-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the functionality of Dataset #57
Comments
Just on the "off" note: I still don't understand the "sacred" meaning of the main modality. Isn't taking ANY modality as main and then recalculating the weights accordingly brings them to the "equal ground" so to speak? Or more mathematically: there exists hyperplane of "equal regularization effect" and setting coeeficient to one for of one of them scales others acordingly? |
Let me break into the discussion and say a couple of words in defense of main modality :) This is not imho about equal weights or something. Topic modeling is about analyzing texts. So, it is reasonable to provide a way to tell somehow the plain text (aka main modality) from other modalities (which are either meta info such as author or title; or manually created fancy things like bigram, trigram, skipgram etc. god knows what else is possible to come up with). A user may want to build models solely on plain text. Or she may want to use this modality for coherence computation, for example (if words of main modality are in natural order in the VW, but other modalities are in bag-of-words). So, main modality == preprocessed raw text. Or maybe it would be better to give it some other name (not main modality, but preprocessed_text or plain_text?) |
I agree with @Alvant, but I want to add another consideration. In many models, multiplying every modality weight by the same constant should leave the model unchanged (as a consequence, you definitely could recalculate weights based on any modality). However, this is not the case when regularizers are involved. If we want to transfer good |
Something along the lines of "convert between
Counter
andvowpal_wabbit
" would be very helpful.Also, maybe we need to store more metadata (such as main modality and co-occurrences)
Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)
The text was updated successfully, but these errors were encountered: