-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Topic Models] Evaluating topic models #1866
Comments
Aaaand of course, one makes plans, and reality laughs :) One important factor to take into account in the above is that statements are very short texts. This risks playing havoc with the n-gram frequencies used for co-occurence in NPMI. This could explain the problem of negative NPMI that I'm seeing in my first experiment with gensim's NPMI for topics assembled by clustering sentence embeddings. Not only negative NPMI, meaning many words in a topic never occuring together (which makes sense: the shorter the comment, the fewer words in it, the fewer co-occurences), but also a small size of support, i.e. a small number of pairs if I understand gensim's "support" correctly, and a high standard deviation (which would make sense if small texts). It could also be, more simply, that I'm badly calling the Coherence estimator in that first experiment. (I'll push the code in a new public repo as soon as I have pushing rights, and link to it in these comments) Need to experiment and think a bit further. |
In my experience, topic models, especially LLM topic model, have two critical axis of performance worth evaluating:
Most straightforward way to eval them is to have humans generate ground truth topic taxonomies for some example sets statements, and then have them manually tag each statement with those topics. Our main test data set is 5 diff sets of statements (ie. responses to 5 diff prompts), each with taxonomies and codes produced by 5 different people (so 25 diff taxonomies x tags). This lets us benchmark a topic model's consistency with humans against human consistency with humans. There are lots of metrics/consistency eval approaches, but main thing is just to have at least one that gives a signal of quality of topic taxonomy relative to humans, and the other that gives a signal of topic tag consistency. And unless your one-shotting the taxonomy AND the tagging, then you can isolate the later by starting with the human-generated topics, having your model use those for tagging, and then just compare those to the human tags is standard FN, FP, Fscore metrics. If your interested doing this kind of thing, maybe we could OS the human produced ground truth data we use for these evals and turn it into a short-statement topic model benchmark? |
Hi Andrew @akonya , and lovely to e-meet you after having heard so much via @colinmegill :) Very nice separation into two sub-tasks indeed, thanks! It's not unlike the two dimensions of Topic Coherence and Topic Diversity try to match, but you offer it in a more sequential angle, with Tagging evaluation being conditioned on a Generation being done right. The point "2. Topic tagging" (conditionally on a list of topics) is essentially [multi-class] classification, for which we have all the usual metrics of precision and recall (which cover the "with" and the "only with"). If we anticipate that some comments can belong to several topics (I need to check how frequent that is in polis with its short comments -- I know that some of the methods we look at, based on clustering, don't allow that) , we need to include multi-class generalizations, which is feasible. My dream would be to have some metrics that we can run on all the past conversations (both some publicly for the open data, and some privately for the not-open data), therefore unsupervised. But that doesn't exclude supervised metrics (i.e. compare to human topics) as you suggest -- and to be honest, supervised metrics are easier :) So they're a good place to start. We do have a couple of conversations for which we have some human-defined topics. Of course this then begs the question of how good are those human-defined topics ("quis custodiet etc etc" -> who labels the labellers :) or evaluate the evaluations? Gold standards with multi labelers could be a dream, but is it worth it at this stage?), but they're a start. In a perfect world we'd have a distribution of topics based on multiple labellers etc, but I'm overcomplicating. Let me see what I can hack quickly for very basic evaluations, so we have something tangible to iterate on. I need to check whether the V-measure evaluation from Rosenberg et Hiurschberg (2007), used in the MTEB benchmark for embeddings, can apply for our supervised evaluation. |
(woops, please ignore the "Completed/Reopened", wrong keyboard shortcut 🤦 ) |
Problem:
As we start considering topic models for polis conversations, we need to provide some quantitative quality assessments of the algorithms we consider, evaluated on past polis conversations (either open or private), to serve as empirical evaluation.
This issue discusses the metrics I think are the most adequate based on literature search. We could manually come up with something smart, but there's an advantage to go for a metric already accepted: e.g. raw counts of errors come with normalization questions, and by the time you solve those normalizations you've reinvented something close to proper metrics 😅 but without the years of refinement that then get you the extra mile.
Suggested solution:
No metric is perfect, as any dive in the literature and references below show, but I recommend evaluating:
These for example are used by Grootendorst et al. (2022) to evaluate their BERTopic library.
For topic coherence we can use Normalized Pointwise Mutual Information (NPMI) as defined by Bouma (2009) and evaluated as "emulates human judgment with reasonable performance” by Lau et al (2014). Score between -1 (never occur together) and +1 (always occur together), with 0 being independence. Higher the better.
For Topic diversity we will use its definition from Dieng et al. (2020): the percentage of unique words in the top K=25 words of all topics. Diversity close to 0 indicates redundant topics; diversity close to 1 indicates more varied Topics.
Why those metrics:
We can then compare topic models on our whole historic data if we so want :) and they're relatively quick to implement.
A few more advantages to the above:
Implementation:
We can either do our own implementation, or, to save time and avoid reinventing the wheel, call some of the existing NLP toolboxes which contain those evals.
For example, OCTIS (Terragni et al 2021, MIND-LAB/OCTIS 2020) has both NPMI for Coherence and Dieng et al. (2020)'s Topic Diversity, and is used by BERTopic's evaluation scripts (https://github.com/MaartenGr/BERTopic_evaluation/tree/main). However, it is a bit fiddly to install (see e.g. MIND-Lab/OCTIS#106 for example, where it pins certain libraries to specific old versions). Under the hood, OCTIS calls
gensim
(Řehůřek and Sojka, 2010)'s pipeline Topic Coherence metric and has a straightforward code for Topic Diverstiy. Since we do not (yet?) use OCTIS's other features, we could call gensim directly, and reimplement the diversity.References:
The text was updated successfully, but these errors were encountered: