[Topic Models] Evaluating topic models #1866

jucor · 2024-12-30T19:23:45Z

Problem:
As we start considering topic models for polis conversations, we need to provide some quantitative quality assessments of the algorithms we consider, evaluated on past polis conversations (either open or private), to serve as empirical evaluation.

This issue discusses the metrics I think are the most adequate based on literature search. We could manually come up with something smart, but there's an advantage to go for a metric already accepted: e.g. raw counts of errors come with normalization questions, and by the time you solve those normalizations you've reinvented something close to proper metrics 😅 but without the years of refinement that then get you the extra mile.

Suggested solution:

No metric is perfect, as any dive in the literature and references below show, but I recommend evaluating:

Topic Coherence: do the keywords within each topic actually occur together often in the text?
Topic Diversity: are the topics well separated, or do they overlap?

These for example are used by Grootendorst et al. (2022) to evaluate their BERTopic library.
For topic coherence we can use Normalized Pointwise Mutual Information (NPMI) as defined by Bouma (2009) and evaluated as "emulates human judgment with reasonable performance” by Lau et al (2014). Score between -1 (never occur together) and +1 (always occur together), with 0 being independence. Higher the better.

For Topic diversity we will use its definition from Dieng et al. (2020): the percentage of unique words in the top K=25 words of all topics. Diversity close to 0 indicates redundant topics; diversity close to 1 indicates more varied Topics.

Why those metrics:

Unlike what the embedding leaderboard MTEB (Muennighoff et al 2023) uses (V-measure), they are unsupervised, i.e. do not require human labels, thus we can apply them to _all_ our past conversations ❤️
They are widely accepted in the literature. Yes, there is a real zoo of metrics in NLP (see e.g. Roder et al (2015) or Harrando et al (2021)), that's a whole research domain, but those are pretty standard and well supported. Yay for standards.
We can take the product of them both to get a single metric if we need simpler comparisons, as e.g. done by Dieng et al (2020).
They rely on pairs (or n-grams) of words, _not_ on the embeddings: we can thus compare different types of language models without relying on any one specific model to be the judge.

We can then compare topic models on our whole historic data if we so want :) and they're relatively quick to implement.

A few more advantages to the above:

Automated metrics, instead of manually evaluated by us on a couple datasets: we can easily recompute for any variant of the algorithms, any contributed new algorithms, and even track our progress over time, on a wide variety of conversations that have occurred in polis.
Two complementary usages:
- Computing average metrics (any metric) on historical data helps to evaluate the algorithm over the distribution of conversations we have in polis.
- Computing on a new conversation is a measure of how the algorithm fares on that data, and can be read both on its own and in comparison to the historic data.

Implementation:

We can either do our own implementation, or, to save time and avoid reinventing the wheel, call some of the existing NLP toolboxes which contain those evals.

For example, OCTIS (Terragni et al 2021, MIND-LAB/OCTIS 2020) has both NPMI for Coherence and Dieng et al. (2020)'s Topic Diversity, and is used by BERTopic's evaluation scripts (https://github.com/MaartenGr/BERTopic_evaluation/tree/main). However, it is a bit fiddly to install (see e.g. MIND-Lab/OCTIS#106 for example, where it pins certain libraries to specific old versions). Under the hood, OCTIS calls gensim (Řehůřek and Sojka, 2010)'s pipeline Topic Coherence metric and has a straightforward code for Topic Diverstiy. Since we do not (yet?) use OCTIS's other features, we could call gensim directly, and reimplement the diversity.

References:

Dieng, Adji B., Francisco JR Ruiz, and David M. Blei. 2020. ‘Topic Modeling in Embedding Spaces’. Transactions of the Association for Computational Linguistics 8:439–53.
Grootendorst, Maarten. 2022. ‘BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure’. arXiv. https://doi.org/10.48550/arXiv.2203.05794.
Harrando, Ismail, Pasquale Lisena, and Raphael Troncy. 2021. ‘Apples to Apples: A Systematic Evaluation of Topic Models’. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), edited by Ruslan Mitkov and Galia Angelova, 483–93. Held Online: INCOMA Ltd. https://aclanthology.org/2021.ranlp-1.55.
Lau, Jey Han, David Newman, and Timothy Baldwin. 2014. ‘Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality’. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, edited by Shuly Wintner, Sharon Goldwater, and Stefan Riezler, 530–39. Gothenburg, Sweden: Association for Computational Linguistics. https://doi.org/10.3115/v1/E14-1056.
‘MIND-Lab/OCTIS’. (2020) 2024. Python. MIND. https://github.com/MIND-Lab/OCTIS.
Muennighoff, Niklas, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. ‘MTEB: Massive Text Embedding Benchmark’. arXiv. https://doi.org/10.48550/arXiv.2210.07316.
Řehůřek, Radim, and Petr Sojka. 2010. ‘Software Framework for Topic Modelling with Large Corpora’. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA.
Röder, Michael, Andreas Both, and Alexander Hinneburg. 2015. ‘Exploring the Space of Topic Coherence Measures’. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408. WSDM ’15. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2684822.2685324.
Rosenberg, Andrew, and Julia Hirschberg. 2007. ‘V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), edited by Jason Eisner, 410–20. Prague, Czech Republic: Association for Computational Linguistics. https://aclanthology.org/D07-1043.
Terragni, Silvia, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. ‘OCTIS: Comparing and Optimizing Topic Models Is Simple!’ In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, edited by Dimitra Gkatzia and Djamé Seddah, 263–70. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-demos.31.

The text was updated successfully, but these errors were encountered:

jucor · 2024-12-30T23:15:01Z

Aaaand of course, one makes plans, and reality laughs :) One important factor to take into account in the above is that statements are very short texts. This risks playing havoc with the n-gram frequencies used for co-occurence in NPMI.

This could explain the problem of negative NPMI that I'm seeing in my first experiment with gensim's NPMI for topics assembled by clustering sentence embeddings. Not only negative NPMI, meaning many words in a topic never occuring together (which makes sense: the shorter the comment, the fewer words in it, the fewer co-occurences), but also a small size of support, i.e. a small number of pairs if I understand gensim's "support" correctly, and a high standard deviation (which would make sense if small texts).

It could also be, more simply, that I'm badly calling the Coherence estimator in that first experiment.

(I'll push the code in a new public repo as soon as I have pushing rights, and link to it in these comments)

Need to experiment and think a bit further.

akonya · 2024-12-31T16:26:41Z

In my experience, topic models, especially LLM topic model, have two critical axis of performance worth evaluating:

topic generation: does the topic taxonomy cover all topics in data at the right scope?
topic tagging: is each statement tagged with, and only with, the appropriate topics?

Most straightforward way to eval them is to have humans generate ground truth topic taxonomies for some example sets statements, and then have them manually tag each statement with those topics. Our main test data set is 5 diff sets of statements (ie. responses to 5 diff prompts), each with taxonomies and codes produced by 5 different people (so 25 diff taxonomies x tags). This lets us benchmark a topic model's consistency with humans against human consistency with humans.

There are lots of metrics/consistency eval approaches, but main thing is just to have at least one that gives a signal of quality of topic taxonomy relative to humans, and the other that gives a signal of topic tag consistency. And unless your one-shotting the taxonomy AND the tagging, then you can isolate the later by starting with the human-generated topics, having your model use those for tagging, and then just compare those to the human tags is standard FN, FP, Fscore metrics.

If your interested doing this kind of thing, maybe we could OS the human produced ground truth data we use for these evals and turn it into a short-statement topic model benchmark?

jucor · 2025-01-13T11:20:46Z

Hi Andrew @akonya , and lovely to e-meet you after having heard so much via @colinmegill :)

Very nice separation into two sub-tasks indeed, thanks! It's not unlike the two dimensions of Topic Coherence and Topic Diversity try to match, but you offer it in a more sequential angle, with Tagging evaluation being conditioned on a Generation being done right.

The point "2. Topic tagging" (conditionally on a list of topics) is essentially [multi-class] classification, for which we have all the usual metrics of precision and recall (which cover the "with" and the "only with"). If we anticipate that some comments can belong to several topics (I need to check how frequent that is in polis with its short comments -- I know that some of the methods we look at, based on clustering, don't allow that) , we need to include multi-class generalizations, which is feasible.

My dream would be to have some metrics that we can run on all the past conversations (both some publicly for the open data, and some privately for the not-open data), therefore unsupervised. But that doesn't exclude supervised metrics (i.e. compare to human topics) as you suggest -- and to be honest, supervised metrics are easier :) So they're a good place to start. We do have a couple of conversations for which we have some human-defined topics.

Of course this then begs the question of how good are those human-defined topics ("quis custodiet etc etc" -> who labels the labellers :) or evaluate the evaluations? Gold standards with multi labelers could be a dream, but is it worth it at this stage?), but they're a start. In a perfect world we'd have a distribution of topics based on multiple labellers etc, but I'm overcomplicating.

Let me see what I can hack quickly for very basic evaluations, so we have something tangible to iterate on. I need to check whether the V-measure evaluation from Rosenberg et Hiurschberg (2007), used in the MTEB benchmark for embeddings, can apply for our supervised evaluation.

jucor · 2025-01-13T11:29:15Z

(woops, please ignore the "Completed/Reopened", wrong keyboard shortcut 🤦 )

jucor added the feature-request For new feature suggestions label Dec 30, 2024

jucor closed this as completed Jan 13, 2025

jucor reopened this Jan 13, 2025

jucor changed the title ~~Evaluating topic models~~ [Topic Models] Evaluating topic models Jan 21, 2025

This was referenced Jan 21, 2025

[Topic Models] Support multiple topics per comment #1877

Open

[Topics and LLM summaries] Automate sanity checks #1879

Open

jucor mentioned this issue Jan 29, 2025

Speed-up calls to LLM by parallelization of the topic categorization Jigsaw-Code/sensemaking-tools#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Topic Models] Evaluating topic models #1866

[Topic Models] Evaluating topic models #1866

jucor commented Dec 30, 2024 •

edited

Loading

jucor commented Dec 30, 2024 •

edited

Loading

akonya commented Dec 31, 2024

jucor commented Jan 13, 2025 •

edited

Loading

jucor commented Jan 13, 2025 •

edited

Loading

[Topic Models] Evaluating topic models #1866

[Topic Models] Evaluating topic models #1866

Comments

jucor commented Dec 30, 2024 • edited Loading

jucor commented Dec 30, 2024 • edited Loading

akonya commented Dec 31, 2024

jucor commented Jan 13, 2025 • edited Loading

jucor commented Jan 13, 2025 • edited Loading

jucor commented Dec 30, 2024 •

edited

Loading

jucor commented Dec 30, 2024 •

edited

Loading

jucor commented Jan 13, 2025 •

edited

Loading

jucor commented Jan 13, 2025 •

edited

Loading