sumgrams vs collocations

oduwsdl · Sep 10, 2019 · 7341419 · 7341419
1 parent 3878c1a
commit 7341419
Show file tree

Hide file tree

Showing 3 changed files with 13 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -312,12 +312,21 @@ both  MWPN: "federal emergency management agency" occurrence rate: 0.625
 `mvg_window_glue_split_ngrams` favors longer MWPNs as long as its occurrence rate >= `mvg_window_min_proper_noun_rate`, so even though the left MWPN's (`federal emergency management`) occurrence rate is the highest (0.875), the algorithm would select the longest MWPN (`federal emergency management agency`) since it fulfills the selection criteria: its occurrence rate (0.625 > mvg_window_min_proper_noun_rate = 0.625 > 0.5). For each fragment ngram, `mvg_window_glue_split_ngrams` searches for the longest MWPN that fulfills the selection criteria.
 
 ## Sumgrams vs. Collocations
-Coming soon
+Sumgram may be likened to some of the [multiple methods of detecting common phrases or collocations](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf) since it strives to identify multi-word proper nouns. But there are some significant differences.
 
+First, the primary goal of sumgrams is to summarize text documents in a collection, the process of conjoining split ngrams is secondary. This primary goal is the reason sumgram uses [binary TF counts](https://github.com/oduwsdl/sumgram/#counting-term-frequencies) in order to be give all documents a fair chance in deciding what terms are important, and to represent as many different (diverse) terms as possible. From Fig. 3, I claim the sumgrams produce a broader and more cohesive topical scope compared to the bigram collocations. Collocations unlike sumgram use raw TF counts and the primary focus of collocation is to identity groups of words ([with limited compositionality](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)) that frequently co-occur. Therefore, unlike collocation methods, sumgram simultaneously generates a summary for a collection and conjoins split ngrams (akin to extracting collocations). However, sumgrams is primarily a collection summarization method.
+
+Second, collocation method often begin by ranking frequent ngrams (e.g., bigrams or trigrams). Variation comes with applying filters: stopwords/punctuation removal, use of [Pointwise Mutual Information](https://en.wikipedia.org/wiki/Pointwise_mutual_information), [Chi-squared tests](https://en.wikipedia.org/wiki/Chi-squared_test), [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test), or some other statistical tests to ensure collocations are statistically significant. The key difference here between sumgram and collocation methods is: after collocation detection methods apply filters to get better results (e.g., bigrams/trigrams), they often stop. This means by design, since we restrict the calculation of collocations to specific n-grams (e.g, bigrams), we may split collocations with more that two terms. In contrast, after sumgram returns bigrams for example, it tries to expand the bigram into a k-gram (k > 2) especially if the bigram is part of a multi-word proper noun. So it is not sufficient that the bigram is frequent, sumgram strives to avoid splitting multi-word proper nouns. This gives sumgram the flexibility to return multiple ngrams (bigrams, trigrams, six-grams, etc.) as part of the list of most frequent ngrams.
+
+*Fig. 3: Comparison of top 20 sumgrams and top 20 bigram collocations generated using different methods for labeling collocations. Since collocations are calculated for fixed (n=2) ngram widths, the are prone to splitting MWPNs (highlighted). The collocation output were generated by running [Nicha Ruchirawat's](https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a) that implementation of some common methods for identifying ngram collocations.*
+<img src="pics/sumgrams_v_collocations_ebola.png" alt="sumgrams vs collocations ebola" style="width: 50%;"/>
+
+*Fig. 4: Comparison of top 20 sumgrams and top 20 bigram collocations generated using different methods for labeling collocations. Since collocations are calculated for fixed (n=2) ngram widths, the are prone to splitting MWPNs (highlighted). The collocation output were generated by running [Nicha Ruchirawat's](https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a)that implementation of some common methods for identifying ngram collocations.*
+<img src="pics/sumgrams_v_collocations_harvey.png" alt="sumgrams vs collocations harvey" style="width: 50%;"/>
 ## Sumgrams vs. LDA
 
-*Fig. 3: Comparison of top LDA topics and top 20 sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about [Hurricane Harvey](https://en.wikipedia.org/wiki/Hurricane_Harvey).*
+*Fig. 5: Comparison of top LDA topics and top 20 sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about [Hurricane Harvey](https://en.wikipedia.org/wiki/Hurricane_Harvey).*
 <img src="pics/sumgrams_v_lda.png" alt="sumgrams vs lda" style="width: 50%;"/>
-[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) often used to discover abstract "topics" in a large volume of documents. Fig. 3 juxtaposes the top 20 topics surfaced by a [Python LDA library](https://github.com/lda-project/lda/) at top 20 sumgrams. As seen in Fig. 3 (Column 2), LDA abstract topics confirm our intuition that the collection is indeed about a Hurricane. However, I am uncomfortable with LDAs output and prefer sumgrams as a summary for the collection for the following reasons:
-* It is not easy to find set the `words per topic` input to the LDA system. In Fig. 3 we set `words per topic = 5` to facilitate comparison with sumgrams. Also, since our collection is homogeneous (we already know most documents are about Hurricane Harvey), apply LDA to our collection is akin to finding the subtopics within our broader Hurricane Harvey topic. Consequently, if `words per topic` is set to small, we may abbreviate a subtopic, and if it is set too high, we may mix multiple subtopics within a single LDA abstract topic. This means LDA might be too coarse to find subtopic because it is hard to control topic boundaries in a supervised manner. Sumgrams in contrast isolate entities. For example, the first sumgram identifies only one entity `hurricane harvey`, as well as the second: `the federal emergency management agency`.
+[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) often used to discover abstract "topics" in a large volume of documents. Fig. 5 juxtaposes the top 20 topics surfaced by a [Python LDA library](https://github.com/lda-project/lda/) at top 20 sumgrams. As seen in Fig. 5 (Column 2), LDA abstract topics confirm our intuition that the collection is indeed about a Hurricane. However, I am uncomfortable with LDAs output and prefer sumgrams as a summary for the collection for the following reasons:
+* It is not easy to find set the `words per topic` input to the LDA system. In Fig. 5 we set `words per topic = 5` to facilitate comparison with sumgrams. Also, since our collection is homogeneous (we already know most documents are about Hurricane Harvey), apply LDA to our collection is akin to finding the subtopics within our broader Hurricane Harvey topic. Consequently, if `words per topic` is set to small, we may abbreviate a subtopic, and if it is set too high, we may mix multiple subtopics within a single LDA abstract topic. This means LDA might be too coarse to find subtopic because it is hard to control topic boundaries in a supervised manner. Sumgrams in contrast isolate entities. For example, the first sumgram identifies only one entity `hurricane harvey`, as well as the second: `the federal emergency management agency`.
 * Similar to the first point, the LDA topics are not sequences, words in the topics are ordered according to their probability of belonging to the topic. This means the discovery of proper nouns in the LDA topics is not obvious. For example, in the 8th LDA topic `puerto hurricane rico maria images`, `rico` is not next to `puerto`. I could only detect this because I am familiar with the collection topic.
diff --git a/pics/sumgrams_v_collocations_ebola.png b/pics/sumgrams_v_collocations_ebola.png
diff --git a/pics/sumgrams_v_collocations_harvey.png b/pics/sumgrams_v_collocations_harvey.png