Co-occurrence network distance parameter: Jaccard, Euclid or Cosine? #1108
-
Hello, Thank you for providing this tool for free, it is very appreciated. I am currently using it for making a co-occurence network using code. I have a set of about 150,000 words and 9 different codes. I was wondering what would be the preferred distance between Jaccard and Euclide. In your manual you suggest to use Euclid in case of large data sets however I am not sure as to why ? Thank you for your considering, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
Hello, The size of computing unit is important rather than the size of dataset as a whole. If you use small units like "sentences" or "paragraphs", I recommend Jaccard. The Jaccard coefficient looks at whether the codes co-occurred (1) or not (0). There are only "0" and "1". There is no such thing like 5 or 10. On the other hand, Cosine is preferable when using larger units like speech or writing as a unit. The Cosine is similar to the correlation coefficient. It will distinguish the difference between 0, 1, 5, and 10. When using larger units, we want to distinguish not only between occurring (1) and not occurring (0), but also occurring many times (10) or so (5). Euclid has similar style as the Cosine in this context but Cosine often create better result in my experiences. Please try both Cosine and Euclid if you use larger units. By the way unit "h5" means each "cell" of Excel if you use Excel file as the target file. |
Beta Was this translation helpful? Give feedback.
-
I am sorry, this description was wrong. It only applies to WORDS not to CODES. CODES are all 1 (match) or 0 (no match) at the first place. CODES are binary data. So, when you are creating co-occurrence network of CODES, you can choose the coefficients according to your preference for the formulas that define them. I would try Jaccard first, then Cosine. Therefore, the conclusion is the same, but I apologize for the mistake in the explanation on the way. |
Beta Was this translation helpful? Give feedback.
-
I have a related question for the co-occurence of words and documents as well as the illustration of it on the attached figure. If I chose Jaccard, then the number of words (the size of the bubble) corresponds to the number of units in which the word was used. Am I write about the interpretation of the size of the bubble? |
Beta Was this translation helpful? Give feedback.
Hello,
The size of computing unit is important rather than the size of dataset as a whole.
If you use small units like "sentences" or "paragraphs", I recommend Jaccard. The Jaccard coefficient looks at whether the codes co-occurred (1) or not (0). There are only "0" and "1". There is no such thing like 5 or 10.
On the other hand, Cosine is preferable when using larger units like speech or writing as a unit. The Cosine is similar to the correlation coefficient. It will distinguish the difference between 0, 1, 5, and 10. When using larger units, we want to distinguish not only between occurring (1) and not occurring (0), but also occurring many times (10) or so (5).
Euclid has similar style…