Co-occurrence network distance parameter: Jaccard, Euclid or Cosine? #1108

tchiby · 2023-06-21T02:20:23Z

tchiby
Jun 21, 2023

Hello,

Thank you for providing this tool for free, it is very appreciated.

I am currently using it for making a co-occurence network using code. I have a set of about 150,000 words and 9 different codes. I was wondering what would be the preferred distance between Jaccard and Euclide. In your manual you suggest to use Euclid in case of large data sets however I am not sure as to why ?

Thank you for your considering,
Have a great day!

Answered by ko-ichi-h

Jun 21, 2023

Hello,

The size of computing unit is important rather than the size of dataset as a whole.

If you use small units like "sentences" or "paragraphs", I recommend Jaccard. The Jaccard coefficient looks at whether the codes co-occurred (1) or not (0). There are only "0" and "1". There is no such thing like 5 or 10.

On the other hand, Cosine is preferable when using larger units like speech or writing as a unit. The Cosine is similar to the correlation coefficient. It will distinguish the difference between 0, 1, 5, and 10. When using larger units, we want to distinguish not only between occurring (1) and not occurring (0), but also occurring many times (10) or so (5).

Euclid has similar style…

View full answer

ko-ichi-h · 2023-06-21T04:05:51Z

ko-ichi-h
Jun 21, 2023
Maintainer

Hello,

The size of computing unit is important rather than the size of dataset as a whole.

If you use small units like "sentences" or "paragraphs", I recommend Jaccard. The Jaccard coefficient looks at whether the codes co-occurred (1) or not (0). There are only "0" and "1". There is no such thing like 5 or 10.

On the other hand, Cosine is preferable when using larger units like speech or writing as a unit. The Cosine is similar to the correlation coefficient. It will distinguish the difference between 0, 1, 5, and 10. When using larger units, we want to distinguish not only between occurring (1) and not occurring (0), but also occurring many times (10) or so (5).

Euclid has similar style as the Cosine in this context but Cosine often create better result in my experiences. Please try both Cosine and Euclid if you use larger units.

By the way unit "h5" means each "cell" of Excel if you use Excel file as the target file.

5 replies

tchiby Jun 21, 2023
Author

Thank you very much for your quick reply ! I believe I understand.

In my case, for the excel file I used as target I put into each cell paragraphs (containing multiples phrases or not) from my articles and not sentence by sentence. In this case the size of the computing unit is variable (either one sentence or sometimes 5 to 10 sentences), is it considered small ?

Thank you very much for your help; it is greatly appreciated,
Have a great day.

ko-ichi-h Jun 21, 2023
Maintainer

Yes, it's small, I think.

Best,

tchiby Jun 21, 2023
Author

Thank you for your quick reply, sorry to bother you but let's say with the exemple :

"this safety guide does not apply to the decommissioning of facilities. decommissioning is an authorized process primarily concerned with the decontamination and dismantling of systems, structures and components of a facility and with the decontamination and demolition of buildings. remediation can entail activities that are similar to decommissioning; both remediation and decommissioning activities are typically performed under an authorization. abandoned and currently unauthorized industrial sites, such as former uranium mines and mills and former radium processing facilities, may have buildings and structures to be taken down by actions consistent with the decommissioning process (e.g. decontamination and dismantling); however, such activities are considered to be a part of site remediation and would typically be carried out as part of a site specific remediation plan. consequently, such activities are within the scope of this safety guide."

This is a single cell analyzed with Khcoder : For the word decommissioning (appearing 5 times here) wouldn't it be better to use euclide/cosine ? and even more so if I use code such as Dismantlement : decommissioning + decontamination + dismantling... ?

ko-ichi-h Jun 21, 2023
Maintainer

I believe jaccard will do.

But if you have a second thought, just try both jaccard and cosine. Then make decision based on the results.

tchiby Jun 21, 2023
Author

Alright, I understand, thank you very much for your help !

ko-ichi-h · 2023-06-21T13:11:12Z

ko-ichi-h
Jun 21, 2023
Maintainer

On the other hand, Cosine is preferable when using larger units like speech or writing as a unit. The Cosine is similar to the correlation coefficient. It will distinguish the difference between 0, 1, 5, and 10. When using larger units, we want to distinguish not only between occurring (1) and not occurring (0), but also occurring many times (10) or so (5).

I am sorry, this description was wrong. It only applies to WORDS not to CODES.

CODES are all 1 (match) or 0 (no match) at the first place. CODES are binary data.

So, when you are creating co-occurrence network of CODES, you can choose the coefficients according to your preference for the formulas that define them. I would try Jaccard first, then Cosine.

Therefore, the conclusion is the same, but I apologize for the mistake in the explanation on the way.

0 replies

talk100 · 2023-09-12T17:59:41Z

talk100
Sep 12, 2023

I have a related question for the co-occurence of words and documents as well as the illustration of it on the attached figure. If I chose Jaccard, then the number of words (the size of the bubble) corresponds to the number of units in which the word was used. Am I write about the interpretation of the size of the bubble?
cluster1_subclusters.pdf

1 reply

ko-ichi-h Sep 12, 2023
Maintainer

No. The size of the bubble corresponds to Term Frequency. Term frequency is the number of times a word appears in the entire data. It is irrelevant with the selection of distance parameter.

When you use codes, not words, it will corresponds to the number of units (Document Frequency) because codes are all 1 or 0.

BTW, when you have a new question, please create a new discussion thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Co-occurrence network distance parameter: Jaccard, Euclid or Cosine? #1108

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Co-occurrence network distance parameter: Jaccard, Euclid or Cosine? #1108

tchiby Jun 21, 2023

Replies: 3 comments · 6 replies

ko-ichi-h Jun 21, 2023 Maintainer

tchiby Jun 21, 2023 Author

ko-ichi-h Jun 21, 2023 Maintainer

tchiby Jun 21, 2023 Author

ko-ichi-h Jun 21, 2023 Maintainer

tchiby Jun 21, 2023 Author

ko-ichi-h Jun 21, 2023 Maintainer

talk100 Sep 12, 2023

ko-ichi-h Sep 12, 2023 Maintainer

tchiby
Jun 21, 2023

Replies: 3 comments 6 replies

ko-ichi-h
Jun 21, 2023
Maintainer

tchiby Jun 21, 2023
Author

ko-ichi-h Jun 21, 2023
Maintainer

tchiby Jun 21, 2023
Author

ko-ichi-h Jun 21, 2023
Maintainer

tchiby Jun 21, 2023
Author

ko-ichi-h
Jun 21, 2023
Maintainer

talk100
Sep 12, 2023

ko-ichi-h Sep 12, 2023
Maintainer