duplicate synsets vs. poorly worded glosses #161

restinplace · 2019-05-04T09:44:06Z

This is prompted by the "court #160" discussion.

Imho poor glosses are common, but true synset duplicates
(for alternative senses of a word) are rare. I assume the reason
is that glosses were written independently to denote each sense,
rather than as a system of contrasts that both denote individual
senses, and clearly partition related synsets' overlapping
semantic spaces.

Even as a native speaker I occasionally scratch my head, but in the
end I almost invariably agree that the sense distinction is there,
and is worth making (just as I usually disagree with splitting hairs
to add more senses that may exist, but imho are not synset-worthy! ).

The "solution" has generally been to try to coarsen the WN sense
inventory, not just by posts here, but see e.g. Hovy et al 2006
"OntoNotes: the 90% Solution"
https://dl.acm.org/citation.cfm?id=1614064
The SUMO :: WN mappings are similar to the OntoNotes clusters,
and other approaches to WN sense coarsening have been published.
For example, for "court#n" we have these coarsened sense sets:
SUMO (n):Government court#n#1|court#n#8
OntoNotes (n):a sovereign regime and its assemblage (1) court#n#3|court#n#6

I mention this because in a sledge-hammer kind of way these are
gynormous sets of "duplicate synset" notices. I think it would be
a very bad idea to treat them as requests, but they provide an extremely
informative guide as to a) why existing glosses are confusing, and
b) how a minimal gloss change might clarify the implied contrast
(as we saw the other day

I also think that trying to improve the gloss first is advantageous
because it:

does not intrude on PWN's underlying relational structure, and
is readily extracted for checking by native speakers, who can
simply compare the old vs new glosses (or gloss sets).
In contrast, the effects of adding or deleting forms or synsets
is not always readily apparent (although obviously it's still the
best path sometimes).

arademaker · 2019-05-04T13:47:10Z

I agree that glosses are the real problem and probably because they were introduced later in PWN. But issues should be more specific, and what we really need is to devise a methodology for what would a good template for glosses.

The link is broken, I think the stable link for this article is https://dl.acm.org/citation.cfm?id=1614064? I found other versions via Google too. Thank you for sharing. I am aware of SUMO solutions but I don't expect to be easy to define a uniform approach for the ontology ~> WN clustering decisions. I tend to believe more on the original idea of using a semantic concordance, https://dl.acm.org/citation.cfm?id=1075742 as a guide for sense clustering.

I am reading the OntoNotes solution now and it seems to be more related to the idea of semantic concordance, corpus annotation.

jmccrae · 2019-05-08T08:31:15Z

This is closely related to #141

I think there are a few issues here:

Glosses are frequently very poorly written, there should be clearer guidelines into exactly what constitutes a definition
We should avoid that a word is a hypernym of itself (there are currently 329 such cases).
There is no mechanism to group related senses, e.g., metanyms such 'court' as a building vs 'court' as an institution. Traditional dictionaries would use hierarchies of senses here. Further, here the lexicographer files often have lead to distinctions that other resources would not make, for example these two senses of 'milk', where the food meaning is distinguished from the body meaning but this does not seem a meaningful distinction

ewn-07860018-n (Interlingual Index: i78383)

(n) milk a white nutritious liquid secreted by mammals and used as food by human beings
Topic: noun.food

ewn-05406377-n (Interlingual Index: i65344)

(n) milk produced by mammary glands of female mammals for feeding their young
Topic: noun.body

vcvpaiva · 2019-07-29T09:32:18Z

I totally agree with @restinplace on:

poor glosses are common, but true synset duplicates (for alternative senses of a word) are rare.
I usually scratch my head, but in the end, I almost invariably agree that the sense distinction made in PWN is there (but I am not a native speaker)
Trying to improve the gloss first is advantageous because it does not intrude on PWN's underlying relational structure (one shouldn't touch the graph, unless absolutely necessary), and
glosses are readily extracted for checking by native speakers, who can simply compare the old vs new glosses
I also agree with @jmccrae that
We should avoid that a word is a hypernym of itself (there are currently 329 such cases).

jmccrae · 2019-12-27T16:52:48Z

Closing this as it seems to be a very general discussion which will not suggest any specific changes to the resource. For discussion of the process of writing definitions see #141. For a general idea of whether to split synsets I suggest we continue to proceed case by case.

The self-hypernyms can be discussed under #237

jmccrae added the enhancement New feature or request label May 8, 2019

jmccrae added this to the 2020 Release milestone May 8, 2019

jmccrae closed this as completed Dec 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate synsets vs. poorly worded glosses #161

duplicate synsets vs. poorly worded glosses #161

restinplace commented May 4, 2019 •

edited

Loading

arademaker commented May 4, 2019 •

edited

Loading

jmccrae commented May 8, 2019

vcvpaiva commented Jul 29, 2019

jmccrae commented Dec 27, 2019

duplicate synsets vs. poorly worded glosses #161

duplicate synsets vs. poorly worded glosses #161

Comments

restinplace commented May 4, 2019 • edited Loading

arademaker commented May 4, 2019 • edited Loading

jmccrae commented May 8, 2019

vcvpaiva commented Jul 29, 2019

jmccrae commented Dec 27, 2019

restinplace commented May 4, 2019 •

edited

Loading

arademaker commented May 4, 2019 •

edited

Loading