Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate synsets vs. poorly worded glosses #161

Closed
restinplace opened this issue May 4, 2019 · 4 comments
Closed

duplicate synsets vs. poorly worded glosses #161

restinplace opened this issue May 4, 2019 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@restinplace
Copy link

restinplace commented May 4, 2019

This is prompted by the "court #160" discussion.

Imho poor glosses are common, but true synset duplicates
(for alternative senses of a word) are rare. I assume the reason
is that glosses were written independently to denote each sense,
rather than as a system of contrasts that both denote individual
senses, and clearly partition related synsets' overlapping
semantic spaces.

Even as a native speaker I occasionally scratch my head, but in the
end I almost invariably agree that the sense distinction is there,
and is worth making (just as I usually disagree with splitting hairs
to add more senses that may exist, but imho are not synset-worthy! ).

The "solution" has generally been to try to coarsen the WN sense
inventory, not just by posts here, but see e.g. Hovy et al 2006
"OntoNotes: the 90% Solution"
https://dl.acm.org/citation.cfm?id=1614064
The SUMO :: WN mappings are similar to the OntoNotes clusters,
and other approaches to WN sense coarsening have been published.
For example, for "court#n" we have these coarsened sense sets:
SUMO (n):Government court#n#1|court#n#8
OntoNotes (n):a sovereign regime and its assemblage (1) court#n#3|court#n#6

I mention this because in a sledge-hammer kind of way these are
gynormous sets of "duplicate synset" notices. I think it would be
a very bad idea to treat them as requests, but they provide an extremely
informative guide as to a) why existing glosses are confusing, and
b) how a minimal gloss change might clarify the implied contrast
(as we saw the other day

I also think that trying to improve the gloss first is advantageous
because it:

  • does not intrude on PWN's underlying relational structure, and
  • is readily extracted for checking by native speakers, who can
    simply compare the old vs new glosses (or gloss sets).
    In contrast, the effects of adding or deleting forms or synsets
    is not always readily apparent (although obviously it's still the
    best path sometimes).
@arademaker
Copy link
Member

arademaker commented May 4, 2019

I agree that glosses are the real problem and probably because they were introduced later in PWN. But issues should be more specific, and what we really need is to devise a methodology for what would a good template for glosses.

The link is broken, I think the stable link for this article is https://dl.acm.org/citation.cfm?id=1614064? I found other versions via Google too. Thank you for sharing. I am aware of SUMO solutions but I don't expect to be easy to define a uniform approach for the ontology ~> WN clustering decisions. I tend to believe more on the original idea of using a semantic concordance, https://dl.acm.org/citation.cfm?id=1075742 as a guide for sense clustering.

I am reading the OntoNotes solution now and it seems to be more related to the idea of semantic concordance, corpus annotation.

@jmccrae jmccrae added the enhancement New feature or request label May 8, 2019
@jmccrae jmccrae added this to the 2020 Release milestone May 8, 2019
@jmccrae
Copy link
Member

jmccrae commented May 8, 2019

This is closely related to #141

I think there are a few issues here:

  • Glosses are frequently very poorly written, there should be clearer guidelines into exactly what constitutes a definition
  • We should avoid that a word is a hypernym of itself (there are currently 329 such cases).
  • There is no mechanism to group related senses, e.g., metanyms such 'court' as a building vs 'court' as an institution. Traditional dictionaries would use hierarchies of senses here. Further, here the lexicographer files often have lead to distinctions that other resources would not make, for example these two senses of 'milk', where the food meaning is distinguished from the body meaning but this does not seem a meaningful distinction

ewn-07860018-n (Interlingual Index: i78383)

(n) milk a white nutritious liquid secreted by mammals and used as food by human beings
Topic: noun.food

ewn-05406377-n (Interlingual Index: i65344)

(n) milk produced by mammary glands of female mammals for feeding their young
Topic: noun.body

@vcvpaiva
Copy link

I totally agree with @restinplace on:

  1. poor glosses are common, but true synset duplicates (for alternative senses of a word) are rare.
  2. I usually scratch my head, but in the end, I almost invariably agree that the sense distinction made in PWN is there (but I am not a native speaker)
  3. Trying to improve the gloss first is advantageous because it does not intrude on PWN's underlying relational structure (one shouldn't touch the graph, unless absolutely necessary), and
    glosses are readily extracted for checking by native speakers, who can simply compare the old vs new glosses
    I also agree with @jmccrae that
  4. We should avoid that a word is a hypernym of itself (there are currently 329 such cases).

@jmccrae
Copy link
Member

jmccrae commented Dec 27, 2019

Closing this as it seems to be a very general discussion which will not suggest any specific changes to the resource. For discussion of the process of writing definitions see #141. For a general idea of whether to split synsets I suggest we continue to proceed case by case.

The self-hypernyms can be discussed under #237

@jmccrae jmccrae closed this as completed Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants