Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"at least", "in general", and related expressions: fixed? ExtPos? and validator rule prohibiting det(X, Y) & nmod(Y, Z) #553

Open
2 tasks
nschneid opened this issue Dec 3, 2024 · 29 comments
Labels

Comments

@nschneid
Copy link
Contributor

nschneid commented Dec 3, 2024

Some instances of "at least" attaching as nmod to a det-dependent ("at least some...") are now triggering validator errors. See UniversalDependencies/docs#1059 (comment). We might as well change them all to specify ExtPos=ADV and attach as advmod rather than nmod.

Note: non-quantitative "at least" and "at most" are considered fixed expressions, so they are already taken care of.

@amir-zeldes
Copy link
Contributor

We might as well change them all to specify ExtPos=ADV and attach as advmod rather than nmod.

Wouldn't that mean that we are starting to treat all "at least"s as fixed expressions?

@nschneid
Copy link
Contributor Author

nschneid commented Dec 3, 2024

It would still be saying they are PPs internally, which I think is fair. Though TBH I don't understand the syntactic reason for distinguishing the quantitative and non-quantitative ones—isn't it just a matter of idiomatic meaning?

@amir-zeldes
Copy link
Contributor

It would still be saying they are PPs internally

Wait, are you saying you want to keep obl+case for non quantitative "at least", but you want to put ExtPos=ADV on the head? I'm not sure that makes sense to me - obl dependents are already adverbial in the sense that any adverbial PP is, so I don't see what this adds. It makes more sense for the fixed version (with quantities).

@nschneid
Copy link
Contributor Author

nschneid commented Dec 5, 2024

Right now it's the nonquantity ones that are fixed. Which I find confusing.

What about dispensing with fixed entirely:

at/case least/ADJ[ExtPos=ADV]/advmod 3 books

at/case least/ADJ[ExtPos=ADV]/advmod some homework

at/case least/ADJ/obl you are having fun

@amir-zeldes
Copy link
Contributor

I agree it's confusing, and I'd be for making them all be the same. But I don't think we should have a compositional looking subtree with case and then use advmod + ExtPos. I'd do either or:

  1. Keep obl + case, then it's just compositional and there's no need for advmod or ExtPos
  2. Decide it's a special multi-word adverbial, then use fixed + advmod + ExtPos

I feel like mixing the two strategies is confusing.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 6, 2024

The problem is we can't do nmod+case for "at least some (books)", because the validator now prohibits it.

@amir-zeldes
Copy link
Contributor

That suggests we should prefer fixed+advmod, which is fine by me, though I will note there are variants ("at the least" comes to mind, or "at the very least")

@nschneid
Copy link
Contributor Author

nschneid commented Jan 7, 2025

Ruling from the Core Group: under a word functioning as det or nummod, dependents should have obl rather than nmod. The theory is that these uses of DET and NUM are in a structural sense more adjective-like than noun-like. The validator should accept obl(some, least), case(least, at).

@nschneid
Copy link
Contributor Author

nschneid commented Jan 7, 2025

I was able to easily update the EWT tokens of at least/at most + nummod/det to be obl instead of nmod.

  • at least a couple nuclear weapons: This is really the same construction with "at least" modifying a quantity that modifies a nominal. However, "a couple" is annotated as nmod:unmarked, so I am leaving nmod(couple, least).
  • ...of at least two of them: I assume "two" is acting nominally here so keeping nmod for "at least".

A broader query for this structure, https://universal.grew.fr/?custom=677d861f9252b, also surfaces many tokens that are ranges, e.g. "2-4 days". Our convention is to treat "- 4" as a PP equivalent to "to 4". Should these also be changed to obl?

@amir-zeldes
Copy link
Contributor

Should these also be changed to obl?

I can't say I find it super intuitive (I really don't think of numbers as adjectives at all), but if that's the ruling then that's what we have to do. Or to put it differently, if we don't like this for the number ranges, I think we have to take it back for a second round of discussions with the Core Group, but otherwise, yeah, this falls under that decision IMO.

@nschneid
Copy link
Contributor Author

nschneid commented Jan 7, 2025

You know I think "2-3 days" resembles "2 or 3 days" and would be open to treating ranges as coordination, but I don't expect to win that fight. :) Let's change to obl then.

@nschneid
Copy link
Contributor Author

nschneid commented Jan 7, 2025

OK here's a case to consider: "80-120 million barrels". It uses compound(million, 80) and nmod(80, 120). Is compound an adjective-like environment warranting obl?? Seems strange to use obl for "80-120 barrels" but nmod when adding "million" in there.

@amir-zeldes
Copy link
Contributor

I took @jnivre 's position to be that numbers fundamentally take obl rather than nmod, unless they are standing in as a nominal head. A compound dependent is still a modifier, and the upos is still NUM, so I would assume the position would still be that it should take obl (Joakim can correct me if I misunderstood). But I have to agree, it doesn't seems intuitive to me that numbers should take obl dependents when they are inside an NP - for a modifier to an NP head there is a difference (obl modifies the predication a noun heads, nmod modifies the noun head itself - "It's Tuesday in Japan" vs. "it's Tuesday of last week"), but within an NP I don't understand what the contrast between nmod and obl is.

@jnivre
Copy link

jnivre commented Jan 11, 2025

I am not completely sure what the question is here, but if it's about an "at least/most" that attaches to "80", then I would say it's still obl. For me it is strange to use compound for the relation between "80" and "million", but I guess I am applying my Swedish compound intuitions as usual. :)

@nschneid
Copy link
Contributor Author

nschneid commented Jan 11, 2025

unless they are standing in as a nominal head

OK, so as the head of a full nominal. I thought I understood @jnivre & @dan-zeman to be saying that we should be guided by parent deprels rather than UPOS in formulating these rules. But results of this query suggest a strict interpretation of this rule beyond determiners/numbers will run into problems. In attributive position, we find proper names like "World's Fair" ("World's Fair museum": compound(museum, Fair)), but I don't think that should prevent us from using nmod:poss. Likewise for nmod:desc ("St Pancras station"), and common nouns modified by PPs ("a unit by unit basis", "a state of the art result").

So, let's just say that in English, DETs and NUMs are modifiers by default, and PP modifiers of them should be obl, unless they are "promoted" to stand as head of a full nominal (which we can tell from the external deprel).

@nschneid
Copy link
Contributor Author

For me it is strange to use compound for the relation between "80" and "million", but I guess I am applying my Swedish compound intuitions as usual.

Multiword number names are tricky and we had a section exploring them in https://arxiv.org/abs/2108.12928. What would it be in Swedish?

@jnivre
Copy link

jnivre commented Jan 11, 2025

I agree that multiword numbers are tricky, but for the specific case of "80 miljoner" (lit. "80 millions"), we just treat "miljoner" as an ordinary noun and attach "80" as nummod. If the whole expression modifies another head noun, as in "80 miljoner fat" ("80 million barrels"), then "miljoner" attaches to "fat" as nmod.

@nschneid
Copy link
Contributor Author

nschneid commented Jan 11, 2025

Hmm. Semantically, in "80 miljoner fat" / "80 million barrels", there is a complex number serving as a quantity modifier of an entity-referring-noun. I guess the question is whether nummod means

(a) a simple numeral word modifying some other word (possibly another quantity), or
(b) a number expression modifying a count noun, following the language's ordinary modification construction for expressing counts of entities, or
(c) both.

(N.B. By "number expression" for present purposes I mean a cardinal numeral that can be expressed with a series of digits, even if verbalized as multiple words. By "ordinary construction" I mean to rule out other morphosyntactic forms being occasionally recruited for this purpose—I can't think of a great English example for small counts but something like "books numbering 3" meaning '3 books'. Or "books in the millions", "millions of books", etc. for large counts. None of these should be nummod IMO.)

nschneid added a commit that referenced this issue Jan 11, 2025
@jnivre
Copy link

jnivre commented Jan 11, 2025 via email

@nschneid
Copy link
Contributor Author

@jnivre in English we tag words like "million" as NUM, and attach them as nummod in expressing dollar amounts for instance. https://universal.grew.fr/?custom=6782b41c66e8f

@jnivre
Copy link

jnivre commented Jan 11, 2025 via email

@dan-zeman
Copy link
Member

That is indeed the question. But since (unless I’m mistaken) the validator only allows ”nummod” with upos NUM, we cannot make ”miljoner” a ”nummod”. Skickat från Outlook för iOShttps://aka.ms/o0ukef

The validator currently permits NUM, NOUN and SYM, although the error message mentions only NUM as the expected UPOS. The comment in the code refers to the discussion in UniversalDependencies/docs#596.

@nschneid
Copy link
Contributor Author

it behaves like a noun morphosyntactically, so I guess it is yet another case where one has to choose whether to go with morphosyntax or with semantics.

I always thought NUM was a semantically-based cross between NOUN and DET. Though if there's plural marking (tens of books, the 1990s), we tag as NOUN in English.

@amir-zeldes
Copy link
Contributor

Though if there's plural marking (tens of books, the 1990s), we tag as NOUN in English.

I think that's the issue with Swedish, isn't it? Isn't "miljoner" morphologically plural?

For me it is strange to use compound for the relation between "80" and "million"

It's sort of an inheritance from SD, but I think it makes sense in its own right, because in English "million" is not pluralized, similarly to how compounds with a number don't trigger pluralization ("a two day trip", not "a two days trip")

@nschneid
Copy link
Contributor Author

Well "million" is not attaching as compound; "80" is attaching as compound. So I don't think the singular status is relevant.

@amir-zeldes
Copy link
Contributor

"80" is attaching as compound.

Yeah, and in "two day plan" I think "two" is also compound, not nummod, right?

@nschneid
Copy link
Contributor Author

Oh, now I see your point. (I am always confused by the overloading of the term "compound" to mean both the attributive relation of a word or phrase with respect to a noun, and the internal structure of that phrase.)

With a standard N+N combination, like "egg carton", it is the dependent of the compound relation that resists being plural.

With "80 million dollars", it is the dependent of the nummod relation and the head of the compound relation that resists being plural ("million" not "millions").

Your point is that if "80" were the nummod, even when it's not attributively modifying another noun ("The population is 80 million"), we would expect "millions". Which makes sense I guess. For Swedish, where it IS plural, nested nummods would make sense.

@amir-zeldes
Copy link
Contributor

Exactly!

nschneid added a commit to UniversalDependencies/docs that referenced this issue Feb 9, 2025
…general"

- "at least" is now never fixed (previously there was a semantic distinction)
- "in general" is documented as non-fixed
@nschneid
Copy link
Contributor Author

nschneid commented Feb 9, 2025

@amir-zeldes and I have decided that, despite a strong presumption of stare decisis for entries in the fixed list, the treatment of at least as fixed in some meanings was problematic.

Old guidelines

Image

&

Image

New guidelines

Image

A technical issue

It is not immediately obvious whether the superlative words after "at" should be ADJ or ADV. The guidelines previously had ADV for at best/worst. Empirically, the situation in EWT

Image

and in GUM

Image

is a strong preference for the ADJ/JJS tags. I presume we should standardize that and update the RBS tokens.

in general

I have added "in general", which was not previously documented, as non-fixed. It is another ADP+ADJ combination with idiomatic meaning but can be analyzed as a regular PP.

Image

@nschneid nschneid changed the title ExtPos/advmod for quantitative "at (the) least/most" "at least", "in general", and related expressions: fixed? ExtPos? Feb 9, 2025
@nschneid nschneid added the MWE label Feb 9, 2025
@nschneid nschneid changed the title "at least", "in general", and related expressions: fixed? ExtPos? "at least", "in general", and related expressions: fixed? ExtPos? and validator rule prohibiting det(X, Y) & nmod(Y, Z) Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants