Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to English validator: Style should almost always mean the form is distinct from the lemma #554

Open
nschneid opened this issue Dec 6, 2024 · 11 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Dec 6, 2024

My understanding of Style is that it indicates a stylistically marked form of a word, and should normally have a more canonical lemma.

An exception could be Style=Arch on archaic pronouns thou, ye, etc., which have no precise modern equivalent to use as the lemma.

Case-sensitive equivalence:

I don't know if Grew permits searching for case-insensitive equality between two attributes of a word (@bguil?). But I know GUM has "Hmm" with lemma "hmm" and Style=Expr.

@amir-zeldes
Copy link
Contributor

Style is that it indicates a stylistically marked form of a word, and should normally have a more canonical lemma.

I'm not sure I understand it the same way: for example, English "because" has stylistic variants "cause" and "cuz", but I'm not sure I want to say the lemma of "cause" is "because". Using "cause" is a stylistic choice, but there's no inflection going on or anything - it's just a stylistically marked alternative to "because".

@nschneid
Copy link
Contributor Author

nschneid commented Dec 6, 2024

Isn't it an abbreviation? Like "'fraid" for "afraid" etc.

@amir-zeldes
Copy link
Contributor

Not sure, I feel like it's its own word by now, no? dictionary.com lists both cause and cuz:

https://www.dictionary.com/browse/cuz

@nschneid
Copy link
Contributor Author

nschneid commented Dec 6, 2024

"a shortened form of because"

@amir-zeldes
Copy link
Contributor

Right, but "lab" is also a shortened form of "laboratory", and I don't think we've been lemmatizing the first to the second - it's now an independent form, no? I think this is different with sort of ad hoc normalization of abbreviations that take on different forms, like e.g., eg, e.g -> e.g.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 8, 2024

Historically yes, but "lab" and "laboratory" are both mainstream rather than marked—I don't think any of us in normal parlance would say we hold laboratory meetings, though in writing "laboratory" is common enough.

I guess I am thinking that if there is an obvious/standard form of the word and a minor stylistically marked form, it makes sense to call that a stylistic variant with a shared lemma. But if they are treated as fully independent words with distinct lemmas, I would not use the Style feature. (That would make it a pure meaning feature that could apply to all sorts of formal or informal words; I take it the Style feature is "morphological" because it implies a contrast with another form.)

@nschneid
Copy link
Contributor Author

nschneid commented Dec 8, 2024

(I agree that "'cause" involves a degree of conventionalization that goes beyond spelling variation—it's also part of the spoken language. I think this is similar to "'em" for "them", or "gonna" and "wanna", which we normalize to the non-colloquial lemma in combination with the Style feature. Also "c'mon" in EWT.)

@amir-zeldes
Copy link
Contributor

If this annotation mandates that form != lemma, then I think it shouldn't be called Style - it would really be something like "NonCanonical" or similar. Lots of words have stylistic implications but are probably they're own lemmas (for example if I speak in pirate style and say "avast" or "ahoy", but I don't think there are other lemmas for that). Is the intention here to point out non-canonical language? Or just informal/spoken language features?

@nschneid
Copy link
Contributor Author

nschneid commented Dec 9, 2024

There can be exceptions for clearly archaic forms (e.g. we have "thou" as archaic) but formal/informal seems very hard to apply throughout the lexicon. Should we be comparing lexical frequencies in speech vs. edited writing in order to mark some words ("expectorate", "deceased") as formal and others ("busted", "yucky") as informal? I bet agreement would not be high if we left it to annotators' intuitions.

@amir-zeldes
Copy link
Contributor

I agree - but all this makes me think we should maybe not be using Style as a feature? At the moment it's only used for very few items anyway.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 12, 2024

For words in a paradigm, like pronouns, it is useful for explaining the variants I think! And likewise if there is an alternative spelling that can be understood as colloquial/expressive ("walkin'", "looooong").

As it is a "morphological" feature I understand it as an explanation of why a variant form of a word exists, with the canonical form as lemma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants