-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Latin word tokenizer to handle 'nec' #146
Conversation
Following the practice of the Perseus NLP data, 'nec' now tokenizes to: 'nec' > ['c', 'ne'] For now, this is handled as a special case with the token replaced in the list with two tokens before handling other enclitics/etc. See comment in file. Also, test updated to handle 'nec'.
Current coverage is
|
Very nice. I trust your judgment on the algorithm. Since this has seen so much development, I think that a few sentences would be fitting in the docs entry. Nothing urgent. The docs code is at |
Update Latin word tokenizer to handle 'nec'
This is very nice!!! As I was reading some old Latin something come to my 2016-02-26 17:24 GMT-03:00 Kyle P. Johnson [email protected]:
|
Thanks, @marpozzi! I have been trying to get the tokenizer to follow the logic of the Perseus NLP data, which tokenizes words the 'nec' and '-que' enclitics because these particles have a function in dependency grammar. That is, for example, in order to treebank a sentence with a "-que" word, you need to account for both the function of the word and the enclitic conjunction. The "-ce" in 'huiusce' is emphatic but has no function in the dependency grammar of the sentence, so I'm not sure it should be tokenized. I'd say it just can be lemmatized to "hic" in the next stage of preprocessing. (Of course, if your research question is about the frequency of empathic enclitics, then...) If I'm missing something about the function of "-ce", let me know. Cf. this example from Cic. Cat. 1:
There are all sorts of similar questions that we should address in time—eiusmodo, republica, revera, tantummodo. I'm going to keep following the Perseus examples unless there's a reason not to... Speaking of which—I have written the tokenizer to take words like 'mecum' and return ['cum', 'me']. This is NOT a current practice of the Perseus NLP data, though I'm unclear why. I opened up an issue about this today to find out more, cf. PerseusDL/treebank_data#8 |
Very interesting. Thank you @marpozzi and @diyclassics for thinking about this. I'm not the best one to make decision about this, so put it into the hands of real scholars! Please do report back on what Perseus says about the ticket you opened. |
Minor tweak—the tokenizer finds/replaces 'nec' with ['c', 'ne'] before looking at other enclitics/etc. Seemed easier to handle as a special case rather than try to match on final 'c'. Let me know if anyone thinks of a better way to handle this.