Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Latin word tokenizer to handle 'nec' #146

Merged
merged 3 commits into from
Feb 26, 2016
Merged

Update Latin word tokenizer to handle 'nec' #146

merged 3 commits into from
Feb 26, 2016

Conversation

diyclassics
Copy link
Collaborator

Minor tweak—the tokenizer finds/replaces 'nec' with ['c', 'ne'] before looking at other enclitics/etc. Seemed easier to handle as a special case rather than try to match on final 'c'. Let me know if anyone thinks of a better way to handle this.

Following the practice of the Perseus NLP data, 'nec' now tokenizes to:

'nec' > ['c', 'ne']

For now, this is handled as a special case with the token replaced in
the list with two tokens before handling other enclitics/etc. See
comment in file.

Also, test updated to handle 'nec'.
@codecov-io
Copy link

Current coverage is 80.07%

Merging #146 into master will increase coverage by +0.01% as of b4dfc71

@@            master    #146   diff @@
======================================
  Files           51      51       
  Stmts         2649    2650     +1
  Branches         0       0       
  Methods          0       0       
======================================
+ Hit           2121    2122     +1
  Partial          0       0       
  Missed         528     528       

Review entire Coverage Diff as of b4dfc71

Powered by Codecov. Updated on successful CI builds.

@kylepjohnson
Copy link
Member

Very nice. I trust your judgment on the algorithm.

Since this has seen so much development, I think that a few sentences would be fitting in the docs entry. Nothing urgent.

The docs code is at docs/latin.rst.

kylepjohnson added a commit that referenced this pull request Feb 26, 2016
Update Latin word tokenizer to handle 'nec'
@kylepjohnson kylepjohnson merged commit dbc2aae into cltk:master Feb 26, 2016
@marpozzi
Copy link
Contributor

This is very nice!!! As I was reading some old Latin something come to my
troubled mind... I was wondering if CLTK should provide coverage to the
archaic "sort-of" enclitic "-ce" (as in huius-ce, hic-ce, etc.) quite
common in commedy, or leave it out. Perhaps in the same vein as the new
handling of 'nec'.

2016-02-26 17:24 GMT-03:00 Kyle P. Johnson [email protected]:

Merged #146 #146.


Reply to this email directly or view it on GitHub
#146 (comment).

@diyclassics
Copy link
Collaborator Author

Thanks, @marpozzi! I have been trying to get the tokenizer to follow the logic of the Perseus NLP data, which tokenizes words the 'nec' and '-que' enclitics because these particles have a function in dependency grammar. That is, for example, in order to treebank a sentence with a "-que" word, you need to account for both the function of the word and the enclitic conjunction.

The "-ce" in 'huiusce' is emphatic but has no function in the dependency grammar of the sentence, so I'm not sure it should be tokenized. I'd say it just can be lemmatized to "hic" in the next stage of preprocessing. (Of course, if your research question is about the frequency of empathic enclitics, then...) If I'm missing something about the function of "-ce", let me know.

Cf. this example from Cic. Cat. 1:

<word id="1" form="hisce" lemma="hic1" postag="p-p---nb-" head="2" relation="ATR"/>

There are all sorts of similar questions that we should address in time—eiusmodo, republica, revera, tantummodo. I'm going to keep following the Perseus examples unless there's a reason not to...

Speaking of which—I have written the tokenizer to take words like 'mecum' and return ['cum', 'me']. This is NOT a current practice of the Perseus NLP data, though I'm unclear why. I opened up an issue about this today to find out more, cf. PerseusDL/treebank_data#8

@kylepjohnson
Copy link
Member

Very interesting. Thank you @marpozzi and @diyclassics for thinking about this. I'm not the best one to make decision about this, so put it into the hands of real scholars!

Please do report back on what Perseus says about the ticket you opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants