Update Latin word tokenizer to handle 'nec' #146

diyclassics · 2016-02-26T14:10:07Z

Minor tweak—the tokenizer finds/replaces 'nec' with ['c', 'ne'] before looking at other enclitics/etc. Seemed easier to handle as a special case rather than try to match on final 'c'. Let me know if anyone thinks of a better way to handle this.

Following the practice of the Perseus NLP data, 'nec' now tokenizes to: 'nec' > ['c', 'ne'] For now, this is handled as a special case with the token replaced in the list with two tokens before handling other enclitics/etc. See comment in file. Also, test updated to handle 'nec'.

codecov-io · 2016-02-26T14:24:54Z

Current coverage is `80.07%`

Merging #146 into master will increase coverage by +0.01% as of b4dfc71

@@            master    #146   diff @@
======================================
  Files           51      51       
  Stmts         2649    2650     +1
  Branches         0       0       
  Methods          0       0       
======================================
+ Hit           2121    2122     +1
  Partial          0       0       
  Missed         528     528

Review entire Coverage Diff as of b4dfc71

Powered by Codecov. Updated on successful CI builds.

kylepjohnson · 2016-02-26T20:24:36Z

Very nice. I trust your judgment on the algorithm.

Since this has seen so much development, I think that a few sentences would be fitting in the docs entry. Nothing urgent.

The docs code is at docs/latin.rst.

Update Latin word tokenizer to handle 'nec'

marpozzi · 2016-02-26T21:13:44Z

This is very nice!!! As I was reading some old Latin something come to my
troubled mind... I was wondering if CLTK should provide coverage to the
archaic "sort-of" enclitic "-ce" (as in huius-ce, hic-ce, etc.) quite
common in commedy, or leave it out. Perhaps in the same vein as the new
handling of 'nec'.

2016-02-26 17:24 GMT-03:00 Kyle P. Johnson [email protected]:

Merged #146 #146.

—
Reply to this email directly or view it on GitHub
#146 (comment).

diyclassics · 2016-02-26T23:23:24Z

Thanks, @marpozzi! I have been trying to get the tokenizer to follow the logic of the Perseus NLP data, which tokenizes words the 'nec' and '-que' enclitics because these particles have a function in dependency grammar. That is, for example, in order to treebank a sentence with a "-que" word, you need to account for both the function of the word and the enclitic conjunction.

The "-ce" in 'huiusce' is emphatic but has no function in the dependency grammar of the sentence, so I'm not sure it should be tokenized. I'd say it just can be lemmatized to "hic" in the next stage of preprocessing. (Of course, if your research question is about the frequency of empathic enclitics, then...) If I'm missing something about the function of "-ce", let me know.

Cf. this example from Cic. Cat. 1:

<word id="1" form="hisce" lemma="hic1" postag="p-p---nb-" head="2" relation="ATR"/>

There are all sorts of similar questions that we should address in time—eiusmodo, republica, revera, tantummodo. I'm going to keep following the Perseus examples unless there's a reason not to...

Speaking of which—I have written the tokenizer to take words like 'mecum' and return ['cum', 'me']. This is NOT a current practice of the Perseus NLP data, though I'm unclear why. I opened up an issue about this today to find out more, cf. PerseusDL/treebank_data#8

kylepjohnson · 2016-02-27T00:08:29Z

Very interesting. Thank you @marpozzi and @diyclassics for thinking about this. I'm not the best one to make decision about this, so put it into the hands of real scholars!

Please do report back on what Perseus says about the ticket you opened.

diyclassics added 3 commits February 26, 2016 08:57

Merge branch 'master' of https://github.com/diyclassics/cltk

ed3d420

Merge branch 'master' into master

4326a1f

kylepjohnson added a commit that referenced this pull request Feb 26, 2016

Merge pull request #146 from diyclassics/master

dbc2aae

Update Latin word tokenizer to handle 'nec'

kylepjohnson merged commit dbc2aae into cltk:master Feb 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Latin word tokenizer to handle 'nec' #146

Update Latin word tokenizer to handle 'nec' #146

diyclassics commented Feb 26, 2016

codecov-io commented Feb 26, 2016

kylepjohnson commented Feb 26, 2016

marpozzi commented Feb 26, 2016

diyclassics commented Feb 26, 2016

kylepjohnson commented Feb 27, 2016

Update Latin word tokenizer to handle 'nec' #146

Update Latin word tokenizer to handle 'nec' #146

Conversation

diyclassics commented Feb 26, 2016

codecov-io commented Feb 26, 2016

Current coverage is 80.07%

kylepjohnson commented Feb 26, 2016

marpozzi commented Feb 26, 2016

diyclassics commented Feb 26, 2016

kylepjohnson commented Feb 27, 2016

Current coverage is `80.07%`