Question about tokenizing words like 'tecum' #8

diyclassics · 2016-02-26T14:21:46Z

I noticed in the treebank data that compounds with "-cum"—like 'tecum'—are tokenized as a single token. E.g.

<word id="9" form="tecum" lemma="tu1" postag="p-s---mb-" head="11" relation="ADV"/>

Is there a reason that this is not tokenized as two tokens, i.e. 'cum' + 'te'? (Cf. 'neque' which is tokenized as 'que' + 'ne'.) From a treebanking point of view, it seems like this construction should be comparable to other prepositional phrases of the form 'cum' + abl. noun/pronoun.

More curiosity than anything else—I'm working on a Latin tokenizer myself and trying to follow Perseus NLP practice as closely as possible. Thanks!

ps. Are these tokenizing decisions documented anywhere that I can review?

The text was updated successfully, but these errors were encountered:

balmas · 2016-03-18T14:44:31Z

@gcelano has been working on normalizing the Perseus treebank data but I'm not sure if tokenization is one of the issues he is addressing. We followed slightly different practices in the early days of the treebank than we do now. You can find some discussions on this topic in the issues list for the tokenizer we are currently using for Perseids: https://github.com/latin-language-toolkit/llt-tokenizer/issues

Switching to cltk from LLT (or offering CLTK as an alternative) is something I have been interested in pursuing as the LLT services are no longer being actively maintained.

We developed a RESTful API for the LLT tokenization and segmentation services that made it easy to integrate with other Perseids tools. It's not perfect but the functionality exposed there is maybe interesting to others doing this sort of work, and standardizing on a RESTful APIs for this functionality would make it much easier to swap different implementations in and out.

nevenjovanovic · 2016-04-30T09:12:59Z

Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display.
Nevertheless, there are 29 occurrences of non-analyzed "nec" in the treebank on Tündra, cf. e. g. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=471 , or search there for [word="nec"]. I think that the Latin treebank is inconsistent here, which is not acceptable if we want to have a gold standard.

And a sincere +1 for opening documentation on tokenizing decisions!

gcelano · 2016-04-30T10:58:44Z

Hi Neven,

You are right, and this is why now nec and neque are kept univerbated. Have a look at the repository, where there is a new version of data (2.1). There you find this problem solved for most texts, even though some of them (specified in the documentation) still need a major revision (which includes resolution of this problem).

I will ask that the new data be available in Tundra, but you may wait for some time (uploading not depending on me).

Best,
Giuseppe

Il giorno 30 apr 2016, alle ore 11:13, Neven Jovanović [email protected] ha scritto:

Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display.
Nevertheless, there are 29 occurrences of non-analyzed "nec" in the treebank on Tündra, cf. e. g. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus//index.zul?tbname=PerseusLatin&tbsent=471 , or search there for [word="nec"]. I think that the Latin treebank is inconsistent here, which is not acceptable if we want to have a gold standard.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

diyclassics · 2016-04-30T12:23:54Z

@gcelano—This is helpful to know—I wrote the CLTK tokenizer (with the nec/neque split and reversed order) based on the example from the treebank data. It is my goal with these tools to align them with large projects like the Perseus NLP research. I will likely change it back now, perhaps add a flag to keep this behavior if the user wants.

Any thoughts on "-cum" compounds? Thanks!

nevenjovanovic · 2016-04-30T15:18:19Z

@gcelano -- thanks for confirming my suspicions. I have consulted the documentation in https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Latin , and things are much clearer now. The 2.1 Latin treebanks in the repo (as forked yesterday) still show 57 occurrences of //*[@form='c' and @lemma='que1'] when I load them in an XML database, in files:

phi0959.phi006.perseus-lat1.tb.xml
phi0972.phi001.perseus-lat1.xml
tlg0031.tlg027.perseus-lat1.tb.xml

We should think about how to fix this -- if you recommend that nec and neque remain unanalyzed (as, I guess, you are doing with οὐδέ in the Greek tb, and as it seems to be the current practice with Latin), I will talk with Filip to organize a correcting action and a pull request.

@diyclassics -- when e. g. the Morpheus parser in the Morphology Service analyzes nec (http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&word=nec&engine=morpheuslat), it does not split the word in any way. This is completely fine by me!

gcelano · 2016-05-02T13:23:37Z

Hi @diyclassics,

I would suggest to keep tecum (and similia) separated. This is much better. We need to correct the tokenizer at Perseus in this respect.

balmas · 2016-05-02T13:35:21Z

See perseids-project/llt-tokenizer#1 for tracking of the requested tokenizer changes.

* Add feature to load Latin corpora directly using PlaintextCorpusReader Using the Latin Library corpus as a test case, this feature allows you to refer to a corpus and load it with PlaintextCorpusReader using the following syntax: from cltk.corpus.latin import latinlibrary The code checks to make sure that a corpus is installed in the main CLTK_DATA and raises an error if it is not there. * Added missing comma to list 'cum_inclusions' * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Add 'neque' to que_exceptions Like 'nec', 'neque' should no longer be separated as two tokens to bring the tokenizer more in line with Perseus NLP data (see previous commit). * Revert "Stop tokenizing 'nec'" This reverts commit 1e80159. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Reversed order of tokenized enclitics To better follow Perseus NLP practice, enclitics are now tokenized in the order in which they appear, e.g. 'virumque' > ['virum', '-que'] See PerseusDL/treebank_data#8 * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Better handle case for enclitic tokenization * Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer * Rewrote "-cum" handling Tokenization for "-cum" compounds, e.g. mecum, is now done through regex replacement on the original string rather than by iterating over and check all of the tokens. More efficient, easier to read. Includes a function to maintain case of original after replacement. * Handle exceptions in Latin tokenizer at string level Following the logic of 'cum' compound handling, exceptional cases are now handled through regex replacement at the string level. This is much more efficient than the multiple list comprehensions currently in use and much easier to read/maintain. It also now correctly handles 'similist' and 'qualist'.

* Add feature to load Latin corpora directly using PlaintextCorpusReader Using the Latin Library corpus as a test case, this feature allows you to refer to a corpus and load it with PlaintextCorpusReader using the following syntax: from cltk.corpus.latin import latinlibrary The code checks to make sure that a corpus is installed in the main CLTK_DATA and raises an error if it is not there. * Added missing comma to list 'cum_inclusions' * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Add 'neque' to que_exceptions Like 'nec', 'neque' should no longer be separated as two tokens to bring the tokenizer more in line with Perseus NLP data (see previous commit). * Revert "Stop tokenizing 'nec'" This reverts commit 1e80159. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Reversed order of tokenized enclitics To better follow Perseus NLP practice, enclitics are now tokenized in the order in which they appear, e.g. 'virumque' > ['virum', '-que'] See PerseusDL/treebank_data#8 * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Better handle case for enclitic tokenization * Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer * Rewrote "-cum" handling Tokenization for "-cum" compounds, e.g. mecum, is now done through regex replacement on the original string rather than by iterating over and check all of the tokens. More efficient, easier to read. Includes a function to maintain case of original after replacement. * Handle exceptions in Latin tokenizer at string level Following the logic of 'cum' compound handling, exceptional cases are now handled through regex replacement at the string level. This is much more efficient than the multiple list comprehensions currently in use and much easier to read/maintain. It also now correctly handles 'similist' and 'qualist'. * Moved list of latin exceptions to a separate file * Make tokenizer split final period; update test. * Fixed typo in previous commit that will make Sanskrit tokenizer test fail * Updated tokenizer to use local function instead of NLTK's word_tokenize * Updated tokenizer to use local function instead of NLTK's word_tokenize

* Add feature to load Latin corpora directly using PlaintextCorpusReader Using the Latin Library corpus as a test case, this feature allows you to refer to a corpus and load it with PlaintextCorpusReader using the following syntax: from cltk.corpus.latin import latinlibrary The code checks to make sure that a corpus is installed in the main CLTK_DATA and raises an error if it is not there. * Added missing comma to list 'cum_inclusions' * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Add 'neque' to que_exceptions Like 'nec', 'neque' should no longer be separated as two tokens to bring the tokenizer more in line with Perseus NLP data (see previous commit). * Revert "Stop tokenizing 'nec'" This reverts commit 1e80159. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Reversed order of tokenized enclitics To better follow Perseus NLP practice, enclitics are now tokenized in the order in which they appear, e.g. 'virumque' > ['virum', '-que'] See PerseusDL/treebank_data#8 * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Better handle case for enclitic tokenization * Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer * Rewrote "-cum" handling Tokenization for "-cum" compounds, e.g. mecum, is now done through regex replacement on the original string rather than by iterating over and check all of the tokens. More efficient, easier to read. Includes a function to maintain case of original after replacement. * Handle exceptions in Latin tokenizer at string level Following the logic of 'cum' compound handling, exceptional cases are now handled through regex replacement at the string level. This is much more efficient than the multiple list comprehensions currently in use and much easier to read/maintain. It also now correctly handles 'similist' and 'qualist'. * Moved list of latin exceptions to a separate file * Make tokenizer split final period; update test. * Fixed typo in previous commit that will make Sanskrit tokenizer test fail * Updated tokenizer to use local function instead of NLTK's word_tokenize * Updated tokenizer to use local function instead of NLTK's word_tokenize * Updates to Latin tokenizer A few changes: - Most significant: Special handling for Latin moved into its own function. Makes the general tokenizer code much easier to read and makes effort to avoid the clutter than will arise from separate exceptions for each language. - Latin tokenizer now splits on sentences before splitting on words. This allows: - Better handling of '-ne' enclitic which can now be tested only on the sentence initial position. - Custom handling of Latin abbreviations. The test case included here are the praenomina; e.g. sentences will no longer incorrectly split on the name "Cn."

diyclassics mentioned this issue Feb 26, 2016

Update Latin word tokenizer to handle 'nec' cltk/cltk#146

Merged

diyclassics mentioned this issue Jun 1, 2016

Updates to Latin tokenizer cltk/cltk#303

Merged

gcelano mentioned this issue Sep 23, 2016

Malformed trees #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about tokenizing words like 'tecum' #8

Question about tokenizing words like 'tecum' #8

diyclassics commented Feb 26, 2016

balmas commented Mar 18, 2016

nevenjovanovic commented Apr 30, 2016 •

edited

Loading

gcelano commented Apr 30, 2016

diyclassics commented Apr 30, 2016 •

edited

Loading

nevenjovanovic commented Apr 30, 2016 •

edited

Loading

gcelano commented May 2, 2016

balmas commented May 2, 2016

Question about tokenizing words like 'tecum' #8

Question about tokenizing words like 'tecum' #8

Comments

diyclassics commented Feb 26, 2016

balmas commented Mar 18, 2016

nevenjovanovic commented Apr 30, 2016 • edited Loading

gcelano commented Apr 30, 2016

diyclassics commented Apr 30, 2016 • edited Loading

nevenjovanovic commented Apr 30, 2016 • edited Loading

gcelano commented May 2, 2016

balmas commented May 2, 2016

nevenjovanovic commented Apr 30, 2016 •

edited

Loading

diyclassics commented Apr 30, 2016 •

edited

Loading

nevenjovanovic commented Apr 30, 2016 •

edited

Loading