-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about tokenizing words like 'tecum' #8
Comments
@gcelano has been working on normalizing the Perseus treebank data but I'm not sure if tokenization is one of the issues he is addressing. We followed slightly different practices in the early days of the treebank than we do now. You can find some discussions on this topic in the issues list for the tokenizer we are currently using for Perseids: https://github.com/latin-language-toolkit/llt-tokenizer/issues Switching to cltk from LLT (or offering CLTK as an alternative) is something I have been interested in pursuing as the LLT services are no longer being actively maintained. We developed a RESTful API for the LLT tokenization and segmentation services that made it easy to integrate with other Perseids tools. It's not perfect but the functionality exposed there is maybe interesting to others doing this sort of work, and standardizing on a RESTful APIs for this functionality would make it much easier to swap different implementations in and out. |
Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as And a sincere +1 for opening documentation on tokenizing decisions! |
Hi Neven, You are right, and this is why now nec and neque are kept univerbated. Have a look at the repository, where there is a new version of data (2.1). There you find this problem solved for most texts, even though some of them (specified in the documentation) still need a major revision (which includes resolution of this problem). I will ask that the new data be available in Tundra, but you may wait for some time (uploading not depending on me). Best, Il giorno 30 apr 2016, alle ore 11:13, Neven Jovanović [email protected] ha scritto: Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display. — |
@gcelano—This is helpful to know—I wrote the CLTK tokenizer (with the nec/neque split and reversed order) based on the example from the treebank data. It is my goal with these tools to align them with large projects like the Perseus NLP research. I will likely change it back now, perhaps add a flag to keep this behavior if the user wants. Any thoughts on "-cum" compounds? Thanks! |
@gcelano -- thanks for confirming my suspicions. I have consulted the documentation in https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Latin , and things are much clearer now. The 2.1 Latin treebanks in the repo (as forked yesterday) still show 57 occurrences of
We should think about how to fix this -- if you recommend that @diyclassics -- when e. g. the Morpheus parser in the Morphology Service analyzes nec (http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&word=nec&engine=morpheuslat), it does not split the word in any way. This is completely fine by me! |
Hi @diyclassics, I would suggest to keep tecum (and similia) separated. This is much better. We need to correct the tokenizer at Perseus in this respect. |
See perseids-project/llt-tokenizer#1 for tracking of the requested tokenizer changes. |
* Add feature to load Latin corpora directly using PlaintextCorpusReader Using the Latin Library corpus as a test case, this feature allows you to refer to a corpus and load it with PlaintextCorpusReader using the following syntax: from cltk.corpus.latin import latinlibrary The code checks to make sure that a corpus is installed in the main CLTK_DATA and raises an error if it is not there. * Added missing comma to list 'cum_inclusions' * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Add 'neque' to que_exceptions Like 'nec', 'neque' should no longer be separated as two tokens to bring the tokenizer more in line with Perseus NLP data (see previous commit). * Revert "Stop tokenizing 'nec'" This reverts commit 1e80159. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Reversed order of tokenized enclitics To better follow Perseus NLP practice, enclitics are now tokenized in the order in which they appear, e.g. 'virumque' > ['virum', '-que'] See PerseusDL/treebank_data#8 * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Better handle case for enclitic tokenization * Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer * Rewrote "-cum" handling Tokenization for "-cum" compounds, e.g. mecum, is now done through regex replacement on the original string rather than by iterating over and check all of the tokens. More efficient, easier to read. Includes a function to maintain case of original after replacement. * Handle exceptions in Latin tokenizer at string level Following the logic of 'cum' compound handling, exceptional cases are now handled through regex replacement at the string level. This is much more efficient than the multiple list comprehensions currently in use and much easier to read/maintain. It also now correctly handles 'similist' and 'qualist'.
* Add feature to load Latin corpora directly using PlaintextCorpusReader Using the Latin Library corpus as a test case, this feature allows you to refer to a corpus and load it with PlaintextCorpusReader using the following syntax: from cltk.corpus.latin import latinlibrary The code checks to make sure that a corpus is installed in the main CLTK_DATA and raises an error if it is not there. * Added missing comma to list 'cum_inclusions' * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Add 'neque' to que_exceptions Like 'nec', 'neque' should no longer be separated as two tokens to bring the tokenizer more in line with Perseus NLP data (see previous commit). * Revert "Stop tokenizing 'nec'" This reverts commit 1e80159. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Reversed order of tokenized enclitics To better follow Perseus NLP practice, enclitics are now tokenized in the order in which they appear, e.g. 'virumque' > ['virum', '-que'] See PerseusDL/treebank_data#8 * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Better handle case for enclitic tokenization * Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer * Rewrote "-cum" handling Tokenization for "-cum" compounds, e.g. mecum, is now done through regex replacement on the original string rather than by iterating over and check all of the tokens. More efficient, easier to read. Includes a function to maintain case of original after replacement. * Handle exceptions in Latin tokenizer at string level Following the logic of 'cum' compound handling, exceptional cases are now handled through regex replacement at the string level. This is much more efficient than the multiple list comprehensions currently in use and much easier to read/maintain. It also now correctly handles 'similist' and 'qualist'. * Moved list of latin exceptions to a separate file * Make tokenizer split final period; update test. * Fixed typo in previous commit that will make Sanskrit tokenizer test fail * Updated tokenizer to use local function instead of NLTK's word_tokenize * Updated tokenizer to use local function instead of NLTK's word_tokenize
* Add feature to load Latin corpora directly using PlaintextCorpusReader Using the Latin Library corpus as a test case, this feature allows you to refer to a corpus and load it with PlaintextCorpusReader using the following syntax: from cltk.corpus.latin import latinlibrary The code checks to make sure that a corpus is installed in the main CLTK_DATA and raises an error if it is not there. * Added missing comma to list 'cum_inclusions' * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Add 'neque' to que_exceptions Like 'nec', 'neque' should no longer be separated as two tokens to bring the tokenizer more in line with Perseus NLP data (see previous commit). * Revert "Stop tokenizing 'nec'" This reverts commit 1e80159. * Added hyphen before tokenized enclitics Looking at the Perseus NLP data, many of the tokenized enclitics are distinguished using a hyphen, e.g. "-que". This brings it more in line with that dataset. It also better distinguishes "ne" from "-ne". * Reversed order of tokenized enclitics To better follow Perseus NLP practice, enclitics are now tokenized in the order in which they appear, e.g. 'virumque' > ['virum', '-que'] See PerseusDL/treebank_data#8 * Stop tokenizing 'nec' This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is. * Better handle case for enclitic tokenization * Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer * Rewrote "-cum" handling Tokenization for "-cum" compounds, e.g. mecum, is now done through regex replacement on the original string rather than by iterating over and check all of the tokens. More efficient, easier to read. Includes a function to maintain case of original after replacement. * Handle exceptions in Latin tokenizer at string level Following the logic of 'cum' compound handling, exceptional cases are now handled through regex replacement at the string level. This is much more efficient than the multiple list comprehensions currently in use and much easier to read/maintain. It also now correctly handles 'similist' and 'qualist'. * Moved list of latin exceptions to a separate file * Make tokenizer split final period; update test. * Fixed typo in previous commit that will make Sanskrit tokenizer test fail * Updated tokenizer to use local function instead of NLTK's word_tokenize * Updated tokenizer to use local function instead of NLTK's word_tokenize * Updates to Latin tokenizer A few changes: - Most significant: Special handling for Latin moved into its own function. Makes the general tokenizer code much easier to read and makes effort to avoid the clutter than will arise from separate exceptions for each language. - Latin tokenizer now splits on sentences before splitting on words. This allows: - Better handling of '-ne' enclitic which can now be tested only on the sentence initial position. - Custom handling of Latin abbreviations. The test case included here are the praenomina; e.g. sentences will no longer incorrectly split on the name "Cn."
I noticed in the treebank data that compounds with "-cum"—like 'tecum'—are tokenized as a single token. E.g.
Is there a reason that this is not tokenized as two tokens, i.e. 'cum' + 'te'? (Cf. 'neque' which is tokenized as 'que' + 'ne'.) From a treebanking point of view, it seems like this construction should be comparable to other prepositional phrases of the form 'cum' + abl. noun/pronoun.
More curiosity than anything else—I'm working on a Latin tokenizer myself and trying to follow Perseus NLP practice as closely as possible. Thanks!
ps. Are these tokenizing decisions documented anywhere that I can review?
The text was updated successfully, but these errors were encountered: