Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about tokenizing words like 'tecum' #8

Open
diyclassics opened this issue Feb 26, 2016 · 7 comments
Open

Question about tokenizing words like 'tecum' #8

diyclassics opened this issue Feb 26, 2016 · 7 comments

Comments

@diyclassics
Copy link

I noticed in the treebank data that compounds with "-cum"—like 'tecum'—are tokenized as a single token. E.g.

<word id="9" form="tecum" lemma="tu1" postag="p-s---mb-" head="11" relation="ADV"/>

Is there a reason that this is not tokenized as two tokens, i.e. 'cum' + 'te'? (Cf. 'neque' which is tokenized as 'que' + 'ne'.) From a treebanking point of view, it seems like this construction should be comparable to other prepositional phrases of the form 'cum' + abl. noun/pronoun.

More curiosity than anything else—I'm working on a Latin tokenizer myself and trying to follow Perseus NLP practice as closely as possible. Thanks!

ps. Are these tokenizing decisions documented anywhere that I can review?

@balmas
Copy link
Contributor

balmas commented Mar 18, 2016

@gcelano has been working on normalizing the Perseus treebank data but I'm not sure if tokenization is one of the issues he is addressing. We followed slightly different practices in the early days of the treebank than we do now. You can find some discussions on this topic in the issues list for the tokenizer we are currently using for Perseids: https://github.com/latin-language-toolkit/llt-tokenizer/issues

Switching to cltk from LLT (or offering CLTK as an alternative) is something I have been interested in pursuing as the LLT services are no longer being actively maintained.

We developed a RESTful API for the LLT tokenization and segmentation services that made it easy to integrate with other Perseids tools. It's not perfect but the functionality exposed there is maybe interesting to others doing this sort of work, and standardizing on a RESTful APIs for this functionality would make it much easier to swap different implementations in and out.

@nevenjovanovic
Copy link

nevenjovanovic commented Apr 30, 2016

Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display.
Nevertheless, there are 29 occurrences of non-analyzed "nec" in the treebank on Tündra, cf. e. g. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=471 , or search there for [word="nec"]. I think that the Latin treebank is inconsistent here, which is not acceptable if we want to have a gold standard.

And a sincere +1 for opening documentation on tokenizing decisions!

@gcelano
Copy link
Contributor

gcelano commented Apr 30, 2016

Hi Neven,

You are right, and this is why now nec and neque are kept univerbated. Have a look at the repository, where there is a new version of data (2.1). There you find this problem solved for most texts, even though some of them (specified in the documentation) still need a major revision (which includes resolution of this problem).

I will ask that the new data be available in Tundra, but you may wait for some time (uploading not depending on me).

Best,
Giuseppe

Il giorno 30 apr 2016, alle ore 11:13, Neven Jovanović [email protected] ha scritto:

Studying the treebanks in Latin Tündra Perseus, I am encountering cases of "nec" being analyzed into "c ne" (as @diyclassics seems also to have been doing in his tokenizer); cf. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus/index.zul?tbname=PerseusLatin&tbsent=3119 . It is perfectly clear to me why it should be analyzed like this (I have read latin-language-toolkit/llt-tokenizer#27). It is less clear, however, why the analysis should be displayed in this order, and not as ne c, or, even better, ne -c (cf. virum -que). The inverted order confuses readers of treebanked sentences, and appears, on the whole, unnecessarily clumsy from the linguistic point of view; elsewhere you don't change the original word order in the display.
Nevertheless, there are 29 occurrences of non-analyzed "nec" in the treebank on Tündra, cf. e. g. https://weblicht.sfs.uni-tuebingen.de/TundraPerseus//index.zul?tbname=PerseusLatin&tbsent=471 , or search there for [word="nec"]. I think that the Latin treebank is inconsistent here, which is not acceptable if we want to have a gold standard.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@diyclassics
Copy link
Author

diyclassics commented Apr 30, 2016

@gcelano—This is helpful to know—I wrote the CLTK tokenizer (with the nec/neque split and reversed order) based on the example from the treebank data. It is my goal with these tools to align them with large projects like the Perseus NLP research. I will likely change it back now, perhaps add a flag to keep this behavior if the user wants.

Any thoughts on "-cum" compounds? Thanks!

@nevenjovanovic
Copy link

nevenjovanovic commented Apr 30, 2016

@gcelano -- thanks for confirming my suspicions. I have consulted the documentation in https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Latin , and things are much clearer now. The 2.1 Latin treebanks in the repo (as forked yesterday) still show 57 occurrences of //*[@form='c' and @lemma='que1'] when I load them in an XML database, in files:

  • phi0959.phi006.perseus-lat1.tb.xml
  • phi0972.phi001.perseus-lat1.xml
  • tlg0031.tlg027.perseus-lat1.tb.xml

We should think about how to fix this -- if you recommend that nec and neque remain unanalyzed (as, I guess, you are doing with οὐδέ in the Greek tb, and as it seems to be the current practice with Latin), I will talk with Filip to organize a correcting action and a pull request.

@diyclassics -- when e. g. the Morpheus parser in the Morphology Service analyzes nec (http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&word=nec&engine=morpheuslat), it does not split the word in any way. This is completely fine by me!

@gcelano
Copy link
Contributor

gcelano commented May 2, 2016

Hi @diyclassics,

I would suggest to keep tecum (and similia) separated. This is much better. We need to correct the tokenizer at Perseus in this respect.

@balmas
Copy link
Contributor

balmas commented May 2, 2016

See perseids-project/llt-tokenizer#1 for tracking of the requested tokenizer changes.

kylepjohnson pushed a commit to cltk/cltk that referenced this issue Jun 1, 2016
* Add feature to load Latin corpora directly using PlaintextCorpusReader

Using the Latin Library corpus as a test case, this feature allows
you to refer to a corpus and load it with PlaintextCorpusReader using
the following syntax:

from cltk.corpus.latin import latinlibrary

The code checks to make sure that a corpus is installed in the
main CLTK_DATA and raises an error if it is not there.

* Added missing comma to list 'cum_inclusions'

* Stop tokenizing 'nec'

This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is.

* Added hyphen before tokenized enclitics

Looking at the Perseus NLP data, many of the tokenized enclitics
are distinguished using a hyphen, e.g. "-que". This brings it more
in line with that dataset. It also better distinguishes "ne" from "-ne".

* Add 'neque' to que_exceptions

Like 'nec', 'neque' should no longer be separated as two tokens to
bring the tokenizer more in line with Perseus NLP data (see
previous commit).

* Revert "Stop tokenizing 'nec'"

This reverts commit 1e80159.

* Added hyphen before tokenized enclitics

Looking at the Perseus NLP data, many of the tokenized enclitics
are distinguished using a hyphen, e.g. "-que". This brings it more
in line with that dataset. It also better distinguishes "ne" from "-ne".

* Reversed order of tokenized enclitics

To better follow Perseus NLP practice, enclitics are now tokenized
in the order in which they appear, e.g.

'virumque' > ['virum', '-que']

See PerseusDL/treebank_data#8

* Stop tokenizing 'nec'

This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is.

* Better handle case for enclitic tokenization

* Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer

* Rewrote "-cum" handling

Tokenization for "-cum" compounds, e.g. mecum, is now done through
regex replacement on the original string rather than by
iterating over and check all of the tokens. More efficient,
easier to read.

Includes a function to maintain case of original after
replacement.

* Handle exceptions in Latin tokenizer at string level

Following the logic of 'cum' compound handling, exceptional cases
are now handled through regex replacement at the string level.
This is much more efficient than the multiple list comprehensions
currently in use and much easier to read/maintain. It also now
correctly handles 'similist' and 'qualist'.
kylepjohnson pushed a commit to cltk/cltk that referenced this issue Jun 3, 2016
* Add feature to load Latin corpora directly using PlaintextCorpusReader

Using the Latin Library corpus as a test case, this feature allows
you to refer to a corpus and load it with PlaintextCorpusReader using
the following syntax:

from cltk.corpus.latin import latinlibrary

The code checks to make sure that a corpus is installed in the
main CLTK_DATA and raises an error if it is not there.

* Added missing comma to list 'cum_inclusions'

* Stop tokenizing 'nec'

This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is.

* Added hyphen before tokenized enclitics

Looking at the Perseus NLP data, many of the tokenized enclitics
are distinguished using a hyphen, e.g. "-que". This brings it more
in line with that dataset. It also better distinguishes "ne" from "-ne".

* Add 'neque' to que_exceptions

Like 'nec', 'neque' should no longer be separated as two tokens to
bring the tokenizer more in line with Perseus NLP data (see
previous commit).

* Revert "Stop tokenizing 'nec'"

This reverts commit 1e80159.

* Added hyphen before tokenized enclitics

Looking at the Perseus NLP data, many of the tokenized enclitics
are distinguished using a hyphen, e.g. "-que". This brings it more
in line with that dataset. It also better distinguishes "ne" from "-ne".

* Reversed order of tokenized enclitics

To better follow Perseus NLP practice, enclitics are now tokenized
in the order in which they appear, e.g.

'virumque' > ['virum', '-que']

See PerseusDL/treebank_data#8

* Stop tokenizing 'nec'

This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is.

* Better handle case for enclitic tokenization

* Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer

* Rewrote "-cum" handling

Tokenization for "-cum" compounds, e.g. mecum, is now done through
regex replacement on the original string rather than by
iterating over and check all of the tokens. More efficient,
easier to read.

Includes a function to maintain case of original after
replacement.

* Handle exceptions in Latin tokenizer at string level

Following the logic of 'cum' compound handling, exceptional cases
are now handled through regex replacement at the string level.
This is much more efficient than the multiple list comprehensions
currently in use and much easier to read/maintain. It also now
correctly handles 'similist' and 'qualist'.

* Moved list of latin exceptions to a separate file

* Make tokenizer split final period; update test.

* Fixed typo in previous commit that will make Sanskrit tokenizer test fail

* Updated tokenizer to use local function instead of NLTK's word_tokenize

* Updated tokenizer to use local function instead of NLTK's word_tokenize
kylepjohnson pushed a commit to cltk/cltk that referenced this issue Jun 9, 2016
* Add feature to load Latin corpora directly using PlaintextCorpusReader

Using the Latin Library corpus as a test case, this feature allows
you to refer to a corpus and load it with PlaintextCorpusReader using
the following syntax:

from cltk.corpus.latin import latinlibrary

The code checks to make sure that a corpus is installed in the
main CLTK_DATA and raises an error if it is not there.

* Added missing comma to list 'cum_inclusions'

* Stop tokenizing 'nec'

This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is.

* Added hyphen before tokenized enclitics

Looking at the Perseus NLP data, many of the tokenized enclitics
are distinguished using a hyphen, e.g. "-que". This brings it more
in line with that dataset. It also better distinguishes "ne" from "-ne".

* Add 'neque' to que_exceptions

Like 'nec', 'neque' should no longer be separated as two tokens to
bring the tokenizer more in line with Perseus NLP data (see
previous commit).

* Revert "Stop tokenizing 'nec'"

This reverts commit 1e80159.

* Added hyphen before tokenized enclitics

Looking at the Perseus NLP data, many of the tokenized enclitics
are distinguished using a hyphen, e.g. "-que". This brings it more
in line with that dataset. It also better distinguishes "ne" from "-ne".

* Reversed order of tokenized enclitics

To better follow Perseus NLP practice, enclitics are now tokenized
in the order in which they appear, e.g.

'virumque' > ['virum', '-que']

See PerseusDL/treebank_data#8

* Stop tokenizing 'nec'

This is based on the discussion here: PerseusDL/treebank_data#8. Until this larger issue is resolved, it seems best to leave the form 'nec' as is.

* Better handle case for enclitic tokenization

* Updated test_tokenizer.py to reflect recent changes to the Latin tokenizer

* Rewrote "-cum" handling

Tokenization for "-cum" compounds, e.g. mecum, is now done through
regex replacement on the original string rather than by
iterating over and check all of the tokens. More efficient,
easier to read.

Includes a function to maintain case of original after
replacement.

* Handle exceptions in Latin tokenizer at string level

Following the logic of 'cum' compound handling, exceptional cases
are now handled through regex replacement at the string level.
This is much more efficient than the multiple list comprehensions
currently in use and much easier to read/maintain. It also now
correctly handles 'similist' and 'qualist'.

* Moved list of latin exceptions to a separate file

* Make tokenizer split final period; update test.

* Fixed typo in previous commit that will make Sanskrit tokenizer test fail

* Updated tokenizer to use local function instead of NLTK's word_tokenize

* Updated tokenizer to use local function instead of NLTK's word_tokenize

* Updates to Latin tokenizer

A few changes:
- Most significant: Special handling for Latin moved into its own
    function. Makes the general tokenizer code much easier to read
    and makes effort to avoid the clutter than will arise from
    separate exceptions for each language.
- Latin tokenizer now splits on sentences before splitting on words.
    This allows:
        - Better handling of '-ne' enclitic which can now be tested
            only on the sentence initial position.
        - Custom handling of Latin abbreviations. The test case
            included here are the praenomina; e.g. sentences will no
            longer incorrectly split on the name "Cn."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants