Tokenizer lexicons #14

HAKSOAT · 2020-06-12T22:54:59Z

Created lexicons for cjk and non-cjk texts

halfak · 2020-06-22T15:09:06Z

deltas/tokenizers/wikitext_split.py

@@ -97,4 +72,10 @@
    ("etc", r"."),
 ]

-wikitext_split = RegexTokenizer(LEXICON)
+LEXICON_LATIN = LEXICON.copy()
+LEXICON_LATIN.insert(-2, ('cjk', cjk))


Why not insert this right after "word"?

I can do that. My thought process was since we won't have lots of cjk in a regular latin-dominant text, we don't need to handle them before the tab_open, tab_close, etc.

halfak · 2020-06-22T15:11:51Z

deltas/tokenizers/wikitext_split.py


-word = r'(?:[^\W\d]|[' + combined_word + r'])' + \
+cjk_re = r'\u3040-\u30ff' + r'\u4e00-\u9FFF'


Does this still cover the full range?

Yes. It does.

halfak · 2020-06-22T15:12:24Z

deltas/tokenizers/wikitext_split.py

+
+cjk = r'[' + cjk_re + ']'
+
+word = r'(?:[^\W\d' + cjk_re + r']|[' + combined_word + r'])' + \


Do we need to explicitly exclude CJK here?

Without doing that, some cjk values get captured as word.

HAKSOAT changed the title ~~Created lexicons for cjk and non-cjk texts~~ Tokenizer lexicons Jun 12, 2020

Created lexicons for cjk and non-cjk texts

e9bd355

HAKSOAT force-pushed the tokenizer_scripts branch from 8e0c442 to e9bd355 Compare June 12, 2020 23:19

halfak reviewed Jun 22, 2020

View reviewed changes

HAKSOAT force-pushed the tokenizer_scripts branch 6 times, most recently from f3489da to 1d59026 Compare June 22, 2020 20:02

Refactored lexicons

df6225b

HAKSOAT force-pushed the tokenizer_scripts branch from 1d59026 to df6225b Compare June 22, 2020 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer lexicons #14

Tokenizer lexicons #14

HAKSOAT commented Jun 12, 2020 •

edited

Loading

halfak Jun 22, 2020

HAKSOAT Jun 22, 2020

halfak Jun 22, 2020

HAKSOAT Jun 22, 2020

halfak Jun 22, 2020

HAKSOAT Jun 22, 2020


		word = r'(?:[^\W\d]\|[' + combined_word + r'])' + \
		cjk_re = r'\u3040-\u30ff' + r'\u4e00-\u9FFF'


		cjk = r'[' + cjk_re + ']'

		word = r'(?:[^\W\d' + cjk_re + r']\|[' + combined_word + r'])' + \

Tokenizer lexicons #14

Are you sure you want to change the base?

Tokenizer lexicons #14

Conversation

HAKSOAT commented Jun 12, 2020 • edited Loading

halfak Jun 22, 2020

Choose a reason for hiding this comment

HAKSOAT Jun 22, 2020

Choose a reason for hiding this comment

halfak Jun 22, 2020

Choose a reason for hiding this comment

HAKSOAT Jun 22, 2020

Choose a reason for hiding this comment

halfak Jun 22, 2020

Choose a reason for hiding this comment

HAKSOAT Jun 22, 2020

Choose a reason for hiding this comment

HAKSOAT commented Jun 12, 2020 •

edited

Loading