[Feature request] Train a classifier to better classify languages #21

Muhtasham · 2021-10-05T13:59:01Z

Is your feature request related to a problem? Please describe.
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script

Describe the solution you'd like
Train new models using other data other than Wikipedia, for instance for text material that was taken from randomly chosen language specific websites, language specific news websites , and text material collected via CURL portal (https://curl.corpora.uni-leipzig.de).

Describe alternatives you've considered
Leipzig Corpora, but it also has some "noise" that needs to be cleaned for efficient language detection

Additional context
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.

if you do simple check using fasttext

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))
Output will be

#(('__label__tg', '__label__bg'), array([0.38605371, 0.14384778]))

Which indicates that it is Tajik but in fact it is not

chris-ha458 · 2023-07-18T14:06:58Z

Training a whole new classifier might be difficult.
It is very much trying to train a whole new language model with all the difficulty that implies.
Thankfully, since the release of the original fasttext model (lid.176.bin) that ungoliant uses two new strong models have been released.

The first is from the facebook team NLLB (No language left behind)
this model handles 218 languages. 1GiB in size.
The second is from a "An Open Dataset and Model for Language Identification" (Burchell et al., ACL 2023).
it handles 201 languages also around 1GiB in size.
Both are available in different opensource licenses

Since they are large, I have quantized them each into 150MiB ftz models (they are dropin compatible.)
https://huggingface.co/hac541309/fasttext_langID_models

chris-ha458 · 2023-07-18T14:09:57Z

Unfortunately, they might not be sufficent to handle this issue though.
fasttext predict-prob lid201-model.ftz - 5 Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани __label__tgk_Cyrl 0.811877 __label__tat_Cyrl 0.116819 __label__kir_Cyrl 0.0491767 __label__rus_Cyrl 0.00656079 __label__khk_Cyrl 0.00462278

fasttext predict-prob lid218e.ftz - 5 Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани __label__tgk_Cyrl 0.995232 __label__prs_Arab 0.0018278 __label__kir_Cyrl 0.00124926 __label__tat_Cyrl 0.000336861 __label__bjn_Latn 0.000244104

Muhtasham added the enhancement New feature or request label Oct 5, 2021

Muhtasham assigned pjox Oct 5, 2021

pjox added good first issue Good for newcomers help wanted Extra attention is needed labels Oct 5, 2021

pjox assigned Uinelj Oct 5, 2021

chris-ha458 mentioned this issue Jul 13, 2023

[Feature request] Document how to set fasttext model #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Train a classifier to better classify languages #21

[Feature request] Train a classifier to better classify languages #21

Muhtasham commented Oct 5, 2021

chris-ha458 commented Jul 18, 2023

chris-ha458 commented Jul 18, 2023

[Feature request] Train a classifier to better classify languages #21

[Feature request] Train a classifier to better classify languages #21

Comments

Muhtasham commented Oct 5, 2021

chris-ha458 commented Jul 18, 2023

chris-ha458 commented Jul 18, 2023