Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Train a classifier to better classify languages #21

Open
Muhtasham opened this issue Oct 5, 2021 · 2 comments
Open
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@Muhtasham
Copy link

Is your feature request related to a problem? Please describe.
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script

Describe the solution you'd like
Train new models using other data other than Wikipedia, for instance for text material that was taken from randomly chosen language specific websites, language specific news websites , and text material collected via CURL portal (https://curl.corpora.uni-leipzig.de).

Describe alternatives you've considered
Leipzig Corpora, but it also has some "noise" that needs to be cleaned for efficient language detection

Additional context
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.

if you do simple check using fasttext

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))
Output will be

#(('__label__tg', '__label__bg'), array([0.38605371, 0.14384778]))

Which indicates that it is Tajik but in fact it is not

@Muhtasham Muhtasham added the enhancement New feature or request label Oct 5, 2021
@pjox pjox added good first issue Good for newcomers help wanted Extra attention is needed labels Oct 5, 2021
@chris-ha458
Copy link

Training a whole new classifier might be difficult.
It is very much trying to train a whole new language model with all the difficulty that implies.
Thankfully, since the release of the original fasttext model (lid.176.bin) that ungoliant uses two new strong models have been released.

The first is from the facebook team NLLB (No language left behind)
this model handles 218 languages. 1GiB in size.
The second is from a "An Open Dataset and Model for Language Identification" (Burchell et al., ACL 2023).
it handles 201 languages also around 1GiB in size.
Both are available in different opensource licenses

Since they are large, I have quantized them each into 150MiB ftz models (they are dropin compatible.)
https://huggingface.co/hac541309/fasttext_langID_models

@chris-ha458
Copy link

Unfortunately, they might not be sufficent to handle this issue though.
fasttext predict-prob lid201-model.ftz - 5 Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани __label__tgk_Cyrl 0.811877 __label__tat_Cyrl 0.116819 __label__kir_Cyrl 0.0491767 __label__rus_Cyrl 0.00656079 __label__khk_Cyrl 0.00462278

fasttext predict-prob lid218e.ftz - 5 Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани __label__tgk_Cyrl 0.995232 __label__prs_Arab 0.0018278 __label__kir_Cyrl 0.00124926 __label__tat_Cyrl 0.000336861 __label__bjn_Latn 0.000244104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants