[Feature request] Train a classifier to better classify languages #21
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
Is your feature request related to a problem? Please describe.
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script
Describe the solution you'd like
Train new models using other data other than Wikipedia, for instance for text material that was taken from randomly chosen language specific websites, language specific news websites , and text material collected via CURL portal (https://curl.corpora.uni-leipzig.de).
Describe alternatives you've considered
Leipzig Corpora, but it also has some "noise" that needs to be cleaned for efficient language detection
Additional context
for ex:
File: tg.txt Line: 660247: Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.
if you do simple check using fasttext
import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('Маҳаллий расмийлар агар у пахта теримига чиқмаса, набираси учун тўланадиган нафақадан маҳрум этилиши билан таҳдид қилишгани.', k=2))
Output will be
#(('__label__tg', '__label__bg'), array([0.38605371, 0.14384778]))
Which indicates that it is Tajik but in fact it is not
The text was updated successfully, but these errors were encountered: