♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

kurianbenoy · 2022-05-29T04:03:41Z

Vaakue2Vec claims to be the State-of-the-Art Language Modeling and Text Classification in Malayalam Language.

ℹ️ We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.

Note

Since it has been almost three years since the work, I am assuming a few things have changed. Like fastai version2, which will make the effort to reproduce a bit difficult. Also the dataset has not been made public fully by the authors of work

kurianbenoy added good first issue Good for newcomers help wanted Extra attention is needed labels May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

kurianbenoy commented May 29, 2022

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

Comments

kurianbenoy commented May 29, 2022

Note