Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

Open
kurianbenoy opened this issue May 29, 2022 · 0 comments
Open

♻️ Reproduce the work of ULMFiT in Vaaku2Vec #3

kurianbenoy opened this issue May 29, 2022 · 0 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@kurianbenoy
Copy link
Member

Vaakue2Vec claims to be the State-of-the-Art Language Modeling and Text Classification in Malayalam Language.

ℹ️ We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.

Note

Since it has been almost three years since the work, I am assuming a few things have changed. Like fastai version2, which will make the effort to reproduce a bit difficult. Also the dataset has not been made public fully by the authors of work

@kurianbenoy kurianbenoy added good first issue Good for newcomers help wanted Extra attention is needed labels May 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant