-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update tokenizer.py #60
Conversation
Added proper tokenizer support for Hindi Language which would prevent crash while fine tuning Hindi language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR, this is great!
It includes some changes that are not related to Hindi because there are some small differences in the code between this fork and the original repo and it looks like you started from the original one. Could you revert these unrelated changes?
FIX:Tokenizer for Hindi Language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes fixed, pls check.
Hi @akshatrocky i am getting this error, while trying to run tokenizer script from your branch: |
Which language are you trying to generate @manash997 ? (Edit) - How are you running the tokenizer script, from fine tuning XTTS? |
i just tried running it as a standalone python script. The changes for Hindi, look good to me. However wanted to test the entire script once. |
Running this script(standalone), even without changes made by me also produces an error. The main use of tokenizer.py is for fine tuning XTTS. When we fine tune any language, the program gets the tokenizer from which it gets fine tuned. Which means this script is only used in fine tuning, as in my knowledge. |
Yes, the checks in this file are not actually run during any tests and were probably already broken for a while. Otherwise everything looks good. The author deleted their account, so I had to recreate the PR in #64. |
Added proper tokenizer support for Hindi Language which would prevent crash while fine tuning Hindi language.