Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tokenizer.py #60

Closed
wants to merge 4 commits into from
Closed

Update tokenizer.py #60

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jul 18, 2024

Added proper tokenizer support for Hindi Language which would prevent crash while fine tuning Hindi language.

Added proper tokenizer support for Hindi Language which would prevent crash while fine tuning Hindi language.
Copy link
Member

@eginhard eginhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR, this is great!

It includes some changes that are not related to Hindi because there are some small differences in the code between this fork and the original repo and it looks like you started from the original one. Could you revert these unrelated changes?

TTS/tts/layers/xtts/tokenizer.py Outdated Show resolved Hide resolved
FIX:Tokenizer for Hindi Language
Copy link
Author

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes fixed, pls check.

@ghost ghost requested a review from eginhard July 18, 2024 08:47
@manash997
Copy link

Hi @akshatrocky i am getting this error, while trying to run tokenizer script from your branch:
tokenizer.py", line 806, in test_expand_numbers_multilingual
assert out == b, f"'{out}' vs '{b}'"
AssertionError: 'Через двенадцать целых пять десятых секунды.' vs 'Через двенадцать запятая пять секунды.'

@ghost
Copy link
Author

ghost commented Jul 22, 2024

Which language are you trying to generate @manash997 ?

(Edit) - How are you running the tokenizer script, from fine tuning XTTS?

@manash997
Copy link

Which language are you trying to generate @manash997 ?

(Edit) - How are you running the tokenizer script, from fine tuning XTTS?

i just tried running it as a standalone python script. The changes for Hindi, look good to me. However wanted to test the entire script once.

@ghost
Copy link
Author

ghost commented Jul 22, 2024

Running this script(standalone), even without changes made by me also produces an error. The main use of tokenizer.py is for fine tuning XTTS. When we fine tune any language, the program gets the tokenizer from which it gets fine tuned. Which means this script is only used in fine tuning, as in my knowledge.

@ghost ghost closed this Jul 25, 2024
@eginhard
Copy link
Member

Yes, the checks in this file are not actually run during any tests and were probably already broken for a while. Otherwise everything looks good. The author deleted their account, so I had to recreate the PR in #64.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants