Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Model Accuracy After Extended Training with Unicode-Based Language #852

Open
6 tasks done
cod3r0k opened this issue Jan 24, 2025 · 2 comments
Open
6 tasks done
Labels
bug Something isn't working

Comments

@cod3r0k
Copy link

cod3r0k commented Jan 24, 2025

Self Checks

  • This template is only for bug reports. For questions, please visit Discussions.
  • I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
  • I have searched for existing issues, including closed ones. Search issues
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

Ubuntu

Steps to Reproduce

I have been training the model for over 41 hours, but the accuracy remains very low. Specifically, the training loss is 7.160 and the validation loss is 6.950, with the top-5 accuracy values stuck at approximately 44% for training and 46% for validation. The dataset I'm using is based on a Unicode language, which has specific challenges related to tokenization and script handling (e.g., Arabic).

Possible Issues:

Tokenization: The model may be struggling with tokenizing characters and subword units specific to the Unicode language. Standard tokenizers may not handle the complexities of these languages, such as right-to-left text, character variations, and complex scripts, leading to poor performance.

Data Preprocessing: There may be issues in preprocessing the text (e.g., normalization, diacritics handling) for languages with distinct character structures and additional linguistic complexities.

Model Overfitting/Underfitting: The model might be underfitting due to a simple architecture or insufficient training epochs. The low accuracy suggests that the model might not be able to fully capture the intricacies of the language.

Hyperparameter Issues: A suboptimal learning rate or other hyperparameters could be hindering the model’s ability to converge.

Request for Guidance:
Could you provide guidance on best practices for handling Unicode-based languages with the current architecture? Specifically:

Recommendations for improving tokenization for languages with complex scripts.
Suggestions for model adjustments or more suitable architectures.
Potential preprocessing steps for better handling of Unicode text.
Any advice on hyperparameter tuning (learning rate, optimizer, etc.) for better performance on such languages.
Additional Information:
If there are any known issues with training on complex Unicode languages or specific steps that I might have overlooked, any help would be greatly appreciated.

✔️ Expected Behavior

Take long time and slow accuracy

42:01:26<00:00, 0.20it/s, v_num=0, train/loss=6.500, train/top_5_accuracy=0.461, val/loss=6.950, val/top_5_accuracy=0.459

❌ Actual Behavior

No response

@cod3r0k cod3r0k added the bug Something isn't working label Jan 24, 2025
@abhisirka2001
Copy link

Hey
are you doing lora finetuning or full finetuning?

@cod3r0k
Copy link
Author

cod3r0k commented Jan 24, 2025

Hi @abhisirka2001 .
I am just following the documentation that refers to LoRA fine-tuning (https://speech.fish.audio/finetune/). What would you suggest I do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants