Low Model Accuracy After Extended Training with Unicode-Based Language #852

cod3r0k · 2025-01-24T10:20:37Z

Self Checks

This template is only for bug reports. For questions, please visit Discussions.
I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文日本語 Portuguese (Brazil)
I have searched for existing issues, including closed ones. Search issues
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

Ubuntu

Steps to Reproduce

I have been training the model for over 41 hours, but the accuracy remains very low. Specifically, the training loss is 7.160 and the validation loss is 6.950, with the top-5 accuracy values stuck at approximately 44% for training and 46% for validation. The dataset I'm using is based on a Unicode language, which has specific challenges related to tokenization and script handling (e.g., Arabic).

Possible Issues:

Tokenization: The model may be struggling with tokenizing characters and subword units specific to the Unicode language. Standard tokenizers may not handle the complexities of these languages, such as right-to-left text, character variations, and complex scripts, leading to poor performance.

Data Preprocessing: There may be issues in preprocessing the text (e.g., normalization, diacritics handling) for languages with distinct character structures and additional linguistic complexities.

Model Overfitting/Underfitting: The model might be underfitting due to a simple architecture or insufficient training epochs. The low accuracy suggests that the model might not be able to fully capture the intricacies of the language.

Hyperparameter Issues: A suboptimal learning rate or other hyperparameters could be hindering the model’s ability to converge.

Request for Guidance:
Could you provide guidance on best practices for handling Unicode-based languages with the current architecture? Specifically:

Recommendations for improving tokenization for languages with complex scripts.
Suggestions for model adjustments or more suitable architectures.
Potential preprocessing steps for better handling of Unicode text.
Any advice on hyperparameter tuning (learning rate, optimizer, etc.) for better performance on such languages.
Additional Information:
If there are any known issues with training on complex Unicode languages or specific steps that I might have overlooked, any help would be greatly appreciated.

✔️ Expected Behavior

Take long time and slow accuracy

42:01:26<00:00, 0.20it/s, v_num=0, train/loss=6.500, train/top_5_accuracy=0.461, val/loss=6.950, val/top_5_accuracy=0.459

❌ Actual Behavior

No response

The text was updated successfully, but these errors were encountered:

abhisirka2001 · 2025-01-24T13:40:28Z

Hey
are you doing lora finetuning or full finetuning?

cod3r0k · 2025-01-24T15:43:09Z

Hi @abhisirka2001 .
I am just following the documentation that refers to LoRA fine-tuning (https://speech.fish.audio/finetune/). What would you suggest I do?

cod3r0k added the bug Something isn't working label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low Model Accuracy After Extended Training with Unicode-Based Language #852

Low Model Accuracy After Extended Training with Unicode-Based Language #852

cod3r0k commented Jan 24, 2025

abhisirka2001 commented Jan 24, 2025

cod3r0k commented Jan 24, 2025

Low Model Accuracy After Extended Training with Unicode-Based Language #852

Low Model Accuracy After Extended Training with Unicode-Based Language #852

Comments

cod3r0k commented Jan 24, 2025

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

abhisirka2001 commented Jan 24, 2025

cod3r0k commented Jan 24, 2025