You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This template is only for bug reports. For questions, please visit Discussions.
I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English中文日本語Portuguese (Brazil)
I have searched for existing issues, including closed ones. Search issues
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
Ubuntu
Steps to Reproduce
I have been training the model for over 41 hours, but the accuracy remains very low. Specifically, the training loss is 7.160 and the validation loss is 6.950, with the top-5 accuracy values stuck at approximately 44% for training and 46% for validation. The dataset I'm using is based on a Unicode language, which has specific challenges related to tokenization and script handling (e.g., Arabic).
Possible Issues:
Tokenization: The model may be struggling with tokenizing characters and subword units specific to the Unicode language. Standard tokenizers may not handle the complexities of these languages, such as right-to-left text, character variations, and complex scripts, leading to poor performance.
Data Preprocessing: There may be issues in preprocessing the text (e.g., normalization, diacritics handling) for languages with distinct character structures and additional linguistic complexities.
Model Overfitting/Underfitting: The model might be underfitting due to a simple architecture or insufficient training epochs. The low accuracy suggests that the model might not be able to fully capture the intricacies of the language.
Hyperparameter Issues: A suboptimal learning rate or other hyperparameters could be hindering the model’s ability to converge.
Request for Guidance:
Could you provide guidance on best practices for handling Unicode-based languages with the current architecture? Specifically:
Recommendations for improving tokenization for languages with complex scripts.
Suggestions for model adjustments or more suitable architectures.
Potential preprocessing steps for better handling of Unicode text.
Any advice on hyperparameter tuning (learning rate, optimizer, etc.) for better performance on such languages.
Additional Information:
If there are any known issues with training on complex Unicode languages or specific steps that I might have overlooked, any help would be greatly appreciated.
Self Checks
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
Ubuntu
Steps to Reproduce
I have been training the model for over 41 hours, but the accuracy remains very low. Specifically, the training loss is 7.160 and the validation loss is 6.950, with the top-5 accuracy values stuck at approximately 44% for training and 46% for validation. The dataset I'm using is based on a Unicode language, which has specific challenges related to tokenization and script handling (e.g., Arabic).
Possible Issues:
Tokenization: The model may be struggling with tokenizing characters and subword units specific to the Unicode language. Standard tokenizers may not handle the complexities of these languages, such as right-to-left text, character variations, and complex scripts, leading to poor performance.
Data Preprocessing: There may be issues in preprocessing the text (e.g., normalization, diacritics handling) for languages with distinct character structures and additional linguistic complexities.
Model Overfitting/Underfitting: The model might be underfitting due to a simple architecture or insufficient training epochs. The low accuracy suggests that the model might not be able to fully capture the intricacies of the language.
Hyperparameter Issues: A suboptimal learning rate or other hyperparameters could be hindering the model’s ability to converge.
Request for Guidance:
Could you provide guidance on best practices for handling Unicode-based languages with the current architecture? Specifically:
Recommendations for improving tokenization for languages with complex scripts.
Suggestions for model adjustments or more suitable architectures.
Potential preprocessing steps for better handling of Unicode text.
Any advice on hyperparameter tuning (learning rate, optimizer, etc.) for better performance on such languages.
Additional Information:
If there are any known issues with training on complex Unicode languages or specific steps that I might have overlooked, any help would be greatly appreciated.
✔️ Expected Behavior
Take long time and slow accuracy
42:01:26<00:00, 0.20it/s, v_num=0, train/loss=6.500, train/top_5_accuracy=0.461, val/loss=6.950, val/top_5_accuracy=0.459
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered: