You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, training XTTSv2 leads to weird training lags - training gets stuck with no errors
with using DDP
x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)
Without DDP
Tried different dataset sizes - 2500hrs, 250hrs - result remains the same
I think there's some kind of error in Trainer or in xtts scripts maybe, don't know where to dig, thank you
no swap memory usage, no cpu overloading, no RAM overloading (by clearml, htop and top at least)
disk is fast NVME
Describe the bug
Hello, training XTTSv2 leads to weird training lags - training gets stuck with no errors
with using DDP
![image](https://private-user-images.githubusercontent.com/69078939/345027148-c0f3405b-a1d8-496c-b49c-6e8d2e6d1142.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMjQxNTAsIm5iZiI6MTczOTEyMzg1MCwicGF0aCI6Ii82OTA3ODkzOS8zNDUwMjcxNDgtYzBmMzQwNWItYTFkOC00OTZjLWI0OWMtNmU4ZDJlNmQxMTQyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA5VDE3NTczMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTE4NzQxZjk2NGFmNGM4YTMzYzhlNWQxMjlkYWU3MzczMGNmNjU2YTI1OGE2ZmZhNzM5NDFhNGZiMGQ4ZGFkYjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.2CUM3qAG-XGF0zw5RqqtyuABOXjSeqD03Fv7pKyE0ik)
x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)
Without DDP
![image](https://private-user-images.githubusercontent.com/69078939/345027234-c9b25da1-3086-4acb-800b-e5cac9de4197.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMjQxNTAsIm5iZiI6MTczOTEyMzg1MCwicGF0aCI6Ii82OTA3ODkzOS8zNDUwMjcyMzQtYzliMjVkYTEtMzA4Ni00YWNiLTgwMGItZTVjYWM5ZGU0MTk3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA5VDE3NTczMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJmM2JmNDJjYjE5ZmZmNWE1NTU3ODA4MzE5ODlhYmE2YWEyZGY2YjJhMTg2MjRjNzIzN2YxYWE2ZjQzMzVlYmImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.aVg-qB753Y6vyAq_kipHNJYAJWXfxqh224Dtk4CIHb0)
Tried different dataset sizes - 2500hrs, 250hrs - result remains the same
I think there's some kind of error in Trainer or in xtts scripts maybe, don't know where to dig, thank you
no swap memory usage, no cpu overloading, no RAM overloading (by clearml, htop and top at least)
disk is fast NVME
To Reproduce
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1
python3 recipes/ljspeech/xtts_v2/train_gpt_xtts.py
Expected behavior
No response
Logs
No response
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: