[Bug] Training XTTSv2 leads to weird training lags #54

NikitaKononov · 2024-07-02T12:34:05Z

Describe the bug

Hello, training XTTSv2 leads to weird training lags - training gets stuck with no errors

with using DDP
x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)

Without DDP

Tried different dataset sizes - 2500hrs, 250hrs - result remains the same

I think there's some kind of error in Trainer or in xtts scripts maybe, don't know where to dig, thank you
no swap memory usage, no cpu overloading, no RAM overloading (by clearml, htop and top at least)
disk is fast NVME

To Reproduce

python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1
python3 recipes/ljspeech/xtts_v2/train_gpt_xtts.py

Expected behavior

No response

Logs

No response

Environment

TTS 0.24.1
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:01:00.0 Off |                  Off |
| 46%   70C    P2             229W / 300W |  32382MiB / 49140MiB |     91%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               On  | 00000000:25:00.0 Off |                  Off |
| 42%   68C    P2             246W / 300W |  27696MiB / 49140MiB |     77%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               On  | 00000000:41:00.0 Off |                  Off |
| 38%   67C    P2             256W / 300W |  27640MiB / 49140MiB |     63%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               On  | 00000000:81:00.0 Off |                  Off |
| 39%   67C    P2             245W / 300W |  27640MiB / 49140MiB |     67%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               On  | 00000000:A1:00.0 Off |                  Off |
| 46%   70C    P2             239W / 300W |  27620MiB / 49140MiB |     66%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               On  | 00000000:C2:00.0 Off |                  Off |
| 30%   31C    P8              17W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Additional context

No response

NikitaKononov · 2024-07-03T08:40:35Z

tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps
lots of ram and shared memory

NikitaKononov added the bug Something isn't working label Jul 2, 2024

eginhard added the help wanted Extra attention is needed label Jul 3, 2024

eginhard added the XTTS label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Training XTTSv2 leads to weird training lags #54

[Bug] Training XTTSv2 leads to weird training lags #54

NikitaKononov commented Jul 2, 2024 •

edited

Loading

NikitaKononov commented Jul 3, 2024

[Bug] Training XTTSv2 leads to weird training lags #54

[Bug] Training XTTSv2 leads to weird training lags #54

Comments

NikitaKononov commented Jul 2, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

NikitaKononov commented Jul 3, 2024

NikitaKononov commented Jul 2, 2024 •

edited

Loading