You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This template is only for bug reports. For questions, please visit Discussions.
I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English中文日本語Portuguese (Brazil)
I have searched for existing issues, including closed ones. Search issues
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
ubuntu
Steps to Reproduce
I ran training with LoRA.
I don't know why it takes longer compared to ViTs or other models.
Does that make sense? How long should it typically take, for example, on LJSpeech with a T4 GPU?
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2025-01-24 20:32:52,673][fish_speech.models.text2semantic.lit_module][INFO] - [rank: 0] Set weight decay: 0 for 432 parameters
[2025-01-24 20:32:52,674][fish_speech.models.text2semantic.lit_module][INFO] - [rank: 0] Set weight decay: 0.0 for 61 parameters
| Name | Type | Params | Mode
-------------------------------------------------------------------------
0 | model | DualARTransformer | 644 M | train
1 | model.embeddings | Embedding | 105 M | train
2 | model.codebook_embeddings | Embedding | 8.5 M | train
3 | model.layers | ModuleList | 362 M | train
4 | model.norm | RMSNorm | 1.0 K | train
5 | model.output | Linear | 105 M | train
6 | model.fast_project_in | Identity | 0 | train
7 | model.fast_embeddings | Embedding | 1.1 M | train
8 | model.fast_layers | ModuleList | 60.4 M | train
9 | model.fast_norm | RMSNorm | 1.0 K | train
10 | model.fast_output | Linear | 1.1 M | train
-------------------------------------------------------------------------
6.2 M Trainable params
637 M Non-trainable params
644 M Total params
2,576.370 Total estimated model params size (MB)
433 Modules in train mode
0 Modules in eval mode
Sanity Checking: | | 0/? [00:00<?, ?it/s][2025-01-24 20:32:52,851][fish_speech.datasets.semantic][INFO] - [rank: 0] Reading 1 / 1 files
[2025-01-24 20:32:53,264][fish_speech.datasets.semantic][INFO] - [rank: 0] Read total 1 groups of data
[2025-01-24 20:33:05,749][fish_speech.datasets.semantic][INFO] - [rank: 0] Reading 1 / 1 files
[2025-01-24 20:33:06,059][fish_speech.datasets.semantic][INFO] - [rank: 0] Read total 1 groups of data
Epoch 0: | | 100/? [26:13<00:00, 0.06it/s, v_num=0, train/loss=8.250, train/top_5_accuracy=0.412]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 100% 10/10 [00:47<00:00, 4.80s/it]
Epoch 0: | | 200/? [53:16<00:00, 0.06it/s, v_num=0, train/loss=7.880, train/top_5_accuracy=0.438, val/loss=8.060, val/top_5_accuracy=0.423]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 100% 10/10 [00:47<00:00, 4.78s/it]
Epoch 0: | | 300/? [1:20:17<00:00, 0.06it/s, v_num=0, train/loss=7.940, train/top_5_accuracy=0.418, val/loss=8.060, val/top_5_accuracy=0.423]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0% 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 0% 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 100% 10/10 [00:47<00:00, 4.79s/it]
Epoch 0: | | 380/? [1:42:08<00:00, 0.06it/s, v_num=0, train/loss=7.880, train/top_5_accuracy=0.430, val/loss=7.890, val/top_5_accuracy=0.433]
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered:
I use the A100, and its utilization is at 100%, but it takes a long time compared to my previous experience with VITS. The steps increase very slowly. After multiple days, it is still at epoch 0 and is only making minimal progress with the steps, which I think is very inefficient. Additionally, it consumes a huge amount of RAM (~120GB). Why is this happening? @Stardust-minus
Self Checks
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
ubuntu
Steps to Reproduce
I ran training with LoRA.
I don't know why it takes longer compared to ViTs or other models.
Does that make sense? How long should it typically take, for example, on LJSpeech with a T4 GPU?
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered: