-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train a 4step SDXL got CUDA error: no kernel image is available for execution on the device #47
Comments
And once I upgrade the torch to, saying 2.2.0 with cu121, I got the same error as in #41 |
I think it requires 2.0.1 torch version. which unfortunately doesn't seem to have a build with cuda 12.x. The deeper reason for the error is probably due to some mysterious implementation of FSDP in accelerate. To solve this, we basically need to use a raw FSDP wrapper https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel some code snippets look like the following
the auto_wrap policy needs to be adapted to SDXL (like using size based wrapping instead of the transformer one here) I don't have time to test this recently but this works well in other project's codebase. and once this is fixed, it should support any torch / cuda version |
Hi @tianweiy , I removed FSDP and made it to train with bs=2 (with bf16) on a single 80G GPU, thanks. However, when I tested it on two cards, I can at most put bs=1 for each GPU, is this expected? besides, it would be even better that you can share the training loss, etc. The task is SDXL 4 step distillation |
DDP or so will add multiple GB extra overhead so it is possible. once there is no fsdp, I think you can enable gradient checkpointing. this might save some memory (might need to add the gradient checkpointing when computing gan loss too) ? Full bf16 might or might not have precision issues. Interested to see how it goes for you. One more trick you could do is to offload the real unet to cpu() after computing dmd loss. this will save quite a few GB I think. Basically, load to gpu at Line 178 in 0f8a481
and offload to cpu at Line 255 in 0f8a481
It will be slow but anyway the whole codebase is not very fast at the moment lol. I think newer PyTorch version seems to be faster
see By the way, feel free to let me know if you have any questions about how useful is one specific loss / hyper parameter / etc.. |
Hi @tianweiy ,thanks a lot for your help. now I can train DMD on multiple GPUs with bs=2 for each card, but the results are not good for now. however the dm loss is not dropping, and the generated image std is not increasing... have you tried bf16 training on SDXL before? |
I didn't. Actually could you send me an email? We can probably set up a call to figure out the issues. Thanks
On Sep 6, 2024 3:24 AM, Yuanzhi Zhu ***@***.***> wrote:
Hi @tianweiy<https://github.com/tianweiy> ,thanks a lot for your help.
now I can train DMD on multiple GPUs with bs=2 for each card, but the results are not good for now.
according to your logs, after 2k iters we should see some nice generated pics.
however the dm loss is not dropping, and the generated image std is not increasing...
have you tried bf16 training on SDXL before?
—
Reply to this email directly, view it on GitHub<#47 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJFWY3QB7CF3SGYMEPIU7LLZVFKAVAVCNFSM6AAAAABNSLF7GKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZTGQYTKOBTGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi yuanzhi, I'm curious about what modifications you made to enable the model to run training on 80GB of GPU memory. Did you simply remove |
Hey, can you try to set |
I create conda env followed README and got this error when training SDXL.
It seems that the torch version installed is with cu11 but my GPU has cuda 12.2 (I checked from nvidia-smi).
However, the previous issue #41 indicate the only working env is the one follows README?
Massive thanks for your work
The text was updated successfully, but these errors were encountered: