train a 4step SDXL got AssertionError: Invalid: ``_post_backward_hook_state` #41

joelulu · 2024-08-09T09:46:35Z

Thanks for excellent work!
When I try to train a 4step SDXL model.(2 nodes 16 GPUs ) I got an error:

`[rank2]: Traceback (most recent call last):
[rank2]: File "/mnt/nas/gaohl/project/DMD2-main/main/train_sd.py", line 739, in
[rank2]: trainer.train()
[rank2]: File "/mnt/nas/gaohl/project/DMD2-main/main/train_sd.py", line 633, in train
[rank2]: self.train_one_step()
[rank2]: File "/mnt/nas/gaohl/project/DMD2-main/main/train_sd.py", line 390, in train_one_step
[rank2]: self.accelerator.backward(generator_loss)
[rank2]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in backward
[rank2]: loss.backward(**kwargs)
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
[rank2]: torch.autograd.backward(
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
[rank2]: _engine_run_backward(
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 1099, in _post_backward_final_callback
[rank2]: _finalize_params(fsdp_state)
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 1168, in _finalize_params
[rank2]: _p_assert(
[rank2]: File "/usr/local/lib/python3.10/site-packages/torch/distributed/utils.py", line 146, in _p_assert
[rank2]: raise AssertionError(s)
[rank2]: AssertionError: Invalid: _post_backward_hook_state: (<torch.autograd.graph.register_multi_grad_hook..Handle object at 0x7f0bfb6a3a00>,)

and my configure file is:

accelerate launch --main_process_port $MASTER_PORT --main_process_ip $MASTER_ADDR --config_file fsdp_configs/fsdp_8node_debug_joe.yaml --machine_rank $RANK main/train_sd.py \ --generator_lr 5e-7 \ --guidance_lr 5e-7 \ --train_iters 100000000 \ --output_path $CHECKPOINT_PATH/sdxl_cond999_8node_lr5e-7_denoising4step_diffusion1000_gan5e-3_guidance8_noinit_noode_backsim_scratch \ --batch_size 2 \ --grid_size 2 \ --initialie_generator --log_iters 1000 \ --resolution 1024 \ --latent_resolution 128 \ --seed 10 \ --real_guidance_scale 8 \ --fake_guidance_scale 1.0 \ --max_grad_norm 10.0 \ --model_id "/mnt/nas/gaohl/models/models--stabilityai--stable-diffusion-xl-base-1.0/snapshots/462165984030d82259a11f4367a4eed129e94a7b/" \ --wandb_iters 100 \ --wandb_entity $WANDB_ENTITY \ --wandb_project $WANDB_PROJECT \ --wandb_name "sdxl_cond999_8node_lr5e-7_denoising4step_diffusion1000_gan5e-3_guidance8_noinit_noode_backsim_scratch" \ --log_loss \ --dfake_gen_update_ratio 5 \ --fsdp \ --sdxl \ --use_fp16 \ --max_step_percent 0.98 \ --cls_on_clean_image \ --gen_cls_loss \ --gen_cls_loss_weight 5e-3 \ --guidance_cls_loss_weight 1e-2 \ --diffusion_gan \ --diffusion_gan_max_timestep 1000 \ --denoising \ --num_denoising_step 4 \ --denoising_timestep 1000 \ --backward_simulation \ --train_prompt_path $CHECKPOINT_PATH/captions_laion_score6.25.pkl \ --real_image_path $CHECKPOINT_PATH/sdxl_vae_latents_laion_500k_lmdb/

fsdp_8node_debug_joe.yaml:
compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: SIZE_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: false fsdp_min_num_params: 3000 fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

The text was updated successfully, but these errors were encountered:

tianweiy · 2024-08-10T02:51:43Z

I have no clue how to solve this error.

But the most likely cause for the error is mismatched torch and accelerate version. Could you double check that the torch and accelerate version match the one in the README ?

Other versions simply doesn't work unfortunately...

also related to #25 (comment)

fire2323 · 2024-08-10T04:46:02Z

@tianweiy, do you use TorchDynamo in torch fsdp, because I found it might be related to torchdynamo compiling.

tianweiy · 2024-08-11T07:41:38Z

I didn't use TorchDynamo.

BeBuBu · 2024-08-12T05:54:24Z

Is there any new progress on this question? My version of accelerate and torch is as follows:
accelerate 0.25.0
pytorch 2.1.2 py3.8_cuda11.8_cudnn8.7.0_0

tianweiy · 2024-08-12T05:56:38Z

Please just try the one specified in the readme first. Other versions are not tested and likely just don't work

…

On Sun, Aug 11, 2024, 10:54 PM Lijian ***@***.***> wrote: Is there any new progress on this question? My version of accelerate and torch is as follows: accelerate 0.25.0 pytorch 2.1.2 py3.8_cuda11.8_cudnn8.7.0_0 — Reply to this email directly, view it on GitHub <#41 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJFWY3T3AX2BXFYKEK7GTCDZRBE2LAVCNFSM6AAAAABMIEHMSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBTGE2TQOJWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

joelulu · 2024-08-12T06:48:11Z

I have no clue how to solve this error.

But the most likely cause for the error is mismatched torch and accelerate version. Could you double check that the torch and accelerate version match the one in the README ?

Other versions simply doesn't work unfortunately...

also related to #25 (comment)

Thanks a lot. I solved my problem

BeBuBu · 2024-08-12T10:40:06Z

Please just try the one specified in the readme first. Other versions are not tested and likely just don't work
…
On Sun, Aug 11, 2024, 10:54 PM Lijian @.> wrote: Is there any new progress on this question? My version of accelerate and torch is as follows: accelerate 0.25.0 pytorch 2.1.2 py3.8_cuda11.8_cudnn8.7.0_0 — Reply to this email directly, view it on GitHub <#41 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJFWY3T3AX2BXFYKEK7GTCDZRBE2LAVCNFSM6AAAAABMIEHMSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBTGE2TQOJWGA . You are receiving this because you were mentioned.Message ID: @.>

Thank you. Problem solved

joelulu · 2024-08-20T02:58:08Z

Does DMD2 have plans to support StableCascade distillation ？

yuanzhi-zhu mentioned this issue Sep 3, 2024

train a 4step SDXL got CUDA error: no kernel image is available for execution on the device #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train a 4step SDXL got AssertionError: Invalid: ``_post_backward_hook_state` #41

train a 4step SDXL got AssertionError: Invalid: ``_post_backward_hook_state` #41

joelulu commented Aug 9, 2024

tianweiy commented Aug 10, 2024

fire2323 commented Aug 10, 2024

tianweiy commented Aug 11, 2024

BeBuBu commented Aug 12, 2024

tianweiy commented Aug 12, 2024 via email

joelulu commented Aug 12, 2024

BeBuBu commented Aug 12, 2024

joelulu commented Aug 20, 2024

train a 4step SDXL got AssertionError: Invalid: `_post_backward_hook_state #41

train a 4step SDXL got AssertionError: Invalid: `_post_backward_hook_state #41

Comments

joelulu commented Aug 9, 2024

tianweiy commented Aug 10, 2024

fire2323 commented Aug 10, 2024

tianweiy commented Aug 11, 2024

BeBuBu commented Aug 12, 2024

tianweiy commented Aug 12, 2024 via email

joelulu commented Aug 12, 2024

BeBuBu commented Aug 12, 2024

joelulu commented Aug 20, 2024

train a 4step SDXL got AssertionError: Invalid: ``_post_backward_hook_state` #41

train a 4step SDXL got AssertionError: Invalid: ``_post_backward_hook_state` #41