You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Due to memory limitations, I am attempting to use Zero-3 for Flux training on 8 GPUs with 32GB each. I encountered a bug similar to the one reported in this issue: #1865. I made modifications based on the solution proposed in this pull request: #3076. However, the same error persists. In my opinion, the fix does not work as expected, at least not entirely. Could you advise on how to modify it further?
def deepspeed_zero_init_disabled_context_manager():
"""
returns either a context list that includes one that will disable zero.Init or an empty context list
"""
deepspeed_plugin = AcceleratorState().deepspeed_plugin if accelerate.state.is_initialized() else None
print(f"deepspeed_plugin: {deepspeed_plugin}")
if deepspeed_plugin is None:
return []
return [deepspeed_plugin.zero3_init_context_manager(enable=False)]
with ContextManagers(deepspeed_zero_init_disabled_context_manager()):
text_encoder_one, text_encoder_two = load_text_encoders(text_encoder_cls_one, text_encoder_cls_two)
vae = AutoencoderKL.from_pretrained(
args.pretrained_model_name_or_path,
subfolder="vae",
revision=args.revision,
variant=args.variant,
)
the problem is a bug in the interaction between Diffusers, Accelerate, PEFT, and DeepSpeed; which weren't involved for that training run of Megatron :D
the problem is a bug in the interaction between Diffusers, Accelerate, PEFT, and DeepSpeed; which weren't involved for that training run of Megatron :D
@bghira I see. Sorry for my expression, and my question is whether we can use megatron for Flux training on 8 GPUs with 32GB each, which haven't been mentioned in relation to any issues.
Describe the bug
Due to memory limitations, I am attempting to use Zero-3 for Flux training on 8 GPUs with 32GB each. I encountered a bug similar to the one reported in this issue: #1865. I made modifications based on the solution proposed in this pull request: #3076. However, the same error persists. In my opinion, the fix does not work as expected, at least not entirely. Could you advise on how to modify it further?
The relevant code from https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_flux.py#L1157 has been updated as follows:
Reproduction
deepspeed config:
accelerate config:
training shell:
Logs
RuntimeError: 'weight' must be 2-D
System Info
pytorch: 2.1.0
deepspeed: 0.14.0
accelerate: 1.3.0
diffusers: develop
Who can help?
No response
The text was updated successfully, but these errors were encountered: