You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:
from accelerate import InitProcessGroupKwargs
from datetime import timedelta
x = InitProcessGroupKwargs(timeout=timedelta(seconds=N)))
accelerator = Accelerator(
...,
kwargs_handlers = [x]
)
however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
Describe the bug
Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:
however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().
Reproduction
accelerate launch --config_file configs/distributed train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="path"
--mixed_precision="bf16"
--resolution=1024
--learning_rate=5e-6
--max_train_steps=100000
--validation_steps=1000
--checkpointing_steps=25000
--validation_image "placeholder"
--validation_prompt "placeholder"
--train_batch_size=4
--gradient_accumulation_steps=1
--report_to="tensorboard"
--seed=42
--jsonl_for_train="path"
--cache_dir="path"
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
use_cpu: false
Logs
System Info
diffusers from source
accelerate == 1.1.1
datasets == 3.1.0
transformers == 4.46.2
Who can help?
No response
The text was updated successfully, but these errors were encountered: