although folder preparation button is resumed, image folder still does not work #2245

jjiikkkk · 2024-04-10T09:12:47Z

see, I do not know what it is wrong

Traceback (most recent call last):
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 1115, in
trainer.train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 234, in train
model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 101, in load_target_model
text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4387, in load_target_model
text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model(
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4363, in _load_target_model
original_unet = UNet2DConditionModel(
File "/kaggle/working/kohya_ss/sd-scripts/library/original_unet.py", line 1427, in init
attn_num_head_channels=attention_head_dim[i],
IndexError: list index out of range
[2024-04-09 05:08:36,532] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/kaggle/working/kohya_ss/sd-scripts/train_network.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-09_05:08:36
host : c9fb5313e204
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1114)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

jjiikkkk · 2024-04-10T11:07:36Z

seeming the same errors like #2244

bmaltais · 2024-04-10T22:34:36Z

Not sure what cause this error… it is caused by the training script and I don’t think I can do anything about it. You might want to open an issue directly on the as-scripts repo.

jjiikkkk · 2024-04-11T07:19:58Z

okay

jjiikkkk · 2024-04-11T07:24:36Z

File "/kaggle/working/kohya_ss/sd-scripts/sdxl_train_network.py", line 185, in
trainer.train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 272, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 2080, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 1023, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 2428, in cache_batch_latents
raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")
RuntimeError: NaN detected in latents: /kaggle/working/results/img/25_ohwx tanglaoya/1 (1)_resized.png
[2024-04-11 07:13:14,425] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1115) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/kaggle/working/kohya_ss/sd-scripts/sdxl_train_network.py FAILED， I checked several threads here and found out some people come across the same problem with me, so really puzzled

attashe · 2024-04-12T17:19:52Z

Same error, launched train_network.py directly without guid

attashe · 2024-04-12T17:57:36Z

Same error, launched train_network.py directly without guid

It was a stupid mistake, I forgot to change the script name to sdxl_ version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

although folder preparation button is resumed, image folder still does not work #2245

although folder preparation button is resumed, image folder still does not work #2245

jjiikkkk commented Apr 10, 2024

jjiikkkk commented Apr 10, 2024

bmaltais commented Apr 10, 2024

jjiikkkk commented Apr 11, 2024

jjiikkkk commented Apr 11, 2024

attashe commented Apr 12, 2024

attashe commented Apr 12, 2024

although folder preparation button is resumed, image folder still does not work #2245

although folder preparation button is resumed, image folder still does not work #2245

Comments

jjiikkkk commented Apr 10, 2024

/kaggle/working/kohya_ss/sd-scripts/train_network.py FAILED

Failures: <NO_OTHER_FAILURES>

jjiikkkk commented Apr 10, 2024

bmaltais commented Apr 10, 2024

jjiikkkk commented Apr 11, 2024

jjiikkkk commented Apr 11, 2024

attashe commented Apr 12, 2024

attashe commented Apr 12, 2024

Failures:
<NO_OTHER_FAILURES>