Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

although folder preparation button is resumed, image folder still does not work #2245

Open
jjiikkkk opened this issue Apr 10, 2024 · 6 comments

Comments

@jjiikkkk
Copy link

see, I do not know what it is wrong

Traceback (most recent call last):
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 1115, in
trainer.train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 234, in train
model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 101, in load_target_model
text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4387, in load_target_model
text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model(
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4363, in _load_target_model
original_unet = UNet2DConditionModel(
File "/kaggle/working/kohya_ss/sd-scripts/library/original_unet.py", line 1427, in init
attn_num_head_channels=attention_head_dim[i],
IndexError: list index out of range
[2024-04-09 05:08:36,532] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/kaggle/working/kohya_ss/sd-scripts/train_network.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-09_05:08:36
host : c9fb5313e204
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1114)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@jjiikkkk
Copy link
Author

seeming the same errors like #2244

@bmaltais
Copy link
Owner

Not sure what cause this error… it is caused by the training script and I don’t think I can do anything about it. You might want to open an issue directly on the as-scripts repo.

@jjiikkkk
Copy link
Author

okay

@jjiikkkk
Copy link
Author

File "/kaggle/working/kohya_ss/sd-scripts/sdxl_train_network.py", line 185, in
trainer.train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 272, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 2080, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 1023, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 2428, in cache_batch_latents
raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")
RuntimeError: NaN detected in latents: /kaggle/working/results/img/25_ohwx tanglaoya/1 (1)_resized.png
[2024-04-11 07:13:14,425] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1115) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/kaggle/working/kohya_ss/sd-scripts/sdxl_train_network.py FAILED, I checked several threads here and found out some people come across the same problem with me, so really puzzled

@attashe
Copy link

attashe commented Apr 12, 2024

Same error, launched train_network.py directly without guid

@attashe
Copy link

attashe commented Apr 12, 2024

Same error, launched train_network.py directly without guid

It was a stupid mistake, I forgot to change the script name to sdxl_ version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants