Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After training, there are two pytorch_model files. Which one should I use? #31

Open
koking0 opened this issue Jul 15, 2024 · 5 comments

Comments

@koking0
Copy link

koking0 commented Jul 15, 2024

We found two pytorch_model files in the training checkpoint, pytorch_model_1.bin and pytorch_model.bin. Which one should I use?

$ ll time_1720783819_seed10/checkpoint_model_087000/
total 21G
drwxr-xr-x 2 root root 4.0K Jul 15 10:10 ./
drwxr-xr-x 3 root root 4.0K Jul 15 10:10 ../
-rw-r--r-- 1 root root 6.5G Jul 15 10:10 optimizer_1.bin
-rw-r--r-- 1 root root 6.5G Jul 15 10:10 optimizer.bin
-rw-r--r-- 1 root root 4.9G Jul 15 10:10 pytorch_model_1.bin
-rw-r--r-- 1 root root 3.3G Jul 15 10:10 pytorch_model.bin
-rw-r--r-- 1 root root  16K Jul 15 10:10 random_states_0.pkl
-rw-r--r-- 1 root root  988 Jul 15 10:10 scaler.pt
-rw-r--r-- 1 root root 1008 Jul 15 10:10 scheduler_1.bin
-rw-r--r-- 1 root root 1000 Jul 15 10:10 scheduler.bin

We observed that the sizes of these two files differ significantly from the official dmd2_sdxl_1step_unet.bin (10.3 GB), which might indicate an issue.

We tried inference with each of these pytorch_model files but encountered different errors.

Using pytorch_model_1.bin for inference:

ckpt_path = "CHECKPOINT/checkpoint_model_087000/pytorch_model_1.bin"
unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, unet=unet).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
dmd2_step1_baseline(pipe)

Error:

% CUDA_VISIBLE_DEVICES=1 HF_ENDPOINT=https://hf-mirror.com python eval1024_en_train.py
/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/diffusers/configuration_utils.py:245: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`. If you were trying to load a model, please use <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Traceback (most recent call last):
  File "eval1024_en_train.py", line 106, in <module>
    unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 1172, in _load
    result = unpickler.load()
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 172, in _rebuild_tensor_v2
    set_tensor_metadata(tensor, metadata)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 163, in set_tensor_metadata
    torch._C._set_tensor_metadata(tensor, metadata)  # type: ignore[attr-defined]
RuntimeError: Unexpected key `base_strides_/1/` passed to setTensorMetadata.

Using pytorch_model.bin for inference:

ckpt_path = "CHECKPOINT/checkpoint_model_087000/pytorch_model.bin"
unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, unet=unet).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
dmd2_step4_baseline(pipe)

Error:

% CUDA_VISIBLE_DEVICES=1 HF_ENDPOINT=https://hf-mirror.com python eval1024_en_train.py
/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/diffusers/configuration_utils.py:245: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`. If you were trying to load a model, please use <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Traceback (most recent call last):
  File "eval1024_en_train.py", line 112, in <module>
    unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 1172, in _load
    result = unpickler.load()
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 172, in _rebuild_tensor_v2
    set_tensor_metadata(tensor, metadata)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 163, in set_tensor_metadata
    torch._C._set_tensor_metadata(tensor, metadata)  # type: ignore[attr-defined]
RuntimeError: Unexpected key `storage_sizes_/320/4/3/3/` passed to setTensorMetadata.
@tianweiy
Copy link
Owner

tianweiy commented Jul 15, 2024

we should use pytorch_model.bin. This one stores the generator parameter. The pytorch_model_1.bin stores the guidance model's parameter's, specificlaly the real and fake unit.

In the error message, it seems torch load itself doesn't work which is weird. could you check if your torch version is the same as the one we used. and I am wondering if changing the map_location="cpu" might have any difference

Thanks

@koking0
Copy link
Author

koking0 commented Jul 17, 2024

we should use pytorch_model.bin. This one stores the generator parameter. The pytorch_model_1.bin stores the guidance model's parameter's, specificlaly the real and fake unit.

In the error message, it seems torch load itself doesn't work which is weird. could you check if your torch version is the same as the one we used. and I am wondering if changing the map_location="cpu" might have any difference

Thanks

Thank you, setting map_location="cpu" works, but the inference seems to be very slow.
I have another question. I modified your script for training sd1.5 and replaced it with our own dataset. After training, what should the step of the obtained pytorch_model.bin be set to? I see that there is no relevant parameter in the training parameters.

torchrun --nnodes 1 --nproc_per_node=16 main/train_sd.py \
    --generator_lr 1e-5  \
    --guidance_lr 1e-5 \
    --train_iters 30000 \
    --output_path $CHECKPOINT_PATH \
    --batch_size 44 \
    --grid_size 2 \
    --initialie_generator --log_iters 1000 \
    --resolution 512 \
    --latent_resolution 64 \
    --seed 10 \
    --real_guidance_scale 1.75 \
    --fake_guidance_scale 1.0 \
    --max_grad_norm 10.0 \
    --model_id "/root/workspace/env_run/sd1.5" \
    --train_prompt_path /root/workspace/env_run/dmd2/prompts/shuffled.txt \
    --afs_data_path="/root/workspace/env/2kw_merge_result/" \
    --afs_part_list="/root/paddlejob/workspace/env/2kw_part_count/part-00000" \
    --log_path /root/env_run/dmd2/tensorboard_log_sd1.5 \
    --wandb_iters 50 \
    --use_fp16 \
    --log_loss \
    --dfake_gen_update_ratio 10 \
    --gradient_checkpointing

@tianweiy
Copy link
Owner

Thank you, setting map_location="cpu" works, but the inference seems to be very slow.

I don't get it. map_location="cpu" will only specify the device during loading.

pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, unet=unet).to("cuda")

should still use the cuda

I have another question. I modified your script for training sd1.5 and replaced it with our own dataset. After training, what should the step of the obtained pytorch_model.bin be set to? I see that there is no relevant parameter in the training parameters.

What step are you referring to ?

@tianweiy
Copy link
Owner

and btw, if you want good images, you need to use larger guidance scale

@koking0
Copy link
Author

koking0 commented Jul 17, 2024

What step are you referring to ?

For example, in the example given in your README, when loading the "dmd2_sdxl_1step_unet_fp16.bin" model, num_inference_steps is set to 1 when pipe inference, and when loading the "dmd2_sdxl_4step_unet_fp16.safetensors" model, num_inference_steps is set to 4 when pipe inference.

So, when loading "pytorch_model.bin", what should num_inference_steps be set to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants