After training, there are two pytorch_model files. Which one should I use? #31

koking0 · 2024-07-15T03:39:48Z

We found two pytorch_model files in the training checkpoint, pytorch_model_1.bin and pytorch_model.bin. Which one should I use?

$ ll time_1720783819_seed10/checkpoint_model_087000/
total 21G
drwxr-xr-x 2 root root 4.0K Jul 15 10:10 ./
drwxr-xr-x 3 root root 4.0K Jul 15 10:10 ../
-rw-r--r-- 1 root root 6.5G Jul 15 10:10 optimizer_1.bin
-rw-r--r-- 1 root root 6.5G Jul 15 10:10 optimizer.bin
-rw-r--r-- 1 root root 4.9G Jul 15 10:10 pytorch_model_1.bin
-rw-r--r-- 1 root root 3.3G Jul 15 10:10 pytorch_model.bin
-rw-r--r-- 1 root root  16K Jul 15 10:10 random_states_0.pkl
-rw-r--r-- 1 root root  988 Jul 15 10:10 scaler.pt
-rw-r--r-- 1 root root 1008 Jul 15 10:10 scheduler_1.bin
-rw-r--r-- 1 root root 1000 Jul 15 10:10 scheduler.bin

We observed that the sizes of these two files differ significantly from the official dmd2_sdxl_1step_unet.bin (10.3 GB), which might indicate an issue.

We tried inference with each of these pytorch_model files but encountered different errors.

Using pytorch_model_1.bin for inference:

ckpt_path = "CHECKPOINT/checkpoint_model_087000/pytorch_model_1.bin"
unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, unet=unet).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
dmd2_step1_baseline(pipe)

Error:

% CUDA_VISIBLE_DEVICES=1 HF_ENDPOINT=https://hf-mirror.com python eval1024_en_train.py
/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/diffusers/configuration_utils.py:245: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`. If you were trying to load a model, please use <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Traceback (most recent call last):
  File "eval1024_en_train.py", line 106, in <module>
    unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 1172, in _load
    result = unpickler.load()
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 172, in _rebuild_tensor_v2
    set_tensor_metadata(tensor, metadata)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 163, in set_tensor_metadata
    torch._C._set_tensor_metadata(tensor, metadata)  # type: ignore[attr-defined]
RuntimeError: Unexpected key `base_strides_/1/` passed to setTensorMetadata.

Using pytorch_model.bin for inference:

ckpt_path = "CHECKPOINT/checkpoint_model_087000/pytorch_model.bin"
unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, unet=unet).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
dmd2_step4_baseline(pipe)

Error:

% CUDA_VISIBLE_DEVICES=1 HF_ENDPOINT=https://hf-mirror.com python eval1024_en_train.py
/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/diffusers/configuration_utils.py:245: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`. If you were trying to load a model, please use <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Traceback (most recent call last):
  File "eval1024_en_train.py", line 112, in <module>
    unet.load_state_dict(torch.load(ckpt_path, map_location="cuda"))
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/serialization.py", line 1172, in _load
    result = unpickler.load()
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 172, in _rebuild_tensor_v2
    set_tensor_metadata(tensor, metadata)
  File "/ssd2/anaconda3/envs/dmd2/lib/python3.8/site-packages/torch/_utils.py", line 163, in set_tensor_metadata
    torch._C._set_tensor_metadata(tensor, metadata)  # type: ignore[attr-defined]
RuntimeError: Unexpected key `storage_sizes_/320/4/3/3/` passed to setTensorMetadata.

The text was updated successfully, but these errors were encountered:

tianweiy · 2024-07-15T03:56:06Z

we should use pytorch_model.bin. This one stores the generator parameter. The pytorch_model_1.bin stores the guidance model's parameter's, specificlaly the real and fake unit.

In the error message, it seems torch load itself doesn't work which is weird. could you check if your torch version is the same as the one we used. and I am wondering if changing the map_location="cpu" might have any difference

Thanks

koking0 · 2024-07-17T03:47:13Z

we should use pytorch_model.bin. This one stores the generator parameter. The pytorch_model_1.bin stores the guidance model's parameter's, specificlaly the real and fake unit.

In the error message, it seems torch load itself doesn't work which is weird. could you check if your torch version is the same as the one we used. and I am wondering if changing the map_location="cpu" might have any difference

Thanks

Thank you, setting map_location="cpu" works, but the inference seems to be very slow.
I have another question. I modified your script for training sd1.5 and replaced it with our own dataset. After training, what should the step of the obtained pytorch_model.bin be set to? I see that there is no relevant parameter in the training parameters.

torchrun --nnodes 1 --nproc_per_node=16 main/train_sd.py \
    --generator_lr 1e-5  \
    --guidance_lr 1e-5 \
    --train_iters 30000 \
    --output_path $CHECKPOINT_PATH \
    --batch_size 44 \
    --grid_size 2 \
    --initialie_generator --log_iters 1000 \
    --resolution 512 \
    --latent_resolution 64 \
    --seed 10 \
    --real_guidance_scale 1.75 \
    --fake_guidance_scale 1.0 \
    --max_grad_norm 10.0 \
    --model_id "/root/workspace/env_run/sd1.5" \
    --train_prompt_path /root/workspace/env_run/dmd2/prompts/shuffled.txt \
    --afs_data_path="/root/workspace/env/2kw_merge_result/" \
    --afs_part_list="/root/paddlejob/workspace/env/2kw_part_count/part-00000" \
    --log_path /root/env_run/dmd2/tensorboard_log_sd1.5 \
    --wandb_iters 50 \
    --use_fp16 \
    --log_loss \
    --dfake_gen_update_ratio 10 \
    --gradient_checkpointing

tianweiy · 2024-07-17T03:58:20Z

Thank you, setting map_location="cpu" works, but the inference seems to be very slow.

I don't get it. map_location="cpu" will only specify the device during loading.

pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, unet=unet).to("cuda")

should still use the cuda

I have another question. I modified your script for training sd1.5 and replaced it with our own dataset. After training, what should the step of the obtained pytorch_model.bin be set to? I see that there is no relevant parameter in the training parameters.

What step are you referring to ?

tianweiy · 2024-07-17T03:58:38Z

and btw, if you want good images, you need to use larger guidance scale

koking0 · 2024-07-17T04:07:10Z

What step are you referring to ?

For example, in the example given in your README, when loading the "dmd2_sdxl_1step_unet_fp16.bin" model, num_inference_steps is set to 1 when pipe inference, and when loading the "dmd2_sdxl_4step_unet_fp16.safetensors" model, num_inference_steps is set to 4 when pipe inference.

So, when loading "pytorch_model.bin", what should num_inference_steps be set to?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After training, there are two pytorch_model files. Which one should I use? #31

After training, there are two pytorch_model files. Which one should I use? #31

koking0 commented Jul 15, 2024

tianweiy commented Jul 15, 2024 •

edited

Loading

koking0 commented Jul 17, 2024

tianweiy commented Jul 17, 2024

tianweiy commented Jul 17, 2024

koking0 commented Jul 17, 2024

After training, there are two pytorch_model files. Which one should I use? #31

After training, there are two pytorch_model files. Which one should I use? #31

Comments

koking0 commented Jul 15, 2024

tianweiy commented Jul 15, 2024 • edited Loading

koking0 commented Jul 17, 2024

tianweiy commented Jul 17, 2024

tianweiy commented Jul 17, 2024

koking0 commented Jul 17, 2024

tianweiy commented Jul 15, 2024 •

edited

Loading