Train embedding not working #522

ronbere · 2023-04-26T14:24:08Z

ronbere
Apr 26, 2023

Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████| 334/334 [00:09<00:00, 36.39it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 7.64it/s]
embedding train: TypeError█████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 8.16it/s]
┌───────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────┐
│ D:\automatic\modules\textual_inversion\textual_inversion.py:604 in train_embedding │
│ │
│ 603 │ │ │ │ │ │ captioned_image = caption_image_overlay(image, title, footer_lef │
│ > 604 │ │ │ │ │ │ captioned_image = insert_image_data_embed(captioned_image, data) │
│ 605 │
│ │
│ D:\automatic\modules\textual_inversion\image_embedding.py:74 in insert_image_data_embed │
│ │
│ 73 │ │
│ > 74 │ h = image.size[1] │
│ 75 │ next_size = data_np_low.shape[0] + (h-(data_np_low.shape[0] % h)) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
TypeError: 'int' object is not subscriptable
Applying scaled dot product cross attention optimization

images are on 512x512

vladmandic · 2023-04-26T14:26:42Z

vladmandic
Apr 26, 2023
Maintainer

did you update repo, there was an issue very similar to this yesterday that was fixed.

1 reply

ronbere Apr 27, 2023
Author

Sure, commits are updated

mykeehu · 2023-05-06T18:54:41Z

mykeehu
May 6, 2023

I have SDP set and disable memory attention enabled. When I start a training, I get an OOM error with batch size 5. I have an RTX 3060 12 GB video card. Should I not even try the TI training with SDP?

0 replies

vladmandic · 2023-05-06T19:19:31Z

vladmandic
May 6, 2023
Maintainer

does it work with lower batch sizes? embedding training with high batch size has been buggy for a looong time, i don't think it has anything to do with sdp. plus any kind of memory attention can create false backward pass so training is much slower to actually learn.

1 reply

mykeehu May 7, 2023

I've trained with xformers under Auto with a batch size of 10, so it worked. I was hoping that SDP was a better optimizer, so I could do it there, because I wanted to use deterministic training to get rid of xformers.

vladmandic · 2023-05-08T11:38:19Z

vladmandic
May 8, 2023
Maintainer

I've trained with xformers under Auto with a batch size of 10, so it worked. I was hoping that SDP was a better optimizer

Training with optimization enabled under torch 2.0 is just bad. Doesn't matter if it worked before, its not recommended and there is no way to make training work properly with cross-attention in general - cross-attention shotcuts backwards pass (thats part why its faster) which means that loss function cannot be correctly evalulated - so you think you're learning when you're not.

I asked a question - if you want to proceed with troubleshooting, lets go through that.

1 reply

mykeehu May 8, 2023

Ok, even with 1 batch size, the training on SDP is not working, the video memory is still full:

17:30:30-302181 ERROR    gradio call: OutOfMemoryError
┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ H:\Stable-Diffusion-Automatic\automatic\modules\call_queue.py:54 in f       │
│                                                                             │
│    53 │   │   │   │   pr.enable()                                           │
│ >  54 │   │   │   res = func(*args, **kwargs)                               │
│    55 │   │   │   if res is None:                                           │
│                                                                             │
│ H:\Stable-Diffusion-Automatic\automatic\modules\call_queue.py:35 in f       │
│                                                                             │
│    34 │   │   │   try:                                                      │
│ >  35 │   │   │   │   res = func(*args, **kwargs)                           │
│    36 │   │   │   │   progress.record_results(id_task, res)                 │
│                                                                             │
│                          ... 12 frames hidden ...                           │
│                                                                             │
│ H:\Stable-Diffusion-Automatic\automatic\repositories\stable-diffusion-stabi │
│ lity-ai\ldm\modules\diffusionmodules\model.py:132 in forward                │
│                                                                             │
│   131 │   │   h = self.norm1(h)                                             │
│ > 132 │   │   h = nonlinearity(h)                                           │
│   133 │   │   h = self.conv1(h)                                             │
│                                                                             │
│ H:\Stable-Diffusion-Automatic\automatic\repositories\stable-diffusion-stabi │
│ lity-ai\ldm\modules\diffusionmodules\model.py:43 in nonlinearity            │
│                                                                             │
│    42 │   # swish                                                           │
│ >  43 │   return x*torch.sigmoid(x)                                         │
│    44                                                                       │
└─────────────────────────────────────────────────────────────────────────────┘
OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0;
12.00 GiB total capacity; 10.74 GiB already allocated; 0 bytes free; 10.99 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

After restarting the server, I got this error for batch 1:

17:37:27-387266 INFO     No saved optimizer exists in checkpoint

  0%|                                                   | 0/20 [00:00<?, ?it/s]

  5%|##1                                        | 1/20 [00:00<00:05,  3.60it/s]

 10%|####3                                      | 2/20 [00:00<00:04,  4.08it/s]

 15%|######4                                    | 3/20 [00:00<00:03,  4.31it/s]

 20%|########6                                  | 4/20 [00:00<00:03,  4.29it/s]

 25%|##########7                                | 5/20 [00:01<00:03,  4.40it/s]

 30%|############9                              | 6/20 [00:01<00:03,  4.36it/s]

 35%|###############                            | 7/20 [00:01<00:02,  4.34it/s]

 40%|#################2                         | 8/20 [00:01<00:02,  4.42it/s]

 45%|###################3                       | 9/20 [00:02<00:02,  4.38it/s]

 50%|#####################                     | 10/20 [00:02<00:03,  3.15it/s]

 55%|#######################1                  | 11/20 [00:02<00:02,  3.43it/s]

 60%|#########################2                | 12/20 [00:03<00:02,  3.65it/s]

 65%|###########################3              | 13/20 [00:03<00:01,  3.90it/s]

 70%|#############################4            | 14/20 [00:03<00:01,  4.00it/s]

 75%|###############################5          | 15/20 [00:03<00:01,  4.17it/s]

 80%|#################################6        | 16/20 [00:04<00:01,  3.05it/s]

 85%|###################################6      | 17/20 [00:04<00:00,  3.39it/s]

 90%|#####################################8    | 18/20 [00:04<00:00,  3.62it/s]

 95%|#######################################9  | 19/20 [00:04<00:00,  3.80it/s]

100%|##########################################| 20/20 [00:05<00:00,  4.01it/s]
100%|##########################################| 20/20 [00:05<00:00,  3.86it/s]

Training textual inversion step 3 loss: 0.04087 lr: 0.00400:    1%   [  <  ,  ]17:37:34-109501 ERROR    embedding train: UnboundLocalError
                      Training textual inversion step 3 loss: 0.04087 lr: 0.00400:     1%  Traceback  [  <  ,  ](most recent call last) ─────────────────────┐
│ H:\Stable-Diffusion-Automatic\automatic\modules\textual_inversion\textual_i │
│ nversion.py:554 in train_embedding                                          │
│                                                                             │
│   553 │   │   │   │   │   │   shared.state.assign_current_image(image)      │
│ > 554 │   │   │   │   │   │   last_saved_image, _last_text_info = images.sa │
│   555 │   │   │   │   │   │   last_saved_image += f", prompt: {preview_text │
│                                                                             │
│ H:\Stable-Diffusion-Automatic\automatic\modules\images.py:599 in save_image │
│                                                                             │
│   598 │   script_callbacks.image_saved_callback(params)                     │
│ > 599 │   return fullfn, txt_fullfn                                         │
│   600                                                                       │
└─────────────────────────────────────────────────────────────────────────────┘
UnboundLocalError: local variable 'txt_fullfn' referenced before assignment
Training textual inversion step 3 loss: 0.04087 lr: 0.00400:    1%   [  <  ,  ]
17:37:34-176232 INFO     Applying scaled dot product cross attention
                         optimization (without memory efficient attention)

So either I have to manually switch back and forth in the settings from SDP to xformers, or maybe if you know of a way to switch the system in the background during TI training that would be good.

vladmandic · 2023-05-08T16:04:48Z

vladmandic
May 8, 2023
Maintainer

txt_fullfn was a typo in the latest commit, fixed. the rest, i'll take a look.

1 reply

mykeehu May 8, 2023

Ok, now working with batch 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train embedding not working #522

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Train embedding not working #522

ronbere Apr 26, 2023

Replies: 5 comments · 4 replies

vladmandic Apr 26, 2023 Maintainer

ronbere Apr 27, 2023 Author

mykeehu May 6, 2023

vladmandic May 6, 2023 Maintainer

mykeehu May 7, 2023

vladmandic May 8, 2023 Maintainer

mykeehu May 8, 2023

vladmandic May 8, 2023 Maintainer

mykeehu May 8, 2023

ronbere
Apr 26, 2023

Replies: 5 comments 4 replies

vladmandic
Apr 26, 2023
Maintainer

ronbere Apr 27, 2023
Author

mykeehu
May 6, 2023

vladmandic
May 6, 2023
Maintainer

vladmandic
May 8, 2023
Maintainer

vladmandic
May 8, 2023
Maintainer