AMD is way behind #807

Chilluminati91 · 2023-05-07T19:35:21Z

Chilluminati91
May 7, 2023

Since SD released I was always kind of content with the performance of my 8GB RX 6650XT on Ubuntu.
I just installed my old 6GB Geforce GTX 1060 on my second slot to use another program that has no AMD support yet.

Now, I dont know if something in my AMD setup was always messed up but I am absolutely blown away by the VRAM management of the nvidia card.
I can do highres fix up to 1400x788 with a lot of VRAM to spare while the AMD card runs out of memory at around 1100x620.
Speed of course is a bit slower (512x512 1.2it/s) vs (512x512 3.2it/s) but is rocm really THAT MUCH worse than CUDA?

SD on AMD is basically constantly walking on eggshells to not run out of VRAM, while the 6 year old Nvidia GPU is just stable and solid.

vladmandic · 2023-05-07T21:32:14Z

vladmandic
May 7, 2023
Maintainer

It's not that cuda is faster, it's far more complete and stable. So all "extra" code like memory-cross optimization (xformers or SDP) works perfectly in cuda, but not on ROCm - and that results in a big performance and memory savings boosts.

3 replies

Ilja-HP May 8, 2023

I am not a programmer, but I will say that if Stable Diffusion were implemented in C, C ++ and not in python, then the computer and video card would not consume such significant system resources and there would not be such frequent software failures, errors, conflicts versions, etc. It's my personal opinion.

vladmandic May 8, 2023
Maintainer

This is very wrong on so many levels. If you wanted to ask a question how things work under the hood, you could have done so. Instead you voiced an opinion which is not supported by facts.

joe85 May 8, 2023

I am not a programmer, but I will say that if Stable Diffusion were implemented in C, C ++ and not in python, then the computer and video card would not consume such significant system resources and there would not be such frequent software failures, errors, conflicts versions, etc. It's my personal opinion.

Python is the most popular language for machine learning, the language is not the cause for performance issues.

All of the core libraries are heavily optimized and already contain C++ code.

For example, PyTorch has a Python front-end but a C++ back-end. That's also true for NumPy and TensorFlow.

This gives you the ease of use of Python code, but the performance of C++.

iDeNoh · 2023-05-08T19:01:35Z

iDeNoh
May 8, 2023

Since SD released I was always kind of content with the performance of my 8GB RX 6650XT on Ubuntu. I just installed my old 6GB Geforce GTX 1060 on my second slot to use another program that has no AMD support yet.

Now, I dont know if something in my AMD setup was always messed up but I am absolutely blown away by the VRAM management of the nvidia card. I can do highres fix up to 1400x788 with a lot of VRAM to spare while the AMD card runs out of memory at around 1100x620. Speed of course is a bit slower (512x512 1.2it/s) vs (512x512 3.2it/s) but is rocm really THAT MUCH worse than CUDA?

SD on AMD is basically constantly walking on eggshells to not run out of VRAM, while the 6 year old Nvidia GPU is just stable and solid.

First off, I have a 6700xt with 12GB of vram, so I do have a slightly better card. That being said I can generate MUCH higher res images than that, and I'm getting around 7.6it/s at 512x512 using Euler a, either my card is significantly faster than yours or you've got an issue with your settings/setup. Can you post screenshots of your settings? I can make some recommendations for settings, specifically the stable diffusion and compute settings screens.

2 replies

iDeNoh May 8, 2023

For reference, these are the settings I'm currently using to fantastic effect.

Chilluminati91 May 9, 2023
Author

Thanks for these! Just completely reinstalled vladman and took your settings. I get black images and the NaN-Error when enabling Do not check if produced images/latent spaces have NaN values, but ALL solutions from this issue dont seem to help. Also tried toggling every single setting from your screenshot one after the other, but that doesnt work either.

Edit: Black Image seems to be related to the model, SD1.5 pruned works fine with vladman. Weird that the model works on automatic1111 though.

I also completely reinstalled automatic and that one works out of the box (with pytorch 2.0 and --opt-sdp-attention). However both repos have the same speed and VRAM limits, even with your settings on vladman:

512x512 Euler a -> around 4.7 it/s
Hi-Res Fix -> maximum 1.3 at 512x512 before running out of VRAM

This is with 1.5x Hi-Res Fix:

OutOfMemoryError: HIP out of memory. Tried to allocate 2.53 GiB (GPU 0; 7.98 GiB total capacity; 4.72 GiB already allocated; 652.00 MiB free; 7.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Time taken: 42.77s

Torch active/reserved: 7423/7782 MiB, Sys VRAM: 7878/8176 MiB (96.36%)

According to the message Pytorch is trying to allocate 2.53GB with 4.72GB already allocated. That makes 7.25GB in total and should be possible to handle?

==> Compared to your 7.6it/s my 4.7/it/s seems okayish, but I simply dont get why the resolution is so limited.

Chilluminati91 · 2023-05-09T10:52:28Z

Chilluminati91
May 9, 2023
Author


- Ubuntu 22.04.2 LTS
- Intel i5-12600k
- 32GB DDR4
- AMD RX 6650 XT 8GB
- Vladman 4d9fab49

I'll list my weird performance quirk findings here, fresh install and settings above from iDeNoh.
Generation is always 512x512, Euler a, 20 steps, SD1.5 pruned, a woman riding a horse through a forest.

token merging:

token merging on (3 runs, 0.3 ratio): 4.79 | 4.77 | 4.77 it/s
token merging on (3runs, 0.5 ratio): 4.75 | 4.76 | 4.76 it/s
token merging off: 4.77 | 4.76 | 4.75 it/s

-> token merging does either not work or does not impact generation speed at all

hi-res fix (latent)

1.1x: 4.75 | 3.45 it/s
1.2x: 4.21 | 2.79 it/s
1.3x: 4.78 | 2.18 it/s
1.35x: 4.79 | 1.92 it/s
1.4x: OOM

-> abysmal performance here
-> after the OOM message python uses 6.7GB VRAM

cross attention

SDP (+disable mem): 4.28 | 4.78 | 4.76 it/s
Sub Quadratic: 3.40 | 3.71 | 3.73 it/s
Split Attention: 3.18 | 3.49 | 3.48 it/s
InvokeAI: 4.26 | 4.81 | 4.81 it/s
Doggettx: 4.24 | 4.82 | 4.79 it/s
Disabled: 3.25 | 3.57 | 3.57 it/s

-> SDP, InvokeAI and Doggettx on the same level

1 reply

iDeNoh May 9, 2023

I COMPLETELY forgot about this last setting, but you should also try changing the token merging settings, it improves memory usage significantly, and you get a slight speed boost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD is way behind #807

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

AMD is way behind #807

Chilluminati91 May 7, 2023

Replies: 3 comments · 6 replies

vladmandic May 7, 2023 Maintainer

Ilja-HP May 8, 2023

vladmandic May 8, 2023 Maintainer

joe85 May 8, 2023

iDeNoh May 8, 2023

iDeNoh May 8, 2023

Chilluminati91 May 9, 2023 Author

Chilluminati91 May 9, 2023 Author

iDeNoh May 9, 2023

Chilluminati91
May 7, 2023

Replies: 3 comments 6 replies

vladmandic
May 7, 2023
Maintainer

vladmandic May 8, 2023
Maintainer

iDeNoh
May 8, 2023

Chilluminati91 May 9, 2023
Author

Chilluminati91
May 9, 2023
Author