Slower than automatic1111, help me configure it better #288

gruevy · 2023-04-20T15:19:59Z

gruevy
Apr 20, 2023

I tried this fork because I thought I used the new TensorRT thing that Nvidia put out but it turns out it runs slower, not faster, than automatic1111 main. I use --opt-sdp-attention instead of xformers because it's easier and the performance is about the same, and it looks like it works in both repos. I'm still trying to get anywhere close to the 40 it/s that some people are getting with their 4090s and having no luck. Here are my benchmarks:

This branch:

2023-04-20 08:37:12.965865

7.67 / 11.89 / 15.19

updated:2023-04-20 hash:f2c3978a url:https://github.com/anapnoe/stable-diffusion-webui-ux.git/tree/master

arch:AMD64 cpu:AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD system:Windows release:Windows-10-10.0.22621-SP0 python:3.10.7

torch:1.13.1+cu117 autocast half xformers:unavailable accelerate:0.12.0 transformers:4.25.1

device:NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9) cuda:11.7 cudnn:8500 24GB

Doggettx none

v1-5-pruned.safetensors [5929c1736f]

automatic1111 main:

2023-04-20 09:11:57.833105

10.06 / 15.73 / 17.67

updated:2023-03-29 hash:22bcc7be url:https://github.com/AUTOMATIC1111/stable-diffusion-webui/tree/master

arch:AMD64 cpu:AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD system:Windows release:Windows-10-10.0.22621-SP0 python:3.10.7

torch:1.13.1+cu117 autocast half xformers:unavailable accelerate:0.12.0 transformers:4.25.1

device:NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9) cuda:11.7 cudnn:8800 24GB

Doggettx none

v1-5-pruned-emaonly.safetensors [6ce0161689]

vladmandic · 2023-04-20T15:29:13Z

vladmandic
Apr 20, 2023
Maintainer

SDP is only available on torch 2.0 and you have torch 1.13.
uninstall torch and let installer install correct version.
same for other system libs - your accelerate is ancient - this is not what installer here installs.

as a result, you can clearly see you're not using SDP, you're using Doggettx - its printed in the output (and on the console when you start server)

10 replies

vladmandic Apr 20, 2023
Maintainer

so this is pretty much double your original performance.
you can retry using extra steps and extensive level and see how far that gets you?

gruevy Apr 20, 2023
Author

Here's a test with extra steps and extensive. I'm pretty pleased.

2023-04-20 10:58:54.010766

16.19 / 31.81 / 41.67 / 40.72 / 40.76

updated:2023-04-20 hash:752b91d3 url:https://github.com/vladmandic/automatic/tree/master

arch:AMD64 cpu:AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD system:Windows release:Windows-10-10.0.22621-SP0 python:3.10.7

torch:2.0.0+cu118 Autocast half xformers:unavailable accelerate:0.18.0 transformers:4.26.1

device:NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9) cuda:11.8 cudnn:8700 24GB

sdp none

lyriel_v15.safetensors [4d91c4c217]

vladmandic Apr 20, 2023
Maintainer

now you're getting to expected levels :)

gruevy Apr 20, 2023
Author

Feels good man :D

Squeezitgirdle Apr 26, 2023

I'm also on a 4090, would you mind sharing your settings with me? Mine runs fine but I've been running into a lot of memory issues or where it starts generating then just stops and never finishes or gives any error.

I probably have something configured incorrectly.

Michoko92 · 2023-04-21T09:01:40Z

Michoko92
Apr 21, 2023

Hi,

I'd love to adopt this repo, but despite all my efforts, I can't obtain the same speed as Auto1111's original repo.

I updated Auto1111 some time ago to latest torch + xformers, and am getting around 5.2 it/s with xformers (which for me performs better than SDP on my RTX 2060 6 GB VRAM card). So I copied my xformers repo to Vlad's one, activated xformers in the settings, then restarted. Vlad's System Info tells me this:

Torch
2.0.0+cu118 Autocast  half
GPU
device: NVIDIA GeForce RTX 2060 (1) (compute_37) (7, 5)
cuda: 11.8
cudnn: 8700
Memory optimization
medvram
0.2/0.9s
Cross-attention
xformers
Libs
xformers: 0.0.17
accelerate: 0.18.0
transformers: 4.26.1

All seems good. I compared both Auto1111 and Vlad1111 settings one by one, and they are identical. And despite this, I barely get 4 it/s with Vlad's (vs 5.2 it/s with Auto).

I tried SDP in Vlad's, as I saw it was recommended in several threads, but it's not better for me, like in Auto1111's one. I'm a bit out of options, as everything seems to be configured the same in both repos. Would you have any suggestion, please?

16 replies

Michoko92 Apr 26, 2023

I installed v0.0.18, but the launch script downgrades back to v0.0.17. I checked the requirements.txt file, but can't see a mention to the xformers version. Is there a way to prevent the launcher from downgrading xformers, please?

vladmandic Apr 26, 2023
Maintainer

@Michoko92 i've provided answer in your separate thread.

Michoko92 Apr 26, 2023

Oh thank you Vlad, I'm not the creator of this thread, but I appreciate the help :)

Gazzoo-byte Apr 26, 2023

@Michoko92 #524 was a thread I made as I ran into the same issue lol, though the solution I arrived at was much less elegant - requirements.txt gets checked after xformers gets rolled back to 0.0.17, so I added xformers==0.0.18 to requirements.txt temporarily. Leads to xformers being downgraded then upgraded again so vlad's advice in the thread is vastly better

Michoko92 Apr 26, 2023

OK made a test with 0.0.18, and unfortunately, results are actually a bit worse than with 0.0.17 (with medvram)

That's frustrating, because I'm really hyped about this new repo. But for now, Auto1111 is still performing better on single image generation with VRAM optimizations. Seems I'll have to deal with it.

Gazzoo-byte · 2023-04-26T13:49:08Z

Gazzoo-byte
Apr 26, 2023

I decided to approach this from a different angle by doing a bunch of generations and noting the "Time taken"

Both used fresh pulls of AUTOMATIC1111/stable-diffusion-webui@22bcc7b and da35bfb respectively.

Settings for generation used were DPM++ 2M Karras, 512x512, 28 steps, both using torch: 2.1.0.dev20230424+cu118 with --opt-sdp-attention, live previews disabled, lastly same model and VAE in both instances

Only thing that notibly altered the time taken was when the token length went >75 - however the difference was still within margin of error for both

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower than automatic1111, help me configure it better #288

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slower than automatic1111, help me configure it better #288

Replies: 3 comments · 26 replies

vladmandic Apr 20, 2023 Maintainer

vladmandic Apr 20, 2023 Maintainer

gruevy Apr 20, 2023 Author

vladmandic Apr 20, 2023 Maintainer

gruevy Apr 20, 2023 Author

vladmandic Apr 26, 2023 Maintainer

Replies: 3 comments 26 replies

vladmandic
Apr 20, 2023
Maintainer

vladmandic Apr 20, 2023
Maintainer

gruevy Apr 20, 2023
Author

vladmandic Apr 20, 2023
Maintainer

gruevy Apr 20, 2023
Author

vladmandic Apr 26, 2023
Maintainer