Replies: 3 comments 26 replies
-
SDP is only available on torch 2.0 and you have torch 1.13. as a result, you can clearly see you're not using SDP, you're using |
Beta Was this translation helpful? Give feedback.
-
Hi, I'd love to adopt this repo, but despite all my efforts, I can't obtain the same speed as Auto1111's original repo. I updated Auto1111 some time ago to latest torch + xformers, and am getting around 5.2 it/s with xformers (which for me performs better than SDP on my RTX 2060 6 GB VRAM card). So I copied my xformers repo to Vlad's one, activated xformers in the settings, then restarted. Vlad's System Info tells me this:
All seems good. I compared both Auto1111 and Vlad1111 settings one by one, and they are identical. And despite this, I barely get 4 it/s with Vlad's (vs 5.2 it/s with Auto). I tried SDP in Vlad's, as I saw it was recommended in several threads, but it's not better for me, like in Auto1111's one. I'm a bit out of options, as everything seems to be configured the same in both repos. Would you have any suggestion, please? |
Beta Was this translation helpful? Give feedback.
-
I decided to approach this from a different angle by doing a bunch of generations and noting the "Time taken" Both used fresh pulls of AUTOMATIC1111/stable-diffusion-webui@22bcc7b and da35bfb respectively. Settings for generation used were DPM++ 2M Karras, 512x512, 28 steps, both using torch: 2.1.0.dev20230424+cu118 with --opt-sdp-attention, live previews disabled, lastly same model and VAE in both instances Only thing that notibly altered the time taken was when the token length went >75 - however the difference was still within margin of error for both |
Beta Was this translation helpful? Give feedback.
-
I tried this fork because I thought I used the new TensorRT thing that Nvidia put out but it turns out it runs slower, not faster, than automatic1111 main. I use --opt-sdp-attention instead of xformers because it's easier and the performance is about the same, and it looks like it works in both repos. I'm still trying to get anywhere close to the 40 it/s that some people are getting with their 4090s and having no luck. Here are my benchmarks:
This branch:
2023-04-20 08:37:12.965865
7.67 / 11.89 / 15.19
updated:2023-04-20 hash:f2c3978a url:https://github.com/anapnoe/stable-diffusion-webui-ux.git/tree/master
arch:AMD64 cpu:AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD system:Windows release:Windows-10-10.0.22621-SP0 python:3.10.7
torch:1.13.1+cu117 autocast half xformers:unavailable accelerate:0.12.0 transformers:4.25.1
device:NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9) cuda:11.7 cudnn:8500 24GB
Doggettx none
v1-5-pruned.safetensors [5929c1736f]
automatic1111 main:
2023-04-20 09:11:57.833105
10.06 / 15.73 / 17.67
updated:2023-03-29 hash:22bcc7be url:https://github.com/AUTOMATIC1111/stable-diffusion-webui/tree/master
arch:AMD64 cpu:AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD system:Windows release:Windows-10-10.0.22621-SP0 python:3.10.7
torch:1.13.1+cu117 autocast half xformers:unavailable accelerate:0.12.0 transformers:4.25.1
device:NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9) cuda:11.7 cudnn:8800 24GB
Doggettx none
v1-5-pruned-emaonly.safetensors [6ce0161689]
Beta Was this translation helpful? Give feedback.
All reactions