Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Takes an excessive amount of time #13

Open
AndrewToomey opened this issue Sep 7, 2024 · 4 comments
Open

Takes an excessive amount of time #13

AndrewToomey opened this issue Sep 7, 2024 · 4 comments

Comments

@AndrewToomey
Copy link

AndrewToomey commented Sep 7, 2024

This is taking about 2 hours with the smallest model.

I presume the issue is that my GPU cannot load a t5_XXL model into memory. According to the Huggingface page the model weights are 44.5 Gb.

Is there any possibility of switching it out for a GGUF of the t5_XXL or at least quantizing with bitsandbytes? (Just the encoder)?

@eftSharptooth
Copy link

It appears that the model is loaded in bf16 instead of the 32 that the model was created in. Usually, this means the model is loaded into system ram and then quantized into the lesser depth -> into VRAM. Now obviously, with the model size being so large, this will require a LOT of system ram. I looked it up, and t5_xxl requires a little over 20.5GB VRAM when run in bf16 (Down from the 44GB if you run it in 32). It will also unload the system ram once quantized into VRAM.

This isn't my repo and I cannot test this right now as I am captioning music for training so my gpu is in use, but it may be possible to either change the t5 model into xl or another smaller model, load xxl as int8 or int4, OR permanently save a quantized version of xxl at bf16 and use that.

Easiest thing to start with:
How much system ram and vram do you have? I would assume that the hangup is either the free system ram is too low to quantize, causing it to use swap (SLOW) or there is not enough vram and you have newer nvidia drivers, which will load and unload parts of the model into system ram to prevent an out of memory error (ALSO SLOW).

@mejikomtv
Copy link

i just test it

@eftSharptooth
Copy link

This is on the most current pull of the repo:
Inference should entirely fit within a 24GB vram card, even on windows with the UI and everything loaded.
@AndrewToomey Were you talking about training the model or just generating samples?
@mejikomtv Could you post if you were testing inference (generating samples) or training, and the GPU vram available?

@eftSharptooth
Copy link

OK. So on a 24GB card (3090) it takes roughly 20 seconds to do a single inference currently. Easiest way to test is just copy the config\example.txt file, remove all but one line, then sample with this new file supplied as the prompt. It stays just under the vram limit for 24GB. Interestingly, it takes roughly 2 minutes to generate two samples if you leave two prompts in the new examples.txt file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants