Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Achieving llama.cpp Speed with Lower VRAM Requirements using llama-cpp-python #53

Open
Rio77Shiina opened this issue Aug 2, 2024 · 0 comments

Comments

@Rio77Shiina
Copy link

Currently, running with llama.cpp works well, but keeping the GGUF model resident in limited VRAM is a challenge. With only 16GB of VRAM, I can only run SDXL with a batch size of 1.
I hope to utilize llama-cpp-python to allow comfyUI to manage the VRAM allocation between the LLM and SD models. The envisioned workflow is as follows:
1.Load GGUF Model with llama-cpp-python: Load the model using the Python bindings.
2.Omost Chat: Perform inference and text generation using the loaded model.
3.GGUF Model CPU Offload: Unload the GGUF model from VRAM to CPU memory.
4.Load SD Model: Load the Stable Diffusion model into the now-available VRAM.
This approach should enable running larger models and higher batch sizes in VRAM-constrained environments by leveraging the dynamic loading and unloading capabilities of llama-cpp-python and comfyUI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant