Achieving llama.cpp Speed with Lower VRAM Requirements using llama-cpp-python #53

Rio77Shiina · 2024-08-02T07:10:17Z

Currently, running with llama.cpp works well, but keeping the GGUF model resident in limited VRAM is a challenge. With only 16GB of VRAM, I can only run SDXL with a batch size of 1.
I hope to utilize llama-cpp-python to allow comfyUI to manage the VRAM allocation between the LLM and SD models. The envisioned workflow is as follows:
1.Load GGUF Model with llama-cpp-python: Load the model using the Python bindings.
2.Omost Chat: Perform inference and text generation using the loaded model.
3.GGUF Model CPU Offload: Unload the GGUF model from VRAM to CPU memory.
4.Load SD Model: Load the Stable Diffusion model into the now-available VRAM.
This approach should enable running larger models and higher batch sizes in VRAM-constrained environments by leveraging the dynamic loading and unloading capabilities of llama-cpp-python and comfyUI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieving llama.cpp Speed with Lower VRAM Requirements using llama-cpp-python #53

Achieving llama.cpp Speed with Lower VRAM Requirements using llama-cpp-python #53

Rio77Shiina commented Aug 2, 2024

Achieving llama.cpp Speed with Lower VRAM Requirements using llama-cpp-python #53

Achieving llama.cpp Speed with Lower VRAM Requirements using llama-cpp-python #53

Comments

Rio77Shiina commented Aug 2, 2024