You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For example, someone running a personal model (1 request at a time) only needs a cache size of 1. Unfortunately, the default value is designed for very large parallel inference, which makes it prohibitive to run models fast on anything but the largest type of card. I think that being able to adjust this value would be an easy win for the performance and usefulness of this repo.
I can write up a PR for this if it works better; I think I know what needs to be done. I'm just not very famiiliar with RunPod serverless right now.
The text was updated successfully, but these errors were encountered:
For context, when running an 11B model on an L40S, the throughput is okay, but the GPU barely gets used (1-2%) because the CPU-side PyTorch is the bottleneck. The throughput would be significantly faster if this value were set at a lower value such that the KV cache can fit in VRAM.
The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For example, someone running a personal model (1 request at a time) only needs a cache size of 1. Unfortunately, the default value is designed for very large parallel inference, which makes it prohibitive to run models fast on anything but the largest type of card. I think that being able to adjust this value would be an easy win for the performance and usefulness of this repo.
I can write up a PR for this if it works better; I think I know what needs to be done. I'm just not very famiiliar with RunPod serverless right now.
The text was updated successfully, but these errors were encountered: