Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

idales
Copy link

@idales idales commented Feb 12, 2025

Improved KV cache loading performance for Vulkan, resulting in a 20x acceleration

In the current implementation, when loading the KV cache, the ggml_backend_tensor_set function is called repeatedly for each layer. For the Vulkan backend, each call takes 1-2 ms, but due to the large number of calls, loading the entire cache can take around 0.5 seconds, depending on the model being used. This is particularly noticeable in applications where prompts need to be changed frequently and quickly.

Optimization Idea:
The proposed optimization involves assembling the KV cache for each layer in memory and then loading it into the backend with a single call to ggml_backend_tensor_set. In my case, this approach resulted in a 20x improvement in KV cache loading speed.

When running the CI pipeline locally, after approximately 1.5 hours of working, I encountered the following error:

1.24.132.134 I perplexity: 80.88 seconds per pass - ETA 5.38 minutes
[1]inf,terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
./ci/run.sh: line 614: 1312784 Aborted                 (core dumped) ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4

real    1m46,938s
user    0m1,617s
sys     0m0,911s

However, the same error occurs when running the original, unmodified code. I am unsure if there are specific tests for cache loading, but I have tested the modified code in my application and did not encounter any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant