Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

idales · 2025-02-12T06:30:04Z

Improved KV cache loading performance for Vulkan, resulting in a 20x acceleration

In the current implementation, when loading the KV cache, the ggml_backend_tensor_set function is called repeatedly for each layer. For the Vulkan backend, each call takes 1-2 ms, but due to the large number of calls, loading the entire cache can take around 0.5 seconds, depending on the model being used. This is particularly noticeable in applications where prompts need to be changed frequently and quickly.

Optimization Idea:
The proposed optimization involves assembling the KV cache for each layer in memory and then loading it into the backend with a single call to ggml_backend_tensor_set. In my case, this approach resulted in a 20x improvement in KV cache loading speed.

When running the CI pipeline locally, after approximately 1.5 hours of working, I encountered the following error:

1.24.132.134 I perplexity: 80.88 seconds per pass - ETA 5.38 minutes
[1]inf,terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
./ci/run.sh: line 614: 1312784 Aborted                 (core dumped) ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4

real    1m46,938s
user    0m1,617s
sys     0m0,911s

However, the same error occurs when running the original, unmodified code. I am unsure if there are specific tests for cache loading, but I have tested the modified code in my application and did not encounter any issues.

…acceleration

Improved KV cache loading performance for Vulkan, resulting in a 20x …

028a4bb

…acceleration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

idales commented Feb 12, 2025

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Are you sure you want to change the base?

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Conversation

idales commented Feb 12, 2025