Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815
+23
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improved KV cache loading performance for Vulkan, resulting in a 20x acceleration
In the current implementation, when loading the KV cache, the ggml_backend_tensor_set function is called repeatedly for each layer. For the Vulkan backend, each call takes 1-2 ms, but due to the large number of calls, loading the entire cache can take around 0.5 seconds, depending on the model being used. This is particularly noticeable in applications where prompts need to be changed frequently and quickly.
Optimization Idea:
The proposed optimization involves assembling the KV cache for each layer in memory and then loading it into the backend with a single call to ggml_backend_tensor_set. In my case, this approach resulted in a 20x improvement in KV cache loading speed.
When running the CI pipeline locally, after approximately 1.5 hours of working, I encountered the following error:
However, the same error occurs when running the original, unmodified code. I am unsure if there are specific tests for cache loading, but I have tested the modified code in my application and did not encounter any issues.