Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : fix defrag logic #11707

Merged
merged 3 commits into from
Feb 7, 2025
Merged

llama : fix defrag logic #11707

merged 3 commits into from
Feb 7, 2025

Conversation

ggerganov
Copy link
Owner

While working on #11213 I realized that we are currently doing many unnecessary graph defrags because of incorrect cache fragmentation logic. The cache padding triggers the fragmentation threshold for small contexts even if there is no fragmentation at all.

./scripts/compare-commits.sh master gg/llama-fix-defrag -m models/llama-3.1-8b-instruct/ggml-model-q4_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-q8_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-f16.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-f16.gguf -fa 1
Model Test t/s master t/s gg/llama-fix-defrag Speedup
llama 8B F16 pp512 1458.51 1458.18 1.00
llama 8B F16 tg128 38.82 39.19 1.01
llama 8B Q4_0 pp512 1324.28 1323.85 1.00
llama 8B Q4_0 tg128 99.55 101.37 1.02
llama 8B Q8_0 pp512 1298.42 1298.34 1.00
llama 8B Q8_0 tg128 66.23 66.99 1.01
qwen2 3B F16 pp512 3226.49 3226.91 1.00
qwen2 3B F16 tg128 71.26 72.44 1.02
qwen2 3B Q4_0 pp512 2927.50 2925.14 1.00
qwen2 3B Q4_0 tg128 138.02 142.55 1.03
qwen2 3B Q8_0 pp512 2880.21 2878.93 1.00
qwen2 3B Q8_0 tg128 108.89 112.35 1.03

master has the following path applied:

diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 4ac19ca86..8e9f90f27 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -753,6 +753,7 @@ struct cmd_params_instance {
         cparams.offload_kqv = !no_kv_offload;
         cparams.flash_attn  = flash_attn;
         cparams.embeddings  = embeddings;
+        cparams.defrag_thold = 0.1f;
 
         return cparams;
     }

@ggerganov ggerganov merged commit ed926d8 into master Feb 7, 2025
50 of 53 checks passed
@ggerganov ggerganov deleted the gg/llama-fix-defrag branch February 7, 2025 14:05
if (cparams.causal_attn && cparams.defrag_thold >= 0.0f) {
const float fragmentation = kv_self.n >= 128 ? 1.0f - float(kv_self.used)/float(kv_self.n) : 0.0f;
if (cparams.causal_attn && cparams.defrag_thold > 0.0f) {
// - do not defrag small contexts (i.e. < 2048 tokens)
Copy link

@MoonRide303 MoonRide303 Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov I am sometimes running benchmarks that require only 256 or 512 tokens per slot, with total context size like 512 or 1024 (for big models that don't fully fit into my VRAM). Will it work properly in cases like that?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defragmentation for such small context is not really worth it, so my expectation is that with this change you should get better performance overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants