llama : fix defrag logic #11707

ggerganov · 2025-02-06T11:05:03Z

While working on #11213 I realized that we are currently doing many unnecessary graph defrags because of incorrect cache fragmentation logic. The cache padding triggers the fragmentation threshold for small contexts even if there is no fragmentation at all.

./scripts/compare-commits.sh master gg/llama-fix-defrag -m models/llama-3.1-8b-instruct/ggml-model-q4_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-q8_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-f16.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-f16.gguf -fa 1

Model	Test	t/s master	t/s gg/llama-fix-defrag	Speedup
llama 8B F16	pp512	1458.51	1458.18	1.00
llama 8B F16	tg128	38.82	39.19	1.01
llama 8B Q4_0	pp512	1324.28	1323.85	1.00
llama 8B Q4_0	tg128	99.55	101.37	1.02
llama 8B Q8_0	pp512	1298.42	1298.34	1.00
llama 8B Q8_0	tg128	66.23	66.99	1.01
qwen2 3B F16	pp512	3226.49	3226.91	1.00
qwen2 3B F16	tg128	71.26	72.44	1.02
qwen2 3B Q4_0	pp512	2927.50	2925.14	1.00
qwen2 3B Q4_0	tg128	138.02	142.55	1.03
qwen2 3B Q8_0	pp512	2880.21	2878.93	1.00
qwen2 3B Q8_0	tg128	108.89	112.35	1.03

master has the following path applied:

diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 4ac19ca86..8e9f90f27 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -753,6 +753,7 @@ struct cmd_params_instance {
         cparams.offload_kqv = !no_kv_offload;
         cparams.flash_attn  = flash_attn;
         cparams.embeddings  = embeddings;
+        cparams.defrag_thold = 0.1f;
 
         return cparams;
     }

ggml-ci

MoonRide303 · 2025-02-08T07:49:04Z

src/llama.cpp

-    if (cparams.causal_attn && cparams.defrag_thold >= 0.0f) {
-        const float fragmentation = kv_self.n >= 128 ? 1.0f - float(kv_self.used)/float(kv_self.n) : 0.0f;
+    if (cparams.causal_attn && cparams.defrag_thold > 0.0f) {
+        // - do not defrag small contexts (i.e. < 2048 tokens)


@ggerganov I am sometimes running benchmarks that require only 256 or 512 tokens per slot, with total context size like 512 or 1024 (for big models that don't fully fit into my VRAM). Will it work properly in cases like that?

The defragmentation for such small context is not really worth it, so my expectation is that with this change you should get better performance overall.

ggerganov added 3 commits February 6, 2025 12:48

llama : fix defrag logic

04c01e9

ggml-ci

cont : better logic

32b8ce5

ggml-ci

cont : clamp fragmentation to 0.0

861d3b9

ggml-ci

ggerganov merged commit ed926d8 into master Feb 7, 2025
50 of 53 checks passed

ggerganov deleted the gg/llama-fix-defrag branch February 7, 2025 14:05

MoonRide303 reviewed Feb 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix defrag logic #11707

llama : fix defrag logic #11707

ggerganov commented Feb 6, 2025

MoonRide303 Feb 8, 2025 •

edited

Loading

ggerganov Feb 8, 2025

llama : fix defrag logic #11707

llama : fix defrag logic #11707

Conversation

ggerganov commented Feb 6, 2025

MoonRide303 Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

ggerganov Feb 8, 2025

Choose a reason for hiding this comment

MoonRide303 Feb 8, 2025 •

edited

Loading