[Inference PagedAttention] Integrate initial paged attention implementation into maxengine (2/N) #1336

wyzhang · 2025-03-02T07:32:28Z

This change is based on a branch from Pate and Rupeng with code refactoring and modifications.

What:

This PR integrate initial paged attention components into maxengine, guarded behind attention=paged config setting.

Impact of this change:

This PR is a NOOP. Paged attention is not enabled unless attention=paged is set in the config. The default attention=autoselected will NOT trigger paged attention.

Key changes:

MaxText/layers/attentions.py: Use paged attention op when attention=paged for all model mode other than MODEL_MODE_TRAIN
MaxText/layers/models.py: Initialize paged attention components when attention=paged

Why:

Page attention should be able to enhance inference performance.

Testing:

python -m unittest tests/inference/paged_attention_test.py
python MaxText/decode.py MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 \ load_parameters_path=gs://msingh-bkt/checkpoints/quant_llama2-7b-chat/20241120034012/int8_
max_prefill_predict_length=16 max_target_length=32 model_name=llama2-7b
ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1
scan_layers=false weight_dtype=bfloat16 per_device_batch_size=1
checkpoint_is_quantized=true quantization=int8
attention=paged pagedattn_num_pages=64 pagedattn_tokens_per_page=8 pagedattn_pages_per_compute_block=4

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

vipannalla

LGTM

vipannalla · 2025-03-03T22:47:38Z

@richjames0 , can you also take a look and LGTM?

xy12181 · 2025-03-05T05:38:14Z

MaxText/layers/attentions.py

@@ -1520,11 +1546,15 @@ def __call__(

    assert not self.config.quantize_kvcache or self.kv_quant

-    out = self.attention_op(query, key, value, decoder_segment_ids, model_mode, previous_chunk)
+    if self.config.attention == "paged" and model_mode != common_types.MODEL_MODE_TRAIN:


it is better to do:

and (model_mode == common_types.MODEL_MODE_PREFILLL or model_mode == common_types.MODEL_MODE_AUTOREGRESSIVE)

xy12181 · 2025-03-05T05:38:55Z

MaxText/maxengine.py

-        decode_state["cache"],
-        self.kv_cache_annotations_named,
-    )
+    if self.config.attention == "paged":


have you run test for this chunk of code?

…tation into maxengine (2/N) This change is based on a branch from Pate and Rupeng with code refactoring and modifications. What: * This PR integrate initial paged attention components into maxengine, guarded behind `attention=paged` config setting. Impact of this change: * This PR is a NOOP. Paged attention is not enabled unless `attention=paged` is set in the config. The default `attention=autoselected` will NOT trigger paged attention. Key changes: * MaxText/layers/attentions.py: Use paged attention op when `attention=paged` for all model mode other than MODEL_MODE_TRAIN * MaxText/layers/models.py: Initialize paged attention components when `attention=paged` Why: * Page attention should be able to enhance inference performance. Testing: * python -m unittest tests/inference/paged_attention_test.py * python MaxText/decode.py MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 \ load_parameters_path=gs://msingh-bkt/checkpoints/quant_llama2-7b-chat/20241120034012/int8_ \ max_prefill_predict_length=16 max_target_length=32 model_name=llama2-7b \ ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 \ scan_layers=false weight_dtype=bfloat16 per_device_batch_size=1 \ checkpoint_is_quantized=true quantization=int8 \ attention=paged pagedattn_num_pages=64 pagedattn_tokens_per_page=8 pagedattn_pages_per_compute_block=4

-- 887e944 by Wangyuan Zhang <[email protected]>: [Inference PagedAttention] Integrate initial paged attention implementation into maxengine (2/N) This change is based on a branch from Pate and Rupeng with code refactoring and modifications. What: * This PR integrate initial paged attention components into maxengine, guarded behind `attention=paged` config setting. Impact of this change: * This PR is a NOOP. Paged attention is not enabled unless `attention=paged` is set in the config. The default `attention=autoselected` will NOT trigger paged attention. Key changes: * MaxText/layers/attentions.py: Use paged attention op when `attention=paged` for all model mode other than MODEL_MODE_TRAIN * MaxText/layers/models.py: Initialize paged attention components when `attention=paged` Why: * Page attention should be able to enhance inference performance. Testing: * python -m unittest tests/inference/paged_attention_test.py * python MaxText/decode.py MaxText/configs/base.yml tokenizer_path=assets/tokenizer.llama2 \ load_parameters_path=gs://msingh-bkt/checkpoints/quant_llama2-7b-chat/20241120034012/int8_ \ max_prefill_predict_length=16 max_target_length=32 model_name=llama2-7b \ ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=-1 \ scan_layers=false weight_dtype=bfloat16 per_device_batch_size=1 \ checkpoint_is_quantized=true quantization=int8 \ attention=paged pagedattn_num_pages=64 pagedattn_tokens_per_page=8 pagedattn_pages_per_compute_block=4 COPYBARA_INTEGRATE_REVIEW=#1336 from AI-Hypercomputer:wyzhang/page/2-n-0228 887e944 PiperOrigin-RevId: 733799896

wyzhang requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, rni418 and gagika as code owners March 2, 2025 07:32

wyzhang requested review from xy12181, liurupeng, sixiang-google and patemotter March 2, 2025 07:32

vipannalla approved these changes Mar 3, 2025

View reviewed changes

wyzhang force-pushed the wyzhang/page/1-n-0228 branch 2 times, most recently from f93f36a to 6814eaa Compare March 4, 2025 03:48

Base automatically changed from wyzhang/page/1-n-0228 to main March 4, 2025 04:58

wyzhang force-pushed the wyzhang/page/2-n-0228 branch 2 times, most recently from 18c9b2b to 0e620e5 Compare March 4, 2025 05:49

github-actions bot added the pull ready label Mar 4, 2025

wyzhang force-pushed the wyzhang/page/2-n-0228 branch from 0e620e5 to c131d50 Compare March 4, 2025 21:26

liurupeng approved these changes Mar 4, 2025

View reviewed changes

wyzhang force-pushed the wyzhang/page/2-n-0228 branch 3 times, most recently from ace9d03 to 887e944 Compare March 5, 2025 05:11

xy12181 reviewed Mar 5, 2025

View reviewed changes

xy12181 self-requested a review March 5, 2025 05:42

xy12181 approved these changes Mar 5, 2025

View reviewed changes

wyzhang force-pushed the wyzhang/page/2-n-0228 branch 2 times, most recently from 83ee89a to 570215d Compare March 5, 2025 06:23

wyzhang force-pushed the wyzhang/page/2-n-0228 branch 4 times, most recently from ea4e683 to f6afd83 Compare March 5, 2025 17:26

wyzhang force-pushed the wyzhang/page/2-n-0228 branch from f6afd83 to 6fd47d0 Compare March 5, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference PagedAttention] Integrate initial paged attention implementation into maxengine (2/N) #1336

[Inference PagedAttention] Integrate initial paged attention implementation into maxengine (2/N) #1336

wyzhang commented Mar 2, 2025 •

edited

Loading

vipannalla left a comment

vipannalla commented Mar 3, 2025

xy12181 Mar 5, 2025

xy12181 Mar 5, 2025

[Inference PagedAttention] Integrate initial paged attention implementation into maxengine (2/N) #1336

Are you sure you want to change the base?

[Inference PagedAttention] Integrate initial paged attention implementation into maxengine (2/N) #1336

Conversation

wyzhang commented Mar 2, 2025 • edited Loading

vipannalla left a comment

Choose a reason for hiding this comment

vipannalla commented Mar 3, 2025

xy12181 Mar 5, 2025

Choose a reason for hiding this comment

xy12181 Mar 5, 2025

Choose a reason for hiding this comment

wyzhang commented Mar 2, 2025 •

edited

Loading