Add seq parallelism for attention and MoE MLP #1328

suexu1025 · 2025-03-01T01:17:06Z

Description

Add seq_parallelism + exp_parallelism for attention + followed MLP module
with sp+ep, moe customer 2k seq inference improved by 20%
Fix prefill_KV_cache sharding mismatch during seq_parallelism
decode improved by 10%
Enable inference auto layout in mistral model

FIXES: b/374773995

Tests

tested on v6e/v5p:
SEQ=2048

python MaxText/inference_microbenchmark.py MaxText/configs/base.yml max_prefill_predict_length=$SEQ max_target_length=6144 model_name=mixtral-8x7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_expert_parallelism=1 ici_context_parallelism=4 ici_tensor_parallelism=1 scan_layers=false per_device_batch_size=1 attention=dot_product megablox=False quantization=int8 checkpoint_is_quantized=True quantize_kvcache=True capacity_factor=1 tokenizer_path=assets/tokenizer.mistral-v3 compute_axis_order=0,2,1,3 ar_cache_axis_order=0,2,1,3 enable_jax_profiler=True inference_microbenchmark_prefill_lengths="$SEQ" base_output_directory=$OUT_DIR run_name=$RUN_NAME profiler=xplane model_call_mode=inference inference_microbenchmark_stages=prefill

Checklist

Before submitting this PR, please make sure (put X in square brackets):

[ x ] I have performed a self-review of my code.
[ x ] I have necessary comments in my code, particularly in hard-to-understand areas.
[ x ] I have run end-to-end tests tests and provided workload links above if applicable.
[ x ] I have made or will make corresponding changes to the doc if needed.

google-cla · 2025-03-01T01:17:13Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

mailvijayasingh

Overall, looks good, left some comments.

Lets test, dense models, training and moe 8x22B on v6e-8 before pushing

mailvijayasingh · 2025-03-06T21:37:17Z

MaxText/layers/attentions.py

@@ -1424,7 +1457,7 @@ def out_projection(self, output_dim: int, out: Array) -> Array:
        features=output_dim,
        axis=(-2, -1),
        kernel_init=self.kernel_init,
-        kernel_axes=("heads", "kv", "embed"),
+        kernel_axes=(None, None, None), # trade speed with memory


does this mean we might OOM in some higher batch sizes?

mailvijayasingh · 2025-03-06T21:39:28Z

MaxText/layers/mistral.py

@@ -103,6 +103,9 @@ def __call__(
        float32_logits=cfg.float32_logits,
        quant=self.quant,
        kv_quant=quantizations.configure_kv_quant(cfg),
+        prefill_cache_axis_order=tuple([int(i) for i in cfg.prefill_cache_axis_order.split(",")]),
+        ar_cache_axis_order=tuple([int(i) for i in cfg.ar_cache_axis_order.split(",")]),


thanks for fixing this :)

mailvijayasingh · 2025-03-06T21:39:50Z

MaxText/pyconfig.py

@@ -212,6 +212,7 @@ def validate_model_name(s: str) -> bool:
      "llama3.1-70b",
      "llama3.1-405b",
      "llama3.3-70b",
+      "subsup",


mailvijayasingh · 2025-03-06T21:40:24Z

MaxText/layers/linears.py

+    cp = self.config.ici_context_parallelism
+    batch_size = inputs.shape[0]
+    seq_len = inputs.shape[1]
+    if seq_len % cp != 0:


nit: maybe abstract this part to get_cp and get_sub_seq_length?

+1, let's extract those into a helper function, and reuse. Or get_context_partition_and_sub_seq to return both.

mailvijayasingh · 2025-03-06T21:42:01Z

MaxText/configs/base.yml

-                      ['activation_length', ['sequence']],
+                      ['activation_length', ['sequence', 'context']],
+                      ['activation_length', ['context']],
+                      ['activation_length_q', ['context']],


as talked offline, shall we pull out a config, moe_config_inference.yml specifically for moe?

What's this config? i.e. is this context parallelism related?

mailvijayasingh · 2025-03-06T22:23:58Z

MaxText/layers/attentions.py

 LENGTH = common_types.LENGTH
+KV_LENGTH = common_types.KV_LENGTH


We will have to make changes at places where kv_quant=True or else its broken, one work around is to, push this code with an assertion that when context_parallel !=1, kv_quant should be False.
and quickly follow up with quantization changes. but if its not too hard, I would prefer them going together

RissyRan

Thanks for the change and improvement! Overall, LGTM! One thing I'd like to see how is the performance impact for training. When it's ready, could you help run a benchmark on 8X7b (or other model size) with FSDP + EP sharding in dropping (with and without this change)? Capturing profiles will be great! Thank you!

RissyRan · 2025-03-07T03:38:28Z

MaxText/configs/base.yml

-                      ['activation_length', ['sequence']],
+                      ['activation_length', ['sequence', 'context']],
+                      ['activation_length', ['context']],
+                      ['activation_length_q', ['context']],


What's this config? i.e. is this context parallelism related?

RissyRan · 2025-03-07T03:42:59Z

MaxText/layers/attentions.py

@@ -568,16 +582,32 @@ def apply_attention_dot(
      key = key.astype(jnp.float32)

    q_seq_len = query.shape[1]
+    # special sharding for decode
+    if self.config.ici_context_parallelism > 0 and q_seq_len == 1:


Nit: could we wrap this self.config.ici_context_parallelism > 0 and q_seq_len == 1 to a helper function and reuse, and name it something like is_context_parallelism_in_decoding()?

RissyRan · 2025-03-07T03:49:09Z

MaxText/layers/linears.py

+      cp = 1
+    sub_seq = seq_len // cp
+
+    top_k_indices = jnp.reshape(top_k_indices, (batch_size, cp, sub_seq, top_k_indices.shape[2]))


Could we rename cp as context_partitions or similar? And add a comment about top_k_indices shape, i.e. [batch_size, context_partition, sub_sequence, num_experts_per_tok]?

RissyRan · 2025-03-07T03:51:52Z

MaxText/layers/linears.py

+    cp = self.config.ici_context_parallelism
+    batch_size = inputs.shape[0]
+    seq_len = inputs.shape[1]
+    if seq_len % cp != 0:


+1, let's extract those into a helper function, and reuse. Or get_context_partition_and_sub_seq to return both.

RissyRan · 2025-03-07T03:53:34Z

MaxText/layers/linears.py

+        # intermediate_layer = nn.with_logical_constraint(
+        #     intermediate_layer,
+        #     ("activation_exp", "activation_batch_no_exp", None, "activation_embed"),
+        # )


I see this block is removed

if self.config.activations_in_float32: intermediate_layer = intermediate_layer.astype(jnp.float32)

ZhiyuLi-goog and others added 11 commits February 27, 2025 02:51

[exp] seq exp sharding

7b8f711

update

802313a

update

d8d4595

merge for sp

b7f3225

fix merge parts

7ed5fd1

update merge confict base config

a1c6973

update to fix sharding mismatch

3f2d278

update sub_seq for masks

3e06ebb

update sharding axis

d23d27b

update with reshape

924ce77

solve merge conflict

b62812d

suexu1025 requested review from mailvijayasingh and vipannalla March 1, 2025 01:17

suexu1025 requested review from gobbleturk, khatwanimohit, bvandermoon, RissyRan, richjames0, rni418 and gagika as code owners March 1, 2025 01:17

suexu1025 added 6 commits March 3, 2025 19:05

update for generate sharding

746f4a3

enable compute_axis configurable in mixtral model

a6d345c

address output_logits sharding

e06c3d6

clean up

65a64d4

Merge branch 'main' into qinwen/sharding_merge_main

23cd85f

update

10a9d82

suexu1025 changed the title ~~[Draft] Add seq parallelism for attention and MLP~~ Add seq parallelism for attention and MoE MLP Mar 6, 2025

suexu1025 added 2 commits March 6, 2025 00:38

update

0cca6df

Merge branch 'main' into qinwen/sharding_merge_main

cd005f3

fix tests

ebae8e0

mailvijayasingh reviewed Mar 6, 2025

View reviewed changes

added contition for non-sharded kernel for cp during inference only

2e0c459

suexu1025 requested review from shralex, yangyuwei, SurbhiJainUSC, hengtaoguo and A9isha as code owners March 6, 2025 22:25

suexu1025 added 3 commits March 6, 2025 23:03

update

37c843e

bug fix

b63c63b

Merge branch 'main' into qinwen/sharding_merge_main

9b32dc0

RissyRan reviewed Mar 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add seq parallelism for attention and MoE MLP #1328

Add seq parallelism for attention and MoE MLP #1328

suexu1025 commented Mar 1, 2025 •

edited

Loading

google-cla bot commented Mar 1, 2025

mailvijayasingh left a comment

mailvijayasingh Mar 6, 2025

mailvijayasingh Mar 6, 2025

mailvijayasingh Mar 6, 2025

mailvijayasingh Mar 6, 2025

RissyRan Mar 7, 2025

mailvijayasingh Mar 6, 2025

RissyRan Mar 7, 2025

mailvijayasingh Mar 6, 2025

RissyRan left a comment

RissyRan Mar 7, 2025

RissyRan Mar 7, 2025

RissyRan Mar 7, 2025

RissyRan Mar 7, 2025

RissyRan Mar 7, 2025

		LENGTH = common_types.LENGTH
		KV_LENGTH = common_types.KV_LENGTH

Add seq parallelism for attention and MoE MLP #1328

Are you sure you want to change the base?

Add seq parallelism for attention and MoE MLP #1328

Conversation

suexu1025 commented Mar 1, 2025 • edited Loading

Description

Tests

Checklist

google-cla bot commented Mar 1, 2025

mailvijayasingh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RissyRan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suexu1025 commented Mar 1, 2025 •

edited

Loading