[torch_xla] Changing the sharding of `model.embed_tokens.weight` produces NaN gradients in Llama 3.1 405B #114

tengyifei · 2025-02-18T08:01:49Z

The scaling configuration for Llama 3.1 405B on 1 Trillium pod is

activation_checkpoint_layers:
 - LlamaDecoderLayer

optimization_barrier_layers:
 - LlamaDecoderLayer

sharding:
  model.embed_tokens.weight: [fsdp, tensor]
  model.layers.*.self_attn.q_proj.weight: [tensor, fsdp]
  model.layers.*.self_attn.k_proj.weight: [tensor, fsdp]
  model.layers.*.self_attn.v_proj.weight: [tensor, fsdp]
  model.layers.*.self_attn.o_proj.weight: [fsdp, tensor]
  model.layers.*.mlp.gate_proj.weight: [tensor, fsdp]
  model.layers.*.mlp.up_proj.weight: [tensor, fsdp]
  model.layers.*.mlp.down_proj.weight: [fsdp, tensor]
  model.layers.*.input_layernorm.weight: [fsdp]
  model.layers.*.post_attention_layernorm.weight: [fsdp]
  model.norm.weight: [fsdp]
  lm_head.weight: [tensor, fsdp]

If we replace model.embed_tokens.weight: [fsdp, tensor] with model.embed_tokens.weight: [tensor, fsdp], one would assume the model would train just as fine, because this change won't affect any subsequent decoder layers. In practice we observe that:

The gradient of some decoder layers becomes NaN by the 6-th iteration.
The collectives in the backward pass are drastically different (e.g. all-reduce becomes all-gather.
The model uses less HBM.

This is a tracking bug to find the root cause of this problem. Some hypothesis:

The model.embed_tokens.weight sharding got propagated to other tensors during the backward pass, changing the collectives significantly, and introducing numerical instability. We'll need to inspect how GSPMD propagated the shardings to tensors in the backward pass to dig deeper.

To repro

tp run torchprime/torch_xla_models/train.py model=llama-3.1-405b global_batch_size=64 mesh.fsdp=64 mesh.tensor=4 dataset_config_name=wikitext-103-raw-v1 profile_step=15 logging_steps=1 model.scaling.sharding='{model.embed_tokens.weight:[tensor,fsdp]}'

The text was updated successfully, but these errors were encountered:

tengyifei self-assigned this Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch_xla] Changing the sharding of `model.embed_tokens.weight` produces NaN gradients in Llama 3.1 405B #114

[torch_xla] Changing the sharding of `model.embed_tokens.weight` produces NaN gradients in Llama 3.1 405B #114

tengyifei commented Feb 18, 2025 •

edited

Loading

[torch_xla] Changing the sharding of model.embed_tokens.weight produces NaN gradients in Llama 3.1 405B #114

[torch_xla] Changing the sharding of model.embed_tokens.weight produces NaN gradients in Llama 3.1 405B #114

Comments

tengyifei commented Feb 18, 2025 • edited Loading

To repro

[torch_xla] Changing the sharding of `model.embed_tokens.weight` produces NaN gradients in Llama 3.1 405B #114

[torch_xla] Changing the sharding of `model.embed_tokens.weight` produces NaN gradients in Llama 3.1 405B #114

tengyifei commented Feb 18, 2025 •

edited

Loading