TRN2 Meshes and Configurations #916

apoorvtintin · 2025-01-10T00:48:35Z

This PR adds meshes for TRN2/1 for Fuji models and transformer layer configuration favorable to Neuron.

Neuron supports stacked transformer and GroupedQKVLinear instead of FusedGroupedQKVLinear for Grouped Query Attention (GQA)

This is a newer version of the PR #885. This PR resolved all comments and requested changes mentioned in the linked PR.

apoorvtintin · 2025-01-10T00:55:07Z

Added a ModelConfigModifier that overrides the class for a module. Allowing different model configurations based on Model size and platform.

kelvin-zou

Thank you for making such change, overall looks good. A few nit comments.

kelvin-zou · 2025-01-10T00:57:05Z

axlearn/common/trainer_config_modifier.py

+                continue
+            # Here we assume x.y.z format.
+            # One example would be model.decoder.transformer.layer.
+            target_modules = module_name.split(".")


Can you try to extract a common util function named something like
def replace_module_recursive(target_modules:str, config_key: str, target_config) and make it applied to both here and RematSpecModifier

I extracted a helper function, let me know if this looks good

axlearn/common/trainer_config_modifier_test.py

apoorvtintin · 2025-01-10T07:44:33Z

Added ParameterPartitionSpecModifier for parameters to shard Embeddings in a vocab parallel manner as described in Megatron LM.

axlearn/common/trainer_config_modifier.py

ruomingp · 2025-01-11T23:50:52Z

axlearn/common/trainer_config_modifier.py

+
+            found_module, parent_module, key_in_parent = find_target_module(module_name, cfg)
+
+            # Copy configurations from the config being replaced on a best effort basis


Wait, this behavior is not explained in the class comments. So we are not replacing but merging the configs? Maybe we should support a merge function instead?

Yeah the goal is to change the config to a similar module. This means most of the configuration can be reused from before. Essentially replacing the module but merging the config. Let me extract out a merge function.

Abstracted out a merge function let me know if more changes are needed for this.

apoorvtintin · 2025-01-12T07:08:47Z

@ruomingp Thank you for the review, I have addressed all your comments, please let me know if more changes are needed.

axlearn/common/trainer_config_modifier.py

ruomingp · 2025-01-13T06:35:53Z

axlearn/common/trainer_config_modifier.py

+        for module_name, model_cfg in self._model_cfg_modifications.items():
+            found_module = _find_target_module(module_name, cfg)


In utils.py we have get_recursively and set_recursively for Nested[...]. I wonder if it will be useful to add corresponding methods to ConfigBase. Then we can do something like:

Suggested change

for module_name, model_cfg in self._model_cfg_modifications.items():

found_module = _find_target_module(module_name, cfg)

for cfg_path, cfg_modification in self._model_cfg_modifications.items():

child_cfg = cfg.get_recursively(cfg_path)

child_cfg = cfg_modification(child_cfg, path=cfg_path)

cfg.set_recursively(cfg_path, value=child_cfg)

Added get_recursively and set_recursively functions to ConfigBase. Let me know if it looks good

I wonder if an alternative (which aims to simplify the ConfigBase api) is to do something similar to Python's sorted; we allow utils.get_resursively to take a value fn:

# Default behavior is to use key lookup: utils.get_recursively(..., value_fn=lambda k,v: v[k]) # Custom behavior can be attribute lookup: utils.get_recursively(..., value_fn=lambda k,v: getattr(v,k))

A benefit is that other non-config instances can also leverage get_recursively.

Hi @markblee , maybe we can do this in a follow-up PR?

apoorvtintin · 2025-01-15T00:26:21Z

Added a more flexible PartitionSpecModifier that can modify multiple partition_spec attributes in a single module config.

kelvin-zou

Mostly lgtm, some minor comments.

axlearn/common/trainer_config_modifier.py

kelvin-zou · 2025-02-05T20:05:42Z

axlearn/experiments/text/gpt/fuji.py

@@ -151,6 +155,72 @@ def get_trainer_kwargs(

    rope_theta = ROPE_THETA[version]

+    # TRN2 specific model config modifications


Can we move all the modifications in a helper function?
saying

def _generate_trainium2_custom_configs():
...
return trn2_model_modifications, trn2_partition_spec_modifications

Addressed it, lmk if it looks good

axlearn/experiments/text/gpt/fuji.py

kelvin-zou

Approve overall, please address @hanzhi713 's comment.

+ Fix modifier tests

apoorvtintin · 2025-02-06T04:55:34Z

Updated the PR to address failing tests, can we re-trigger the CI please? Thank you

apoorvtintin · 2025-02-06T22:15:17Z

@kelvin-zou @hanzhi713 Thank you both for the review, I addressed all the comments. Let's merge this if it looks good

kelvin-zou

Thank you!

apoorvtintin · 2025-02-07T22:05:05Z

Hello @ruomingp, can I please get an approval if this PR looks good? Looks like that is needed for the PR to merge.

kelvin-zou · 2025-02-07T22:08:33Z

maybe @markblee can have a second eye for the final round?

markblee

Approve to unblock.

axlearn/common/config.py

markblee · 2025-02-07T23:43:40Z

(Looks like we still need @ruomingp 's approval to unblock the 'requested changes'.)

apoorvtintin · 2025-02-10T01:25:02Z

Addressed the final comment from @markblee, thanks everyone! Can we run the CI again and merge this?

apoorvtintin requested review from ruomingp, markblee and a team as code owners January 10, 2025 00:48

apoorvtintin mentioned this pull request Jan 10, 2025

Add meshes and config for TRN2/1 for Fuji models #885

Closed

apoorvtintin force-pushed the mainline-upstream-boilerplate branch 2 times, most recently from 6b404f6 to 3f7c840 Compare January 10, 2025 00:53

kelvin-zou reviewed Jan 10, 2025

View reviewed changes

apoorvtintin force-pushed the mainline-upstream-boilerplate branch 2 times, most recently from 708fc5e to d481132 Compare January 10, 2025 07:38

apoorvtintin force-pushed the mainline-upstream-boilerplate branch 2 times, most recently from 5be50d7 to 9b10041 Compare January 10, 2025 08:10

apoorvtintin requested a review from kelvin-zou January 10, 2025 19:44

ruomingp requested changes Jan 11, 2025

View reviewed changes

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from 9b10041 to 0f0a530 Compare January 12, 2025 07:06

apoorvtintin requested a review from ruomingp January 12, 2025 07:08

ruomingp requested changes Jan 13, 2025

View reviewed changes

apoorvtintin mentioned this pull request Jan 13, 2025

[DO-NOT-MERGE] PR encompassing all changes needed to support neuron on Axlearn #919

Open

apoorvtintin force-pushed the mainline-upstream-boilerplate branch 2 times, most recently from c23e3b2 to 94bfff6 Compare January 15, 2025 00:10

apoorvtintin force-pushed the mainline-upstream-boilerplate branch 2 times, most recently from 45c7df1 to 8807856 Compare January 17, 2025 01:17

kelvin-zou reviewed Jan 17, 2025

View reviewed changes

axlearn/common/trainer_config_modifier.py Outdated Show resolved Hide resolved

axlearn/common/trainer_config_modifier.py Outdated Show resolved Hide resolved

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from 8807856 to 25510d6 Compare January 22, 2025 01:39

apoorvtintin requested a review from kelvin-zou January 22, 2025 01:40

apoorvtintin force-pushed the mainline-upstream-boilerplate branch 2 times, most recently from eec33eb to 86bafa8 Compare January 23, 2025 05:40

Use get_recursively inside set

78fdd29

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from 780d424 to b6ae638 Compare February 5, 2025 19:01

apoorvtintin requested a review from ruomingp February 5, 2025 19:03

kelvin-zou requested a review from hanzhi713 February 5, 2025 19:52

kelvin-zou reviewed Feb 5, 2025

View reviewed changes

hanzhi713 reviewed Feb 5, 2025

View reviewed changes

axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from b6ae638 to f10ebd0 Compare February 5, 2025 22:44

kelvin-zou reviewed Feb 5, 2025

View reviewed changes

Move trn2 configs to a helper function.

4613fba

+ Fix modifier tests

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from f10ebd0 to 37986ce Compare February 6, 2025 04:53

TRN2 partitionspec supports DP over FSDP and TP

38f8111

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from 37986ce to 4cda0dd Compare February 6, 2025 22:13

apoorvtintin requested review from hanzhi713 and kelvin-zou February 6, 2025 22:15

kelvin-zou approved these changes Feb 6, 2025

View reviewed changes

hanzhi713 approved these changes Feb 7, 2025

View reviewed changes

markblee approved these changes Feb 7, 2025

View reviewed changes

axlearn/common/config.py Outdated Show resolved Hide resolved

ruomingp approved these changes Feb 9, 2025

View reviewed changes

apoorvtintin added 2 commits February 9, 2025 16:48

Use for loop in get_recursively

2b1dad3

Update Golden Configs

53472f2

apoorvtintin force-pushed the mainline-upstream-boilerplate branch from 4cda0dd to 53472f2 Compare February 10, 2025 01:23

markblee approved these changes Feb 10, 2025

View reviewed changes

ruomingp added this pull request to the merge queue Feb 10, 2025

ruomingp removed this pull request from the merge queue due to a manual request Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRN2 Meshes and Configurations #916

TRN2 Meshes and Configurations #916

apoorvtintin commented Jan 10, 2025

apoorvtintin commented Jan 10, 2025

kelvin-zou left a comment

kelvin-zou Jan 10, 2025

apoorvtintin Jan 10, 2025 •

edited

Loading

apoorvtintin commented Jan 10, 2025 •

edited

Loading

ruomingp Jan 11, 2025

apoorvtintin Jan 12, 2025

apoorvtintin Jan 12, 2025

apoorvtintin commented Jan 12, 2025

ruomingp Jan 13, 2025

apoorvtintin Jan 27, 2025

markblee Jan 27, 2025

ruomingp Jan 31, 2025

apoorvtintin commented Jan 15, 2025 •

edited

Loading

kelvin-zou left a comment

kelvin-zou Feb 5, 2025

apoorvtintin Feb 5, 2025

kelvin-zou left a comment

apoorvtintin commented Feb 6, 2025

apoorvtintin commented Feb 6, 2025

kelvin-zou left a comment

apoorvtintin commented Feb 7, 2025

kelvin-zou commented Feb 7, 2025

markblee left a comment

markblee commented Feb 7, 2025

apoorvtintin commented Feb 10, 2025


		found_module, parent_module, key_in_parent = find_target_module(module_name, cfg)

		# Copy configurations from the config being replaced on a best effort basis

		for module_name, model_cfg in self._model_cfg_modifications.items():
		found_module = _find_target_module(module_name, cfg)

-        for module_name, model_cfg in self._model_cfg_modifications.items():
-            found_module = _find_target_module(module_name, cfg)
+        for cfg_path, cfg_modification in self._model_cfg_modifications.items():
+            child_cfg = cfg.get_recursively(cfg_path)
+            child_cfg = cfg_modification(child_cfg, path=cfg_path)
+            cfg.set_recursively(cfg_path, value=child_cfg)

		@@ -151,6 +155,72 @@ def get_trainer_kwargs(

		rope_theta = ROPE_THETA[version]

		# TRN2 specific model config modifications

TRN2 Meshes and Configurations #916

Are you sure you want to change the base?

TRN2 Meshes and Configurations #916

Conversation

apoorvtintin commented Jan 10, 2025

apoorvtintin commented Jan 10, 2025

kelvin-zou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

apoorvtintin commented Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin commented Jan 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin commented Jan 15, 2025 • edited Loading

kelvin-zou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kelvin-zou left a comment

Choose a reason for hiding this comment

apoorvtintin commented Feb 6, 2025

apoorvtintin commented Feb 6, 2025

kelvin-zou left a comment

Choose a reason for hiding this comment

apoorvtintin commented Feb 7, 2025

kelvin-zou commented Feb 7, 2025

markblee left a comment

Choose a reason for hiding this comment

markblee commented Feb 7, 2025

apoorvtintin commented Feb 10, 2025

apoorvtintin Jan 10, 2025 •

edited

Loading

apoorvtintin commented Jan 10, 2025 •

edited

Loading

apoorvtintin commented Jan 15, 2025 •

edited

Loading