feat: Add `ALCF/examples/finetune_llama3/*` #74

saforem2 · 2025-01-15T14:47:07Z

Copilot Generated Summary:

This pull request includes several updates to the ALCF/examples/finetune_llama3 directory, focusing on adding a new README file, configuration files, and a shell script for fine-tuning the Llama3 model. Additionally, there is a minor update in the megatron/checkpointing.py file to ensure directory creation when saving the learning rate state dictionary.

Documentation and setup:

ALCF/examples/finetune_llama3/README.md: Added comprehensive instructions for setting up the environment, installing dependencies, downloading data, and converting Hugging Face checkpoints for fine-tuning Llama3.

Configuration files:

ALCF/examples/finetune_llama3/ds_config.json: Introduced a new DeepSpeed configuration file with parameters for batch size, optimizer, and bf16 support.
ALCF/examples/finetune_llama3/ds_config_empty.json: Added an empty DeepSpeed configuration file template with similar parameters as ds_config.json.

Shell script:

ALCF/examples/finetune_llama3/finetune_llama.sh: Added a script for setting up the environment, configuring DeepSpeed, and running the fine-tuning process for Llama3. This script includes logic for converting Hugging Face models to Megatron-Deepspeed format and vice versa.

Minor update:

megatron/checkpointing.py: Ensured the parent directory is created if it does not exist when saving the learning rate state dictionary.

saforem2 · 2025-01-15T15:06:56Z

tools/hf2megads_weight_converter.py


-        for name, param in hf_auto_model.named_parameters():
-            hf_model[name] = param.clone()
-            log.info(name)
+    hf_model = {}
+    for name, submodule in hf_auto_model.named_children():
+        for pname, param in submodule.named_parameters():
+            logger.info(f'[{name}.{pname}] shape={param.shape}')
+            hf_model[f'{name}.{pname}'] = param.clone()


There was an issue being caused by the following block:

for name, param in hf_auto_model.named_parameters():

which was failing to capture the hf_auto_model.lm_head.weight, which was preventing the checkpoint from being converted successfully.

Replacing this block with

>>> for name, submodule in lmodel.named_children(): ... for pname, param in submodule.named_parameters(): ... named_submods[f'{name}.{pname}'] = param.clone()

fixes this issue, as shown explicitly below:

>>> from transformers import AutoModelForCausalLM, LlamaConfig, AutoTokenizer, LlamaForCausalLM, AutoConfig >>> lmodel = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.2-1B') >>> named_params = [] >>> named_params = {} >>> for name, param in lmodel.named_parameters(): ... named_params[name] = param.clone() >>> named_submods = {} >>> for name, submodule in lmodel.named_children(): ... for pname, param in submodule.named_parameters(): ... named_submods[f'{name}.{pname}'] = param.clone() >>> len(named_submods.keys()) 147 >>> len(named_params.keys()) 146 >>> list(named_submods.keys())[-3:] ['model.layers.15.post_attention_layernorm.weight', 'model.norm.weight', 'lm_head.weight'] >>> list(named_params.keys())[-3:] ['model.layers.15.input_layernorm.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.norm.weight']

saforem2 · 2025-01-15T15:36:21Z

tools/hf2megads_weight_converter.py

+        self.tokenizer = get_tokenizer()
+        if args.tokenizer_type == 'HFTokenizer':
+            self.hf_tokenizer = get_hf_tokenizer(args.tokenizer_model)
+            self.token_vocab = len(self.hf_tokenizer)
+        else:
+            self.hf_tokenizer = None
+            assert self.tokenizer is not None
+            self.token_vocab = self.tokenizer.vocab_size


Mismatch between self.tokenizer.vocab_size hf_w.shape[0] when using Llama3 tokenizers.

This discrepancy causes the following assertion:

assert hf_w.shape[0] == self.padded_vocab_size

to fail since hf_w.shape[0] = 128256 but self.tokenizer.vocab_size = 128000.

Explicitly:

>>> type(self.hf_tokenizer) <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'> >>> type(self.tokenizer) <class 'megatron.tokenizer.tokenizer._HFTokenizer'> >>> self.tokenizer.vocab_size 128000 >>> len(self.hf_tokenizer) 128256

so, replacing self.token_vocab = len(self.hf_tokenizer) seems to resolve this issue.

saforem2 added 7 commits January 14, 2025 14:20

feat: Add ALCF/examples/finetune_llama3/*

03da571

docs: Update ALCF/examples/finetune_llama3/*

19bdff0

chore: Update tools/hf2megads_weight_converter.py

e2cb209

feat: Add ALCF/examples/finetune_llama3p2_1B/*

a868788

feat: Update ALCF/examples/finetune_llama3/*

babef03

Update README.md

7727f93

Remove redundant ALCF/examples/finetune_llama3p2_1B/*

13666a1

saforem2 changed the title feat: Add ALCF/examples/finetune_llama3/* feat: Add ALCF/examples/finetune_llama3/* Jan 15, 2025

saforem2 commented Jan 15, 2025

View reviewed changes

saforem2 and others added 2 commits January 16, 2025 15:30

chore: Add DummyOptimizer to tools/hf2megads_weight_converter.py

6b5eed5

Add {sunspot, sophia} in ALCF/examples/finetune_llama3/*

0636aea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `ALCF/examples/finetune_llama3/*` #74

feat: Add `ALCF/examples/finetune_llama3/*` #74

saforem2 commented Jan 15, 2025

saforem2 Jan 15, 2025

saforem2 Jan 15, 2025

feat: Add ALCF/examples/finetune_llama3/* #74

Are you sure you want to change the base?

feat: Add ALCF/examples/finetune_llama3/* #74

Conversation

saforem2 commented Jan 15, 2025

saforem2 Jan 15, 2025

Choose a reason for hiding this comment

saforem2 Jan 15, 2025

Choose a reason for hiding this comment

feat: Add `ALCF/examples/finetune_llama3/*` #74

feat: Add `ALCF/examples/finetune_llama3/*` #74