Trainer Updating Only One Adapter During Fine-Tuning with Multiple Adapters and a Router #1592

Hazem-Abbas · 2025-01-29T13:35:07Z

Issue Summary: When fine-tuning Meta-Llama-3.1-8B using Unsloth, only one adapter is being updated despite having three adapters and another component that requires grad.

Steps to Reproduce:

Install Unsloth

%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Load the model and tokenizer using FastLanguageModel.from_pretrained method.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = DTYPE,
    load_in_4bit = LOAD_IN_4BIT,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
);

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

Load Adapters and add a Router
Configure the model to have multiple adapters and an additional component that requires gradients.

model.print_trainable_parameters()
# Make the Router also trainable
for index in range(len(model.base_model.model.model.layers)):
    for param in model.base_model.model.model.layers[index].mlp.router.parameters():
        param.requires_grad = True

model.print_trainable_parameters()

Copy weights before fine-tuning.

# Sanity Check: Reserved for checking weight update after training
rounter_0 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.router)

gate_lora_a_0_ada_0 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["default"])
gate_lora_a_0_ada_1 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_1"])
gate_lora_a_0__ada_2 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_2"])

Perform the fine-tuning process.

training_arguments = TrainingArguments(per_device_train_batch_size=1,
                                       gradient_accumulation_steps=4,
                                       warmup_ratio=0.1,
                                       # num_train_epochs=3, # Set this for 1 full training run.
                                       max_steps=60,
                                       learning_rate=2e-5,
                                       fp16=not is_bfloat16_supported(),
                                       bf16=is_bfloat16_supported(),
                                       logging_steps=2,
                                       optim="adamw_8bit",
                                       weight_decay=0.01,
                                       lr_scheduler_type="linear",
                                       seed=3407,
                                       output_dir="outputs",
                                       report_to="none", # Use this for WandB etc
                                       )

trainer = SFTTrainer(model=model,
                     tokenizer=tokenizer,
                     train_dataset=hybrid_dataset,
                     dataset_text_field="text",
                     max_seq_length=MAX_SEQ_LENGTH,
                     dataset_num_proc=2,
                     packing=False, # Can make training 5x faster for short sequences.
                     # gradient_checkpointing=True,
                     args=training_arguments,)

trainer_stats = trainer.train()

Compare the copied weights with the trained weights.

torch.unique((rounter_0.net[0].weight == model.base_model.model.model.layers[0].mlp.router.net[0].weight), return_counts=True)
torch.unique((gate_lora_a_0_ada_0.weight == model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["default"].weight), return_counts=True)
torch.unique((gate_lora_a_0_ada_1.weight == model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_1"].weight), return_counts=True)
torch.unique((gate_lora_a_0__ada_2.weight == model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_2"].weight), return_counts=True)

Expected Behavior: All three adapters and the additional component should receive gradient updates during fine-tuning.

Observed Behavior: Only one adapter is being updated, while the other adapters and the additional component are not receiving gradient updates.

Environment:
Unsloth version: 2025.1.7
PyTorch version: 2.5.1+cu121
Python version: 3.10.12

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer Updating Only One Adapter During Fine-Tuning with Multiple Adapters and a Router #1592

Trainer Updating Only One Adapter During Fine-Tuning with Multiple Adapters and a Router #1592

Hazem-Abbas commented Jan 29, 2025 •

edited

Loading

Trainer Updating Only One Adapter During Fine-Tuning with Multiple Adapters and a Router #1592

Trainer Updating Only One Adapter During Fine-Tuning with Multiple Adapters and a Router #1592

Comments

Hazem-Abbas commented Jan 29, 2025 • edited Loading

Hazem-Abbas commented Jan 29, 2025 •

edited

Loading