Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer Updating Only One Adapter During Fine-Tuning with Multiple Adapters and a Router #1592

Open
Hazem-Abbas opened this issue Jan 29, 2025 · 0 comments

Comments

@Hazem-Abbas
Copy link

Hazem-Abbas commented Jan 29, 2025

Issue Summary: When fine-tuning Meta-Llama-3.1-8B using Unsloth, only one adapter is being updated despite having three adapters and another component that requires grad.

Steps to Reproduce:

  • Install Unsloth
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
  • Load the model and tokenizer using FastLanguageModel.from_pretrained method.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = DTYPE,
    load_in_4bit = LOAD_IN_4BIT,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
);

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
  • Load Adapters and add a Router

  • Configure the model to have multiple adapters and an additional component that requires gradients.

model.print_trainable_parameters()
# Make the Router also trainable
for index in range(len(model.base_model.model.model.layers)):
    for param in model.base_model.model.model.layers[index].mlp.router.parameters():
        param.requires_grad = True

model.print_trainable_parameters()
  • Copy weights before fine-tuning.
# Sanity Check: Reserved for checking weight update after training
rounter_0 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.router)

gate_lora_a_0_ada_0 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["default"])
gate_lora_a_0_ada_1 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_1"])
gate_lora_a_0__ada_2 = copy.deepcopy(model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_2"])
  • Perform the fine-tuning process.
training_arguments = TrainingArguments(per_device_train_batch_size=1,
                                       gradient_accumulation_steps=4,
                                       warmup_ratio=0.1,
                                       # num_train_epochs=3, # Set this for 1 full training run.
                                       max_steps=60,
                                       learning_rate=2e-5,
                                       fp16=not is_bfloat16_supported(),
                                       bf16=is_bfloat16_supported(),
                                       logging_steps=2,
                                       optim="adamw_8bit",
                                       weight_decay=0.01,
                                       lr_scheduler_type="linear",
                                       seed=3407,
                                       output_dir="outputs",
                                       report_to="none", # Use this for WandB etc
                                       )

trainer = SFTTrainer(model=model,
                     tokenizer=tokenizer,
                     train_dataset=hybrid_dataset,
                     dataset_text_field="text",
                     max_seq_length=MAX_SEQ_LENGTH,
                     dataset_num_proc=2,
                     packing=False, # Can make training 5x faster for short sequences.
                     # gradient_checkpointing=True,
                     args=training_arguments,)

trainer_stats = trainer.train()
  • Compare the copied weights with the trained weights.
torch.unique((rounter_0.net[0].weight == model.base_model.model.model.layers[0].mlp.router.net[0].weight), return_counts=True)
torch.unique((gate_lora_a_0_ada_0.weight == model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["default"].weight), return_counts=True)
torch.unique((gate_lora_a_0_ada_1.weight == model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_1"].weight), return_counts=True)
torch.unique((gate_lora_a_0__ada_2.weight == model.base_model.model.model.layers[0].mlp.gate_proj.lora_A["adapter_2"].weight), return_counts=True)

Expected Behavior: All three adapters and the additional component should receive gradient updates during fine-tuning.

Observed Behavior: Only one adapter is being updated, while the other adapters and the additional component are not receiving gradient updates.

Environment:
Unsloth version: 2025.1.7
PyTorch version: 2.5.1+cu121
Python version: 3.10.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant