This repository contains the implementation of LoRA and DoRA layers as proposed in the following papers:
These layers are used in a Multi-Layer Perceptron (MLP) model.
LoRA is designed to reduce computational costs and memory usage during fine-tuning of large pre-trained models. By updating only a subset of parameters using low-rank matrices, LoRA allows efficient adaptation to specific tasks, especially when computational resources are limited.
-
Low-Rank Matrices: In LoRA, two low-rank matrices,
$A$ and$B$ , are introduced. These matrices have a much smaller number of parameters compared to the original weight matrix$W$ . During fine-tuning, instead of updating the full weight matrix, only these low-rank matrices are updated. -
Weight Update: The weight update in LoRA can be represented as:
Here,
-
Dimensionality Reduction: By using low-rank matrices, LoRA captures essential adaptations in a lower-dimensional subspace, reducing the number of learnable parameters and enhancing training efficiency.
-
Efficiency: The reduced number of parameters in
$A$ and$B$ speeds up training and mitigates overfitting by limiting the number of parameters. -
Applications: LoRA is beneficial in transfer learning, where a pre-trained model needs quick adaptation to new tasks with limited data.
DoRA extends the concept of LoRA by decomposing the pretrained weight matrix into a magnitude vector and a directional matrix. This allows the model to adapt more flexibly to new tasks by dynamically adjusting the low-rank matrices based on the current state of the training process, providing improved adaptability and efficiency.
In DoRA, the weight update is represented as:
where:
-
$W'$ is the updated weight matrix. -
$m$ is the learned magnitude vector. -
$V$ is the initial directional matrix. -
$\Delta V$ represents the update to the directional component matrix$V$ . -
$W_0$ is the initial pretrained weight matrix. -
$BA$ is the low-rank update applied to$W_0$ . -
$| \cdot |_c$ denotes the vector-wise norm used for normalization.
The magnitude vector
where
The magnitude vector
During training, the low-rank matrices
-
Weights Updated: Similar to LoRA, only the low-rank matrices
$A$ and$B$ are updated, but they are dynamically adjusted during training. - Improvement: The key improvement of DoRA over LoRA lies in its ability to selectively focus on directional adjustments while allowing separate training of the magnitude component. This separation can lead to more effective fine-tuning, as it mimics the nuanced adjustments observed in full fine-tuning (FT), potentially improving learning efficiency and stability.
-
Magnitude Vector
$m$ :- The parameter
$m$ is initialized based on the norm of the pretrained weight matrix$W$ . - This parameter allows the model to dynamically adjust the scale of each weight vector in the combined weight matrix during training. This additional flexibility can help the model better capture the importance of different features.
- The parameter
-
Directional Component:
- The directional component is calculated by normalizing the sum of the original weights
$W$ and the scaled output from the low-rank adaptation (LoRA)$BA$ . - This normalization ensures that the updates are directionally aligned with the original weight matrix.
- The directional component is calculated by normalizing the sum of the original weights
The new weights for the linear layer are then calculated by scaling the directional component with the parameter
The Peft package from Hugging Face offers efficient techniques for fine-tuning large pre-trained models with a focus on parameter-efficient methods. It supports various configurations, including LoRA (Low-Rank Adaptation), making it suitable for diverse tasks such as sequence-to-sequence learning. For more details, visit the official documentation.
Example Usage
from peft import LoraConfig, get_peft_model, TaskType
# Define the LoRA configuration
lora_config = LoraConfig(
r=32, # Rank: Controls the dimensionality reduction
lora_alpha=32, # Scaling factor for the LoRA updates
target_modules=["q", "v"], # Target only the attention layers
lora_dropout=0.05, # Dropout rate for regularization
bias="none", # No bias adjustment
task_type=TaskType.SEQ_2_SEQ_LM # Specify task type, e.g., sequence-to-sequence for FLAN-T5
)
# Apply the LoRA configuration to the original model
peft_model = get_peft_model(original_model, lora_config)
This work has been widely influenced by the contributions of Sebastian Raschka, particularly through his detailed explanations and implementations in the following resources:
- LoRA and DoRA from Scratch: An in-depth article that explores the concepts of LoRA and DoRA, providing foundational knowledge and practical implementation tips.
- DoRA from Scratch GitHub Repository: A comprehensive repository containing the code and detailed instructions for implementing DoRA, as discussed in the article.
These resources have been instrumental in shaping the approach and implementation strategies presented in this work.