After using peft, the performance indicators decreased. #2336

KQDtianxiaK · 2025-01-20T17:04:33Z

System Info

Sorry, I just finished the previous question and I still have to ask you a new question. I use the DNABert2 model, whose original structure is as follows:

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(4096, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertUnpadAttention(
            (self): BertUnpadSelfAttention(
              (dropout): Dropout(p=0.0, inplace=False)
              (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (mlp): BertGatedLinearUnitMLP(
            (gated_layers): Linear(in_features=768, out_features=6144, bias=False)
            (act): GELU(approximate='none')
            (wo): Linear(in_features=3072, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
        )
        ......
        (11): BertLayer(
          (attention): BertUnpadAttention(
            (self): BertUnpadSelfAttention(
              (dropout): Dropout(p=0.0, inplace=False)
              (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (mlp): BertGatedLinearUnitMLP(
            (gated_layers): Linear(in_features=768, out_features=6144, bias=False)
            (act): GELU(approximate='none')
            (wo): Linear(in_features=3072, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=157, bias=True)
)

On three different classification tasks, I used OFT, LNTuning and other methods to add fine-tuning modules to linear layers such as ['Wqkv'/'wo'/'gated_layers'], or ['LayerNorm'] and other parts for supervised training. During the training process, the logs output at each logging_steps are normal, the loss keeps decreasing, and the performance indicators keep rising. However, when the saved model weights are finally called to evaluate on the independent test set, the performance will be very poor, and the results are basically equivalent to It's the same as having no training at all. The following error is reported when calling:

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at model/DNABERT2-117M and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']

Who can help?

@BenjaminBossan

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

and the code I use to call the model from each path that holds the best model weights:

def load_best_model_for_test(checkpoint_dir, fold_number):
    fold_dir = os.path.join(checkpoint_dir)
    checkpoint_folders = [d for d in os.scandir(fold_dir) if d.is_dir() and d.name.startswith('checkpoint')]
    best_model_dir = max(checkpoint_folders, key=lambda d: os.path.getmtime(d.path), default=None)
    best_model_path = best_model_dir.path

    model = AutoPeftModelForSequenceClassification.from_pretrained(best_model_path, trust_remote_code=True, num_labels=2)
    return model

def evaluate_on_test_set(models, test_dataset):
    test_results = []
    for model in models:
        trainer = Trainer(
            model=model,
            args=training_args,
            eval_dataset=test_dataset,
            data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
            compute_metrics=eval_predict
        )
        metrics = trainer.evaluate()
        test_results.append(metrics)
    average_metrics = {key: np.mean([result[key] for result in test_results]) for key in test_results[0].keys()}
    return average_metrics

However, when I did full-parameter supervised fine-tuning without using peft, the final results on the independent test set were all normal. I changed different tasks, changed different peft methods, changed different parts of fine-tuning, and used the latest version of peft and still can't solve the problem.

Expected behavior

Find out the cause and fix the problem

The text was updated successfully, but these errors were encountered:

githubnemo · 2025-01-21T14:56:32Z

Hey, thanks for reporting the issue.

I could not reproduce this behavior with a simpler example. Can you try to modify my code to reproduce the behavior you're experiencing?

Here's the code I used to test saving/loading the model:

from peft import LoraConfig
from peft import get_peft_model, TaskType, PeftModel
from transformers import AutoModelForSequenceClassification, BertConfig
import sys
import torch

config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M")
base_model = AutoModelForSequenceClassification.from_pretrained(
    'zhihan1996/DNABERT-2-117M', config=config, trust_remote_code=True)

if sys.argv[1] == 'train':
    lora_config = LoraConfig(r=3, target_modules='all-linear', task_type=TaskType.SEQ_CLS)
    peft_model = get_peft_model(base_model, lora_config)

    # simulate training by modifying classifier weights
    peft_model.classifier.weight.data = torch.ones_like(peft_model.classifier.weight)
    peft_model.save_pretrained('peft_model')
else:
    peft_model = PeftModel.from_pretrained(model=base_model, model_id='./peft_model', trust_remote_code=True)
    print(peft_model.classifier.weight.data)

You can run this script using python example.py train to create the adapter model and then python example.py test to test the loading behavior. Expected is that the modified classifier weights are shown to be all ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After using peft, the performance indicators decreased. #2336

After using peft, the performance indicators decreased. #2336

KQDtianxiaK commented Jan 20, 2025

githubnemo commented Jan 21, 2025 •

edited

Loading

After using peft, the performance indicators decreased. #2336

After using peft, the performance indicators decreased. #2336

Comments

KQDtianxiaK commented Jan 20, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

githubnemo commented Jan 21, 2025 • edited Loading

githubnemo commented Jan 21, 2025 •

edited

Loading