Policy vs. reference model gradient updates in ch07 on DPO #472

tt7533 · 2025-01-07T13:11:59Z

tt7533
Jan 7, 2025

I have noticed that the weights of the reference model are being updated during DPO training in ch07 section 04_preference-tuning-with-dpo. You can see that for example by looking at reference_model.out_head.weight.grad which is not None.

My understanding was that the reference model does not get gradient updates and only the policy model is being changed. If that was the case, there would be no need to compute the gradients for the reference model but it seems like they are computed nonetheless.

Could you please clarify why this is the case?

rasbt · 2025-01-07T14:08:07Z

rasbt
Jan 7, 2025
Maintainer

Interesting, thanks for the comment. The model weights shouldn't be updated though since only the policy model is passed to the optimizer:

optimizer = torch.optim.AdamW(policy_model.parameters(), lr=5e-6, weight_decay=0.01)
...
            
loss.backward()  # Calculate loss gradients
optimizer.step()  # Update model weights using loss gradients

But it's still bad from an efficiency perspective. I believe that the following change should improve it:

Before:

def compute_dpo_loss_batch(batch, policy_model, reference_model, beta):
    """Compute the DPO loss on an input batch"""

    # where policy_model(batch["chosen"]) are the logits
    policy_chosen_log_probas = compute_logprobs(
        logits=policy_model(batch["chosen"]),
        labels=batch["chosen"],
        selection_mask=batch["chosen_mask"]
    )
    policy_rejected_log_probas = compute_logprobs(
        logits=policy_model(batch["rejected"]),
        labels=batch["rejected"],
        selection_mask=batch["rejected_mask"]
    )
    ref_chosen_log_probas = compute_logprobs(
        logits=reference_model(batch["chosen"]),
        labels=batch["chosen"],
        selection_mask=batch["chosen_mask"]
    )
    ref_rejected_log_probas = compute_logprobs(
        logits=reference_model(batch["rejected"]),
        labels=batch["rejected"],
        selection_mask=batch["rejected_mask"]
    )

After:

def compute_dpo_loss_batch(batch, policy_model, reference_model, beta):
    """Compute the DPO loss on an input batch"""

    # where policy_model(batch["chosen"]) are the logits
    policy_chosen_log_probas = compute_logprobs(
        logits=policy_model(batch["chosen"]),
        labels=batch["chosen"],
        selection_mask=batch["chosen_mask"]
    )
    policy_rejected_log_probas = compute_logprobs(
        logits=policy_model(batch["rejected"]),
        labels=batch["rejected"],
        selection_mask=batch["rejected_mask"]
    )

    with torch.no_grad():
        ref_chosen_log_probas = compute_logprobs(
            logits=reference_model(batch["chosen"]),
            labels=batch["chosen"],
            selection_mask=batch["chosen_mask"]
        )
        ref_rejected_log_probas = compute_logprobs(
            logits=reference_model(batch["rejected"]),
            labels=batch["rejected"],
            selection_mask=batch["rejected_mask"]
        )

Could you give this a try?

0 replies

tt7533 · 2025-01-07T15:22:52Z

tt7533
Jan 7, 2025
Author

Yes! Gradients are no longer evaluated for the reference model. Also the memory requirements are now drastically reduced. This is a good patch that may be merged into master.

On another note (not trying to mix topics here), you mentioned somewhere that you intend to also implement RLHF in the future. Is that still the plan?

3 replies

rasbt Jan 7, 2025
Maintainer

Nice, glad to hear that worked. Will also run it later myself to double-check and to see if there's a speed-up.

On another note (not trying to mix topics here), you mentioned somewhere that you intend to also implement RLHF in the future. Is that still the plan?

Yes, I originally planned to. I had some draft code somewhere for that but it didn't work so well. I was out 2 months due to a back injury so everything got pushed back by a lot. I may do it some time later this year.

tt7533 Jan 7, 2025
Author

I didn't notice a huge speed up in time but there is a major difference in memory which is also important considering that the notebooks are intended to be used as with minimal resources.

rasbt Jan 8, 2025
Maintainer

Yes, I totally agree (was just curious if there is also a speed-up). In any case, thanks for opening this discussion, this was a good one and a small but important fix!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy vs. reference model gradient updates in ch07 on DPO #472

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Policy vs. reference model gradient updates in ch07 on DPO #472

tt7533 Jan 7, 2025

Replies: 2 comments · 3 replies

rasbt Jan 7, 2025 Maintainer

tt7533 Jan 7, 2025 Author

rasbt Jan 7, 2025 Maintainer

tt7533 Jan 7, 2025 Author

rasbt Jan 8, 2025 Maintainer

tt7533
Jan 7, 2025

Replies: 2 comments 3 replies

rasbt
Jan 7, 2025
Maintainer

tt7533
Jan 7, 2025
Author

rasbt Jan 7, 2025
Maintainer

tt7533 Jan 7, 2025
Author

rasbt Jan 8, 2025
Maintainer