Skip to content

Why pad to same length in Ch07-04, Preference Tuning with DPO #476

Answered by rasbt
EricTay1997 asked this question in Q&A
Discussion options

You must be logged in to vote

That's a good question. It's been quite some time since I implemented this notebook, and if I remember correctly, this was more for convenience in the data loading utilities. The padding tokens also shouldn't have any effect here as we ignore them in the loss computation:

    if selection_mask is not None:
        mask = selection_mask[:, 1:].clone()

        # Apply the mask to filter out padding tokens
        selected_log_probs = selected_log_probs * mask

        # Calculate the average log probability excluding padding tokens
        # This averages over the tokens, so the shape is (batch_size, num_tokens)
        avg_log_prob = selected_log_probs.sum(-1) / mask.sum(-1)

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@EricTay1997
Comment options

Answer selected by EricTay1997
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants