Question on mathematical equivalence #2

nlpfollower · 2025-02-02T00:22:11Z

Hey, really like this idea!

Was experimenting with the mask a bit, and was curious to know whether you ever tested for mathematical equivalence of the masked and unmasked forward steps for Llama?

I tried running unmasked = model(prompt + rejected, mask=None) and then masked = model(prompt + chosen + rejected, mask=block_mask), and then comparing the logits of unmasked and masked[:prompt] + masked[prompt + chosen:]. However, the logits for the rejected part seem to differ in my experiments, at least when using Llama. The results were the same for a 2D bool mask and sdpa.

It's possible this is an issue with my implementation. I'll see if I can reproduce it in your repo.

The text was updated successfully, but these errors were encountered:

nlpfollower · 2025-02-02T17:14:47Z

Okay, it seems like I was able to reproduce this experiment on a fork (nlpfollower#1), running single-rank Llama3.1-8B.

In my original implementation, I tested using the document mask (e.g. https://github.com/pytorch-labs/attention-gym/blob/main/attn_gym/masks/document_mask.py) with packing -- i.e. the boring approach -- and there was still a difference between the logits, though it was significantly smaller compared to the prefix sharing mask.

Do you see any potential issues with my code or this experiment?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on mathematical equivalence #2

Question on mathematical equivalence #2

nlpfollower commented Feb 2, 2025 •

edited

Loading

nlpfollower commented Feb 2, 2025 •

edited

Loading

Question on mathematical equivalence #2

Question on mathematical equivalence #2

Comments

nlpfollower commented Feb 2, 2025 • edited Loading

nlpfollower commented Feb 2, 2025 • edited Loading

nlpfollower commented Feb 2, 2025 •

edited

Loading

nlpfollower commented Feb 2, 2025 •

edited

Loading