You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current DPO model returns a hardcoded value (-11 that will be exp(-11)) as the nearest integer value that is less than equal probability value across all logits (1/50257), reference in #138
...
# Check if completion is ifcompletion.strip() ==''orlen(completion) <=5:
return-11# exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)
...
The 50257 vocab size is taken as typical vocab size but that could be different for other models / tokenizers.
Ideally, this value would be calculated automatically like 1 / model.vocab_size , rather than a hard-coded number
The text was updated successfully, but these errors were encountered:
The current DPO model returns a hardcoded value (-11 that will be exp(-11)) as the nearest integer value that is less than equal probability value across all logits (1/50257), reference in #138
Chunk of code of openvalidators/reward/dpo.py:
The 50257 vocab size is taken as typical vocab size but that could be different for other models / tokenizers.
Ideally, this value would be calculated automatically like
1 / model.vocab_size
, rather than a hard-coded numberThe text was updated successfully, but these errors were encountered: