-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LlamaGuard Only Responds with S4 when Ages are In Prompt #74
Comments
Ooof, thanks for flagging, making a note of this for future models that might have the same issue. @sam-h-bean by any chance do you see the same behavior with the 8B version of the model? |
@EricMichaelSmith Yeah we found that this behavior was ubiquitous across model sizes (1B and 8B) and versions (3.1 and 3.2). This was what lead us to the belief that it was some bias introduced in the training data. |
Okay, this is very good to know, thanks for informing me of this @sam-h-bean - we'll try to look into what may be causing this to fix it for future releases. |
Thought I'd chime in here. I think what Llama Guard needs to be is a Text Classification model. We forked the one from HuggingFace and followed the instructions from here: https://discuss.huggingface.co/t/announcement-generation-get-probabilities-for-generated-output/30075 to use "compute_transition_scores". Then we deployed the model as a Text Classification. What this did for us was to give us the probability of the item being "Unsafe". You're right, I ran your example against our model and yes it always comes back as unsafe S4....... but only at a probability of 0.679. What you would really only want to deem unsafe is when that probability is above some threshold (say 0.9). Anyway, thought I'd throw that out there in case you wanted to look to do the same. |
I'm trying to use LlamaGuard and found that across models if the prompt contains ages mentioned for 2 people that the model will almost ubiquitously respond with S4. This happens even if the ages of people mentioned are well above the age of consent. Here is an example.
I am using vLLM to serve the model and found that across situations like this if ages are mentioned the model almost always will respond with S4. This seems to be an issue (I am assuming) with the training data having lots of examples mentioning ages for a child and adult where the model is overfitting to this pattern. Here is another example where the model says this is child exploitation when the situation contains only adults
The text was updated successfully, but these errors were encountered: