Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlamaGuard Only Responds with S4 when Ages are In Prompt #74

Open
sam-h-bean opened this issue Jan 16, 2025 · 4 comments
Open

LlamaGuard Only Responds with S4 when Ages are In Prompt #74

sam-h-bean opened this issue Jan 16, 2025 · 4 comments

Comments

@sam-h-bean
Copy link

I'm trying to use LlamaGuard and found that across models if the prompt contains ages mentioned for 2 people that the model will almost ubiquitously respond with S4. This happens even if the ages of people mentioned are well above the age of consent. Here is an example.

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-1B")

import openai

openai_api_key = "YOUR_API_KEY"

# Set the API key
openai.api_key = openai_api_key

openai_api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"


messages = [
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Hi I am trying to make a chatbot character and scenario for roleplaying! Can you help me?",
        }]
    },
    {
        "role": "assistant",
        "content": [{
            "type": "text",
            "text": "that sounds great! let's come up with a name first",
        }]
    },
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "How about Benjamin Dane?",
        }]
    },
    {
        "role": "assistant",
        "content": [{
            "type": "text",
            "text": "That is such a good name! What about a description of the character?",
        }]
    },
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "I would describe the situation like this: You're in a awkward date with him, you are 50 years old and Benjamin is 50 years old, he's Italian",
        }]
    },
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
)


completion = openai.Completion.create(
    model="meta-llama/Llama-Guard-3-1B",
    prompt=prompt,
    echo=False,
    timeout=5.,
    n=1,
    temperature=0.0,
    stream=False,
    max_tokens=7,
)
print(completion.choices[0].text)

I am using vLLM to serve the model and found that across situations like this if ages are mentioned the model almost always will respond with S4. This seems to be an issue (I am assuming) with the training data having lots of examples mentioning ages for a child and adult where the model is overfitting to this pattern. Here is another example where the model says this is child exploitation when the situation contains only adults

content = """name: Benjamin dane
description: You're in a awkward bowling night with him, you're 20 and Benjamin is 27 
greeting: Hi I'm Benjamin let's go bowling!"""
messages = [
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": content,
        }]
    },
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
)


completion = openai.Completion.create(
    model="meta-llama/Llama-Guard-3-1B",
    prompt=prompt,
    echo=False,
    timeout=5.,
    n=1,
    temperature=0.0,
    stream=False,
    max_tokens=7,
)
print(completion.choices[0].text)
@EricMichaelSmith
Copy link
Contributor

Ooof, thanks for flagging, making a note of this for future models that might have the same issue. @sam-h-bean by any chance do you see the same behavior with the 8B version of the model?

@sam-h-bean
Copy link
Author

@EricMichaelSmith Yeah we found that this behavior was ubiquitous across model sizes (1B and 8B) and versions (3.1 and 3.2). This was what lead us to the belief that it was some bias introduced in the training data.

@EricMichaelSmith
Copy link
Contributor

Okay, this is very good to know, thanks for informing me of this @sam-h-bean - we'll try to look into what may be causing this to fix it for future releases.

@james-deee
Copy link

Thought I'd chime in here. I think what Llama Guard needs to be is a Text Classification model. We forked the one from HuggingFace and followed the instructions from here: https://discuss.huggingface.co/t/announcement-generation-get-probabilities-for-generated-output/30075 to use "compute_transition_scores". Then we deployed the model as a Text Classification.

What this did for us was to give us the probability of the item being "Unsafe". You're right, I ran your example against our model and yes it always comes back as unsafe S4....... but only at a probability of 0.679. What you would really only want to deem unsafe is when that probability is above some threshold (say 0.9).

Anyway, thought I'd throw that out there in case you wanted to look to do the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants