LlamaGuard Only Responds with S4 when Ages are In Prompt #74

sam-h-bean · 2025-01-16T01:56:04Z

I'm trying to use LlamaGuard and found that across models if the prompt contains ages mentioned for 2 people that the model will almost ubiquitously respond with S4. This happens even if the ages of people mentioned are well above the age of consent. Here is an example.

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-1B")

import openai

openai_api_key = "YOUR_API_KEY"

# Set the API key
openai.api_key = openai_api_key

openai_api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"


messages = [
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Hi I am trying to make a chatbot character and scenario for roleplaying! Can you help me?",
        }]
    },
    {
        "role": "assistant",
        "content": [{
            "type": "text",
            "text": "that sounds great! let's come up with a name first",
        }]
    },
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "How about Benjamin Dane?",
        }]
    },
    {
        "role": "assistant",
        "content": [{
            "type": "text",
            "text": "That is such a good name! What about a description of the character?",
        }]
    },
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "I would describe the situation like this: You're in a awkward date with him, you are 50 years old and Benjamin is 50 years old, he's Italian",
        }]
    },
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
)


completion = openai.Completion.create(
    model="meta-llama/Llama-Guard-3-1B",
    prompt=prompt,
    echo=False,
    timeout=5.,
    n=1,
    temperature=0.0,
    stream=False,
    max_tokens=7,
)
print(completion.choices[0].text)

I am using vLLM to serve the model and found that across situations like this if ages are mentioned the model almost always will respond with S4. This seems to be an issue (I am assuming) with the training data having lots of examples mentioning ages for a child and adult where the model is overfitting to this pattern. Here is another example where the model says this is child exploitation when the situation contains only adults

content = """name: Benjamin dane
description: You're in a awkward bowling night with him, you're 20 and Benjamin is 27 
greeting: Hi I'm Benjamin let's go bowling!"""
messages = [
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": content,
        }]
    },
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
)


completion = openai.Completion.create(
    model="meta-llama/Llama-Guard-3-1B",
    prompt=prompt,
    echo=False,
    timeout=5.,
    n=1,
    temperature=0.0,
    stream=False,
    max_tokens=7,
)
print(completion.choices[0].text)

EricMichaelSmith · 2025-01-17T22:17:50Z

Ooof, thanks for flagging, making a note of this for future models that might have the same issue. @sam-h-bean by any chance do you see the same behavior with the 8B version of the model?

sam-h-bean · 2025-01-21T18:56:54Z

@EricMichaelSmith Yeah we found that this behavior was ubiquitous across model sizes (1B and 8B) and versions (3.1 and 3.2). This was what lead us to the belief that it was some bias introduced in the training data.

EricMichaelSmith · 2025-01-22T14:08:53Z

Okay, this is very good to know, thanks for informing me of this @sam-h-bean - we'll try to look into what may be causing this to fix it for future releases.

james-deee · 2025-01-29T17:26:55Z

Thought I'd chime in here. I think what Llama Guard needs to be is a Text Classification model. We forked the one from HuggingFace and followed the instructions from here: https://discuss.huggingface.co/t/announcement-generation-get-probabilities-for-generated-output/30075 to use "compute_transition_scores". Then we deployed the model as a Text Classification.

What this did for us was to give us the probability of the item being "Unsafe". You're right, I ran your example against our model and yes it always comes back as unsafe S4....... but only at a probability of 0.679. What you would really only want to deem unsafe is when that probability is above some threshold (say 0.9).

Anyway, thought I'd throw that out there in case you wanted to look to do the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaGuard Only Responds with S4 when Ages are In Prompt #74

LlamaGuard Only Responds with S4 when Ages are In Prompt #74

sam-h-bean commented Jan 16, 2025

EricMichaelSmith commented Jan 17, 2025

sam-h-bean commented Jan 21, 2025

EricMichaelSmith commented Jan 22, 2025

james-deee commented Jan 29, 2025

LlamaGuard Only Responds with S4 when Ages are In Prompt #74

LlamaGuard Only Responds with S4 when Ages are In Prompt #74

Comments

sam-h-bean commented Jan 16, 2025

EricMichaelSmith commented Jan 17, 2025

sam-h-bean commented Jan 21, 2025

EricMichaelSmith commented Jan 22, 2025

james-deee commented Jan 29, 2025