Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect calculation at ToolCallAccuracy #1893

Open
licux opened this issue Feb 1, 2025 · 2 comments
Open

Incorrect calculation at ToolCallAccuracy #1893

licux opened this issue Feb 1, 2025 · 2 comments
Labels
bug Something isn't working module-metrics this is part of metrics module

Comments

@licux
Copy link
Contributor

licux commented Feb 1, 2025

Describe the bug
In ToolCallAccuracy, if the number of ToolCalls in the user_input is greater than the number of reference_tool_calls, it does not affect the evaluation score. In other words, the evaluation score remains unaffected even when more ToolCalls than expected occur.

Ragas version: latest
Python version: 3.12

Code to Reproduce

sample = [
    HumanMessage(content="What's the weather like in New York right now?"),
    AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
    HumanMessage(content="Can you translate that to Celsius?"),
    AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]),
    ToolMessage(content="75°F is approximately 23.9°C."),
    AIMessage(content="75°F is approximately 23.9°C.")
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]
)

Output:

1

Error trace

Expected behavior
"evaluation is 0"
I think there are many opinions.

Additional context
Add any other context about the problem here.

@licux licux added the bug Something isn't working label Feb 1, 2025
@dosubot dosubot bot added the module-metrics this is part of metrics module label Feb 1, 2025
@sahusiddharth
Copy link
Collaborator

sahusiddharth commented Feb 1, 2025

Hi @licux,

The way ToolCallAccuracy works is, it scores if the predicted sample has the Tool Calls in the same order as the Tool Calls in the reference.

Reference Tool Calls: ["weather_check"]
Predicted Tool Calls: ["weather_check", "temperature_conversion"]

Since reference had one Tool Call(["weather_check"]), and it was present in the predicted so the score is 1.

The formula used for scoring is:

$$\text{score} = \frac{\sum{(\text{matched tool call scores})}}{\text{number of reference tool calls}}$$

Here, $\frac{1.0}{1} = 1.0$

I hope this helps.

@licux
Copy link
Contributor Author

licux commented Feb 1, 2025

HI,
Thank you for your reply!

In your case, there were unnecessary tool call ”temperature_conversion”.
So, I think the score should be low because agent did unexpected behavior.
I was a bit confused, but I'm gradually understanding.

For example:

  • Reference Tool Calls: ["weather_check", "temperature_conversion"]
  • Predicted Tool Calls: ["weather_check"]

In this case, the temperature_conversion Tool Call was not executed.
Therefore, even if the 'weather_check' was executed correctly including args, the score is 0 in this metrics,
Is my understanding correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module-metrics this is part of metrics module
Projects
None yet
Development

No branches or pull requests

2 participants