Incorrect calculation at ToolCallAccuracy #1893

licux · 2025-02-01T12:05:26Z

Describe the bug
In ToolCallAccuracy, if the number of ToolCalls in the user_input is greater than the number of reference_tool_calls, it does not affect the evaluation score. In other words, the evaluation score remains unaffected even when more ToolCalls than expected occur.

Ragas version: latest
Python version: 3.12

Code to Reproduce

sample = [
    HumanMessage(content="What's the weather like in New York right now?"),
    AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
    HumanMessage(content="Can you translate that to Celsius?"),
    AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]),
    ToolMessage(content="75°F is approximately 23.9°C."),
    AIMessage(content="75°F is approximately 23.9°C.")
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]
)

Output:

Error trace

Expected behavior
"evaluation is 0"
I think there are many opinions.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

sahusiddharth · 2025-02-01T14:00:48Z

Hi @licux,

The way ToolCallAccuracy works is, it scores if the predicted sample has the Tool Calls in the same order as the Tool Calls in the reference.

Reference Tool Calls: ["weather_check"]
Predicted Tool Calls: ["weather_check", "temperature_conversion"]

Since reference had one Tool Call(["weather_check"]), and it was present in the predicted so the score is 1.

The formula used for scoring is:

$$\text{score} = \frac{\sum{(\text{matched tool call scores})}}{\text{number of reference tool calls}}$$

Here, $\frac{1.0}{1} = 1.0$

I hope this helps.

licux · 2025-02-01T17:16:28Z

HI,
Thank you for your reply!

~~In your case, there were unnecessary tool call ”temperature_conversion”.~~
~~So, I think the score should be low because agent did unexpected behavior.~~
I was a bit confused, but I'm gradually understanding.

For example:

Reference Tool Calls: ["weather_check", "temperature_conversion"]
Predicted Tool Calls: ["weather_check"]

In this case, the temperature_conversion Tool Call was not executed.
Therefore, even if the 'weather_check' was executed correctly including args, the score is 0 in this metrics,
Is my understanding correct?

licux added the bug Something isn't working label Feb 1, 2025

dosubot bot added the module-metrics this is part of metrics module label Feb 1, 2025

licux mentioned this issue Feb 1, 2025

return False when pred_sequence length and ref_requence length is dif… #1894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect calculation at ToolCallAccuracy #1893

Incorrect calculation at ToolCallAccuracy #1893

licux commented Feb 1, 2025 •

edited by sahusiddharth

Loading

sahusiddharth commented Feb 1, 2025 •

edited

Loading

licux commented Feb 1, 2025 •

edited

Loading

Incorrect calculation at ToolCallAccuracy #1893

Incorrect calculation at ToolCallAccuracy #1893

Comments

licux commented Feb 1, 2025 • edited by sahusiddharth Loading

sahusiddharth commented Feb 1, 2025 • edited Loading

licux commented Feb 1, 2025 • edited Loading

licux commented Feb 1, 2025 •

edited by sahusiddharth

Loading

sahusiddharth commented Feb 1, 2025 •

edited

Loading

licux commented Feb 1, 2025 •

edited

Loading