Agent Leaderboard

Overview

The Agent Leaderboard evaluates language models' ability to effectively utilize tools in complex scenarios. With major tech CEOs predicting 2025 as a pivotal year for AI agents, we built this leaderboard to answer: "How do AI agents perform in real-world business scenarios?"

Get latest update of the leaderboard on Hugging Face Spaces. For more info, checkout the blog post for a detailed overview of our evaluation methodology.

https://huggingface.co/spaces/galileo-ai/agent-leaderboard

Methodology

Our evaluation process follows a systematic approach:

Model Selection: Curated diverse set of leading language models (12 private, 5 open-source)
Agent Configuration: Standardized system prompt and consistent tool access
Metric Definition: Tool Selection Quality (TSQ) as primary metric
Dataset Curation: Strategic sampling from established benchmarks
Scoring System: Equally weighted average across datasets

Model Rankings

Current standings across different models:

Development Guidelines

Key considerations for implementing AI agents:

Key Insights

Analysis of model performance and capabilities:

Tool Selection Complexity

Understanding the nuances of tool selection and usage across different scenarios:

Dataset Structure

Comprehensive evaluation across multiple domains and interaction types by leveraging diverse datasets:

BFCL: Mathematics, Entertainment, Education, and Academic Domains
τ-bench: Retail and Airline Industry Scenarios
xLAM: Cross-domain Data Generation (21 Domains)
ToolACE: API Interactions across 390 Domains

Evaluation

Our evaluation metric Tool Selection Quality (TSQ) assesses how well models select and use tools based on real-world requirements:

Implementation

Here's how we evaluate the models using the Tool Selection Quality (TSQ) metric:

import promptquality as pq

# Initialize evaluation handler with TSQ scorer
chainpoll_tool_selection_scorer = pq.CustomizedChainPollScorer(
    scorer_name=pq.CustomizedScorerName.tool_selection_quality,
    model_alias=pq.Models.gpt_4o,
)

evaluate_handler = pq.GalileoPromptCallback(
    project_name=project_name,
    run_name=run_name,
    scorers=[chainpoll_tool_selection_scorer],
)

# Configure LLM with zero temperature for consistent evaluation
llm = llm_handler.get_llm(model, temperature=0.0, max_tokens=4000)

# System prompt for standardized tool usage
system_msg = {
    "role": "system",
    "content": """Your job is to use the given tools to answer the query of human. 
                 If there is no relevant tool then reply with "I cannot answer the question with given tools". 
                 If tool is available but sufficient information is not available, then ask human to get the same. 
                 You can call as many tools as you want. Use multiple tools if needed. 
                 If the tools need to be called in a sequence then just call the first tool."""
}

# Run evaluation
for row in df.itertuples():
    chain = llm.bind_tools(tools)
    outputs.append(
        chain.invoke(
            [system_msg, *row.conversation], 
            config=dict(callbacks=[evaluate_handler])
        )
    )

evaluate_handler.finish()

Repo Structure

agent-leaderboard/
├── data/                   # Data storage directory
├── datasets/ 
│   ├── bfcl.ipynb          # BFCL data conversion
│   ├── tau.ipynb           # Tau benchmark data conversion
│   ├── toolace.ipynb       # ToolACE data conversion
│   └── xlam.ipynb          # XLAM data conversion
├── evaluate/ 
│   ├── get_results.ipynb   # Results aggregation
│   ├── llm_handler.py      # LLM initialization handler
│   ├── test_r1.ipynb       # Test R1
│   └── tool_call_exp.ipynb # Tool calling experiment runner
├── .env                    # Environment variables for API keys
├── LICENSE 
├── README.md 
└── requirements.txt        # Dependencies

Acknowledgements

We extend our sincere gratitude to the creators of the benchmark datasets that made this evaluation framework possible:

BFCL: Thanks to the Berkeley AI Research team for their comprehensive dataset evaluating function calling capabilities.
τ-bench: Thanks to the Sierra Research team for developing this benchmark focusing on real-world tool use scenarios.
xLAM: Thanks to the Salesforce AI Research team for their extensive Large Action Model dataset covering 21 domains.
ToolACE: Thanks to the team for their comprehensive API interaction dataset spanning 390 domains.

These datasets have been instrumental in creating a comprehensive evaluation framework for tool-calling capabilities in language models.

Citation

@misc{agent-leaderboard,
    author = {Pratik Bhavsar},
    title = {Agent Leaderboard},
    year = {2025},
    publisher = {Galileo.ai},
    howpublished = "\url{https://huggingface.co/spaces/galileo-ai/agent-leaderboard}"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Leaderboard

Overview

Methodology

Model Rankings

Development Guidelines

Key Insights

Tool Selection Complexity

Dataset Structure

Evaluation

Implementation

Repo Structure

Acknowledgements

Citation

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datasets		datasets
evaluate		evaluate
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

rungalileo/agent-leaderboard

Folders and files

Latest commit

History

Repository files navigation

Agent Leaderboard

Overview

Methodology

Model Rankings

Development Guidelines

Key Insights

Tool Selection Complexity

Dataset Structure

Evaluation

Implementation

Repo Structure

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages