Hugging Face leaderboard #64

recursix · 2024-10-15T14:28:59Z

Make a leaderboard for all benchmark and for individual benchmarks

the all-benchmarks view would allow us to see all agents on all benchmarks ( '-' when it's not available)
the per-benchmark view would have the following columns

agent name
agent score (float with std_err)
backend llm (str or link)
benchmark specific (yes/no) (if the agent was hand-engineered on this benchmark)
benchmark tuned (yes/no) (if the agent was automatically adjusted on this benchmark)
followed evaluation protocol in the paper (yes/no with potential footnote)
reproducible (yes/no) if there is a paper with open-sourced code and enough details to reproduce the results
reproduced (list of scores, date and uuid of traces reproducing the result).
Comments (str), whatever the author think is useful to know

recursix · 2024-10-25T19:03:04Z

{
    "agent_name": "my agent", 
    "benchmark": "WorkArena-L1",
    "benchmark_version": "1.0",
    "score": 6.1,
    "std_err": 0.3,
    "study_id": "study_id",
    "date_time": "2021-01-01 12:00:00",
    "benchmark_specific": "?",
    "benchmark_tuned": "?",
    "followed_evaluation_protocol": "?",
    "reproducible": "?",
    "comments": "NA",
    "original_or_reproduced": "original"
}

recursix assigned Megh-Thakkar Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face leaderboard #64

Hugging Face leaderboard #64

recursix commented Oct 15, 2024

recursix commented Oct 25, 2024

Hugging Face leaderboard #64

Hugging Face leaderboard #64

Comments

recursix commented Oct 15, 2024

recursix commented Oct 25, 2024