You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make a leaderboard for all benchmark and for individual benchmarks
the all-benchmarks view would allow us to see all agents on all benchmarks ( '-' when it's not available)
the per-benchmark view would have the following columns
agent name
agent score (float with std_err)
backend llm (str or link)
benchmark specific (yes/no) (if the agent was hand-engineered on this benchmark)
benchmark tuned (yes/no) (if the agent was automatically adjusted on this benchmark)
followed evaluation protocol in the paper (yes/no with potential footnote)
reproducible (yes/no) if there is a paper with open-sourced code and enough details to reproduce the results
reproduced (list of scores, date and uuid of traces reproducing the result).
Comments (str), whatever the author think is useful to know
The text was updated successfully, but these errors were encountered:
Make a leaderboard for all benchmark and for individual benchmarks
the all-benchmarks view would allow us to see all agents on all benchmarks ( '-' when it's not available)
the per-benchmark view would have the following columns
The text was updated successfully, but these errors were encountered: