Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When to choose HAL over Inspect Evals #16

Open
jbragg opened this issue Jan 17, 2025 · 1 comment
Open

When to choose HAL over Inspect Evals #16

jbragg opened this issue Jan 17, 2025 · 1 comment

Comments

@jbragg
Copy link

jbragg commented Jan 17, 2025

Hey, cool to see this eval harness!
I'm wondering if you could articulate when someone should use HAL over Inspect Evals (might be helpful to add that to the README).
I'm also curious why you are expanding benchmarks here (e.g., #15 ) rather than contributing them to the upstream Inspect Evals harness.
Thanks!
Jonathan

@benediktstroebl
Copy link
Collaborator

@jbragg

Hi Jonathan,
thanks, I'm glad you like it!

There are a couple of key reasons for when to use HAL over inspect:

  • Ease of use for developers: With HAL, we wanted to build the harness as general as possible and focus on minimizing constraints for the agent dev. Hence, developers do not have to follow a particular agent framework etc. To run agents on inspect benchmarks with the inspect evals, they need to be implemented as inspect solvers.
  • Simplicity of adding new benchmarks: Most agent benchmarks do see almost no adoption because there is no public leaderboard and it is tedious to set them up (some lack a proper harness altogether). Adding benchmarks to HAL can be as easy as 150 lines of code. (e.g. appworld.py).
  • Containerization: We wanted to avoid using Docker for sandboxing agent runs but rather use the cloud. This is because some agents use Docker in their implementation and we wanted to avoid DinD.

Happy to chat about it more!
Best,
Benedikt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants