When to choose HAL over Inspect Evals #16

jbragg · 2025-01-17T01:48:18Z

Hey, cool to see this eval harness!
I'm wondering if you could articulate when someone should use HAL over Inspect Evals (might be helpful to add that to the README).
I'm also curious why you are expanding benchmarks here (e.g., #15 ) rather than contributing them to the upstream Inspect Evals harness.
Thanks!
Jonathan

benediktstroebl · 2025-01-17T16:17:29Z

@jbragg

Hi Jonathan,
thanks, I'm glad you like it!

There are a couple of key reasons for when to use HAL over inspect:

Ease of use for developers: With HAL, we wanted to build the harness as general as possible and focus on minimizing constraints for the agent dev. Hence, developers do not have to follow a particular agent framework etc. To run agents on inspect benchmarks with the inspect evals, they need to be implemented as inspect solvers.
Simplicity of adding new benchmarks: Most agent benchmarks do see almost no adoption because there is no public leaderboard and it is tedious to set them up (some lack a proper harness altogether). Adding benchmarks to HAL can be as easy as 150 lines of code. (e.g. appworld.py).
Containerization: We wanted to avoid using Docker for sandboxing agent runs but rather use the cloud. This is because some agents use Docker in their implementation and we wanted to avoid DinD.

Happy to chat about it more!
Best,
Benedikt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When to choose HAL over Inspect Evals #16

When to choose HAL over Inspect Evals #16

jbragg commented Jan 17, 2025

benediktstroebl commented Jan 17, 2025

When to choose HAL over Inspect Evals #16

When to choose HAL over Inspect Evals #16

Comments

jbragg commented Jan 17, 2025

benediktstroebl commented Jan 17, 2025