You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, cool to see this eval harness!
I'm wondering if you could articulate when someone should use HAL over Inspect Evals (might be helpful to add that to the README).
I'm also curious why you are expanding benchmarks here (e.g., #15 ) rather than contributing them to the upstream Inspect Evals harness.
Thanks!
Jonathan
The text was updated successfully, but these errors were encountered:
There are a couple of key reasons for when to use HAL over inspect:
Ease of use for developers: With HAL, we wanted to build the harness as general as possible and focus on minimizing constraints for the agent dev. Hence, developers do not have to follow a particular agent framework etc. To run agents on inspect benchmarks with the inspect evals, they need to be implemented as inspect solvers.
Simplicity of adding new benchmarks: Most agent benchmarks do see almost no adoption because there is no public leaderboard and it is tedious to set them up (some lack a proper harness altogether). Adding benchmarks to HAL can be as easy as 150 lines of code. (e.g. appworld.py).
Containerization: We wanted to avoid using Docker for sandboxing agent runs but rather use the cloud. This is because some agents use Docker in their implementation and we wanted to avoid DinD.
Hey, cool to see this eval harness!
I'm wondering if you could articulate when someone should use HAL over Inspect Evals (might be helpful to add that to the README).
I'm also curious why you are expanding benchmarks here (e.g., #15 ) rather than contributing them to the upstream Inspect Evals harness.
Thanks!
Jonathan
The text was updated successfully, but these errors were encountered: