Add AGIEval #121

clefourrier · 2024-03-20T14:47:20Z

Compared results with YALL, within stderr range.
Closes #79

NathanHB · 2024-03-21T10:10:36Z

tests/reference_scores/reference_task_scores.py

@@ -615,5 +615,59 @@
        "lighteval|bigbench:tracking_shuffled_objects_five_objects|3|0": {"acc": 0.2000, "acc_stderr": 0.1333},
        "lighteval|bigbench:tracking_shuffled_objects_seven_objects|3|0": {"acc": 0.3000, "acc_stderr": 0.1528},
        "lighteval|bigbench:tracking_shuffled_objects_three_objects|3|0": {"acc": 0.4000, "acc_stderr": 0.1633},
+        "lighteval|agieval:_average|0|0": {


those come from YALL or you computed them with lighteval ?

These our own results on this model, which is not in YALL iirc - but I tested on a range of models from YALL and we are always within stderr range.

NathanHB

Looks good !

clefourrier and others added 7 commits March 20, 2024 10:51

task path fixed + added prompt formatting function

b8bfd22

results within error rates when compared to yall

cf5052e

add gpt 10 samples test

7b37187

added agieval to tests

e0a2428

add missing bbh to tests

b1333ec

Merge branch 'main' into clem_add_agieval

fdec0e1

fixed precision

72a4245

clefourrier requested a review from NathanHB March 20, 2024 16:58

NathanHB reviewed Mar 21, 2024

View reviewed changes

NathanHB approved these changes Mar 21, 2024

View reviewed changes

clefourrier merged commit 133cf9b into main Mar 21, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AGIEval #121

Add AGIEval #121

clefourrier commented Mar 20, 2024 •

edited

Loading

NathanHB Mar 21, 2024

clefourrier Mar 21, 2024

NathanHB left a comment

Add AGIEval #121

Add AGIEval #121

Conversation

clefourrier commented Mar 20, 2024 • edited Loading

NathanHB Mar 21, 2024

Choose a reason for hiding this comment

clefourrier Mar 21, 2024

Choose a reason for hiding this comment

NathanHB left a comment

Choose a reason for hiding this comment

clefourrier commented Mar 20, 2024 •

edited

Loading