diff --git a/evaluation_results/2024-12-21_13-57-31/results.md b/evaluation_results/2024-12-21_13-57-31/results.md index 3be0a9c..1923c0d 100644 --- a/evaluation_results/2024-12-21_13-57-31/results.md +++ b/evaluation_results/2024-12-21_13-57-31/results.md @@ -4,9 +4,7 @@ There are 4 scenarios and 4 test cases with 3 attempts (48 total tests). ## Test: blank_math ### claude_sonnet_latest_with_seg - - - + ``` 10 @@ -15,36 +13,22 @@ There are 4 scenarios and 4 test cases with 3 attempts (48 total tests). ### gpt-4o-mini_no_seg - - - - - + ### gpt-4o_with_seg - - - - - + ``` 10 ``` ### claude_sonnet_latest_no_seg - - - - - + ## Test: tic_tac_toe_1 ### claude_sonnet_latest_with_seg - - - + ``` Your turn! Place an O anywhere you'd like. @@ -53,83 +37,39 @@ Your turn! Place an O anywhere you'd like. ### gpt-4o-mini_no_seg - - - - - + ### gpt-4o_with_seg - - - - - + ### claude_sonnet_latest_no_seg - - - - - + ## Test: x_in_box ### claude_sonnet_latest_with_seg - - - - - + ### gpt-4o-mini_no_seg - - - - - + ### gpt-4o_with_seg - - - - - + ### claude_sonnet_latest_no_seg - - - - - + ## Test: x_in_boxes ### claude_sonnet_latest_with_seg - - - - - + ### gpt-4o-mini_no_seg - - - - - + ### gpt-4o_with_seg - - - - - + ### claude_sonnet_latest_no_seg - - - - - +