O1 mini #137

haesleinhuepf · 2024-09-25T06:25:36Z

This PR contains:

a new test-case for the benchmark
- I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
new dependencies in requirements.txt
- The environment.yml file was updated using the command conda env export > environment.yml
new generator-functions allowing to sample from other LLMs
new samples (sample_....jsonl files)
new benchmarking results (..._results.jsonl files)
documentation update
bug fixes

Related github issue (if relevant): related to #135

Short description:

How do you think will this influence the benchmark results?

Why do you think it makes sense to merge this PR?

o1-mini is the most recent model from OpenAI. The company claims it is good in coding. Our benchmark confirms this, while also showing that an older gpt4omni model scores a tiny bit higher. It's a nice independent measurement.

haesleinhuepf added 3 commits September 25, 2024 07:16

sampled o1-mini

df25b1b

evaluated o1-mini, updated plots

2ffcfa2

updated environment the benchmark was executed on

380cbe9

haesleinhuepf merged commit d984f03 into main Sep 26, 2024

Provide feedback