reasoning-gym Evaluation

We store evaluation results of reasoning-gym datasets (including llm outputs) in this repository.

Progress and LLM accuracy metrics are tracked on our main Google Spreadsheet.

Team

You can reach the eval-team in the #reasoning-gym channel of the GPU-Mode discord server.
We would be very happy about donations in the form of OpenRouter API keys (or other inference API providers)!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
anthropic_claude-3.5-sonnet_20250227_230002		anthropic_claude-3.5-sonnet_20250227_230002
llama-3.3-70b-instruct		llama-3.3-70b-instruct
o3-mini		o3-mini
sonnet-3.5		sonnet-3.5
LICENSE		LICENSE
README.md		README.md