Compilation Benchmark

Work in progress

This repo extends what I started in rust-bench, I am interested in comparing how different llms perform at writing code that compiles, and how that relates to the code working correctly.

I have used some problems written by hand, and the problems from the advent of code 2024, hoping that they were not in the training data. From the AOC problems, I havent uploaded my inputs and outputs.

Correlations

Looking at the correlation between how often a problem compiles and is correct in different languages, it seems plausible that writing code that compiles requires a slightly different skill from solving the problem with code. Writing Haskell code that compiles correlates more with writing Ocaml code that compiles, than with solving the problem in Haskell.

	compiles cpp	compiles haskell	compiles ocaml	compiles python	compiles rust	compiles go	avg compiles	correct cpp	correct haskell	correct ocaml	correct python	correct rust	correct go	avg correct
compiles cpp	1.000	0.825	0.860	-0.345	0.889	-0.440	0.923	0.774	0.732	0.770	0.661	0.818	0.838	0.810
compiles haskell	0.825	1.000	0.929	-0.226	0.884	-0.526	0.952	0.780	0.891	0.844	0.620	0.919	0.893	0.866
compiles ocaml	0.860	0.929	1.000	-0.361	0.886	-0.592	0.953	0.763	0.826	0.820	0.642	0.928	0.859	0.845
compiles python	-0.345	-0.226	-0.361	1.000	-0.417	0.523	-0.241	-0.245	-0.086	-0.180	-0.291	-0.291	-0.163	-0.231
compiles rust	0.889	0.884	0.886	-0.417	1.000	-0.702	0.950	0.596	0.659	0.646	0.489	0.830	0.743	0.707
compiles go	-0.440	-0.526	-0.592	0.523	-0.702	1.000	-0.545	-0.327	-0.343	-0.270	-0.199	-0.588	-0.336	-0.369
avg compiles	0.923	0.952	0.953	-0.241	0.950	-0.545	1.000	0.753	0.806	0.816	0.609	0.912	0.874	0.841
correct cpp	0.774	0.780	0.763	-0.245	0.596	-0.327	0.753	1.000	0.900	0.922	0.907	0.841	0.934	0.978
correct haskell	0.732	0.891	0.826	-0.086	0.659	-0.343	0.806	0.900	1.000	0.961	0.713	0.863	0.945	0.936
correct ocaml	0.770	0.844	0.820	-0.180	0.646	-0.270	0.816	0.922	0.961	1.000	0.813	0.867	0.953	0.968
correct python	0.661	0.620	0.642	-0.291	0.489	-0.199	0.609	0.907	0.713	0.813	1.000	0.669	0.841	0.891
correct rust	0.818	0.919	0.928	-0.291	0.830	-0.588	0.912	0.841	0.863	0.867	0.669	1.000	0.868	0.896
correct go	0.838	0.893	0.859	-0.163	0.743	-0.336	0.874	0.934	0.945	0.953	0.841	0.868	1.000	0.980
avg correct	0.810	0.866	0.845	-0.231	0.707	-0.369	0.841	0.978	0.936	0.968	0.891	0.896	0.980	1.000

Python and Go are a bit different as most attempts compile correctly. Focusing only on Rust, Haskell, C++ and Ocaml.

Some AOC results

Some graphs comparing the perfromance of different models in Advent of Code 2024

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
img		img
problems		problems
prompts		prompts
results		results
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compilation Benchmark

Correlations

Some AOC results

About

Releases

Packages

Contributors 2

Languages

Gusanidas/compilation-benchmark

Folders and files

Latest commit

History

Repository files navigation

Compilation Benchmark

Correlations

Some AOC results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages