CodeNames Oversight

CodeNames is a party card game where players need to find creative word associations. It has some properties that make it compelling as a testbed for scalable oversight experiments:

It should be easy for language models to learn.
The computational complexity for generating a clue is much larger than for finding an issue with a clue, which is higher still than evaluating an issue.
It's easy to procedurally generate many games.
It's easy to simulate overseers with various kinds of flaws, or artificially limit the oversight budget.

This project aims to expand on the theory of predicting whether a scalable oversight technique will robustly succeed for some problem domain and overseer, and then test out the theory with many small experiments.

For more detail, see the LW post draft

Relevant background

the original Debate paper
Redwood Research's post on meta-level adversarial evaluations

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
codenames_oversight		codenames_oversight
data		data
experiments		experiments
jobs		jobs
models		models
paper		paper
playground		playground
results-explorer		results-explorer
test		test
.gitignore		.gitignore
README.md		README.md
game-words.txt		game-words.txt
plot_experiment.ipynb		plot_experiment.ipynb
plot_oracle_by_optimization.ipynb		plot_oracle_by_optimization.ipynb
requirements.txt		requirements.txt
ruff.toml		ruff.toml
training_plots.ipynb		training_plots.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeNames Oversight

Relevant background

About

Releases

Packages

Contributors 2

Languages

Crazytieguy/codenames-oversight

Folders and files

Latest commit

History

Repository files navigation

CodeNames Oversight

Relevant background

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages