This is a work in progress LLM benchmark being written entirely using Mentat (the GitHub bot). The project aims to provide a framework for comparing and evaluating different language models.
In each round of the game:
- A green card is drawn
- Players are dealt red cards
- Players choose a red card from their hand that best matches the green card
- A judge selects the best match among the played cards
The benchmark supports both real language models and random players, allowing for:
- Evaluation of model performance in understanding word relationships
- Comparison between different models
- Testing and development using random players
- Mixed games with both real models and random players
To run a game, use the benchmark.run
module with the following arguments:
--rounds
: Number of rounds to play--players
: Number of players in the game--models
: Model type for each player (one per player)
Example commands:
# Run a game with all real models
python -m benchmark.run --rounds 5 --players 3 --models gpt-4 claude-2 gpt-3.5-turbo
# Mix random and real models
python -m benchmark.run --rounds 5 --players 3 --models random gpt-4 random
# Test with all random players
python -m benchmark.run --rounds 5 --players 3 --models random random random
random
: Makes random selections (useful for testing and baselines)- Real models (via OpenRouter API):
gpt-4
gpt-3.5-turbo
claude-2
- And other models supported by OpenRouter
The benchmark uses the OpenRouter API for model access. Set up your environment:
- Create a
.env
file in the project root - Add your OpenRouter API key:
OPEN_ROUTER_KEY=your_api_key_here
🚧 Work in Progress 🚧
This project is in its early stages of development. Stay tuned for updates!
This project is being developed using Mentat, an AI-powered coding assistant. The entire codebase is being written through interactions with the Mentat GitHub bot.