This project simulates multiple AI language models playing the social deduction game Werewolf. Different AI models take on roles as either Villagers or Werewolves, and interact with each other to deduce who's who. Also known as Mafia.
- Players are divided into two teams: Werewolves and Villagers
- Each night, Werewolves choose a Villager to eliminate
- Each day, all players discuss and vote to eliminate a suspected Werewolf
- Werewolves win if they equal or outnumber Villagers
- Villagers win if they eliminate all Werewolves
werewolf.ts- Main game enginellms.txt- List of LLM models to use as playersnames.txt- List of player namesmatches/- Directory containing game transcriptseval.ts- Script to analyze game statisticsstats/- Directory containing generated statistics
- Bun runtime
- OpenRouter API key for accessing multiple LLM models
This project is licensed under the MIT License - see the LICENSE file for details.
To run a new game of Werewolf, use:
OPEN_ROUTER_API_KEY=sk-or-v1-asdf bun werewolf.tsThis will:
- Randomly assign roles to 10 different AI models
- Run through night and day phases
- Save the complete game transcript to the
matches/directory
Running a single game typically costs $5 USD on openrouter.
You can analyze the performance of different LLM models across games using:
bun eval.tsThis generates statistics on:
- Overall win rates per model
- Survival rates per model (alive at the end of the game)
- Win rates as Villager per model
- Win rates as Werewolf per model
- Role distribution
- Game-wide statistics
The statistics are saved to stats/werewolf_stats.md and can be viewed directly in GitHub.
Last updated: 2025-02-26
| Model | Games | Win % | Survival % | Villager Games (% of total) | Villager Win Rate | Werewolf Games (% of total) | Werewolf Win Rate |
|---|---|---|---|---|---|---|---|
| anthropic/claude-3.5-haiku-20241022 | 1 | 100.0% | 100.0% | 1 (100.0%) | 100.0% | 0 (0.0%) | 0.0% |
| thedrummer/unslopnemo-12b | 1 | 100.0% | 0.0% | 1 (100.0%) | 100.0% | 0 (0.0%) | 0.0% |
| sao10k/l3-euryale-70b | 1 | 100.0% | 0.0% | 1 (100.0%) | 100.0% | 0 (0.0%) | 0.0% |
| perplexity/llama-3.1-sonar-small-128k-chat | 1 | 100.0% | 0.0% | 1 (100.0%) | 100.0% | 0 (0.0%) | 0.0% |
| mistralai/mistral-tiny | 1 | 100.0% | 0.0% | 1 (100.0%) | 100.0% | 0 (0.0%) | 0.0% |
| cognitivecomputations/dolphin-mixtral-8x22b | 1 | 100.0% | 100.0% | 1 (100.0%) | 100.0% | 0 (0.0%) | 0.0% |
| x-ai/grok-2-vision-1212 | 1 | 100.0% | 0.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 100.0% |
| openai/gpt-4-turbo-preview | 1 | 100.0% | 100.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 100.0% |
| x-ai/grok-2-1212 | 1 | 100.0% | 100.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 100.0% |
| meta-llama/llama-guard-2-8b | 1 | 100.0% | 100.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 100.0% |
| openai/gpt-4-32k-0314 | 1 | 100.0% | 100.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 100.0% |
| openai/gpt-3.5-turbo-1106 | 1 | 100.0% | 100.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 100.0% |
| anthropic/claude-2 | 2 | 50.0% | 50.0% | 2 (100.0%) | 50.0% | 0 (0.0%) | 0.0% |
| openai-gpt-4o | 9 | 33.3% | 44.4% | 6 (66.7%) | 0.0% | 3 (33.3%) | 100.0% |
| sophosympatheia/rogue-rose-103b-v0.2:free | 1 | 0.0% | 0.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 0.0% |
| google/gemini-2.0-flash-thinking-exp-1219:free | 1 | 0.0% | 0.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 0.0% |
| openai/o1-preview | 1 | 0.0% | 0.0% | 0 (0.0%) | 0.0% | 1 (100.0%) | 0.0% |
| openai/chatgpt-4o-latest | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| aion-labs/aion-1.0 | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| deepseek/deepseek-r1-distill-qwen-1.5b | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| anthropic/claude-2.0 | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| cohere/command-r-03-2024 | 1 | 0.0% | 100.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| qwen/qwen-2-72b-instruct | 1 | 0.0% | 100.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| microsoft/phi-3-mini-128k-instruct | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| perplexity/sonar-reasoning | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| mistralai/mixtral-8x7b | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| microsoft/wizardlm-2-7b | 1 | 0.0% | 100.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| microsoft/phi-4 | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| anthropic/claude-3.5-haiku | 1 | 0.0% | 100.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
| mistralai/mistral-medium | 1 | 0.0% | 0.0% | 1 (100.0%) | 0.0% | 0 (0.0%) | 0.0% |
- Total matches analyzed: 4
- Total players: 39
- Werewolves: 12 (30.8%)
- Villagers: 27 (69.2%)
- Werewolf team wins: 3 (75.0%)
- Villager team wins: 1 (25.0%)