LLM-Werewolf: LLMs Playing Werewolf

This project simulates multiple AI language models playing the social deduction game Werewolf. Different AI models take on roles as either Villagers or Werewolves, and interact with each other to deduce who's who. Also known as Mafia.

Mafia

TLDR

Players are divided into two teams: Werewolves and Villagers
Each night, Werewolves choose a Villager to eliminate
Each day, all players discuss and vote to eliminate a suspected Werewolf
Werewolves win if they equal or outnumber Villagers
Villagers win if they eliminate all Werewolves

Project Structure

werewolf.ts - Main game engine
llms.txt - List of LLM models to use as players
names.txt - List of player names
matches/ - Directory containing game transcripts
eval.ts - Script to analyze game statistics
stats/ - Directory containing generated statistics

Requirements

Bun runtime
OpenRouter API key for accessing multiple LLM models

License

This project is licensed under the MIT License - see the LICENSE file for details.

Running a Game

To run a new game of Werewolf, use:

OPEN_ROUTER_API_KEY=sk-or-v1-asdf bun werewolf.ts

This will:

Randomly assign roles to 10 different AI models
Run through night and day phases
Save the complete game transcript to the matches/ directory

Running a single game typically costs $5 USD on openrouter.

Analyzing Game Statistics

You can analyze the performance of different LLM models across games using:

bun eval.ts

This generates statistics on:

Overall win rates per model
Survival rates per model (alive at the end of the game)
Win rates as Villager per model
Win rates as Werewolf per model
Role distribution
Game-wide statistics

The statistics are saved to stats/werewolf_stats.md and can be viewed directly in GitHub.

Werewolf Game Statistics

Last updated: 2025-02-26

Model Performance (Sorted by Win Rate)

Model	Games	Win %	Survival %	Villager Games (% of total)	Villager Win Rate	Werewolf Games (% of total)	Werewolf Win Rate
anthropic/claude-3.5-haiku-20241022	1	100.0%	100.0%	1 (100.0%)	100.0%	0 (0.0%)	0.0%
thedrummer/unslopnemo-12b	1	100.0%	0.0%	1 (100.0%)	100.0%	0 (0.0%)	0.0%
sao10k/l3-euryale-70b	1	100.0%	0.0%	1 (100.0%)	100.0%	0 (0.0%)	0.0%
perplexity/llama-3.1-sonar-small-128k-chat	1	100.0%	0.0%	1 (100.0%)	100.0%	0 (0.0%)	0.0%
mistralai/mistral-tiny	1	100.0%	0.0%	1 (100.0%)	100.0%	0 (0.0%)	0.0%
cognitivecomputations/dolphin-mixtral-8x22b	1	100.0%	100.0%	1 (100.0%)	100.0%	0 (0.0%)	0.0%
x-ai/grok-2-vision-1212	1	100.0%	0.0%	0 (0.0%)	0.0%	1 (100.0%)	100.0%
openai/gpt-4-turbo-preview	1	100.0%	100.0%	0 (0.0%)	0.0%	1 (100.0%)	100.0%
x-ai/grok-2-1212	1	100.0%	100.0%	0 (0.0%)	0.0%	1 (100.0%)	100.0%
meta-llama/llama-guard-2-8b	1	100.0%	100.0%	0 (0.0%)	0.0%	1 (100.0%)	100.0%
openai/gpt-4-32k-0314	1	100.0%	100.0%	0 (0.0%)	0.0%	1 (100.0%)	100.0%
openai/gpt-3.5-turbo-1106	1	100.0%	100.0%	0 (0.0%)	0.0%	1 (100.0%)	100.0%
anthropic/claude-2	2	50.0%	50.0%	2 (100.0%)	50.0%	0 (0.0%)	0.0%
openai-gpt-4o	9	33.3%	44.4%	6 (66.7%)	0.0%	3 (33.3%)	100.0%
sophosympatheia/rogue-rose-103b-v0.2:free	1	0.0%	0.0%	0 (0.0%)	0.0%	1 (100.0%)	0.0%
google/gemini-2.0-flash-thinking-exp-1219:free	1	0.0%	0.0%	0 (0.0%)	0.0%	1 (100.0%)	0.0%
openai/o1-preview	1	0.0%	0.0%	0 (0.0%)	0.0%	1 (100.0%)	0.0%
openai/chatgpt-4o-latest	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
aion-labs/aion-1.0	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
deepseek/deepseek-r1-distill-qwen-1.5b	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
anthropic/claude-2.0	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
cohere/command-r-03-2024	1	0.0%	100.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
qwen/qwen-2-72b-instruct	1	0.0%	100.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
microsoft/phi-3-mini-128k-instruct	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
perplexity/sonar-reasoning	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
mistralai/mixtral-8x7b	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
microsoft/wizardlm-2-7b	1	0.0%	100.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
microsoft/phi-4	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
anthropic/claude-3.5-haiku	1	0.0%	100.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%
mistralai/mistral-medium	1	0.0%	0.0%	1 (100.0%)	0.0%	0 (0.0%)	0.0%

Game Summary

Total matches analyzed: 4
Total players: 39
Werewolves: 12 (30.8%)
Villagers: 27 (69.2%)
Werewolf team wins: 3 (75.0%)
Villager team wins: 1 (25.0%)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
matches		matches
transcripts		transcripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
eval.ts		eval.ts
listModels.ts		listModels.ts
llms.txt		llms.txt
names.txt		names.txt
package.json		package.json
pretty.ts		pretty.ts
tsconfig.json		tsconfig.json
werewolf.ts		werewolf.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Werewolf: LLMs Playing Werewolf

TLDR

Project Structure

Requirements

License

Running a Game

Analyzing Game Statistics

Werewolf Game Statistics

Model Performance (Sorted by Win Rate)

Game Summary

About

Uh oh!

Uh oh!

Languages

License

gabigabogabu/llm-werewolf

Folders and files

Latest commit

History

Repository files navigation

LLM-Werewolf: LLMs Playing Werewolf

TLDR

Project Structure

Requirements

License

Running a Game

Analyzing Game Statistics

Werewolf Game Statistics

Model Performance (Sorted by Win Rate)

Game Summary

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages