This repository contains a complete AI evaluations course built around a Recipe Chatbot. Through 5 progressive homework assignments, you'll learn practical techniques for evaluating and improving AI systems.
-
Clone & Setup
git clone https://github.com/ai-evals-course/recipe-chatbot.git cd recipe-chatbot uv sync source .venv/bin/activate
-
Configure Environment
cp env.example .env # Edit .env to add your model and API keys -
Run the Chatbot
uv run uvicorn backend.main:app --reload # Open http://127.0.0.1:8000
Bonus: Using AI Assisted Coding to Tackle Homework Problems
-
HW1: Basic Prompt Engineering (
homeworks/hw1/)- Write system prompts and expand test queries
- Walkthrough: See HW2 walkthrough for HW1 content
-
HW2: Error Analysis & Failure Taxonomy (
homeworks/hw2/) -
HW3: LLM-as-Judge Evaluation (
homeworks/hw3/)- Automated evaluation using the
judgylibrary - Interactive Walkthrough:
- Code:
homeworks/hw3/hw3_walkthrough.ipynb - video: walkthrough of solution
- Code:
- Automated evaluation using the
-
HW4: RAG/Retrieval Evaluation (
homeworks/hw4/)- BM25 retrieval system with synthetic query generation
- Interactive Walkthrough:
- Code:
homeworks/hw4/hw4_walkthrough.ipynb - video: walkthrough of solution
- Code:
-
HW5: Agent Failure Analysis (
homeworks/hw5/)- Analyze conversation traces and failure patterns
- Interactive Walkthrough:
- Code:
homeworks/hw5/hw5_walkthrough.ipynb - video: walkthrough of solution
- Code:
- Backend: FastAPI with LiteLLM (multi-provider LLM support)
- Frontend: Simple chat interface with conversation history
- Annotation Tool: FastHTML-based interface for manual evaluation (
annotation/) - Retrieval: BM25-based recipe search (
backend/retrieval.py) - Query Rewriting: LLM-powered query optimization (
backend/query_rewrite_agent.py) - Evaluation Tools: Automated metrics, bias correction, and analysis scripts
recipe-chatbot/
├── backend/ # FastAPI app & core logic
├── frontend/ # Chat UI (HTML/CSS/JS)
├── homeworks/ # 5 progressive assignments
│ ├── hw1/ # Prompt engineering
│ ├── hw2/ # Error analysis (with walkthrough)
│ ├── hw3/ # LLM-as-Judge (with walkthrough)
│ ├── hw4/ # Retrieval eval (with walkthroughs)
│ └── hw5/ # Agent analysis
├── annotation/ # Manual annotation tools
├── scripts/ # Utility scripts
├── data/ # Datasets and queries
└── results/ # Evaluation outputs
Each homework (HW2-HW5) includes a self-contained Jupyter notebook walkthrough:
cd homeworks/hw2
jupyter notebook hw2_walkthrough.ipynbThe walkthroughs use data from reference_files/ and can be run without any external scripts. Each notebook includes:
- Data loading and exploration
- Step-by-step solution code
- Expected outputs and analysis
- Annotation Interface: Run
python annotation/annotation.pyfor manual evaluation - Bulk Testing: Use
python scripts/bulk_test.pyto test multiple queries - Trace Analysis: All conversations saved as JSON for analysis
Configure your .env file with:
MODEL_NAME: LLM model for chatbot (e.g.,openai/gpt-5-chat-latest,anthropic/claude-3-sonnet-20240229)MODEL_NAME_JUDGE: LLM model for judge, which can be smaller than the chatbot model (e.g.,openai/gpt-5-mini,anthropic/claude-3-haiku-20240307)- API keys:
OPENAI_API_KEY,ANTHROPIC_API_KEY, etc.
See LiteLLM docs for supported providers.
This course emphasizes:
- Practical experience over theory
- Systematic evaluation over "vibes"
- Progressive complexity - each homework builds on previous work
- Industry-standard techniques for real-world AI evaluation
