A comprehensive collection of evaluation methods, tools, and frameworks for assessing Large Language Models (LLMs), RAG systems, and AI agents in real-world applications.
This repository provides practical implementations and detailed guidance for evaluating AI systems, with a focus on understanding when and how to apply different evaluation methods. Unlike simple metric collections, we offer working code, mathematical foundations, and domain-specific considerations.
- Implementation-First: Every metric includes complete, tested code examples
- Decision Frameworks: Clear tables and guides for metric selection
- Mathematical Rigor: Understanding the theory behind each evaluation method
- Domain Expertise: Tailored approaches for medical, legal, financial, and other specialized applications
- System-Level Thinking: Evaluation of components and their interactions
| Concept | Finding | Source | Application |
|---|---|---|---|
| Consistency vs Accuracy | Models can have high accuracy but low consistency | SCORE (NVIDIA 2025) | Evaluate reliability alongside correctness |
| Pass@k vs Pass^k | Metrics measure different aspects (optimistic vs reliable) | Code generation research | Choose based on deployment needs |
| Confidence Scoring | Ensemble methods correlate better with accuracy than logprobs | Industry studies | Use majority voting for confidence |
| Component Interaction | System performance ≠ sum of component performance | RAG research | Evaluate end-to-end and per-component |
| Bias Detection | Bayes Factors superior to p-values | QuaCer-B research | Use Bayesian methods for statistical rigor |
- Getting Started
- Evaluation Methods
- Domain-Specific Evaluation
- Tools & Platforms
- Benchmarks & Datasets
- Implementation Guide
- Resources
- Contributing
- Citation
| Task Type | Primary Metrics | Secondary Metrics | Key Considerations |
|---|---|---|---|
| Text Generation | Perplexity, G-Eval | BLEU, ROUGE | Need reference texts for BLEU/ROUGE |
| Question Answering | Answer Correctness, Faithfulness | BERTScore, Exact Match | Domain expertise affects threshold |
| Code Generation | Pass@k (benchmarks), Pass^k (reliability) | Syntax validity, Security | Pass@k ≠ Pass^k for planning |
| RAG Systems | Faithfulness, Context Relevance | Precision@k, NDCG | Evaluate retrieval and generation separately |
| Translation | BLEU, METEOR | BERTScore, Human eval | BLEU has known limitations |
| Summarization | ROUGE, Relevance | Coherence, Consistency | ROUGE may miss semantic equivalence |
| Dialogue | Coherence, Engagement | Response diversity | Context window important |
| Multi-Agent | Task completion, Coordination | Communication efficiency | System-level metrics needed |
| Domain | Metric Type | Typical Threshold | Rationale |
|---|---|---|---|
| Medical | Faithfulness | > 0.9 | Patient safety critical |
| Legal | Factual accuracy | > 0.95 | Regulatory compliance |
| Financial | Numerical precision | > 0.98 | Monetary implications |
| Customer Support | Response relevance | > 0.7 | User satisfaction |
| Creative Writing | Diversity score | > 0.6 | Avoid repetition |
| Education | Answer correctness | > 0.85 | Learning outcomes |
- Python 3.8+ for evaluation frameworks
- Node.js 14+ for JavaScript-based tools (optional)
- API keys for LLM providers (OpenAI, Anthropic, etc.)
- Docker for self-hosted solutions (optional)
# Clone this repository
git clone https://github.com/hparreao/Awesome-AI-Evaluation-Guide.git
cd Awesome-AI-Evaluation-Guide
# Install core dependencies
pip install -r requirements.txt
# Optional: Install specific evaluation frameworks
pip install deepeval ragas langfuse trulens-eval # Python frameworks
npm install -g promptfoo # JavaScript CLI tool (optional)from examples.llm_as_judge import evaluate_response
result = evaluate_response(
question="What is the capital of France?",
response="Paris is the capital of France.",
criteria="factual_accuracy"
)
print(f"Score: {result.score}, Reasoning: {result.reasoning}")from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# Evaluate your RAG system
result = evaluate(
dataset=your_test_data,
metrics=[faithfulness, answer_relevancy]
)
print(f"Faithfulness: {result['faithfulness']:.2f}")from examples.consistency_robustness.score_framework import SCOREEvaluator
evaluator = SCOREEvaluator(model=your_model, k=5)
metrics = evaluator.evaluate(test_cases)
print(f"Accuracy: {metrics.accuracy:.2f}, Consistency: {metrics.consistency_rate:.2f}")Foundational metrics from NLP research, adapted for LLM evaluation.
What it measures: Model uncertainty in predicting the next token. Lower values indicate better performance.
When to use:
- Comparing language models on the same task
- Pre-training evaluation
- Domain adaptation assessment
Implementation: examples/traditional_metrics/perplexity.py
Documentation: docs/traditional-metrics/perplexity.md
What it measures: Precision-based n-gram overlap between generated and reference text.
When to use:
- Machine translation evaluation
- Text generation with reference outputs
- Paraphrase quality assessment
Limitations:
- Doesn't account for semantic similarity
- Biased toward shorter outputs
- Requires reference text
Implementation: examples/traditional_metrics/bleu_score.py
Documentation: docs/traditional-metrics/bleu-score.md
What it measures: Recall-oriented n-gram overlap, primarily for summarization.
When to use:
- Summarization tasks
- Content coverage assessment
- Information preservation evaluation
Implementation: examples/traditional_metrics/rouge_score.py
Documentation: docs/traditional-metrics/rouge-score.md
Leverage model confidence through token probabilities.
What it measures: Log probabilities for each generated token.
Applications:
- Hallucination detection (low probability = potential hallucination)
- Confidence estimation
- Classification with uncertainty quantification
Key Finding: OpenAI research shows logprobs enable reliable confidence scoring for classification tasks.
Implementation: examples/probability_based/logprobs.py
Documentation: docs/probability-based/logprobs.md
What it measures: Distribution of top-k most probable tokens at each position.
Applications:
- Diversity assessment
- Uncertainty quantification
- Alternative generation paths exploration
Implementation: examples/probability_based/topk_analysis.py
Documentation: docs/probability-based/topk-analysis.md
Use LLMs to evaluate LLM outputs based on custom criteria.
What it is: Chain-of-thought (CoT) based evaluation using LLMs with token probability normalization.
Why it works:
- Better human alignment than traditional metrics
- Flexible custom criteria
- Token probability weighting reduces bias
Production Scale: DeepEval processes 10M+ G-Eval metrics monthly.
Core Use Cases:
- Answer Correctness - Validate factual accuracy
- Coherence & Clarity - Assess text quality without references
- Tonality & Professionalism - Domain-appropriate style
- Safety & Compliance - PII detection, bias, toxicity
- Domain-Specific Faithfulness - RAG evaluation with heavy hallucination penalties
Implementation: examples/llm_as_judge/
Complete Guide: docs/llm-as-judge/g-eval-framework.md
Quick Example:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
# Define custom evaluation
correctness = GEval(
name="Correctness",
evaluation_steps=[
"Check for factual contradictions",
"Penalize missing critical information",
"Accept paraphrasing and style differences"
],
evaluation_params=[
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT
],
threshold=0.7
)
# Evaluate
test_case = LLMTestCase(
input="What is Python?",
actual_output="Python is a high-level programming language.",
expected_output="Python is an interpreted, high-level programming language."
)
correctness.measure(test_case)
print(f"Score: {correctness.score}") # 0-1 scaleThe SCORE framework (NVIDIA 2025) evaluates model consistency alongside accuracy, providing insights into reliability.
| Metric | What it Measures | Use Case |
|---|---|---|
| Consistency Rate (CR@K) | If model gives same correct answer K times | Reliability assessment |
| Prompt Robustness | Stability across paraphrased prompts | Input variation handling |
| Sampling Robustness | Consistency under temperature changes | Deployment configuration |
| Order Robustness | Invariance to choice ordering | Multiple-choice tasks |
- Evaluating model reliability beyond accuracy
- Testing robustness to input variations
- Assessing deployment readiness
- Comparing model stability
Implementation: examples/consistency_robustness/score_framework.py
Documentation: docs/consistency-robustness/score-framework.md
from examples.consistency_robustness import SCOREEvaluator
evaluator = SCOREEvaluator(model=your_model)
metrics = evaluator.evaluate(test_cases)
# Compare accuracy vs consistency
print(f"Accuracy: {metrics.accuracy:.2%}")
print(f"Consistency Rate: {metrics.consistency_rate:.2%}")Ensemble-based methods for reliable confidence estimation.
What it measures: Consensus across multiple model generations.
Key Finding: Industry studies show strong positive correlation between majority voting confidence and actual accuracy, while "no clear correlation was found between logprob-based confidence score and accuracy."
Optimal Configuration: 4-7 diverse models (sweet spot for reliability vs. cost)
Implementation: examples/confidence_scoring/majority_voting.py
Documentation: docs/confidence-scoring/majority-voting.md
What it does: Weight model votes by historical accuracy.
Weighting Strategy: Linear weights preferred (w_i = Accuracy_i) over exponential to maintain ensemble diversity.
Implementation: examples/confidence_scoring/weighted_ensemble.py
What it solves: Aligns raw confidence scores with actual accuracy.
Goal: Expected Calibration Error (ECE) < 0.05 for production systems.
Implementation: examples/confidence_scoring/calibration.py
Complete Guide: docs/confidence-scoring/ensemble-methods.md
Methods for identifying fabricated or unsupported information.
How it works: Measures consistency across multiple samples from the same LLM. Factual statements remain consistent; hallucinations show high variance.
Why it's better: Unlike legacy NLP metrics (WER, METEOR), SelfCheckGPT is designed for Transformer-era LLMs and addresses hallucination problems that didn't exist in pre-Transformer systems.
Zero-resource: No external knowledge base required.
Implementation: examples/hallucination_detection/selfcheck_gpt.py
Documentation: docs/hallucination-detection/selfcheck-gpt.md
Method: Identify low-confidence tokens as potential hallucinations.
Threshold: Typical cutoff at 0.3 probability for hallucination risk flagging.
Implementation: examples/hallucination_detection/logprobs_detection.py
Systematic methods for identifying unfair treatment across demographic groups.
Method: Test model responses with demographic identifiers varied systematically.
Example: Same resume with different names (e.g., "John" vs. "Jamal") to detect hiring bias.
Implementation: examples/bias_detection/correspondence.py
What it measures: Statistical evidence of bias using Bayesian inference.
Advantage: Quantifies uncertainty in bias detection, avoiding false positives from small sample sizes.
Implementation: examples/bias_detection/bayesian_testing.py
What it provides: Certified bounds on bias magnitude with statistical guarantees.
Use case: Regulatory compliance and high-stakes applications.
Documentation: docs/bias-detection/quacer-b.md
Complete Guide: docs/bias-detection/bias-methods.md
Retrieval-Augmented Generation requires specialized evaluation of both retrieval and generation components.
Retrieval Quality:
- Precision@k: Relevance of top-k retrieved documents
- Recall@k: Coverage of relevant documents in top-k
- MRR (Mean Reciprocal Rank): Position of first relevant document
- NDCG (Normalized Discounted Cumulative Gain): Graded relevance scoring
Generation Faithfulness:
- Groundedness: All claims supported by retrieval context
- Hallucination penalty: Severity weighting for fabricated information
- Attribution accuracy: Correct source citation
End-to-End:
- Answer correctness: Factual accuracy given context
- Completeness: Coverage of relevant information from context
- Conciseness: Avoiding unnecessary verbosity
Implementation: examples/rag_evaluation/
Documentation: docs/rag-evaluation/rag-metrics.md
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
medical_faithfulness = GEval(
name="Medical Faithfulness",
evaluation_steps=[
"Extract medical claims from output",
"Verify each claim against clinical guidelines in context",
"Identify contradictions or unsupported claims",
"HEAVILY PENALIZE hallucinations that could cause patient harm",
"Emphasize clinical accuracy and safety"
],
evaluation_params=[
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT
],
threshold=0.9 # High threshold for medical
)These metrics measure different aspects of code generation performance and serve different purposes.
| Metric | Definition | Formula | Use Case |
|---|---|---|---|
| Pass@k | At least one of k solutions passes | 1 - C(n-c,k)/C(n,k) |
Benchmark comparison |
| Pass^k | All k solutions pass | p^k |
Reliability planning |
Use Pass@k for:
- Comparing models on benchmarks
- Reporting best-case performance
- Academic evaluation
Use Pass^k for:
- Planning system reliability
- Resource allocation
- SLA commitments
Implementation: examples/code_generation/pass_metrics.py
Documentation: docs/code-generation/pass-metrics-distinction.md
Implementation: examples/code_generation/pass_at_k.py
- Functional correctness: Unit test passage
- Code efficiency: Runtime and memory benchmarks
- Code style: PEP8, linting scores
- Security: Vulnerability scanning (Bandit, CodeQL)
Documentation: docs/code-generation/metrics.md
Evaluation challenges unique to autonomous and cooperative agents.
Challenge: Traditional metrics fail to capture dynamic, context-dependent agent behaviors.
Approach:
- Scenario-based testing: Predefined interaction sequences
- Trace analysis: Evaluate decision trees and communication patterns
- Goal achievement: Success rate on complex multi-step objectives
- Communication efficiency: Message volume vs. task complexity
- Role adherence: Agent specialization maintenance
- Conflict resolution: Time to consensus in disagreements
- Distributed explainability: Transparency across agent decisions
Implementation: examples/multi_agent/
- DeepEval - Comprehensive evaluation framework with G-Eval implementation and 14+ pre-built metrics
- Ragas - Specialized for RAG evaluation with reference-free metrics (faithfulness, relevance, context quality)
- TruLens - Custom feedback functions with LangChain/LlamaIndex integration
- Promptfoo - CLI tool for prompt testing with cost tracking and regression detection
- OpenAI Evals - Reference implementation with 100+ community-contributed evaluations
- LangCheck - Simple, composable evaluation metrics for LLM applications
- Athina AI - Configurable evaluation metrics with focus on reliability
- Langfuse - Open-source LLM engineering platform with tracing and prompt management
- Langwatch - Real-time quality monitoring with custom evaluators and cost analytics
- Arize Phoenix - OpenTelemetry-based observability with embedding visualization
- Opik - Self-hostable platform with dataset management and experiment tracking
- Helicone - Observability platform with request caching and rate limiting
- Weights & Biases - ML experiment tracking extended for LLM evaluation
- Braintrust - CI/CD for AI with regression testing and agent sandboxes
- LangSmith - LangChain's hosted platform for tracing and evaluation
- Confident AI - Production monitoring with scheduled evaluation suites
- HoneyHive - Enterprise platform with A/B testing and fine-tuning workflows
- Humanloop - Prompt engineering and evaluation with human-in-the-loop
- Galileo - ML observability extended for GenAI applications
- Amazon Bedrock Evaluations - AWS-native evaluation for foundation models
- Azure AI Foundry - Integrated with Azure OpenAI and Prompt Flow
- Vertex AI Evaluation - Google Cloud's evaluation service with custom rubrics
- OpenAI Platform - Built-in evaluation capabilities for GPT models
Detailed Comparison: tools-and-platforms.md
- MMLU - Massive Multitask Language Understanding across 57 subjects from STEM to humanities
- MMLU-Pro - Enhanced version with 10 choices per question, emphasizing reasoning over memorization
- BIG-bench - Beyond the Imitation Game with 200+ diverse tasks testing emergent capabilities
- HELM - Holistic evaluation across 42 scenarios and 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency)
- AGIEval - Human-centric standardized exams (SAT, LSAT, GRE, GMAT)
- HellaSwag - Commonsense natural language inference with adversarial filtering
- WinoGrande - Large-scale Winograd schema challenge for commonsense reasoning
- ARC - AI2 Reasoning Challenge with grade-school science questions
- HumanEval - Function synthesis with 164 Python problems and unit tests
- MBPP - 974 crowd-sourced Python programming problems
- CodeContests - Competitive programming problems from Codeforces, TopCoder
- SWE-bench - Real GitHub issues requiring repository-level understanding
- DS-1000 - Data science problems across NumPy, Pandas, TensorFlow, PyTorch
- MATH - 12,500 competition mathematics problems with step-by-step solutions
- GSM8K - 8,500 grade school math word problems
- MINERVA - STEM problem-solving requiring quantitative reasoning
- ScienceQA - Multimodal science questions with explanations
- PubMedQA - Biomedical research question answering
- MedQA - Medical examination questions (USMLE style)
- BEIR - Heterogeneous benchmark for information retrieval across 18 datasets
- MTEB - Massive Text Embedding Benchmark with 8 tasks and 58 datasets
- MS MARCO - Machine reading comprehension with 1M+ real queries
- Natural Questions - Real Google search queries with Wikipedia answers
- MT-Bench - Multi-turn conversation quality assessment
- ChatbotArena - Crowd-sourced pairwise model comparisons
- DialogSum - Dialogue summarization dataset
- WMT - Annual shared tasks in machine translation
- FLORES-200 - Translation between 200 languages
- XL-Sum - Multilingual abstractive summarization
- AgentBench - Comprehensive agent evaluation across 8 environments
- GAIA - General AI assistants with real-world questions requiring tools
- WebArena - Autonomous web agents in realistic environments
- ToolBench - Tool learning with 16,000+ real-world APIs
- TruthfulQA - Measures whether models generate truthful answers to questions
- HaluEval - Hallucination evaluation across diverse tasks
- FActScore - Fine-grained atomic fact verification
- BBQ - Bias Benchmark for Question Answering in ambiguous contexts
- BOLD - Bias in Open-ended Language Generation Dataset
- WinoBias - Gender bias in coreference resolution
- ToxiGen - Large-scale machine-generated toxicity dataset
- RealToxicityPrompts - Toxic generation evaluation
- SafetyBench - Chinese and English safety evaluation
When standard benchmarks don't fit your needs:
- Define Clear Evaluation Criteria: Specify what success looks like
- Create Diverse Test Cases: Cover edge cases and failure modes
- Establish Ground Truth: Use expert annotations or automated validation
- Version Control: Track benchmark changes over time
- Statistical Rigor: Ensure sufficient sample size and significance testing
Guide: docs/creating-custom-benchmarks.md
# ❌ Less consistent (regenerates steps each time)
metric = GEval(criteria="Check for correctness", ...)
# ✅ More consistent (fixed procedure)
metric = GEval(
evaluation_steps=[
"Verify factual accuracy",
"Check for completeness",
"Assess clarity"
],
...
)# Fit calibrator on validation set
calibrator = ConfidenceCalibrator()
calibrator.fit(validation_confidences, ground_truth)
# Apply to production
calibrated_score = calibrator.calibrate(raw_confidence)
# Monitor ECE < 0.05from deepeval.tracing import observe
@observe(metrics=[retrieval_quality])
def retrieve(query):
# Retrieval logic
return documents
@observe(metrics=[generation_faithfulness])
def generate(query, documents):
# Generation logic
return answer
# Separate scores for retrieval vs. generation# Medical application: high threshold, strict mode
medical_metric = GEval(
threshold=0.9,
strict_mode=True, # Binary: perfect or fail
...
)
# General chatbot: lower threshold, graded scoring
chatbot_metric = GEval(
threshold=0.7,
strict_mode=False,
...
)# Validate Spearman ρ > 0.7
correlation = spearmanr(confidences, accuracies)
if correlation < 0.7:
print("⚠️ Confidence scores unreliable - recalibrate")def route_for_review(calibrated_confidence):
if calibrated_confidence >= 0.85:
return "AUTO_PROCESS"
elif calibrated_confidence >= 0.60:
return "SPOT_CHECK" # 10% sampling
else:
return "HUMAN_REVIEW" # 100% review# Use cheaper models for less critical evals
fast_metric = GEval(
model="gpt-4o-mini", # ~10x cheaper than GPT-4o
...
)
# Cache repeated evaluations
@lru_cache(maxsize=1000)
def cached_evaluate(input_hash, output_hash):
return metric.measure(test_case)- LLM Evaluation: A Practical Guide - DeepLearning.AI course on evaluation fundamentals
- A Survey on Evaluation of LLMs - Comprehensive academic survey (2024)
- Holistic Evaluation of LLMs - Stanford's HELM methodology and learnings
- RAG Evaluation Guide - Best practices for evaluating retrieval-augmented generation
- Prompt Engineering Guide - Includes evaluation strategies for prompts
- G-Eval Paper (2023) - NLG evaluation using GPT-4 with chain-of-thought
- SCORE Framework (2025) - NVIDIA's consistency and robustness evaluation
- Judging LLM-as-a-Judge (2024) - Meta-evaluation of LLM judges
- Constitutional AI (2022) - Anthropic's approach to AI safety evaluation
- LangChain Evaluation - Evaluation chains and criteria
- OpenAI Cookbook - Practical examples including evaluation techniques
- Hugging Face Evaluate - Library for easily evaluating ML models and datasets
- Microsoft Promptflow - Evaluation flows for LLM applications
- Colab: LLM Evaluation Basics - Interactive introduction
- RAG Evaluation Notebook - Step-by-step RAG metrics
- Custom Metrics Creation - Building domain-specific evaluations
- r/MachineLearning - Academic ML discussions including evaluation
- Hugging Face Forums - Community discussions on model evaluation
- LangChain Discord - Active community for LLM application development
- AI Alignment Forum - Safety and alignment evaluation discussions
- NeurIPS Datasets and Benchmarks Track - Annual benchmark proposals
- EMNLP Evaluation Track - NLP evaluation methodologies
- ACL Workshop on Evaluation - Specialized evaluation workshops
- Awesome LLM - Comprehensive LLM resources
- Awesome RAG - RAG-specific tools and papers
- Awesome LLM Safety - Safety evaluation and alignment
- Awesome Production LLM - Production deployment including monitoring
- Open LLM Leaderboard - Hugging Face's model rankings
- LMSYS Chatbot Arena - Human preference rankings
- Big Code Leaderboard - Code generation benchmarks
- MTEB Leaderboard - Text embedding rankings
- State of AI Report - Annual industry overview including evaluation trends
- OpenAI System Card - GPT-4 evaluation methodology
- Anthropic Claude Evaluations - Constitutional AI evaluation approach
- Google PaLM Technical Report - Comprehensive evaluation across 150+ tasks
This evaluation guide is developed in support of research on Agentic AI Explainable-by-Design, focusing on:
- Multi-agent systems for ethical analysis of regulatory documents
- Behavior metrics in RAG systems applied to sensitive domains (healthcare, legal, financial)
- Interpretive auditing frameworks combining technical performance with human interpretability
- Global South perspectives on AI evaluation in contexts of limited infrastructure and linguistic diversity
Contributions are welcome! This guide aims to be a living resource for the AI evaluation community.
Ways to contribute:
- Add new evaluation methods with code examples
- Improve documentation clarity
- Report issues or inaccuracies
- Share production case studies
- Translate content (especially Portuguese, Spanish for Latin American accessibility)
Please read CONTRIBUTING.md for detailed guidelines.
If you use this guide in your research or projects, please cite:
@misc{parreao2024awesome_ai_eval,
author = {Parreão, Hugo},
title = {Awesome AI Evaluation Guide: Implementation-Focused Methods for LLMs, RAG, and Agentic AI},
year = {2025},
publisher = {GitHub},
url = {https://github.com/hparreao/Awesome-AI-Evaluation-Guide}
}This work is released under CC0 1.0 Universal (Public Domain). You are free to use, modify, and distribute this content without attribution, though attribution is appreciated.
Maintained by: Hugo Parreão | AI Engineering MSc
Contact: Open an issue or reach out via GitHub for questions, suggestions, or collaboration opportunities.