Skip to content

A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.

License

Notifications You must be signed in to change notification settings

hparreao/Awesome-AI-Evaluation-Guide

Repository files navigation

Awesome AI Evaluation Guide Awesome

License: CC0-1.0 Contributions Welcome Python 3.8+

A comprehensive collection of evaluation methods, tools, and frameworks for assessing Large Language Models (LLMs), RAG systems, and AI agents in real-world applications.

About This Guide

This repository provides practical implementations and detailed guidance for evaluating AI systems, with a focus on understanding when and how to apply different evaluation methods. Unlike simple metric collections, we offer working code, mathematical foundations, and domain-specific considerations.

What Makes This Guide Different

  • Implementation-First: Every metric includes complete, tested code examples
  • Decision Frameworks: Clear tables and guides for metric selection
  • Mathematical Rigor: Understanding the theory behind each evaluation method
  • Domain Expertise: Tailored approaches for medical, legal, financial, and other specialized applications
  • System-Level Thinking: Evaluation of components and their interactions

Key Research Insights

Concept Finding Source Application
Consistency vs Accuracy Models can have high accuracy but low consistency SCORE (NVIDIA 2025) Evaluate reliability alongside correctness
Pass@k vs Pass^k Metrics measure different aspects (optimistic vs reliable) Code generation research Choose based on deployment needs
Confidence Scoring Ensemble methods correlate better with accuracy than logprobs Industry studies Use majority voting for confidence
Component Interaction System performance ≠ sum of component performance RAG research Evaluate end-to-end and per-component
Bias Detection Bayes Factors superior to p-values QuaCer-B research Use Bayesian methods for statistical rigor

Table of Contents


Metric Selection Guide

Quick Decision Table

Task Type Primary Metrics Secondary Metrics Key Considerations
Text Generation Perplexity, G-Eval BLEU, ROUGE Need reference texts for BLEU/ROUGE
Question Answering Answer Correctness, Faithfulness BERTScore, Exact Match Domain expertise affects threshold
Code Generation Pass@k (benchmarks), Pass^k (reliability) Syntax validity, Security Pass@k ≠ Pass^k for planning
RAG Systems Faithfulness, Context Relevance Precision@k, NDCG Evaluate retrieval and generation separately
Translation BLEU, METEOR BERTScore, Human eval BLEU has known limitations
Summarization ROUGE, Relevance Coherence, Consistency ROUGE may miss semantic equivalence
Dialogue Coherence, Engagement Response diversity Context window important
Multi-Agent Task completion, Coordination Communication efficiency System-level metrics needed

Domain-Specific Thresholds

Domain Metric Type Typical Threshold Rationale
Medical Faithfulness > 0.9 Patient safety critical
Legal Factual accuracy > 0.95 Regulatory compliance
Financial Numerical precision > 0.98 Monetary implications
Customer Support Response relevance > 0.7 User satisfaction
Creative Writing Diversity score > 0.6 Avoid repetition
Education Answer correctness > 0.85 Learning outcomes

Getting Started

Prerequisites

  • Python 3.8+ for evaluation frameworks
  • Node.js 14+ for JavaScript-based tools (optional)
  • API keys for LLM providers (OpenAI, Anthropic, etc.)
  • Docker for self-hosted solutions (optional)

Installation

# Clone this repository
git clone https://github.com/hparreao/Awesome-AI-Evaluation-Guide.git
cd Awesome-AI-Evaluation-Guide

# Install core dependencies
pip install -r requirements.txt

# Optional: Install specific evaluation frameworks
pip install deepeval ragas langfuse trulens-eval  # Python frameworks
npm install -g promptfoo  # JavaScript CLI tool (optional)

Quick Examples

Basic LLM Evaluation

from examples.llm_as_judge import evaluate_response

result = evaluate_response(
    question="What is the capital of France?",
    response="Paris is the capital of France.",
    criteria="factual_accuracy"
)
print(f"Score: {result.score}, Reasoning: {result.reasoning}")

RAG Pipeline Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Evaluate your RAG system
result = evaluate(
    dataset=your_test_data,
    metrics=[faithfulness, answer_relevancy]
)
print(f"Faithfulness: {result['faithfulness']:.2f}")

Consistency Testing (SCORE Framework)

from examples.consistency_robustness.score_framework import SCOREEvaluator

evaluator = SCOREEvaluator(model=your_model, k=5)
metrics = evaluator.evaluate(test_cases)
print(f"Accuracy: {metrics.accuracy:.2f}, Consistency: {metrics.consistency_rate:.2f}")

Evaluation Metrics

Traditional Metrics

Foundational metrics from NLP research, adapted for LLM evaluation.

Perplexity

What it measures: Model uncertainty in predicting the next token. Lower values indicate better performance.

When to use:

  • Comparing language models on the same task
  • Pre-training evaluation
  • Domain adaptation assessment

Implementation: examples/traditional_metrics/perplexity.py

Documentation: docs/traditional-metrics/perplexity.md

BLEU Score

What it measures: Precision-based n-gram overlap between generated and reference text.

When to use:

  • Machine translation evaluation
  • Text generation with reference outputs
  • Paraphrase quality assessment

Limitations:

  • Doesn't account for semantic similarity
  • Biased toward shorter outputs
  • Requires reference text

Implementation: examples/traditional_metrics/bleu_score.py

Documentation: docs/traditional-metrics/bleu-score.md

ROUGE Score

What it measures: Recall-oriented n-gram overlap, primarily for summarization.

When to use:

  • Summarization tasks
  • Content coverage assessment
  • Information preservation evaluation

Implementation: examples/traditional_metrics/rouge_score.py

Documentation: docs/traditional-metrics/rouge-score.md


Probability-Based Metrics

Leverage model confidence through token probabilities.

Logprobs Analysis

What it measures: Log probabilities for each generated token.

Applications:

  • Hallucination detection (low probability = potential hallucination)
  • Confidence estimation
  • Classification with uncertainty quantification

Key Finding: OpenAI research shows logprobs enable reliable confidence scoring for classification tasks.

Implementation: examples/probability_based/logprobs.py

Documentation: docs/probability-based/logprobs.md

Top-k Token Analysis

What it measures: Distribution of top-k most probable tokens at each position.

Applications:

  • Diversity assessment
  • Uncertainty quantification
  • Alternative generation paths exploration

Implementation: examples/probability_based/topk_analysis.py

Documentation: docs/probability-based/topk-analysis.md


LLM-as-a-Judge

Use LLMs to evaluate LLM outputs based on custom criteria.

G-Eval Framework

What it is: Chain-of-thought (CoT) based evaluation using LLMs with token probability normalization.

Why it works:

  • Better human alignment than traditional metrics
  • Flexible custom criteria
  • Token probability weighting reduces bias

Production Scale: DeepEval processes 10M+ G-Eval metrics monthly.

Core Use Cases:

  1. Answer Correctness - Validate factual accuracy
  2. Coherence & Clarity - Assess text quality without references
  3. Tonality & Professionalism - Domain-appropriate style
  4. Safety & Compliance - PII detection, bias, toxicity
  5. Domain-Specific Faithfulness - RAG evaluation with heavy hallucination penalties

Implementation: examples/llm_as_judge/

Complete Guide: docs/llm-as-judge/g-eval-framework.md

Quick Example:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define custom evaluation
correctness = GEval(
    name="Correctness",
    evaluation_steps=[
        "Check for factual contradictions",
        "Penalize missing critical information",
        "Accept paraphrasing and style differences"
    ],
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.7
)

# Evaluate
test_case = LLMTestCase(
    input="What is Python?",
    actual_output="Python is a high-level programming language.",
    expected_output="Python is an interpreted, high-level programming language."
)

correctness.measure(test_case)
print(f"Score: {correctness.score}")  # 0-1 scale

Modern Metrics

Consistency & Robustness (SCORE)

The SCORE framework (NVIDIA 2025) evaluates model consistency alongside accuracy, providing insights into reliability.

Components of SCORE

Metric What it Measures Use Case
Consistency Rate (CR@K) If model gives same correct answer K times Reliability assessment
Prompt Robustness Stability across paraphrased prompts Input variation handling
Sampling Robustness Consistency under temperature changes Deployment configuration
Order Robustness Invariance to choice ordering Multiple-choice tasks

When to Use SCORE

  • Evaluating model reliability beyond accuracy
  • Testing robustness to input variations
  • Assessing deployment readiness
  • Comparing model stability

Implementation: examples/consistency_robustness/score_framework.py

Documentation: docs/consistency-robustness/score-framework.md

from examples.consistency_robustness import SCOREEvaluator

evaluator = SCOREEvaluator(model=your_model)
metrics = evaluator.evaluate(test_cases)

# Compare accuracy vs consistency
print(f"Accuracy: {metrics.accuracy:.2%}")
print(f"Consistency Rate: {metrics.consistency_rate:.2%}")

Confidence Scoring

Ensemble-based methods for reliable confidence estimation.

Majority Voting

What it measures: Consensus across multiple model generations.

Key Finding: Industry studies show strong positive correlation between majority voting confidence and actual accuracy, while "no clear correlation was found between logprob-based confidence score and accuracy."

Optimal Configuration: 4-7 diverse models (sweet spot for reliability vs. cost)

Implementation: examples/confidence_scoring/majority_voting.py

Documentation: docs/confidence-scoring/majority-voting.md

Weighted Ensemble

What it does: Weight model votes by historical accuracy.

Weighting Strategy: Linear weights preferred (w_i = Accuracy_i) over exponential to maintain ensemble diversity.

Implementation: examples/confidence_scoring/weighted_ensemble.py

Calibration (Platt Scaling)

What it solves: Aligns raw confidence scores with actual accuracy.

Goal: Expected Calibration Error (ECE) < 0.05 for production systems.

Implementation: examples/confidence_scoring/calibration.py

Complete Guide: docs/confidence-scoring/ensemble-methods.md


Hallucination Detection

Methods for identifying fabricated or unsupported information.

SelfCheckGPT

How it works: Measures consistency across multiple samples from the same LLM. Factual statements remain consistent; hallucinations show high variance.

Why it's better: Unlike legacy NLP metrics (WER, METEOR), SelfCheckGPT is designed for Transformer-era LLMs and addresses hallucination problems that didn't exist in pre-Transformer systems.

Zero-resource: No external knowledge base required.

Implementation: examples/hallucination_detection/selfcheck_gpt.py

Documentation: docs/hallucination-detection/selfcheck-gpt.md

Logprobs-based Detection

Method: Identify low-confidence tokens as potential hallucinations.

Threshold: Typical cutoff at 0.3 probability for hallucination risk flagging.

Implementation: examples/hallucination_detection/logprobs_detection.py


Bias Detection

Systematic methods for identifying unfair treatment across demographic groups.

Correspondence Experiments

Method: Test model responses with demographic identifiers varied systematically.

Example: Same resume with different names (e.g., "John" vs. "Jamal") to detect hiring bias.

Implementation: examples/bias_detection/correspondence.py

Bayesian Hypothesis Testing

What it measures: Statistical evidence of bias using Bayesian inference.

Advantage: Quantifies uncertainty in bias detection, avoiding false positives from small sample sizes.

Implementation: examples/bias_detection/bayesian_testing.py

QuaCer-B Certification

What it provides: Certified bounds on bias magnitude with statistical guarantees.

Use case: Regulatory compliance and high-stakes applications.

Documentation: docs/bias-detection/quacer-b.md

Complete Guide: docs/bias-detection/bias-methods.md


Domain-Specific Evaluation

RAG Systems

Retrieval-Augmented Generation requires specialized evaluation of both retrieval and generation components.

Component-Level Metrics

Retrieval Quality:

  • Precision@k: Relevance of top-k retrieved documents
  • Recall@k: Coverage of relevant documents in top-k
  • MRR (Mean Reciprocal Rank): Position of first relevant document
  • NDCG (Normalized Discounted Cumulative Gain): Graded relevance scoring

Generation Faithfulness:

  • Groundedness: All claims supported by retrieval context
  • Hallucination penalty: Severity weighting for fabricated information
  • Attribution accuracy: Correct source citation

End-to-End:

  • Answer correctness: Factual accuracy given context
  • Completeness: Coverage of relevant information from context
  • Conciseness: Avoiding unnecessary verbosity

Implementation: examples/rag_evaluation/

Documentation: docs/rag-evaluation/rag-metrics.md

Example: Medical RAG Evaluation

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

medical_faithfulness = GEval(
    name="Medical Faithfulness",
    evaluation_steps=[
        "Extract medical claims from output",
        "Verify each claim against clinical guidelines in context",
        "Identify contradictions or unsupported claims",
        "HEAVILY PENALIZE hallucinations that could cause patient harm",
        "Emphasize clinical accuracy and safety"
    ],
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.RETRIEVAL_CONTEXT
    ],
    threshold=0.9  # High threshold for medical
)

Code Generation

Understanding Pass@k vs Pass^k

These metrics measure different aspects of code generation performance and serve different purposes.

Metric Comparison
Metric Definition Formula Use Case
Pass@k At least one of k solutions passes 1 - C(n-c,k)/C(n,k) Benchmark comparison
Pass^k All k solutions pass p^k Reliability planning
When to Use Each

Use Pass@k for:

  • Comparing models on benchmarks
  • Reporting best-case performance
  • Academic evaluation

Use Pass^k for:

  • Planning system reliability
  • Resource allocation
  • SLA commitments

Implementation: examples/code_generation/pass_metrics.py

Documentation: docs/code-generation/pass-metrics-distinction.md

Implementation: examples/code_generation/pass_at_k.py

Code Quality Metrics

  • Functional correctness: Unit test passage
  • Code efficiency: Runtime and memory benchmarks
  • Code style: PEP8, linting scores
  • Security: Vulnerability scanning (Bandit, CodeQL)

Documentation: docs/code-generation/metrics.md


Multi-Agent Systems

Evaluation challenges unique to autonomous and cooperative agents.

Emergent Behavior Assessment

Challenge: Traditional metrics fail to capture dynamic, context-dependent agent behaviors.

Approach:

  1. Scenario-based testing: Predefined interaction sequences
  2. Trace analysis: Evaluate decision trees and communication patterns
  3. Goal achievement: Success rate on complex multi-step objectives

Coordination Metrics

  • Communication efficiency: Message volume vs. task complexity
  • Role adherence: Agent specialization maintenance
  • Conflict resolution: Time to consensus in disagreements
  • Distributed explainability: Transparency across agent decisions

Implementation: examples/multi_agent/


Tools & Platforms

Open Source Frameworks

Evaluation Libraries

  • DeepEval - Comprehensive evaluation framework with G-Eval implementation and 14+ pre-built metrics
  • Ragas - Specialized for RAG evaluation with reference-free metrics (faithfulness, relevance, context quality)
  • TruLens - Custom feedback functions with LangChain/LlamaIndex integration
  • Promptfoo - CLI tool for prompt testing with cost tracking and regression detection
  • OpenAI Evals - Reference implementation with 100+ community-contributed evaluations
  • LangCheck - Simple, composable evaluation metrics for LLM applications
  • Athina AI - Configurable evaluation metrics with focus on reliability

Observability & Monitoring

  • Langfuse - Open-source LLM engineering platform with tracing and prompt management
  • Langwatch - Real-time quality monitoring with custom evaluators and cost analytics
  • Arize Phoenix - OpenTelemetry-based observability with embedding visualization
  • Opik - Self-hostable platform with dataset management and experiment tracking
  • Helicone - Observability platform with request caching and rate limiting
  • Weights & Biases - ML experiment tracking extended for LLM evaluation

Commercial Solutions

Evaluation Platforms

  • Braintrust - CI/CD for AI with regression testing and agent sandboxes
  • LangSmith - LangChain's hosted platform for tracing and evaluation
  • Confident AI - Production monitoring with scheduled evaluation suites
  • HoneyHive - Enterprise platform with A/B testing and fine-tuning workflows
  • Humanloop - Prompt engineering and evaluation with human-in-the-loop
  • Galileo - ML observability extended for GenAI applications

Cloud Services

Detailed Comparison: tools-and-platforms.md


Benchmarks & Datasets

General Language Understanding

Knowledge & Reasoning

  • MMLU - Massive Multitask Language Understanding across 57 subjects from STEM to humanities
  • MMLU-Pro - Enhanced version with 10 choices per question, emphasizing reasoning over memorization
  • BIG-bench - Beyond the Imitation Game with 200+ diverse tasks testing emergent capabilities
  • HELM - Holistic evaluation across 42 scenarios and 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency)
  • AGIEval - Human-centric standardized exams (SAT, LSAT, GRE, GMAT)

Common Sense & World Knowledge

  • HellaSwag - Commonsense natural language inference with adversarial filtering
  • WinoGrande - Large-scale Winograd schema challenge for commonsense reasoning
  • ARC - AI2 Reasoning Challenge with grade-school science questions

Domain-Specific Benchmarks

Code Generation

  • HumanEval - Function synthesis with 164 Python problems and unit tests
  • MBPP - 974 crowd-sourced Python programming problems
  • CodeContests - Competitive programming problems from Codeforces, TopCoder
  • SWE-bench - Real GitHub issues requiring repository-level understanding
  • DS-1000 - Data science problems across NumPy, Pandas, TensorFlow, PyTorch

Mathematics

  • MATH - 12,500 competition mathematics problems with step-by-step solutions
  • GSM8K - 8,500 grade school math word problems
  • MINERVA - STEM problem-solving requiring quantitative reasoning

Scientific Understanding

  • ScienceQA - Multimodal science questions with explanations
  • PubMedQA - Biomedical research question answering
  • MedQA - Medical examination questions (USMLE style)

Retrieval & RAG

  • BEIR - Heterogeneous benchmark for information retrieval across 18 datasets
  • MTEB - Massive Text Embedding Benchmark with 8 tasks and 58 datasets
  • MS MARCO - Machine reading comprehension with 1M+ real queries
  • Natural Questions - Real Google search queries with Wikipedia answers

Task-Specific Benchmarks

Dialogue & Conversation

  • MT-Bench - Multi-turn conversation quality assessment
  • ChatbotArena - Crowd-sourced pairwise model comparisons
  • DialogSum - Dialogue summarization dataset

Translation & Multilingual

  • WMT - Annual shared tasks in machine translation
  • FLORES-200 - Translation between 200 languages
  • XL-Sum - Multilingual abstractive summarization

Agent & Tool Use

  • AgentBench - Comprehensive agent evaluation across 8 environments
  • GAIA - General AI assistants with real-world questions requiring tools
  • WebArena - Autonomous web agents in realistic environments
  • ToolBench - Tool learning with 16,000+ real-world APIs

Safety & Alignment Benchmarks

Truthfulness & Hallucination

  • TruthfulQA - Measures whether models generate truthful answers to questions
  • HaluEval - Hallucination evaluation across diverse tasks
  • FActScore - Fine-grained atomic fact verification

Bias & Fairness

  • BBQ - Bias Benchmark for Question Answering in ambiguous contexts
  • BOLD - Bias in Open-ended Language Generation Dataset
  • WinoBias - Gender bias in coreference resolution

Safety & Toxicity

Creating Custom Benchmarks

When standard benchmarks don't fit your needs:

  1. Define Clear Evaluation Criteria: Specify what success looks like
  2. Create Diverse Test Cases: Cover edge cases and failure modes
  3. Establish Ground Truth: Use expert annotations or automated validation
  4. Version Control: Track benchmark changes over time
  5. Statistical Rigor: Ensure sufficient sample size and significance testing

Guide: docs/creating-custom-benchmarks.md


Production Best Practices

1. Use Evaluation Steps, Not Criteria

# ❌ Less consistent (regenerates steps each time)
metric = GEval(criteria="Check for correctness", ...)

# ✅ More consistent (fixed procedure)
metric = GEval(
    evaluation_steps=[
        "Verify factual accuracy",
        "Check for completeness",
        "Assess clarity"
    ],
    ...
)

2. Implement Calibration

# Fit calibrator on validation set
calibrator = ConfidenceCalibrator()
calibrator.fit(validation_confidences, ground_truth)

# Apply to production
calibrated_score = calibrator.calibrate(raw_confidence)

# Monitor ECE < 0.05

3. Use Component-Level Tracing

from deepeval.tracing import observe

@observe(metrics=[retrieval_quality])
def retrieve(query):
    # Retrieval logic
    return documents

@observe(metrics=[generation_faithfulness])
def generate(query, documents):
    # Generation logic
    return answer

# Separate scores for retrieval vs. generation

4. Set Domain-Appropriate Thresholds

# Medical application: high threshold, strict mode
medical_metric = GEval(
    threshold=0.9,
    strict_mode=True,  # Binary: perfect or fail
    ...
)

# General chatbot: lower threshold, graded scoring
chatbot_metric = GEval(
    threshold=0.7,
    strict_mode=False,
    ...
)

5. Monitor Confidence-Accuracy Correlation

# Validate Spearman ρ > 0.7
correlation = spearmanr(confidences, accuracies)
if correlation < 0.7:
    print("⚠️ Confidence scores unreliable - recalibrate")

6. Implement Human-in-the-Loop Thresholds

def route_for_review(calibrated_confidence):
    if calibrated_confidence >= 0.85:
        return "AUTO_PROCESS"
    elif calibrated_confidence >= 0.60:
        return "SPOT_CHECK"  # 10% sampling
    else:
        return "HUMAN_REVIEW"  # 100% review

7. Cost Optimization

# Use cheaper models for less critical evals
fast_metric = GEval(
    model="gpt-4o-mini",  # ~10x cheaper than GPT-4o
    ...
)

# Cache repeated evaluations
@lru_cache(maxsize=1000)
def cached_evaluate(input_hash, output_hash):
    return metric.measure(test_case)

Resources

Learning Materials

Tutorials & Guides

Papers & Research

Implementation Examples

Code Repositories

Notebooks & Demos

Communities & Discussions

Forums & Groups

Conferences & Workshops

Related Collections

Awesome Lists

Benchmark Leaderboards

Industry Reports & Case Studies


Research Context

This evaluation guide is developed in support of research on Agentic AI Explainable-by-Design, focusing on:

  1. Multi-agent systems for ethical analysis of regulatory documents
  2. Behavior metrics in RAG systems applied to sensitive domains (healthcare, legal, financial)
  3. Interpretive auditing frameworks combining technical performance with human interpretability
  4. Global South perspectives on AI evaluation in contexts of limited infrastructure and linguistic diversity

Contributing

Contributions are welcome! This guide aims to be a living resource for the AI evaluation community.

Ways to contribute:

  • Add new evaluation methods with code examples
  • Improve documentation clarity
  • Report issues or inaccuracies
  • Share production case studies
  • Translate content (especially Portuguese, Spanish for Latin American accessibility)

Please read CONTRIBUTING.md for detailed guidelines.


Citation

If you use this guide in your research or projects, please cite:

@misc{parreao2024awesome_ai_eval,
  author = {Parreão, Hugo},
  title = {Awesome AI Evaluation Guide: Implementation-Focused Methods for LLMs, RAG, and Agentic AI},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/hparreao/Awesome-AI-Evaluation-Guide}
}

License

This work is released under CC0 1.0 Universal (Public Domain). You are free to use, modify, and distribute this content without attribution, though attribution is appreciated.


Maintained by: Hugo Parreão | AI Engineering MSc

Contact: Open an issue or reach out via GitHub for questions, suggestions, or collaboration opportunities.

About

A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published