Awesome AI Evaluation Guide

A comprehensive collection of evaluation methods, tools, and frameworks for assessing Large Language Models (LLMs), RAG systems, and AI agents in real-world applications.

About This Guide

This repository provides practical implementations and detailed guidance for evaluating AI systems, with a focus on understanding when and how to apply different evaluation methods. Unlike simple metric collections, we offer working code, mathematical foundations, and domain-specific considerations.

What Makes This Guide Different

Implementation-First: Every metric includes complete, tested code examples
Decision Frameworks: Clear tables and guides for metric selection
Mathematical Rigor: Understanding the theory behind each evaluation method
Domain Expertise: Tailored approaches for medical, legal, financial, and other specialized applications
System-Level Thinking: Evaluation of components and their interactions

Key Research Insights

Concept	Finding	Source	Application
Consistency vs Accuracy	Models can have high accuracy but low consistency	SCORE (NVIDIA 2025)	Evaluate reliability alongside correctness
Pass@k vs Pass^k	Metrics measure different aspects (optimistic vs reliable)	Code generation research	Choose based on deployment needs
Confidence Scoring	Ensemble methods correlate better with accuracy than logprobs	Industry studies	Use majority voting for confidence
Component Interaction	System performance ≠ sum of component performance	RAG research	Evaluate end-to-end and per-component
Bias Detection	Bayes Factors superior to p-values	QuaCer-B research	Use Bayesian methods for statistical rigor

Metric Selection Guide

Quick Decision Table

Task Type	Primary Metrics	Secondary Metrics	Key Considerations
Text Generation	Perplexity, G-Eval	BLEU, ROUGE	Need reference texts for BLEU/ROUGE
Question Answering	Answer Correctness, Faithfulness	BERTScore, Exact Match	Domain expertise affects threshold
Code Generation	Pass@k (benchmarks), Pass^k (reliability)	Syntax validity, Security	Pass@k ≠ Pass^k for planning
RAG Systems	Faithfulness, Context Relevance	Precision@k, NDCG	Evaluate retrieval and generation separately
Translation	BLEU, METEOR	BERTScore, Human eval	BLEU has known limitations
Summarization	ROUGE, Relevance	Coherence, Consistency	ROUGE may miss semantic equivalence
Dialogue	Coherence, Engagement	Response diversity	Context window important
Multi-Agent	Task completion, Coordination	Communication efficiency	System-level metrics needed

Domain-Specific Thresholds

Domain	Metric Type	Typical Threshold	Rationale
Medical	Faithfulness	> 0.9	Patient safety critical
Legal	Factual accuracy	> 0.95	Regulatory compliance
Financial	Numerical precision	> 0.98	Monetary implications
Customer Support	Response relevance	> 0.7	User satisfaction
Creative Writing	Diversity score	> 0.6	Avoid repetition
Education	Answer correctness	> 0.85	Learning outcomes

Getting Started

Prerequisites

Python 3.8+ for evaluation frameworks
Node.js 14+ for JavaScript-based tools (optional)
API keys for LLM providers (OpenAI, Anthropic, etc.)
Docker for self-hosted solutions (optional)

Installation

# Clone this repository
git clone https://github.com/hparreao/Awesome-AI-Evaluation-Guide.git
cd Awesome-AI-Evaluation-Guide

# Install core dependencies
pip install -r requirements.txt

# Optional: Install specific evaluation frameworks
pip install deepeval ragas langfuse trulens-eval  # Python frameworks
npm install -g promptfoo  # JavaScript CLI tool (optional)

Quick Examples

Basic LLM Evaluation

from examples.llm_as_judge import evaluate_response

result = evaluate_response(
    question="What is the capital of France?",
    response="Paris is the capital of France.",
    criteria="factual_accuracy"
)
print(f"Score: {result.score}, Reasoning: {result.reasoning}")

RAG Pipeline Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Evaluate your RAG system
result = evaluate(
    dataset=your_test_data,
    metrics=[faithfulness, answer_relevancy]
)
print(f"Faithfulness: {result['faithfulness']:.2f}")

Consistency Testing (SCORE Framework)

from examples.consistency_robustness.score_framework import SCOREEvaluator

evaluator = SCOREEvaluator(model=your_model, k=5)
metrics = evaluator.evaluate(test_cases)
print(f"Accuracy: {metrics.accuracy:.2f}, Consistency: {metrics.consistency_rate:.2f}")

Evaluation Metrics

Traditional Metrics

Foundational metrics from NLP research, adapted for LLM evaluation.

Perplexity

What it measures: Model uncertainty in predicting the next token. Lower values indicate better performance.

When to use:

Comparing language models on the same task
Pre-training evaluation
Domain adaptation assessment

Implementation: examples/traditional_metrics/perplexity.py

Documentation: docs/traditional-metrics/perplexity.md

BLEU Score

What it measures: Precision-based n-gram overlap between generated and reference text.

When to use:

Machine translation evaluation
Text generation with reference outputs
Paraphrase quality assessment

Limitations:

Doesn't account for semantic similarity
Biased toward shorter outputs
Requires reference text

Implementation: examples/traditional_metrics/bleu_score.py

Documentation: docs/traditional-metrics/bleu-score.md

ROUGE Score

What it measures: Recall-oriented n-gram overlap, primarily for summarization.

When to use:

Summarization tasks
Content coverage assessment
Information preservation evaluation

Implementation: examples/traditional_metrics/rouge_score.py

Documentation: docs/traditional-metrics/rouge-score.md

Probability-Based Metrics

Leverage model confidence through token probabilities.

Logprobs Analysis

What it measures: Log probabilities for each generated token.

Applications:

Hallucination detection (low probability = potential hallucination)
Confidence estimation
Classification with uncertainty quantification

Key Finding: OpenAI research shows logprobs enable reliable confidence scoring for classification tasks.

Implementation: examples/probability_based/logprobs.py

Documentation: docs/probability-based/logprobs.md

Top-k Token Analysis

What it measures: Distribution of top-k most probable tokens at each position.

Applications:

Diversity assessment
Uncertainty quantification
Alternative generation paths exploration

Implementation: examples/probability_based/topk_analysis.py

Documentation: docs/probability-based/topk-analysis.md

LLM-as-a-Judge

Use LLMs to evaluate LLM outputs based on custom criteria.

G-Eval Framework

What it is: Chain-of-thought (CoT) based evaluation using LLMs with token probability normalization.

Why it works:

Better human alignment than traditional metrics
Flexible custom criteria
Token probability weighting reduces bias

Production Scale: DeepEval processes 10M+ G-Eval metrics monthly.

Core Use Cases:

Answer Correctness - Validate factual accuracy
Coherence & Clarity - Assess text quality without references
Tonality & Professionalism - Domain-appropriate style
Safety & Compliance - PII detection, bias, toxicity
Domain-Specific Faithfulness - RAG evaluation with heavy hallucination penalties

Implementation: examples/llm_as_judge/

Complete Guide: docs/llm-as-judge/g-eval-framework.md

Quick Example:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define custom evaluation
correctness = GEval(
    name="Correctness",
    evaluation_steps=[
        "Check for factual contradictions",
        "Penalize missing critical information",
        "Accept paraphrasing and style differences"
    ],
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.7
)

# Evaluate
test_case = LLMTestCase(
    input="What is Python?",
    actual_output="Python is a high-level programming language.",
    expected_output="Python is an interpreted, high-level programming language."
)

correctness.measure(test_case)
print(f"Score: {correctness.score}")  # 0-1 scale

Modern Metrics

Consistency & Robustness (SCORE)

The SCORE framework (NVIDIA 2025) evaluates model consistency alongside accuracy, providing insights into reliability.

Components of SCORE

Metric	What it Measures	Use Case
Consistency Rate (CR@K)	If model gives same correct answer K times	Reliability assessment
Prompt Robustness	Stability across paraphrased prompts	Input variation handling
Sampling Robustness	Consistency under temperature changes	Deployment configuration
Order Robustness	Invariance to choice ordering	Multiple-choice tasks

When to Use SCORE

Evaluating model reliability beyond accuracy
Testing robustness to input variations
Assessing deployment readiness
Comparing model stability

Implementation: examples/consistency_robustness/score_framework.py

Documentation: docs/consistency-robustness/score-framework.md

from examples.consistency_robustness import SCOREEvaluator

evaluator = SCOREEvaluator(model=your_model)
metrics = evaluator.evaluate(test_cases)

# Compare accuracy vs consistency
print(f"Accuracy: {metrics.accuracy:.2%}")
print(f"Consistency Rate: {metrics.consistency_rate:.2%}")

Confidence Scoring

Ensemble-based methods for reliable confidence estimation.

Majority Voting

What it measures: Consensus across multiple model generations.

Key Finding: Industry studies show strong positive correlation between majority voting confidence and actual accuracy, while "no clear correlation was found between logprob-based confidence score and accuracy."

Optimal Configuration: 4-7 diverse models (sweet spot for reliability vs. cost)

Implementation: examples/confidence_scoring/majority_voting.py

Documentation: docs/confidence-scoring/majority-voting.md

Weighted Ensemble

What it does: Weight model votes by historical accuracy.

Weighting Strategy: Linear weights preferred (w_i = Accuracy_i) over exponential to maintain ensemble diversity.

Implementation: examples/confidence_scoring/weighted_ensemble.py

Calibration (Platt Scaling)

What it solves: Aligns raw confidence scores with actual accuracy.

Goal: Expected Calibration Error (ECE) < 0.05 for production systems.

Implementation: examples/confidence_scoring/calibration.py

Complete Guide: docs/confidence-scoring/ensemble-methods.md

Hallucination Detection

Methods for identifying fabricated or unsupported information.

SelfCheckGPT

How it works: Measures consistency across multiple samples from the same LLM. Factual statements remain consistent; hallucinations show high variance.

Why it's better: Unlike legacy NLP metrics (WER, METEOR), SelfCheckGPT is designed for Transformer-era LLMs and addresses hallucination problems that didn't exist in pre-Transformer systems.

Zero-resource: No external knowledge base required.

Implementation: examples/hallucination_detection/selfcheck_gpt.py

Documentation: docs/hallucination-detection/selfcheck-gpt.md

Logprobs-based Detection

Method: Identify low-confidence tokens as potential hallucinations.

Threshold: Typical cutoff at 0.3 probability for hallucination risk flagging.

Implementation: examples/hallucination_detection/logprobs_detection.py

Bias Detection

Systematic methods for identifying unfair treatment across demographic groups.

Correspondence Experiments

Method: Test model responses with demographic identifiers varied systematically.

Example: Same resume with different names (e.g., "John" vs. "Jamal") to detect hiring bias.

Implementation: examples/bias_detection/correspondence.py

Bayesian Hypothesis Testing

What it measures: Statistical evidence of bias using Bayesian inference.

Advantage: Quantifies uncertainty in bias detection, avoiding false positives from small sample sizes.

Implementation: examples/bias_detection/bayesian_testing.py

QuaCer-B Certification

What it provides: Certified bounds on bias magnitude with statistical guarantees.

Use case: Regulatory compliance and high-stakes applications.

Documentation: docs/bias-detection/quacer-b.md

Complete Guide: docs/bias-detection/bias-methods.md

Domain-Specific Evaluation

RAG Systems

Retrieval-Augmented Generation requires specialized evaluation of both retrieval and generation components.

Component-Level Metrics

Retrieval Quality:

Precision@k: Relevance of top-k retrieved documents
Recall@k: Coverage of relevant documents in top-k
MRR (Mean Reciprocal Rank): Position of first relevant document
NDCG (Normalized Discounted Cumulative Gain): Graded relevance scoring

Generation Faithfulness:

Groundedness: All claims supported by retrieval context
Hallucination penalty: Severity weighting for fabricated information
Attribution accuracy: Correct source citation

End-to-End:

Answer correctness: Factual accuracy given context
Completeness: Coverage of relevant information from context
Conciseness: Avoiding unnecessary verbosity

Implementation: examples/rag_evaluation/

Documentation: docs/rag-evaluation/rag-metrics.md

Example: Medical RAG Evaluation

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

medical_faithfulness = GEval(
    name="Medical Faithfulness",
    evaluation_steps=[
        "Extract medical claims from output",
        "Verify each claim against clinical guidelines in context",
        "Identify contradictions or unsupported claims",
        "HEAVILY PENALIZE hallucinations that could cause patient harm",
        "Emphasize clinical accuracy and safety"
    ],
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.RETRIEVAL_CONTEXT
    ],
    threshold=0.9  # High threshold for medical
)

Code Generation

Understanding Pass@k vs Pass^k

These metrics measure different aspects of code generation performance and serve different purposes.

Metric Comparison

Metric	Definition	Formula	Use Case
Pass@k	At least one of k solutions passes	`1 - C(n-c,k)/C(n,k)`	Benchmark comparison
Pass^k	All k solutions pass	`p^k`	Reliability planning

When to Use Each

Use Pass@k for:

Comparing models on benchmarks
Reporting best-case performance
Academic evaluation

Use Pass^k for:

Planning system reliability
Resource allocation
SLA commitments

Implementation: examples/code_generation/pass_metrics.py

Documentation: docs/code-generation/pass-metrics-distinction.md

Implementation: examples/code_generation/pass_at_k.py

Code Quality Metrics

Functional correctness: Unit test passage
Code efficiency: Runtime and memory benchmarks
Code style: PEP8, linting scores
Security: Vulnerability scanning (Bandit, CodeQL)

Documentation: docs/code-generation/metrics.md

Multi-Agent Systems

Evaluation challenges unique to autonomous and cooperative agents.

Emergent Behavior Assessment

Challenge: Traditional metrics fail to capture dynamic, context-dependent agent behaviors.

Approach:

Scenario-based testing: Predefined interaction sequences
Trace analysis: Evaluate decision trees and communication patterns
Goal achievement: Success rate on complex multi-step objectives

Coordination Metrics

Communication efficiency: Message volume vs. task complexity
Role adherence: Agent specialization maintenance
Conflict resolution: Time to consensus in disagreements
Distributed explainability: Transparency across agent decisions

Implementation: examples/multi_agent/

Tools & Platforms

Open Source Frameworks

Evaluation Libraries

DeepEval - Comprehensive evaluation framework with G-Eval implementation and 14+ pre-built metrics
Ragas - Specialized for RAG evaluation with reference-free metrics (faithfulness, relevance, context quality)
TruLens - Custom feedback functions with LangChain/LlamaIndex integration
Promptfoo - CLI tool for prompt testing with cost tracking and regression detection
OpenAI Evals - Reference implementation with 100+ community-contributed evaluations
LangCheck - Simple, composable evaluation metrics for LLM applications
Athina AI - Configurable evaluation metrics with focus on reliability

Observability & Monitoring

Langfuse - Open-source LLM engineering platform with tracing and prompt management
Langwatch - Real-time quality monitoring with custom evaluators and cost analytics
Arize Phoenix - OpenTelemetry-based observability with embedding visualization
Opik - Self-hostable platform with dataset management and experiment tracking
Helicone - Observability platform with request caching and rate limiting
Weights & Biases - ML experiment tracking extended for LLM evaluation

Commercial Solutions

Evaluation Platforms

Braintrust - CI/CD for AI with regression testing and agent sandboxes
LangSmith - LangChain's hosted platform for tracing and evaluation
Confident AI - Production monitoring with scheduled evaluation suites
HoneyHive - Enterprise platform with A/B testing and fine-tuning workflows
Humanloop - Prompt engineering and evaluation with human-in-the-loop
Galileo - ML observability extended for GenAI applications

Cloud Services

Amazon Bedrock Evaluations - AWS-native evaluation for foundation models
Azure AI Foundry - Integrated with Azure OpenAI and Prompt Flow
Vertex AI Evaluation - Google Cloud's evaluation service with custom rubrics
OpenAI Platform - Built-in evaluation capabilities for GPT models

Detailed Comparison: tools-and-platforms.md

Benchmarks & Datasets

General Language Understanding

Knowledge & Reasoning

MMLU - Massive Multitask Language Understanding across 57 subjects from STEM to humanities
MMLU-Pro - Enhanced version with 10 choices per question, emphasizing reasoning over memorization
BIG-bench - Beyond the Imitation Game with 200+ diverse tasks testing emergent capabilities
HELM - Holistic evaluation across 42 scenarios and 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency)
AGIEval - Human-centric standardized exams (SAT, LSAT, GRE, GMAT)

Common Sense & World Knowledge

HellaSwag - Commonsense natural language inference with adversarial filtering
WinoGrande - Large-scale Winograd schema challenge for commonsense reasoning
ARC - AI2 Reasoning Challenge with grade-school science questions

Domain-Specific Benchmarks

Code Generation

HumanEval - Function synthesis with 164 Python problems and unit tests
MBPP - 974 crowd-sourced Python programming problems
CodeContests - Competitive programming problems from Codeforces, TopCoder
SWE-bench - Real GitHub issues requiring repository-level understanding
DS-1000 - Data science problems across NumPy, Pandas, TensorFlow, PyTorch

Mathematics

MATH - 12,500 competition mathematics problems with step-by-step solutions
GSM8K - 8,500 grade school math word problems
MINERVA - STEM problem-solving requiring quantitative reasoning

Scientific Understanding

ScienceQA - Multimodal science questions with explanations
PubMedQA - Biomedical research question answering
MedQA - Medical examination questions (USMLE style)

Retrieval & RAG

BEIR - Heterogeneous benchmark for information retrieval across 18 datasets
MTEB - Massive Text Embedding Benchmark with 8 tasks and 58 datasets
MS MARCO - Machine reading comprehension with 1M+ real queries
Natural Questions - Real Google search queries with Wikipedia answers

Task-Specific Benchmarks

Dialogue & Conversation

MT-Bench - Multi-turn conversation quality assessment
ChatbotArena - Crowd-sourced pairwise model comparisons
DialogSum - Dialogue summarization dataset

Translation & Multilingual

WMT - Annual shared tasks in machine translation
FLORES-200 - Translation between 200 languages
XL-Sum - Multilingual abstractive summarization

Agent & Tool Use

AgentBench - Comprehensive agent evaluation across 8 environments
GAIA - General AI assistants with real-world questions requiring tools
WebArena - Autonomous web agents in realistic environments
ToolBench - Tool learning with 16,000+ real-world APIs

Safety & Alignment Benchmarks

Truthfulness & Hallucination

TruthfulQA - Measures whether models generate truthful answers to questions
HaluEval - Hallucination evaluation across diverse tasks
FActScore - Fine-grained atomic fact verification

Bias & Fairness

BBQ - Bias Benchmark for Question Answering in ambiguous contexts
BOLD - Bias in Open-ended Language Generation Dataset
WinoBias - Gender bias in coreference resolution

Safety & Toxicity

ToxiGen - Large-scale machine-generated toxicity dataset
RealToxicityPrompts - Toxic generation evaluation
SafetyBench - Chinese and English safety evaluation

Creating Custom Benchmarks

When standard benchmarks don't fit your needs:

Define Clear Evaluation Criteria: Specify what success looks like
Create Diverse Test Cases: Cover edge cases and failure modes
Establish Ground Truth: Use expert annotations or automated validation
Version Control: Track benchmark changes over time
Statistical Rigor: Ensure sufficient sample size and significance testing

Guide: docs/creating-custom-benchmarks.md

Production Best Practices

1. Use Evaluation Steps, Not Criteria

# ❌ Less consistent (regenerates steps each time)
metric = GEval(criteria="Check for correctness", ...)

# ✅ More consistent (fixed procedure)
metric = GEval(
    evaluation_steps=[
        "Verify factual accuracy",
        "Check for completeness",
        "Assess clarity"
    ],
    ...
)

2. Implement Calibration

# Fit calibrator on validation set
calibrator = ConfidenceCalibrator()
calibrator.fit(validation_confidences, ground_truth)

# Apply to production
calibrated_score = calibrator.calibrate(raw_confidence)

# Monitor ECE < 0.05

3. Use Component-Level Tracing

from deepeval.tracing import observe

@observe(metrics=[retrieval_quality])
def retrieve(query):
    # Retrieval logic
    return documents

@observe(metrics=[generation_faithfulness])
def generate(query, documents):
    # Generation logic
    return answer

# Separate scores for retrieval vs. generation

4. Set Domain-Appropriate Thresholds

# Medical application: high threshold, strict mode
medical_metric = GEval(
    threshold=0.9,
    strict_mode=True,  # Binary: perfect or fail
    ...
)

# General chatbot: lower threshold, graded scoring
chatbot_metric = GEval(
    threshold=0.7,
    strict_mode=False,
    ...
)

5. Monitor Confidence-Accuracy Correlation

# Validate Spearman ρ > 0.7
correlation = spearmanr(confidences, accuracies)
if correlation < 0.7:
    print("⚠️ Confidence scores unreliable - recalibrate")

6. Implement Human-in-the-Loop Thresholds

def route_for_review(calibrated_confidence):
    if calibrated_confidence >= 0.85:
        return "AUTO_PROCESS"
    elif calibrated_confidence >= 0.60:
        return "SPOT_CHECK"  # 10% sampling
    else:
        return "HUMAN_REVIEW"  # 100% review

7. Cost Optimization

# Use cheaper models for less critical evals
fast_metric = GEval(
    model="gpt-4o-mini",  # ~10x cheaper than GPT-4o
    ...
)

# Cache repeated evaluations
@lru_cache(maxsize=1000)
def cached_evaluate(input_hash, output_hash):
    return metric.measure(test_case)

Resources

Learning Materials

Tutorials & Guides

LLM Evaluation: A Practical Guide - DeepLearning.AI course on evaluation fundamentals
A Survey on Evaluation of LLMs - Comprehensive academic survey (2024)
Holistic Evaluation of LLMs - Stanford's HELM methodology and learnings
RAG Evaluation Guide - Best practices for evaluating retrieval-augmented generation
Prompt Engineering Guide - Includes evaluation strategies for prompts

Papers & Research

G-Eval Paper (2023) - NLG evaluation using GPT-4 with chain-of-thought
SCORE Framework (2025) - NVIDIA's consistency and robustness evaluation
Judging LLM-as-a-Judge (2024) - Meta-evaluation of LLM judges
Constitutional AI (2022) - Anthropic's approach to AI safety evaluation

Implementation Examples

Code Repositories

LangChain Evaluation - Evaluation chains and criteria
OpenAI Cookbook - Practical examples including evaluation techniques
Hugging Face Evaluate - Library for easily evaluating ML models and datasets
Microsoft Promptflow - Evaluation flows for LLM applications

Notebooks & Demos

Colab: LLM Evaluation Basics - Interactive introduction
RAG Evaluation Notebook - Step-by-step RAG metrics
Custom Metrics Creation - Building domain-specific evaluations

Communities & Discussions

Forums & Groups

r/MachineLearning - Academic ML discussions including evaluation
Hugging Face Forums - Community discussions on model evaluation
LangChain Discord - Active community for LLM application development
AI Alignment Forum - Safety and alignment evaluation discussions

Conferences & Workshops

NeurIPS Datasets and Benchmarks Track - Annual benchmark proposals
EMNLP Evaluation Track - NLP evaluation methodologies
ACL Workshop on Evaluation - Specialized evaluation workshops

Related Collections

Awesome Lists

Awesome LLM - Comprehensive LLM resources
Awesome RAG - RAG-specific tools and papers
Awesome LLM Safety - Safety evaluation and alignment
Awesome Production LLM - Production deployment including monitoring

Benchmark Leaderboards

Open LLM Leaderboard - Hugging Face's model rankings
LMSYS Chatbot Arena - Human preference rankings
Big Code Leaderboard - Code generation benchmarks
MTEB Leaderboard - Text embedding rankings

Industry Reports & Case Studies

State of AI Report - Annual industry overview including evaluation trends
OpenAI System Card - GPT-4 evaluation methodology
Anthropic Claude Evaluations - Constitutional AI evaluation approach
Google PaLM Technical Report - Comprehensive evaluation across 150+ tasks

Research Context

This evaluation guide is developed in support of research on Agentic AI Explainable-by-Design, focusing on:

Multi-agent systems for ethical analysis of regulatory documents
Behavior metrics in RAG systems applied to sensitive domains (healthcare, legal, financial)
Interpretive auditing frameworks combining technical performance with human interpretability
Global South perspectives on AI evaluation in contexts of limited infrastructure and linguistic diversity

Contributing

Contributions are welcome! This guide aims to be a living resource for the AI evaluation community.

Ways to contribute:

Add new evaluation methods with code examples
Improve documentation clarity
Report issues or inaccuracies
Share production case studies
Translate content (especially Portuguese, Spanish for Latin American accessibility)

Please read CONTRIBUTING.md for detailed guidelines.

Citation

If you use this guide in your research or projects, please cite:

@misc{parreao2024awesome_ai_eval,
  author = {Parreão, Hugo},
  title = {Awesome AI Evaluation Guide: Implementation-Focused Methods for LLMs, RAG, and Agentic AI},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/hparreao/Awesome-AI-Evaluation-Guide}
}

License

This work is released under CC0 1.0 Universal (Public Domain). You are free to use, modify, and distribute this content without attribution, though attribution is appreciated.

Maintained by: Hugo Parreão | AI Engineering MSc

Contact: Open an issue or reach out via GitHub for questions, suggestions, or collaboration opportunities.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
examples		examples
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
WORK_SAMPLE_CONTEXT.md		WORK_SAMPLE_CONTEXT.md
benchmarks.md		benchmarks.md
requirements.txt		requirements.txt
tools-and-platforms.md		tools-and-platforms.md

License

hparreao/Awesome-AI-Evaluation-Guide

Folders and files

Latest commit

History

Repository files navigation

Awesome AI Evaluation Guide

About This Guide

What Makes This Guide Different

Key Research Insights

Table of Contents

Metric Selection Guide

Quick Decision Table

Domain-Specific Thresholds

Getting Started

Prerequisites

Installation

Quick Examples

Basic LLM Evaluation

RAG Pipeline Evaluation

Consistency Testing (SCORE Framework)

Evaluation Metrics

Traditional Metrics

Perplexity

BLEU Score

ROUGE Score

Probability-Based Metrics

Logprobs Analysis

Top-k Token Analysis

LLM-as-a-Judge

G-Eval Framework

Modern Metrics

Consistency & Robustness (SCORE)

Components of SCORE

When to Use SCORE

Confidence Scoring

Majority Voting

Weighted Ensemble

Calibration (Platt Scaling)

Hallucination Detection

SelfCheckGPT

Logprobs-based Detection

Bias Detection

Correspondence Experiments

Bayesian Hypothesis Testing

QuaCer-B Certification

Domain-Specific Evaluation

RAG Systems

Component-Level Metrics

Example: Medical RAG Evaluation

Code Generation

Understanding Pass@k vs Pass^k

Metric Comparison

When to Use Each

Code Quality Metrics

Multi-Agent Systems

Emergent Behavior Assessment

Coordination Metrics

Tools & Platforms

Open Source Frameworks

Evaluation Libraries

Observability & Monitoring

Commercial Solutions

Evaluation Platforms

Cloud Services

Benchmarks & Datasets

General Language Understanding

Knowledge & Reasoning

Common Sense & World Knowledge

Domain-Specific Benchmarks

Code Generation

Mathematics

Scientific Understanding

Retrieval & RAG

Task-Specific Benchmarks

Dialogue & Conversation

Translation & Multilingual

Agent & Tool Use

Safety & Alignment Benchmarks

Truthfulness & Hallucination

Packages