RTEB (Retrieval-focused Text Embedding Benchmark) is a new, reliable, high-quality benchmark designed to evaluate the retrieval accuracy of embedding models and rerankers. Unlike existing benchmarks, RTEB focuses specifically on retrieval tasks that reflect real-world applications, particularly search and RAG (Retrieval-Augmented Generation) systems.
Most existing benchmarks used to evaluate embedding models and rerankers have significant shortcomings for real-world applications:
MTEB Issues:
- Many models train on the test set, some intentionally, leading to inflated performance scores
- Measures performance on tasks such as STS (Semantic Textual Similarity) and classification which are less relevant for typical embedding applications today (search and RAG)
- Uses only academic datasets, many of which are irrelevant to actual enterprise use cases and problems
TREC Issues:
- Too large to be of practical use for most evaluation scenarios
- Contains many false negatives due to the data collection process
- Has data which is less relevant for enterprise retrieval applications
RTEB addresses these shortcomings by:
- Retrieval-focused: Concentrating on high-quality retrieval-specific datasets
- Real-world relevance: Including datasets that reflect actual enterprise use cases
- Tiered approach: Implementing a multi-tier system (0-3) to prevent overfitting while maintaining transparency
- Practical dataset sizes: Ensuring datasets are large enough to be meaningful but small enough for efficient evaluation
RTEB uses a hierarchical organization system:
- Dataset: The minimal unit of evaluation. Each dataset produces one score per model for a fixed evaluation metric (e.g., NDCG@10)
- Dataset Groups: Collections of related datasets (e.g., "legal", "healthcare", "code"). Datasets can belong to multiple groups
- Group Scoring: Each group has a single aggregated score per model, calculated as the weighted average of constituent dataset scores
- Main Leaderboard: Features the "text overall" group containing almost all text datasets
Based on production interactions with embedding model users, RTEB includes these key groups:
- Domain-specific: text, legal, code, healthcare, finance, conversation, long-context
- Language-specific: english, german, french, japanese, multilingual
To prevent overfitting while maintaining transparency, RTEB implements a multi-tier system:
- All files publicly available:
corpus.jsonl,queries.jsonl,relevance.jsonl - Includes existing high-quality datasets with open test sets
- At least one open dataset per task group
- Documents and queries publicly available, but relevance judgments are held out
- Allows for development and analysis while preventing direct optimization on labels
- Only document corpus is publicly available
- Queries and relevance judgments are held out
- Private evaluation sets to prevent manipulation
- All data is held out - corpus, queries, and relevance judgments
- Public metadata includes: basic description, data sources, dataset statistics, token length distributions
- Five sample (query, document, relevance) triplets provided for reference
- Most closed datasets in RTEB are Tier 3
git clone <repository-url>
cd rteb
pip install -r requirements.txtNote: Additional packages may be required depending on the models you want to evaluate. The framework uses lazy loading, so model-specific dependencies (like sentence-transformers, openai, cohere, etc.) are only imported when needed. Install additional packages as required:
# For sentence-transformers models
pip install sentence-transformers
# For OpenAI models
pip install openai
# For Cohere models
pip install cohere
# For VoyageAI models
pip install voyageai
# For other specific model requirements, check the model's documentation- List available models and datasets:
python -m rteb --list-models
python -m rteb --list-tasks- Run evaluation on all datasets with all models:
python -m rteb- Evaluate specific models on specific datasets:
python -m rteb --models "text-embedding-3-small_float32_1536d,all-MiniLM-L6-v2" --tasks "FinanceBench,LegalQuAD"python -m rteb \
--models "text-embedding-3-small_float32_1536d,text-embedding-3-large_float32_3072d,text-embedding-004_float32_768d" \
--tasks "FinanceBench,HC3Finance,FinQA" \
--gpus 1 \
--batch_size 32python -m rteb \
--models "sentence-transformers__all-MiniLM-L6-v2_float32_384d,sentence-transformers__all-mpnet-base-v2_float32_768d" \
--tasks "APPS,DS1000,HumanEval,MBPP,WikiSQL,FreshStack" \
--save_path "output/code_evaluation" \
--gpus 1python -m rteb \
--models "sentence-transformers__all-MiniLM-L6-v2_float32_384d" \
--tasks "ChatDoctor_HealthCareMagic,CUREv1_en" \
--cpus 4 \
--batch_size 16python -m rteb \
--models "nvidia__NV-Embed-v2_float32_4096d" \
--gpus 1 \
--bf16 \
--offload-model \
--embd_in_memory_threshold 100000RTEB supports multiple model types organized by provider:
- OpenAI: text-embedding-3-small, text-embedding-3-large, text-embedding-004
- Sentence-transformers: all-MiniLM-L6-v2, all-mpnet-base-v2, etc.
- NVIDIA: NV-Embed-v2
- Cohere: embed-english-v3.0, embed-multilingual-v3.0
- VoyageAI: voyage-3, voyage-3.5
- And many more...
Use --list-models to see all available models with their specifications.
RTEB includes datasets across multiple domains:
- Legal: AILACasedocs, AILAStatutes, LegalSummarization, LegalQuAD
- Finance: FinanceBench, HC3Finance, FinQA
- Code: APPS, DS1000, HumanEval, MBPP, WikiSQL, FreshStack
- Healthcare: ChatDoctor_HealthCareMagic, CUREv1_en, CUREv1_fr
- Closed datasets: 13 held-out datasets across various domains and languages
Use --list-tasks to see all available datasets with their tier information and groups.
Results for each model-dataset combination are stored in JSON format with comprehensive metrics:
{
"ndcg_at_1": 0.18,
"ndcg_at_3": 0.16432,
"ndcg_at_5": 0.16492,
"ndcg_at_10": 0.19717,
"ndcg_at_20": 0.22987,
"ndcg_at_50": 0.28899,
"ndcg_at_100": 0.35235,
"map_at_1": 0.04421,
"map_at_3": 0.08374,
"map_at_5": 0.10432,
"map_at_10": 0.12772,
"map_at_20": 0.13851,
"map_at_50": 0.15196,
"map_at_100": 0.16262,
"recall_at_1": 0.04421,
"recall_at_3": 0.09949,
"recall_at_5": 0.14982,
"recall_at_10": 0.25601,
"recall_at_20": 0.35048,
"recall_at_50": 0.55142,
"recall_at_100": 0.83315,
"precision_at_1": 0.18,
"precision_at_3": 0.14667,
"precision_at_5": 0.12,
"precision_at_10": 0.092,
"precision_at_20": 0.065,
"precision_at_50": 0.0428,
"precision_at_100": 0.0314,
"model_name": "all-MiniLM-L6-v2",
"embd_dim": 384,
"embd_dtype": "float32"
}Models are evaluated using multiple retrieval metrics:
- NDCG@{1,3,5,10,20,40,100}: Normalized Discounted Cumulative Gain
- MAP@{1,3,5,10,20,50,100}: Mean Average Precision
- Recall@{1,3,5,10,20,50,100}: Recall at different cutoffs
- Precision@{1,3,5,10,20,50,100}: Precision at different cutoffs
Default metric: NDCG@10 is used as the primary ranking metric on leaderboards.
Results are organized in the following directory structure:
output/
├── {dataset_name}/
│ └── {model_id}/
│ ├── retrieve_eval.json # Evaluation metrics
│ └── retrieve_pred.json # Predictions (if --save_prediction)
│
results/
├── results.json # Compiled results for all datasets
├── models.json # Model metadata
└── datasets.json # Dataset group information
Each dataset follows a standard format:
-
corpus.jsonl: Document collection
{"_id": "doc1", "title": "Document Title", "text": "Document content..."} -
queries.jsonl: Query collection
{"_id": "query1", "text": "What is the query?"} -
relevance.jsonl: Relevance judgments
{"query1": {"doc1": 1, "doc2": 0}}
--gpus N: Number of GPUs for encoding (default: 0)--cpus N: Number of CPUs for computation (default: 1)--bf16: Use bfloat16 precision for memory efficiency
--batch_size N: Encoding batch size (default: 16)--embd_batch_size N: Similarity computation batch size (default: 1024)--embd_in_memory_threshold N: Embedding memory threshold (default: 200000)--offload-model: Offload model after encoding to save memory
--save_path DIR: Output directory (default: "output/")--save_prediction: Save detailed predictions--keep_embds: Keep embedding files after retrieval--topk N: Number of top documents per query (default: 100)--overwrite: Overwrite existing results
We welcome contributions of new embedding models! The framework supports:
- Sentence-transformers models: Inherit from
SentenceTransformersEmbeddingModel - API-based models: Inherit from the appropriate API model class
- Custom models: Implement the
EmbeddingModelinterface
- Create a model class inheriting from the appropriate base class
- Implement required methods (
embed,_id, etc.) - Add model metadata using
ModelMeta - Submit a PR with your model and results on Tier 1 datasets
- Model will be evaluated on Tier 2 datasets by organizers
Each model must include:
- Dimensionality: Embedding vector size
- Model parameters: Total parameter count
- Precision: float32, int8, binary, etc.
- Context length: Maximum input token length
- Similarity metric: cosine, dot_product, etc.
- Reference: Link to model documentation/paper
Results are compiled into a comprehensive leaderboard showing:
- Model rankings across different dataset groups
- Detailed performance breakdowns by domain
- Model specifications and efficiency metrics
- Statistical significance testing
Only models with complete results across all datasets are included in the final leaderboard to ensure fair comparison.
Datasets in RTEB follow specific size constraints for practical evaluation:
Lower bounds:
- Minimum 1000 documents
- Minimum 50 queries (ideally 100+)
Upper bounds:
- Most datasets: <100M tokens
- Large datasets: <1B tokens (typically closed)
- Target: Evaluation within 10 minutes on 8xH100 + CPU
RTEB is designed to address the needs of the embedding community by providing a practical, fair, and comprehensive evaluation framework. We thank all contributors and the broader NLP community for their feedback and support.
[License information to be added]
If you use RTEB in your research, please cite:
[Citation to be added]