This repository contains evals metrics and scripts for evaluating Deep Research reports.
We provide a pairwise-comparison evaluation script that was inspired by Google's Deep Research capabilities as described in their blog post about Deep Research on Gemini 2.5 Pro Experimental.
The evaluation is performed by comparing generated research reports against reference reports (in our case OpenAI's Deep Research outputs) across four key dimensions:
- Instruction Following: Evaluates response's fidelity to user specified instructions and constraints.
- Comprehensiveness: Measures breadth and range of information covered in response, addressing the scope of user request.
- Completeness: Measures the depth and thoroughness of information for topics addressed in the report.
- Writing Quality: Evaluates clarity, conciseness, logical organization, and overall readability of the report.
These dimensions align with the capabilities that make Deep Research tools effective at analytical reasoning, information synthesis, and generating insightful research reports.
We include the DeepConsult dataset in the datasets/DeepConsult directory, which consists of:
-
queries.csv- A collection of business and consulting-related prompts designed for deep research. These queries cover a wide range of topics including:- Market analysis and investment opportunities
- Industry-specific evaluations
- Financial modeling and assessment
- Technology trend analysis
- Strategic business planning
-
responses_OpenAI-DeepResearch_vs_ARI_2025-05-15.csv- This file contains responses from OpenAI DeepResearch and ARI formatted specifically for use with the evaluation script. The file follows the required format for the evals script:question: The original research questions/promptsbaseline_answer: Responses from OpenAI's Deep Research capabilities (used as reference)candidate_answer: Responses from ARI to be evaluated against the baseline
The dataset is designed to benchmark and evaluate the capability of language models to perform deep research on complex business and consulting queries, assessing their ability to provide comprehensive, well-structured, and insightful analysis comparable to professional consulting reports.
-
Clone this repository:
git clone https://github.com/your-org/ydc-deep-research-evals.git cd ydc-deep-research-evals -
Install Git LFS (Large File Storage) if you don't have it already:
(This is required for downloading the example dataset in this repo)# On Ubuntu/Debian apt-get install git-lfs # On macOS (using Homebrew) brew install git-lfs # Initialize Git LFS git lfs install
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up environment variables for OpenAI API access:
export OPENAI_API_KEY=your_openai_api_key export OPENAI_ORGANIZATION_ID=your_openai_org_id
To evaluate research-style responses, use the deep_research_pairwise_evals.py script:
python evals/deep_research_pairwise_evals.py \
--input-data datasets/DeepConsult/responses_OpenAI-DeepResearch_vs_ARI_2025-05-15.csv \
--output-dir path/to/output/directory \
--model o3-mini-2025-01-31 \
--num-workers 4 \
--metric-num-workers 3 \
--metric-num-trials 3The input CSV file should contain the following columns:
question: The research question or promptbaseline_answer: The reference answer to compare againstcandidate_answer: The candidate answer to evaluate
The evaluation results are saved as a JSONL file in the specified output directory. Each line contains the evaluation results for a single question-answer pair, including:
- Original input data (question, answers, metadata)
- Scores for each evaluation dimension
- Aggregate metrics
- Raw evaluation data
You can also use the evaluation metric directly in your Python code:
from evals.metrics.deep_research_pairwise_metric import DeepResearchPairwiseMetric
# Initialize the metric
metric = DeepResearchPairwiseMetric(
eval_model="o3-mini-2025-01-31",
num_trials=3,
num_workers=3
)
# Evaluate a single question-answer pair
result = metric.score(
question="What are the impacts of climate change on agriculture?",
baseline_answer="Your reference answer text...",
candidate_answer="Your candidate answer text..."
)
# Access the evaluation results
print(f"Instruction Following Score: {result.instruction_following.score}")
print(f"Comprehensiveness Score: {result.comprehensiveness.score}")
print(f"Completeness Score: {result.completeness.score}")
print(f"Writing Quality Score: {result.writing_quality.score}")The evaluator supports several configuration options:
model: The OpenAI model to use for evaluation (default: o3-mini-2025-01-31)num-workers: Number of worker threads for parallel processing (default: 4)metric-num-workers: Number of worker threads for the underlying metric (default: 3)metric-num-trials: Number of trials per evaluation for more stable results (default: 3)- For each trial, the evaluation runs twice - once with the original order and once with the baseline and candidate answers flipped. This helps mitigate potential position bias in the evaluation.
This project is licensed under the terms included in the LICENSE file.
- Python 3.10+
- OpenAI API access
- Dependencies listed in requirements.txt