DeepConsult: A Deep Research Benchmark for Consulting / Business Queries

This repository contains evals metrics and scripts for evaluating Deep Research reports.

Overview

We provide a pairwise-comparison evaluation script that was inspired by Google's Deep Research capabilities as described in their blog post about Deep Research on Gemini 2.5 Pro Experimental.

The evaluation is performed by comparing generated research reports against reference reports (in our case OpenAI's Deep Research outputs) across four key dimensions:

Instruction Following: Evaluates response's fidelity to user specified instructions and constraints.
Comprehensiveness: Measures breadth and range of information covered in response, addressing the scope of user request.
Completeness: Measures the depth and thoroughness of information for topics addressed in the report.
Writing Quality: Evaluates clarity, conciseness, logical organization, and overall readability of the report.

These dimensions align with the capabilities that make Deep Research tools effective at analytical reasoning, information synthesis, and generating insightful research reports.

DeepConsult Dataset

We include the DeepConsult dataset in the datasets/DeepConsult directory, which consists of:

queries.csv - A collection of business and consulting-related prompts designed for deep research. These queries cover a wide range of topics including:
- Market analysis and investment opportunities
- Industry-specific evaluations
- Financial modeling and assessment
- Technology trend analysis
- Strategic business planning
responses_OpenAI-DeepResearch_vs_ARI_2025-05-15.csv - This file contains responses from OpenAI DeepResearch and ARI formatted specifically for use with the evaluation script. The file follows the required format for the evals script:
- question: The original research questions/prompts
- baseline_answer: Responses from OpenAI's Deep Research capabilities (used as reference)
- candidate_answer: Responses from ARI to be evaluated against the baseline

The dataset is designed to benchmark and evaluate the capability of language models to perform deep research on complex business and consulting queries, assessing their ability to provide comprehensive, well-structured, and insightful analysis comparable to professional consulting reports.

Installation

Clone this repository:

git clone https://github.com/your-org/ydc-deep-research-evals.git
cd ydc-deep-research-evals

Install Git LFS (Large File Storage) if you don't have it already:
(This is required for downloading the example dataset in this repo)

# On Ubuntu/Debian
apt-get install git-lfs

# On macOS (using Homebrew)
brew install git-lfs

# Initialize Git LFS
git lfs install

Install the required dependencies:
```
pip install -r requirements.txt
```

Set up environment variables for OpenAI API access:

export OPENAI_API_KEY=your_openai_api_key
export OPENAI_ORGANIZATION_ID=your_openai_org_id

Usage

Running Evaluations

To evaluate research-style responses, use the deep_research_pairwise_evals.py script:

python evals/deep_research_pairwise_evals.py \
  --input-data datasets/DeepConsult/responses_OpenAI-DeepResearch_vs_ARI_2025-05-15.csv \
  --output-dir path/to/output/directory \
  --model o3-mini-2025-01-31 \
  --num-workers 4 \
  --metric-num-workers 3 \
  --metric-num-trials 3

Input Data Format

The input CSV file should contain the following columns:

question: The research question or prompt
baseline_answer: The reference answer to compare against
candidate_answer: The candidate answer to evaluate

Output

The evaluation results are saved as a JSONL file in the specified output directory. Each line contains the evaluation results for a single question-answer pair, including:

Original input data (question, answers, metadata)
Scores for each evaluation dimension
Aggregate metrics
Raw evaluation data

Using the Evaluation Metric in Your Code

You can also use the evaluation metric directly in your Python code:

from evals.metrics.deep_research_pairwise_metric import DeepResearchPairwiseMetric

# Initialize the metric
metric = DeepResearchPairwiseMetric(
    eval_model="o3-mini-2025-01-31",
    num_trials=3,
    num_workers=3
)

# Evaluate a single question-answer pair
result = metric.score(
    question="What are the impacts of climate change on agriculture?",
    baseline_answer="Your reference answer text...",
    candidate_answer="Your candidate answer text..."
)

# Access the evaluation results
print(f"Instruction Following Score: {result.instruction_following.score}")
print(f"Comprehensiveness Score: {result.comprehensiveness.score}")
print(f"Completeness Score: {result.completeness.score}")
print(f"Writing Quality Score: {result.writing_quality.score}")

Configuration

The evaluator supports several configuration options:

model: The OpenAI model to use for evaluation (default: o3-mini-2025-01-31)
num-workers: Number of worker threads for parallel processing (default: 4)
metric-num-workers: Number of worker threads for the underlying metric (default: 3)
metric-num-trials: Number of trials per evaluation for more stable results (default: 3)
- For each trial, the evaluation runs twice - once with the original order and once with the baseline and candidate answers flipped. This helps mitigate potential position bias in the evaluation.

License

This project is licensed under the terms included in the LICENSE file.

Requirements

Python 3.10+
OpenAI API access
Dependencies listed in requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
datasets/DeepConsult		datasets/DeepConsult
evals		evals
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepConsult: A Deep Research Benchmark for Consulting / Business Queries

Overview

DeepConsult Dataset

Installation

Usage

Running Evaluations

Input Data Format

Output

Using the Evaluation Metric in Your Code

Configuration

License

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

youdotcom-oss/ydc-deep-research-evals

Folders and files

Latest commit

History

Repository files navigation

DeepConsult: A Deep Research Benchmark for Consulting / Business Queries

Overview

DeepConsult Dataset

Installation

Usage

Running Evaluations

Input Data Format

Output

Using the Evaluation Metric in Your Code

Configuration

License

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages