scBench

Can AI agents analyze real-world single-cell data?

SCBench is a benchmark of verifiable problems derived from practical single-cell RNA-seq workflows. Each problem snapshots an analysis state immediately before a target step and pairs it with a deterministic grader that evaluates recovery of a key biological result.

Key Findings

Model	Accuracy	Cost/Eval	Latency
Opus-4.5	50.0%	$0.39	275s
GPT-5.2	46.6%	$0.04	89s
Sonnet-4.5	41.4%	$0.08	116s
Gemini-2.5-Pro	41.1%	$0.19	194s

Full results with 95% confidence intervals are in results/.

Benchmark Structure

Evaluations across:

6 platforms: Chromium, BD Rhapsody, Parse, Illumina, MissionBio, CSGenetics
7 task categories: QC, Normalization, Dimensionality Reduction, Clustering, Cell Typing, Differential Expression, Trajectory Analysis

Tasks require empirical interaction with the data—agents that rely on prior knowledge without performing the requisite analysis fail to complete many tasks correctly.

Canonical Examples

This repository includes canonical examples in evals_canonical/ demonstrating the evaluation format. The full benchmark is withheld to prevent overfitting.

Task	Platform	Grader
QC	Chromium	Numeric
Normalization	Chromium	Numeric
Dimensionality Reduction	Chromium	MCQ
Clustering	Chromium	MCQ
Cell Typing	Chromium	Cosine
Differential Expression	Chromium	P@K
Trajectory Analysis	Chromium	P@K

Quick Start

pip install -e .

# Validate an evaluation
scbench validate evals_canonical/chromium/chromium_qc_4T1_filter_cells.json

# Run with mini-swe-agent
export ANTHROPIC_API_KEY=your_key
scbench run evals_canonical/chromium/chromium_qc_4T1_filter_cells.json --agent minisweagent --model anthropic/claude-opus-4-5

Custom Agent

from scbench import EvalRunner

def my_agent(task_prompt, work_dir):
    import json
    answer = {"cells_after_filtering": 6355}
    (work_dir / "eval_answer.json").write_text(json.dumps(answer))
    return answer

runner = EvalRunner("evals_canonical/chromium/chromium_qc_4T1_filter_cells.json")
result = runner.run(agent_function=my_agent)
print(f"Passed: {result['passed']}")

Graders

Five grader families handle different answer types:

Grader	Use Case
NumericTolerance	QC metrics, counts, expression values
MultipleChoice	Discrete interpretation questions
MarkerGenePrecisionRecall	Gene lists (P@K, R@K)
LabelSetJaccard	Cell type sets
DistributionComparison	Cell type proportions

See eval-graders for implementations.

Citation

TODO

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
evals_canonical		evals_canonical
results		results
scbench		scbench
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scBench

Key Findings

Benchmark Structure

Canonical Examples

Quick Start

Custom Agent

Graders

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

latchbio/scbench

Folders and files

Latest commit

History

Repository files navigation

scBench

Key Findings

Benchmark Structure

Canonical Examples

Quick Start

Custom Agent

Graders

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages