Skip to content

latchbio/scbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scBench

Can AI agents analyze real-world single-cell data?

SCBench is a benchmark of verifiable problems derived from practical single-cell RNA-seq workflows. Each problem snapshots an analysis state immediately before a target step and pairs it with a deterministic grader that evaluates recovery of a key biological result.

Key Findings

Model Accuracy Cost/Eval Latency
Opus-4.5 50.0% $0.39 275s
GPT-5.2 46.6% $0.04 89s
Sonnet-4.5 41.4% $0.08 116s
Gemini-2.5-Pro 41.1% $0.19 194s

Full results with 95% confidence intervals are in results/.

Benchmark Structure

Evaluations across:

  • 6 platforms: Chromium, BD Rhapsody, Parse, Illumina, MissionBio, CSGenetics
  • 7 task categories: QC, Normalization, Dimensionality Reduction, Clustering, Cell Typing, Differential Expression, Trajectory Analysis

Tasks require empirical interaction with the data—agents that rely on prior knowledge without performing the requisite analysis fail to complete many tasks correctly.

Canonical Examples

This repository includes canonical examples in evals_canonical/ demonstrating the evaluation format. The full benchmark is withheld to prevent overfitting.

Task Platform Grader
QC Chromium Numeric
Normalization Chromium Numeric
Dimensionality Reduction Chromium MCQ
Clustering Chromium MCQ
Cell Typing Chromium Cosine
Differential Expression Chromium P@K
Trajectory Analysis Chromium P@K

Quick Start

pip install -e .

# Validate an evaluation
scbench validate evals_canonical/chromium/chromium_qc_4T1_filter_cells.json

# Run with mini-swe-agent
export ANTHROPIC_API_KEY=your_key
scbench run evals_canonical/chromium/chromium_qc_4T1_filter_cells.json --agent minisweagent --model anthropic/claude-opus-4-5

Custom Agent

from scbench import EvalRunner

def my_agent(task_prompt, work_dir):
    import json
    answer = {"cells_after_filtering": 6355}
    (work_dir / "eval_answer.json").write_text(json.dumps(answer))
    return answer

runner = EvalRunner("evals_canonical/chromium/chromium_qc_4T1_filter_cells.json")
result = runner.run(agent_function=my_agent)
print(f"Passed: {result['passed']}")

Graders

Five grader families handle different answer types:

Grader Use Case
NumericTolerance QC metrics, counts, expression values
MultipleChoice Discrete interpretation questions
MarkerGenePrecisionRecall Gene lists (P@K, R@K)
LabelSetJaccard Cell type sets
DistributionComparison Cell type proportions

See eval-graders for implementations.

Citation

TODO

License

Apache 2.0

About

Benchmark for agentic single cell data analysis

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •