Skip to content

Commit c16dc6e

Browse files
committed
new eval thanks @ashikshafi08
1 parent 850d841 commit c16dc6e

26 files changed

+3414
-1707
lines changed

bun.lock

Lines changed: 51 additions & 73 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

packages/eval/.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
1-
cache
2-
results
1+
runs
32
data

packages/eval/.python-version

Lines changed: 0 additions & 1 deletion
This file was deleted.

packages/eval/README.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# @supermemory/eval
2+
3+
SWE-bench Lite retrieval-only evaluation harness comparing two Claude Agent SDK variants:
4+
5+
- **Agent1 (ops-only)**: Read/Grep/Glob tools only
6+
- **Agent2 (ops+search)**: Read/Grep/Glob + semantic search via `code-chunk` embeddings
7+
8+
## Setup
9+
10+
```bash
11+
# From monorepo root
12+
bun install
13+
```
14+
15+
Required environment variables:
16+
17+
```bash
18+
ANTHROPIC_API_KEY=... # Claude API access
19+
GOOGLE_API_KEY=... # Gemini embeddings (default)
20+
# or
21+
OPENAI_API_KEY=... # If using --embedding-provider openai
22+
```
23+
24+
## Usage
25+
26+
```bash
27+
cd packages/eval
28+
29+
# Full evaluation on test split
30+
bun run src/run.ts
31+
32+
# Dev split, limited instances
33+
bun run src/run.ts --split dev --max-instances 10
34+
35+
# Only Agent1 (ops-only)
36+
bun run src/run.ts --skip-agent2
37+
38+
# Specific instance
39+
bun run src/run.ts --instance django__django-12345
40+
41+
# Custom embedding dimensions (768/1536/3072)
42+
bun run src/run.ts --embedding-dimensions 768
43+
```
44+
45+
## Options
46+
47+
| Flag | Description | Default |
48+
|------|-------------|---------|
49+
| `--split <dev\|test>` | Dataset split | `test` |
50+
| `--max-instances <n>` | Limit instances | all |
51+
| `--max-turns <n>` | Max agent turns | 20 |
52+
| `--max-tool-calls <n>` | Max tool calls | 50 |
53+
| `--model <name>` | Claude model | `claude-sonnet-4-5` |
54+
| `--skip-agent1` | Skip ops-only agent | false |
55+
| `--skip-agent2` | Skip ops+search agent | false |
56+
| `--instance <id>` | Run specific instance(s) | - |
57+
| `--run-dir <path>` | Output directory | `./runs` |
58+
| `--embedding-provider` | `gemini` or `openai` | `gemini` |
59+
| `--embedding-dimensions` | Gemini output dims | 1536 |
60+
61+
## Output
62+
63+
Runs output to `runs/<timestamp>/`:
64+
65+
```
66+
runs/
67+
└── 2025-01-01T12-00-00-000Z/
68+
├── events/
69+
│ ├── django__django-12345_ops-only.jsonl
70+
│ └── django__django-12345_ops+search.jsonl
71+
├── metrics.jsonl
72+
└── summary.json
73+
```
74+
75+
## Metrics
76+
77+
- **Hit@k**: Whether oracle file appears in top-k predictions
78+
- **MRR**: Mean Reciprocal Rank of first oracle file
79+
- **Coverage@k**: Fraction of oracle files in top-k
80+
- **Time-to-first-hit**: Turns/tool calls until first oracle file accessed
81+
- **Embedding latency**: Index build + query times (Agent2 only)
82+
83+
## Architecture
84+
85+
```
86+
src/
87+
├── run.ts # CLI entrypoint
88+
└── swebench/
89+
├── types.ts # SWEbenchInstance, metrics types
90+
├── dataset.ts # HuggingFace dataset loader with caching
91+
├── git.ts # Bare clone + worktree management
92+
├── score.ts # Per-instance metric computation
93+
├── aggregate.ts # Cross-instance aggregation
94+
├── run.ts # Main evaluation loop
95+
├── agent/
96+
│ ├── prompts.ts # Retrieval-only system/user prompts
97+
│ ├── variants.ts # Agent1/Agent2 tool configurations
98+
│ └── semantic_search_adapter.ts # Gemini embeddings + MCP server
99+
└── observe/
100+
└── instrumentation.ts # SDK hooks, event writer
101+
```
102+
103+
## How it works
104+
105+
1. Loads SWE-bench Lite dataset (300 instances)
106+
2. For each instance:
107+
- Creates git worktree at target commit
108+
- Runs Agent1 (ops-only) with Read/Grep/Glob
109+
- Builds semantic index using `code-chunk`
110+
- Runs Agent2 (ops+search) with additional semantic_search tool
111+
- Computes retrieval metrics against oracle files from patch
112+
3. Aggregates metrics, prints summary, writes results
113+
114+
## Embedding cache
115+
116+
Semantic search indexes are cached at `~/.cache/swebench-eval/embeddings/` to avoid re-embedding repos. Cache key includes instance ID + embedding provider + dimensions.

packages/eval/package.json

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,24 @@
22
"name": "@supermemory/eval",
33
"version": "0.1.0",
44
"private": true,
5-
"description": "Evaluation harness for code-chunk",
5+
"description": "SWE-bench Lite retrieval-only evaluation harness for code-chunk",
66
"type": "module",
77
"scripts": {
88
"start": "bun run src/run.ts",
9+
"eval": "bun run src/run.ts",
10+
"eval:dev": "bun run src/run.ts --split dev",
11+
"eval:quick": "bun run src/run.ts --max-instances 5",
912
"type-check": "tsc --noEmit"
1013
},
1114
"dependencies": {
15+
"@anthropic-ai/claude-agent-sdk": "^0.1.75",
16+
"@anthropic-ai/sdk": "^0.71.2",
1217
"code-chunk": "workspace:*",
13-
"openai": "^4.0.0"
18+
"dotenv": "^16.4.0",
19+
"zod": "^3.24.0"
1420
},
1521
"devDependencies": {
16-
"@types/bun": "^1.3.4"
22+
"@types/bun": "^1.3.4",
23+
"typescript": "^5.0.0"
1724
}
1825
}

packages/eval/pyproject.toml

Lines changed: 0 additions & 9 deletions
This file was deleted.

packages/eval/src/chunkers/ast.ts

Lines changed: 0 additions & 40 deletions
This file was deleted.

packages/eval/src/chunkers/chonkie.ts

Lines changed: 0 additions & 82 deletions
This file was deleted.

packages/eval/src/chunkers/chonkie_chunk.py

Lines changed: 0 additions & 92 deletions
This file was deleted.

0 commit comments

Comments
 (0)