Experimental setup for training code generation models on R2E-Gym dataset using:
- Tinker API for RL model training
- Agent Sandbox for safe code execution
- R2E-Gym Dataset (4.5K real-world GitHub issues)
Reproducing DeepSWE experiments (42.2% Pass@1 on SWE-Bench-Verified).
git clone https://github.com/novitalabs/rft-tinker.git
cd rft-tinkerpython3 -m venv venv
source venv/bin/activate
pip install datasets huggingface-hub novita-sandbox tinker torch transformersCopy the example environment file:
cp .env.example .env.localEdit .env.local with your API keys:
# Agent Sandbox API Key (get from https://novita.ai)
NOVITA_API_KEY=your_novita_api_key_here
# Tinker API Token (get from Tinker platform)
TINKER_API_TOKEN=your_tinker_api_token_here
# Template IDs
NOVITA_TEMPLATE_BASE=vn9xnp3cm92x6rmqlgwcWarning: Never commit .env.local with real credentials!
Test Agent Sandbox connectivity:
python -m tests.integration.test_novita_basicTest R2E-Gym workflow:
python -m tests.integration.test_r2e_gym_workflowDownload R2E-Gym sample (50 instances):
python scripts/prepare_data/prepare_r2e_sample.pyTest dataset loading:
python -m tests.unit.test_dataset_loadingrft-tinker/
├── src/ # Core source code
│ ├── datasets/ # Dataset utilities and repo mapping
│ ├── environments/ # Sandbox environment wrappers
│ ├── rollout/ # Multi-turn rollout pipeline
│ └── utils/ # Utility functions
├── tests/ # All test files
│ ├── integration/ # Integration tests
│ ├── rollout/ # Rollout pipeline tests
│ └── unit/ # Unit tests
├── scripts/ # Utility scripts
├── templates/ # Agent Sandbox Dockerfile templates
├── docs/ # Documentation
├── data/ # Datasets (gitignored)
├── outputs/ # Generated outputs (gitignored)
├── tinker_r2e_training.py # RL training script
├── tinker_sft_training.py # SFT training script
└── .env.example # API keys template
python tinker_r2e_training.pyConfiguration (in script):
| Parameter | Value | Purpose |
|---|---|---|
| GROUP_SIZE | 10 | Parallel sandboxes per problem |
| MAX_STEPS | 40 | Max actions per episode |
| SAVE_INTERVAL | 2 | Checkpoint frequency (batches) |
| TEMPERATURE | 1.0 | Sampling temperature |
python tinker_sft_training.pyConverts gold patches to edit trajectories for supervised fine-tuning warm-start.
python validate_sft_weights.pyValidates SFT checkpoint weights before RL training.
- Python 3.8.10, pytest 8.3.5, numpy 1.24.4
- Core: scipy, sympy, requests, pillow
- For most Python repositories
- Adds: pandas, scikit-learn, matplotlib, seaborn, h5py
- For scientific computing
- Pillow 10.4.0 with full image processing
- For image-heavy repositories
from novita_sandbox.core import Sandbox
# Create sandbox
sandbox = Sandbox.create(
api_key=api_key,
template=template_id,
timeout=3600
)
# Run commands (synchronous - no await)
result = sandbox.commands.run("echo 'Hello World'")
print(result.stdout)
print(result.exit_code)
# Write files
sandbox.files.write("/path/to/file.py", content.encode())Standard evaluation workflow:
# 1. Clone repo at base commit
sandbox.commands.run(f"git clone {repo_url} /tmp/testbed")
sandbox.commands.run(f"cd /tmp/testbed && git checkout {base_commit}")
# 2. Apply model-generated patch
sandbox.files.write("/tmp/patch.diff", patch_content)
sandbox.commands.run("cd /tmp/testbed && git apply /tmp/patch.diff")
# 3. Run tests that should now pass (FAIL_TO_PASS)
result = sandbox.commands.run(f"cd /tmp/testbed && pytest {fail_tests}")
# 4. Run tests that should remain passing (PASS_TO_PASS)
result = sandbox.commands.run(f"cd /tmp/testbed && pytest {pass_tests}")
# 5. Compute reward
reward = 1.0 if all_tests_passed else 0.0Each R2E-Gym instance contains:
{
"instance_id": "orange3__2d9617bd",
"repo": "orange3",
"commit_hash": "2d9617bd0cb1f0ba61771258410ab8fae8e7e24d",
"problem_statement": "[ISSUE] ...",
"modified_files": [...],
"test_files": ["test_1.py"],
"test_codes": ["..."],
"old_commit_exit_code": 1, # Tests fail before fix
"new_commit_exit_code": 0, # Tests pass after fix
"gold_patch": {...}
}The rollout generator provides 8 tools for the model:
- bash - Execute shell commands
- read - Read file content (with line range support)
- search - Pattern search (grep -rn)
- find_file - Locate files by pattern
- list_dir - Directory listing (ls -lah)
- edit - Line-based file editing
- run_test - Execute test commands
- submit - Submit solution
Based on actual training measurements:
| Phase | Duration | % of Batch |
|---|---|---|
| Sandbox creation (10×) | ~21s | 1.2% |
| Repository setup (10×) | ~2 min | 6.7% |
| Rollout execution | ~25-28 min | ~90% |
| Training update | ~30s | 1.7% |
| Sandbox cleanup | ~15s | 0.8% |
Key metrics:
- Sandbox hot-start latency: 60-100ms/task
- Concurrent sandboxes: Up to 150 per account
| Aspect | DeepSWE | This Setup |
|---|---|---|
| Model | Qwen3-32B | Qwen3-30B-A3B |
| Hardware | 64 H100 | Tinker |
| Dataset | R2E-Gym (4.5K) | Same ✅ |
| Sandbox | Kubernetes + Docker | Agent Sandbox ✅ |
| Pass@1 | 42.2% (SOTA) | TBD |
- Technical Blog - Detailed guide on RL training with Agent Sandbox
- Progress Report - Detailed development progress
- DeepSWE Paper: https://www.together.ai/blog/deepswe
- R2E-Gym Dataset: https://huggingface.co/datasets/R2E-Gym/R2E-Gym-Subset
- Novita AI Platform: https://novita.ai
MIT License