RFT-Tinker: R2E-Gym Training with Tinker API + Agent Sandbox

Overview

Experimental setup for training code generation models on R2E-Gym dataset using:

Tinker API for RL model training
Agent Sandbox for safe code execution
R2E-Gym Dataset (4.5K real-world GitHub issues)

Reproducing DeepSWE experiments (42.2% Pass@1 on SWE-Bench-Verified).

Quick Start

1. Clone Repository

git clone https://github.com/novitalabs/rft-tinker.git
cd rft-tinker

2. Install Dependencies

python3 -m venv venv
source venv/bin/activate
pip install datasets huggingface-hub novita-sandbox tinker torch transformers

3. Configure API Keys

Copy the example environment file:

cp .env.example .env.local

Edit .env.local with your API keys:

# Agent Sandbox API Key (get from https://novita.ai)
NOVITA_API_KEY=your_novita_api_key_here

# Tinker API Token (get from Tinker platform)
TINKER_API_TOKEN=your_tinker_api_token_here

# Template IDs
NOVITA_TEMPLATE_BASE=vn9xnp3cm92x6rmqlgwc

Warning: Never commit .env.local with real credentials!

4. Run Tests

Test Agent Sandbox connectivity:

python -m tests.integration.test_novita_basic

Test R2E-Gym workflow:

python -m tests.integration.test_r2e_gym_workflow

5. Prepare Dataset

Download R2E-Gym sample (50 instances):

python scripts/prepare_data/prepare_r2e_sample.py

Test dataset loading:

python -m tests.unit.test_dataset_loading

Project Structure

rft-tinker/
├── src/                    # Core source code
│   ├── datasets/           # Dataset utilities and repo mapping
│   ├── environments/       # Sandbox environment wrappers
│   ├── rollout/            # Multi-turn rollout pipeline
│   └── utils/              # Utility functions
├── tests/                  # All test files
│   ├── integration/        # Integration tests
│   ├── rollout/            # Rollout pipeline tests
│   └── unit/               # Unit tests
├── scripts/                # Utility scripts
├── templates/              # Agent Sandbox Dockerfile templates
├── docs/                   # Documentation
├── data/                   # Datasets (gitignored)
├── outputs/                # Generated outputs (gitignored)
├── tinker_r2e_training.py  # RL training script
├── tinker_sft_training.py  # SFT training script
└── .env.example            # API keys template

Training Scripts

RL Training (GRPO)

python tinker_r2e_training.py

Configuration (in script):

Parameter	Value	Purpose
GROUP_SIZE	10	Parallel sandboxes per problem
MAX_STEPS	40	Max actions per episode
SAVE_INTERVAL	2	Checkpoint frequency (batches)
TEMPERATURE	1.0	Sampling temperature

SFT Training (Optional Warm-Start)

python tinker_sft_training.py

Converts gold patches to edit trajectories for supervised fine-tuning warm-start.

Weight Validation

python validate_sft_weights.py

Validates SFT checkpoint weights before RL training.

Agent Sandbox Templates

r2e-gym-base (vn9xnp3cm92x6rmqlgwc)

Python 3.8.10, pytest 8.3.5, numpy 1.24.4
Core: scipy, sympy, requests, pillow
For most Python repositories

r2e-gym-scientific

Adds: pandas, scikit-learn, matplotlib, seaborn, h5py
For scientific computing

r2e-gym-pillow

Pillow 10.4.0 with full image processing
For image-heavy repositories

Agent Sandbox API

from novita_sandbox.core import Sandbox

# Create sandbox
sandbox = Sandbox.create(
    api_key=api_key,
    template=template_id,
    timeout=3600
)

# Run commands (synchronous - no await)
result = sandbox.commands.run("echo 'Hello World'")
print(result.stdout)
print(result.exit_code)

# Write files
sandbox.files.write("/path/to/file.py", content.encode())

R2E-Gym Workflow

Standard evaluation workflow:

# 1. Clone repo at base commit
sandbox.commands.run(f"git clone {repo_url} /tmp/testbed")
sandbox.commands.run(f"cd /tmp/testbed && git checkout {base_commit}")

# 2. Apply model-generated patch
sandbox.files.write("/tmp/patch.diff", patch_content)
sandbox.commands.run("cd /tmp/testbed && git apply /tmp/patch.diff")

# 3. Run tests that should now pass (FAIL_TO_PASS)
result = sandbox.commands.run(f"cd /tmp/testbed && pytest {fail_tests}")

# 4. Run tests that should remain passing (PASS_TO_PASS)
result = sandbox.commands.run(f"cd /tmp/testbed && pytest {pass_tests}")

# 5. Compute reward
reward = 1.0 if all_tests_passed else 0.0

Dataset Schema

Each R2E-Gym instance contains:

{
    "instance_id": "orange3__2d9617bd",
    "repo": "orange3",
    "commit_hash": "2d9617bd0cb1f0ba61771258410ab8fae8e7e24d",
    "problem_statement": "[ISSUE] ...",
    "modified_files": [...],
    "test_files": ["test_1.py"],
    "test_codes": ["..."],
    "old_commit_exit_code": 1,  # Tests fail before fix
    "new_commit_exit_code": 0,  # Tests pass after fix
    "gold_patch": {...}
}

Available Actions (in Rollout Generator)

The rollout generator provides 8 tools for the model:

bash - Execute shell commands
read - Read file content (with line range support)
search - Pattern search (grep -rn)
find_file - Locate files by pattern
list_dir - Directory listing (ls -lah)
edit - Line-based file editing
run_test - Execute test commands
submit - Submit solution

Performance Notes

Based on actual training measurements:

Phase	Duration	% of Batch
Sandbox creation (10×)	~21s	1.2%
Repository setup (10×)	~2 min	6.7%
Rollout execution	~25-28 min	~90%
Training update	~30s	1.7%
Sandbox cleanup	~15s	0.8%

Key metrics:

Sandbox hot-start latency: 60-100ms/task
Concurrent sandboxes: Up to 150 per account

DeepSWE Comparison

Aspect	DeepSWE	This Setup
Model	Qwen3-32B	Qwen3-30B-A3B
Hardware	64 H100	Tinker
Dataset	R2E-Gym (4.5K)	Same ✅
Sandbox	Kubernetes + Docker	Agent Sandbox ✅
Pass@1	42.2% (SOTA)	TBD

Documentation

Technical Blog - Detailed guide on RL training with Agent Sandbox
Progress Report - Detailed development progress

References

DeepSWE Paper: https://www.together.ai/blog/deepswe
R2E-Gym Dataset: https://huggingface.co/datasets/R2E-Gym/R2E-Gym-Subset
Novita AI Platform: https://novita.ai

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RFT-Tinker: R2E-Gym Training with Tinker API + Agent Sandbox

Overview

Quick Start

1. Clone Repository

2. Install Dependencies

3. Configure API Keys

4. Run Tests

5. Prepare Dataset

Project Structure

Training Scripts

RL Training (GRPO)

SFT Training (Optional Warm-Start)

Weight Validation

Agent Sandbox Templates

r2e-gym-base (vn9xnp3cm92x6rmqlgwc)

r2e-gym-scientific

r2e-gym-pillow

Agent Sandbox API

R2E-Gym Workflow

Dataset Schema

Available Actions (in Rollout Generator)

Performance Notes

DeepSWE Comparison

Documentation

References

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
docs		docs
scripts		scripts
src		src
templates		templates
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
analyze_trajectory.py		analyze_trajectory.py
reorganize.sh		reorganize.sh
run_test.sh		run_test.sh
test_deepseek_v32.sh		test_deepseek_v32.sh
tinker_r2e_training.py		tinker_r2e_training.py
tinker_sft_training.py		tinker_sft_training.py
validate_sft_weights.py		validate_sft_weights.py

novitalabs/rft-tinker

Folders and files

Latest commit

History

Repository files navigation

RFT-Tinker: R2E-Gym Training with Tinker API + Agent Sandbox

Overview

Quick Start

1. Clone Repository

2. Install Dependencies

3. Configure API Keys

4. Run Tests

5. Prepare Dataset

Project Structure

Training Scripts

RL Training (GRPO)

SFT Training (Optional Warm-Start)

Weight Validation

Agent Sandbox Templates

r2e-gym-base (vn9xnp3cm92x6rmqlgwc)

r2e-gym-scientific

r2e-gym-pillow

Agent Sandbox API

R2E-Gym Workflow

Dataset Schema

Available Actions (in Rollout Generator)

Performance Notes

DeepSWE Comparison

Documentation

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages