OpenHands Index Results

This repository contains benchmark results for various OpenHands agents and LLM configurations.

Data Structure

Agent-Centric Format (Recommended)

Results are organized in the results/ directory with the following structure:

results/
├── {version}_{model_name}/
│   ├── metadata.json
│   └── scores.json

Directory Naming Convention

Each agent directory follows the format: {version}_{model_name}/

{version}: Agent version (semantic version starting with 'v', e.g., v1.8.3)
{model_name}: LLM model name (e.g., claude-sonnet-4-5, GPT-5.2)

metadata.json

Contains agent metadata and configuration:

{
  "agent_name": "OpenHands CodeAct v2.0",
  "agent_version": "v1.8.3",
  "model": "claude-sonnet-4-5",
  "openness": "closed_api_available",
  "tool_usage": "standard",
  "submission_time": "2025-11-24T19:56:00.092895",
  "directory_name": "v1.8.3_claude-sonnet-4-5"
}

Fields:

agent_name: Display name of the agent
agent_version: Semantic version number (e.g., "1.0.0", "1.0.2")
model: LLM model used
openness: Model availability type
- closed_api_available: Commercial API-based models
- open_api_available: Open-source models with API access
- open_weights_available: Open-weights models that can be self-hosted
tool_usage: Agent tooling type
- standard: Standard tool usage
- custom_interface: Custom tool interface
submission_time: ISO 8601 timestamp

scores.json

Contains benchmark scores and performance metrics:

[
  {
    "benchmark": "swe-bench",
    "score": 45.1,
    "metric": "resolve_rate",
    "total_cost": 32.55,
    "average_runtime": 3600,
    "tags": ["bug_fixing"]
  },
  ...
]

Fields:

benchmark: Benchmark identifier (e.g., "swe-bench", "commit0")
score: Primary metric score (percentage or numeric value)
metric: Type of metric (e.g., "resolve_rate", "success_rate")
total_cost: Total API cost in USD
average_runtime: Average runtime per instance in seconds (optional)
tags: Category tags for grouping (e.g., ["bug_fixing"], ["app_creation"])

Legacy Format (Backward Compatible)

The 1.0.0-dev1/ directory contains the original benchmark-centric JSONL files:

swe-bench.jsonl
swe-bench-multimodal.jsonl
commit0.jsonl
swt-bench.jsonl
gaia.jsonl

This format is maintained for backward compatibility.

Supported Benchmarks

Bug Fixing

SWE-Bench: Resolving GitHub issues from real Python repositories
SWE-Bench-Multimodal: Similar to SWE-Bench with multimodal inputs

App Creation

Commit0: Building applications from scratch based on specifications

Test Generation

SWT-Bench: Generating comprehensive test suites

Information Gathering

GAIA: General AI assistant tasks requiring web search and reasoning

Benchmark Categories

Results are grouped into 4 main categories on the leaderboard:

Bug Fixing: SWE-Bench, SWE-Bench-Multimodal
App Creation: Commit0
Test Generation: SWT-Bench
Information Gathering: GAIA

Adding New Results

To add new benchmark results:

Create a directory following the naming convention: results/{version}_{model_name}/
Add metadata.json with agent configuration
Add scores.json with benchmark results
Commit and push to the repository

Example:

# Create directory
mkdir -p results/v1.8.3_claude-sonnet-4-5/

# Add metadata
cat > results/v1.8.3_claude-sonnet-4-5/metadata.json << 'EOF'
{
  "agent_name": "OpenHands CodeAct v2.0",
  "agent_version": "v1.8.3",
  "model": "claude-sonnet-4-5",
  "openness": "closed_api_available",
  "tool_usage": "standard",
  "submission_time": "2025-11-24T19:56:00.092895",
  "directory_name": "v1.8.3_claude-sonnet-4-5"
}
EOF

# Add scores
cat > results/v1.8.3_claude-sonnet-4-5/scores.json << 'EOF'
[
  {
    "benchmark": "swe-bench",
    "score": 45.1,
    "metric": "accuracy",
    "cost_per_instance": 0.412,
    "average_runtime": 3600,
    "tags": ["bug_fixing"]
  },
  ...
]
EOF

# Commit and push
git add results/v1.8.3_claude-sonnet-4-5/
git commit -m "Add results for OpenHands CodeAct v1.8.3 with Claude 4.5 Sonnet"
git push origin main

Leaderboard

View the live leaderboard at: https://huggingface.co/spaces/OpenHands/openhands-index

License

MIT License - See repository for details.

Name		Name	Last commit message	Last commit date
Latest commit History 383 Commits
.github/workflows		.github/workflows
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenHands Index Results

Data Structure

Agent-Centric Format (Recommended)

Directory Naming Convention

metadata.json

scores.json

Legacy Format (Backward Compatible)

Supported Benchmarks

Bug Fixing

App Creation

Test Generation

Information Gathering

Benchmark Categories

Adding New Results

Leaderboard

License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

OpenHands/openhands-index-results

Folders and files

Latest commit

History

Repository files navigation

OpenHands Index Results

Data Structure

Agent-Centric Format (Recommended)

Directory Naming Convention

metadata.json

scores.json

Legacy Format (Backward Compatible)

Supported Benchmarks

Bug Fixing

App Creation

Test Generation

Information Gathering

Benchmark Categories

Adding New Results

Leaderboard

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages