This repository contains benchmark results for various OpenHands agents and LLM configurations.
Results are organized in the results/ directory with the following structure:
results/
├── {version}_{model_name}/
│ ├── metadata.json
│ └── scores.json
Each agent directory follows the format: {version}_{model_name}/
{version}: Agent version (semantic version starting with 'v', e.g.,v1.8.3){model_name}: LLM model name (e.g.,claude-sonnet-4-5,GPT-5.2)
Contains agent metadata and configuration:
{
"agent_name": "OpenHands CodeAct v2.0",
"agent_version": "v1.8.3",
"model": "claude-sonnet-4-5",
"openness": "closed_api_available",
"tool_usage": "standard",
"submission_time": "2025-11-24T19:56:00.092895",
"directory_name": "v1.8.3_claude-sonnet-4-5"
}Fields:
agent_name: Display name of the agentagent_version: Semantic version number (e.g., "1.0.0", "1.0.2")model: LLM model usedopenness: Model availability typeclosed_api_available: Commercial API-based modelsopen_api_available: Open-source models with API accessopen_weights_available: Open-weights models that can be self-hosted
tool_usage: Agent tooling typestandard: Standard tool usagecustom_interface: Custom tool interface
submission_time: ISO 8601 timestamp
Contains benchmark scores and performance metrics:
[
{
"benchmark": "swe-bench",
"score": 45.1,
"metric": "resolve_rate",
"total_cost": 32.55,
"average_runtime": 3600,
"tags": ["bug_fixing"]
},
...
]Fields:
benchmark: Benchmark identifier (e.g., "swe-bench", "commit0")score: Primary metric score (percentage or numeric value)metric: Type of metric (e.g., "resolve_rate", "success_rate")total_cost: Total API cost in USDaverage_runtime: Average runtime per instance in seconds (optional)tags: Category tags for grouping (e.g., ["bug_fixing"], ["app_creation"])
The 1.0.0-dev1/ directory contains the original benchmark-centric JSONL files:
swe-bench.jsonlswe-bench-multimodal.jsonlcommit0.jsonlswt-bench.jsonlgaia.jsonl
This format is maintained for backward compatibility.
- SWE-Bench: Resolving GitHub issues from real Python repositories
- SWE-Bench-Multimodal: Similar to SWE-Bench with multimodal inputs
- Commit0: Building applications from scratch based on specifications
- SWT-Bench: Generating comprehensive test suites
- GAIA: General AI assistant tasks requiring web search and reasoning
Results are grouped into 4 main categories on the leaderboard:
- Bug Fixing: SWE-Bench, SWE-Bench-Multimodal
- App Creation: Commit0
- Test Generation: SWT-Bench
- Information Gathering: GAIA
To add new benchmark results:
- Create a directory following the naming convention:
results/{version}_{model_name}/ - Add
metadata.jsonwith agent configuration - Add
scores.jsonwith benchmark results - Commit and push to the repository
Example:
# Create directory
mkdir -p results/v1.8.3_claude-sonnet-4-5/
# Add metadata
cat > results/v1.8.3_claude-sonnet-4-5/metadata.json << 'EOF'
{
"agent_name": "OpenHands CodeAct v2.0",
"agent_version": "v1.8.3",
"model": "claude-sonnet-4-5",
"openness": "closed_api_available",
"tool_usage": "standard",
"submission_time": "2025-11-24T19:56:00.092895",
"directory_name": "v1.8.3_claude-sonnet-4-5"
}
EOF
# Add scores
cat > results/v1.8.3_claude-sonnet-4-5/scores.json << 'EOF'
[
{
"benchmark": "swe-bench",
"score": 45.1,
"metric": "accuracy",
"cost_per_instance": 0.412,
"average_runtime": 3600,
"tags": ["bug_fixing"]
},
...
]
EOF
# Commit and push
git add results/v1.8.3_claude-sonnet-4-5/
git commit -m "Add results for OpenHands CodeAct v1.8.3 with Claude 4.5 Sonnet"
git push origin mainView the live leaderboard at: https://huggingface.co/spaces/OpenHands/openhands-index
MIT License - See repository for details.