Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.
Developed at Paradigm Shift AI
Contributors: Anais Howland, Ashwin Thinnappan, Vaibhav Gupta, Jameel Shahid Mohammed
┌─────────────────────────────────────────────────┐
│ Your Agent Implementation │
│ (extends Evaluation class) │
│ │
│ async def run() -> AgentResult │
│ def compute_steps() │
│ def compute_tokens() │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Neurosim Framework │
│ ┌───────────────────────────────────────────┐ │
│ │ Evaluation Base Class │ │
│ │ • execute() orchestration │ │
│ │ • CLI argument parsing │ │
│ │ • Result persistence │ │
│ └───────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────┐ │
│ │ GCSUploader │ │
│ │ • upload_json() - Results (zstd) │ │
│ │ • upload_png() - Screenshots │ │
│ └───────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────┐ │
│ │ LLM Judge │ │
│ │ • Score: 0-100 (≥70 = success) │ │
│ │ • Reasoning & suggestions │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Google Cloud Storage │
│ results/{user}/{job}/{episode}/{task}/ │
│ ├── result.json.zst (compressed) │
│ └── screenshot_*.png │
└─────────────────────────────────────────────────┘
- Agent Evaluation Framework: Abstract base class for implementing custom agent evaluations
- Cloud Storage Integration: Upload agent results, screenshots, and artifacts to Google Cloud Storage (GCS)
- LLM Judge System: Automated evaluation of agent performance using GPT or Gemini models
- Result Management: Structured result format with Zstandard compression support
- Firestore Integration: Track evaluation status and metrics
- Monitoring: Real-time monitoring of agent execution jobs
- Installation
- Quick Start
- Core Components
- Usage Examples
- LLM Judge
- Configuration
- Development
- License
- Citation
- Python 3.11 or higher
- Google Cloud Platform account (for GCS storage features)
- Google Cloud SDK (optional, for GCS authentication)
# Clone the repository
git clone https://github.com/anaishowland/neurosim.git
cd neurosim
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package
pip install -e .
# Install with optional dependencies
pip install -e ".[core]" # Core evaluation features
pip install -e ".[judge]" # LLM judge system
pip install -e ".[monitor]" # Job monitoring features# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install
uv venv --python python3.11
source .venv/bin/activate
uv pip install -e ".[core,judge,monitor]"Create a .env file in your project:
# Required for GCS storage
GCS_BUCKET_NAME=your-gcs-bucket-name
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
# Optional: Firestore configuration
GCP_PROJECT_ID=your-gcp-project-id
FIRESTORE_DATABASE=(default)
FIRESTORE_COLLECTION=evaluations
# Optional: Logging
LOG_LEVEL=INFO# Option 1: Using service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
# Option 2: Using gcloud CLI
gcloud auth application-default loginfrom neurosim.evaluation import Evaluation
from neurosim.utils.models import EvaluationRequest, AgentResult
class MyAgentEvaluation(Evaluation):
def __init__(self, request: EvaluationRequest):
super().__init__(request)
self.agent_name = "MyAgent"
self.agent_version = "1.0.0"
def get_llm(self):
return "gpt-4"
async def run(self) -> AgentResult:
# Implement your agent logic here
self.result.success = True
self.result.results = "Task completed successfully"
return self.result
def compute_steps(self):
# Process agent steps/trajectory
self.result.steps = []
def compute_tokens(self):
# Track token usage
self.result.tokens = [{"prompt_tokens": 100, "completion_tokens": 50}]
# Run evaluation
import asyncio
request = EvaluationRequest(
userid="user123",
model="gpt-4",
jobid="job001",
task="Open google.com and search for 'AI agents'",
taskid="task001",
browser_channel="chrome",
episode=0,
advanced_settings={},
bucket_name="your-bucket"
)
eval_instance = MyAgentEvaluation(request)
asyncio.run(eval_instance.execute())The Evaluation abstract class provides the foundation for implementing agent evaluations:
from neurosim.evaluation import Evaluation
class YourEvaluation(Evaluation):
async def run(self) -> AgentResult:
"""Execute the agent task"""
pass
def get_llm(self):
"""Return the LLM model identifier"""
pass
def compute_steps(self):
"""Process agent steps/trajectory"""
pass
def compute_tokens(self):
"""Calculate token usage"""
passUpload evaluation results and screenshots to Google Cloud Storage:
from neurosim.core.storage import GCSUploader
uploader = GCSUploader(bucket_name="your-bucket")
# Upload JSON results (with optional compression)
uri = uploader.upload_json(
blob_path="results/task_001/result.json",
data={"success": True, "score": 95},
compress_zstd=True,
zstd_level=3
)
# Upload screenshots
with open("screenshot.png", "rb") as f:
uri = uploader.upload_png(
data=f.read(),
blob_path="results/task_001/screenshot.png"
)Structured data models using Pydantic:
from neurosim.utils.models import (
EvaluationRequest, # Input configuration
AgentResult, # Evaluation output
AgentErrors, # Error tracking
EvaluationConfig # Runtime config
)cd examples/NotteEvaluation
# Set up environment
cp .env.example .env
# Edit .env with your GCS bucket and credentials
# Run the evaluation
bash scripts/basic.sh# your_eval.py
from neurosim.evaluation import Evaluation
class CustomEvaluation(Evaluation):
# ... implementation ...
pass
if __name__ == "__main__":
import asyncio
eval = CustomEvaluation.from_cli()
asyncio.run(eval.execute())Run from command line:
python your_eval.py \
--jobId job_123 \
--task "Navigate to example.com" \
--taskId task_001 \
--user user_001 \
--episode 0 \
--model gpt-4 \
--advanced_settings '{"max_steps": 50}'Evaluate agent performance automatically using GPT or Gemini models.
The LLM judge analyzes agent execution results and assigns a score (0-100) with detailed reasoning:
- Score ≥70: Success
- Score <70: Failure
# Set API key
export OPENAI_API_KEY=your_openai_key
# or
export GOOGLE_API_KEY=your_google_key
# Run judge on evaluation results
export JUDGE_MAX_CONCURRENCY=50
python -m neurosim.judge.evaluate_results \
/path/to/evaluation/folder \
--model gpt-4o \
--max-images 10 \
--output llm_judge.json{
"task_001": {
"score": 85,
"success": true,
"reasoning": "Agent successfully completed the task...",
"issues": ["minor_navigation_delay"],
"suggestions": ["Optimize element selection"]
}
}Create a .env file with these variables:
# Required
GCS_BUCKET_NAME=your-gcs-bucket-name
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
# Optional - GCP
GCP_PROJECT_ID=your-gcp-project-id
GCP_REGION=us-central1
FIRESTORE_DATABASE=(default)
FIRESTORE_COLLECTION=evaluations
# Optional - LLM Judge
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
JUDGE_MAX_CONCURRENCY=50
# Optional - Logging
LOG_LEVEL=INFO# Install build tools
pip install build wheel hatchling
# Build distribution
python -m build
# or
make build# Install test dependencies
pip install pytest pytest-asyncio pytest-cov
# Run tests
pytest
# With coverage
pytest --cov=src/neurosim --cov-report=html# Format code
black src/ examples/
# Sort imports
isort src/ examples/
# Type checking
mypy src/neurosimThis project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research or projects, please cite:
@software{neurosim2025,
author = {Howland, Anais and Thinnappan, Ashwin and Gupta, Vaibhav and Mohammed, Jameel Shahid},
title = {Neurosim: Core Evaluation Framework for AI Agents},
year = {2025},
publisher = {Paradigm Shift AI},
url = {https://github.com/anaishowland/neurosim}
}Developed at Paradigm Shift AI
This project was created to provide a robust framework for evaluating AI agent systems.
- Agent-CE: Continuous Evaluation platform with pre-built agent integrations (Browser Use, Notte, Anthropic/OpenAI Computer Use)
This is a snapshot release of work developed at Paradigm Shift AI. The code is provided as-is under the MIT License for the community to use, modify, and build upon.