Neurosim

Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.

Developed at Paradigm Shift AI

Contributors: Anais Howland, Ashwin Thinnappan, Vaibhav Gupta, Jameel Shahid Mohammed

Architecture Overview

┌─────────────────────────────────────────────────┐
│         Your Agent Implementation               │
│         (extends Evaluation class)              │
│                                                 │
│  async def run() -> AgentResult                │
│  def compute_steps()                           │
│  def compute_tokens()                          │
└─────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│           Neurosim Framework                    │
│  ┌───────────────────────────────────────────┐ │
│  │  Evaluation Base Class                    │ │
│  │  • execute() orchestration                │ │
│  │  • CLI argument parsing                   │ │
│  │  • Result persistence                     │ │
│  └───────────────────────────────────────────┘ │
│  ┌───────────────────────────────────────────┐ │
│  │  GCSUploader                              │ │
│  │  • upload_json() - Results (zstd)         │ │
│  │  • upload_png() - Screenshots             │ │
│  └───────────────────────────────────────────┘ │
│  ┌───────────────────────────────────────────┐ │
│  │  LLM Judge                                │ │
│  │  • Score: 0-100 (≥70 = success)           │ │
│  │  • Reasoning & suggestions                │ │
│  └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│        Google Cloud Storage                     │
│  results/{user}/{job}/{episode}/{task}/        │
│    ├── result.json.zst (compressed)            │
│    └── screenshot_*.png                        │
└─────────────────────────────────────────────────┘

Features

Agent Evaluation Framework: Abstract base class for implementing custom agent evaluations
Cloud Storage Integration: Upload agent results, screenshots, and artifacts to Google Cloud Storage (GCS)
LLM Judge System: Automated evaluation of agent performance using GPT or Gemini models
Result Management: Structured result format with Zstandard compression support
Firestore Integration: Track evaluation status and metrics
Monitoring: Real-time monitoring of agent execution jobs

Installation

Prerequisites

Python 3.11 or higher
Google Cloud Platform account (for GCS storage features)
Google Cloud SDK (optional, for GCS authentication)

Install from Source

# Clone the repository
git clone https://github.com/anaishowland/neurosim.git
cd neurosim

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install -e .

# Install with optional dependencies
pip install -e ".[core]"      # Core evaluation features
pip install -e ".[judge]"     # LLM judge system
pip install -e ".[monitor]"   # Job monitoring features

Using uv (Recommended)

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install
uv venv --python python3.11
source .venv/bin/activate
uv pip install -e ".[core,judge,monitor]"

Quick Start

1. Set Up Environment Variables

Create a .env file in your project:

# Required for GCS storage
GCS_BUCKET_NAME=your-gcs-bucket-name
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json

# Optional: Firestore configuration
GCP_PROJECT_ID=your-gcp-project-id
FIRESTORE_DATABASE=(default)
FIRESTORE_COLLECTION=evaluations

# Optional: Logging
LOG_LEVEL=INFO

2. Authenticate with Google Cloud

# Option 1: Using service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# Option 2: Using gcloud CLI
gcloud auth application-default login

3. Create Your First Evaluation

from neurosim.evaluation import Evaluation
from neurosim.utils.models import EvaluationRequest, AgentResult

class MyAgentEvaluation(Evaluation):
    def __init__(self, request: EvaluationRequest):
        super().__init__(request)
        self.agent_name = "MyAgent"
        self.agent_version = "1.0.0"
    
    def get_llm(self):
        return "gpt-4"
    
    async def run(self) -> AgentResult:
        # Implement your agent logic here
        self.result.success = True
        self.result.results = "Task completed successfully"
        return self.result
    
    def compute_steps(self):
        # Process agent steps/trajectory
        self.result.steps = []
    
    def compute_tokens(self):
        # Track token usage
        self.result.tokens = [{"prompt_tokens": 100, "completion_tokens": 50}]

# Run evaluation
import asyncio

request = EvaluationRequest(
    userid="user123",
    model="gpt-4",
    jobid="job001",
    task="Open google.com and search for 'AI agents'",
    taskid="task001",
    browser_channel="chrome",
    episode=0,
    advanced_settings={},
    bucket_name="your-bucket"
)

eval_instance = MyAgentEvaluation(request)
asyncio.run(eval_instance.execute())

Core Components

1. Evaluation Base Class

The Evaluation abstract class provides the foundation for implementing agent evaluations:

from neurosim.evaluation import Evaluation

class YourEvaluation(Evaluation):
    async def run(self) -> AgentResult:
        """Execute the agent task"""
        pass
    
    def get_llm(self):
        """Return the LLM model identifier"""
        pass
    
    def compute_steps(self):
        """Process agent steps/trajectory"""
        pass
    
    def compute_tokens(self):
        """Calculate token usage"""
        pass

2. GCS Storage

Upload evaluation results and screenshots to Google Cloud Storage:

from neurosim.core.storage import GCSUploader

uploader = GCSUploader(bucket_name="your-bucket")

# Upload JSON results (with optional compression)
uri = uploader.upload_json(
    blob_path="results/task_001/result.json",
    data={"success": True, "score": 95},
    compress_zstd=True,
    zstd_level=3
)

# Upload screenshots
with open("screenshot.png", "rb") as f:
    uri = uploader.upload_png(
        data=f.read(),
        blob_path="results/task_001/screenshot.png"
    )

3. Data Models

Structured data models using Pydantic:

from neurosim.utils.models import (
    EvaluationRequest,    # Input configuration
    AgentResult,          # Evaluation output
    AgentErrors,          # Error tracking
    EvaluationConfig      # Runtime config
)

Usage Examples

Running the Notte Example

cd examples/NotteEvaluation

# Set up environment
cp .env.example .env
# Edit .env with your GCS bucket and credentials

# Run the evaluation
bash scripts/basic.sh

Custom Evaluation with CLI

# your_eval.py
from neurosim.evaluation import Evaluation

class CustomEvaluation(Evaluation):
    # ... implementation ...
    pass

if __name__ == "__main__":
    import asyncio
    eval = CustomEvaluation.from_cli()
    asyncio.run(eval.execute())

Run from command line:

python your_eval.py \
    --jobId job_123 \
    --task "Navigate to example.com" \
    --taskId task_001 \
    --user user_001 \
    --episode 0 \
    --model gpt-4 \
    --advanced_settings '{"max_steps": 50}'

LLM Judge

Evaluate agent performance automatically using GPT or Gemini models.

Judge System Overview

The LLM judge analyzes agent execution results and assigns a score (0-100) with detailed reasoning:

Score ≥70: Success
Score <70: Failure

Running the Judge

# Set API key
export OPENAI_API_KEY=your_openai_key
# or
export GOOGLE_API_KEY=your_google_key

# Run judge on evaluation results
export JUDGE_MAX_CONCURRENCY=50
python -m neurosim.judge.evaluate_results \
    /path/to/evaluation/folder \
    --model gpt-4o \
    --max-images 10 \
    --output llm_judge.json

Judge Output

{
  "task_001": {
    "score": 85,
    "success": true,
    "reasoning": "Agent successfully completed the task...",
    "issues": ["minor_navigation_delay"],
    "suggestions": ["Optimize element selection"]
  }
}

Configuration

Environment Variables

Create a .env file with these variables:

# Required
GCS_BUCKET_NAME=your-gcs-bucket-name
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# Optional - GCP
GCP_PROJECT_ID=your-gcp-project-id
GCP_REGION=us-central1
FIRESTORE_DATABASE=(default)
FIRESTORE_COLLECTION=evaluations

# Optional - LLM Judge
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
JUDGE_MAX_CONCURRENCY=50

# Optional - Logging
LOG_LEVEL=INFO

Development

Building from Source

# Install build tools
pip install build wheel hatchling

# Build distribution
python -m build
# or
make build

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio pytest-cov

# Run tests
pytest

# With coverage
pytest --cov=src/neurosim --cov-report=html

Code Quality

# Format code
black src/ examples/

# Sort imports
isort src/ examples/

# Type checking
mypy src/neurosim

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this work in your research or projects, please cite:

@software{neurosim2025,
  author = {Howland, Anais and Thinnappan, Ashwin and Gupta, Vaibhav and Mohammed, Jameel Shahid},
  title = {Neurosim: Core Evaluation Framework for AI Agents},
  year = {2025},
  publisher = {Paradigm Shift AI},
  url = {https://github.com/anaishowland/neurosim}
}

Developed at Paradigm Shift AI

This project was created to provide a robust framework for evaluating AI agent systems.

Related Projects

Agent-CE: Continuous Evaluation platform with pre-built agent integrations (Browser Use, Notte, Anthropic/OpenAI Computer Use)

About This Release

This is a snapshot release of work developed at Paradigm Shift AI. The code is provided as-is under the MIT License for the community to use, modify, and build upon.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docker/core		docker/core
examples/NotteEvaluation		examples/NotteEvaluation
scripts		scripts
src/neurosim		src/neurosim
templates		templates
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

License

anaishowland/neurosim

Folders and files

Latest commit

History

Repository files navigation

Neurosim

Architecture Overview

Features

Table of Contents

Installation

Prerequisites

Install from Source

Using uv (Recommended)

Quick Start

1. Set Up Environment Variables

2. Authenticate with Google Cloud

3. Create Your First Evaluation

Core Components

1. Evaluation Base Class

2. GCS Storage

3. Data Models

Usage Examples

Running the Notte Example

Custom Evaluation with CLI

LLM Judge

Judge System Overview

Running the Judge

Judge Output

Configuration

Environment Variables

Development

Building from Source

Running Tests

Code Quality

License

Citation

Related Projects

About This Release

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages