Skip to content
/ SOLLOL Public

Super Ollama Load Balancer - Performance-aware routing for distributed Ollama deployments with Ray, Dask, and adaptive metrics

License

Notifications You must be signed in to change notification settings

B-A-M-N/SOLLOL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

234 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SOLLOL: Intelligent Routing & Observability for Ollama Clusters

Maintained by B-A-M-N

PyPI version Python 3.8+ License: MIT Tests codecov Ollama

Transform your Ollama nodes into a production AI cluster with adaptive routing that learns which nodes work best for each task, plus unified observability across all applications.

βœ… Adaptive Routing: GPU-aware, task-aware, learns from performance history βœ… Production Observability: Single dashboard for all applications and nodes βœ… Zero-Config Discovery: Auto-finds nodes, handles failover automatically βœ… Proven Stable: Powers FlockParser (5.5x speedup) and SynapticLlamas in production

Quick Start β€’ πŸ“š Distributed Guide β€’ Why SOLLOL β€’ Architecture β€’ Documentation


🎯 What is SOLLOL?

SOLLOL (Super Ollama Load balancer & Orchestration Layer) is a production-ready orchestration framework that transforms your collection of Ollama nodes into an intelligent AI cluster with adaptive routing and unified observabilityβ€”all running on your own hardware.

The Problem You Have:

  • ❌ Manual node selection for each request
  • ❌ Can't distribute multi-agent workloads efficiently
  • ❌ No automatic failover or load balancing
  • ❌ Zero visibility into cluster performance

The SOLLOL Solution:

  • βœ… Intelligent routing that learns which nodes work best for each task
  • βœ… Parallel agent execution for multi-agent frameworks
  • βœ… Auto-discovery of Ollama nodes across your network
  • βœ… Built-in observability with real-time metrics and dashboard
  • βœ… Automatic failover and health monitoring

⚑ Quickstart (3 Commands)

# 1. Install SOLLOL
pip install sollol

# 2. Start the dashboard (optional but recommended)
python3 -m sollol.dashboard_service &

# 3. Run your first query
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.chat(model='llama3.2', messages=[{'role': 'user', 'content': 'Hello!'}])['message']['content'])"

What just happened?

  • βœ… SOLLOL auto-discovered all Ollama nodes on your network
  • βœ… Intelligently routed your request to the best available node
  • βœ… Dashboard live at http://localhost:8080 (shows routing decisions, metrics, logs)

Expected output:

Discovering Ollama nodes...
Found 3 nodes: 192.168.1.22:11434, 192.168.1.10:11434, localhost:11434
Selected node: 192.168.1.22:11434 (GPU, 12ms latency)
Hello! How can I help you today?

Next steps:


πŸ”₯ Why SOLLOL?

1. Intelligent, Not Just Balanced

SOLLOL doesn't just distribute requests randomlyβ€”it learns and optimizes:

Feature Simple Load Balancer SOLLOL
Routing Round-robin Context-aware scoring
Learning None Adapts from performance history
Resource Awareness None GPU/CPU/memory-aware
Task Optimization None Routes by task type complexity
Failover Manual Automatic with health checks
Priority FIFO Priority queue with fairness

Example: SOLLOL automatically routes:

  • Heavy generation tasks β†’ GPU nodes with 24GB VRAM
  • Fast embeddings β†’ CPU nodes or smaller GPUs
  • Critical requests β†’ Fastest, most reliable nodes
  • Batch processing β†’ Lower priority, distributed load

2. Unified Observability for Your Entire AI Network

SOLLOL provides a single pane of glass to monitor every application and every node in your distributed AI network.

  • βœ… Centralized Dashboard: One web interface shows all applications, nodes, and performance metrics
  • βœ… Multi-App Tracking: See which applications (e.g., SynapticLlamas, custom agents) are using the cluster in real-time
  • βœ… Network-Wide Visibility: The dashboard runs as a persistent service, discovering and monitoring all components
  • βœ… Zero-Config: Applications automatically appear in the dashboard with no extra code required

This moves beyond per-application monitoring to provide true, centralized observability for your entire infrastructure.


3. Production-Ready from Day One

from sollol import OllamaPool

# Literally 3 lines to production
pool = OllamaPool.auto_configure()
response = pool.chat(model="llama3.2", messages=[...])
print(response['message']['content'])

Out of the box:

  • Auto-discovery of Ollama nodes
  • Health monitoring and failover
  • Prometheus metrics
  • Web dashboard with P50/P95/P99 latency tracking
  • Connection pooling
  • Request hedging
  • Priority queuing

4. Task Distribution at Scale

Distribute multiple requests across your cluster in parallel:

# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=[...])
    for _ in range(10)
])
# Parallel execution across available nodes

Proven results:

  • βœ… FlockParser: 5.5x speedup on document processing
  • βœ… SynapticLlamas: Parallel multi-agent execution across nodes
  • βœ… Production-tested: Real-world applications running at scale

πŸ“Έ Dashboard Screenshots

Dashboard Overview

SOLLOL Unified Dashboard Real-time monitoring with P50/P95/P99 latency metrics, network nodes, and active applications

Activity Monitoring

Real-time Activity Logs Live request/response activity streams from Ollama nodes with performance tracking

Ray & Dask Integration

Ray and Dask Dashboards Embedded Ray and Dask dashboards for distributed task monitoring


πŸ—οΈ Production Applications

SOLLOL powers production-ready applications that leverage its intelligent routing and task distribution:

Distributed PDF Processing & RAG System

  • βœ… Production-stable document parsing and embedding
  • βœ… Distributed vector search across Ollama cluster
  • βœ… Automatic load balancing for embedding generation
  • βœ… Real-time monitoring via SOLLOL dashboard
  • Use Case: Enterprise document processing, RAG pipelines, knowledge base systems

Multi-Agent Collaboration Framework

  • βœ… Production-stable parallel agent execution
  • βœ… Collaborative workflow orchestration (research β†’ critique β†’ synthesis)
  • βœ… Quality control with automated validation
  • βœ… Distributed across multiple Ollama nodes via SOLLOL
  • Use Case: Complex reasoning tasks, research synthesis, multi-perspective analysis

πŸ—οΈ Architecture

High-Level Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Your Application                       β”‚
β”‚         (SynapticLlamas, custom agents, etc.)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 SOLLOL Gateway (:8000)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Intelligent Routing Engine               β”‚  β”‚
β”‚  β”‚  β€’ Analyzes: task type, complexity, resources    β”‚  β”‚
β”‚  β”‚  β€’ Scores: all nodes based on context            β”‚  β”‚
β”‚  β”‚  β€’ Learns: from performance history              β”‚  β”‚
β”‚  β”‚  β€’ Routes: to optimal node                       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚          Priority Queue + Failover               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Your Heterogeneous Cluster                 β”‚
β”‚  GPU (24GB) β”‚ GPU (16GB) β”‚ CPU (64c) β”‚ GPU (8GB) β”‚...  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How Routing Works

# 1. Request arrives
POST /api/chat {
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Complex analysis task..."}],
  "priority": 8
}

# 2. SOLLOL analyzes
task_type = "generation"       # Auto-detected
complexity = "high"             # Token count analysis
requires_gpu = True             # Based on task
estimated_duration = 3.2s       # From history

# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms) β†’ Score: 185.3 βœ“ WINNER
Node B (GPU 8GB,  load: 0.6, latency: 200ms) β†’ Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms)  β†’ Score: 41.2

# 4. Routes to Node A, monitors execution, learns for next time

Scoring Algorithm:

Score = 100.0 (baseline)
      Γ— success_rate (0.0-1.0)
      Γ· (1 + latency_penalty)
      Γ— gpu_bonus (1.5x if GPU available & needed)
      Γ· (1 + load_penalty)
      Γ— priority_alignment
      Γ— task_specialization

πŸ“¦ Installation

Quick Install (PyPI)

pip install sollol

From Source

git clone https://github.com/B-A-M-N/SOLLOL.git
cd SOLLOL
pip install -e .

⚑ Quick Start

1. Synchronous API (No async/await needed!)

New in v0.3.6: SOLLOL provides a synchronous API for easier integration.

from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority

# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
    priority=Priority.HIGH,  # Semantic priority levels
    timeout=60  # Request timeout in seconds
)

print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")

Key features:

  • βœ… No async/await syntax required
  • βœ… Works with synchronous agent frameworks
  • βœ… Same intelligent routing and features
  • βœ… Runs async code in background thread automatically

2. Priority-Based Multi-Agent Execution

from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role

pool = OllamaPool.auto_configure()

# Define agents with different priorities
agents = [
    {"name": "Researcher", "role": "researcher"},  # Priority 8
    {"name": "Editor", "role": "editor"},          # Priority 6
    {"name": "Summarizer", "role": "summarizer"},  # Priority 5
]

for agent in agents:
    priority = get_priority_for_role(agent["role"])

    response = pool.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
        priority=priority
    )
    # User-facing agents get priority, background tasks wait

Priority levels available:

  • Priority.CRITICAL (10) - Mission-critical
  • Priority.URGENT (9) - Fast response needed
  • Priority.HIGH (7) - Important tasks
  • Priority.NORMAL (5) - Default
  • Priority.LOW (3) - Background tasks
  • Priority.BATCH (1) - Can wait

3. Enable Real-Time GPU Monitoring

For accurate VRAM-aware routing, install the GPU reporter on each node:

# On each Ollama node, run:
sollol install-gpu-reporter --redis-host <redis-server-ip>

# Example:
sollol install-gpu-reporter --redis-host 192.168.1.10

What this does:

  • Installs vendor-agnostic GPU monitoring (NVIDIA/AMD/Intel via gpustat)
  • Publishes real-time VRAM stats to Redis every 5 seconds
  • SOLLOL uses this data for intelligent routing decisions
  • See GPU Monitoring Guide for details

Without GPU monitoring: SOLLOL falls back to estimates which may be inaccurate.


πŸ“Š Performance & Benchmarks

Production-Validated Performance

FlockParser Document Processing:

  • βœ… 5.5x speedup on large document batch processing
  • βœ… Distributed embedding generation across nodes
  • βœ… Production-tested with real-world workloads

SynapticLlamas Multi-Agent:

  • βœ… Parallel agent execution across multiple nodes
  • βœ… Automatic failover between agents
  • βœ… Priority-based task scheduling

Measured Baseline Performance

Single Ollama Node (llama3.2-3B, 50 requests, concurrency=5):

  • βœ… Success Rate: 100%
  • ⚑ Throughput: 0.51 req/s
  • πŸ“ˆ Average Latency: 5,659 ms
  • πŸ“ˆ P95 Latency: 11,299 ms
  • πŸ“ˆ P99 Latency: 12,259 ms

Hardware: Single Ollama instance with 75+ models loaded Data: See benchmarks/results/ for raw JSON

Run Your Own:

# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50

# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60

Overhead

  • Routing decision: ~5-10ms (tested with 5-10 nodes)
  • Network overhead: Varies by network (typically 5-20ms)
  • Total added latency: ~20-50ms
  • Benefit: Better resource utilization + automatic failover

πŸ“š Building Distributed Applications

New to distributed Ollama? Read our comprehensive guide:

Learn to build production-grade distributed AI applications with:

  • 4 proven architecture patterns (batch processing, multi-agent, code synthesis)
  • Real performance data from production applications
  • Complete code examples from real projects
  • Performance tuning guide for your workload
  • Production best practices and troubleshooting

Quick preview:

from sollol import OllamaPool

# Auto-discover and distribute work across cluster
pool = OllamaPool.auto_configure()

# Batch process 10,000 embeddings with adaptive parallelism
embeddings = pool.embed_batch(
    model="mxbai-embed-large",
    inputs=texts,
    use_adaptive=True  # SOLLOL optimizes based on node speeds
)
# Automatic work stealing, retry logic, and real-time dashboard

βš™οΈ Advanced Configuration

Custom Routing Strategy

from sollol import OllamaPool

pool = OllamaPool(
    nodes=[
        {"host": "gpu-1.local", "port": 11434, "priority": 10},  # Prefer this
        {"host": "gpu-2.local", "port": 11434, "priority": 5},
        {"host": "cpu-1.local", "port": 11434, "priority": 1},   # Last resort
    ],
    enable_intelligent_routing=True,
    enable_hedging=True,  # Duplicate critical requests
    max_queue_size=100
)

Observability & Monitoring

Zero-Config Auto-Registration 🎯

SOLLOL provides automatic observability with zero configuration required:

from sollol import OllamaPool

# Creates pool AND auto-registers with dashboard (if running)
pool = OllamaPool.auto_configure()
# βœ… Application automatically appears in dashboard at http://localhost:8080

Persistent Dashboard Service

Start the persistent dashboard once (survives application exits):

# Start dashboard service (runs until stopped)
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379

# Or run in background
nohup python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379 > /tmp/dashboard_service.log 2>&1 &

Features:

  • πŸ“Š Real-time metrics: System status, latency, success rate, GPU memory
  • πŸ“œ Live log streaming: WebSocket-based log tailing (via Redis pub/sub)
  • 🌐 Activity monitoring: Ollama server activity tracking
  • πŸ” Auto-discovery: Automatically discovers Ollama nodes

Programmatic Stats Access

# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")

# Per-node breakdown
for host, metrics in stats['hosts'].items():
    print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")

🏭 Production Deployment

Multi-Node Bare Metal Setup

For teams preferring bare metal infrastructure, SOLLOL provides systemd-based deployment:

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Central Router Machine                β”‚
β”‚   - SOLLOL Dashboard (port 8080)        β”‚
β”‚   - Redis (port 6379)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚ Auto-discovery
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β–Ό       β–Ό          β–Ό             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Node 1  β”‚ β”‚ Node 2  β”‚ β”‚ Node 3  β”‚ β”‚ Node N  β”‚
β”‚ Ollama  β”‚ β”‚ Ollama  β”‚ β”‚ Ollama  β”‚ β”‚ Ollama  β”‚
β”‚ :11434  β”‚ β”‚ :11434  β”‚ β”‚ :11434  β”‚ β”‚ :11434  β”‚
β”‚ GPU 24GBβ”‚ β”‚ GPU 16GBβ”‚ β”‚ CPU 64c β”‚ β”‚ ...     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Setup:

# 1. Install Ollama on each node
curl -fsSL https://ollama.ai/install.sh | sh

# 2. Install SOLLOL on control plane
pip install sollol redis

# 3. Start dashboard
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379

# 4. Test discovery
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.get_stats())"

See INSTALLATION.md for complete deployment guide including systemd services and production hardening.


πŸŽ“ Use Cases

1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)

Problem: Running 10 agents sequentially takes 10x longer than necessary.

Solution: SOLLOL distributes agents across nodes in parallel.

pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=agent_prompts[i])
    for i in range(10)
])
# Speedup depends on number of available nodes and their capacity

2. Mixed Workloads

Problem: Different tasks need different resources.

Solution: SOLLOL routes each task to the optimal node.

pool = OllamaPool.auto_configure()

# Heavy generation β†’ GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])

# Fast embeddings β†’ CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])

# SOLLOL automatically routes each to the best available node

3. High Availability Production

Problem: Node failures break your service.

Solution: SOLLOL auto-fails over and recovers.

# Node A fails mid-request
# βœ… SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthy

πŸ”Œ Integration Examples

SynapticLlamas Integration

from sollol import OllamaPool

# SynapticLlamas uses SOLLOL for intelligent routing
pool = OllamaPool(
    nodes=None,  # Auto-discover all Ollama nodes
    enable_intelligent_routing=True,
    app_name="SynapticLlamas",
    enable_ray=True
)

# All agent execution routes through SOLLOL
response = pool.chat(model="llama3.2", messages=[{"role": "user", "content": "query"}])

FlockParser Integration

from sollol import OllamaPool

# FlockParser uses SOLLOL's OllamaPool directly
pool = OllamaPool(
    nodes=None,  # Auto-discover all Ollama nodes
    enable_intelligent_routing=True,
    exclude_localhost=True,
    discover_all_nodes=True,
    app_name="FlockParser",
    enable_ray=True
)

# All document embeddings and queries route through SOLLOL
embeddings = pool.embed(model="mxbai-embed-large", input="document text")

LangChain Integration

from langchain.llms import Ollama
from sollol import OllamaPool

# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()

llm = Ollama(
    base_url="http://localhost:8000",
    model="llama3.2"
)

# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")

⚑ Performance Optimizations

SOLLOL includes production-grade performance optimizations:

πŸš€ Response Caching Layer

Intelligent LRU cache with TTL expiration:

from sollol import OllamaPool

# Enable response caching (enabled by default)
pool = OllamaPool.auto_configure(
    enable_cache=True,
    cache_max_size=1000,  # Cache up to 1000 responses
    cache_ttl=3600        # 1 hour TTL
)

# Get cache stats
stats = pool.get_cache_stats()
print(f"Hit rate: {stats['hit_rate']:.1%}")

🌊 Streaming Support

Token-by-token streaming for better UX:

# Stream chat responses
for chunk in pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
):
    content = chunk.get("message", {}).get("content", "")
    print(content, end="", flush=True)

πŸ”₯ Smart Model Prefetching

Pre-load models into VRAM before first use:

# Warm a single model
pool.warm_model("llama3.2")

# Warm multiple models in parallel
results = pool.warm_models(
    models=["llama3.2", "codellama", "mistral"],
    parallel=True
)

Additional Optimizations

  • Connection Pool Tuning: Optimized pool sizes for better concurrency
  • Adaptive Health Checks: Dynamic intervals based on node stability
  • Telemetry Sampling: Configurable sampling reduces overhead by ~90%
  • HTTP/2 Multiplexing: 30-50% latency reduction for concurrent requests

πŸ“š Documentation

Getting Started

Complete Documentation

See docs/ for detailed documentation organized by category:


🀝 Contributing

We welcome contributions! Areas we'd love help with:

  • ML-based routing predictions
  • Additional monitoring integrations
  • Cloud provider integrations
  • Performance optimizations
  • Documentation improvements

See CONTRIBUTING.md for guidelines.


πŸ”¬ Experimental: Distributed Inference Research

⚠️ Research Feature - Not Production Ready

SOLLOL includes experimental distributed inference capabilities via llama.cpp RPC integration. This feature distributes model layer computation across multiple nodes.

Current Status:

  • βœ… Basic functionality validated (13B models, 2-3 nodes)
  • ⚠️ Performance: 5x slower than local inference
  • ⚠️ Complexity: Requires manual setup, exact version matching
  • ⚠️ Limitation: Coordinator still requires full model in RAM

When to use this:

  • Research and experimentation only
  • When you absolutely need to run a model that won't fit on any single node
  • When you're willing to accept significant performance tradeoffs

For production workloads: Use SOLLOL's proven task distribution features instead.

Learn more:

Future Work:

Further optimization of distributed inference requires:

  • Multi-node cluster infrastructure for comprehensive testing
  • Performance tuning to reduce startup time and inference overhead
  • Automated version management and deployment

Status: Research track with working foundation. See src/sollol/distributed_pipeline.py for technical details.


πŸ“œ License

MIT License - see LICENSE file for details.


πŸ™ Credits

Created by B-A-M-N

Part of the Complete AI Ecosystem:

Special Thanks:

  • Dallan Loomis - For always providing invaluable support, feedback, and guidance throughout development

Built with: Ray, Dask, FastAPI, llama.cpp, Ollama


🎯 What Makes SOLLOL Different?

  1. Adaptive routing that learns from performance history
  2. Context-aware scoring based on task type, complexity, and resources
  3. Auto-discovery of nodes with minimal configuration
  4. Built-in failover and priority queuing
  5. Production-ready: Powers FlockParser and SynapticLlamas at scale
  6. Unified observability: Single dashboard for entire AI network

Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.

Get Started β€’ View on GitHub β€’ Report Issue

Packages

No packages published

Contributors 2

  •  
  •