Maintained by B-A-M-N
Transform your Ollama nodes into a production AI cluster with adaptive routing that learns which nodes work best for each task, plus unified observability across all applications.
β Adaptive Routing: GPU-aware, task-aware, learns from performance history β Production Observability: Single dashboard for all applications and nodes β Zero-Config Discovery: Auto-finds nodes, handles failover automatically β Proven Stable: Powers FlockParser (5.5x speedup) and SynapticLlamas in production
Quick Start β’ π Distributed Guide β’ Why SOLLOL β’ Architecture β’ Documentation
SOLLOL (Super Ollama Load balancer & Orchestration Layer) is a production-ready orchestration framework that transforms your collection of Ollama nodes into an intelligent AI cluster with adaptive routing and unified observabilityβall running on your own hardware.
The Problem You Have:
- β Manual node selection for each request
- β Can't distribute multi-agent workloads efficiently
- β No automatic failover or load balancing
- β Zero visibility into cluster performance
The SOLLOL Solution:
- β Intelligent routing that learns which nodes work best for each task
- β Parallel agent execution for multi-agent frameworks
- β Auto-discovery of Ollama nodes across your network
- β Built-in observability with real-time metrics and dashboard
- β Automatic failover and health monitoring
# 1. Install SOLLOL
pip install sollol
# 2. Start the dashboard (optional but recommended)
python3 -m sollol.dashboard_service &
# 3. Run your first query
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.chat(model='llama3.2', messages=[{'role': 'user', 'content': 'Hello!'}])['message']['content'])"What just happened?
- β SOLLOL auto-discovered all Ollama nodes on your network
- β Intelligently routed your request to the best available node
- β
Dashboard live at
http://localhost:8080(shows routing decisions, metrics, logs)
Expected output:
Discovering Ollama nodes...
Found 3 nodes: 192.168.1.22:11434, 192.168.1.10:11434, localhost:11434
Selected node: 192.168.1.22:11434 (GPU, 12ms latency)
Hello! How can I help you today?
Next steps:
- Visit
http://localhost:8080to see the dashboard - π Read the Distributed Ollama Guide - Learn how to build distributed AI applications with proven patterns
SOLLOL doesn't just distribute requests randomlyβit learns and optimizes:
| Feature | Simple Load Balancer | SOLLOL |
|---|---|---|
| Routing | Round-robin | Context-aware scoring |
| Learning | None | Adapts from performance history |
| Resource Awareness | None | GPU/CPU/memory-aware |
| Task Optimization | None | Routes by task type complexity |
| Failover | Manual | Automatic with health checks |
| Priority | FIFO | Priority queue with fairness |
Example: SOLLOL automatically routes:
- Heavy generation tasks β GPU nodes with 24GB VRAM
- Fast embeddings β CPU nodes or smaller GPUs
- Critical requests β Fastest, most reliable nodes
- Batch processing β Lower priority, distributed load
SOLLOL provides a single pane of glass to monitor every application and every node in your distributed AI network.
- β Centralized Dashboard: One web interface shows all applications, nodes, and performance metrics
- β Multi-App Tracking: See which applications (e.g., SynapticLlamas, custom agents) are using the cluster in real-time
- β Network-Wide Visibility: The dashboard runs as a persistent service, discovering and monitoring all components
- β Zero-Config: Applications automatically appear in the dashboard with no extra code required
This moves beyond per-application monitoring to provide true, centralized observability for your entire infrastructure.
from sollol import OllamaPool
# Literally 3 lines to production
pool = OllamaPool.auto_configure()
response = pool.chat(model="llama3.2", messages=[...])
print(response['message']['content'])Out of the box:
- Auto-discovery of Ollama nodes
- Health monitoring and failover
- Prometheus metrics
- Web dashboard with P50/P95/P99 latency tracking
- Connection pooling
- Request hedging
- Priority queuing
Distribute multiple requests across your cluster in parallel:
# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
pool.chat(model="llama3.2", messages=[...])
for _ in range(10)
])
# Parallel execution across available nodesProven results:
- β FlockParser: 5.5x speedup on document processing
- β SynapticLlamas: Parallel multi-agent execution across nodes
- β Production-tested: Real-world applications running at scale
Real-time monitoring with P50/P95/P99 latency metrics, network nodes, and active applications
Live request/response activity streams from Ollama nodes with performance tracking
Embedded Ray and Dask dashboards for distributed task monitoring
SOLLOL powers production-ready applications that leverage its intelligent routing and task distribution:
Distributed PDF Processing & RAG System
- β Production-stable document parsing and embedding
- β Distributed vector search across Ollama cluster
- β Automatic load balancing for embedding generation
- β Real-time monitoring via SOLLOL dashboard
- Use Case: Enterprise document processing, RAG pipelines, knowledge base systems
Multi-Agent Collaboration Framework
- β Production-stable parallel agent execution
- β Collaborative workflow orchestration (research β critique β synthesis)
- β Quality control with automated validation
- β Distributed across multiple Ollama nodes via SOLLOL
- Use Case: Complex reasoning tasks, research synthesis, multi-perspective analysis
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β (SynapticLlamas, custom agents, etc.) β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOLLOL Gateway (:8000) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Intelligent Routing Engine β β
β β β’ Analyzes: task type, complexity, resources β β
β β β’ Scores: all nodes based on context β β
β β β’ Learns: from performance history β β
β β β’ Routes: to optimal node β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Priority Queue + Failover β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Heterogeneous Cluster β
β GPU (24GB) β GPU (16GB) β CPU (64c) β GPU (8GB) β... β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 1. Request arrives
POST /api/chat {
"model": "llama3.2",
"messages": [{"role": "user", "content": "Complex analysis task..."}],
"priority": 8
}
# 2. SOLLOL analyzes
task_type = "generation" # Auto-detected
complexity = "high" # Token count analysis
requires_gpu = True # Based on task
estimated_duration = 3.2s # From history
# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms) β Score: 185.3 β WINNER
Node B (GPU 8GB, load: 0.6, latency: 200ms) β Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms) β Score: 41.2
# 4. Routes to Node A, monitors execution, learns for next timeScoring Algorithm:
Score = 100.0 (baseline)
Γ success_rate (0.0-1.0)
Γ· (1 + latency_penalty)
Γ gpu_bonus (1.5x if GPU available & needed)
Γ· (1 + load_penalty)
Γ priority_alignment
Γ task_specialization
pip install sollolgit clone https://github.com/B-A-M-N/SOLLOL.git
cd SOLLOL
pip install -e .New in v0.3.6: SOLLOL provides a synchronous API for easier integration.
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority
# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()
# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
priority=Priority.HIGH, # Semantic priority levels
timeout=60 # Request timeout in seconds
)
print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")Key features:
- β No async/await syntax required
- β Works with synchronous agent frameworks
- β Same intelligent routing and features
- β Runs async code in background thread automatically
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role
pool = OllamaPool.auto_configure()
# Define agents with different priorities
agents = [
{"name": "Researcher", "role": "researcher"}, # Priority 8
{"name": "Editor", "role": "editor"}, # Priority 6
{"name": "Summarizer", "role": "summarizer"}, # Priority 5
]
for agent in agents:
priority = get_priority_for_role(agent["role"])
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
priority=priority
)
# User-facing agents get priority, background tasks waitPriority levels available:
Priority.CRITICAL(10) - Mission-criticalPriority.URGENT(9) - Fast response neededPriority.HIGH(7) - Important tasksPriority.NORMAL(5) - DefaultPriority.LOW(3) - Background tasksPriority.BATCH(1) - Can wait
For accurate VRAM-aware routing, install the GPU reporter on each node:
# On each Ollama node, run:
sollol install-gpu-reporter --redis-host <redis-server-ip>
# Example:
sollol install-gpu-reporter --redis-host 192.168.1.10What this does:
- Installs vendor-agnostic GPU monitoring (NVIDIA/AMD/Intel via
gpustat) - Publishes real-time VRAM stats to Redis every 5 seconds
- SOLLOL uses this data for intelligent routing decisions
- See GPU Monitoring Guide for details
Without GPU monitoring: SOLLOL falls back to estimates which may be inaccurate.
FlockParser Document Processing:
- β 5.5x speedup on large document batch processing
- β Distributed embedding generation across nodes
- β Production-tested with real-world workloads
SynapticLlamas Multi-Agent:
- β Parallel agent execution across multiple nodes
- β Automatic failover between agents
- β Priority-based task scheduling
Single Ollama Node (llama3.2-3B, 50 requests, concurrency=5):
- β Success Rate: 100%
- β‘ Throughput: 0.51 req/s
- π Average Latency: 5,659 ms
- π P95 Latency: 11,299 ms
- π P99 Latency: 12,259 ms
Hardware: Single Ollama instance with 75+ models loaded
Data: See benchmarks/results/ for raw JSON
Run Your Own:
# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50
# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60- Routing decision: ~5-10ms (tested with 5-10 nodes)
- Network overhead: Varies by network (typically 5-20ms)
- Total added latency: ~20-50ms
- Benefit: Better resource utilization + automatic failover
New to distributed Ollama? Read our comprehensive guide:
Learn to build production-grade distributed AI applications with:
- 4 proven architecture patterns (batch processing, multi-agent, code synthesis)
- Real performance data from production applications
- Complete code examples from real projects
- Performance tuning guide for your workload
- Production best practices and troubleshooting
Quick preview:
from sollol import OllamaPool
# Auto-discover and distribute work across cluster
pool = OllamaPool.auto_configure()
# Batch process 10,000 embeddings with adaptive parallelism
embeddings = pool.embed_batch(
model="mxbai-embed-large",
inputs=texts,
use_adaptive=True # SOLLOL optimizes based on node speeds
)
# Automatic work stealing, retry logic, and real-time dashboardfrom sollol import OllamaPool
pool = OllamaPool(
nodes=[
{"host": "gpu-1.local", "port": 11434, "priority": 10}, # Prefer this
{"host": "gpu-2.local", "port": 11434, "priority": 5},
{"host": "cpu-1.local", "port": 11434, "priority": 1}, # Last resort
],
enable_intelligent_routing=True,
enable_hedging=True, # Duplicate critical requests
max_queue_size=100
)SOLLOL provides automatic observability with zero configuration required:
from sollol import OllamaPool
# Creates pool AND auto-registers with dashboard (if running)
pool = OllamaPool.auto_configure()
# β
Application automatically appears in dashboard at http://localhost:8080Start the persistent dashboard once (survives application exits):
# Start dashboard service (runs until stopped)
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
# Or run in background
nohup python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379 > /tmp/dashboard_service.log 2>&1 &Features:
- π Real-time metrics: System status, latency, success rate, GPU memory
- π Live log streaming: WebSocket-based log tailing (via Redis pub/sub)
- π Activity monitoring: Ollama server activity tracking
- π Auto-discovery: Automatically discovers Ollama nodes
# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")
# Per-node breakdown
for host, metrics in stats['hosts'].items():
print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")For teams preferring bare metal infrastructure, SOLLOL provides systemd-based deployment:
βββββββββββββββββββββββββββββββββββββββββββ
β Central Router Machine β
β - SOLLOL Dashboard (port 8080) β
β - Redis (port 6379) β
ββββββββββββββ¬βββββββββββββββββββββββββββββ
β Auto-discovery
βββββββββΌβββββββββββ¬ββββββββββββββ
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β Node 1 β β Node 2 β β Node 3 β β Node N β
β Ollama β β Ollama β β Ollama β β Ollama β
β :11434 β β :11434 β β :11434 β β :11434 β
β GPU 24GBβ β GPU 16GBβ β CPU 64c β β ... β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
Quick Setup:
# 1. Install Ollama on each node
curl -fsSL https://ollama.ai/install.sh | sh
# 2. Install SOLLOL on control plane
pip install sollol redis
# 3. Start dashboard
python3 -m sollol.dashboard_service --port 8080 --redis-url redis://localhost:6379
# 4. Test discovery
python3 -c "from sollol import OllamaPool; pool = OllamaPool.auto_configure(); print(pool.get_stats())"See INSTALLATION.md for complete deployment guide including systemd services and production hardening.
Problem: Running 10 agents sequentially takes 10x longer than necessary.
Solution: SOLLOL distributes agents across nodes in parallel.
pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
pool.chat(model="llama3.2", messages=agent_prompts[i])
for i in range(10)
])
# Speedup depends on number of available nodes and their capacityProblem: Different tasks need different resources.
Solution: SOLLOL routes each task to the optimal node.
pool = OllamaPool.auto_configure()
# Heavy generation β GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])
# Fast embeddings β CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])
# SOLLOL automatically routes each to the best available nodeProblem: Node failures break your service.
Solution: SOLLOL auto-fails over and recovers.
# Node A fails mid-request
# β
SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthyfrom sollol import OllamaPool
# SynapticLlamas uses SOLLOL for intelligent routing
pool = OllamaPool(
nodes=None, # Auto-discover all Ollama nodes
enable_intelligent_routing=True,
app_name="SynapticLlamas",
enable_ray=True
)
# All agent execution routes through SOLLOL
response = pool.chat(model="llama3.2", messages=[{"role": "user", "content": "query"}])from sollol import OllamaPool
# FlockParser uses SOLLOL's OllamaPool directly
pool = OllamaPool(
nodes=None, # Auto-discover all Ollama nodes
enable_intelligent_routing=True,
exclude_localhost=True,
discover_all_nodes=True,
app_name="FlockParser",
enable_ray=True
)
# All document embeddings and queries route through SOLLOL
embeddings = pool.embed(model="mxbai-embed-large", input="document text")from langchain.llms import Ollama
from sollol import OllamaPool
# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()
llm = Ollama(
base_url="http://localhost:8000",
model="llama3.2"
)
# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")SOLLOL includes production-grade performance optimizations:
Intelligent LRU cache with TTL expiration:
from sollol import OllamaPool
# Enable response caching (enabled by default)
pool = OllamaPool.auto_configure(
enable_cache=True,
cache_max_size=1000, # Cache up to 1000 responses
cache_ttl=3600 # 1 hour TTL
)
# Get cache stats
stats = pool.get_cache_stats()
print(f"Hit rate: {stats['hit_rate']:.1%}")Token-by-token streaming for better UX:
# Stream chat responses
for chunk in pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
):
content = chunk.get("message", {}).get("content", "")
print(content, end="", flush=True)Pre-load models into VRAM before first use:
# Warm a single model
pool.warm_model("llama3.2")
# Warm multiple models in parallel
results = pool.warm_models(
models=["llama3.2", "codellama", "mistral"],
parallel=True
)- Connection Pool Tuning: Optimized pool sizes for better concurrency
- Adaptive Health Checks: Dynamic intervals based on node stability
- Telemetry Sampling: Configurable sampling reduces overhead by ~90%
- HTTP/2 Multiplexing: 30-50% latency reduction for concurrent requests
- Installation Guide - Complete setup for bare-metal deployment
- Quick Start - Get up and running in 3 commands
- Configuration - All configuration options
- Architecture - System architecture overview
See docs/ for detailed documentation organized by category:
- Setup Guides - Ray, Redis, GPU monitoring, Grafana
- Features - Routing, dashboard, batch processing
- Architecture - System design and patterns
- Integration - Code examples and walkthroughs
- Benchmarks - Performance testing and results
- Troubleshooting - Known issues and fixes
We welcome contributions! Areas we'd love help with:
- ML-based routing predictions
- Additional monitoring integrations
- Cloud provider integrations
- Performance optimizations
- Documentation improvements
See CONTRIBUTING.md for guidelines.
SOLLOL includes experimental distributed inference capabilities via llama.cpp RPC integration. This feature distributes model layer computation across multiple nodes.
Current Status:
- β Basic functionality validated (13B models, 2-3 nodes)
β οΈ Performance: 5x slower than local inferenceβ οΈ Complexity: Requires manual setup, exact version matchingβ οΈ Limitation: Coordinator still requires full model in RAM
When to use this:
- Research and experimentation only
- When you absolutely need to run a model that won't fit on any single node
- When you're willing to accept significant performance tradeoffs
For production workloads: Use SOLLOL's proven task distribution features instead.
Learn more:
- π EXPERIMENTAL_FEATURES.md - Honest assessment, realistic expectations, known issues
- π» Complete llama.cpp Guide - Setup, optimization, troubleshooting
- π Distributed Ollama Guide - Production-ready patterns
Future Work:
Further optimization of distributed inference requires:
- Multi-node cluster infrastructure for comprehensive testing
- Performance tuning to reduce startup time and inference overhead
- Automated version management and deployment
Status: Research track with working foundation. See src/sollol/distributed_pipeline.py for technical details.
MIT License - see LICENSE file for details.
Created by B-A-M-N
Part of the Complete AI Ecosystem:
- SynapticLlamas - Multi-Agent Orchestration
- FlockParser - Document RAG Intelligence
- SOLLOL - Distributed Inference Platform (this project)
Special Thanks:
- Dallan Loomis - For always providing invaluable support, feedback, and guidance throughout development
Built with: Ray, Dask, FastAPI, llama.cpp, Ollama
- Adaptive routing that learns from performance history
- Context-aware scoring based on task type, complexity, and resources
- Auto-discovery of nodes with minimal configuration
- Built-in failover and priority queuing
- Production-ready: Powers FlockParser and SynapticLlamas at scale
- Unified observability: Single dashboard for entire AI network
Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.
Get Started β’ View on GitHub β’ Report Issue