A containerized, OpenAI-compatible LLM inference gateway that intelligently routes requests to a local GPU model (vLLM). Includes prompt classification, Redis caching, monitoring, and robust configuration.
-
API Key Authentication: All API key registration and authentication is now database-backed. Keys registered via
/api/v1/apikeysare instantly usable for all protected endpoints. No more reliance on environment variables for key validation in production. -
Pydantic v2 Compatibility: All models and settings use
model_dump()instead of.dict()and leveragejson_schema_extrafor environment variable field mapping, ensuring compatibility with Pydantic v2 and beyond. -
Test Suite Parity: The test suite robustly validates API key registration and authentication using the same DB-backed logic as production, ensuring tests reflect real-world usage.
-
OpenAI-compatible endpoints:
/v1/chat/completions,/v1/models,/health -
Prompt classification (local only)
-
Local vLLM backend integration (Llama-3-8B-Instruct)
-
Redis-based response caching
-
Prometheus/Grafana monitoring
-
API key authentication & rate limiting (DB-backed, production-grade)
-
Pydantic v2 compatible (uses model_dump and json_schema_extra for env fields)
-
Robust test suite: registration/auth tests mirror production behavior
-
Docker Compose deployment
- All model names must use the format
<provider>/<model>, e.g.:local/llama-3-8b-instructopenai/gpt-4.1anthropic/claude-3-opus
- The router will parse the provider from the model name and route requests to the correct backend.
- "Smart routing" mode: If a model is available and configured, the router will load and use it automatically.
- If an unknown provider or model is provided, a clear error will be returned.
The model registry database (usually ~/.agent_coder/models.db) is the ONLY source of truth for available models. Do NOT manually edit the database; use provided admin tools or scripts for all changes.
- The router uses a dynamic model registry that supports multiple providers (local, OpenAI, Hugging Face, OpenRouter, Google Gemini/PaLM, etc.).
- Hardware-aware model discovery can be triggered on demand to recommend the best models for your system (GPU/CPU detected automatically).
- Model registry and discovery are persisted and only refreshed when explicitly requested.
- API endpoints:
GET /v1/models: List all available models in OpenAI-compatible formatPOST /v1/registry/refresh: Run hardware/model discovery and refresh the registryGET /v1/registry/status: Get last refresh time and hardware info
- Registry metadata is stored in
~/.agent_coder/by default. - To update the registry after hardware or model changes, call
/v1/registry/refresh.
- Python 3.9+
sqlite3,requests,torch(for GPU detection),openai(for LLM-powered recommendations)- Set
OPENAI_API_KEYin your environment for discovery
See docs/SETUP.md for prerequisites and setup instructions.
docs/SETUP.md: Setup & deploymentdocs/CONFIGURATION.md: Config referencedocs/API.md: API detailsdocs/ARCHITECTURE.md: Design & diagramsdocs/MONITORING.md: Metrics & dashboardsdocs/TROUBLESHOOTING.md: Common issues
- The project uses a lazy environment variable lookup for
REDIS_URLto ensure robust Redis connectivity in all environments. - Always set
REDIS_URL=redis://localhost:6379/0in your.envfor local development and testing. - See
docs/SETUP.mdfor details.
This repository is structured for seamless onboarding to the Gaia Infra Platform. It follows the canonical app stack pattern described in the central Gaia ONBOARDING.md. This enables automated import, monitoring, and provisioning by the Gaia infra team or CI/CD.
intelligent_inference_router/
docker-compose.yml
.env.example
monitoring/
prometheus.yml
grafana-dashboard.json (or router-cache-metrics.json)
db-provisioning/
# (empty or with DB manifests)
n8n/
# (empty or with n8n workflows)
README.md
.gitlab-ci.yml
- monitoring/: Prometheus scrape config and Grafana dashboards for this app.
- db-provisioning/: Reserved for DB manifests (leave empty if not needed).
- n8n/: Reserved for n8n workflow exports (leave empty if not needed).
- .gitlab-ci.yml: GitLab CI/CD pipeline for lint, test, build, push, and deploy.
To onboard this app into a Gaia environment, use the import script from the Gaia Infra Platform:
# Usage: ./scripts/import-app-stack.sh <path-to-app-repo> <environment>
./scripts/import-app-stack.sh ../intelligent_inference_router devThis will validate the structure and copy all required files into the correct environment stack.
For more, see the Gaia Infra Platform ONBOARDING.md.
This project uses GitLab Community Edition (CE) for all CI/CD automation:
- All pipelines are defined in
.gitlab-ci.ymlat the repo root. - Pipelines run on GitLab Runners and push images to the GitLab Container Registry.
- Key branches:
main(production),dev(development),test(pre-prod/integration). - All build, lint, test, and deploy steps are visible in GitLab’s UI.
- lint: Checks code style (flake8, black)
- unit_test: Runs all Python tests
- build: Builds Docker image
- push: Pushes image to GitLab Container Registry
- deploy_dev: Manual deploy to dev environment
- deploy_prod: Manual deploy to prod environment
- Push to
main,dev, ortestwill trigger the pipeline. - Manual deploy steps can be triggered from the GitLab UI.
For more, see Gaia’s ONBOARDING.md and the .gitlab-ci.yml in this repo.
As of 2025-04-24, this project no longer depends on the ml_ops repository. All code, tests, and infrastructure for the Intelligent Inference Router are now self-contained. Any prior references to ml_ops as a submodule, symlink, or dependency have been removed for clarity and maintainability.
If you require legacy utilities, refer to the archived ml_ops repository separately.
For full requirements and context, see the Memory Bank and PRD in memory-bank/.