-
Notifications
You must be signed in to change notification settings - Fork 293
Description
Issue Description
Overview
Add a new skill for Semantic Routing or Mixture-of-Models (vLLM Semantic Router) to the 19-emerging-techniques category. Semantic Routing provides system-level intelligence for Mixture-of-Models (MoM) through signal-driven decision engine and plugin chain architecture for intelligent LLM routing, security, and optimization.
What is Semantic Routing?
Semantic Routing is an intelligent routing layer that uses signal-driven decisions and plugin chains to:
- Route queries intelligently across multiple specialized models (math → Qwen-Math, code → DeepSeek-Coder)
- Optimize costs by using smaller models for simple tasks, larger models for complex ones
- Secure LLM systems with built-in jailbreak, PII, and hallucination detection
- Reduce latency through semantic caching (10-100× speedup)
- Enable model collaboration through Mixture-of-Models (MoM) architecture
Key Features
Signal-Driven Decision Engine:
- 10 signal types: keyword , embedding, domain/MMLU, fact_check, user_feedback, preference, language, latency (TPOT/TTFT), context, complexity
- Flexible combination: AND/OR operators for complex routing logic
- Multi-signal fusion: Combine signals for higher accuracy than single classifiers
Plugin Chain Architecture:
semantic-cache- 10-100× latency reduction for similar queriesjailbreak- Adversarial prompt detection and blockingpii- Personally identifiable information detectionsystem_prompt- Dynamic system prompt injection per routeheader_mutation- HTTP header manipulation for routing controlhallucination- Token-level hallucination detection during generation
Model Training:
https://huggingface.co/llm-semantic-router
Why This Belongs in Emerging Techniques
- Novel approach: System-level intelligence for MoM (vs. model-level MoE)
- Production-ready: Used in real-world vLLM deployments
- Research-backed: NeurIPS 2025 MLForSys paper, ICLR 2026 RouterArena Suggestion: Convert this repo into a Claude Marketplace #1 ranking
- Cost-effective: 80-90% cost reduction vs. always using largest model
- Active development: Regular releases, bi-weekly community meetings, AMD partnership
Proposed Skill Structure
19-emerging-techniques/semantic-routing/
├── SKILL.md # 200-500 lines main guidance
├── references/
│ ├── README.md # Architecture overview
│ ├── signals.md # 10 signal types deep dive
│ ├── plugins.md # Plugin chain architecture
│ ├── training.md # ModernBERT + LoRA training guide
│ ├── deployment.md # Docker/Kubernetes deployment
│ ├── api.md # API reference
│ └── issues.md # Common issues and solutions
└── examples/
├── basic-routing.yaml # Simple keyword routing
├── multi-signal.yaml # Complex signal combination
└── production-stack.yaml # Full production setup
Content Outline
SKILL.md (200-500 lines):
-
When to Use
- Multi-model collaboration scenarios
- Cost optimization needs
- Security requirements (jailbreak/PII/hallucination)
- Semantic caching for latency reduction
-
Quick Start
pip install vllm-sr vllm-sr serve
-
Core Concepts
- Mixture of Models (MoM) vs. Mixture of Experts (MoE)
- Signal-Driven Decisions (10 signal types overview)
- Plugin Chain Architecture (6 plugins overview)
-
Two Complete Workflows with Checklists
-
Workflow 1: Basic Multi-Model Routing
- Define signals (keyword + domain)
- Configure decision rules (AND/OR)
- Set model mappings
- Test routing decisions
- Validate routing accuracy
-
Workflow 2: Production Deployment
- Configure security plugins (jailbreak + PII)
- Enable semantic cache
- Set up monitoring metrics
- Configure multiple backend models
- Load testing
- Deploy to Kubernetes
-
-
When to Use vs Alternatives
- vs. LiteLLM (simple routing only)
- vs. LangChain Router (slow LLM-based routing)
- vs. Hand-written if-else (hard to maintain)
-
Common Issues
- Signal conflicts resolution
- Inaccurate routing decisions
- High latency troubleshooting
- Low cache hit rate optimization
- Model loading failures
references/ (300KB+ target):
- signals.md: Detailed documentation of all 10 signal types with configuration examples, latency comparison, use cases, and combination strategies
- plugins.md: Deep dive into 6 plugins, plugin development guide, execution order
- training.md: Why ModernBERT, 4 classifier models, LoRA training methodology, datasets, performance metrics
- deployment.md: Docker Compose, Kubernetes + Helm, production configuration, performance tuning, observability
- api.md: OpenAI-compatible API, routing API, classification API, configuration API
- issues.md: Real GitHub issues, common errors and solutions, debugging methods
examples/:
- basic-routing.yaml: Simple keyword-based routing
- multi-signal.yaml: Multi-signal combination (keyword + domain + embedding)
- production-stack.yaml: Full production config with plugins, monitoring, multiple models
Key Highlights to Emphasize
Why Use Semantic Router?
- Cost optimization: Use Llama-3-8B for simple queries, GPT-4 for complex ones
- Quality improvement: Route math to Qwen-Math, code to DeepSeek-Coder
- Security built-in: Jailbreak, PII, hallucination detection
- Performance boost: 10-100× latency reduction via semantic cache
Core Advantages:
- Multi-signal fusion: 10 signals combined > single classifier
- Low latency: keyword 1ms, embedding 10-50ms, domain 50-100ms
- Extensible: Plugin architecture for custom signals and processing
- Production-ready: Kubernetes-native, Prometheus metrics, OpenTelemetry tracing
Resources
- GitHub: https://github.com/vllm-project/semantic-router (513 source files)
- Documentation: https://vllm-semantic-router.com (24,000+ lines)
- Paper: When to Reason: Semantic Router for vLLM (NeurIPS 2025)
- Blog: https://blog.vllm.ai/2025/09/11/semantic-router.html
- Community: vLLM Slack #semantic-router channel
Acceptance Criteria
- SKILL.md with proper YAML frontmatter (
name: semantic-routing) - 200-500 lines of focused guidance in SKILL.md
- 300KB+ reference documentation from official sources
- At least 2 complete workflows with checklists
- Code examples with language tags (
yaml,bash, ```python) - "When to use vs alternatives" section
- Common issues and solutions section
- References one level deep from SKILL.md (no nested references)
- Examples directory with 3 runnable configuration files
Related Skills
- 12-inference-serving/vllm - vLLM inference engine (backend for semantic router)
- 14-agents/langchain - Agent frameworks that can benefit from intelligent routing
- 15-rag - RAG systems that benefit from semantic caching and routing
- 16-prompt-engineering/dspy - Prompt optimization with routing decisions
Labels: enhancement, new-skill, emerging-techniques, documentation