Skip to content

Add Semantic Routing or Mixture-of-Models skill to Emerging Techniques #23

@Xunzhuo

Description

@Xunzhuo

Issue Description

Overview

Add a new skill for Semantic Routing or Mixture-of-Models (vLLM Semantic Router) to the 19-emerging-techniques category. Semantic Routing provides system-level intelligence for Mixture-of-Models (MoM) through signal-driven decision engine and plugin chain architecture for intelligent LLM routing, security, and optimization.

What is Semantic Routing?

Semantic Routing is an intelligent routing layer that uses signal-driven decisions and plugin chains to:

  1. Route queries intelligently across multiple specialized models (math → Qwen-Math, code → DeepSeek-Coder)
  2. Optimize costs by using smaller models for simple tasks, larger models for complex ones
  3. Secure LLM systems with built-in jailbreak, PII, and hallucination detection
  4. Reduce latency through semantic caching (10-100× speedup)
  5. Enable model collaboration through Mixture-of-Models (MoM) architecture

Key Features

Signal-Driven Decision Engine:

  • 10 signal types: keyword , embedding, domain/MMLU, fact_check, user_feedback, preference, language, latency (TPOT/TTFT), context, complexity
  • Flexible combination: AND/OR operators for complex routing logic
  • Multi-signal fusion: Combine signals for higher accuracy than single classifiers

Plugin Chain Architecture:

  • semantic-cache - 10-100× latency reduction for similar queries
  • jailbreak - Adversarial prompt detection and blocking
  • pii - Personally identifiable information detection
  • system_prompt - Dynamic system prompt injection per route
  • header_mutation - HTTP header manipulation for routing control
  • hallucination - Token-level hallucination detection during generation

Model Training:
https://huggingface.co/llm-semantic-router

Why This Belongs in Emerging Techniques

  1. Novel approach: System-level intelligence for MoM (vs. model-level MoE)
  2. Production-ready: Used in real-world vLLM deployments
  3. Research-backed: NeurIPS 2025 MLForSys paper, ICLR 2026 RouterArena Suggestion: Convert this repo into a Claude Marketplace #1 ranking
  4. Cost-effective: 80-90% cost reduction vs. always using largest model
  5. Active development: Regular releases, bi-weekly community meetings, AMD partnership

Proposed Skill Structure

19-emerging-techniques/semantic-routing/
├── SKILL.md                    # 200-500 lines main guidance
├── references/
│   ├── README.md              # Architecture overview
│   ├── signals.md             # 10 signal types deep dive
│   ├── plugins.md             # Plugin chain architecture
│   ├── training.md            # ModernBERT + LoRA training guide
│   ├── deployment.md          # Docker/Kubernetes deployment
│   ├── api.md                 # API reference
│   └── issues.md              # Common issues and solutions
└── examples/
    ├── basic-routing.yaml     # Simple keyword routing
    ├── multi-signal.yaml      # Complex signal combination
    └── production-stack.yaml  # Full production setup

Content Outline

SKILL.md (200-500 lines):

  1. When to Use

    • Multi-model collaboration scenarios
    • Cost optimization needs
    • Security requirements (jailbreak/PII/hallucination)
    • Semantic caching for latency reduction
  2. Quick Start

    pip install vllm-sr
    vllm-sr serve
  3. Core Concepts

    • Mixture of Models (MoM) vs. Mixture of Experts (MoE)
    • Signal-Driven Decisions (10 signal types overview)
    • Plugin Chain Architecture (6 plugins overview)
  4. Two Complete Workflows with Checklists

    • Workflow 1: Basic Multi-Model Routing

      • Define signals (keyword + domain)
      • Configure decision rules (AND/OR)
      • Set model mappings
      • Test routing decisions
      • Validate routing accuracy
    • Workflow 2: Production Deployment

      • Configure security plugins (jailbreak + PII)
      • Enable semantic cache
      • Set up monitoring metrics
      • Configure multiple backend models
      • Load testing
      • Deploy to Kubernetes
  5. When to Use vs Alternatives

    • vs. LiteLLM (simple routing only)
    • vs. LangChain Router (slow LLM-based routing)
    • vs. Hand-written if-else (hard to maintain)
  6. Common Issues

    • Signal conflicts resolution
    • Inaccurate routing decisions
    • High latency troubleshooting
    • Low cache hit rate optimization
    • Model loading failures

references/ (300KB+ target):

  • signals.md: Detailed documentation of all 10 signal types with configuration examples, latency comparison, use cases, and combination strategies
  • plugins.md: Deep dive into 6 plugins, plugin development guide, execution order
  • training.md: Why ModernBERT, 4 classifier models, LoRA training methodology, datasets, performance metrics
  • deployment.md: Docker Compose, Kubernetes + Helm, production configuration, performance tuning, observability
  • api.md: OpenAI-compatible API, routing API, classification API, configuration API
  • issues.md: Real GitHub issues, common errors and solutions, debugging methods

examples/:

  • basic-routing.yaml: Simple keyword-based routing
  • multi-signal.yaml: Multi-signal combination (keyword + domain + embedding)
  • production-stack.yaml: Full production config with plugins, monitoring, multiple models

Key Highlights to Emphasize

Why Use Semantic Router?

  • Cost optimization: Use Llama-3-8B for simple queries, GPT-4 for complex ones
  • Quality improvement: Route math to Qwen-Math, code to DeepSeek-Coder
  • Security built-in: Jailbreak, PII, hallucination detection
  • Performance boost: 10-100× latency reduction via semantic cache

Core Advantages:

  1. Multi-signal fusion: 10 signals combined > single classifier
  2. Low latency: keyword 1ms, embedding 10-50ms, domain 50-100ms
  3. Extensible: Plugin architecture for custom signals and processing
  4. Production-ready: Kubernetes-native, Prometheus metrics, OpenTelemetry tracing

Resources

Acceptance Criteria

  • SKILL.md with proper YAML frontmatter (name: semantic-routing)
  • 200-500 lines of focused guidance in SKILL.md
  • 300KB+ reference documentation from official sources
  • At least 2 complete workflows with checklists
  • Code examples with language tags (yaml, bash, ```python)
  • "When to use vs alternatives" section
  • Common issues and solutions section
  • References one level deep from SKILL.md (no nested references)
  • Examples directory with 3 runnable configuration files

Related Skills

  • 12-inference-serving/vllm - vLLM inference engine (backend for semantic router)
  • 14-agents/langchain - Agent frameworks that can benefit from intelligent routing
  • 15-rag - RAG systems that benefit from semantic caching and routing
  • 16-prompt-engineering/dspy - Prompt optimization with routing decisions

Labels: enhancement, new-skill, emerging-techniques, documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions