Velox AI is a high-performance, full-duplex conversational voice engine designed to bridge the gap between standard chatbots and human-like voice interaction. Unlike traditional voice assistants that rely on "turn-based" communication (listen -> stop -> think -> speak), Velox operates on a continuous streaming architecture.
It features instant barge-in (interruptibility), sub-700ms latency, and a provider-agnostic backend that decouples the core logic from specific AI vendors. The system is built to handle the chaos of real-world audio—interruptions, background noise, and network jitter—while maintaining a fluid conversational flow.
The architecture follows a Client-Server-Service model, heavily utilizing Event-Driven Design to manage asynchronous data streams. The system is designed to be "Frontend Agnostic" (the client is dumb) and "Backend Intelligent" (the server manages state).
graph TD
%% --- STYLING ---
classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000;
classDef backend fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000;
classDef aiService fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
classDef userNode fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000;
classDef hardware fill:#e0e0e0,stroke:#616161,stroke-width:2px,color:#000;
%% --- NODES ---
User((👤 User)):::userNode
Speakers[🔊 Device Speakers]:::hardware
subgraph Frontend [Frontend Layer]
Client[💻 Next.js Client<br/><i>AudioWorklet & WebSocket</i>]:::frontend
end
subgraph Server_Layer [Orchestration Layer]
Server[⚙️ FastAPI Server]:::backend
VAD_Logic{⚡ Smart VAD<br/><i>Interruption Logic</i>}:::backend
Queue[📥 Event Queue]:::backend
end
subgraph Cloud_AI [AI Infrastructure]
direction TB
STT[👂 STT<br/><i>Deepgram / Gladia</i>]:::aiService
LLM[🧠 LLM<br/><i>Groq / Cerebras</i>]:::aiService
TTS[🗣️ TTS<br/><i>Deepgram / Piper</i>]:::aiService
end
%% --- DATA FLOW ---
%% 1. Input
User ==> |"🎤 Raw PCM Stream"| Client
Client <==> |"🔌 WebSocket (Int16)"| Server
%% 2. Processing
Server ==> |"Audio Stream"| STT
STT --> |"Transcript Events"| VAD_Logic
%% 3. Decision
VAD_Logic -- "User Spoke?" --> Queue
Queue --> |"Context"| LLM
%% 4. Generation
LLM --> |"Tokens"| TTS
TTS --> |"Synthesized Audio"| Server
%% 5. Output
Server -.-> |"Buffered Audio"| Client
Client ==> |"Playback"| Speakers
Instead of heavy client-side processing, Velox uses a lightweight "Thin Client" architecture to ensure performance on low-end mobile devices.
- AudioWorklet Processing: A custom
AudioProcessorruns on the browser's high-priority audio thread. - Client-Side Resampling: Raw 48kHz audio is downsampled to 16kHz and converted to Int16 (PCM) directly in the Worklet using Linear Interpolation.
- Result: The main JavaScript thread is bypassed entirely for audio handling, preventing UI freezes (React re-renders) from causing audio glitches.
The core of Velox is a FastAPI backend running a non-blocking asyncio event loop. It manages the delicate state machine of the conversation.
- State Management: Tracks
is_user_speaking,is_ai_speaking, andprocessingflags to prevent race conditions. - Smart Barge-In (Server-Side VAD):
- Velox implements Transcript-Based VAD utilizing Deepgram's
UtteranceEndsignals. - Logic: Instead of cutting audio on loud noises (energy VAD), the system only interrupts if it detects valid phonemes/words. This eliminates false positives from background noise (e.g., door slamming).
- Velox implements Transcript-Based VAD utilizing Deepgram's
Velox implements the Adapter Design Pattern to ensure no vendor lock-in. The core logic interacts with generic LLMProvider and STTProvider interfaces.
- Swappable Providers: The system can switch between Groq (Llama 3) for speed and OpenAI (GPT-4) for complexity as per the user convenience.
- Adapter Implementation:
# The system doesn't care if it's Groq or Cerebras class LLMService: async def stream(self, provider_name: str, messages: list): adapter = self.adapters.get(provider_name) return adapter.generate(messages)
Streaming audio over the internet introduces packet jitter. Velox handles this via a two-stage buffering strategy:
- Server-Side Aggregation: The backend buffers tiny TTS chunks into actionable packets before sending.
- Client-Side Jitter Buffer: The frontend maintains a dynamic ~150ms buffer to smooth out network irregularities, ensuring the AI's voice doesn't sound "robotic" on 4G networks.
This is the exact lifecycle of a single user interaction, demonstrating the parallelism of the system.
- INPUT: User speaks "What is the price?"
- Client sends Int16 binary stream -> Server.
- DETECTION:
- Server forwards audio -> Deepgram.
- Deepgram detects "UtteranceEnd" (User stopped talking).
- GENERATION:
- Server signals LLM (Groq) immediately.
- Time-to-First-Token: <200ms.
- SYNTHESIS:
- LLM streams text tokens -> TTS Service.
- TTS streams audio bytes -> Server.
- PLAYBACK:
- Server flushes audio queue -> Client WebSocket.
- Client plays audio via Web Audio API.
Total Round-Trip Latency: ~700ms (Network Dependent)
| Feature | Implementation | Benefit |
|---|---|---|
| Full-Duplex | Raw WebSockets | Allows speaking and listening simultaneously. |
| Semantic Barge-In | Transcript VAD | AI stops only for speech, ignores noise. |
| Optimized Audio | Int16 PCM | 3x smaller payload than Float32; ready for AI ingestion. |
| Resiliency | Auto-Reconnection | Handles socket drops/network switching gracefully. |
| Modularity | Adapter Pattern | Zero refactoring required to change AI models. |
- Framework: Next.js 14 (React)
- Language: TypeScript
- Styling: TailwindCSS
- Core: Python
- Framework: FastAPI (Uvicorn)
- Concurrency: AsyncIO (
async/await) - Networking: WebSockets
- LLM Inference: Groq, Cerebras
- Speech-to-Text (STT): Deepgram, Gladia
- Text-to-Speech (TTS): Deepgram (Cloud), Piper (Local)
- Orchestration: Custom Event Loop State Machine
