Skip to content

harshil-mistry/Velox-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

140 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ VELOX AI

Ultra-Low Latency Real-Time Voice Orchestration Engine

Velox AI Banner

📖 Project Overview

Velox AI is a high-performance, full-duplex conversational voice engine designed to bridge the gap between standard chatbots and human-like voice interaction. Unlike traditional voice assistants that rely on "turn-based" communication (listen -> stop -> think -> speak), Velox operates on a continuous streaming architecture.

It features instant barge-in (interruptibility), sub-700ms latency, and a provider-agnostic backend that decouples the core logic from specific AI vendors. The system is built to handle the chaos of real-world audio—interruptions, background noise, and network jitter—while maintaining a fluid conversational flow.


🏗️ System Architecture

The architecture follows a Client-Server-Service model, heavily utilizing Event-Driven Design to manage asynchronous data streams. The system is designed to be "Frontend Agnostic" (the client is dumb) and "Backend Intelligent" (the server manages state).

High-Level Data Flow

graph TD
    %% --- STYLING ---
    classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000;
    classDef backend fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000;
    classDef aiService fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
    classDef userNode fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000;
    classDef hardware fill:#e0e0e0,stroke:#616161,stroke-width:2px,color:#000;

    %% --- NODES ---
    User((👤 User)):::userNode
    Speakers[🔊 Device Speakers]:::hardware

    subgraph Frontend [Frontend Layer]
        Client[💻 Next.js Client<br/><i>AudioWorklet & WebSocket</i>]:::frontend
    end

    subgraph Server_Layer [Orchestration Layer]
        Server[⚙️ FastAPI Server]:::backend
        VAD_Logic{⚡ Smart VAD<br/><i>Interruption Logic</i>}:::backend
        Queue[📥 Event Queue]:::backend
    end

    subgraph Cloud_AI [AI Infrastructure]
        direction TB
        STT[👂 STT<br/><i>Deepgram / Gladia</i>]:::aiService
        LLM[🧠 LLM<br/><i>Groq / Cerebras</i>]:::aiService
        TTS[🗣️ TTS<br/><i>Deepgram / Piper</i>]:::aiService
    end

    %% --- DATA FLOW ---
    
    %% 1. Input
    User ==> |"🎤 Raw PCM Stream"| Client
    Client <==> |"🔌 WebSocket (Int16)"| Server

    %% 2. Processing
    Server ==> |"Audio Stream"| STT
    STT --> |"Transcript Events"| VAD_Logic
    
    %% 3. Decision
    VAD_Logic -- "User Spoke?" --> Queue
    Queue --> |"Context"| LLM

    %% 4. Generation
    LLM --> |"Tokens"| TTS
    TTS --> |"Synthesized Audio"| Server
    
    %% 5. Output
    Server -.-> |"Buffered Audio"| Client
    Client ==> |"Playback"| Speakers
Loading

⚙️ The Engineering Pipeline

1. The Ingestion Layer ("The Dumb Client")

Instead of heavy client-side processing, Velox uses a lightweight "Thin Client" architecture to ensure performance on low-end mobile devices.

  • AudioWorklet Processing: A custom AudioProcessor runs on the browser's high-priority audio thread.
  • Client-Side Resampling: Raw 48kHz audio is downsampled to 16kHz and converted to Int16 (PCM) directly in the Worklet using Linear Interpolation.
  • Result: The main JavaScript thread is bypassed entirely for audio handling, preventing UI freezes (React re-renders) from causing audio glitches.

2. The Orchestration Layer ("The Event Loop")

The core of Velox is a FastAPI backend running a non-blocking asyncio event loop. It manages the delicate state machine of the conversation.

  • State Management: Tracks is_user_speaking, is_ai_speaking, and processing flags to prevent race conditions.
  • Smart Barge-In (Server-Side VAD):
    • Velox implements Transcript-Based VAD utilizing Deepgram's UtteranceEnd signals.
    • Logic: Instead of cutting audio on loud noises (energy VAD), the system only interrupts if it detects valid phonemes/words. This eliminates false positives from background noise (e.g., door slamming).

3. The Intelligence Layer (Provider-Agnostic Design)

Velox implements the Adapter Design Pattern to ensure no vendor lock-in. The core logic interacts with generic LLMProvider and STTProvider interfaces.

  • Swappable Providers: The system can switch between Groq (Llama 3) for speed and OpenAI (GPT-4) for complexity as per the user convenience.
  • Adapter Implementation:
    # The system doesn't care if it's Groq or Cerebras
    class LLMService:
        async def stream(self, provider_name: str, messages: list):
            adapter = self.adapters.get(provider_name)
            return adapter.generate(messages)

4. The Synthesis Layer (Jitter-Free Playback)

Streaming audio over the internet introduces packet jitter. Velox handles this via a two-stage buffering strategy:

  1. Server-Side Aggregation: The backend buffers tiny TTS chunks into actionable packets before sending.
  2. Client-Side Jitter Buffer: The frontend maintains a dynamic ~150ms buffer to smooth out network irregularities, ensuring the AI's voice doesn't sound "robotic" on 4G networks.

🔄 Sequence of a "Turn" (The Lifecycle)

This is the exact lifecycle of a single user interaction, demonstrating the parallelism of the system.

  1. INPUT: User speaks "What is the price?"
    • Client sends Int16 binary stream -> Server.
  2. DETECTION:
    • Server forwards audio -> Deepgram.
    • Deepgram detects "UtteranceEnd" (User stopped talking).
  3. GENERATION:
    • Server signals LLM (Groq) immediately.
    • Time-to-First-Token: <200ms.
  4. SYNTHESIS:
    • LLM streams text tokens -> TTS Service.
    • TTS streams audio bytes -> Server.
  5. PLAYBACK:
    • Server flushes audio queue -> Client WebSocket.
    • Client plays audio via Web Audio API.

Total Round-Trip Latency: ~700ms (Network Dependent)


🛠️ Key Technical Features

Feature Implementation Benefit
Full-Duplex Raw WebSockets Allows speaking and listening simultaneously.
Semantic Barge-In Transcript VAD AI stops only for speech, ignores noise.
Optimized Audio Int16 PCM 3x smaller payload than Float32; ready for AI ingestion.
Resiliency Auto-Reconnection Handles socket drops/network switching gracefully.
Modularity Adapter Pattern Zero refactoring required to change AI models.

💻 Tech Stack

Frontend (Client)

  • Framework: Next.js 14 (React)
  • Language: TypeScript
  • Styling: TailwindCSS

Backend (Server)

  • Core: Python
  • Framework: FastAPI (Uvicorn)
  • Concurrency: AsyncIO (async/await)
  • Networking: WebSockets

AI Infrastructure

  • LLM Inference: Groq, Cerebras
  • Speech-to-Text (STT): Deepgram, Gladia
  • Text-to-Speech (TTS): Deepgram (Cloud), Piper (Local)
  • Orchestration: Custom Event Loop State Machine