⚡ VELOX AI

Ultra-Low Latency Real-Time Voice Orchestration Engine

📖 Project Overview

Velox AI is a high-performance, full-duplex conversational voice engine designed to bridge the gap between standard chatbots and human-like voice interaction. Unlike traditional voice assistants that rely on "turn-based" communication (listen -> stop -> think -> speak), Velox operates on a continuous streaming architecture.

It features instant barge-in (interruptibility), sub-700ms latency, and a provider-agnostic backend that decouples the core logic from specific AI vendors. The system is built to handle the chaos of real-world audio—interruptions, background noise, and network jitter—while maintaining a fluid conversational flow.

🏗️ System Architecture

The architecture follows a Client-Server-Service model, heavily utilizing Event-Driven Design to manage asynchronous data streams. The system is designed to be "Frontend Agnostic" (the client is dumb) and "Backend Intelligent" (the server manages state).

High-Level Data Flow

graph TD
    %% --- STYLING ---
    classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000;
    classDef backend fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000;
    classDef aiService fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
    classDef userNode fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000;
    classDef hardware fill:#e0e0e0,stroke:#616161,stroke-width:2px,color:#000;

    %% --- NODES ---
    User((👤 User)):::userNode
    Speakers[🔊 Device Speakers]:::hardware

    subgraph Frontend [Frontend Layer]
        Client[💻 Next.js Client<br/><i>AudioWorklet & WebSocket</i>]:::frontend
    end

    subgraph Server_Layer [Orchestration Layer]
        Server[⚙️ FastAPI Server]:::backend
        VAD_Logic{⚡ Smart VAD<br/><i>Interruption Logic</i>}:::backend
        Queue[📥 Event Queue]:::backend
    end

    subgraph Cloud_AI [AI Infrastructure]
        direction TB
        STT[👂 STT<br/><i>Deepgram / Gladia</i>]:::aiService
        LLM[🧠 LLM<br/><i>Groq / Cerebras</i>]:::aiService
        TTS[🗣️ TTS<br/><i>Deepgram / Piper</i>]:::aiService
    end

    %% --- DATA FLOW ---
    
    %% 1. Input
    User ==> |"🎤 Raw PCM Stream"| Client
    Client <==> |"🔌 WebSocket (Int16)"| Server

    %% 2. Processing
    Server ==> |"Audio Stream"| STT
    STT --> |"Transcript Events"| VAD_Logic
    
    %% 3. Decision
    VAD_Logic -- "User Spoke?" --> Queue
    Queue --> |"Context"| LLM

    %% 4. Generation
    LLM --> |"Tokens"| TTS
    TTS --> |"Synthesized Audio"| Server
    
    %% 5. Output
    Server -.-> |"Buffered Audio"| Client
    Client ==> |"Playback"| Speakers

⚙️ The Engineering Pipeline

1. The Ingestion Layer ("The Dumb Client")

Instead of heavy client-side processing, Velox uses a lightweight "Thin Client" architecture to ensure performance on low-end mobile devices.

AudioWorklet Processing: A custom AudioProcessor runs on the browser's high-priority audio thread.
Client-Side Resampling: Raw 48kHz audio is downsampled to 16kHz and converted to Int16 (PCM) directly in the Worklet using Linear Interpolation.
Result: The main JavaScript thread is bypassed entirely for audio handling, preventing UI freezes (React re-renders) from causing audio glitches.

2. The Orchestration Layer ("The Event Loop")

The core of Velox is a FastAPI backend running a non-blocking asyncio event loop. It manages the delicate state machine of the conversation.

State Management: Tracks is_user_speaking, is_ai_speaking, and processing flags to prevent race conditions.
Smart Barge-In (Server-Side VAD):
- Velox implements Transcript-Based VAD utilizing Deepgram's UtteranceEnd signals.
- Logic: Instead of cutting audio on loud noises (energy VAD), the system only interrupts if it detects valid phonemes/words. This eliminates false positives from background noise (e.g., door slamming).

3. The Intelligence Layer (Provider-Agnostic Design)

Velox implements the Adapter Design Pattern to ensure no vendor lock-in. The core logic interacts with generic LLMProvider and STTProvider interfaces.

Swappable Providers: The system can switch between Groq (Llama 3) for speed and OpenAI (GPT-4) for complexity as per the user convenience.

Adapter Implementation:

# The system doesn't care if it's Groq or Cerebras
class LLMService:
    async def stream(self, provider_name: str, messages: list):
        adapter = self.adapters.get(provider_name)
        return adapter.generate(messages)

4. The Synthesis Layer (Jitter-Free Playback)

Streaming audio over the internet introduces packet jitter. Velox handles this via a two-stage buffering strategy:

Server-Side Aggregation: The backend buffers tiny TTS chunks into actionable packets before sending.
Client-Side Jitter Buffer: The frontend maintains a dynamic ~150ms buffer to smooth out network irregularities, ensuring the AI's voice doesn't sound "robotic" on 4G networks.

🔄 Sequence of a "Turn" (The Lifecycle)

This is the exact lifecycle of a single user interaction, demonstrating the parallelism of the system.

INPUT: User speaks "What is the price?"
- Client sends Int16 binary stream -> Server.
DETECTION:
- Server forwards audio -> Deepgram.
- Deepgram detects "UtteranceEnd" (User stopped talking).
GENERATION:
- Server signals LLM (Groq) immediately.
- Time-to-First-Token: <200ms.
SYNTHESIS:
- LLM streams text tokens -> TTS Service.
- TTS streams audio bytes -> Server.
PLAYBACK:
- Server flushes audio queue -> Client WebSocket.
- Client plays audio via Web Audio API.

Total Round-Trip Latency: ~700ms (Network Dependent)

🛠️ Key Technical Features

Feature	Implementation	Benefit
Full-Duplex	Raw WebSockets	Allows speaking and listening simultaneously.
Semantic Barge-In	Transcript VAD	AI stops only for speech, ignores noise.
Optimized Audio	Int16 PCM	3x smaller payload than Float32; ready for AI ingestion.
Resiliency	Auto-Reconnection	Handles socket drops/network switching gracefully.
Modularity	Adapter Pattern	Zero refactoring required to change AI models.

💻 Tech Stack

Frontend (Client)

Framework: Next.js 14 (React)
Language: TypeScript
Styling: TailwindCSS

Backend (Server)

Core: Python
Framework: FastAPI (Uvicorn)
Concurrency: AsyncIO (async/await)
Networking: WebSockets

AI Infrastructure

LLM Inference: Groq, Cerebras
Speech-to-Text (STT): Deepgram, Gladia
Text-to-Speech (TTS): Deepgram (Cloud), Piper (Local)
Orchestration: Custom Event Loop State Machine

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
backend		backend
contexts		contexts
frontend		frontend
prototype_tests		prototype_tests
.gitignore		.gitignore
README.md		README.md
context.md		context.md
docker-compose.yml		docker-compose.yml
progress.md		progress.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ VELOX AI

Ultra-Low Latency Real-Time Voice Orchestration Engine

📖 Project Overview

🏗️ System Architecture

High-Level Data Flow

⚙️ The Engineering Pipeline

1. The Ingestion Layer ("The Dumb Client")

2. The Orchestration Layer ("The Event Loop")

3. The Intelligence Layer (Provider-Agnostic Design)

4. The Synthesis Layer (Jitter-Free Playback)

🔄 Sequence of a "Turn" (The Lifecycle)

🛠️ Key Technical Features

💻 Tech Stack

Frontend (Client)

Backend (Server)

AI Infrastructure

About

Uh oh!

Languages

harshil-mistry/Velox-AI

Folders and files

Latest commit

History

Repository files navigation

⚡ VELOX AI

Ultra-Low Latency Real-Time Voice Orchestration Engine

📖 Project Overview

🏗️ System Architecture

High-Level Data Flow

⚙️ The Engineering Pipeline

1. The Ingestion Layer ("The Dumb Client")

2. The Orchestration Layer ("The Event Loop")

3. The Intelligence Layer (Provider-Agnostic Design)

4. The Synthesis Layer (Jitter-Free Playback)

🔄 Sequence of a "Turn" (The Lifecycle)

🛠️ Key Technical Features

💻 Tech Stack

Frontend (Client)

Backend (Server)

AI Infrastructure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages