Skip to content

This repository provides a real-time speech-to-text transcription service using Sarvam Speech-to-Text WebSocket API integrated with the Agent Voice Response system.

License

Notifications You must be signed in to change notification settings

agentvoiceresponse/avr-asr-sarvam

Repository files navigation

Agent Voice Response - Sarvam Speech-to-Text Integration

Discord GitHub Repo stars Docker Pulls Ko-fi

This repository provides a real-time speech-to-text transcription service using Sarvam Speech-to-Text WebSocket API integrated with the Agent Voice Response system. The service sets up an Express.js server that accepts audio streams via HTTP POST, connects to Sarvam via WebSocket for real-time transcription, and streams the transcribed text back to clients using Server-Sent Events (SSE).

Prerequisites

Before setting up the project, ensure you have the following:

  1. Node.js and npm installed.
  2. A Sarvam account with the Speech-to-Text API enabled.
  3. A Sarvam API Key with the necessary permissions to access the Speech-to-Text API.

Setup

1. Clone the Repository

git clone https://github.com/agentvoiceresponse/avr-asr-sarvam.git
cd avr-asr-sarvam

2. Install Dependencies

npm install

3. Set Up Sarvam Credentials

Create a .env file in the root directory and add your Sarvam API key:

SARVAM_API_KEY=your_soniox_api_key

You can obtain your API key from the Sarvam Console.

4. Configuration

Configure the following environment variables in your .env file:

# Required: Sarvam API Key
SARVAM_API_KEY=your_soniox_api_key

# Optional: Sarvam WebSocket URL (defaults to production endpoint)
SARVAM_WEBSOCKET_URL=wss://api.sarvam.ai/speech-to-text/ws

# Optional: Speech recognition model (default: saarika:v2.5)
SARVAM_SPEECH_RECOGNITION_MODEL=saarika:v2.5

# Optional: Language hints (default: en-IN)
SARVAM_SPEECH_RECOGNITION_LANGUAGE=en-IN

# Optional: Server port (default: 6050)
PORT=6050

How It Works

This application sets up an Express.js server that accepts audio streams via HTTP POST and uses the Sarvam WebSocket API for real-time transcription. The architecture follows this flow:

1. Express.js Server

The server listens for audio streams on the /speech-to-text-stream POST endpoint. When a client sends audio data, the server:

  • Sets up Server-Sent Events (SSE) headers for streaming responses
  • Creates a WebSocket connection to Sarvam
  • Forwards audio chunks to Sarvam in real-time

2. Sarvam WebSocket Connection

The service establishes a persistent WebSocket connection to wss://api.sarvam.ai/speech-to-text/ws:

  • Sends a configuration message with API key, model, audio format, and language hints
  • Streams audio data as binary WebSocket frames
  • Receives JSON responses containing transcription tokens

3. Audio Format

The service expects audio in the following format:

  • Format: s16le (signed 16-bit little-endian PCM)
  • Sample Rate: 8000 Hz
  • Channels: Mono (1 channel)

4. Transcription Response

Sarvam returns JSON responses with token arrays.

5. Route /speech-to-text-stream

This POST endpoint:

  • Accepts raw audio stream in the request body
  • Returns transcription results via Server-Sent Events (SSE)
  • Automatically closes the Sarvam connection when the audio stream ends

Architecture Overview

The service implements a bridge pattern between HTTP and WebSocket protocols:

Client (HTTP POST) → Express Server → Sarvam WebSocket API
                    ↓
              SSE Response ← Transcription Tokens

Key Components:

  • handleAudioStream: Main handler function that:

    • Creates a WebSocket connection to Sarvam
    • Sends configuration message with API credentials and settings
    • Forwards incoming audio chunks to Sarvam as binary frames
    • Processes Sarvam responses to extract final transcription tokens
    • Streams transcripts back to the client via Server-Sent Events
    • Handles connection lifecycle (open, message, close, error)
  • WebSocket Event Handlers:

    • open: Sends configuration and enables audio streaming
    • message: Parses JSON responses, extracts final tokens, builds transcripts
    • close: Gracefully closes client connection
    • error: Handles and reports connection errors
  • HTTP Request Handlers:

    • data: Forwards audio chunks to Sarvam WebSocket
    • end: Sends empty frame to gracefully close Sarvam connection
    • error: Handles client-side stream errors

Running the Application

To start the application:

npm run start

or

npm run start:dev

The server will start and listen on the port specified in the .env file or default to PORT=6050.

Audio Format Issues

  • Ensure audio is in s16le format (signed 16-bit little-endian PCM)
  • Verify sample rate is exactly 8000 Hz
  • Confirm audio is mono (single channel)

Error Responses

The service handles Sarvam error responses and forwards them with appropriate HTTP status codes:

  • 400: Bad request (invalid parameters)
  • 401: Unauthorized (invalid API key)
  • 402: Payment required (account balance exhausted)
  • 429: Too many requests (rate limit exceeded)
  • 500: Internal server error
  • 503: Service unavailable

Check the server logs for detailed error messages.

Support & Community

Support AVR

AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.

Support us on Ko-fi

License

MIT License - see the LICENSE file for details.

About

This repository provides a real-time speech-to-text transcription service using Sarvam Speech-to-Text WebSocket API integrated with the Agent Voice Response system.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published