This repository provides a real-time speech-to-text transcription service using Sarvam Speech-to-Text WebSocket API integrated with the Agent Voice Response system. The service sets up an Express.js server that accepts audio streams via HTTP POST, connects to Sarvam via WebSocket for real-time transcription, and streams the transcribed text back to clients using Server-Sent Events (SSE).
Before setting up the project, ensure you have the following:
- Node.js and npm installed.
- A Sarvam account with the Speech-to-Text API enabled.
- A Sarvam API Key with the necessary permissions to access the Speech-to-Text API.
git clone https://github.com/agentvoiceresponse/avr-asr-sarvam.git
cd avr-asr-sarvamnpm installCreate a .env file in the root directory and add your Sarvam API key:
SARVAM_API_KEY=your_soniox_api_keyYou can obtain your API key from the Sarvam Console.
Configure the following environment variables in your .env file:
# Required: Sarvam API Key
SARVAM_API_KEY=your_soniox_api_key
# Optional: Sarvam WebSocket URL (defaults to production endpoint)
SARVAM_WEBSOCKET_URL=wss://api.sarvam.ai/speech-to-text/ws
# Optional: Speech recognition model (default: saarika:v2.5)
SARVAM_SPEECH_RECOGNITION_MODEL=saarika:v2.5
# Optional: Language hints (default: en-IN)
SARVAM_SPEECH_RECOGNITION_LANGUAGE=en-IN
# Optional: Server port (default: 6050)
PORT=6050This application sets up an Express.js server that accepts audio streams via HTTP POST and uses the Sarvam WebSocket API for real-time transcription. The architecture follows this flow:
The server listens for audio streams on the /speech-to-text-stream POST endpoint. When a client sends audio data, the server:
- Sets up Server-Sent Events (SSE) headers for streaming responses
- Creates a WebSocket connection to Sarvam
- Forwards audio chunks to Sarvam in real-time
The service establishes a persistent WebSocket connection to wss://api.sarvam.ai/speech-to-text/ws:
- Sends a configuration message with API key, model, audio format, and language hints
- Streams audio data as binary WebSocket frames
- Receives JSON responses containing transcription tokens
The service expects audio in the following format:
- Format:
s16le(signed 16-bit little-endian PCM) - Sample Rate: 8000 Hz
- Channels: Mono (1 channel)
Sarvam returns JSON responses with token arrays.
This POST endpoint:
- Accepts raw audio stream in the request body
- Returns transcription results via Server-Sent Events (SSE)
- Automatically closes the Sarvam connection when the audio stream ends
The service implements a bridge pattern between HTTP and WebSocket protocols:
Client (HTTP POST) → Express Server → Sarvam WebSocket API
↓
SSE Response ← Transcription Tokens
Key Components:
-
handleAudioStream: Main handler function that:- Creates a WebSocket connection to Sarvam
- Sends configuration message with API credentials and settings
- Forwards incoming audio chunks to Sarvam as binary frames
- Processes Sarvam responses to extract final transcription tokens
- Streams transcripts back to the client via Server-Sent Events
- Handles connection lifecycle (open, message, close, error)
-
WebSocket Event Handlers:
open: Sends configuration and enables audio streamingmessage: Parses JSON responses, extracts final tokens, builds transcriptsclose: Gracefully closes client connectionerror: Handles and reports connection errors
-
HTTP Request Handlers:
data: Forwards audio chunks to Sarvam WebSocketend: Sends empty frame to gracefully close Sarvam connectionerror: Handles client-side stream errors
To start the application:
npm run startor
npm run start:devThe server will start and listen on the port specified in the .env file or default to PORT=6050.
- Ensure audio is in
s16leformat (signed 16-bit little-endian PCM) - Verify sample rate is exactly 8000 Hz
- Confirm audio is mono (single channel)
The service handles Sarvam error responses and forwards them with appropriate HTTP status codes:
400: Bad request (invalid parameters)401: Unauthorized (invalid API key)402: Payment required (account balance exhausted)429: Too many requests (rate limit exceeded)500: Internal server error503: Service unavailable
Check the server logs for detailed error messages.
- GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
- Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
- Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
- NPM: https://www.npmjs.com/~agentvoiceresponse - Browse our packages.
- Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.
AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.
MIT License - see the LICENSE file for details.