Agent Voice Response - Sarvam Speech-to-Text Integration

This repository provides a real-time speech-to-text transcription service using Sarvam Speech-to-Text WebSocket API integrated with the Agent Voice Response system. The service sets up an Express.js server that accepts audio streams via HTTP POST, connects to Sarvam via WebSocket for real-time transcription, and streams the transcribed text back to clients using Server-Sent Events (SSE).

Prerequisites

Before setting up the project, ensure you have the following:

Node.js and npm installed.
A Sarvam account with the Speech-to-Text API enabled.
A Sarvam API Key with the necessary permissions to access the Speech-to-Text API.

Setup

1. Clone the Repository

git clone https://github.com/agentvoiceresponse/avr-asr-sarvam.git
cd avr-asr-sarvam

2. Install Dependencies

npm install

3. Set Up Sarvam Credentials

Create a .env file in the root directory and add your Sarvam API key:

SARVAM_API_KEY=your_soniox_api_key

You can obtain your API key from the Sarvam Console.

4. Configuration

Configure the following environment variables in your .env file:

# Required: Sarvam API Key
SARVAM_API_KEY=your_soniox_api_key

# Optional: Sarvam WebSocket URL (defaults to production endpoint)
SARVAM_WEBSOCKET_URL=wss://api.sarvam.ai/speech-to-text/ws

# Optional: Speech recognition model (default: saarika:v2.5)
SARVAM_SPEECH_RECOGNITION_MODEL=saarika:v2.5

# Optional: Language hints (default: en-IN)
SARVAM_SPEECH_RECOGNITION_LANGUAGE=en-IN

# Optional: Server port (default: 6050)
PORT=6050

How It Works

This application sets up an Express.js server that accepts audio streams via HTTP POST and uses the Sarvam WebSocket API for real-time transcription. The architecture follows this flow:

1. Express.js Server

The server listens for audio streams on the /speech-to-text-stream POST endpoint. When a client sends audio data, the server:

Sets up Server-Sent Events (SSE) headers for streaming responses
Creates a WebSocket connection to Sarvam
Forwards audio chunks to Sarvam in real-time

2. Sarvam WebSocket Connection

The service establishes a persistent WebSocket connection to wss://api.sarvam.ai/speech-to-text/ws:

Sends a configuration message with API key, model, audio format, and language hints
Streams audio data as binary WebSocket frames
Receives JSON responses containing transcription tokens

3. Audio Format

The service expects audio in the following format:

Format: s16le (signed 16-bit little-endian PCM)
Sample Rate: 8000 Hz
Channels: Mono (1 channel)

4. Transcription Response

Sarvam returns JSON responses with token arrays.

5. Route `/speech-to-text-stream`

This POST endpoint:

Accepts raw audio stream in the request body
Returns transcription results via Server-Sent Events (SSE)
Automatically closes the Sarvam connection when the audio stream ends

Architecture Overview

The service implements a bridge pattern between HTTP and WebSocket protocols:

Client (HTTP POST) → Express Server → Sarvam WebSocket API
                    ↓
              SSE Response ← Transcription Tokens

Key Components:

handleAudioStream: Main handler function that:
- Creates a WebSocket connection to Sarvam
- Sends configuration message with API credentials and settings
- Forwards incoming audio chunks to Sarvam as binary frames
- Processes Sarvam responses to extract final transcription tokens
- Streams transcripts back to the client via Server-Sent Events
- Handles connection lifecycle (open, message, close, error)
WebSocket Event Handlers:
- open: Sends configuration and enables audio streaming
- message: Parses JSON responses, extracts final tokens, builds transcripts
- close: Gracefully closes client connection
- error: Handles and reports connection errors
HTTP Request Handlers:
- data: Forwards audio chunks to Sarvam WebSocket
- end: Sends empty frame to gracefully close Sarvam connection
- error: Handles client-side stream errors

Running the Application

To start the application:

npm run start

or

npm run start:dev

The server will start and listen on the port specified in the .env file or default to PORT=6050.

Audio Format Issues

Ensure audio is in s16le format (signed 16-bit little-endian PCM)
Verify sample rate is exactly 8000 Hz
Confirm audio is mono (single channel)

Error Responses

The service handles Sarvam error responses and forwards them with appropriate HTTP status codes:

400: Bad request (invalid parameters)
401: Unauthorized (invalid API key)
402: Payment required (account balance exhausted)
429: Too many requests (rate limit exceeded)
500: Internal server error
503: Service unavailable

Check the server logs for detailed error messages.

Support & Community

GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
NPM: https://www.npmjs.com/~agentvoiceresponse - Browse our packages.
Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.

Support AVR

AVR is free and open-source. Any support is entirely voluntary and intended as a personal gesture of appreciation. Donations do not provide access to features, services, or special benefits, and the project remains fully available regardless of donations.

License

MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Voice Response - Sarvam Speech-to-Text Integration

Prerequisites

Setup

1. Clone the Repository

2. Install Dependencies

3. Set Up Sarvam Credentials

4. Configuration

How It Works

1. Express.js Server

2. Sarvam WebSocket Connection

3. Audio Format

4. Transcription Response

5. Route `/speech-to-text-stream`

Architecture Overview

Running the Application

Audio Format Issues

Error Responses

Support & Community

Support AVR

License

About

Uh oh!

Releases

Packages

Languages

License

agentvoiceresponse/avr-asr-sarvam

Folders and files

Latest commit

History

Repository files navigation

Agent Voice Response - Sarvam Speech-to-Text Integration

Prerequisites

Setup

1. Clone the Repository

2. Install Dependencies

3. Set Up Sarvam Credentials

4. Configuration

How It Works

1. Express.js Server

2. Sarvam WebSocket Connection

3. Audio Format

4. Transcription Response

5. Route /speech-to-text-stream

Architecture Overview

Running the Application

Audio Format Issues

Error Responses

Support & Community

Support AVR

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

5. Route `/speech-to-text-stream`

Packages