A sample implementation of real-time voice assistant using Amazon Nova 2 Sonic, AWS's speech-to-speech foundation model. Talk naturally with an AI that listens, thinks, and responds — all in real-time with human-like conversation flow.
Unlike traditional pipelines (ASR → LLM → TTS), a speech-to-speech model processes audio input and generates audio output directly in a single model, preserving acoustic context like tone, pace, and emotion.
- Turn-Taking — Detects pauses, hesitations, and non-verbal cues to know when to respond
- Barge-In — Gracefully handles interruptions mid-response without losing context
- Adaptive Speech — Adjusts response tone and pace based on how you speak, not just what you say
- Low Latency — Bidirectional streaming delivers responses as they're generated
- Multiple Voices — Tiffany (feminine) and Matthew (masculine) with native expressivity
- Polyglot Support — Voices can switch languages within the same conversation
- Text Input Mode — Switch to typing when voice isn't convenient (crossmodal interaction)
- Languages — Officially supports English (US/UK), Spanish, French, German, Italian, Portuguese, and Hindi. Other languages like Chinese and Japanese should also work.
Nova 2 Sonic supports asynchronous function calling — the model can invoke tools while continuing the conversation, then incorporate results naturally when they return. See src/tools for implementation. You can enable each of these tools individually in UI.
| Tool | Description |
|---|---|
| DateTime | Current date and time |
| Weather | Weather forecasts via Open-Meteo API |
| Wikipedia | Search and retrieve Wikipedia articles |
| Location Search | Find places and coordinates |
| Reasoning | Extended thinking for complex questions |
| Transcript Correction | Fix ASR transcription errors |
┌─────────────────┐ Socket.IO ┌─────────────────┐ AWS Bedrock ┌─────────────────┐
│ Browser UI │◄─────────────────►│ Express Server │◄─────────────────►│ Nova 2 Sonic │
│ (JavaScript) │ WebSocket │ (TypeScript) │ Bidirectional │ Model │
└─────────────────┘ └─────────────────┘ Streaming └─────────────────┘
Audio is captured via Web Audio API (16kHz mono PCM), streamed through Socket.IO to an Express server, then forwarded to Nova 2 Sonic via AWS Bedrock's bidirectional streaming API. Responses (24kHz audio + text transcripts) flow back in real-time.
- Node.js 18+
- AWS account with Amazon Bedrock access to Nova 2 Sonic
- AWS credentials configured
npm install
# Configure AWS credentials
export AWS_PROFILE=your-profile
# or
export AWS_ACCESS_KEY_ID=xxx && export AWS_SECRET_ACCESS_KEY=xxx
# (optional) Listen incoming connections on all available network interfaces
export HOST=0.0.0.0
# Compile and run application
npm run build
npm start
# Alternatively you can run in development mode for easier debugging
npm run devOpen http://localhost:3000 and click the microphone to start talking.
Settings are adjustable from the UI panel:
| Setting | Description |
|---|---|
| Voice | Choose between Tiffany or Matthew personas |
| Response Timing | Controls voice activity detection sensitivity — faster means quicker responses but may cut off longer pauses |
| Audio Quality | Output sample rate — higher quality uses more bandwidth |
| Initial Buffer | Audio buffer before playback starts — lower means faster response, higher means smoother playback |
| Temperature | Higher values produce more varied responses |
| Top P | Controls response diversity via nucleus sampling |
| Max Tokens | Limits response length |
| AWS Region | Select a closer region to reduce latency |
Pre-built personas available in public/prompts:
- Friendly Chat — Casual conversational assistant
- Insurance Service — Customer service agent
- Intelligent Assistant — General-purpose with tool usage
- Math Tutor — High school math teacher
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.

