Aura is a real-time, voice-to-voice AI companion that listens, thinks, and speaks back naturally like humans.
Flow: User speaks → Audio is recorded and sent via WebSocket to server → STT transcribes → LLM generates response → TTS streams audio back → Audio is played on mobile app → User hears response
- Node.js 18+
- Expo Go Mobile App
- API Keys:
GROQ_API_KEY,ELEVENLABS_API_KEY
cd backend
npm install
# Create .env file with:
# GROQ_API_KEY=your_key
# ELEVENLABS_API_KEY=your_key
# PORT=5000
npm run devcd mobile-app
npm install
# Create .env file with:
# EXPO_PUBLIC_WEBSOCKET_URL=ws://localhost:5000
npm run start
# Open Expo Go app on your phone and scan the QR code from the terminal
# Start using the app!| Optimization | Why |
|---|---|
| Groq LLM (Llama 3.3 70B) | Fastest inference provider (~200-400ms for short responses) |
| ElevenLabs Scribe v2 | Low-latency STT model optimized for real-time |
| Streaming TTS | Audio chunks sent as they're generated, not waiting for full synthesis |
| WebSocket (persistent) | Eliminates HTTP connection overhead per request |
| Raw WebSocket library (ws) | Eliminates the overhead and abstraction of higher-level libraries eg. Socket.io |
| Non-streaming LLM for short responses | For voice AI, full response is often faster than streaming overhead |
| Binary audio over WebSocket | Minimal encoding overhead for audio data |
| Low-quality recording preset | 16kHz mono @ 128kbps — fast to encode & transmit |
| Stage | Typical Time |
|---|---|
| Audio Recording + Send | ~100-200ms |
| STT (ElevenLabs Scribe) | ~400-800ms |
| LLM Response (Groq) | ~200-500ms |
| TTS First Chunk | ~500-2000ms |
| Total (Time to First Audio) | Upto 5s |
Latency is measured end-to-end from recording stop to first audio playback programmatically in the mobile app:
Displayed in UI: Real-time latency shown after each interaction.
Given more time, I would:
- Implement streaming LLM → TTS pipeline — Stream sentences to TTS as LLM generates them (already have
groqChatStreamready) - Add VAD (Voice Activity Detection) — Auto-detect speech end instead of push-to-talk
- Client-side audio chunking — Stream audio during recording for faster STT start
- Optimize latency further — Identify and reduce bottlenecks in each stage carefully
- Audio compression — Implement more efficient audio codecs for lower bandwidth
- Preemptive TTS warming — Pre-initialize TTS connection to reduce first-chunk latency
- Edge deployment — Deploy backend closer to user for reduced network latency
├── backend/ # Node.js WebSocket server
│ └── src/
│ ├── index.ts # WebSocket server & message handling
│ ├── constant.ts # System prompt for Aura personality
│ └── services/
│ ├── stt.ts # ElevenLabs Speech-to-Text
│ ├── llm.ts # Groq LLM (Llama 3.3)
│ ├── tts.ts # ElevenLabs Text-to-Speech (streaming)
│ └── context.ts # Conversation history management
│
└── mobile-app/ # Expo React Native app
└── app/
└── index.tsx # Main voice interface
- Mobile: Expo, React Native, expo-audio
- Backend: Node.js, WebSocket (ws), TypeScript
- STT: ElevenLabs Scribe v2
- LLM: Groq (Llama 3.3 70B / GPT OSS)
- TTS: ElevenLabs Multilingual v2 (Streaming)