Full-duplex Discord voice pipeline for OpenClaw. Speak into Discord. An AI agent listens, thinks, and speaks back — hands-free, no buttons, no modes.
The first fully operational voice bot built for the OpenClaw agent platform. Built by Greyforge Labs. Validated by adversarial multi-agent pre-build audit before a single line of implementation was written.
Chronicle: greyforge.tech/chronicles/voiceops-integration
A walkie-talkie is half-duplex: one side talks, the other listens, then they switch. You press a button.
A phone call is full-duplex: both sides transmit simultaneously, in real time. No button-pressing, no mode-switching.
VoiceOps is full-duplex for Discord. You join a voice channel. The bot is already there, listening. You speak naturally. When you finish (detected by 800ms of silence), the bot transcribes your words, sends them to an OpenClaw AI agent, synthesizes the response with a local neural TTS engine, and plays it back through the voice channel. The whole loop — from the end of your sentence to the first word of the response — takes 3 to 7 seconds.
OpenClaw is a local AI agent platform built around a persistent WebSocket gateway. It hosts a council of AI agents (using models from Anthropic, Google, and OpenAI), manages sessions, routing, and multi-agent coordination, and exposes a single unified API over its v3 Gateway protocol.
VoiceOps is a standalone process that plugs into the OpenClaw Gateway over WebSocket. It does not modify OpenClaw internals — it simply speaks the Gateway protocol, sending voice turns as chat.send requests and receiving agent responses as chat events.
┌─────────────────────────────────────────────────────────────────┐
│ VoiceOps Pipeline │
└─────────────────────────────────────────────────────────────────┘
You speak in Discord
│
▼
┌─────────────┐
│ Discord RX │ @discordjs/voice receives Opus-encoded audio
│ (Opus UDP) │ from the operator (target Discord user) only.
└──────┬──────┘
│ Opus frames (48kHz stereo)
▼
┌─────────────┐
│ Opus Decode │ prism-media decodes to PCM16 16kHz mono
│ (prism) │ (Whisper-compatible format)
└──────┬──────┘
│ Raw PCM buffer
▼
┌──────────────────┐
│ Silence VAD + │ EndBehaviorType.AfterSilence (800ms)
│ RMS Energy Gate │ RMS threshold filters near-silence clips.
│ │ Min utterance: 500ms. Max: 30s.
└──────┬───────────┘
│ PCM buffer (utterance complete)
▼
┌─────────────┐
│ Whisper │ WAV header prepended to PCM.
│ ASR API │ Posted to OpenAI whisper-1 (or local whisper.cpp).
└──────┬──────┘
│ Transcript text
▼
┌──────────────────┐
│ OpenClaw │ WebSocket v3 Gateway (ws://127.0.0.1:18789)
│ Gateway Client │ chat.send → configured voiceSessionKey session
└──────┬───────────┘
│ Agent response text (streaming → final event)
▼
┌─────────────┐
│ kokoro-js │ Local neural TTS. 82MB ONNX model.
│ TTS Engine │ Runs in subprocess (WASM isolation pattern).
└──────┬──────┘
│ WAV buffer (24kHz mono 16-bit)
▼
┌─────────────┐
│ Discord TX │ @discordjs/voice + ffmpeg encode WAV → Opus.
│ (Opus UDP) │ Played back into the voice channel.
└─────────────┘
You hear the response.
| Stage | Time |
|---|---|
| VAD / silence detection | 800ms |
| Whisper ASR (5s clip) | 500ms - 1.5s |
| Agent reasoning | 1 - 3s |
| kokoro-js TTS synthesis (warm) | < 300ms |
| Discord TX buffering | ~200ms |
| Total end-to-end | 3 - 7s |
The "thinking cue" feature (configurable) plays a short audio phrase ("Let me think about that...") immediately after transcription, masking the agent reasoning latency.
TTS cold start is ~1-2s on first call (82MB ONNX model load). Subsequent calls are < 300ms because the subprocess is respawned with model load amortized per call.
VAD cold start is 43ms one-time. Per-frame cost is < 1ms.
Five TTS options were evaluated before selecting kokoro-js. The winner had to be free, local, and fast.
| Engine | Latency | Quality | Cost |
|---|---|---|---|
| kokoro-js (CHOSEN) | < 300ms warm | Excellent (near-ElevenLabs) | $0 |
| piper-tts | < 1s | Good | $0 |
| edge-tts | 1 - 2s | Excellent | $0 (cloud) |
| espeak-ng | < 100ms | Robotic | $0 |
| ElevenLabs Starter | 300 - 800ms | Excellent | ~$0.108/turn |
kokoro-js uses the Kokoro-82M-v1.0-ONNX model from HuggingFace. It runs entirely locally via ONNX Runtime. No API key, no cloud round-trip.
| Configuration | Cost per turn |
|---|---|
| kokoro-js TTS + Whisper API | ~$0.0005 (ASR only) |
| kokoro-js TTS + local whisper.cpp | $0.00 |
| ElevenLabs TTS + Whisper API | ~$0.108 |
At 20 turns per session, the Whisper-only configuration costs about $0.01 per session. The fully local (whisper.cpp) configuration costs nothing.
- Node.js 20 or higher
- A running OpenClaw Gateway instance (v3 protocol, default port 18789)
- A Discord bot token with the following permissions in the target voice channel:
- View Channel
- Connect
- Speak
- An OpenAI API key for Whisper ASR (or local whisper.cpp — see below)
- ffmpeg installed and on PATH (required by @discordjs/voice for audio encoding)
- OS: Ubuntu 24.04 LTS
- CPU: modern multi-core x86_64
- Node: v24.12.0
- Discord.js: v14
git clone https://github.com/GreyforgeLabs/voiceops.git
cd voiceops
npm installThe kokoro-js package will download the 82MB ONNX model on first TTS call. This is automatic.
Copy the example config and fill in your values:
cp voiceops.config.example.json voiceops.config.jsonEdit voiceops.config.json:
| Key | Description |
|---|---|
voiceChannelId |
Right-click your Discord voice channel → Copy Channel ID |
guildId |
Right-click your Discord server → Copy Server ID |
operatorUserId |
Your Discord user ID (the operator who will speak to the bot) |
tts.voice |
kokoro-js voice name (default: af_bella) |
tts.speed |
Speech speed multiplier (default: 1.0) |
vad.silenceDurationMs |
Milliseconds of silence that ends an utterance (default: 800) |
vad.minUtteranceDurationMs |
Clips shorter than this are discarded (default: 500) |
vad.rmsThreshold |
RMS energy floor — clips below this are discarded (default: 0.008) |
asr.model |
Whisper model name (default: whisper-1) |
asr.language |
Language hint for Whisper (default: en) |
pipeline.utterancesPerMinuteLimit |
Rate cap to control API costs (default: 20) |
pipeline.thinkingCueEnabled |
Play a short audio cue while the agent is thinking (default: true) |
pipeline.thinkingCueText |
Text to synthesize as the thinking cue |
To enable Developer Mode in Discord (required to copy IDs): User Settings → App Settings → Advanced → Developer Mode.
VoiceOps reads from ~/.openclaw/openclaw.json. The following keys must be present:
| Key | Description |
|---|---|
channels.discord.token |
Discord bot token |
gateway.auth.token |
OpenClaw Gateway authentication token |
gateway.port |
Gateway WebSocket port (default: 18789) |
env.OPENAI_API_KEY |
OpenAI API key for Whisper ASR |
npm startOr with file watching during development:
npm run devOn startup, VoiceOps will:
- Log in to Discord as the bot
- Verify the bot is a member of the target guild
- Connect to the OpenClaw Gateway via WebSocket (v3 protocol)
- Join the configured voice channel
- Begin listening for the Operator's audio
The bot subscribes to the Operator's Discord audio stream. Discord sends compressed Opus audio. VoiceOps decodes it to raw 16kHz PCM (the format Whisper expects).
When the Operator stops speaking for 800ms, the audio stream ends. VoiceOps immediately re-subscribes so it is always listening. The collected PCM buffer is then checked against a minimum duration (500ms) and an RMS energy threshold (0.008) to filter out silence and noise. If those checks pass, the buffer goes to Whisper.
The PCM buffer is wrapped in a WAV header and posted to OpenAI's whisper-1 endpoint. The transcript comes back as plain text.
The transcript is sent to the OpenClaw Gateway as a chat.send request. The session key agent:main:voice:user routes it to the main agent. The gateway responds asynchronously: a chat event with state: "final" carries the complete agent response text.
The response text is passed to a kokoro-js TTS subprocess. The subprocess loads the ONNX model, synthesizes Float32 audio, converts it to 16-bit PCM, and writes a WAV file to stdout. The main process collects the WAV bytes.
The WAV buffer is fed to the Discord AudioPlayer. Discord.js + ffmpeg encode it to Opus and transmit it through the voice channel UDP connection.
If a new utterance arrives while the bot is still processing or playing a previous response, it is queued rather than dropped (subject to the per-minute rate limit). The queue drains in order as each response completes.
The kokoro-js ONNX phonemizer (Emscripten-compiled WASM) calls process.exit(7) during its cleanup cycle. If TTS ran in-process, this would kill VoiceOps.
The solution is the subprocess pattern: tts.mjs spawns tts-worker.mjs as a child process for every synthesis call. The worker loads the model, synthesizes, writes WAV to stdout, and exits. Exit code 7 is expected and harmless. The main process collects stdout and continues.
@ricky0123/vad-node (Silero VAD) was benchmarked: 43ms cold start, 0.42ms per frame. It was not adopted for this release because it depends on onnxruntime-node@1.24.2, which conflicts with kokoro-js's dependency on onnxruntime-node@1.21.0. Running both simultaneously causes a hard crash.
The adopted VAD is EndBehaviorType.AfterSilence (provided by @discordjs/voice, built on Discord's own silence packets) combined with an RMS energy gate in asr.mjs. This approach has zero external dependencies and is sufficient for single-speaker desktop use.
Silero VAD can be re-evaluated if/when the ONNX version conflict is resolved upstream.
VoiceOps uses OpenClaw Gateway protocol v3.
Connect handshake:
Server → { type: "event", event: "connect.challenge" }
Client → { type: "req", id: uuid, method: "connect", params: {
minProtocol: 3, maxProtocol: 3,
client: { id, version, platform, mode },
scopes: ["operator.admin"],
auth: { token }
}}
Server → { type: "res", id: uuid, ok: true, payload: { ... } }
Send a voice turn:
Client → { type: "req", id: uuid, method: "chat.send", params: {
sessionKey: "agent:main:voice:user",
message: "transcript text",
idempotencyKey: uuid
}}
Server → { type: "event", event: "chat", payload: {
state: "final", runId: idempotencyKey,
message: { content: [{ type: "text", text: "response" }] }
}}
The idempotencyKey sent with chat.send is echoed as runId in the final chat event, allowing VoiceOps to match responses to the correct pending promise.
The voice is set via tts.voice in voiceops.config.json. Available voices in the Kokoro-82M model include:
| Voice | Style |
|---|---|
af_bella |
American female, warm (default) |
af_sarah |
American female, neutral |
am_adam |
American male, neutral |
am_michael |
American male, deep |
bf_emma |
British female |
bm_george |
British male |
See the kokoro-js documentation for the full voice list.
voiceops/
index.mjs Entry point — Discord client, pipeline init, graceful shutdown
src/
config.mjs Config loader — merges openclaw.json + voiceops.config.json
pipeline.mjs Pipeline orchestrator — wires ASR → Gateway → TTS
discord-voice.mjs Discord voice channel management (RX + TX + reconnect)
gateway-client.mjs OpenClaw Gateway WebSocket client (v3 protocol)
asr.mjs Whisper ASR — PCM → WAV → transcript
tts.mjs TTS coordinator — spawns tts-worker subprocess
tts-worker.mjs kokoro-js synthesis worker (subprocess isolation)
voiceops.config.example.json Configuration template
package.json
LICENSE
Send SIGINT (Ctrl+C) or SIGTERM:
kill -TERM <pid>VoiceOps will leave the voice channel, close the Gateway WebSocket, and destroy the Discord client before exiting.
When inviting the bot to your server, it needs these permissions:
botscopeCONNECT(permission integer: 1048576)SPEAK(permission integer: 2097152)
Combined permission integer: 3145728
The startup log prints a ready-made invite URL:
[VoiceOps] Re-invite URL: https://discord.com/oauth2/authorize?client_id=YOUR_BOT_ID&scope=bot&permissions=3145728
MIT. See LICENSE.
VoiceOps was designed as a standalone satellite process for the OpenClaw platform. The design goal was a clean, dependency-minimal voice pipeline with no modifications to OpenClaw internals — communication happens entirely through the public Gateway WebSocket API.
Before implementation began, the architecture was validated by an adversarial multi-agent pre-build audit (a dedicated team of AI agents stress-testing the design for failure modes, dependency conflicts, and ONNX version collisions). The VAD architecture note above is one direct output of that audit.
Read the full chronicle: greyforge.tech/chronicles/voiceops-integration
Built by Greyforge Labs · OpenForge · GitHub · MIT License