-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Feature Request: Trigger Word for End-of-Speech Detection ("Walkie-Talkie Mode")
Problem
When using VoiceMode in noisy environments (walking on streets, cafes, outdoors), the silence-based end-of-speech detection (WebRTC VAD) struggles to determine when the user has finished speaking. Background noise prevents reliable silence detection, leading to:
- Premature cutoffs (VAD thinks noise is the end of speech)
- Long waits (VAD never detects silence, waits for
listen_duration_max)
Proposed Solution
Add support for an optional trigger word (like "over" in walkie-talkie/radio communication) that explicitly signals the end of speech. When the user says "over" at the end of their message, VoiceMode stops listening immediately.
Configuration
# In voicemode.env or environment variable
VOICEMODE_TRIGGER_WORD=over
# Or for non-English: VOICEMODE_TRIGGER_WORD=fertigBehavior
- User speaks: "Please help me refactor the authentication module, over"
- VoiceMode detects trigger word → stops recording immediately
- Trigger word is stripped from transcription → Claude receives: "Please help me refactor the authentication module"
Implementation Options
Option A: Post-transcription Detection (Simple)
After STT completes, check if transcription ends with trigger word and strip it.
Location: tools/converse.py around line 1655
# After: response_text = stt_result.get("text")
if TRIGGER_WORD and response_text:
# Check for trigger word at end (case-insensitive)
text_lower = response_text.lower().rstrip()
trigger_lower = TRIGGER_WORD.lower()
if text_lower.endswith(trigger_lower):
# Strip trigger word
response_text = response_text[:-(len(TRIGGER_WORD))].rstrip()
# Also strip common trailing punctuation/comma before trigger
response_text = response_text.rstrip('.,;:')Pros: Simple, minimal code changes
Cons: Doesn't solve the waiting problem - still waits for silence/timeout before transcribing
Option B: Real-time Keyword Detection (Better UX)
Detect trigger word during recording to stop immediately.
Approaches:
- Vosk - Lightweight offline speech recognition, can do streaming
- Periodic STT - Send audio chunks every few seconds to Whisper, check for keyword
- Porcupine/similar - Wake word detection (would need custom model training)
Location: tools/converse.py in record_audio_with_silence_detection()
Pros: Immediate response when user says trigger word
Cons: More complex, additional dependencies, potential latency
Recommended Approach
Start with Option A for quick implementation, then consider Option B as an enhancement if users need faster response in noisy environments.
Use Cases
- Outdoor/mobile use - Walking, public transport, street noise
- Open office - Background chatter interferes with silence detection
- Hands-free operation - Clear signal without needing to press a button
- International users - Can configure trigger word in their language
Workaround (Current)
Users can currently work around this by:
disable_silence_detection=True, listen_duration_max=30But this requires waiting the full duration or awkward silence.
Additional Considerations
- Trigger word should be configurable (different languages, preferences)
- Should be optional (disabled by default for backward compatibility)
- Could support multiple trigger words:
VOICEMODE_TRIGGER_WORDS=over,done,finished - Consider adding a parameter to converse():
trigger_word: Optional[str] = None
Environment:
- VoiceMode version: 7.4.2
- Use case: concept elaboration via voice while mobile