Skip to content

Trigger Word for End-of-Speech Detection ("Walkie-Talkie Mode") #210

@evbrandy

Description

@evbrandy

Feature Request: Trigger Word for End-of-Speech Detection ("Walkie-Talkie Mode")

Problem

When using VoiceMode in noisy environments (walking on streets, cafes, outdoors), the silence-based end-of-speech detection (WebRTC VAD) struggles to determine when the user has finished speaking. Background noise prevents reliable silence detection, leading to:

  • Premature cutoffs (VAD thinks noise is the end of speech)
  • Long waits (VAD never detects silence, waits for listen_duration_max)

Proposed Solution

Add support for an optional trigger word (like "over" in walkie-talkie/radio communication) that explicitly signals the end of speech. When the user says "over" at the end of their message, VoiceMode stops listening immediately.

Configuration

# In voicemode.env or environment variable
VOICEMODE_TRIGGER_WORD=over
# Or for non-English: VOICEMODE_TRIGGER_WORD=fertig

Behavior

  1. User speaks: "Please help me refactor the authentication module, over"
  2. VoiceMode detects trigger word → stops recording immediately
  3. Trigger word is stripped from transcription → Claude receives: "Please help me refactor the authentication module"

Implementation Options

Option A: Post-transcription Detection (Simple)

After STT completes, check if transcription ends with trigger word and strip it.

Location: tools/converse.py around line 1655

# After: response_text = stt_result.get("text")
if TRIGGER_WORD and response_text:
    # Check for trigger word at end (case-insensitive)
    text_lower = response_text.lower().rstrip()
    trigger_lower = TRIGGER_WORD.lower()
    if text_lower.endswith(trigger_lower):
        # Strip trigger word
        response_text = response_text[:-(len(TRIGGER_WORD))].rstrip()
        # Also strip common trailing punctuation/comma before trigger
        response_text = response_text.rstrip('.,;:')

Pros: Simple, minimal code changes
Cons: Doesn't solve the waiting problem - still waits for silence/timeout before transcribing

Option B: Real-time Keyword Detection (Better UX)

Detect trigger word during recording to stop immediately.

Approaches:

  1. Vosk - Lightweight offline speech recognition, can do streaming
  2. Periodic STT - Send audio chunks every few seconds to Whisper, check for keyword
  3. Porcupine/similar - Wake word detection (would need custom model training)

Location: tools/converse.py in record_audio_with_silence_detection()

Pros: Immediate response when user says trigger word
Cons: More complex, additional dependencies, potential latency

Recommended Approach

Start with Option A for quick implementation, then consider Option B as an enhancement if users need faster response in noisy environments.

Use Cases

  1. Outdoor/mobile use - Walking, public transport, street noise
  2. Open office - Background chatter interferes with silence detection
  3. Hands-free operation - Clear signal without needing to press a button
  4. International users - Can configure trigger word in their language

Workaround (Current)

Users can currently work around this by:

disable_silence_detection=True, listen_duration_max=30

But this requires waiting the full duration or awkward silence.

Additional Considerations

  • Trigger word should be configurable (different languages, preferences)
  • Should be optional (disabled by default for backward compatibility)
  • Could support multiple trigger words: VOICEMODE_TRIGGER_WORDS=over,done,finished
  • Consider adding a parameter to converse(): trigger_word: Optional[str] = None

Environment:

  • VoiceMode version: 7.4.2
  • Use case: concept elaboration via voice while mobile

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions