Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request expands the Janus audio pipeline with production-ready infrastructure: upgrading audio processing to 48kHz with larger chunks, adding testable threading controls via Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client (Web)
participant WS as WebSocket Server
participant Engine as Engine
participant Sender as Sender Process
participant Executor as ThreadPoolExecutor
Client->>WS: Connect
WS->>Engine: Read initial state
WS-->>Client: ControlStateMessage (mode, is_streaming, is_recording, emotion_override)
Client->>WS: Start Stream
WS->>Engine: Update state (is_streaming=true)
WS-->>Client: ControlStateMessage (updated state)
Sender->>Sender: Read audio chunk
Sender->>Sender: Check pre-roll buffer & VAD
Sender->>Sender: Detect speech
Sender->>Executor: Submit process_and_transmit (async)
Executor->>Executor: Transcribe audio
Executor->>Executor: Extract prosody
Executor->>Executor: Transmit packet
Executor-->>Sender: Job complete
Sender->>Engine: Detect silence
Engine->>Engine: Check silence threshold
Engine->>Sender: Trigger processing if buffered
Client->>WS: Stop Stream
WS->>Engine: Update state (is_streaming=false)
WS-->>Client: ControlStateMessage (updated state)
sequenceDiagram
participant Test as Test Runner
participant MockAudio as MockAudioService
participant Receiver as Receiver Loop
participant Sender as Sender Loop
participant Executor as ThreadPoolExecutor
Test->>MockAudio: Set input pattern (conversation)
Test->>Receiver: Start in thread with stop_event
Test->>Sender: Start in thread with stop_event
Sender->>MockAudio: read_chunk()
MockAudio-->>Sender: Audio frame
Sender->>Executor: process_and_transmit()
Executor->>Executor: Transcribe + analyze
Executor->>Receiver: UDP packet
Receiver->>MockAudio: write_chunk() (synthesized)
Sender->>Sender: Detect silence threshold
Sender->>Executor: Submit next batch
Test->>Test: Wait for multiple outputs
Test->>Sender: Set stop_event
Test->>Receiver: Set stop_event
Sender->>Executor: shutdown()
Receiver-->>Test: Cleanup complete
Test->>Test: Assert ≥2 WAV chunks, no deadlock
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/tests/test_e2e_local.py`:
- Around line 366-378: The test currently prints thread_exceptions but does not
fail when they exist; update the post-thread check to fail the test if
thread_exceptions is non-empty by asserting or calling pytest.fail with a clear
message that includes the thread_exceptions contents (e.g., reference
thread_exceptions and include thread_name/exc), then continue to check
mock_audio_service.written_audio_chunks as before to ensure multiple chunks; use
the existing thread_exceptions variable and
mock_audio_service.written_audio_chunks to locate where to add the failing
assertion.
🧹 Nitpick comments (4)
backend/tests/test_engine.py (1)
88-101: Consider moving fixture definition before the test that uses it.While pytest resolves fixtures by name regardless of position, placing fixtures before their dependent tests improves readability and follows conventional test file organization.
♻️ Suggested reordering
Move the
reset_ducking_statefixture (lines 88-101) to appear beforetest_audio_ducking(line 53), typically after the existing tests and before the new ducking test, or at the top of the file after imports.backend/tests/test_e2e_local.py (3)
17-20: Remove unused imports.These classes are imported but never used directly. The
monkeypatch.setattrcalls reference them via string paths (e.g.,'backend.services.transcriber.Transcriber'), not the imported symbols.Proposed fix
-from backend.common.protocol import JanusMode, JanusPacket +from backend.common.protocol import JanusPacket -from backend.services.prosody import ProsodyExtractor -from backend.services.synthesizer import Synthesizer -from backend.services.transcriber import Transcriber -from backend.services.vad import VoiceActivityDetector
244-244: Rename unused loop variable to_.The loop variable
turnis not used within the loop body.Proposed fix
- for turn in range(num_turns): + for _ in range(num_turns):
334-339: Consider using a synchronization primitive instead of fixed sleep.The
time.sleep(0.5)for waiting on the receiver to be ready is a race condition that may cause flaky tests on slower CI environments. A more robust approach would be using athreading.Eventsignaled by the receiver when it's ready to accept connections.This is a common pattern in integration tests and may be acceptable given the generous timeouts elsewhere, but worth considering if you observe test flakiness.
| # Check for thread exceptions | ||
| if thread_exceptions: | ||
| for thread_name, exc in thread_exceptions: | ||
| print(f"Exception in {thread_name} thread: {exc}") | ||
| # Don't fail immediately - check if we got results anyway | ||
|
|
||
| # Verification: Assert multiple audio chunks were received | ||
| written_chunks = mock_audio_service.written_audio_chunks | ||
|
|
||
| # Critical assertion: Must receive at least 2 chunks (proving multiple turns work) | ||
| assert len(written_chunks) >= 2, ( | ||
| f"Expected at least 2 audio chunks (multiple turns), but got {len(written_chunks)}. " | ||
| f"This indicates a deadlock bug after the first transmission." |
There was a problem hiding this comment.
Thread exceptions should cause test failure.
Currently, if an exception occurs in a thread but some chunks were still received, the test passes despite the error. The exceptions are only printed but not asserted against, which could mask real issues.
Proposed fix
# Check for thread exceptions
if thread_exceptions:
for thread_name, exc in thread_exceptions:
print(f"Exception in {thread_name} thread: {exc}")
- # Don't fail immediately - check if we got results anyway
+ pytest.fail(f"Thread exceptions occurred: {thread_exceptions}")
# Verification: Assert multiple audio chunks were receivedAlternatively, if you want to allow partial success with warnings, consider using pytest.warns or explicitly documenting why swallowing exceptions is acceptable.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check for thread exceptions | |
| if thread_exceptions: | |
| for thread_name, exc in thread_exceptions: | |
| print(f"Exception in {thread_name} thread: {exc}") | |
| # Don't fail immediately - check if we got results anyway | |
| # Verification: Assert multiple audio chunks were received | |
| written_chunks = mock_audio_service.written_audio_chunks | |
| # Critical assertion: Must receive at least 2 chunks (proving multiple turns work) | |
| assert len(written_chunks) >= 2, ( | |
| f"Expected at least 2 audio chunks (multiple turns), but got {len(written_chunks)}. " | |
| f"This indicates a deadlock bug after the first transmission." | |
| # Check for thread exceptions | |
| if thread_exceptions: | |
| for thread_name, exc in thread_exceptions: | |
| print(f"Exception in {thread_name} thread: {exc}") | |
| pytest.fail(f"Thread exceptions occurred: {thread_exceptions}") | |
| # Verification: Assert multiple audio chunks were received | |
| written_chunks = mock_audio_service.written_audio_chunks | |
| # Critical assertion: Must receive at least 2 chunks (proving multiple turns work) | |
| assert len(written_chunks) >= 2, ( | |
| f"Expected at least 2 audio chunks (multiple turns), but got {len(written_chunks)}. " | |
| f"This indicates a deadlock bug after the first transmission." |
🤖 Prompt for AI Agents
In `@backend/tests/test_e2e_local.py` around lines 366 - 378, The test currently
prints thread_exceptions but does not fail when they exist; update the
post-thread check to fail the test if thread_exceptions is non-empty by
asserting or calling pytest.fail with a clear message that includes the
thread_exceptions contents (e.g., reference thread_exceptions and include
thread_name/exc), then continue to check mock_audio_service.written_audio_chunks
as before to ensure multiple chunks; use the existing thread_exceptions variable
and mock_audio_service.written_audio_chunks to locate where to add the failing
assertion.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@README.md`:
- Around line 10-18: The README badges for Next.js and Tailwind are out of date;
locate the badge lines containing the alt texts "![Next.js]" and "![Tailwind]"
and update their version labels to match package.json (Next.js => 16.1.6,
Tailwind CSS => 4). Ensure the shield URLs/labels reflect these exact versions
so the displayed badges match the project's current dependencies.
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files | head -20Repository: akshatvasisht/janus
Length of output: 555
🏁 Script executed:
find . -maxdepth 2 -type f \( -name "package.json" -o -name "pyproject.toml" -o -name "requirements.txt" -o -name "poetry.lock" -o -name "package-lock.json" \) 2>/dev/nullRepository: akshatvasisht/janus
Length of output: 142
🏁 Script executed:
cat backend/requirements.txtRepository: akshatvasisht/janus
Length of output: 707
🏁 Script executed:
cat frontend/package.jsonRepository: akshatvasisht/janus
Length of output: 1021
Update Next.js and Tailwind CSS badges to reflect current dependency versions.
The Next.js badge shows version 15.0, but the project uses 16.1.6 (package.json). The Tailwind CSS badge shows 3.4, but the project uses version 4 (package.json devDependencies). Update both badges to match the current dependencies.
🧰 Tools
🪛 LanguageTool
[style] ~10-~10: Using many exclamation marks might seem excessive (in this case: 9 exclamation marks for a text that’s 4625 characters long)
Context: ...10%2B-blue?logo=python&logoColor=white)

🤖 Prompt for AI Agents
In `@README.md` around lines 10 - 18, The README badges for Next.js and Tailwind
are out of date; locate the badge lines containing the alt texts "![Next.js]"
and "![Tailwind]" and update their version labels to match package.json (Next.js
=> 16.1.6, Tailwind CSS => 4). Ensure the shield URLs/labels reflect these exact
versions so the displayed badges match the project's current dependencies.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
backend/services/synthesizer.py (1)
29-41:⚠️ Potential issue | 🟡 MinorStale docstring still describes removed default-path fallback.
Lines 33–35 say "If no reference audio path is provided, uses a default path in the backend directory", but the code (lines 50–53) and the updated param doc (line 40) confirm that no default is loaded when the path is
None. Please align the top paragraph with the new behavior.📝 Proposed fix
Sets up the Fish Audio SDK client with the provided API key and loads - reference audio for voice cloning. If no reference audio path is provided, - uses a default path in the backend directory. Supports hot-reload of - reference audio files without restarting the service. + reference audio for voice cloning. If no reference audio path is provided, + no reference audio is loaded. Supports hot-reload of reference audio + files without restarting the service.backend/services/prosody.py (1)
11-21:⚠️ Potential issue | 🟡 MinorStale docstring: default sample rate documented as 44100 Hz but actual default is 48000 Hz.
Line 20 still reads
Default is 44100 Hzwhile the constructor default on Line 11 is now48000.📝 Proposed fix
Args: - sample_rate: Audio sample rate in Hz. Default is 44100 Hz. + sample_rate: Audio sample rate in Hz. Default is 48000 Hz. hop_size: Analysis hop size in samples. Default is 512 samples.backend/services/vad.py (1)
11-24:⚠️ Potential issue | 🟡 MinorStale docstring: default sample rate documented as 44100 Hz but actual default is 48000 Hz.
Same issue as in
prosody.py— Line 23 still says "Default is 44100 Hz."📝 Proposed fix
- sample_rate: Input audio sample rate in Hz. Default is 44100 Hz. + sample_rate: Input audio sample rate in Hz. Default is 48000 Hz.backend/services/audio_io.py (1)
24-25:⚠️ Potential issue | 🟡 MinorStale docstring: still references "44100Hz sample rate, 512 sample chunk size."
Update to reflect the new values (48000 Hz, 1536 samples).
📝 Proposed fix
- Sets up PyAudio instance and opens input/output streams for microphone capture - and speaker playback. Configures audio parameters (44100Hz sample rate, 512 sample - chunk size, mono channel) required for AI processing pipelines. + Sets up PyAudio instance and opens input/output streams for microphone capture + and speaker playback. Configures audio parameters (48000Hz sample rate, 1536 sample + chunk size, mono channel) required for AI processing pipelines.backend/tests/test_input_processing.py (1)
53-67:⚠️ Potential issue | 🟡 MinorStale docstring: says "default 512" but actual default is 1536.
Line 58 still reads
(default 512)while the function signature on Line 53 defaults tochunk_size=1536.📝 Proposed fix
""" - Generate a valid audio chunk matching CHUNK_SIZE (1536 samples). + Generate a valid audio chunk matching CHUNK_SIZE. Args: - chunk_size: Number of samples (default 512) + chunk_size: Number of samples (default 1536) sample_rate: Sample rate in Hz
🤖 Fix all issues with AI agents
In `@backend/scripts/verify_audio_params.py`:
- Around line 9-10: DEFAULT_OUTPUT_SR is set to 44100 but
AudioService.SAMPLE_RATE was changed to 48000, so update the verification script
to use the actual service sample rate: replace the hardcoded DEFAULT_OUTPUT_SR
(44_100) with 48_000 or, better, import/read AudioService.SAMPLE_RATE and set
DEFAULT_OUTPUT_SR = AudioService.SAMPLE_RATE so generate()/resampling checks
align with AudioService; ensure references to DEFAULT_OUTPUT_SR in
verify_audio_params.py use the updated value.
In `@backend/services/engine.py`:
- Around line 235-251: The pitch comparisons use 'Low' but
ProsodyExtractor.analyze_buffer() returns 'Deep' (not 'Low'), so replace the
checks for pitch == 'Low' in the prosody-based branch (the block that sets
emotion_tag when packet.override_emotion == "Auto") with pitch == 'Deep'
(including both the branch that maps to 'Panicked' and the branch that maps to
'Serious'); also consider normalizing/casing prosody.get('pitch') beforehand in
the same block to avoid future mismatches.
In `@backend/services/vad.py`:
- Around line 52-60: The code decimates 48000 and 44100 inputs by slicing
(audio_chunk[::3]) but for 44100 this yields ~14700 Hz while still setting
vad_sample_rate=16000, and the else branch passes unsupported rates to Silero
VAD (which only supports 8000/16000). Fix by validating self.sample_rate in the
VAD handling function/class: for 48000 resample/decimate to 16000 and set
vad_sample_rate=16000; for 44100 either perform proper resampling to 16000
(e.g., use librosa.resample or scipy.signal.resample_poly) or raise ValueError
if 44100 support is not desired; for any other sample rates either resample to
8000 or 16000 as appropriate or raise an explicit ValueError instead of passing
arbitrary rates. Ensure that the actual audio_chunk sampling rate matches
vad_sample_rate and update/log any errors accordingly (refer to
variables/methods using self.sample_rate, audio_chunk, and vad_sample_rate).
In `@backend/tests/test_input_processing.py`:
- Line 677: The inline comment on the for loop that iterates "for _ in
range(20)" incorrectly states "SILENCE_THRESHOLD_CHUNKS (6)"; update that
comment to reflect the actual threshold (15) or remove the parenthetical to
avoid misleading readers—locate the test loop in test_input_processing.py (the
for _ in range(20) block) and change the comment to e.g. "More than
SILENCE_THRESHOLD_CHUNKS (15)" referencing the SILENCE_THRESHOLD_CHUNKS
constant.
🧹 Nitpick comments (5)
backend/scripts/list_audio_devices.py (2)
7-7: Unused import:sysis imported but never referenced.🧹 Proposed fix
import pyaudio -import sys
9-20:p.terminate()is not called if an exception occurs during device enumeration.If
get_default_input_device_info(),get_default_output_device_info(), or the device iteration loop raises (e.g.,IOErrorwhen no default device exists), the PyAudio instance is never terminated. Consider using atry/finallyblock.🛡️ Proposed fix
def list_devices(): p = pyaudio.PyAudio() - print(f"PyAudio Version: {pyaudio.__version__}") - print(f"Default Input Device: {p.get_default_input_device_info() if p.get_device_count() > 0 else 'None'}") - print(f"Default Output Device: {p.get_default_output_device_info() if p.get_device_count() > 0 else 'None'}") - - print("\nAll Devices:") - for i in range(p.get_device_count()): - info = p.get_device_info_by_index(i) - print(f"Index {i}: {info['name']} (Inputs: {info['maxInputChannels']}, Outputs: {info['maxOutputChannels']})") - - p.terminate() + try: + print(f"PyAudio Version: {pyaudio.__version__}") + print(f"Default Input Device: {p.get_default_input_device_info() if p.get_device_count() > 0 else 'None'}") + print(f"Default Output Device: {p.get_default_output_device_info() if p.get_device_count() > 0 else 'None'}") + + print("\nAll Devices:") + for i in range(p.get_device_count()): + info = p.get_device_info_by_index(i) + print(f"Index {i}: {info['name']} (Inputs: {info['maxInputChannels']}, Outputs: {info['maxOutputChannels']})") + finally: + p.terminate()backend/scripts/verify_audio_params.py (1)
83-86: Directly resetting singleton internals (_instance,_initialized) is fragile.This couples the script to
ModelManager's private implementation. If the singleton pattern is refactored, this will break silently. Consider adding a publicreset()class method toModelManagerif this pattern is needed.backend/services/engine.py (1)
496-531:audio_bufferandsilence_counterare reset twice — at Lines 498-499 and again at Lines 530-531.The second reset at Lines 530-531 is redundant since
audio_bufferwas already cleared to[]andsilence_counterset to0before the executor call.🧹 Proposed fix — remove redundant reset
text, meta = await loop.run_in_executor( executor, process_audio_blocking, combined_audio ) - audio_buffer = [] - silence_counter = 0 - if text.strip():backend/scripts/sender_main.py (1)
148-151: Emptyifblock — commented-out heartbeat log leaves a no-op conditional.The
if chunk_counter % 100 == 0:block body is entirely commented out. Either addpassexplicitly, remove the block, or uncomment the log at aDEBUGlevel.🧹 Option A: Remove the block
chunk_counter += 1 - if chunk_counter % 100 == 0: - # logger.info(f"Audio consumer heartbeat: Processed {chunk_counter} chunks (Mock mode: {'Yes' if not audio_service._pyaudio_available else 'No'})") - trigger_processing = False🧹 Option B: Uncomment at DEBUG level
chunk_counter += 1 if chunk_counter % 100 == 0: - # logger.info(f"Audio consumer heartbeat: Processed {chunk_counter} chunks (Mock mode: {'Yes' if not audio_service._pyaudio_available else 'No'})") + logger.debug(f"Audio consumer heartbeat: {chunk_counter} chunks processed") trigger_processing = False
| generate() in model_manager.py resamples output with librosa.resample when | ||
| native_sr != target_sr so that playback matches AudioService (44.1 kHz). |
There was a problem hiding this comment.
DEFAULT_OUTPUT_SR (44100) is inconsistent with AudioService.SAMPLE_RATE (48000).
Line 10 says "playback matches AudioService (44.1 kHz)" and Line 21 sets DEFAULT_OUTPUT_SR = 44_100, but AudioService.SAMPLE_RATE was updated to 48000 in this same PR. Users running this script with defaults will get misleading verification results.
📝 Proposed fix
-generate() in model_manager.py resamples output with librosa.resample when
-native_sr != target_sr so that playback matches AudioService (44.1 kHz).
+generate() in model_manager.py resamples output with librosa.resample when
+native_sr != target_sr so that playback matches AudioService (48 kHz).
"""
from __future__ import annotations
import argparse
import json
import sys
# Default model ID; must match ModelManager default.
DEFAULT_MODEL_ID = "Qwen/Qwen3-TTS-12Hz-0.6B-Base"
-DEFAULT_OUTPUT_SR = 44_100
+DEFAULT_OUTPUT_SR = 48_000Also applies to: 21-21
🤖 Prompt for AI Agents
In `@backend/scripts/verify_audio_params.py` around lines 9 - 10,
DEFAULT_OUTPUT_SR is set to 44100 but AudioService.SAMPLE_RATE was changed to
48000, so update the verification script to use the actual service sample rate:
replace the hardcoded DEFAULT_OUTPUT_SR (44_100) with 48_000 or, better,
import/read AudioService.SAMPLE_RATE and set DEFAULT_OUTPUT_SR =
AudioService.SAMPLE_RATE so generate()/resampling checks align with
AudioService; ensure references to DEFAULT_OUTPUT_SR in verify_audio_params.py
use the updated value.
| if packet.override_emotion != "Auto": | ||
| emotion_tag = packet.override_emotion | ||
| else: | ||
| prosody = packet.prosody | ||
| pitch = prosody.get('pitch', 'Normal') | ||
| energy = prosody.get('energy', 'Normal') | ||
|
|
||
| if pitch == 'High' and energy == 'Loud': | ||
| emotion_tag = 'Excited' | ||
| elif pitch == 'High' and energy == 'Normal': | ||
| emotion_tag = 'Joyful' | ||
| elif pitch == 'Low' and energy == 'Loud': | ||
| emotion_tag = 'Panicked' | ||
| elif pitch == 'Low' and energy in ('Quiet', 'Low'): | ||
| emotion_tag = 'Serious' | ||
| else: | ||
| emotion_tag = 'Neutral' |
There was a problem hiding this comment.
Bug: pitch == 'Low' will never match — ProsodyExtractor returns 'Deep', not 'Low'.
ProsodyExtractor.analyze_buffer() returns 'Deep', 'Normal', or 'High' for the pitch key. Lines 247-249 check for 'Low', which will never match, so the 'Panicked' and 'Serious' emotion tags are unreachable through prosody-based derivation.
🐛 Proposed fix
if pitch == 'High' and energy == 'Loud':
emotion_tag = 'Excited'
elif pitch == 'High' and energy == 'Normal':
emotion_tag = 'Joyful'
- elif pitch == 'Low' and energy == 'Loud':
+ elif pitch == 'Deep' and energy == 'Loud':
emotion_tag = 'Panicked'
- elif pitch == 'Low' and energy in ('Quiet', 'Low'):
+ elif pitch == 'Deep' and energy in ('Quiet',):
emotion_tag = 'Serious'
else:
emotion_tag = 'Neutral'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if packet.override_emotion != "Auto": | |
| emotion_tag = packet.override_emotion | |
| else: | |
| prosody = packet.prosody | |
| pitch = prosody.get('pitch', 'Normal') | |
| energy = prosody.get('energy', 'Normal') | |
| if pitch == 'High' and energy == 'Loud': | |
| emotion_tag = 'Excited' | |
| elif pitch == 'High' and energy == 'Normal': | |
| emotion_tag = 'Joyful' | |
| elif pitch == 'Low' and energy == 'Loud': | |
| emotion_tag = 'Panicked' | |
| elif pitch == 'Low' and energy in ('Quiet', 'Low'): | |
| emotion_tag = 'Serious' | |
| else: | |
| emotion_tag = 'Neutral' | |
| if packet.override_emotion != "Auto": | |
| emotion_tag = packet.override_emotion | |
| else: | |
| prosody = packet.prosody | |
| pitch = prosody.get('pitch', 'Normal') | |
| energy = prosody.get('energy', 'Normal') | |
| if pitch == 'High' and energy == 'Loud': | |
| emotion_tag = 'Excited' | |
| elif pitch == 'High' and energy == 'Normal': | |
| emotion_tag = 'Joyful' | |
| elif pitch == 'Deep' and energy == 'Loud': | |
| emotion_tag = 'Panicked' | |
| elif pitch == 'Deep' and energy in ('Quiet',): | |
| emotion_tag = 'Serious' | |
| else: | |
| emotion_tag = 'Neutral' |
🤖 Prompt for AI Agents
In `@backend/services/engine.py` around lines 235 - 251, The pitch comparisons use
'Low' but ProsodyExtractor.analyze_buffer() returns 'Deep' (not 'Low'), so
replace the checks for pitch == 'Low' in the prosody-based branch (the block
that sets emotion_tag when packet.override_emotion == "Auto") with pitch ==
'Deep' (including both the branch that maps to 'Panicked' and the branch that
maps to 'Serious'); also consider normalizing/casing prosody.get('pitch')
beforehand in the same block to avoid future mismatches.
| if self.sample_rate == 48000: | ||
| audio_chunk = audio_chunk[::3] | ||
| vad_sample_rate = 16000 # Silero VAD expects 16k | ||
| elif self.sample_rate == 44100: | ||
| # Fallback for old rate if used, though it's still slightly inaccurate without proper resampling | ||
| audio_chunk = audio_chunk[::3] | ||
| vad_sample_rate = 16000 | ||
| else: | ||
| vad_sample_rate = self.sample_rate |
There was a problem hiding this comment.
44100 Hz fallback decimates by 3 (→14700 Hz) but tells Silero VAD it's 16000 Hz.
The comment on Line 56 acknowledges the inaccuracy, but the mismatch (14700 actual vs. 16000 declared) can meaningfully degrade speech detection. If 44100 Hz support is needed, consider using proper resampling (e.g., librosa.resample or scipy.signal.resample_poly). If it's not needed, removing the branch and raising a ValueError for unsupported rates would be clearer.
Also, the else branch (Line 59-60) passes arbitrary sample rates directly to Silero VAD, which only supports 8000 and 16000 Hz — any other rate will silently produce incorrect results or error.
🤖 Prompt for AI Agents
In `@backend/services/vad.py` around lines 52 - 60, The code decimates 48000 and
44100 inputs by slicing (audio_chunk[::3]) but for 44100 this yields ~14700 Hz
while still setting vad_sample_rate=16000, and the else branch passes
unsupported rates to Silero VAD (which only supports 8000/16000). Fix by
validating self.sample_rate in the VAD handling function/class: for 48000
resample/decimate to 16000 and set vad_sample_rate=16000; for 44100 either
perform proper resampling to 16000 (e.g., use librosa.resample or
scipy.signal.resample_poly) or raise ValueError if 44100 support is not desired;
for any other sample rates either resample to 8000 or 16000 as appropriate or
raise an explicit ValueError instead of passing arbitrary rates. Ensure that the
actual audio_chunk sampling rate matches vad_sample_rate and update/log any
errors accordingly (refer to variables/methods using self.sample_rate,
audio_chunk, and vad_sample_rate).
| mock_vad.is_speech.return_value = False | ||
|
|
||
| for _ in range(20): # More than SILENCE_THRESHOLD_CHUNKS (16) | ||
| for _ in range(20): # More than SILENCE_THRESHOLD_CHUNKS (6) |
There was a problem hiding this comment.
Stale comment: says "SILENCE_THRESHOLD_CHUNKS (6)" but the actual threshold is 15.
The test behavior is correct (20 > 15), but the parenthetical is misleading.
📝 Proposed fix
- for _ in range(20): # More than SILENCE_THRESHOLD_CHUNKS (6)
+ for _ in range(20): # More than SILENCE_THRESHOLD_CHUNKS (15)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _ in range(20): # More than SILENCE_THRESHOLD_CHUNKS (6) | |
| for _ in range(20): # More than SILENCE_THRESHOLD_CHUNKS (15) |
🤖 Prompt for AI Agents
In `@backend/tests/test_input_processing.py` at line 677, The inline comment on
the for loop that iterates "for _ in range(20)" incorrectly states
"SILENCE_THRESHOLD_CHUNKS (6)"; update that comment to reflect the actual
threshold (15) or remove the parenthetical to avoid misleading readers—locate
the test loop in test_input_processing.py (the for _ in range(20) block) and
change the comment to e.g. "More than SILENCE_THRESHOLD_CHUNKS (15)" referencing
the SILENCE_THRESHOLD_CHUNKS constant.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/services/engine.py (1)
554-565:⚠️ Potential issue | 🔴 Critical
avg_pitch_hzandavg_energyare alwaysNone, but the proposed fix won't work becausemetadoesn't contain those keys.The issue is real: lines 554–555 set both to
Nonewithout extracting values from themetadict. However,analyze_buffer()only returns categorical tags ('energy'and'pitch'strings), not numeric values. The method computesavg_pitchinternally (line 90 of prosody.py) but discards it instead of returning the numeric Hz value. The receiver path (lines 232–233) has the same problem—it tries to extract'avg_pitch_hz'and'avg_energy'keys that don't exist inpacket.prosody.To fix this properly,
analyze_buffer()should be modified to return the numeric average pitch and energy values alongside or instead of the categorical tags. The caller can then extract and use these values.
🤖 Fix all issues with AI agents
In `@frontend/components/PushToTalk.tsx`:
- Around line 65-68: The aria-pressed attribute must reflect the same combined
active state used for styling/labeling: change any aria-pressed={isRecording} to
aria-pressed={isRecording || isStreaming} in the PushToTalk component (update
the primary button element and the secondary occurrence around the code block
that handles the alternate button/label), ensuring both the visual state and
accessibility state use the same boolean expression (isRecording ||
isStreaming).
🧹 Nitpick comments (6)
backend/services/engine.py (1)
499-506: Double reset ofaudio_bufferandsilence_counter; magic threshold.
audio_bufferandsilence_counterare reset at lines 501–502 and again at lines 533–534. Since this is single-threaded async, no new chunks can arrive during theawaiton line 529 within the same iteration, so the second reset is dead code.The threshold
1536 * 6is a magic number. Consider extracting it as a named constant with a comment explaining its derivation (e.g., minimum meaningful utterance length at 48 kHz).Proposed cleanup
+ MIN_PROCESS_SAMPLES = 1536 * 6 # ~192ms at 48kHz – minimum for meaningful transcription + ... - if len(combined_audio) < 1536 * 6: + if len(combined_audio) < MIN_PROCESS_SAMPLES: logger.info(f"Skipping short audio buffer ({len(combined_audio)} samples)") continueAnd remove the redundant reset at lines 533–534.
backend/api/socket_manager.py (1)
75-84: Consider usingloggerinstead ofThe new
_reset_control_state()usesprint()(line 83), while the rest of the backend uses theloggingmodule. This is also true for otherprint()calls in this file (lines 65, 68, 109, 136, 173), but since those are pre-existing, the new code should at least follow the project-wide logging pattern.Proposed fix
+import logging + +logger = logging.getLogger(__name__) + def _reset_control_state() -> None: """ Reset engine_state.control_state to idle values. """ state = engine_state.control_state state.is_streaming = False state.is_recording = False state.is_talking = False - print(f"Control State Reset on Disconnect: {state}") + logger.info(f"Control State Reset on Disconnect: {state}")frontend/hooks/useJanusSocket.ts (1)
49-62: Clever use of React Query as a reactive state bus.The
enabled: false+queryFn: () => nullpattern works for reading externally-set query data, but it's non-obvious. A brief comment explaining that the actual data is injected viaqueryClient.setQueryDatafromuseJanusWebSocketwould help future readers understand this isn't a "real" query.Suggested comment
+ // Reactive reader for control_state messages set via queryClient.setQueryData + // in useJanusWebSocket — not a real fetch query. const { data: remoteState } = useQuery<ControlStateMessage | null>({ queryKey: queryKeys.controlState, queryFn: async () => null, enabled: false, });frontend/hooks/useJanusWebSocket.ts (3)
189-196: In-memory transcripts grow unbounded while persistence is capped at 100.
next(the full list) is returned to the query cache, but onlynext.slice(0, 100)is persisted. Over a long session the in-memory list can grow without limit, and on refresh the user loses everything beyond 100. If the cap is intentional, consider applying it to the returned value too so memory stays bounded.Additionally, calling
localStorage.setIteminside thesetQueryDataupdater introduces a side effect in what React Query expects to be a pure function. Consider moving persistence to a separateuseEffectthat watchestranscripts.
206-213:localStoragewrite on every packet may be a hot-path concern.Packet summaries can arrive at high frequency. Writing up to 200 items to
localStorageon every single packet is expensive (serialization + synchronous I/O). Consider debouncing or throttling the persist, or persisting on an interval / onbeforeunloadinstead.
99-118: Persisted data loaded inuseEffectwill flash empty initial data for one render.The
useQueryhooks setinitialData: []synchronously, then this effect overwrites fromlocalStorageafter the first render. This causes a brief flash of empty state. A minor UX nit — if you want to avoid it, you could initialize query data fromlocalStoragesynchronously before the first render (e.g., viainitialDatain theuseQuerycall itself).
| className={`w-48 h-48 transition-all border-4 border-black rounded-none ${isRecording || isStreaming | ||
| ? 'bg-red-500 hover:bg-red-600 text-white shadow-none translate-x-[4px] translate-y-[4px]' | ||
| : 'bg-white hover:bg-gray-100 text-black shadow-[8px_8px_0px_0px_rgba(0,0,0,1)] hover:shadow-[6px_6px_0px_0px_rgba(0,0,0,1)] hover:translate-x-[2px] hover:translate-y-[2px]' | ||
| }`} | ||
| }`} |
There was a problem hiding this comment.
aria-pressed doesn't reflect the combined active state.
The visual styling and label now treat isRecording || isStreaming as the active state, but aria-pressed on line 70 still only reflects isRecording. Screen-reader users won't be informed that the button is visually "active" when streaming alone is true.
Proposed fix
- aria-pressed={isRecording}
+ aria-pressed={isRecording || isStreaming}Also applies to: 94-101
🤖 Prompt for AI Agents
In `@frontend/components/PushToTalk.tsx` around lines 65 - 68, The aria-pressed
attribute must reflect the same combined active state used for styling/labeling:
change any aria-pressed={isRecording} to aria-pressed={isRecording ||
isStreaming} in the PushToTalk component (update the primary button element and
the secondary occurrence around the code block that handles the alternate
button/label), ensuring both the visual state and accessibility state use the
same boolean expression (isRecording || isStreaming).
…mproved error handling
114585f to
6c1b72a
Compare
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores