Skip to content

Conversation

@mohammadaaftabv
Copy link

@mohammadaaftabv mohammadaaftabv commented Jan 23, 2026

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

  • ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
  • ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

  • Python CLI (pipeline.py) and Hydra runner (run.py)
  • Sample input data for testing
  • Comprehensive documentation

Tested with sample data:

  • Stage 1 produces 181 windows from 5 input entries
  • Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026
@mohammadaaftabv mohammadaaftabv force-pushed the alm_data_build branch 3 times, most recently from deb0bd2 to 4f06bc7 Compare January 27, 2026 17:04
Add new NeMo Curator stages for ALM (Audio Language Model) data curation:
- ALMDataBuilderStage: Creates training windows from audio segments with
  quality filtering (sample rate, bandwidth, speaker count, duration)
- ALMDataOverlapStage: Filters overlapping windows based on threshold,
  keeping windows closest to target duration

Add complete tutorial with:
- YAML-driven pipeline configuration (main.py + pipeline.yaml)
- Sample input data for testing
- Comprehensive documentation with pipeline flow diagram

Add unit tests for both stages (14 tests, all passing).

The output JSONL is consumed by downstream processors, producing
sharded data ready for training Audio Language Models.

Tested with sample data:
- Stage 1 produces 181 windows from 5 input entries
- Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Use loguru.logger instead of standard logging (consistent with codebase)
- Add proper type annotations (dict[str, Any], list[...])
- Use set comprehensions instead of set(...)
- Add noqa comments for complexity issues (C901, PLR0912, PLR0915) that
  are inherent to the ported SDP algorithm
- Fix exception message patterns (EM101/EM102)
- Add match parameter to pytest.raises()
- Remove unused variables in tests

Pre-commit checks now pass for all ALM-related files.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
@mohammadaaftabv mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 29, 2026

Greptile Overview

Greptile Summary

This PR introduces a new ALM (Audio Language Model) data curation pipeline in NeMo Curator by adding two new audio stages: ALMDataBuilderStage (builds ~120s training windows from manifest segments with sample-rate/bandwidth/speaker constraints and optional truncation) and ALMDataOverlapStage (filters windows based on overlap threshold, preferring windows closest to target duration). It also adds a tutorial (Hydra YAML + runner script), sample fixture data, and unit tests for both stages.

The new stages live under nemo_curator/stages/audio/alm/ and are re-exported via nemo_curator.stages.audio so they can be composed in existing pipelines/executors. Tests and tutorial exercise the intended end-to-end flow: JSONL manifest → builder stage → overlap stage → JSONL output + basic stats.

Confidence Score: 3/5

  • This PR is not yet safe to merge because overlap filtering semantics appear inverted relative to tests/tutorial, which can silently produce incorrect window selection.
  • Core stage logic is relatively contained, but the overlap threshold mapping disagrees with how the API is documented/used (100% treated as keep-all, 0% as aggressive) which will lead to incorrect filtering and misleading metrics. Once threshold semantics are corrected (and tests updated away from golden totals), risk drops substantially.
  • nemo_curator/stages/audio/alm/alm_data_overlap.py, tests/stages/audio/alm/test_alm_data_overlap.py, tutorials/audio/alm/main.py

Important Files Changed

Filename Overview
nemo_curator/stages/audio/alm/alm_data_overlap.py Adds ALMDataOverlapStage for overlap filtering; review found overlap_percentage=100 maps to threshold=1.0 causing all non-identical overlaps to be kept (likely incorrect), and module uses inverted (end,start) timestamp tuples internally which is fragile (but already noted in prior threads).
nemo_curator/stages/audio/init.py Exports ALM stages and common audio stages via all; no functional issues found.
nemo_curator/stages/audio/alm/init.py Adds ALM stage package exports; no functional issues found.
nemo_curator/stages/audio/alm/alm_data_builder.py Adds ALMDataBuilderStage to build windows from segments with quality/speaker/duration filtering; no new must-fix issues found beyond prior-thread IndexError (appears already addressed with bounds check).
tests/fixtures/audio/alm/sample_input.jsonl Adds JSONL fixture with 5 sample entries for ALM stages; no code issues.
tests/stages/audio/alm/init.py Adds empty init for test package; no issues.
tests/stages/audio/alm/test_alm_data_builder.py Adds tests for builder stage; contains brittle golden assertion on total window count (already raised in prior threads).
tests/stages/audio/alm/test_alm_data_overlap.py Adds overlap stage tests; includes brittle golden totals and permissive-mode expectation that may be incorrect depending on threshold semantics.
tutorials/audio/README.md Adds tutorial index doc updates; no functional issues.
tutorials/audio/alm/README.md Adds ALM tutorial documentation; no code issues found.
tutorials/audio/alm/main.py Adds Hydra-driven ALM pipeline runner; imports XennaExecutor from nemo_curator.backends.xenna (prior thread says wrong) and computes stage1_windows using total_dur_list_window fallback, which changes meaning depending on stage outputs (already noted).
tutorials/audio/alm/pipeline.yaml Adds Hydra YAML config defining builder and overlap stages; correctness depends on stage init params and tutorial runner.
tutorials/audio/alm/requirements.txt Adds tutorial-specific requirements file; ensure versions align with repo tooling.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Prevent potential IndexError when accessing segments[curr_idx] after
the inner loop completes. The loop variable curr_idx could exceed
valid segment indices if the loop exits without breaking.

Added min(curr_idx, len(segments) - 1) bounds check at lines 365 and 389.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@karpnv karpnv self-requested a review January 31, 2026 00:35
Copy link
Contributor

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments:

  1. Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
  2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@mohammadaaftabv
Copy link
Author

Few comments:

  1. Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

@mohammadaaftabv
Copy link
Author

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

  • ALMDataBuilderStage: process_entry_time, segments_processed, windows_created
  • ALMDataOverlapStage: filter_time, input_windows, output_windows"

- Remove output_dir parameter and fcntl file locking from stages
- Add _log_metrics for timing/count tracking in both ALM stages
- Use explicit snake_case stage names (alm_data_builder, alm_data_overlap)
- Add fsspec support to tutorial for cloud path compatibility
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +215 to +218
{
"filter_time": filter_time,
"input_windows": input_windows,
"output_windows": output_windows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlap threshold inverted

threshold = overlap_percentage / 100 combined with if ratio >= threshold: remove makes overlap_percentage=100 most aggressive (threshold=1.0 → almost nothing removed) and overlap_percentage=0 least aggressive (threshold=0.0 → everything after first removed). Tests/tutorial treat 100 as “keep all” and 0 as “aggressive filtering”, so current semantics are reversed. Either invert the threshold (e.g., threshold = 1 - overlap_percentage/100) or rename/redefine overlap_percentage to match behavior and update docs/tests accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants