Add ALM Data Pipeline tutorial and stages #1419

mohammadaaftabv · 2026-01-23T16:56:24Z

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

Python CLI (pipeline.py) and Hydra runner (run.py)
Sample input data for testing
Comprehensive documentation

Tested with sample data:

Stage 1 produces 181 windows from 5 input entries
Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-01-23T16:56:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add new NeMo Curator stages for ALM (Audio Language Model) data curation: - ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration) - ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration Add complete tutorial with: - YAML-driven pipeline configuration (main.py + pipeline.yaml) - Sample input data for testing - Comprehensive documentation with pipeline flow diagram Add unit tests for both stages (14 tests, all passing). The output JSONL is consumed by downstream processors, producing sharded data ready for training Audio Language Models. Tested with sample data: - Stage 1 produces 181 windows from 5 input entries - Stage 2 filters to 25 non-overlapping windows (3035.5s total) Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Use loguru.logger instead of standard logging (consistent with codebase) - Add proper type annotations (dict[str, Any], list[...]) - Use set comprehensions instead of set(...) - Add noqa comments for complexity issues (C901, PLR0912, PLR0915) that are inherent to the ported SDP algorithm - Fix exception message patterns (EM101/EM102) - Add match parameter to pytest.raises() - Remove unused variables in tests Pre-commit checks now pass for all ALM-related files. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps · 2026-01-29T09:41:03Z

Greptile Overview

Greptile Summary

This PR introduces a new ALM (Audio Language Model) data curation pipeline in NeMo Curator by adding two new audio stages: ALMDataBuilderStage (builds ~120s training windows from manifest segments with sample-rate/bandwidth/speaker constraints and optional truncation) and ALMDataOverlapStage (filters windows based on overlap threshold, preferring windows closest to target duration). It also adds a tutorial (Hydra YAML + runner script), sample fixture data, and unit tests for both stages.

The new stages live under nemo_curator/stages/audio/alm/ and are re-exported via nemo_curator.stages.audio so they can be composed in existing pipelines/executors. Tests and tutorial exercise the intended end-to-end flow: JSONL manifest → builder stage → overlap stage → JSONL output + basic stats.

Confidence Score: 3/5

This PR is not yet safe to merge because overlap filtering semantics appear inverted relative to tests/tutorial, which can silently produce incorrect window selection.
Core stage logic is relatively contained, but the overlap threshold mapping disagrees with how the API is documented/used (100% treated as keep-all, 0% as aggressive) which will lead to incorrect filtering and misleading metrics. Once threshold semantics are corrected (and tests updated away from golden totals), risk drops substantially.
nemo_curator/stages/audio/alm/alm_data_overlap.py, tests/stages/audio/alm/test_alm_data_overlap.py, tutorials/audio/alm/main.py

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/alm/alm_data_overlap.py	Adds ALMDataOverlapStage for overlap filtering; review found overlap_percentage=100 maps to threshold=1.0 causing all non-identical overlaps to be kept (likely incorrect), and module uses inverted (end,start) timestamp tuples internally which is fragile (but already noted in prior threads).
nemo_curator/stages/audio/init.py	Exports ALM stages and common audio stages via all; no functional issues found.
nemo_curator/stages/audio/alm/init.py	Adds ALM stage package exports; no functional issues found.
nemo_curator/stages/audio/alm/alm_data_builder.py	Adds ALMDataBuilderStage to build windows from segments with quality/speaker/duration filtering; no new must-fix issues found beyond prior-thread IndexError (appears already addressed with bounds check).
tests/fixtures/audio/alm/sample_input.jsonl	Adds JSONL fixture with 5 sample entries for ALM stages; no code issues.
tests/stages/audio/alm/init.py	Adds empty init for test package; no issues.
tests/stages/audio/alm/test_alm_data_builder.py	Adds tests for builder stage; contains brittle golden assertion on total window count (already raised in prior threads).
tests/stages/audio/alm/test_alm_data_overlap.py	Adds overlap stage tests; includes brittle golden totals and permissive-mode expectation that may be incorrect depending on threshold semantics.
tutorials/audio/README.md	Adds tutorial index doc updates; no functional issues.
tutorials/audio/alm/README.md	Adds ALM tutorial documentation; no code issues found.
tutorials/audio/alm/main.py	Adds Hydra-driven ALM pipeline runner; imports XennaExecutor from nemo_curator.backends.xenna (prior thread says wrong) and computes stage1_windows using total_dur_list_window fallback, which changes meaning depending on stage outputs (already noted).
tutorials/audio/alm/pipeline.yaml	Adds Hydra YAML config defining builder and overlap stages; correctness depends on stage init params and tutorial runner.
tutorials/audio/alm/requirements.txt	Adds tutorial-specific requirements file; ensure versions align with repo tooling.

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_builder.py

Prevent potential IndexError when accessing segments[curr_idx] after the inner loop completes. The loop variable curr_idx could exceed valid segment indices if the loop exits without breaking. Added min(curr_idx, len(segments) - 1) bounds check at lines 365 and 389. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

karpnv

LGTM

ayushdg

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

nemo_curator/stages/audio/alm/alm_data_builder.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mohammadaaftabv · 2026-02-07T12:20:46Z

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

mohammadaaftabv · 2026-02-07T12:57:01Z

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

ALMDataBuilderStage: process_entry_time, segments_processed, windows_created

ALMDataOverlapStage: filter_time, input_windows, output_windows"

- Remove output_dir parameter and fcntl file locking from stages - Add _log_metrics for timing/count tracking in both ALM stages - Use explicit snake_case stage names (alm_data_builder, alm_data_overlap) - Add fsspec support to tutorial for cloud path compatibility

greptile-apps

_{13 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_curator/stages/audio/alm/alm_data_overlap.py

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

tests/stages/audio/alm/test_alm_data_overlap.py

greptile-apps

_{13 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

tests/stages/audio/alm/test_alm_data_builder.py

tutorials/audio/alm/main.py

…iltered

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

tutorials/audio/alm/main.py

nemo_curator/stages/audio/alm/alm_data_overlap.py

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-10T09:25:26Z

nemo_curator/stages/audio/alm/alm_data_overlap.py

+            {
+                "filter_time": filter_time,
+                "input_windows": input_windows,
+                "output_windows": output_windows,


Overlap threshold inverted

threshold = overlap_percentage / 100 combined with if ratio >= threshold: remove makes overlap_percentage=100 most aggressive (threshold=1.0 → almost nothing removed) and overlap_percentage=0 least aggressive (threshold=0.0 → everything after first removed). Tests/tutorial treat 100 as “keep all” and 0 as “aggressive filtering”, so current semantics are reversed. Either invert the threshold (e.g., threshold = 1 - overlap_percentage/100) or rename/redefine overlap_percentage to match behavior and update docs/tests accordingly.

github-actions bot added the community-request label Jan 23, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026

mohammadaaftabv force-pushed the alm_data_build branch 3 times, most recently from deb0bd2 to 4f06bc7 Compare January 27, 2026 17:04

mohammadaaftabv added 2 commits January 29, 2026 15:06

mohammadaaftabv force-pushed the alm_data_build branch from 66abf28 to 0125f32 Compare January 29, 2026 09:36

mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_builder.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

karpnv self-requested a review January 31, 2026 00:35

karpnv approved these changes Jan 31, 2026

View reviewed changes

ayushdg reviewed Feb 5, 2026

View reviewed changes

Merge branch 'main' into alm_data_build

5093c0a

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

nemo_curator/stages/audio/alm/alm_data_overlap.py Show resolved Hide resolved

tutorials/audio/alm/main.py Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_overlap.py Show resolved Hide resolved

tests/stages/audio/alm/test_alm_data_overlap.py Show resolved Hide resolved

Fix sample data paths in README customization examples

3e95b86

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

Ensure stable output schema in ALMDataOverlapStage when all windows f…

5ebc720

…iltered

greptile-apps bot reviewed Feb 7, 2026

View reviewed changes

tutorials/audio/alm/main.py Show resolved Hide resolved

nemo_curator/stages/audio/alm/alm_data_overlap.py Show resolved Hide resolved

mohammadaaftabv requested review from ayushdg and karpnv February 9, 2026 02:41

Merge branch 'main' into alm_data_build

83c35a7

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

Add ALM Data Pipeline tutorial and stages #1419

Are you sure you want to change the base?

Add ALM Data Pipeline tutorial and stages #1419

Conversation

mohammadaaftabv commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

greptile-apps bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

mohammadaaftabv commented Feb 7, 2026

Uh oh!

mohammadaaftabv commented Feb 7, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohammadaaftabv commented Jan 23, 2026 •

edited

Loading

greptile-apps bot commented Jan 29, 2026 •

edited

Loading