-
Notifications
You must be signed in to change notification settings - Fork 215
Add ALM Data Pipeline tutorial and stages #1419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
deb0bd2 to
4f06bc7
Compare
Add new NeMo Curator stages for ALM (Audio Language Model) data curation: - ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration) - ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration Add complete tutorial with: - YAML-driven pipeline configuration (main.py + pipeline.yaml) - Sample input data for testing - Comprehensive documentation with pipeline flow diagram Add unit tests for both stages (14 tests, all passing). The output JSONL is consumed by downstream processors, producing sharded data ready for training Audio Language Models. Tested with sample data: - Stage 1 produces 181 windows from 5 input entries - Stage 2 filters to 25 non-overlapping windows (3035.5s total) Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Use loguru.logger instead of standard logging (consistent with codebase) - Add proper type annotations (dict[str, Any], list[...]) - Use set comprehensions instead of set(...) - Add noqa comments for complexity issues (C901, PLR0912, PLR0915) that are inherent to the ported SDP algorithm - Fix exception message patterns (EM101/EM102) - Add match parameter to pytest.raises() - Remove unused variables in tests Pre-commit checks now pass for all ALM-related files. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
66abf28 to
0125f32
Compare
Greptile OverviewGreptile SummaryThis PR introduces a new ALM (Audio Language Model) data curation pipeline in NeMo Curator by adding two new audio stages: The new stages live under Confidence Score: 3/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 2 comments
Prevent potential IndexError when accessing segments[curr_idx] after the inner loop completes. The loop variable curr_idx could exceed valid segment indices if the loop exits without breaking. Added min(curr_idx, len(segments) - 1) bounds check at lines 365 and 389. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
karpnv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ayushdg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments:
- Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
- You are already logging many statistics in the stages here, is it possible to also use
_log_metricslike done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output. All this is in test cases here. |
Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:
|
- Remove output_dir parameter and fcntl file locking from stages - Add _log_metrics for timing/count tracking in both ALM stages - Use explicit snake_case stage names (alm_data_builder, alm_data_overlap) - Add fsspec support to tutorial for cloud path compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 files reviewed, 4 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 files reviewed, 5 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 files reviewed, 1 comment
| { | ||
| "filter_time": filter_time, | ||
| "input_windows": input_windows, | ||
| "output_windows": output_windows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overlap threshold inverted
threshold = overlap_percentage / 100 combined with if ratio >= threshold: remove makes overlap_percentage=100 most aggressive (threshold=1.0 → almost nothing removed) and overlap_percentage=0 least aggressive (threshold=0.0 → everything after first removed). Tests/tutorial treat 100 as “keep all” and 0 as “aggressive filtering”, so current semantics are reversed. Either invert the threshold (e.g., threshold = 1 - overlap_percentage/100) or rename/redefine overlap_percentage to match behavior and update docs/tests accordingly.
Add new NeMo Curator stages for ALM (Audio Language Model) data curation:
Add complete tutorial with:
Tested with sample data:
Description
Usage
# Add snippet demonstrating usageChecklist