-
Notifications
You must be signed in to change notification settings - Fork 215
(fix): refactor audio stage names to be shown after running benchmark #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ee5bba1 to
3d7cca0
Compare
Greptile OverviewGreptile SummaryThis PR focuses on making stage performance stats retain meaningful stage names after benchmarking. It does this by (1) propagating The change also updates Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Exec as Executor
participant Stage as ProcessingStage
participant Legacy as LegacySpeechStage
participant Task as Task/AudioBatch
participant Out as Output Task
Exec->>Stage: process_batch([task])
Stage->>Stage: validate_input(task)
Stage->>Stage: process(task)
alt process() returns None (filtered)
Stage-->>Exec: skip (continue)
else process() returns list
Stage-->>Exec: extend results
else process() returns single task
Stage-->>Exec: append result
end
Exec->>Legacy: process(AudioBatch)
loop each entry in task.data
Legacy->>Legacy: process_dataset_entry(entry)
Legacy->>Out: propagate _stage_perf if missing/empty
end
Legacy-->>Exec: list[Task]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
| result = self.process(task) | ||
| if isinstance(result, list): | ||
| for r in result: | ||
| if r is not task and hasattr(r, "_stage_perf") and not r._stage_perf: | ||
| r._stage_perf = list(task._stage_perf) | ||
| results.extend(result) | ||
| else: | ||
| if result is not task and hasattr(result, "_stage_perf") and not result._stage_perf: | ||
| result._stage_perf = list(task._stage_perf) | ||
| results.append(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None result can crash
ProcessingStage.process() is documented to allow None for filtering, but process_batch() treats any non-list result as a task-like object and then does hasattr(result, "_stage_perf") / results.append(result). With this change, if process() returns None, the else branch will raise (at result is not task / hasattr(...)) and/or append None into results. This is a functional regression in the default batch path for any stage that filters tasks by returning None.
Fix by explicitly handling result is None before the list/non-list logic (skip or continue).
2a05acd to
f6a5132
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
|
Hey @SwekeR-463 thanks for the PR, since few of the classes where you added name extend from Can you try making the same changes as https://github.com/NVIDIA-NeMo/Curator/pull/1433/changes and seeing if it works Also I'm not sure if the changes in |
No it didnt worked after making the changes you said for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 1 comment
nemo_curator/stages/audio/common.py
Outdated
| def __init__(self, **kwargs): | ||
| super().__init__(**kwargs) | ||
| self.name = self.__class__.__name__ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataclass init breakage
LegacySpeechStage is a base for @dataclass stages in this file (e.g., GetAudioDurationStage). With the new LegacySpeechStage.__init__(...), the dataclass-generated __init__ will pass its field names into super().__init__(...), which then forwards them to ProcessingStage. Since ProcessingStage doesn’t accept those extra keywords, instantiation will raise TypeError for unexpected keyword arguments.
Worked with the having the same changes from #1433 and keeping my changes for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 1 comment
nemo_curator/stages/audio/common.py
Outdated
| def __init__(self, **kwargs): | ||
| super().__init__(**kwargs) | ||
| self.name = self.__class__.__name__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataclass init breaks
LegacySpeechStage is a base for @dataclass stages (e.g., GetAudioDurationStage). With the new LegacySpeechStage.__init__(self, **kwargs) calling super().__init__(**kwargs), the dataclass-generated __init__ will pass its field args into super().__init__, and ProcessingStage doesn’t accept those extra keywords. This will raise TypeError during instantiation of those dataclass stages.
nemo_curator/stages/audio/common.py
Outdated
| def __init__(self, **kwargs): | ||
| super().__init__(**kwargs) | ||
| self.name = self.__class__.__name__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to only work as intended for PreserveByValueStage since that is the only child that does a super().__init__.
nemo_curator/stages/base.py
Outdated
| if result is None: | ||
| continue | ||
| if isinstance(result, list): | ||
| for r in result: | ||
| if r is not task and hasattr(r, "_stage_perf") and not r._stage_perf: | ||
| r._stage_perf = list(task._stage_perf) | ||
| results.extend(result) | ||
| else: | ||
| if result is not task and hasattr(result, "_stage_perf") and not result._stage_perf: | ||
| result._stage_perf = list(task._stage_perf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am able to reproduce what @SwekeR-463 has seen with vs without this change. Although I would like to investigate if there is something that can be changed on the audio side instead of here...
| """ | ||
|
|
||
| name = "AudioToDocumentStage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I found the bug! The process function below does not explicitly forward the _stage_perf list when returning the DocumentBatch.
We may need to update this elsewhere too... then we should be able to remove the changes from nemo_curator/stages/base.py.
@SwekeR-463 let me know what you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: make sure _stage_perf=task._stage_perf is being propagated everywhere when an AudioBatch/DocumentBatch is being returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @sarahyurick for debugging this 🙏
@SwekeR-463 here is an example of how other stages propogate it
Curator/nemo_curator/stages/text/modules/add_id.py
Lines 80 to 81 in 90ce791
| _metadata=batch._metadata, | |
| _stage_perf=batch._stage_perf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: make sure
_stage_perf=task._stage_perfis being propagated everywhere when anAudioBatch/DocumentBatchis being returned.
made changes as per mentioned. if any other changes required please tell. 😄
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
…e hardcoded names Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
* Fuse document iterate and extract stages Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix bug Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update docs and tutorial Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * save progress Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update more tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update benchmark Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move class Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add missing import Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update comment Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
* sdg ray docs init Signed-off-by: Lawrence Lane <llane@nvidia.com> * header, tab fixes Signed-off-by: Lawrence Lane <llane@nvidia.com> * style guide Signed-off-by: Lawrence Lane <llane@nvidia.com> * release notes change, bump version Signed-off-by: Lawrence Lane <llane@nvidia.com> * feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> * readme Signed-off-by: Lawrence Lane <llane@nvidia.com> * updates Signed-off-by: Lawrence Lane <llane@nvidia.com> * updates Signed-off-by: Lawrence Lane <llane@nvidia.com> * updates Signed-off-by: Lawrence Lane <llane@nvidia.com> * updates Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update tutorials/synthetic/README.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
…set stage names Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
67db1ad to
3a413f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 3 comments
nemo_curator/stages/audio/common.py
Outdated
| super().__init__() | ||
|
|
||
| def process(self, task: AudioBatch) -> list[Task]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_stage_perf not propagated
LegacySpeechStage.process() copies _stage_perf only when the returned item already has that attribute (hasattr(r, "_stage_perf")). Many audio stages return new tasks like AudioBatch(data=...) without passing _stage_perf, so they won’t receive the prior perf chain and the stage name still won’t show up in later aggregation. Consider unconditionally propagating when creating new task objects (e.g., construct AudioBatch(..., _stage_perf=task._stage_perf) in process_dataset_entry callers, or in this loop detect known Task types and set _stage_perf even if currently empty).
| text_key: str = "text" | ||
| pred_text_key: str = "pred_text" | ||
| wer_key: str = "wer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tasks drop perf
process_dataset_entry() returns a freshly-constructed AudioBatch(data=...) without _stage_perf, and LegacySpeechStage.process() only copies perf when hasattr(r, "_stage_perf"). This means the perf chain will be lost for this stage’s outputs. Pass _stage_perf=task._stage_perf when constructing the AudioBatch (and similarly for other stages returning new Task instances).
Additional Comments (1)
|
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
3b0170e to
38cf9bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
31 files reviewed, 1 comment
| def process(self, task: AudioBatch) -> list[Task]: | ||
| result = [] | ||
| for entry in task.data: | ||
| result.extend(self.process_dataset_entry(entry)) | ||
| entries = self.process_dataset_entry(entry) | ||
| for r in entries: | ||
| if r is not task: | ||
| r._stage_perf = list(task._stage_perf) | ||
| result.extend(entries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perf chain overwritten
LegacySpeechStage.process() unconditionally overwrites r._stage_perf for every returned task (r._stage_perf = list(task._stage_perf)). If a downstream stage’s process_dataset_entry() returns a task that already has its own perf entries (e.g., it appends to _stage_perf before returning, or returns an object that already accumulated perf), this assignment will drop those entries and lose timing/name data. Consider extending instead of overwriting (e.g., only copy when missing/empty, or merge with r._stage_perf).
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 1 comment
| entries = self.process_dataset_entry(entry) | ||
| for r in entries: | ||
| if r is not task and not r._stage_perf: | ||
| r._stage_perf = list(task._stage_perf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe _stage_perf access
LegacySpeechStage.process() assumes every returned entry has a _stage_perf attribute (not r._stage_perf), but several audio stages return freshly-constructed AudioBatch(data=...) without _stage_perf (e.g., GetAudioDurationStage.process_dataset_entry() at common.py:73). In that case this will raise AttributeError at runtime. Consider guarding with hasattr(r, "_stage_perf") (or using getattr(r, "_stage_perf", None)), and then propagating from task._stage_perf when missing/empty.


Description
Fixes #1464
_stage_perfwhen stages return new task instances.StagePerfStatsstage names.Snippet
After re running
python benchmarking/run.py --config benchmarking/nightly-benchmark.yaml --entries audio_fleursgot this output.Checklist