Skip to content

(fix): refactor audio stage names to be shown after running benchmark#1470

Open
SwekeR-463 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
SwekeR-463:fix/stage-name-propagate
Open

(fix): refactor audio stage names to be shown after running benchmark#1470
SwekeR-463 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
SwekeR-463:fix/stage-name-propagate

Conversation

@SwekeR-463
Copy link

@SwekeR-463 SwekeR-463 commented Feb 7, 2026

Description

Fixes #1464

  • Preserve _stage_perf when stages return new task instances.
  • Define explicit name fields for audio stages to populate StagePerfStats stage names.

Snippet

After re running python benchmarking/run.py --config benchmarking/nightly-benchmark.yaml --entries audio_fleurs got this output.

image

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 7, 2026

Greptile Overview

Greptile Summary

  • Adds explicit name attributes to several audio stages so StagePerfStats / benchmark output shows stable, human-readable stage names.
  • Updates ProcessingStage.process_batch() to handle None results (filtering) and to preserve _stage_perf when stages return new task instances.
  • Change is localized to stage metadata and the default batch-processing fallback path; no new APIs introduced.

Confidence Score: 4/5

  • This PR is likely safe to merge and primarily improves benchmarking/stats reporting, with a small behavior change in the default batch-processing path.
  • Changes are small and well-scoped. The new None handling aligns process_batch() with the documented contract, and _stage_perf propagation is guarded to only fill missing perf stats. Remaining risk is around assuming _stage_perf is always list-like on both input and outputs, and that all list results contain task objects.
  • nemo_curator/stages/base.py (process_batch semantics and _stage_perf propagation)

Important Files Changed

Filename Overview
nemo_curator/stages/base.py Updates default ProcessingStage.process_batch() to (a) skip None results and (b) copy _stage_perf from input task onto newly-created output tasks when their _stage_perf is empty.
nemo_curator/stages/audio/common.py Adds explicit name fields to GetAudioDurationStage and PreserveByValueStage so benchmark stage perf can show stable stage names.
nemo_curator/stages/audio/io/convert.py Adds explicit name = "AudioToDocumentStage" attribute on the stage class for consistent reporting.
nemo_curator/stages/audio/metrics/get_wer.py Adds explicit name field to GetPairwiseWerStage dataclass for consistent stage perf naming.

Sequence Diagram

sequenceDiagram
  autonumber
  participant Exec as Executor/Backend
  participant Stage as ProcessingStage
  participant TaskIn as Input Task
  participant TaskOut as Output Task(s)

  Exec->>Stage: process_batch(tasks)
  loop for each task in tasks
    Stage->>Stage: validate_input(task)
    Stage->>Stage: result = process(task)
    alt result is None
      Stage-->>Exec: skip (filtered out)
    else result is list
      loop for each r in result
        alt r has empty _stage_perf and r != task
          Stage->>TaskOut: r._stage_perf = copy(task._stage_perf)
        end
      end
      Stage-->>Exec: results.extend(result)
    else result is single task
      alt result has empty _stage_perf and result != task
        Stage->>TaskOut: result._stage_perf = copy(task._stage_perf)
      end
      Stage-->>Exec: results.append(result)
    end
  end
  Exec-->>Exec: downstream stages consume results

Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 182 to 193
result = self.process(task)
if isinstance(result, list):
for r in result:
if r is not task and hasattr(r, "_stage_perf") and not r._stage_perf:
r._stage_perf = list(task._stage_perf)
results.extend(result)
else:
if result is not task and hasattr(result, "_stage_perf") and not result._stage_perf:
result._stage_perf = list(task._stage_perf)
results.append(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None result can crash

ProcessingStage.process() is documented to allow None for filtering, but process_batch() treats any non-list result as a task-like object and then does hasattr(result, "_stage_perf") / results.append(result). With this change, if process() returns None, the else branch will raise (at result is not task / hasattr(...)) and/or append None into results. This is a functional regression in the default batch path for any stage that filters tasks by returning None.

Fix by explicitly handling result is None before the list/non-list logic (skip or continue).

Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
@SwekeR-463 SwekeR-463 force-pushed the fix/stage-name-propagate branch from 2a05acd to f6a5132 Compare February 7, 2026 05:51
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audio stage names aren't being propogated as noticed in benchmark script

1 participant