Skip to content

[benchmark] Add FastText filter benchmarking script (#1411)#1452

Merged
sarahyurick merged 10 commits intoNVIDIA-NeMo:mainfrom
KunalSachdev2005:fixes-1411-fasttext-filters-benchmarking-script
Feb 17, 2026
Merged

[benchmark] Add FastText filter benchmarking script (#1411)#1452
sarahyurick merged 10 commits intoNVIDIA-NeMo:mainfrom
KunalSachdev2005:fixes-1411-fasttext-filters-benchmarking-script

Conversation

@KunalSachdev2005
Copy link
Contributor

Description

This PR adds a benchmarking script for FastText-based document filters (language ID and quality) to the NeMo Curator benchmarking framework. The implementation follows the same pattern as the existing score_filter_benchmark.py script.

Changes:

  • Added fasttext_filter_benchmark.py script that benchmarks FastText filters using a Hydra-configured pipeline
  • Added fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml for both executors
  • Supports FastText language ID and quality filters with proper model setup requirements

The script handles FastText filters that require setup() for model loading, which differentiates them from heuristic filters.

Related Issue: #1411

Questions for Discussion

I have a couple of questions posted in the issue comments (#1411) regarding:

  1. Metric requirements: Should we add requirements sections for the FastText benchmarks now, or add them in a follow-up PR after establishing baseline metrics?
  2. Model paths: Confirmation that the model paths ({datasets_path}/models/fasttext/lid.176.bin and {datasets_path}/models/fasttext/quality.bin) are acceptable.

Usage

The benchmark can be run via the benchmarking framework:

./benchmarking/tools/run.sh --config ./benchmarking/nightly-benchmark.yaml

Or directly:

python benchmarking/scripts/fasttext_filter_benchmark.py \
  --benchmark-results-path /path/to/results \
  --input-path /path/to/input \
  --yaml-config nemo_curator/config/text/fasttext_filter_pipeline.yaml \
  --executor ray_data \
  --overrides "fasttext_langid_model_path=/path/to/lid.176.bin, fasttext_quality_model_path=/path/to/quality.bin"

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
    • Benchmark scripts serve as integration tests run by the benchmarking framework.
  • The documentation is up to date with these changes.
    • Script includes docstrings and follows the same pattern as other benchmark scripts.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 3, 2026

Greptile Summary

This PR adds a new FastText filter benchmarking script (fasttext_filter_benchmark.py) that follows the established pattern from score_filter_benchmark.py. It also renames the existing fasttext_model dataset to fasttext_langid_model, adds a new fasttext_quality_model dataset entry, and updates the arxiv E2E benchmark to use the new naming convention.

  • Adds fasttext_filter_benchmark.py with Hydra-based pipeline construction, dedicated CLI args for langid and quality model paths, and standard metrics collection (documents processed, kept, throughput)
  • Adds fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml using the tinystories dataset with ParquetReader override
  • Renames --fasttext-model-path to --fasttext-langid-model-path across the arxiv E2E benchmark script and nightly config for clarity
  • Adds .ftz format entry for the langid model dataset
  • New benchmark entries intentionally omit requirements sections pending baseline metric establishment (as discussed in the PR)

Confidence Score: 4/5

  • This PR is safe to merge; it follows established patterns closely and the new script is a well-structured addition to the benchmarking framework
  • The new script closely mirrors the existing score_filter_benchmark.py, adding appropriate FastText-specific CLI args. The YAML changes are clean renames with new dataset entries. The arxiv benchmark changes are a consistent rename. The only notable gap is the absence of requirements in the new nightly entries, but this is explicitly called out in the PR as intentional pending baseline establishment.
  • No files require special attention; all changes follow established patterns

Important Files Changed

Filename Overview
benchmarking/scripts/fasttext_filter_benchmark.py New benchmarking script for FastText filters closely following the existing score_filter_benchmark.py pattern. Adds dedicated CLI args for model paths, Hydra-based pipeline construction, and standard metrics collection. Well structured with no critical issues.
benchmarking/nightly-benchmark.yaml Adds two new benchmark entries (fasttext_filter_raydata, fasttext_filter_xenna), renames fasttext_model dataset to fasttext_langid_model, adds fasttext_quality_model dataset, and updates arxiv benchmark args to use new naming. No requirements section for new entries (intentional per PR discussion).
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py Straightforward rename of fasttext_model_path to fasttext_langid_model_path across parameter names, docstrings, argument parser, and function calls. Clean, consistent change with no issues.

Flowchart

flowchart TD
    A[CLI Arguments] --> B[main]
    B --> C[run_fasttext_filter_benchmark]
    C --> D[setup_executor]
    C --> E[Build Hydra overrides list]
    E --> F[load_hydra_yaml]
    F --> G[compose DictConfig]
    G --> H[create_pipeline_from_yaml]
    H --> I["Pipeline with stages:\n0: ParquetReader\n1: ScoreFilter(FastTextLangId)\n2: ScoreFilter(FastTextQualityFilter)\n3: JsonlWriter"]
    I --> J[pipeline.run executor]
    J --> K{Success?}
    K -->|Yes| L[Collect metrics from _stage_perf]
    K -->|No| M[Set failure metrics]
    L --> N[write_benchmark_results]
    M --> N
    N --> O[Return exit code]
Loading

Last reviewed commit: 2eea7f8

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

- Add fasttext_filter_benchmark.py script following the pattern from
  score_filter_benchmark.py
- Add fasttext_filter_raydata and fasttext_filter_xenna entries to
  nightly-benchmark.yaml
- Supports FastText language ID and quality filters with model setup
  requirements

Fixes NVIDIA-NeMo#1411

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from 2b52542 to c2ba0da Compare February 4, 2026 04:05
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…onfig (NVIDIA-NeMo#1411)

- Add separate dataset entries for FastText langid and quality models
- Pass FastText model paths as explicit CLI arguments to benchmarks
- Remove hardcoded model paths from Hydra overrides
- Update FastText filter benchmarks to use model_weights_path
- Align arxiv E2E benchmark arg naming with FastText langid usage

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from c2ba0da to de0cec9 Compare February 4, 2026 16:11
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 642 to 676
- name: fasttext_filter_raydata
enabled: true
script: fasttext_filter_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--output-path={session_entry_dir}/scratch/output
--executor=ray_data
--input-path={dataset:tinystories,parquet}
--yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
--fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
--fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
--overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
timeout_s: 400
sink_data:
- name: slack
additional_metrics:
- num_kept_documents
- throughput_docs_per_sec
ray:
num_cpus: 64
num_gpus: 0
enable_object_spilling: false

- name: fasttext_filter_xenna
enabled: true
script: fasttext_filter_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--output-path={session_entry_dir}/scratch/output
--executor=xenna
--input-path={dataset:tinystories,parquet}
--yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
--fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
--fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
--overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] New FastText benchmark entries lack requirements, so regressions won’t be caught by nightly.

Most existing entries define a requirements: section to enforce throughput and/or data-integrity expectations. fasttext_filter_raydata and fasttext_filter_xenna currently only report metrics to Slack, so they’ll run but won’t fail the nightly job on major performance or correctness changes. If baseline metrics are known (or can be captured), adding minimal requirements (e.g., exact num_documents_processed and a conservative min throughput) would make these benchmarks actionable.

@sarahyurick sarahyurick self-requested a review February 12, 2026 20:07
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few minor requests. Testing right now with:

results_path: /path/to/fasttext-results
datasets_path: /path/to/datasets
model_weights_path: /path/to/model_weights

datasets:
  - name: "tinystories"
    formats:
    - type: "parquet"
      path: "{datasets_path}/tinystories_train_parquet"
  - name: "fasttext_langid_model"
    formats:
    - type: "bin"
      path: "{model_weights_path}/fasttext/lid.176.bin"
    - type: "ftz"
      path: "{model_weights_path}/fasttext/lid.176.ftz"
  - name: "fasttext_quality_model"
    formats:
    - type: "bin"
      path: "{model_weights_path}/fasttext/model.bin"

default_timeout_s: 7200

delete_scratch: true

entries:
  - name: fasttext_filter_raydata
    enabled: true
    script: fasttext_filter_benchmark.py
    args: >-
      --benchmark-results-path={session_entry_dir}
      --output-path={session_entry_dir}/scratch/output
      --executor=ray_data
      --input-path={dataset:tinystories,parquet}
      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
    timeout_s: 400
    sink_data:
      - name: slack
        additional_metrics:
          - num_kept_documents
          - throughput_docs_per_sec
    ray:
      num_cpus: 64
      num_gpus: 0
      enable_object_spilling: false

  - name: fasttext_filter_xenna
    enabled: true
    script: fasttext_filter_benchmark.py
    args: >-
      --benchmark-results-path={session_entry_dir}
      --output-path={session_entry_dir}/scratch/output
      --executor=xenna
      --input-path={dataset:tinystories,parquet}
      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
    timeout_s: 600
    sink_data:
      - name: slack
        additional_metrics:
          - num_kept_documents
          - throughput_docs_per_sec
    ray:
      num_cpus: 64
      num_gpus: 0
      enable_object_spilling: false

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…htly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411)

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
@KunalSachdev2005 KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from d2b1f3a to 5a0a25a Compare February 13, 2026 21:52
…ly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411)

Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…el.bin in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411)

Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…nchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411)

Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

… after ScoreFilter benchmarks in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411)

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sarahyurick sarahyurick merged commit 235fa2e into NVIDIA-NeMo:main Feb 17, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants