[benchmark] Add FastText filter benchmarking script (#1411) by KunalSachdev2005 · Pull Request #1452 · NVIDIA-NeMo/Curator

KunalSachdev2005 · 2026-02-03T05:09:58Z

Description

This PR adds a benchmarking script for FastText-based document filters (language ID and quality) to the NeMo Curator benchmarking framework. The implementation follows the same pattern as the existing score_filter_benchmark.py script.

Changes:

Added fasttext_filter_benchmark.py script that benchmarks FastText filters using a Hydra-configured pipeline
Added fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml for both executors
Supports FastText language ID and quality filters with proper model setup requirements

The script handles FastText filters that require setup() for model loading, which differentiates them from heuristic filters.

Related Issue: #1411

Questions for Discussion

I have a couple of questions posted in the issue comments (#1411) regarding:

Metric requirements: Should we add requirements sections for the FastText benchmarks now, or add them in a follow-up PR after establishing baseline metrics?
Model paths: Confirmation that the model paths ({datasets_path}/models/fasttext/lid.176.bin and {datasets_path}/models/fasttext/quality.bin) are acceptable.

Usage

The benchmark can be run via the benchmarking framework:

./benchmarking/tools/run.sh --config ./benchmarking/nightly-benchmark.yaml

Or directly:

python benchmarking/scripts/fasttext_filter_benchmark.py \
  --benchmark-results-path /path/to/results \
  --input-path /path/to/input \
  --yaml-config nemo_curator/config/text/fasttext_filter_pipeline.yaml \
  --executor ray_data \
  --overrides "fasttext_langid_model_path=/path/to/lid.176.bin, fasttext_quality_model_path=/path/to/quality.bin"

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
- Benchmark scripts serve as integration tests run by the benchmarking framework.
The documentation is up to date with these changes.
- Script includes docstrings and follows the same pattern as other benchmark scripts.

copy-pr-bot · 2026-02-03T05:10:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-02-03T05:12:23Z

Greptile Summary

This PR adds a new FastText filter benchmarking script (fasttext_filter_benchmark.py) that follows the established pattern from score_filter_benchmark.py. It also renames the existing fasttext_model dataset to fasttext_langid_model, adds a new fasttext_quality_model dataset entry, and updates the arxiv E2E benchmark to use the new naming convention.

Adds fasttext_filter_benchmark.py with Hydra-based pipeline construction, dedicated CLI args for langid and quality model paths, and standard metrics collection (documents processed, kept, throughput)
Adds fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml using the tinystories dataset with ParquetReader override
Renames --fasttext-model-path to --fasttext-langid-model-path across the arxiv E2E benchmark script and nightly config for clarity
Adds .ftz format entry for the langid model dataset
New benchmark entries intentionally omit requirements sections pending baseline metric establishment (as discussed in the PR)

Confidence Score: 4/5

This PR is safe to merge; it follows established patterns closely and the new script is a well-structured addition to the benchmarking framework
The new script closely mirrors the existing score_filter_benchmark.py, adding appropriate FastText-specific CLI args. The YAML changes are clean renames with new dataset entries. The arxiv benchmark changes are a consistent rename. The only notable gap is the absence of requirements in the new nightly entries, but this is explicitly called out in the PR as intentional pending baseline establishment.
No files require special attention; all changes follow established patterns

Important Files Changed

Filename	Overview
benchmarking/scripts/fasttext_filter_benchmark.py	New benchmarking script for FastText filters closely following the existing `score_filter_benchmark.py` pattern. Adds dedicated CLI args for model paths, Hydra-based pipeline construction, and standard metrics collection. Well structured with no critical issues.
benchmarking/nightly-benchmark.yaml	Adds two new benchmark entries (`fasttext_filter_raydata`, `fasttext_filter_xenna`), renames `fasttext_model` dataset to `fasttext_langid_model`, adds `fasttext_quality_model` dataset, and updates arxiv benchmark args to use new naming. No `requirements` section for new entries (intentional per PR discussion).
benchmarking/scripts/arxiv_e2e_pipeline_benchmark.py	Straightforward rename of `fasttext_model_path` to `fasttext_langid_model_path` across parameter names, docstrings, argument parser, and function calls. Clean, consistent change with no issues.

Flowchart

flowchart TD
    A[CLI Arguments] --> B[main]
    B --> C[run_fasttext_filter_benchmark]
    C --> D[setup_executor]
    C --> E[Build Hydra overrides list]
    E --> F[load_hydra_yaml]
    F --> G[compose DictConfig]
    G --> H[create_pipeline_from_yaml]
    H --> I["Pipeline with stages:\n0: ParquetReader\n1: ScoreFilter(FastTextLangId)\n2: ScoreFilter(FastTextQualityFilter)\n3: JsonlWriter"]
    I --> J[pipeline.run executor]
    J --> K{Success?}
    K -->|Yes| L[Collect metrics from _stage_perf]
    K -->|No| M[Set failure metrics]
    L --> N[write_benchmark_results]
    M --> N
    N --> O[Return exit code]

_{Last reviewed commit: 2eea7f8}

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

- Add fasttext_filter_benchmark.py script following the pattern from score_filter_benchmark.py - Add fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml - Supports FastText language ID and quality filters with model setup requirements Fixes NVIDIA-NeMo#1411 Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…onfig (NVIDIA-NeMo#1411) - Add separate dataset entries for FastText langid and quality models - Pass FastText model paths as explicit CLI arguments to benchmarks - Remove hardcoded model paths from Hydra overrides - Update FastText filter benchmarks to use model_weights_path - Align arxiv E2E benchmark arg naming with FastText langid usage Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

benchmarking/scripts/fasttext_filter_benchmark.py

greptile-apps · 2026-02-04T16:13:44Z

benchmarking/nightly-benchmark.yaml

+  - name: fasttext_filter_raydata
+    enabled: true
+    script: fasttext_filter_benchmark.py
+    args: >-
+      --benchmark-results-path={session_entry_dir}
+      --output-path={session_entry_dir}/scratch/output
+      --executor=ray_data
+      --input-path={dataset:tinystories,parquet}
+      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
+      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
+      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
+      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
+    timeout_s: 400
+    sink_data:
+      - name: slack
+        additional_metrics:
+          - num_kept_documents
+          - throughput_docs_per_sec
+    ray:
+      num_cpus: 64
+      num_gpus: 0
+      enable_object_spilling: false
+
+  - name: fasttext_filter_xenna
+    enabled: true
+    script: fasttext_filter_benchmark.py
+    args: >-
+      --benchmark-results-path={session_entry_dir}
+      --output-path={session_entry_dir}/scratch/output
+      --executor=xenna
+      --input-path={dataset:tinystories,parquet}
+      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
+      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
+      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
+      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"


[P1] New FastText benchmark entries lack requirements, so regressions won’t be caught by nightly.

Most existing entries define a requirements: section to enforce throughput and/or data-integrity expectations. fasttext_filter_raydata and fasttext_filter_xenna currently only report metrics to Slack, so they’ll run but won’t fail the nightly job on major performance or correctness changes. If baseline metrics are known (or can be captured), adding minimal requirements (e.g., exact num_documents_processed and a conservative min throughput) would make these benchmarks actionable.

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sarahyurick

Added a few minor requests. Testing right now with:

results_path: /path/to/fasttext-results
datasets_path: /path/to/datasets
model_weights_path: /path/to/model_weights

datasets:
  - name: "tinystories"
    formats:
    - type: "parquet"
      path: "{datasets_path}/tinystories_train_parquet"
  - name: "fasttext_langid_model"
    formats:
    - type: "bin"
      path: "{model_weights_path}/fasttext/lid.176.bin"
    - type: "ftz"
      path: "{model_weights_path}/fasttext/lid.176.ftz"
  - name: "fasttext_quality_model"
    formats:
    - type: "bin"
      path: "{model_weights_path}/fasttext/model.bin"

default_timeout_s: 7200

delete_scratch: true

entries:
  - name: fasttext_filter_raydata
    enabled: true
    script: fasttext_filter_benchmark.py
    args: >-
      --benchmark-results-path={session_entry_dir}
      --output-path={session_entry_dir}/scratch/output
      --executor=ray_data
      --input-path={dataset:tinystories,parquet}
      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
    timeout_s: 400
    sink_data:
      - name: slack
        additional_metrics:
          - num_kept_documents
          - throughput_docs_per_sec
    ray:
      num_cpus: 64
      num_gpus: 0
      enable_object_spilling: false

  - name: fasttext_filter_xenna
    enabled: true
    script: fasttext_filter_benchmark.py
    args: >-
      --benchmark-results-path={session_entry_dir}
      --output-path={session_entry_dir}/scratch/output
      --executor=xenna
      --input-path={dataset:tinystories,parquet}
      --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
      --fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
      --fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
      --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
    timeout_s: 600
    sink_data:
      - name: slack
        additional_metrics:
          - num_kept_documents
          - throughput_docs_per_sec
    ray:
      num_cpus: 64
      num_gpus: 0
      enable_object_spilling: false

benchmarking/nightly-benchmark.yaml

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…htly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411) Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

…ly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…el.bin in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…nchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

… after ScoreFilter benchmarks in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sarahyurick

Thank you!

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

github-actions bot added the community-request label Feb 3, 2026

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

KunalSachdev2005 mentioned this pull request Feb 3, 2026

Add benchmarking script for FastText filters #1411

Closed

KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from 2b52542 to c2ba0da Compare February 4, 2026 04:05

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from c2ba0da to de0cec9 Compare February 4, 2026 16:11

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

sarahyurick self-requested a review February 12, 2026 20:07

Merge branch 'main' into fixes-1411-fasttext-filters-benchmarking-script

f2ec095

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

sarahyurick reviewed Feb 12, 2026

View reviewed changes

benchmarking/nightly-benchmark.yaml Show resolved Hide resolved

benchmarking/nightly-benchmark.yaml Outdated Show resolved Hide resolved

benchmarking/nightly-benchmark.yaml Outdated Show resolved Hide resolved

sarahyurick reviewed Feb 12, 2026

View reviewed changes

benchmarking/nightly-benchmark.yaml Outdated Show resolved Hide resolved

benchmarking/nightly-benchmark.yaml Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Updated fasttext_filter_raydata benchmark timeout in benchmarking/nig…

5a0a25a

…htly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411) Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

KunalSachdev2005 force-pushed the fixes-1411-fasttext-filters-benchmarking-script branch from d2b1f3a to 5a0a25a Compare February 13, 2026 21:52

Updated fasttext_filter_xenna benchmark timeout in benchmarking/night…

872a3f8

…ly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Updated fasttext_quality_model dataset entry's model file name to mod…

8306eca

…el.bin in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Adding ftz file option for fasttext_langid_model datasets entry in be…

7c748bb

…nchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Moving fasttext_filter_raydata and fasttext_filter_xenna to run right…

4e412f5

… after ScoreFilter benchmarks in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into fixes-1411-fasttext-filters-benchmarking-script

60346c1

Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>

greptile-apps bot reviewed Feb 16, 2026

View reviewed changes

sarahyurick approved these changes Feb 17, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 17:54 Inactive

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

sarahyurick merged commit 235fa2e into NVIDIA-NeMo:main Feb 17, 2026
49 checks passed

Conversation

KunalSachdev2005 commented Feb 3, 2026

Description

Questions for Discussion

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

greptile-apps bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Feb 3, 2026 •

edited

Loading