Skip to content

Conversation

@hotchpotch
Copy link
Contributor

@hotchpotch hotchpotch commented Feb 11, 2026

Hello!

Pull request overview

  • On very large datasets, I wanted to improve runtime and memory behavior, so this PR focuses on performance improvements in the sampler.
  • In my measurements, this change gives a modest speedup for NO_DUPLICATES and around a 20% speedup for NO_DUPLICATES_HASHED.
  • It targets the hot loop in NoDuplicatesBatchSampler.
  • It aims to keep sampling behavior compatible with the historical implementation.
  • It adds tests to check compatibility and dtype behavior.

Details

The old implementation kept remaining indices in a Python dict and deleted accepted indices batch by batch.
That works, but it carries a lot of Python object overhead on large datasets.

This PR switches that part to:

  • randperm(..., dtype=int32|int64) as a NumPy-backed index array
  • a singly linked list over positions (next_positions) for O(1) removals

My intent here is to improve performance without changing semantics.

  • Conflicting samples are still deferred and retried later.
  • drop_last behavior is unchanged.
  • Seeded determinism is unchanged.

About dtypes:

  • Most datasets fit in int32 row indices, so index/position NumPy arrays use int32 in that case to reduce memory usage.
  • They automatically switch to int64 only when row count exceeds the int32 range.
  • Precomputed hashes stay int64.

Benchmark

Setup:

  • Script: examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py
  • Dataset: sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1
  • Subset/split: triplet-hard / train
  • Args: --batch-size 8192 --no-progress-bar --precompute-num-proc 8
  • Baseline: d0df0ccc
  • Current: this PR branch (no_dup_hashed_improve)
  • Note: NO_DUPLICATES is reported with 3 runs because each run is long; NO_DUPLICATES_HASHED also has an extra 10-run check for stability.

precompute_hashes=False (NO_DUPLICATES) (3-run mean)

Metric Baseline Current Delta
Sampler time (s) 175.216 171.275 -3.941s (-2.25%)

Runs:

Run Baseline (s) Current (s)
1 174.716 170.041
2 176.797 172.140
3 174.136 171.644

precompute_hashes=True (NO_DUPLICATES_HASHED) (3-run mean)

Metric Baseline Current Delta
Sampler time (s) 22.819 17.768 -5.051s (-22.13%)
Hash build time (s) 4.545 4.590 +0.045s (+0.99%)
Total time (s) 27.364 22.358 -5.006s (-18.29%)
hash_uss current_delta (MiB) 306.567 303.043 -3.523 MiB (-1.15%)
hash_uss peak_delta (MiB) 6891.520 6802.773 -88.747 MiB (-1.29%)

Runs:

Run Baseline sampler (s) Baseline build (s) Baseline total (s) Current sampler (s) Current build (s) Current total (s)
1 23.079 4.582 27.661 17.740 4.590 22.330
2 22.884 4.582 27.466 17.622 4.618 22.240
3 22.493 4.472 26.965 17.942 4.563 22.505

Extra check (NO_DUPLICATES_HASHED, 10 runs)

  • Sampler time: 22.768s -> 17.842s (-21.64%)

In short:

Mode Summary
NO_DUPLICATES -2.25% (3-run mean)
NO_DUPLICATES_HASHED -22.13% (3-run mean), -21.64% (10-run check)

Memory note:

  • For NO_DUPLICATES_HASHED, memory stays roughly flat in these runs, with a slight average decrease in both hash_uss current_delta and hash_uss peak_delta.

Testing

  • uv run pytest tests/samplers/test_no_duplicates_batch_sampler.py -q
  • uv run pytest tests/test_trainer.py -k "get_batch_sampler" -q

Added tests:

  • Output equality against a reference implementation of the historical dict-based algorithm (drop_last and precompute_hashes combinations)
  • precompute_hashes=True stores precomputed hashes as np.int64
  • Index/position arrays switch to int64 when int32 range is exceeded

Related PR


This implementation was developed in collaboration with Codex, and all code has been reviewed by @hotchpotch.
If you have any questions or see anything that should be improved, I would really appreciate your feedback.

@hotchpotch hotchpotch marked this pull request as ready for review February 11, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant