[`perf`] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

hotchpotch · 2026-02-11T07:29:43Z

Hello!

Pull request overview

On very large datasets, I wanted to improve runtime and memory behavior, so this PR focuses on performance improvements in the sampler.
In my measurements, this change gives a modest speedup for NO_DUPLICATES and around a 20% speedup for NO_DUPLICATES_HASHED.
It targets the hot loop in NoDuplicatesBatchSampler.
It aims to keep sampling behavior compatible with the historical implementation.
It adds tests to check compatibility and dtype behavior.

Details

The old implementation kept remaining indices in a Python dict and deleted accepted indices batch by batch.
That works, but it carries a lot of Python object overhead on large datasets.

This PR switches that part to:

randperm(..., dtype=int32|int64) as a NumPy-backed index array
a singly linked list over positions (next_positions) for O(1) removals

My intent here is to improve performance without changing semantics.

Conflicting samples are still deferred and retried later.
drop_last behavior is unchanged.
Seeded determinism is unchanged.

About dtypes:

Most datasets fit in int32 row indices, so index/position NumPy arrays use int32 in that case to reduce memory usage.
They automatically switch to int64 only when row count exceeds the int32 range.
Precomputed hashes stay int64.

Benchmark

Setup:

Script: examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py
Dataset: sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1
Subset/split: triplet-hard / train
Args: --batch-size 8192 --no-progress-bar --precompute-num-proc 8
Baseline: d0df0ccc
Current: this PR branch (no_dup_hashed_improve)
Note: NO_DUPLICATES is reported with 3 runs because each run is long; NO_DUPLICATES_HASHED also has an extra 10-run check for stability.

`precompute_hashes=False` (`NO_DUPLICATES`) (3-run mean)

Metric	Baseline	Current	Delta
Sampler time (s)	175.216	171.275	-3.941s (-2.25%)

Runs:

Run	Baseline (s)	Current (s)
1	174.716	170.041
2	176.797	172.140
3	174.136	171.644

`precompute_hashes=True` (`NO_DUPLICATES_HASHED`) (3-run mean)

Metric	Baseline	Current	Delta
Sampler time (s)	22.819	17.768	-5.051s (-22.13%)
Hash build time (s)	4.545	4.590	+0.045s (+0.99%)
Total time (s)	27.364	22.358	-5.006s (-18.29%)
hash_uss current_delta (MiB)	306.567	303.043	-3.523 MiB (-1.15%)
hash_uss peak_delta (MiB)	6891.520	6802.773	-88.747 MiB (-1.29%)

Runs:

Run	Baseline sampler (s)	Baseline build (s)	Baseline total (s)	Current sampler (s)	Current build (s)	Current total (s)
1	23.079	4.582	27.661	17.740	4.590	22.330
2	22.884	4.582	27.466	17.622	4.618	22.240
3	22.493	4.472	26.965	17.942	4.563	22.505

Extra check (`NO_DUPLICATES_HASHED`, 10 runs)

Sampler time: 22.768s -> 17.842s (-21.64%)

In short:

Mode	Summary
`NO_DUPLICATES`	`-2.25%` (3-run mean)
`NO_DUPLICATES_HASHED`	`-22.13%` (3-run mean), `-21.64%` (10-run check)

Memory note:

For NO_DUPLICATES_HASHED, memory stays roughly flat in these runs, with a slight average decrease in both hash_uss current_delta and hash_uss peak_delta.

Testing

uv run pytest tests/samplers/test_no_duplicates_batch_sampler.py -q
uv run pytest tests/test_trainer.py -k "get_batch_sampler" -q

Added tests:

Output equality against a reference implementation of the historical dict-based algorithm (drop_last and precompute_hashes combinations)
precompute_hashes=True stores precomputed hashes as np.int64
Index/position arrays switch to int64 when int32 range is exceeded

Related PR

[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

This implementation was developed in collaboration with Codex, and all code has been reviewed by @hotchpotch.
If you have any questions or see anything that should be improved, I would really appreciate your feedback.

hotchpotch added 3 commits February 11, 2026 14:57

perf: optimize no-duplicates sampler iteration for large datasets

2b9ef82

fix: align no-duplicates hash semantics with non-hash path

23ae151

fix: address sampler review findings and tighten typing

ca79950

hotchpotch marked this pull request as ready for review February 11, 2026 08:38

chore: remove Any annotations and clarify sampler comments

64cd02a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`perf`] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

[`perf`] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

hotchpotch commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[perf] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

Are you sure you want to change the base?

[perf] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

Conversation

hotchpotch commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull request overview

Details

Benchmark

precompute_hashes=False (NO_DUPLICATES) (3-run mean)

precompute_hashes=True (NO_DUPLICATES_HASHED) (3-run mean)

Extra check (NO_DUPLICATES_HASHED, 10 runs)

Testing

Related PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[`perf`] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

[`perf`] Speed up NoDuplicatesBatchSampler iteration (NO_DUPLICATES and NO_DUPLICATES_HASHED) #3658

hotchpotch commented Feb 11, 2026 •

edited

Loading

`precompute_hashes=False` (`NO_DUPLICATES`) (3-run mean)

`precompute_hashes=True` (`NO_DUPLICATES_HASHED`) (3-run mean)

Extra check (`NO_DUPLICATES_HASHED`, 10 runs)