[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

hotchpotch · 2026-01-09T07:43:46Z

The current NoDuplicatesBatchSampler can become significantly slow when working with datasets that have many duplicate values across query / positive / negatives columns, especially with large batch sizes (e.g., bs=8192). This is particularly noticeable with triplet or hard negatives data.

Summary of Changes

This PR adds NoDuplicatesFastBatchSampler, which speeds up duplicate checking by pre-computing xxhash 64-bit values for each sample using datasets.map(). It maintains the same batch construction policy as NoDuplicatesBatchSampler (avoiding duplicates within a batch) while significantly improving performance.

Since this approach increases memory usage, both options are provided:

NO_DUPLICATES: Existing sampler (memory-efficient)
NO_DUPLICATES_FAST: New sampler (faster, but uses more memory)

Benchmarks (MS MARCO)

Benchmarked using the following HuggingFace datasets:

sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-hard
sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-50

Conditions

Batch size: 128 and 8192
Hash parallelization: num_proc=8
Progress bar disabled (--no-progress-bar)

The table below summarizes execution time, memory usage, and batch counts. Memory is measured using USS (Unique Set Size). The fast sampler stores hash values as NumPy int64 arrays, which accounts for the increased memory usage. The original NO_DUPLICATES checks values on-the-fly and does not increase memory usage.

dataset	sampler	bs	total_time	hash_time	hash_uss_current	hash_uss_peak	batches (ideal/delta)
triplet-50	NO_DUPLICATES	128	71.386s	n/a	n/a	n/a	3929 (ideal=3929, delta=0)
triplet-50	NO_DUPLICATES_FAST	128	3.496s	3.724s	211.61MiB	211.62MiB	3929 (ideal=3929, delta=0)
triplet-50	NO_DUPLICATES	8192	283.215s	n/a	n/a	n/a	58 (ideal=61, delta=3)
triplet-50	NO_DUPLICATES_FAST	8192	6.835s	3.723s	201.52MiB	201.54MiB	58 (ideal=61, delta=3)
triplet-hard	NO_DUPLICATES	128	405.658s	n/a	n/a	n/a	91114 (ideal=91114, delta=0)
triplet-hard	NO_DUPLICATES_FAST	128	261.424s	4.674s	314.26MiB	510.76MiB	91114 (ideal=91114, delta=0)
triplet-hard	NO_DUPLICATES	8192	171.853s	n/a	n/a	n/a	1423 (ideal=1423, delta=0)
triplet-hard	NO_DUPLICATES_FAST	8192	21.567s	4.579s	313.82MiB	526.93MiB	1423 (ideal=1423, delta=0)

Environment: Ryzen 9 7950 (num_proc=8), Ubuntu 24

Memory Considerations

This implementation stores hash values as int64 NumPy ndarrays, which increases memory usage compared to the current NoDuplicatesBatchSampler.

For reference, using sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1:

triplet-50 (503,302 rows): ~200MiB additional memory
triplet-hard (11,662,655 rows): ~314MiB additional memory

Therefore, users can choose between:

NO_DUPLICATES: Memory-efficient (existing)
NO_DUPLICATES_FAST: Faster (new)

How It Works

On first iteration only: Use datasets.map() to retrieve all values from query / positive / negatives columns
Hash strings using xxhash 64-bit
Store hash arrays per row as NumPy arrays (assumes fixed-length rows, which is valid since query, positive, and negatives columns are consistent within a dataset)
In __iter__, use hash arrays for fast duplicate checking while constructing batches

Implementation Notes

While xxhash64 can theoretically produce hash collisions, the probability is extremely low. Even if a collision occurs, it would only result in excluding a non-duplicate sample from the same batch, which has minimal impact on training. Therefore, this is considered negligible in practice.
Hashing is parallelized using datasets.map(..., num_proc=N) for speed.
I haven't found other places in this project that use multiprocessing in a similar way. If a different implementation style is preferred, or if parallelization should be avoided, please let me know.
The number of parallel workers is capped at 8 even on machines with more cores. Feedback on whether this default is appropriate is welcome.
Suggestions for better optimization approaches or alternative implementations are also welcome.

Benchmark Commands

# triplet-50, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-50, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

Feedback and suggestions are appreciated!

hotchpotch · 2026-01-19T23:50:24Z

Hi @tomaarsen

I recently ran training successfully on 50 other datasets totaling 300M rows with this implementation.

With NoDuplicatesBatchSampler, the initial training speed is almost the same, but on datasets with many duplicates, duplicated samples tend to be skipped and accumulate toward the end. As a result, training can become very slow in the later stages.

Given that, I’d be happy if you could consider merging this into SentenceTransformers.

If there are any parts I should revise, or any additional information you need to evaluate whether to merge it, please let me know. I’d really appreciate your feedback.

tomaarsen · 2026-01-20T17:15:55Z

Hello!

Apologies for the delayed response. I mentioned it briefly in another PR, but I've been a bit busy working on a transformers PR to attempt to strengthen multi-modality support in Sentence Transformers.

For this PR, I've had a quick look a handful of times now, and I think there's a few different options:

Merge this as-is (a very valid option)
Add the NoDuplicatesFastBatchSampler pre-compute functionality to the existing NoDuplicatesBatchSampler with a flag/toggle. However, samplers are often specified with "no_duplicates", BatchSamplers.NO_DUPLICATES or NoDuplicatesBatchSampler, although there's also support for functions. If users want to toggle on/off the pre-compute, they'd have to use e.g. batch_sampler=partial(NoDuplicatesBatchSampler, pre_compute=False). Not very convenient.
Replace the existing NoDuplicatesBatchSampler with NoDuplicatesFastBatchSampler (although it would likely be nicer to go for nr. 2 to keep the option)
Perhaps controversial: A bit like 2, but if the pre_compute argument (not a great name, but you get the gist) is not explicitly provided by the user, choose whether to use the old vs the new approach based on whether xxhash is installed.

Thank you for your benchmarks by the way, they are extremely valuable.

dataset.map seems like a nice solution: do you know whether it places the entire dataset in memory during processing? The original reason for not pre-computing the exact batches to process was that I expected a memory explosion simply for accessing all inputs.

I'm not too bothered with potential hash collisions at the moment, I think.

Either way: I'm definitely planning to include this, and your other PR, in the next release. Thanks a lot for your valuable contributions again! I've done some experiments with @NohTow regarding the various different InfoNCE variants using your implementation as well.

Tom Aarsen

hotchpotch · 2026-01-20T20:56:33Z

Hello!

Thank you for reviewing this while you are busy with the multimodality work. I am sorry for sending an @mention while you were occupied.

Thank you for the suggestions on how to integrate this. Right now I added NO_DUPLICATES_FAST as a separate sampler. If you prefer a single-class approach, I can consolidate it into NoDuplicatesBatchSampler with a pre_compute flag, and optionally switch via partial(...). I am happy to follow whichever direction you prefer.

Regarding dataset.map: HF Datasets uses Arrow’s memory-mapped cache, so Arrow-backed datasets are not designed to load the entire dataset into RAM (datasets loaded via datasets.load_dataset(hf_url) are also stored in Arrow). The map operation also writes to cache by default while processing in batches, and writer_batch_size controls the memory/speed trade-off. ref: Datasets 🤝 Arrow.

The main source of additional RAM usage is the 64-bit hashes that this implementation stores. If each row has k values, the rough estimate is 8 * N * k bytes (plus overhead). In my 300M-row pair case (k = 2), the estimate is about 4.8 GB plus overhead, and I observed a bit over 5 GB in practice.

If you have time after the multimodality work settles down, I would really appreciate another look at this PR. Thank you as well for evaluating the other MultipleNegativesBidirectionalRankingLoss PR.

P.S. I am also excited about the multimodality support and would love to try it. Your continued commitment to Sentence Transformers is a huge help to practitioners like me.

tomaarsen · 2026-01-27T15:25:05Z

Thank you for reviewing this while you are busy with the multimodality work. I am sorry for sending an @mention while you were occupied.

No need to apologize! I'm used to a big list of notifications, pinging me when useful is always fine.

If you prefer a single-class approach, I can consolidate it into NoDuplicatesBatchSampler with a pre_compute flag, and optionally switch via partial(...). I am happy to follow whichever direction you prefer.

I think this is indeed the cleanest. Let's aim for precompute_hashes as the argument name perhaps? Just like with #3607, I think this feature is so strong that we shouldn't hide it as much, and instead integrate it directly with the features already in use. Could you work on this?

Tom Aarsen

hotchpotch · 2026-01-28T04:55:51Z

Hello, thanks for the feedback.

As suggested, I merged the implementation into NoDuplicatesBatchSampler with a precompute_hashes flag.
BatchSamplers.NO_DUPLICATES_FAST now just enables that flag.

I reran the benchmarks and performance is essentially unchanged from the original results.

If there are any other changes needed for merge, please let me know.

Tiny code improvements

tomaarsen · 2026-01-28T11:57:05Z

Okay, I think this is almost ready! I like the structure now. There's only a handful of changes I think we should do: rename NO_DUPLICATES_FAST to NO_DUPLICATES_HASHED to give more info about what's different, I think this should age better in case there's other alternative versions in the future. The docs are clear that this is faster at the cost of some memory.

In the long term, I think I'll add a batch_sampler_kwargs and multi_dataset_batch_sampler_kwargs to the training arguments to allow easier access to the flags like precompute_batch_size, but I'd like that to be integrated after #3554 because that PR creates BaseTrainingArguments/BaseTrainer that should host these changes.

But with this PR, NO_DUPLICATES_HASHED would already be possible. I'm also fine with keeping the evaluation script.

I'll push some tiny changes I made locally, and will continue to make some changes, but this is like 95% there! Feel free to let me know what you think of my changes.

Tom Aarsen

The _iter_no_duplicate_batches and _remove_label_columns can be placed back in the class itself, as there's now just 1 again.

tomaarsen

Looks strong I think!

hotchpotch · 2026-01-28T21:42:02Z

Hello!

Renaming it to NO_DUPLICATES_HASHED is clear and makes a lot of sense, I’m happy with that change. Thanks also for the refactoring.

I really appreciate the effort you put into reviewing this PR. Thank you!

Copilot

Pull request overview

This PR introduces a faster variant of the no-duplicates batch sampler by precomputing xxhash64-based hashes for dataset rows, wires it into the training configuration API, and adds a benchmark script plus tests.

Changes:

Extend NoDuplicatesBatchSampler to support an optional precompute_hashes mode that uses Hugging Face datasets.map and xxhash to precompute per-row hash vectors for faster duplicate detection, including Arrow-based validation and dense NumPy storage.
Add the BatchSamplers.NO_DUPLICATES_HASHED enum value and integrate it into SentenceTransformerTrainer.get_batch_sampler so it maps to NoDuplicatesBatchSampler(..., precompute_hashes=True).
Add tests for the new batch sampler argument value and for both hashed / non-hashed sampler paths, plus an example benchmarking script to compare speed and memory characteristics.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`sentence_transformers/sampler.py`	Adds xxhash-based hashing utilities, new `precompute_hashes` / `_build_hashes` logic, and adjusts `NoDuplicatesBatchSampler.__iter__` to use precomputed hash matrices when enabled.
`sentence_transformers/training_args.py`	Introduces `BatchSamplers.NO_DUPLICATES_HASHED` and updates the docstring to describe the hashed variant and its recommended use cases.
`sentence_transformers/trainer.py`	Extends `get_batch_sampler` to instantiate `NoDuplicatesBatchSampler` with `precompute_hashes=True` when `batch_sampler=BatchSamplers.NO_DUPLICATES_HASHED`.
`tests/samplers/test_no_duplicates_batch_sampler.py`	Parametrizes existing tests over `precompute_hashes=False/True` (skipping when xxhash is unavailable) to cover both code paths and ensure behavior remains consistent.
`tests/test_training_args.py`	Verifies that `HfArgumentParser` parses `--batch_sampler no_duplicates_hashed` into `BatchSamplers.NO_DUPLICATES_HASHED`.
`examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py`	New benchmark script to compare default vs. hashed `NoDuplicatesBatchSampler` in terms of runtime, batch counts, and memory (RSS/USS and tracemalloc) on Hugging Face datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-29T12:53:38Z

sentence_transformers/sampler.py

+    def _build_hashes(self) -> None:
+        if not self.precompute_hashes or self._row_hashes is not None:
+            return
+        exclude_columns = {"dataset_name"}
+        columns = list(self.dataset.column_names)
+        # Precompute hash values once to avoid repeated string processing per batch.
+        # Use num_proc to parallelize hashing across CPU cores.
+        hash_ds: Dataset | None = None
+        hash_ds = self.dataset.map(
+            _hash_batch,
+            batched=True,
+            batch_size=self.precompute_batch_size,
+            num_proc=self.precompute_num_proc,
+            remove_columns=columns,
+            fn_kwargs={"columns": columns, "exclude_columns": exclude_columns},
+            desc="Hashing dataset values",
+        )


The precompute_hashes path assumes that self.dataset is a Hugging Face datasets.Dataset (or at least has a .map method), but this is not validated up front. If a user constructs NoDuplicatesBatchSampler with precompute_hashes=True on a non-HF dataset, self.dataset.map(...) will raise an AttributeError outside the try/except block, resulting in an unhelpful error. Consider adding an explicit type/feature check (e.g. hasattr(self.dataset, "map")) and raising a clear ValueError explaining that precompute_hashes=True requires a Hugging Face Dataset (or an object exposing a compatible .map API).

Unimportant; datasets can safely be assumed to be a datasets.Dataset

hotchpotch added 6 commits January 8, 2026 18:46

Add fast no-duplicates batch sampler

ffa8a79

Wire NO_DUPLICATES_FAST option

1b55966

Guard hash dataset cleanup

829e45e

Add no-duplicate batch sampler benchmark script

c103b39

Rename hash num_proc parameter

3de2e0c

Simplify xxhash requirement message

c257dd3

hotchpotch marked this pull request as ready for review January 9, 2026 07:44

Run ruff format

2f65cbf

hotchpotch added 2 commits January 28, 2026 13:49

Merge fast no-duplicates into NoDuplicatesBatchSampler

8e747f6

Update sampler docs and batch sampler args test

2b534a2

Rename to NO_DUPLICATES_HASHED; update some docs

132b16b

Tiny code improvements

Move methods around a bit

3bbb0b5

The _iter_no_duplicate_batches and _remove_label_columns can be placed back in the class itself, as there's now just 1 again.

tomaarsen approved these changes Jan 28, 2026

View reviewed changes

Only compute cpu_count/default_workers if precompute_num_proc is None

13b949b

tomaarsen requested a review from Copilot January 29, 2026 12:49

Copilot started reviewing on behalf of tomaarsen January 29, 2026 12:50 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Merge branch 'main' into improve_no_dup_batch_sampler

3730c3a

tomaarsen changed the title ~~[feat] Add NoDuplicatesFastBatchSampler~~ [feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler Jan 29, 2026

tomaarsen enabled auto-merge (squash) January 29, 2026 13:32

Merge branch 'main' into improve_no_dup_batch_sampler

689b5e6

tomaarsen merged commit b07e0ce into huggingface:main Jan 29, 2026
17 checks passed

tomaarsen mentioned this pull request Jan 30, 2026

[feat] Add MultiVectorEncoder Support (a.k.a late-interaction models or ColBERT-style models) #3614

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

hotchpotch commented Jan 9, 2026 •

edited

Loading

Uh oh!

hotchpotch commented Jan 19, 2026

Uh oh!

tomaarsen commented Jan 20, 2026

Uh oh!

hotchpotch commented Jan 20, 2026

Uh oh!

tomaarsen commented Jan 27, 2026

Uh oh!

hotchpotch commented Jan 28, 2026

Uh oh!

tomaarsen commented Jan 28, 2026

Uh oh!

tomaarsen left a comment

Uh oh!

hotchpotch commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

tomaarsen Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

Conversation

hotchpotch commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Benchmarks (MS MARCO)

Conditions

Memory Considerations

How It Works

Implementation Notes

Uh oh!

hotchpotch commented Jan 19, 2026

Uh oh!

tomaarsen commented Jan 20, 2026

Uh oh!

hotchpotch commented Jan 20, 2026

Uh oh!

tomaarsen commented Jan 27, 2026

Uh oh!

hotchpotch commented Jan 28, 2026

Uh oh!

tomaarsen commented Jan 28, 2026

Uh oh!

tomaarsen left a comment

Choose a reason for hiding this comment

Uh oh!

hotchpotch commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tomaarsen Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hotchpotch commented Jan 9, 2026 •

edited

Loading