Skip to content

Conversation

@hotchpotch
Copy link
Contributor

@hotchpotch hotchpotch commented Jan 9, 2026

The current NoDuplicatesBatchSampler can become significantly slow when working with datasets that have many duplicate values across query / positive / negatives columns, especially with large batch sizes (e.g., bs=8192). This is particularly noticeable with triplet or hard negatives data.

Summary of Changes

This PR adds NoDuplicatesFastBatchSampler, which speeds up duplicate checking by pre-computing xxhash 64-bit values for each sample using datasets.map(). It maintains the same batch construction policy as NoDuplicatesBatchSampler (avoiding duplicates within a batch) while significantly improving performance.

Since this approach increases memory usage, both options are provided:

  • NO_DUPLICATES: Existing sampler (memory-efficient)
  • NO_DUPLICATES_FAST: New sampler (faster, but uses more memory)

Benchmarks (MS MARCO)

Benchmarked using the following HuggingFace datasets:

  • sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-hard
  • sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-50

Conditions

  • Batch size: 128 and 8192
  • Hash parallelization: num_proc=8
  • Progress bar disabled (--no-progress-bar)

The table below summarizes execution time, memory usage, and batch counts. Memory is measured using USS (Unique Set Size). The fast sampler stores hash values as NumPy int64 arrays, which accounts for the increased memory usage. The original NO_DUPLICATES checks values on-the-fly and does not increase memory usage.

dataset sampler bs total_time hash_time hash_uss_current hash_uss_peak batches (ideal/delta)
triplet-50 NO_DUPLICATES 128 71.386s n/a n/a n/a 3929 (ideal=3929, delta=0)
triplet-50 NO_DUPLICATES_FAST 128 3.496s 3.724s 211.61MiB 211.62MiB 3929 (ideal=3929, delta=0)
triplet-50 NO_DUPLICATES 8192 283.215s n/a n/a n/a 58 (ideal=61, delta=3)
triplet-50 NO_DUPLICATES_FAST 8192 6.835s 3.723s 201.52MiB 201.54MiB 58 (ideal=61, delta=3)
triplet-hard NO_DUPLICATES 128 405.658s n/a n/a n/a 91114 (ideal=91114, delta=0)
triplet-hard NO_DUPLICATES_FAST 128 261.424s 4.674s 314.26MiB 510.76MiB 91114 (ideal=91114, delta=0)
triplet-hard NO_DUPLICATES 8192 171.853s n/a n/a n/a 1423 (ideal=1423, delta=0)
triplet-hard NO_DUPLICATES_FAST 8192 21.567s 4.579s 313.82MiB 526.93MiB 1423 (ideal=1423, delta=0)

Environment: Ryzen 9 7950 (num_proc=8), Ubuntu 24

Memory Considerations

This implementation stores hash values as int64 NumPy ndarrays, which increases memory usage compared to the current NoDuplicatesBatchSampler.

For reference, using sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1:

  • triplet-50 (503,302 rows): ~200MiB additional memory
  • triplet-hard (11,662,655 rows): ~314MiB additional memory

Therefore, users can choose between:

  • NO_DUPLICATES: Memory-efficient (existing)
  • NO_DUPLICATES_FAST: Faster (new)

How It Works

  1. On first iteration only: Use datasets.map() to retrieve all values from query / positive / negatives columns
  2. Hash strings using xxhash 64-bit
  3. Store hash arrays per row as NumPy arrays (assumes fixed-length rows, which is valid since query, positive, and negatives columns are consistent within a dataset)
  4. In __iter__, use hash arrays for fast duplicate checking while constructing batches

Implementation Notes

  • While xxhash64 can theoretically produce hash collisions, the probability is extremely low. Even if a collision occurs, it would only result in excluding a non-duplicate sample from the same batch, which has minimal impact on training. Therefore, this is considered negligible in practice.
  • Hashing is parallelized using datasets.map(..., num_proc=N) for speed.
  • I haven't found other places in this project that use multiprocessing in a similar way. If a different implementation style is preferred, or if parallelization should be avoided, please let me know.
  • The number of parallel workers is capped at 8 even on machines with more cores. Feedback on whether this default is appropriate is welcome.
  • Suggestions for better optimization approaches or alternative implementations are also welcome.

Benchmark Commands
# triplet-50, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-50, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

Feedback and suggestions are appreciated!

@hotchpotch hotchpotch marked this pull request as ready for review January 9, 2026 07:44
@hotchpotch
Copy link
Contributor Author

Hi @tomaarsen

I recently ran training successfully on 50 other datasets totaling 300M rows with this implementation.

With NoDuplicatesBatchSampler, the initial training speed is almost the same, but on datasets with many duplicates, duplicated samples tend to be skipped and accumulate toward the end. As a result, training can become very slow in the later stages.

Given that, I’d be happy if you could consider merging this into SentenceTransformers.

If there are any parts I should revise, or any additional information you need to evaluate whether to merge it, please let me know. I’d really appreciate your feedback.

@tomaarsen
Copy link
Member

Hello!

Apologies for the delayed response. I mentioned it briefly in another PR, but I've been a bit busy working on a transformers PR to attempt to strengthen multi-modality support in Sentence Transformers.

For this PR, I've had a quick look a handful of times now, and I think there's a few different options:

  1. Merge this as-is (a very valid option)
  2. Add the NoDuplicatesFastBatchSampler pre-compute functionality to the existing NoDuplicatesBatchSampler with a flag/toggle. However, samplers are often specified with "no_duplicates", BatchSamplers.NO_DUPLICATES or NoDuplicatesBatchSampler, although there's also support for functions. If users want to toggle on/off the pre-compute, they'd have to use e.g. batch_sampler=partial(NoDuplicatesBatchSampler, pre_compute=False). Not very convenient.
  3. Replace the existing NoDuplicatesBatchSampler with NoDuplicatesFastBatchSampler (although it would likely be nicer to go for nr. 2 to keep the option)
  4. Perhaps controversial: A bit like 2, but if the pre_compute argument (not a great name, but you get the gist) is not explicitly provided by the user, choose whether to use the old vs the new approach based on whether xxhash is installed.

Thank you for your benchmarks by the way, they are extremely valuable.

dataset.map seems like a nice solution: do you know whether it places the entire dataset in memory during processing? The original reason for not pre-computing the exact batches to process was that I expected a memory explosion simply for accessing all inputs.

I'm not too bothered with potential hash collisions at the moment, I think.

Either way: I'm definitely planning to include this, and your other PR, in the next release. Thanks a lot for your valuable contributions again! I've done some experiments with @NohTow regarding the various different InfoNCE variants using your implementation as well.

  • Tom Aarsen

@hotchpotch
Copy link
Contributor Author

Hello!

Thank you for reviewing this while you are busy with the multimodality work. I am sorry for sending an @mention while you were occupied.

Thank you for the suggestions on how to integrate this. Right now I added NO_DUPLICATES_FAST as a separate sampler. If you prefer a single-class approach, I can consolidate it into NoDuplicatesBatchSampler with a pre_compute flag, and optionally switch via partial(...). I am happy to follow whichever direction you prefer.

Regarding dataset.map: HF Datasets uses Arrow’s memory-mapped cache, so Arrow-backed datasets are not designed to load the entire dataset into RAM (datasets loaded via datasets.load_dataset(hf_url) are also stored in Arrow). The map operation also writes to cache by default while processing in batches, and writer_batch_size controls the memory/speed trade-off. ref: Datasets 🤝 Arrow.

The main source of additional RAM usage is the 64-bit hashes that this implementation stores. If each row has k values, the rough estimate is 8 * N * k bytes (plus overhead). In my 300M-row pair case (k = 2), the estimate is about 4.8 GB plus overhead, and I observed a bit over 5 GB in practice.

If you have time after the multimodality work settles down, I would really appreciate another look at this PR. Thank you as well for evaluating the other MultipleNegativesBidirectionalRankingLoss PR.

P.S. I am also excited about the multimodality support and would love to try it. Your continued commitment to Sentence Transformers is a huge help to practitioners like me.

@tomaarsen
Copy link
Member

Thank you for reviewing this while you are busy with the multimodality work. I am sorry for sending an @mention while you were occupied.

No need to apologize! I'm used to a big list of notifications, pinging me when useful is always fine.

If you prefer a single-class approach, I can consolidate it into NoDuplicatesBatchSampler with a pre_compute flag, and optionally switch via partial(...). I am happy to follow whichever direction you prefer.

I think this is indeed the cleanest. Let's aim for precompute_hashes as the argument name perhaps? Just like with #3607, I think this feature is so strong that we shouldn't hide it as much, and instead integrate it directly with the features already in use. Could you work on this?

  • Tom Aarsen

@hotchpotch
Copy link
Contributor Author

Hello, thanks for the feedback.

As suggested, I merged the implementation into NoDuplicatesBatchSampler with a precompute_hashes flag.
BatchSamplers.NO_DUPLICATES_FAST now just enables that flag.

I reran the benchmarks and performance is essentially unchanged from the original results.

If there are any other changes needed for merge, please let me know.

@tomaarsen
Copy link
Member

Okay, I think this is almost ready! I like the structure now. There's only a handful of changes I think we should do: rename NO_DUPLICATES_FAST to NO_DUPLICATES_HASHED to give more info about what's different, I think this should age better in case there's other alternative versions in the future. The docs are clear that this is faster at the cost of some memory.

In the long term, I think I'll add a batch_sampler_kwargs and multi_dataset_batch_sampler_kwargs to the training arguments to allow easier access to the flags like precompute_batch_size, but I'd like that to be integrated after #3554 because that PR creates BaseTrainingArguments/BaseTrainer that should host these changes.

But with this PR, NO_DUPLICATES_HASHED would already be possible. I'm also fine with keeping the evaluation script.

I'll push some tiny changes I made locally, and will continue to make some changes, but this is like 95% there! Feel free to let me know what you think of my changes.

  • Tom Aarsen

The _iter_no_duplicate_batches and _remove_label_columns can be placed back in the class itself, as there's now just 1 again.
Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks strong I think!

@hotchpotch
Copy link
Contributor Author

Hello!

Renaming it to NO_DUPLICATES_HASHED is clear and makes a lot of sense, I’m happy with that change. Thanks also for the refactoring.

I really appreciate the effort you put into reviewing this PR. Thank you!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a faster variant of the no-duplicates batch sampler by precomputing xxhash64-based hashes for dataset rows, wires it into the training configuration API, and adds a benchmark script plus tests.

Changes:

  • Extend NoDuplicatesBatchSampler to support an optional precompute_hashes mode that uses Hugging Face datasets.map and xxhash to precompute per-row hash vectors for faster duplicate detection, including Arrow-based validation and dense NumPy storage.
  • Add the BatchSamplers.NO_DUPLICATES_HASHED enum value and integrate it into SentenceTransformerTrainer.get_batch_sampler so it maps to NoDuplicatesBatchSampler(..., precompute_hashes=True).
  • Add tests for the new batch sampler argument value and for both hashed / non-hashed sampler paths, plus an example benchmarking script to compare speed and memory characteristics.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
sentence_transformers/sampler.py Adds xxhash-based hashing utilities, new precompute_hashes / _build_hashes logic, and adjusts NoDuplicatesBatchSampler.__iter__ to use precomputed hash matrices when enabled.
sentence_transformers/training_args.py Introduces BatchSamplers.NO_DUPLICATES_HASHED and updates the docstring to describe the hashed variant and its recommended use cases.
sentence_transformers/trainer.py Extends get_batch_sampler to instantiate NoDuplicatesBatchSampler with precompute_hashes=True when batch_sampler=BatchSamplers.NO_DUPLICATES_HASHED.
tests/samplers/test_no_duplicates_batch_sampler.py Parametrizes existing tests over precompute_hashes=False/True (skipping when xxhash is unavailable) to cover both code paths and ensure behavior remains consistent.
tests/test_training_args.py Verifies that HfArgumentParser parses --batch_sampler no_duplicates_hashed into BatchSamplers.NO_DUPLICATES_HASHED.
examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py New benchmark script to compare default vs. hashed NoDuplicatesBatchSampler in terms of runtime, batch counts, and memory (RSS/USS and tracemalloc) on Hugging Face datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +274 to +290
def _build_hashes(self) -> None:
if not self.precompute_hashes or self._row_hashes is not None:
return
exclude_columns = {"dataset_name"}
columns = list(self.dataset.column_names)
# Precompute hash values once to avoid repeated string processing per batch.
# Use num_proc to parallelize hashing across CPU cores.
hash_ds: Dataset | None = None
hash_ds = self.dataset.map(
_hash_batch,
batched=True,
batch_size=self.precompute_batch_size,
num_proc=self.precompute_num_proc,
remove_columns=columns,
fn_kwargs={"columns": columns, "exclude_columns": exclude_columns},
desc="Hashing dataset values",
)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precompute_hashes path assumes that self.dataset is a Hugging Face datasets.Dataset (or at least has a .map method), but this is not validated up front. If a user constructs NoDuplicatesBatchSampler with precompute_hashes=True on a non-HF dataset, self.dataset.map(...) will raise an AttributeError outside the try/except block, resulting in an unhelpful error. Consider adding an explicit type/feature check (e.g. hasattr(self.dataset, "map")) and raising a clear ValueError explaining that precompute_hashes=True requires a Hugging Face Dataset (or an object exposing a compatible .map API).

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimportant; datasets can safely be assumed to be a datasets.Dataset

@tomaarsen tomaarsen changed the title [feat] Add NoDuplicatesFastBatchSampler [feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler Jan 29, 2026
@tomaarsen tomaarsen enabled auto-merge (squash) January 29, 2026 13:32
@tomaarsen tomaarsen merged commit b07e0ce into huggingface:main Jan 29, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants