-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611
Conversation
|
Hi @tomaarsen I recently ran training successfully on 50 other datasets totaling 300M rows with this implementation. With Given that, I’d be happy if you could consider merging this into SentenceTransformers. If there are any parts I should revise, or any additional information you need to evaluate whether to merge it, please let me know. I’d really appreciate your feedback. |
|
Hello! Apologies for the delayed response. I mentioned it briefly in another PR, but I've been a bit busy working on a For this PR, I've had a quick look a handful of times now, and I think there's a few different options:
Thank you for your benchmarks by the way, they are extremely valuable.
I'm not too bothered with potential hash collisions at the moment, I think. Either way: I'm definitely planning to include this, and your other PR, in the next release. Thanks a lot for your valuable contributions again! I've done some experiments with @NohTow regarding the various different InfoNCE variants using your implementation as well.
|
|
Hello! Thank you for reviewing this while you are busy with the multimodality work. I am sorry for sending an @mention while you were occupied. Thank you for the suggestions on how to integrate this. Right now I added Regarding The main source of additional RAM usage is the 64-bit hashes that this implementation stores. If each row has k values, the rough estimate is If you have time after the multimodality work settles down, I would really appreciate another look at this PR. Thank you as well for evaluating the other P.S. I am also excited about the multimodality support and would love to try it. Your continued commitment to Sentence Transformers is a huge help to practitioners like me. |
No need to apologize! I'm used to a big list of notifications, pinging me when useful is always fine.
I think this is indeed the cleanest. Let's aim for
|
|
Hello, thanks for the feedback. As suggested, I merged the implementation into I reran the benchmarks and performance is essentially unchanged from the original results. If there are any other changes needed for merge, please let me know. |
Tiny code improvements
|
Okay, I think this is almost ready! I like the structure now. There's only a handful of changes I think we should do: rename In the long term, I think I'll add a But with this PR, I'll push some tiny changes I made locally, and will continue to make some changes, but this is like 95% there! Feel free to let me know what you think of my changes.
|
The _iter_no_duplicate_batches and _remove_label_columns can be placed back in the class itself, as there's now just 1 again.
tomaarsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks strong I think!
|
Hello! Renaming it to I really appreciate the effort you put into reviewing this PR. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a faster variant of the no-duplicates batch sampler by precomputing xxhash64-based hashes for dataset rows, wires it into the training configuration API, and adds a benchmark script plus tests.
Changes:
- Extend
NoDuplicatesBatchSamplerto support an optionalprecompute_hashesmode that uses Hugging Facedatasets.mapand xxhash to precompute per-row hash vectors for faster duplicate detection, including Arrow-based validation and dense NumPy storage. - Add the
BatchSamplers.NO_DUPLICATES_HASHEDenum value and integrate it intoSentenceTransformerTrainer.get_batch_samplerso it maps toNoDuplicatesBatchSampler(..., precompute_hashes=True). - Add tests for the new batch sampler argument value and for both hashed / non-hashed sampler paths, plus an example benchmarking script to compare speed and memory characteristics.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
sentence_transformers/sampler.py |
Adds xxhash-based hashing utilities, new precompute_hashes / _build_hashes logic, and adjusts NoDuplicatesBatchSampler.__iter__ to use precomputed hash matrices when enabled. |
sentence_transformers/training_args.py |
Introduces BatchSamplers.NO_DUPLICATES_HASHED and updates the docstring to describe the hashed variant and its recommended use cases. |
sentence_transformers/trainer.py |
Extends get_batch_sampler to instantiate NoDuplicatesBatchSampler with precompute_hashes=True when batch_sampler=BatchSamplers.NO_DUPLICATES_HASHED. |
tests/samplers/test_no_duplicates_batch_sampler.py |
Parametrizes existing tests over precompute_hashes=False/True (skipping when xxhash is unavailable) to cover both code paths and ensure behavior remains consistent. |
tests/test_training_args.py |
Verifies that HfArgumentParser parses --batch_sampler no_duplicates_hashed into BatchSamplers.NO_DUPLICATES_HASHED. |
examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py |
New benchmark script to compare default vs. hashed NoDuplicatesBatchSampler in terms of runtime, batch counts, and memory (RSS/USS and tracemalloc) on Hugging Face datasets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _build_hashes(self) -> None: | ||
| if not self.precompute_hashes or self._row_hashes is not None: | ||
| return | ||
| exclude_columns = {"dataset_name"} | ||
| columns = list(self.dataset.column_names) | ||
| # Precompute hash values once to avoid repeated string processing per batch. | ||
| # Use num_proc to parallelize hashing across CPU cores. | ||
| hash_ds: Dataset | None = None | ||
| hash_ds = self.dataset.map( | ||
| _hash_batch, | ||
| batched=True, | ||
| batch_size=self.precompute_batch_size, | ||
| num_proc=self.precompute_num_proc, | ||
| remove_columns=columns, | ||
| fn_kwargs={"columns": columns, "exclude_columns": exclude_columns}, | ||
| desc="Hashing dataset values", | ||
| ) |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precompute_hashes path assumes that self.dataset is a Hugging Face datasets.Dataset (or at least has a .map method), but this is not validated up front. If a user constructs NoDuplicatesBatchSampler with precompute_hashes=True on a non-HF dataset, self.dataset.map(...) will raise an AttributeError outside the try/except block, resulting in an unhelpful error. Consider adding an explicit type/feature check (e.g. hasattr(self.dataset, "map")) and raising a clear ValueError explaining that precompute_hashes=True requires a Hugging Face Dataset (or an object exposing a compatible .map API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unimportant; datasets can safely be assumed to be a datasets.Dataset
The current
NoDuplicatesBatchSamplercan become significantly slow when working with datasets that have many duplicate values acrossquery / positive / negativescolumns, especially with large batch sizes (e.g., bs=8192). This is particularly noticeable with triplet or hard negatives data.Summary of Changes
This PR adds
NoDuplicatesFastBatchSampler, which speeds up duplicate checking by pre-computing xxhash 64-bit values for each sample usingdatasets.map(). It maintains the same batch construction policy asNoDuplicatesBatchSampler(avoiding duplicates within a batch) while significantly improving performance.Since this approach increases memory usage, both options are provided:
NO_DUPLICATES: Existing sampler (memory-efficient)NO_DUPLICATES_FAST: New sampler (faster, but uses more memory)Benchmarks (MS MARCO)
Benchmarked using the following HuggingFace datasets:
sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1/triplet-hardsentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1/triplet-50Conditions
128and8192num_proc=8--no-progress-bar)The table below summarizes execution time, memory usage, and batch counts. Memory is measured using USS (Unique Set Size). The fast sampler stores hash values as NumPy int64 arrays, which accounts for the increased memory usage. The original
NO_DUPLICATESchecks values on-the-fly and does not increase memory usage.Environment: Ryzen 9 7950 (
num_proc=8), Ubuntu 24Memory Considerations
This implementation stores hash values as
int64NumPy ndarrays, which increases memory usage compared to the currentNoDuplicatesBatchSampler.For reference, using
sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1:triplet-50(503,302 rows): ~200MiB additional memorytriplet-hard(11,662,655 rows): ~314MiB additional memoryTherefore, users can choose between:
NO_DUPLICATES: Memory-efficient (existing)NO_DUPLICATES_FAST: Faster (new)How It Works
datasets.map()to retrieve all values fromquery / positive / negativescolumns__iter__, use hash arrays for fast duplicate checking while constructing batchesImplementation Notes
datasets.map(..., num_proc=N)for speed.Benchmark Commands
Feedback and suggestions are appreciated!