Evaluate Compression Methods on Retrieval Tasks by casparil · Pull Request #3950 · embeddings-benchmark/mteb

casparil · 2026-01-16T12:51:15Z

This PR adds the option to evaluate MTEB tasks on compressed embeddings (see Issue #3949).

What the code does

Introduces a new (command-line) parameter that computes quantization performance at different levels (float8, int8, int4 and binary) when set accordingly.
The model is then wrapped in a CompressionWrapper class that handles compression.
Computes embeddings as normal, then compresses embeddings per batch and computes results as normal.
Store the result JSON file in a folder named after model and compression level.

Samoed · 2026-01-16T13:46:24Z

I think it would be better to add support for any model, not just for retrieval by creating wrapper for encoder model like Cache wrapper

mteb/mteb/models/cache_wrappers/cache_wrapper.py

Line 23 in 4c4ab2b

class CachedEmbeddingWrapper:

rather than create specific implementation for retrieval models

KennethEnevoldsen · 2026-01-19T11:12:19Z

thanks for the outline! It always great to something to work from.

If we want to have these displayed as different models I am probably leaning a slightly different direction (moving the compression logic to the model)

int8_mdl = CompressionWrapper(model, dtype="int8") 
# also manipulates task metadata to such that the name is "{name} (dtype=int8)"

res = mteb.evaluate(int8_mdl, task)

This is of course inefficient, which is why I would probably use:

cached_mdl = CachedEmbeddingWrapper(model)
int8_mdl = CompressionWrapper(cached_mdl, dtype="int8") 
# also manipulates task metadata to such that the name is "{name} (dtype=int8)"

res = mteb.evaluate(int8_mdl, task)

Which would allow for fast iteration over different compression levels without any changes to the core evaluation loop. This would also automatically make this approach applicable to any task within MTEB.

Samoed · 2026-01-19T11:18:19Z

This is of course inefficient, which is why I would probably use:

I've mentioned CachedEmbeddingWrapper as example of wrapper which is better approach of integration. I don't think that this is required to use them both

If we want to have these displayed as different models I am probably leaning a slightly different direction (moving the compression logic to the model)

I don't think that we should move this logic to the model. I think this can be solved by #1211

KennethEnevoldsen · 2026-01-19T11:20:27Z

I don't think that we should move this logic to the model. I think this can be solved by #1211

but then we have to implement compression metrics for all tasks? (maybe that is easier, but out results are already quite big)

the issue would also be solved by my suggestion though (though I am not sure it is the best approach)

Samoed · 2026-01-19T11:25:49Z

but then we have to implement compression metrics for all tasks? (maybe that is easier, but out results are already quite big)

I don't think we need to implement additional metrics for them. We can measure the same metrics, but on a quantized embedding.

Your approach is similar to #1211 overall. Maybe I misunderstood part of moving the compression logic to the model, because seems from your comment this is just a wrapper around our implementations and I don't think we need to create separate instances of models for this

KennethEnevoldsen · 2026-01-19T11:33:40Z

Yea, the compression would just happen in the wrapper (so no need to create a new wrapper), but for models that require it, we could create their own custom wrappers.

Samoed · 2026-01-19T11:36:51Z

Yes, agree

KennethEnevoldsen · 2026-01-19T11:38:20Z

Potentially raise a warning for cases like:

mdl = mteb.get_model("voyageai/voyage-3.5") 
int8_mdl = CompressionWrapper(mdl, dtype="int8")
# Warning: The model `voyageai/voyage-3.5 (output_dtype=int8)` already exists. Model name of will be set to `voyageai/voyage-3.5 (output_dtype=int8*)` to avoid conflicts.

casparil · 2026-01-19T14:47:55Z

Thanks to both of you for the feedback!

Using a wrapper class as you suggested sounds like a better approach. We'll update the code accordingly and try to integrate your comments.

Samoed · 2026-02-02T19:21:33Z

mteb/abstasks/retrieval.py

-            search_model = model
+        from mteb.models import CachedEmbeddingWrapper, CompressionWrapper
+
+        if isinstance(model, CompressionWrapper):


You don't need to add CompressionWrapper to the retrieval evaluator. You can wrap model and then pass it to mteb without changing evaluation code

The model is wrapped in the CompressionWrapper class before mteb.evaluate() is called, so it now works with other task types as well. There's just this specific code snippet during retrieval evaluation that performs the following checks:

if isinstance(model, EncoderProtocol) and not isinstance(model, SearchProtocol): return SearchEncoderWrapper(model) elif isinstance(model, CrossEncoderProtocol): return SearchCrossEncoderWrapper(model) elif isinstance(model, SearchProtocol): return model else: raise TypeError( f"RetrievalEvaluator expects a SearchInterface, Encoder, or CrossEncoder, got {type(model)}" )

As the model is wrapped in the CompressionWrapper class, this will raise the error, so I've adapted the code accordingly. If you prefer to handle this differently, I'm open for suggestions.

Samoed · 2026-02-02T19:22:53Z

mteb/models/compression_wrappers/compression_wrapper.py

+        if prompt_type == PromptType.query and task_metadata.category in [
+            "t2i",
+            "i2t",
+            "it2i",
+            "i2it",
+        ]:
+            # With multimodal tasks, always quantize text and image embeddings separately
+            logger.info(f"Quantizing query embeddings to {self._quantization_level}")
+            return self._quantize_embeddings(embeddings, PromptType.document)
+        elif prompt_type == PromptType.query and self._quantization_level in [
+            "int8",
+            "int4",
+        ]:
+            # Otherwise, compute thresholds for int8/int4 quantization on documents first, then apply them on queries
+            logger.info("Query embeddings will be quantized on similarity calculation.")
+            self.query_embeds = embeddings
+            return embeddings
+        else:
+            logger.info(f"Quantizing embeddings to {self._quantization_level}")
+            return self._quantize_embeddings(embeddings, prompt_type)


Why not always quantize embeddings?

In a lot of datasets, the number of queries is relatively small, while the number of documents is much larger. For integer quantization, we need to estimate the thresholds that decide which range of floating points is mapped to which integer. Applying this to a relatively small number of embeddings might lead to a bad estimation, so we first compute those thresholds on the larger number of documents, then apply the thresholds to the queries. This also ensures that both queries and documents are quantized using the same thresholds.

Add quantization support

6b9031e

casparil mentioned this pull request Jan 16, 2026

Evaluate Compression Methods on MTEB #3949

Open

Samoed added the research project label Jan 19, 2026

Refactor quantization support into wrapper class

3a966b0

Samoed reviewed Feb 2, 2026

View reviewed changes

Sync fork

85ac82a

casparil marked this pull request as ready for review February 6, 2026 08:24

Conversation

casparil commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 16, 2026

Uh oh!

KennethEnevoldsen commented Jan 19, 2026 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 19, 2026

Uh oh!

KennethEnevoldsen commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Jan 19, 2026

Uh oh!

Samoed commented Jan 19, 2026

Uh oh!

KennethEnevoldsen commented Jan 19, 2026

Uh oh!

casparil commented Jan 19, 2026

Uh oh!

Samoed Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

casparil Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

casparil Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

casparil commented Jan 16, 2026 •

edited

Loading

KennethEnevoldsen commented Jan 19, 2026 •

edited by Samoed

Loading

KennethEnevoldsen commented Jan 19, 2026 •

edited

Loading

Samoed commented Jan 19, 2026 •

edited

Loading