Skip to content

Evaluate Compression Methods on Retrieval Tasks#3950

Open
casparil wants to merge 3 commits intoembeddings-benchmark:mainfrom
casparil:main
Open

Evaluate Compression Methods on Retrieval Tasks#3950
casparil wants to merge 3 commits intoembeddings-benchmark:mainfrom
casparil:main

Conversation

@casparil
Copy link

@casparil casparil commented Jan 16, 2026

This PR adds the option to evaluate MTEB tasks on compressed embeddings (see Issue #3949).

What the code does

  • Introduces a new (command-line) parameter that computes quantization performance at different levels (float8, int8, int4 and binary) when set accordingly.
  • The model is then wrapped in a CompressionWrapper class that handles compression.
  • Computes embeddings as normal, then compresses embeddings per batch and computes results as normal.
  • Store the result JSON file in a folder named after model and compression level.

@Samoed
Copy link
Member

Samoed commented Jan 16, 2026

I think it would be better to add support for any model, not just for retrieval by creating wrapper for encoder model like Cache wrapper

class CachedEmbeddingWrapper:
rather than create specific implementation for retrieval models

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jan 19, 2026

thanks for the outline! It always great to something to work from.

If we want to have these displayed as different models I am probably leaning a slightly different direction (moving the compression logic to the model)

int8_mdl = CompressionWrapper(model, dtype="int8") 
# also manipulates task metadata to such that the name is "{name} (dtype=int8)"

res = mteb.evaluate(int8_mdl, task)

This is of course inefficient, which is why I would probably use:

cached_mdl = CachedEmbeddingWrapper(model)
int8_mdl = CompressionWrapper(cached_mdl, dtype="int8") 
# also manipulates task metadata to such that the name is "{name} (dtype=int8)"

res = mteb.evaluate(int8_mdl, task)

Which would allow for fast iteration over different compression levels without any changes to the core evaluation loop. This would also automatically make this approach applicable to any task within MTEB.

@Samoed
Copy link
Member

Samoed commented Jan 19, 2026

This is of course inefficient, which is why I would probably use:

I've mentioned CachedEmbeddingWrapper as example of wrapper which is better approach of integration. I don't think that this is required to use them both

If we want to have these displayed as different models I am probably leaning a slightly different direction (moving the compression logic to the model)

I don't think that we should move this logic to the model. I think this can be solved by #1211

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jan 19, 2026

I don't think that we should move this logic to the model. I think this can be solved by #1211

but then we have to implement compression metrics for all tasks? (maybe that is easier, but out results are already quite big)

the issue would also be solved by my suggestion though (though I am not sure it is the best approach)

@Samoed
Copy link
Member

Samoed commented Jan 19, 2026

but then we have to implement compression metrics for all tasks? (maybe that is easier, but out results are already quite big)

I don't think we need to implement additional metrics for them. We can measure the same metrics, but on a quantized embedding.

Your approach is similar to #1211 overall. Maybe I misunderstood part of moving the compression logic to the model, because seems from your comment this is just a wrapper around our implementations and I don't think we need to create separate instances of models for this

@KennethEnevoldsen
Copy link
Contributor

Yea, the compression would just happen in the wrapper (so no need to create a new wrapper), but for models that require it, we could create their own custom wrappers.

@Samoed
Copy link
Member

Samoed commented Jan 19, 2026

Yes, agree

@KennethEnevoldsen
Copy link
Contributor

Potentially raise a warning for cases like:

mdl = mteb.get_model("voyageai/voyage-3.5") 
int8_mdl = CompressionWrapper(mdl, dtype="int8")
# Warning: The model `voyageai/voyage-3.5 (output_dtype=int8)` already exists. Model name of will be set to `voyageai/voyage-3.5 (output_dtype=int8*)` to avoid conflicts.

@casparil
Copy link
Author

Thanks to both of you for the feedback!

Using a wrapper class as you suggested sounds like a better approach. We'll update the code accordingly and try to integrate your comments.

search_model = model
from mteb.models import CachedEmbeddingWrapper, CompressionWrapper

if isinstance(model, CompressionWrapper):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to add CompressionWrapper to the retrieval evaluator. You can wrap model and then pass it to mteb without changing evaluation code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is wrapped in the CompressionWrapper class before mteb.evaluate() is called, so it now works with other task types as well. There's just this specific code snippet during retrieval evaluation that performs the following checks:

 if isinstance(model, EncoderProtocol) and not isinstance(model, SearchProtocol):
        return SearchEncoderWrapper(model)
elif isinstance(model, CrossEncoderProtocol):
        return SearchCrossEncoderWrapper(model)
 elif isinstance(model, SearchProtocol):
        return model
else:
        raise TypeError(
            f"RetrievalEvaluator expects a SearchInterface, Encoder, or CrossEncoder, got {type(model)}"
        )

As the model is wrapped in the CompressionWrapper class, this will raise the error, so I've adapted the code accordingly. If you prefer to handle this differently, I'm open for suggestions.

Comment on lines +101 to +120
if prompt_type == PromptType.query and task_metadata.category in [
"t2i",
"i2t",
"it2i",
"i2it",
]:
# With multimodal tasks, always quantize text and image embeddings separately
logger.info(f"Quantizing query embeddings to {self._quantization_level}")
return self._quantize_embeddings(embeddings, PromptType.document)
elif prompt_type == PromptType.query and self._quantization_level in [
"int8",
"int4",
]:
# Otherwise, compute thresholds for int8/int4 quantization on documents first, then apply them on queries
logger.info("Query embeddings will be quantized on similarity calculation.")
self.query_embeds = embeddings
return embeddings
else:
logger.info(f"Quantizing embeddings to {self._quantization_level}")
return self._quantize_embeddings(embeddings, prompt_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always quantize embeddings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a lot of datasets, the number of queries is relatively small, while the number of documents is much larger. For integer quantization, we need to estimate the thresholds that decide which range of floating points is mapped to which integer. Applying this to a relatively small number of embeddings might lead to a bad estimation, so we first compute those thresholds on the larger number of documents, then apply the thresholds to the queries. This also ensures that both queries and documents are quantized using the same thresholds.

@casparil casparil marked this pull request as ready for review February 6, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants