Evaluate Compression Methods on Retrieval Tasks#3950
Evaluate Compression Methods on Retrieval Tasks#3950casparil wants to merge 3 commits intoembeddings-benchmark:mainfrom
Conversation
|
I think it would be better to add support for any model, not just for retrieval by creating wrapper for encoder model like Cache wrapper rather than create specific implementation for retrieval models |
|
thanks for the outline! It always great to something to work from. If we want to have these displayed as different models I am probably leaning a slightly different direction (moving the compression logic to the model) int8_mdl = CompressionWrapper(model, dtype="int8")
# also manipulates task metadata to such that the name is "{name} (dtype=int8)"
res = mteb.evaluate(int8_mdl, task)This is of course inefficient, which is why I would probably use: cached_mdl = CachedEmbeddingWrapper(model)
int8_mdl = CompressionWrapper(cached_mdl, dtype="int8")
# also manipulates task metadata to such that the name is "{name} (dtype=int8)"
res = mteb.evaluate(int8_mdl, task)Which would allow for fast iteration over different compression levels without any changes to the |
I've mentioned
I don't think that we should move this logic to the model. I think this can be solved by #1211 |
but then we have to implement compression metrics for all tasks? (maybe that is easier, but out results are already quite big) the issue would also be solved by my suggestion though (though I am not sure it is the best approach) |
I don't think we need to implement additional metrics for them. We can measure the same metrics, but on a quantized embedding. Your approach is similar to #1211 overall. Maybe I misunderstood part of |
|
Yea, the compression would just happen in the wrapper (so no need to create a new wrapper), but for models that require it, we could create their own custom wrappers. |
|
Yes, agree |
|
Potentially raise a warning for cases like: |
|
Thanks to both of you for the feedback! Using a wrapper class as you suggested sounds like a better approach. We'll update the code accordingly and try to integrate your comments. |
| search_model = model | ||
| from mteb.models import CachedEmbeddingWrapper, CompressionWrapper | ||
|
|
||
| if isinstance(model, CompressionWrapper): |
There was a problem hiding this comment.
You don't need to add CompressionWrapper to the retrieval evaluator. You can wrap model and then pass it to mteb without changing evaluation code
There was a problem hiding this comment.
The model is wrapped in the CompressionWrapper class before mteb.evaluate() is called, so it now works with other task types as well. There's just this specific code snippet during retrieval evaluation that performs the following checks:
if isinstance(model, EncoderProtocol) and not isinstance(model, SearchProtocol):
return SearchEncoderWrapper(model)
elif isinstance(model, CrossEncoderProtocol):
return SearchCrossEncoderWrapper(model)
elif isinstance(model, SearchProtocol):
return model
else:
raise TypeError(
f"RetrievalEvaluator expects a SearchInterface, Encoder, or CrossEncoder, got {type(model)}"
)
As the model is wrapped in the CompressionWrapper class, this will raise the error, so I've adapted the code accordingly. If you prefer to handle this differently, I'm open for suggestions.
| if prompt_type == PromptType.query and task_metadata.category in [ | ||
| "t2i", | ||
| "i2t", | ||
| "it2i", | ||
| "i2it", | ||
| ]: | ||
| # With multimodal tasks, always quantize text and image embeddings separately | ||
| logger.info(f"Quantizing query embeddings to {self._quantization_level}") | ||
| return self._quantize_embeddings(embeddings, PromptType.document) | ||
| elif prompt_type == PromptType.query and self._quantization_level in [ | ||
| "int8", | ||
| "int4", | ||
| ]: | ||
| # Otherwise, compute thresholds for int8/int4 quantization on documents first, then apply them on queries | ||
| logger.info("Query embeddings will be quantized on similarity calculation.") | ||
| self.query_embeds = embeddings | ||
| return embeddings | ||
| else: | ||
| logger.info(f"Quantizing embeddings to {self._quantization_level}") | ||
| return self._quantize_embeddings(embeddings, prompt_type) |
There was a problem hiding this comment.
Why not always quantize embeddings?
There was a problem hiding this comment.
In a lot of datasets, the number of queries is relatively small, while the number of documents is much larger. For integer quantization, we need to estimate the thresholds that decide which range of floating points is mapped to which integer. Applying this to a relatively small number of embeddings might lead to a bad estimation, so we first compute those thresholds on the larger number of documents, then apply the thresholds to the queries. This also ensures that both queries and documents are quantized using the same thresholds.
This PR adds the option to evaluate MTEB tasks on compressed embeddings (see Issue #3949).
What the code does
CompressionWrapperclass that handles compression.