[feat] Introduce cross-modality and multi-modality support; modularize CrossEncoder class#3554
Open
tomaarsen wants to merge 59 commits intohuggingface:mainfrom
Open
[feat] Introduce cross-modality and multi-modality support; modularize CrossEncoder class#3554tomaarsen wants to merge 59 commits intohuggingface:mainfrom
feat] Introduce cross-modality and multi-modality support; modularize CrossEncoder class#3554tomaarsen wants to merge 59 commits intohuggingface:mainfrom
Conversation
|
Question: In addition to Qwen/Qwen3-Embedding-#B, would the rerankers be supported as an initializer to |
Member
Author
|
@coreintelligence Yes, that is the intention. The 'Qwen3-Reranker-#B' rerankers are a new style of reranker that are based on CausalLM models with specific templates whose scores for specific tokens (e.g. "yes", "no", "1", "0") are used to compute a score. Other examples are https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v2 or https://huggingface.co/ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b.
|
…rclasses for all 3 archetypes
This was referenced Dec 4, 2025
tomaarsen
added a commit
to omkar-334/sentence-transformers
that referenced
this pull request
Dec 5, 2025
This will align better with my goals for the big refactor of huggingface#3554, where these methods will be called _multi_process and _multi_process_worker
tomaarsen
added a commit
that referenced
this pull request
Dec 5, 2025
* add multiprocessing support for Cross Encoder * Rename _predict_multi_process... -> _multi_process... This will align better with my goals for the big refactor of #3554, where these methods will be called _multi_process and _multi_process_worker * Add test suite for multi-processing, mirroring the test suite for ST models * Reorder kwargs to match ST * Change how device is determined, matching ST * Add device, pool, and chunk_size to other predict typings * Upgrade with multi-gpu reranking * Update test_hard_negatives test, simplify mine_hard_negatives slightly --------- Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Closed
Also update util API Reference, add missing backend_export_sidebar to efficiency.rst; move similarity_function into similarity
…e MLMTransformer from docs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello!
Pull Request overview
SentenceTransformer,Router, andTransformerCrossEncoderclass, initially by subclassingSentenceTransformer, but long term I want to subclass a new superclassDetails
This pull request is very much a work-in-progress, although it is already functional. In short:
Transformernow works with anAutoProcessorand handles inputs through that. This accepts multiple modalitiesSentenceTransformer,TransformerandRoutercheck the modality of inputs, only one modality is allowed per inferenceRouterhas been adapted to allow for modality-based routing["text", "image"]and[("text", "image")]. The former is cross-modal, i.e. you can pass either text or images, and you can then compare the embeddings across the modalities. The latter is multi-modal, i.e. you can pass text AND images at the same time, and this joint input results in one embedding output. The "one input in, one embedding out" is a core feature.model.encode([{"text": "This is my <image>", "image": "cat.jpg"}, ...]).Transformeris designed to be somewhat flexible moving forward. Model authors can specify which modalities are supported, which methods on theAutoModelneed to be called, and which output keys need to be used from the outputs from those methods. The goal is to have strong defaulting as well.model.modalitiesgives a list of supported modalities. E.g.SentenceTransformer("laion/clap-htsat-unfused").modalitiesis['text', 'audio']text,image,audio,video, and combination of the previous, andmessage. The latter usesprocessor.apply_chat_template.Here's two cross-modal models that I trained:
This is an incomplete list of models that you can simply initialize with
SentenceTransformer(model_name):For more complex setups, you can use a
Router, e.g. when onetransformermodel doesn't have the modalities that you're afterFor the text modality, this example uses
all-MiniLM-L6-v2with a linear layer that projects the token embeddings to 768-dimensional before pooling. For the multimodal text+image, this usesModernVBERT/modernvbert, a model that supports both text AND image inputs simultaneously for one output embedding. This model could be trained to perform multimodal retrieval or similar tasks.There are currently a lot of tiny breaking changes that I want to iron out. If I can't get rid of all of them, then sadly this refactor will have to wait until a v6.0 release, which I would normally only do alongside the introduction of a new archetype, Late Interaction in this case. Tons of TODOs also remain.
cc @NohTow - this should in theory allow you to work on multi-modal/cross-modal Late Interaction. One big annoyance for you for now will likely be that many architectures like CLIP/CLAP default to using
get_text_features/get_..._featuresfrom the transformers model, and these methods all output pooled embeddings rather than token embeddings. This is okay-ish for Sentence Transformers, but a big problem for LI models.