[`feat`] Introduce cross-modality and multi-modality support; modularize CrossEncoder class by tomaarsen · Pull Request #3554 · huggingface/sentence-transformers

tomaarsen · 2025-10-24T15:22:08Z

Hello!

Pull Request overview

Introduce cross-modality and multi-modality support via refactors of SentenceTransformer, Router, and Transformer
Modularize the CrossEncoder class, initially by subclassing SentenceTransformer, but long term I want to subclass a new superclass

Details

This pull request is very much a work-in-progress, although it is already functional. In short:

Transformer now works with an AutoProcessor and handles inputs through that. This accepts multiple modalities
SentenceTransformer, Transformer and Router check the modality of inputs, only one modality is allowed per inference
Router has been adapted to allow for modality-based routing
There is a strict distinction between a model with modalities ["text", "image"] and [("text", "image")]. The former is cross-modal, i.e. you can pass either text or images, and you can then compare the embeddings across the modalities. The latter is multi-modal, i.e. you can pass text AND images at the same time, and this joint input results in one embedding output. The "one input in, one embedding out" is a core feature.
Multimodal models can be called with lists of dictionaries using modalities as keys, e.g. model.encode([{"text": "This is my <image>", "image": "cat.jpg"}, ...]).
Transformer is designed to be somewhat flexible moving forward. Model authors can specify which modalities are supported, which methods on the AutoModel need to be called, and which output keys need to be used from the outputs from those methods. The goal is to have strong defaulting as well.
model.modalities gives a list of supported modalities. E.g. SentenceTransformer("laion/clap-htsat-unfused").modalities is ['text', 'audio']
The modalities are text, image, audio, video, and combination of the previous, and message. The latter uses processor.apply_chat_template.

Here's two cross-modal models that I trained:

This is an incomplete list of models that you can simply initialize with SentenceTransformer(model_name):

text
- all-MiniLM-L6-v2
- google/embeddinggemma-300m
- Qwen/Qwen3-Embedding-0.6B
- google/gemma-3-1b-pt
image
- google/vit-base-patch16-224-in21k
- facebook/deit-base-distilled-patch16-224
- facebook/dinov2-with-registers-small
- DeepGlint-AI/mlcd-vit-bigG-patch14-336
- microsoft/resnet-18
- timm/mobilenetv4_conv_medium.e500_r256_in1k
- timm/convnext_base.clip_laion2b
- microsoft/beit-base-patch16-224
- google/bit-50
- microsoft/conditional-detr-resnet-50
audio
- facebook/wav2vec2-large-960h-lv60-self
- nari-labs/Dia-1.6B-0626
- facebook/hubert-large-ls960-ft
cross text+image
- kakaobrain/align-base
- apple/aimv2-large-patch14-224-lit
- openai/clip-vit-base-patch32
- google/siglip-base-patch16-224
cross text+audio
- laion/clap-htsat-unfused
multi text+image
- google/paligemma-3b-mix-448
- ds4sd/SmolDocling-256M-preview
- ibm-granite/granite-docling-258M
- ibm-granite/granite-vision-3.3-2b
multi text+image as message
- deepseek-community/deepseek-vl-1.3b-chat

For more complex setups, you can use a Router, e.g. when one transformer model doesn't have the modalities that you're after

from PIL import Image
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Dense, Pooling, Router, Transformer

# Create separate encoders for different modalities
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
# Project to 768 dims to match image encoder
text_dense = Dense(text_encoder.get_word_embedding_dimension(), 768, module_input_name="token_embeddings")
image_encoder = Transformer(
    "ModernVBERT/modernvbert",
    model_args={"trust_remote_code": True},
    tokenizer_args={"trust_remote_code": True},
    config_args={"trust_remote_code": True},
)
pooling = Pooling(text_encoder.get_word_embedding_dimension())

# Route based on modality
router = Router(
    sub_modules={
        "text": [text_encoder, text_dense],
        "image": [image_encoder],
    },
    route_mappings={
        (None, "text"): "text",  # Any task with text goes to text encoder
        (None, ("text", "image")): "image",  # Any task with text-image together goes to image encoder
    },
)

model = SentenceTransformer(modules=[router, pooling])

# Modality is automatically inferred
text_embedding = model.encode("A photo of a cat")
multimodal_embedding = model.encode({"text": "A photo of a <image>", "image": Image.open("cat.jpg")})

similarity = model.similarity(text_embedding, multimodal_embedding)

For the text modality, this example uses all-MiniLM-L6-v2 with a linear layer that projects the token embeddings to 768-dimensional before pooling. For the multimodal text+image, this uses ModernVBERT/modernvbert, a model that supports both text AND image inputs simultaneously for one output embedding. This model could be trained to perform multimodal retrieval or similar tasks.

There are currently a lot of tiny breaking changes that I want to iron out. If I can't get rid of all of them, then sadly this refactor will have to wait until a v6.0 release, which I would normally only do alongside the introduction of a new archetype, Late Interaction in this case. Tons of TODOs also remain.

cc @NohTow - this should in theory allow you to work on multi-modal/cross-modal Late Interaction. One big annoyance for you for now will likely be that many architectures like CLIP/CLAP default to using get_text_features/get_..._features from the transformers model, and these methods all output pooled embeddings rather than token embeddings. This is okay-ish for Sentence Transformers, but a big problem for LI models.

Tom Aarsen

coreintelligence · 2025-11-12T08:52:54Z

Question: In addition to Qwen/Qwen3-Embedding-#B, would the rerankers be supported as an initializer to CrossEncoder class (e.g: Qwen3-Reranker-#B) ?

tomaarsen · 2025-11-17T07:53:00Z

@coreintelligence Yes, that is the intention. The 'Qwen3-Reranker-#B' rerankers are a new style of reranker that are based on CausalLM models with specific templates whose scores for specific tokens (e.g. "yes", "no", "1", "0") are used to compute a score. Other examples are https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v2 or https://huggingface.co/ContextualAI/ctxl-rerank-v2-instruct-multilingual-1b.

Tom Aarsen

…rclasses for all 3 archetypes

This will align better with my goals for the big refactor of huggingface#3554, where these methods will be called _multi_process and _multi_process_worker

* add multiprocessing support for Cross Encoder * Rename _predict_multi_process... -> _multi_process... This will align better with my goals for the big refactor of #3554, where these methods will be called _multi_process and _multi_process_worker * Add test suite for multi-processing, mirroring the test suite for ST models * Reorder kwargs to match ST * Change how device is determined, matching ST * Add device, pool, and chunk_size to other predict typings * Upgrade with multi-gpu reranking * Update test_hard_negatives test, simplify mine_hard_negatives slightly --------- Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>

Also update util API Reference, add missing backend_export_sidebar to efficiency.rst; move similarity_function into similarity

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…API Reference

…e MLMTransformer from docs

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

tomaarsen added 6 commits October 20, 2025 16:51

Introduce cross-modality and multi-modality support

c0af86c

Heavily refactor Router to allow for modality-based routing

2273532

Fix Router docstring examples

efb36f7

Remove now-removed Transformer types

6ae100e

MLMTransformer just as Transformer with transformer_task="fill-mask"

b20da09

Add TODOs to modality_utils

cbc7f74

tomaarsen mentioned this pull request Nov 20, 2025

How to override model's max_seq_length? #3575

Open

tomaarsen added 9 commits November 25, 2025 09:54

Merge branch 'main' into refactor/multimodal

0aa54b9

Rename tokenize to preprocess, soft deprecation

0ede385

Load with string instead of Path

4c54d2a

Heavily expand on refactor: separate "Base..." classes to act as supe…

6c61c1d

…rclasses for all 3 archetypes

Avoid ImportError

a7abc9e

Work on matching performance of pre-refactor for CrossEncoder

8ba0b3b

Let's stick with PretrainedConfig instead of PreTrainedConfig for now

0df4450

Improve CrossEncoderTrainer; remove router_mapping from CE

bff6efa

Improve the Router, its tests now pass

a55f94b

This was referenced Dec 4, 2025

Extend NanoBEIR evaluators to support custom NanoBEIR datasets #3583

Merged

Collator setting include_prompt_lengths prevents compilation #3582

Open

tomaarsen added 7 commits December 4, 2025 13:29

Modernize the CMNRL test: avoid InputExample/smart_batching_collate

688bbe9

Require v5.0.0 (pre)

9c47fca

Make max_seq_length more robust, update model type default

805d11e

Update monkeypatch for hard negatives test

2cdb342

Lower tokenizer model_max_length if needed; fix CE model card test

8df5955

Let's revert to <5 in the CI, just because optimum isn't compatible yet

9c34b7f

Move multi-processing functionality to base, fix it for SparseEncoder

aae848e

tomaarsen mentioned this pull request Dec 12, 2025

[Frontend] Support using chat template as custom score template for reranking models vllm-project/vllm#30550

Merged

tomaarsen added 5 commits December 23, 2025 18:12

Ignore unresolved-attribute with ty

ea61bef

Work towards transformers v4 compatibility as well

56ed228

Patch Asym import

a975121

Fix typo in module output name

0425b44

Also use unreleased accelerate for transormers <5

1cb7d10

tomaarsen marked this pull request as ready for review December 23, 2025 19:30

Merge branch 'main' into refactor/multimodal

3c55e8b

tomaarsen mentioned this pull request Dec 29, 2025

T5Gemma2 Support #3604

Closed

tomaarsen requested a review from Copilot December 29, 2025 10:25

tomaarsen added 2 commits December 29, 2025 11:55

Update typings/type hints

63eefee

Update import etc. paths in documentation;

0d2ffed

Also update util API Reference, add missing backend_export_sidebar to efficiency.rst; move similarity_function into similarity

Copilot AI reviewed Dec 29, 2025

View reviewed changes

tomaarsen added 5 commits December 29, 2025 14:51

Resolve dozens of documentation warnings and issues

9479037

Add 'base' to API Reference

4cdbd83

Move LoggingHandler to util

30508ba

Update CrossEncoder training_overview docs, add cross_encoder/models …

6084653

…API Reference

Fix various links in training_overview for all 3 architectures, remov…

880469e

…e MLMTransformer from docs

tomaarsen requested a review from Copilot December 29, 2025 16:00

Copilot AI reviewed Dec 29, 2025

View reviewed changes

tomaarsen added 5 commits December 30, 2025 11:04

Simplify ONNX/OV optimization by using 'model.transformers_model'

6ecba84

Add is_singular_input, disallow singular inputs in preprocess

3997067

Improve typings for preprocess, encode, predict, etc. w. multimodality

6b47d31

Update typings with model loading

5eb13a8

Remove last Modality.all()

fac2e26

ayush1298 mentioned this pull request Jan 20, 2026

Make mteb.get_model compatible with CrossEncoders and SparseEncoders embeddings-benchmark/mteb#3116

Closed

This was referenced Jan 28, 2026

[feat] Add NO_DUPLICATES_HASHED: optional hashing for NoDuplicatesBatchSampler #3611

Merged

[feat] Add MultiVectorEncoder Support (a.k.a late-interaction models or ColBERT-style models) #3614

Open

Adding support for LMK pooling #3642

Open

Samoed mentioned this pull request Feb 5, 2026

[feat] Implement Quantization Aware Training (QAT) Loss draft #3655

Draft

Merge branch 'main' into refactor/multimodal

488e7b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Introduce cross-modality and multi-modality support; modularize CrossEncoder class#3554

[`feat`] Introduce cross-modality and multi-modality support; modularize CrossEncoder class#3554
tomaarsen wants to merge 59 commits intohuggingface:mainfrom
tomaarsen:refactor/multimodal

tomaarsen commented Oct 24, 2025

Uh oh!

coreintelligence commented Nov 12, 2025 •

edited

Loading

Uh oh!

tomaarsen commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomaarsen commented Oct 24, 2025

Pull Request overview

Details

Uh oh!

coreintelligence commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coreintelligence commented Nov 12, 2025 •

edited

Loading