-
-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[Frontend] Support using chat template as custom score template for reranking models #30550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--30550.org.readthedocs.build/en/30550/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a --score-template CLI argument, allowing users to provide a custom Jinja2 template for score/rerank models. This is a valuable feature for decoupling prompt formatting from model-specific code. The implementation is mostly solid, with new CLI arguments, documentation, and tests. However, I've identified a high-severity issue related to code reuse that impacts maintainability and user experience. Specifically, chat-template-specific utilities are being reused for score templates, which can lead to confusing error messages. I've suggested a refactoring to create more generic template-handling functions.
|
Hi @jzakrzew, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1f64fa9 to
9258e17
Compare
|
hello @Samoed How does MTEB handle score templates? We are looking to align our implementation with MTEB. More links related to score templates:
Let's find a way to resolve it once for all. |
|
Hi! We have a separate class for handling instruction-based models that process instructions, with an example for Qwen3. However, this approach is a bit naive, since there's no standard way of doing this yet. Maybe @tomaarsen has some thoughts on standardizing prompt templates for cross-encoders For me, always unclear why there are no models that defines prompts in some jinja templates that could be used more automatically |
|
hello @tomaarsen Please take a look at this thread. |
|
Thanks for pinging me @noooop & @Samoed.
As this modern format is becoming a lot more prevalent. For my codebase, there were always two main concerns:
I think working on this support is so important that I'm working on a major refactor of the Sentence Transformers codebase, notably around the CrossEncoder, to help modularize it. This allows me to very easily support models that don't rely on For concern 2, a very simple solution is to rely on the [
{
"role": "query",
"content": "What is the capital of France?",
},
{
"role": "document",
"content": "Paris is the capital of France.",
}
]
tokenized = tokenizer.apply_chat_template(messages, ...)(This one matches the format required for https://huggingface.co/Qwen/Qwen3-Reranker-0.6B, I believe) An additional benefit here is that we can take advantage of a "system prompt" of sorts as an instruction/prompt for the reranker. In the above chat template, I hardcoded Some of the obvious advantages is that the But, my primary hesitation at the current stage is that This tempts me to write a more "manual" templating implementation, where I can apply the truncation on the second input (often the 'document' in a query-document setting). Recurring issues that I've found with my initial attempts are that you can't fully separately tokenize the template from the actual texts, as many template tokens will want to "merge" with actual text tokens (e.g. having Those are my thoughts for now. @noooop , where do you stand regarding:
|
Generally I think this should be like it, but for now there are now models. Even Qwen just inherit template from original LLM.
Generally I think good approach, but I'm afraid some libraries won't allow custom role names. Probably you can use |
|
What are your thoughts?
This is also my concern, which is why I'd like to seek your advice. After all, Sentence Transformers and MTEB are upstream of vLLM, and vLLM only supports a very limited number of CrossEncoder models. Just to mention Currently, vLLM does not perform truncation by default, following the OpenAI API behavior for /v1/embeddings. |
This makes sense. If the HF Hub repo has an incorrect chat template, you can override in in vLLM via passing
As @noooop , since we don't allow truncation by default, it should not be a problem. |
|
Ok, I'll modify the PR, so that it uses |
|
I think that's the right move. I'll also move to
Does vLLM support prompts/instructions? Edit: As mentioned by @Samoed, the above approaches are not very robust to Listwise rerankers which has multiple documents.
|
|
Also, support based on chunk content can be added like {
"role": "user",
"content": [
{
"type": "query",
"text": { # query/text
"value": "How does AI work? Explain it in simple terms.",
"annotations": []
}
},
{
"type": "document",
"text": { # document/text
"value": "AI works like ...",
}
],
}But I'm not sure is it possible to handle from jinja and if it work with other libraries
By the way, this won't work for |
|
I think we need ask someone from tokenizers/chat template mainaters for better way to handle this |
|
cc @hmellor |
9258e17 to
40808e9
Compare
|
Hi @jzakrzew, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Yes. For context, import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto", dtype=torch.bfloat16)
messages = [
{"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))Docs: https://huggingface.co/docs/transformers/main/en/chat_templating#using-applychattemplate Sentence Transformers relies on
|
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks for the detailed discussion!
Head branch was pushed to by a user without write access
|
@noooop Sorry, just wanted to clarify one comment, I did not notice you enabled automerge. |
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
TLDR
Purpose
This PR allows users to specify a custom prompt template for score/rerank models by providing the
--chat-templateCLI argument or settingchat_templateintokenizer_config.json.Motivation: The current mechanism for setting custom score templates (
SupportsScoreTemplate) is architecture-specific—it requires modifying the model class itself. This change decouples the prompt template from the model class, enabling support for any model requiring a custom score template without model-specific code changes.Immediate use case: The nvidia/llama-nemotron-rerank-1b-v2 model, which uses Llama architecture, but with a custom score template, can now be made to run correctly on vLLM with minor config.json modifications.
Running nvidia/llama-nemotron-rerank-1b-v2 with examples provided in the model's README, using FP32 precision:
Running without the custom template:
Running with a custom template:
Without a custom template:
With a custom template:
Test Plan
tests/entrypoints/pooling/score/test_utils.py
tests/models/language/pooling_mteb_test/test_nemotron.py
Test Result
pass
TODO
since vllm don't allow truncation by default, it should not be a problem.