[NPUW] Support multiple outputs in LLMCompiledModel by AsyaPronina · Pull Request #33664 · openvinotoolkit/openvino

AsyaPronina · 2026-01-17T16:23:11Z

Details:

PR aims to support OmniThinker model.
Fixed partitioning issue with cut LM Head for OmniThinker. If LM head cutting is ON, then output embeddings (not logits) became a Result from the prefill/kvcache model. However, last operation before LM head has had already a connected Result node corresponding to the second output of the original OmniThinker model. Thus one output layer has two Result nodes what didn't work with partitioning (but worked without partitioning).
Improved CutLMHead transformation.
- Addes check that Result's output name is "logits". This check is sufficiently reliable check for finding exactly logits output, because:
  - LLMInferRequest always rely on "logits" name to get logits from prefill/kvcache models.
  - Following Exporter configs: OnnxConfig, OnnxConfigWithPast, TextDecoderOnnxConfig and TextDecoderWithPositionIdsOnnxConfig from optimum-onnx name LLM output with "logits": https://github.com/huggingface/optimum-onnx/blob/main/optimum/exporters/onnx/base.py#L137, https://github.com/huggingface/optimum-onnx/blob/main/optimum/exporters/onnx/config.py#L100
  - Most of optimum-intel OpenVINO Exporter configs are derived from the configs above: https://github.com/huggingface/optimum-intel/blob/main/optimum/exporters/openvino/model_configs.py#L684
  - optimum-intel export() function set names for output tensors from Exporter config: https://github.com/huggingface/optimum-intel/blob/main/optimum/exporters/openvino/convert.py#L442-L445

Tickets:

ticket-id

dmatveev · 2026-01-17T16:32:43Z

If LM head cutting is ON, then output embeddings (not logits) became a Result from the prefill/kvcache model. However, last operation before LM head has had already a connected Result node corresponding to the second output of the original OmniThinker model

If there's already a Result, why don't you use one in your 3mp?

dmatveev · 2026-01-17T16:33:54Z

src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp

I believe this change fixes the issue when a single Node has two Results connected to it.. And this happens when you introduce one more Result to feed the 3rd model. Can this be avoided?

If our partitioning doesn't work for that case, I'd probably fix it later

I understand the concern, however, CutLMHead transformations happens before the model is split into prefill and generate. When we work with the prefill model, we also add a Slice but before only one Result node. So two Result nodes got separated: one is now after the Slice , another one is still connected to the layer before LM head. However, we don't need Slice for generate model as it outputs 1 token already. And only here we face the issue with two Results from one layer before the LM head.
If to merge two Results into one in advance, then Slice will be added to the both results..

After testing it turns out that preserving changes in partitioning is a more safe approach. Merging multiple Result nodes per output layer for generate model works, however this one Result node would have multiple names (from multiple Result-s node to catch their meanings). But, in LLMInferRequest, on the contrary, we are using output_port->get_any_name() to get any tensor name but only one for mapping of names to outputs. This may cause issues. So, preserving partitioning changes currently.

GuoliangShiIntel

@AsyaPronina I have tested this change on both the Eagle3 and Omni pipelines, and both work well.

I'd like to leave an additional comment here in case it's related to the "multi-outputs" feature. Since "multi-outputs" may result in incorrect KV cache redirection, I made some related changes in the previous Eagle3 PR.

I noticed that Eugene added a comment after the PR was merged. I'm not sure if there's anything we should address regarding this?

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

GuoliangShiIntel · 2026-01-19T06:15:55Z

@AsyaPronina And another question, is this PR plan for OV 26.0 release? if yes, i will also change for eagle3 pipeline to align this PR.

dmatveev · 2026-01-21T23:57:42Z

@GuoliangShiIntel there's a lot PR checks failing here for some reason..

AlexanderKalistratov · 2026-01-22T00:28:19Z

src/plugins/intel_npu/src/plugin/npuw/partitioning/partitioning.cpp

+                    // then all Result nodes share the same shape.
+                    if (maybe_results.size() > 1) {
+                        const auto shape = (*maybe_results.begin())->get_shape();
+                        for (auto i = 1; i < maybe_results.size(); ++i) {


Linux builds failed here:

error: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::shared_ptr<ov::Node> >::size_type' {aka 'long unsigned int'} [-Werror=sign-compare] for (auto i = 1; i < maybe_results.size(); ++i) {

Thanks a lot!

AsyaPronina · 2026-01-22T03:40:33Z

@AsyaPronina I have tested this change on both the Eagle3 and Omni pipelines, and both work well.

I'd like to leave an additional comment here in case it's related to the "multi-outputs" feature. Since "multi-outputs" may result in incorrect KV cache redirection, I made some related changes in the previous Eagle3 PR.

I noticed that Eugene added a comment after the PR was merged. I'm not sure if there's anything we should address regarding this?

Hello @GuoliangShiIntel!
Thanks a lot for thorough review and testing of pipelines!

Yes, comment about KVCache redirection and Eugene's comment about empty KV inputs removal are valid.
- Regarding KVCache redirection: for now your solution for Eagle3 additional outputs will work perfect. As long as we are only redirecting outputs with present_key or present_value names, issues shouldn't arise.
- Regarding KVCache redirection in @esmirno Eugene's PR: Eugene refined KVCache redirection logic to use pattern matching. These patterns still won't match neither additional outputs of Eagle3 nor of Omni.
- Regading empty KV inputs removal: Removal of empty KV inputs is based on pattern matching. This pattern matching seeks for param->concat pattern which shouldn't match neither Eagle3 output (add, add, add->concat) nor OmniThinker (no concat).

src/plugins/intel_npu/src/plugin/npuw/llm_eagle3_extension.hpp

GuoliangShiIntel

LGTM

…s enabled

… layer

AsyaPronina requested review from a team as code owners January 17, 2026 16:23

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Jan 17, 2026

AsyaPronina force-pushed the support_npuw_multioutputs branch from d8580e5 to 78264b9 Compare January 17, 2026 16:25

dmatveev reviewed Jan 17, 2026

View reviewed changes

GuoliangShiIntel reviewed Jan 19, 2026

View reviewed changes

dmatveev added this to the 2026.0 milestone Jan 21, 2026

dmatveev added the Code Freeze label Jan 21, 2026

AlexanderKalistratov reviewed Jan 22, 2026

View reviewed changes

AsyaPronina force-pushed the support_npuw_multioutputs branch from 1ad052b to ee57571 Compare January 22, 2026 03:47

GuoliangShiIntel reviewed Jan 22, 2026

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_eagle3_extension.hpp Show resolved Hide resolved

GuoliangShiIntel approved these changes Jan 22, 2026

View reviewed changes

AsyaPronina force-pushed the support_npuw_multioutputs branch from 3d2ef23 to 63200b2 Compare January 22, 2026 07:49

rzubarev added the priority: high High piority label Jan 22, 2026

AsyaPronina added 10 commits January 23, 2026 12:50

Added other_outputs for all outputs besides logits if not Eagle-3 i…

0bc0792

…s enabled

Extend partitioning to allow multiple same Result nodes from 1 output…

716a97f

… layer

Improved CutLMHead to not interfere with other outputs

28d78a8

Fixed Linux builds

0794319

Apply SLICE_OUT only if two-model pipeline is forced

5ab0f41

Fixed Linux build

f333252

Fixed clang-format

ab5186b

Fixed review comments

b3ed717

Small polishing

8f70963

Reverting layer name constant back.

49a8cb5

AsyaPronina force-pushed the support_npuw_multioutputs branch from dd88e0e to 49a8cb5 Compare January 23, 2026 14:20

dmatveev removed priority: high High piority Code Freeze labels Jan 26, 2026

dmatveev modified the milestones: 2026.0, 2026.1 Jan 26, 2026

Conversation

AsyaPronina commented Jan 17, 2026

Details:

Tickets:

Uh oh!

dmatveev commented Jan 17, 2026

Uh oh!

dmatveev Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

AsyaPronina Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AsyaPronina Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

GuoliangShiIntel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GuoliangShiIntel commented Jan 19, 2026

Uh oh!

dmatveev commented Jan 21, 2026

Uh oh!

AlexanderKalistratov Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

AsyaPronina Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

AsyaPronina commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GuoliangShiIntel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AsyaPronina Jan 17, 2026 •

edited

Loading

AsyaPronina commented Jan 22, 2026 •

edited

Loading