Skip to content

[NPUW] Support multiple outputs in LLMCompiledModel#33664

Open
AsyaPronina wants to merge 10 commits intoopenvinotoolkit:masterfrom
AsyaPronina:support_npuw_multioutputs
Open

[NPUW] Support multiple outputs in LLMCompiledModel#33664
AsyaPronina wants to merge 10 commits intoopenvinotoolkit:masterfrom
AsyaPronina:support_npuw_multioutputs

Conversation

@AsyaPronina
Copy link
Contributor

Details:

Tickets:

  • ticket-id

@AsyaPronina AsyaPronina requested review from a team as code owners January 17, 2026 16:23
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Jan 17, 2026
@AsyaPronina AsyaPronina force-pushed the support_npuw_multioutputs branch from d8580e5 to 78264b9 Compare January 17, 2026 16:25
@dmatveev
Copy link
Contributor

If LM head cutting is ON, then output embeddings (not logits) became a Result from the prefill/kvcache model. However, last operation before LM head has had already a connected Result node corresponding to the second output of the original OmniThinker model

If there's already a Result, why don't you use one in your 3mp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change fixes the issue when a single Node has two Results connected to it.. And this happens when you introduce one more Result to feed the 3rd model. Can this be avoided?

If our partitioning doesn't work for that case, I'd probably fix it later

Copy link
Contributor Author

@AsyaPronina AsyaPronina Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern, however, CutLMHead transformations happens before the model is split into prefill and generate. When we work with the prefill model, we also add a Slice but before only one Result node. So two Result nodes got separated: one is now after the Slice , another one is still connected to the layer before LM head. However, we don't need Slice for generate model as it outputs 1 token already. And only here we face the issue with two Results from one layer before the LM head.
If to merge two Results into one in advance, then Slice will be added to the both results..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing it turns out that preserving changes in partitioning is a more safe approach. Merging multiple Result nodes per output layer for generate model works, however this one Result node would have multiple names (from multiple Result-s node to catch their meanings). But, in LLMInferRequest, on the contrary, we are using output_port->get_any_name() to get any tensor name but only one for mapping of names to outputs. This may cause issues. So, preserving partitioning changes currently.

Copy link
Contributor

@GuoliangShiIntel GuoliangShiIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AsyaPronina I have tested this change on both the Eagle3 and Omni pipelines, and both work well.

I'd like to leave an additional comment here in case it's related to the "multi-outputs" feature. Since "multi-outputs" may result in incorrect KV cache redirection, I made some related changes in the previous Eagle3 PR.

I noticed that Eugene added a comment after the PR was merged. I'm not sure if there's anything we should address regarding this?

@GuoliangShiIntel
Copy link
Contributor

@AsyaPronina And another question, is this PR plan for OV 26.0 release? if yes, i will also change for eagle3 pipeline to align this PR.

@dmatveev
Copy link
Contributor

@GuoliangShiIntel there's a lot PR checks failing here for some reason..

@dmatveev dmatveev added this to the 2026.0 milestone Jan 21, 2026
// then all Result nodes share the same shape.
if (maybe_results.size() > 1) {
const auto shape = (*maybe_results.begin())->get_shape();
for (auto i = 1; i < maybe_results.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux builds failed here:

error: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::shared_ptr<ov::Node> >::size_type' {aka 'long unsigned int'} [-Werror=sign-compare]
for (auto i = 1; i < maybe_results.size(); ++i) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@AsyaPronina
Copy link
Contributor Author

AsyaPronina commented Jan 22, 2026

@AsyaPronina I have tested this change on both the Eagle3 and Omni pipelines, and both work well.

I'd like to leave an additional comment here in case it's related to the "multi-outputs" feature. Since "multi-outputs" may result in incorrect KV cache redirection, I made some related changes in the previous Eagle3 PR.

I noticed that Eugene added a comment after the PR was merged. I'm not sure if there's anything we should address regarding this?

Hello @GuoliangShiIntel!
Thanks a lot for thorough review and testing of pipelines!

  1. Yes, comment about KVCache redirection and Eugene's comment about empty KV inputs removal are valid.
    • Regarding KVCache redirection: for now your solution for Eagle3 additional outputs will work perfect. As long as we are only redirecting outputs with present_key or present_value names, issues shouldn't arise.
    • Regarding KVCache redirection in @esmirno Eugene's PR: Eugene refined KVCache redirection logic to use pattern matching. These patterns still won't match neither additional outputs of Eagle3 nor of Omni.
    • Regading empty KV inputs removal: Removal of empty KV inputs is based on pattern matching. This pattern matching seeks for param->concat pattern which shouldn't match neither Eagle3 output (add, add, add->concat) nor OmniThinker (no concat).

@AsyaPronina AsyaPronina force-pushed the support_npuw_multioutputs branch from 1ad052b to ee57571 Compare January 22, 2026 03:47
Copy link
Contributor

@GuoliangShiIntel GuoliangShiIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AsyaPronina AsyaPronina force-pushed the support_npuw_multioutputs branch from 3d2ef23 to 63200b2 Compare January 22, 2026 07:49
@rzubarev rzubarev added the priority: high High piority label Jan 22, 2026
@AsyaPronina AsyaPronina force-pushed the support_npuw_multioutputs branch from dd88e0e to 49a8cb5 Compare January 23, 2026 14:20
@dmatveev dmatveev modified the milestones: 2026.0, 2026.1 Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants