-
Notifications
You must be signed in to change notification settings - Fork 3k
[NPUW] Fix eagle3 with chunk prefill #33975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPUW] Fix eagle3 with chunk prefill #33975
Conversation
5d192f4 to
dfefbd5
Compare
AsyaPronina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix, thank you!
| const uint32_t target_total_len = static_cast<uint32_t>(target_shape[1]); | ||
|
|
||
| OPENVINO_ASSERT(m_chunked_seq_offset + chunk_token_count <= target_total_len, | ||
| "Chunked sequence offset exceeds pre-allocated size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Can't write chunk by stored chunked sequence offset and requested number of tokens, as it will exceed pre-allocated size"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| // Copy chunk data directly to the correct position in pre-allocated tensor | ||
| uint8_t* dst_ptr = reinterpret_cast<uint8_t*>(m_last_hidden_state->data()); | ||
| dst_ptr += m_chunked_seq_offset * row_bytes; // Move to the current write position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please use ov::npuw::util::make_tensor_slice and tensor->copy_to(another_tensor) here? Some examples can be found in LLMInferRequest: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good proposal, done
f04e2c0 to
2827581
Compare
|
|
||
| // Reset chunked prefill state before starting a new chunked prefill session | ||
| void reset_chunked_prefill_state() { | ||
| m_last_hidden_state = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to do this?
It means on each prefill stage we are allocating a new tensor. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Please consider this scenario:
If we run two prompts consecutively using infer:
For the first prompt: m_last_hidden_state is null -> pre-allocate tensor for the full tensor -> copy each chunk's last_hidden_state into pre-allocated memory
After the first prefill completes, the generate phase also updates m_last_hidden_state. When the generate phase finishes, m_last_hidden_state remains non-null.
For the second prompt: Since m_last_hidden_state is still non-null, prefill will not enter the "Pre-allocate tensor on first chunk" path, causing a memory size mismatch that triggers the assertion.
Given that each prompt inference only prefill once, it's reasonable to reset the tensor here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So m_last_hidden_state is pointing to different tensors in prefill and generate phase?
It would be nice to have an explanatory comment here.
Having allocation per prefill is not a big deal I think. But we also can have pre-allocated tensor for prefill phase and not allocate it every time.
2827581 to
8babf24
Compare
8babf24 to
2aaccdb
Compare
|
build_jenkins |
|
@dmatveev All CI passed, can we merge this PR? |
Details:
Background:
last_hidden_statusin addition tologitslogitsonly needs the last token (via slice output),last_hidden_statusrequires all token tensors.last_hidden_statusacross chunks, unlikelogitswhich only needs the final chunk.Changes in this PR:
Added logic to accumulate and concatenate
last_hidden_statusoutputs across chunks during chunk prefill in the Eagle3 pipeline.Tickets:
- CVS-180647