Skip to content

Conversation

@JonathanC-ARM
Copy link
Contributor

@JonathanC-ARM JonathanC-ARM commented Jan 30, 2026

Description

This PR fixes non-deterministic corruption on CPU EP observed when the CPU Conv path is dispatched through the MLAS KleidiAI implementation.

The fix is entirely within onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp and addresses two correctness issues in the KleidiAI Conv LHS packing/padding logic:

  • Fully initialize the LHS indirection table (lhs_ptrs) for partial tiles (m < m_step) so the LHS packing kernel never reads uninitialized entries.
  • Replace the padding buffer that was effectively fixed-size-after-first-call with a per-thread grow-only buffer sized to at least the current ci, preventing out-of-bounds reads for later convolutions with larger channel counts.

Motivation and Context

This change is required to fix incorrect, non-deterministic Conv outputs on CPU (often extreme magnitudes / -FLT_MAX-like values) after running certain Conv-heavy models in the same process. It fixes the root causes of uninitialized reads in the KleidiAI Conv LHS packing path and out-of-bounds reads from an undersized padding buffer.
Fixes: #26669

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Copy link
Contributor

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested in the same environment from #26669, and I now get deterministic outputs. Thanks so much!

Copy link
Contributor

@adrianlizarraga adrianlizarraga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also confirm that this patch resolves the non-deterministic output on M4. Thank you!

@adrianlizarraga
Copy link
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Win_TRT_Minimal_CUDA_Test_CI, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@adrianlizarraga
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@edgchen1
Copy link
Contributor

thanks for the fix. are there existing unit tests which would catch this or do we need more test coverage?

@JonathanC-ARM
Copy link
Contributor Author

thanks for the fix. are there existing unit tests which would catch this or do we need more test coverage?

I was planning on investigating this and following on with a subsequent pr to address it.

I suspect no tests cover this exactly is the answer though, we routinely run the unit and integration tests with this code path and we haven't seen any accuracy issues like this before.


// pad_ptr must be at least 'ci' floats for padding pixels.
// Using a thread_local grow-only buffer to avoid cross-thread interference and ensure sizing is correct.
thread_local std::vector<float> pad_ptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_local std::vector pad_ptr;

Would a stack local vector avoid cross-thread interference? And if indeed there is an issue, should this be synchronized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each thread would have its own individual instance of this vector, and no synchronization mechanism is required because the storage is thread-local and not shared between threads.

@edgchen1
Copy link
Contributor

thanks for the fix. are there existing unit tests which would catch this or do we need more test coverage?

I was planning on investigating this and following on with a subsequent pr to address it.

I suspect no tests cover this exactly is the answer though, we routinely run the unit and integration tests with this code path and we haven't seen any accuracy issues like this before.

sounds good. I'm fine with a subsequent PR for it due to the urgency.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to fix non-deterministic corruption in CPU Conv operations when using the MLAS KleidiAI implementation. It addresses two issues:

Changes:

  • Fully initialize the LHS indirection table (lhs_ptrs) to prevent uninitialized reads by the packing kernel
  • Replace the static fixed-size padding buffer with a thread-local grow-only buffer to handle varying channel counts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +471 to +478
thread_local std::vector<float> pad_ptr;
if (pad_ptr.size() < padsize) {
pad_ptr.resize(padsize, 0.f);
} else {
// Ensure any previously-used region remains zeroed (grow-only means it should already be zeros,
// but keep this explicit for safety).
std::fill(pad_ptr.begin(), pad_ptr.end(), 0.f);
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cached lhs_ptrs array contains pointers to pad_ptr (line 403, and via pixel_offset lambda at lines 420, 425, 430). When pad_ptr is resized (line 473), its underlying storage may be reallocated to a new address, invalidating all pointers stored in previously cached lhs_ptrs entries.

Example scenario:

  1. Call with ci=100, padsize=256: pad_ptr at address A, lhs_ptrs cached with pointers to A
  2. Call with ci=500, padsize=512: pad_ptr resized to address B
  3. Call with ci=100 again: cache hit returns lhs_ptrs with stale pointers to A (now invalid)

The old static vector never moved after first initialization (though it had the undersizing bug this PR fixes). The thread_local grow-only approach fixes the sizing issue but breaks the caching because the cache doesn't invalidate when pad_ptr moves.

Possible solutions:

  • Invalidate lhs_ptrs_cache when pad_ptr is resized
  • Store pad_ptr address in cache key to detect when it has moved
  • Use a stable allocation for pad_ptr that doesn't move when it grows
  • Don't cache lhs_ptrs, or recalculate pointers before use

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Copilot has a point about potential hanging on to pointers that may be invalidated due to the resize() logic

Copy link
Member

@yuslepukhin yuslepukhin Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.

Copy link
Contributor

@adrianlizarraga adrianlizarraga Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is accurate. The lhs_ptrs array is created every time LhsPtrFill is called. LhsPtrFill always receives the updated pad_ptr after it has been resized. Please correct me if I'm mistaken @JonathanC-ARM .

Update: It seems maybe this is a problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding pad_ptr to the cache key might be the way to go (Coplilot's second bullet)

@adrianlizarraga adrianlizarraga self-requested a review January 30, 2026 18:04
hanbitmyths pushed a commit that referenced this pull request Feb 3, 2026
### Description

Refer to V1 of the fix here:
#27214

This PR includes all fixes from the V1 PR + logic to invalidate the lhs
cache pointers in case the pad buffer's underlying buffer has changed
due to a resize. The ARM team will look at potentially enhancing this
logic after the 1.24.0 release.

### Motivation and Context
Fix #26669
tianleiwu pushed a commit that referenced this pull request Feb 3, 2026
### Description

Refer to V1 of the fix here:
#27214

This PR includes all fixes from the V1 PR + logic to invalidate the lhs
cache pointers in case the pad buffer's underlying buffer has changed
due to a resize. The ARM team will look at potentially enhancing this
logic after the 1.24.0 release.

### Motivation and Context
Fix #26669
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cpu] Loading certain models leads to global error state on M4 Max

6 participants