Skip to content

Conversation

@perheld
Copy link
Collaborator

@perheld perheld commented Feb 11, 2026

Add support for a Linux version of the executor_runner targetting the corstone1000 fvp.

This commit splits the backends/arm/runtime/EthosUBackend.cpp into multiple files for Cortex-M and Cortex-A.

  • Cortex-M is for baremetal and zephyr, this is mostly "old" code moved into its new home.

  • Cortex-A is for the Linux version of the executor_runner to communicate with the Linux kernel device driver. This is enabled in cmake with EXECUTORCH_BUILD_ARM_ETHOSU_LINUX=ON.

The EthosUBackend.cpp is keept for shared code between the different targets.

Change-Id: I0dfdf4bff793f7c7d83e20eb4d388f3c151fbfd3

cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Add support for a Linux version of the executor_runner targetting the
corstone1000 fvp.

This commit splits the backends/arm/runtime/EthosUBackend.cpp into
multiple files for Cortex-M and Cortex-A.

* Cortex-M is for baremetal and zephyr, this is mostly "old" code
moved into its new home.

* Cortex-A is for the Linux version of the executor_runner to
communicate with the Linux kernel device driver. This is enabled in
cmake with EXECUTORCH_BUILD_ARM_ETHOSU_LINUX=ON.

The EthosUBackend.cpp is keept for shared code between the different
targets.

Change-Id: I0dfdf4bff793f7c7d83e20eb4d388f3c151fbfd3
Signed-off-by: Per Held <per.held@arm.com>
Copilot AI review requested due to automatic review settings February 11, 2026 12:40
@perheld perheld added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Feb 11, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17367

Note: Links to docs will display an error until the docs builds have been completed.

❌ 12 New Failures, 1 Cancelled Job, 17 Unrelated Failures

As of commit 5e8a9a5 with merge base 9c74c32 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Linux (Cortex-A) Ethos-U backend path for executor_runner and refactors the existing backend into shared + platform-specific implementations.

Changes:

  • Split Ethos-U backend into common code plus Cortex-M (baremetal) and Cortex-A (Linux driver stack) implementations.
  • Added build options and CMake plumbing to enable the Linux backend and fetch/build the Linux driver stack.
  • Extended Vela binary parsing to accept an additional command stream header.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tools/cmake/preset/default.cmake Adds EXECUTORCH_BUILD_ARM_ETHOSU_LINUX option and conflicts with baremetal build.
examples/arm/executor_runner/ethosu_link_helper.cpp Adds force-link hook so the Ethos-U backend is retained in portable runner builds.
examples/arm/ethos-u-setup/aarch64-linux-musl-toolchain.cmake Introduces a musl cross toolchain file for aarch64 Linux builds.
backends/arm/runtime/VelaBinStream.cpp Accepts COP2 “magic header” for cmd stream blocks in vela binaries.
backends/arm/runtime/EthosUBackend_Internal.h Adds shared declarations, platform hooks, and profiling macros for split backend.
backends/arm/runtime/EthosUBackend_Cortex_M.cpp Moves baremetal Ethos-U driver execution into a Cortex-M specific file.
backends/arm/runtime/EthosUBackend_Cortex_A.cpp Adds Linux Ethos-U driver stack execution path (device, buffers, inference).
backends/arm/runtime/EthosUBackend.cpp Refactors common backend to delegate platform-specific execution via platform_execute.
backends/arm/CMakeLists.txt Adds Linux driver stack fetch/configuration and builds the split platform sources.
CMakeLists.txt Wires new build flags and adds runner linking helpers/options for Linux backend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +329 to +380
const size_t expand_factor = chunk_size / vela_chunk_size;
if (expand_factor == 2 && elem_size == 1 &&
tensor_out.scalar_type() == ScalarType::Char) {
const uint8_t* src_bytes = reinterpret_cast<const uint8_t*>(src);
int8_t* dest = tensor_out.mutable_data_ptr<int8_t>();
const uint8_t* chunk_src = src_bytes;
int8_t* chunk_dest = dest;
for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) {
for (size_t byte_idx = 0; byte_idx < vela_chunk_size; ++byte_idx) {
const uint8_t packed = chunk_src[byte_idx];
int8_t low = static_cast<int8_t>(packed & 0x0F);
int8_t high = static_cast<int8_t>((packed >> 4) & 0x0F);
if (low >= 8) {
low -= 16;
}
if (high >= 8) {
high -= 16;
}
chunk_dest[2 * byte_idx] = low;
chunk_dest[2 * byte_idx + 1] = high;
}
chunk_src += vela_chunk_size;
chunk_dest += chunk_size;
}
dest[2 * byte_idx] = low;
dest[2 * byte_idx + 1] = high;
return Error::Ok;
}

ET_LOG(
Error,
"Ethos-U output %d expansion factor %zu with element size %d not supported",
output_index,
expand_factor,
elem_size);
return Error::InvalidProgram;
}

Error strip_delegate_padding(
const char* src,
char* dest,
size_t chunk_count,
size_t dest_chunk_size,
size_t src_chunk_size) const {
if (dest_chunk_size > src_chunk_size) {
ET_LOG(
Error,
"dest chunk size %zu must not exceed src chunk size %zu",
dest_chunk_size,
src_chunk_size);
return Error::InvalidProgram;
}
if (src == nullptr || dest == nullptr) {
ET_LOG(Error, "Ethos-U padded copy received null buffer");
return Error::InvalidState;
}
for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) {
memcpy(dest, src, dest_chunk_size);
src += src_chunk_size;
dest += dest_chunk_size;
}
return Error::Ok;
if (src == nullptr) {
ET_LOG(Error, "Ethos-U padded copy received null buffer");
return Error::InvalidState;
}
char* dest = tensor_out.mutable_data_ptr<char>();
if (dest == nullptr) {
ET_LOG(Error, "Ethos-U padded copy received null destination");
return Error::InvalidState;
}
const char* src_bytes = src;
for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) {
memcpy(dest, src_bytes, chunk_size);
src_bytes += vela_chunk_size;
dest += chunk_size;
}
return Error::Ok;
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcpy(dest, src_bytes, chunk_size) can read past the end of each source chunk when chunk_size > vela_chunk_size (e.g., dest chunk 10 bytes, src chunk 8 bytes). The previous implementation guarded against this; please reintroduce an explicit check (and error) for chunk_size > vela_chunk_size before the copy loop.

Copilot uses AI. Check for mistakes.
if (i < static_cast<int>(output_io_bytes.size())) {
output_io_bytes[i] = static_cast<size_t>(io_count) *
static_cast<size_t>(handles.outputs->io[i].elem_size);
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_io_bytes is derived from the delegate (VelaIO) shape and can be larger than tensor_out.nbytes() when the delegate output includes padding / packed layouts. Later, the Linux driver reads output_io_bytes[i] bytes directly into tensor_out.mutable_data_ptr<char>(), which can overflow the tensor buffer. Fix by either (a) enforcing tensor_out.nbytes() == output_io_bytes[i] and erroring otherwise, or (b) reading into a temporary buffer sized to output_io_bytes[i] and then using the existing copy_with_layout_adjustment(...) path to populate the tensor safely.

Suggested change
}
}
const size_t tensor_nbytes = tensor_out.nbytes();
if (i < static_cast<int>(output_io_bytes.size()) &&
output_io_bytes[i] != tensor_nbytes) {
ET_LOG(
Error,
"Ethos-U Linux backend output size mismatch for index %d: "
"driver IO bytes = %zu, tensor bytes = %zu",
i,
output_io_bytes[i],
tensor_nbytes);
return Error::InvalidState;
}

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +206
std::shared_ptr<EthosU::Buffer> constant_buffer =
std::make_shared<EthosU::Buffer>();
if (handles.weight_data_size > 0) {
auto constant_buffers = device.createBuffers({handles.weight_data_size});
constant_buffer = constant_buffers.front();
constant_buffer->write(
const_cast<char*>(handles.weight_data), handles.weight_data_size);
}

std::shared_ptr<EthosU::Buffer> intermediate_buffer =
std::make_shared<EthosU::Buffer>();
if (handles.scratch_data_size > 0) {
auto scratch_buffers = device.createBuffers({handles.scratch_data_size});
intermediate_buffer = scratch_buffers.front();
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When handles.weight_data_size == 0 and/or handles.scratch_data_size == 0, this passes a default-constructed EthosU::Buffer into EthosU::Inference. If the driver stack expects these buffers to be either properly allocated or nullptr, this can lead to invalid buffer handles at runtime. Prefer initializing these as null (std::shared_ptr<EthosU::Buffer>{}) and only setting them when a real device buffer is created.

Copilot uses AI. Check for mistakes.

for (int i = 0; i < input_count; ++i) {
auto tensor_in = args[i]->toTensor();
linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>();
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inputs are treated as read-only (written into driver buffers), but this takes a mutable pointer via mutable_data_ptr<char>(). Prefer using a const pointer (const_data_ptr<char>() if available) and keep linux_input_ptrs as const char* (it already is), avoiding accidental mutation and unnecessary const-casts downstream.

Suggested change
linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>();
linux_input_ptrs[i] = tensor_in.const_data_ptr<char>();

Copilot uses AI. Check for mistakes.
}

void destroy(DelegateHandle* handle) const override {
return;
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionHandle now owns/points to platform_state (Linux PlatformState contains std::string), but destroy() never runs destructors. This can leak memory (and potentially other resources held by PlatformState) across repeated loads/unloads. Consider explicitly calling the placement-new destructors (handle->~ExecutionHandle() and handle->platform_state->~PlatformState() when non-null) and returning the memory to the allocator if the runtime provides a deallocation API for delegate handles.

Suggested change
return;
if (handle == nullptr) {
return;
}
// The delegate handle memory contains an ExecutionHandle constructed via
// placement new. Explicitly run its destructor and the destructor of the
// owned PlatformState (if present) to avoid leaking resources such as
// std::string.
auto* exec_handle = reinterpret_cast<ExecutionHandle*>(handle);
if (exec_handle->platform_state != nullptr) {
exec_handle->platform_state->~PlatformState();
exec_handle->platform_state = nullptr;
}
exec_handle->~ExecutionHandle();

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
# Minimum version kept low so this file can be used even if the global project
# requires a newer CMake.
cmake_minimum_required(VERSION 3.20)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling cmake_minimum_required() inside a toolchain file can unexpectedly reset policy settings for the parent project (toolchain files are loaded very early). Prefer removing this and, if needed, instead guard features with a if(${CMAKE_VERSION} VERSION_LESS ...) message(FATAL_ERROR ...) check.

Suggested change
# Minimum version kept low so this file can be used even if the global project
# requires a newer CMake.
cmake_minimum_required(VERSION 3.20)
# Minimum version requirement for using this toolchain file. Do not call
# cmake_minimum_required() here, as that can reset policies for the parent
# project; instead, guard on CMAKE_VERSION explicitly.
if(CMAKE_VERSION VERSION_LESS 3.20)
message(FATAL_ERROR "This toolchain file requires at least CMake 3.20")
endif()

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +21
if(POLICY CMP0169)
# Allow FetchContent_Populate to be used for source-only fetch
cmake_policy(SET CMP0169 OLD)
endif()

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing CMP0169 to OLD opts into deprecated FetchContent_Populate behavior and can cause policy warnings or future breakage. Prefer updating the FetchContent usage to the NEW behavior (e.g., FetchContent_MakeAvailable / modern pattern) so you don’t need to change the policy globally for this directory.

Suggested change
if(POLICY CMP0169)
# Allow FetchContent_Populate to be used for source-only fetch
cmake_policy(SET CMP0169 OLD)
endif()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant