-
Notifications
You must be signed in to change notification settings - Fork 834
Arm backend: Linux backend for U85 direct drive #17367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for a Linux version of the executor_runner targetting the corstone1000 fvp. This commit splits the backends/arm/runtime/EthosUBackend.cpp into multiple files for Cortex-M and Cortex-A. * Cortex-M is for baremetal and zephyr, this is mostly "old" code moved into its new home. * Cortex-A is for the Linux version of the executor_runner to communicate with the Linux kernel device driver. This is enabled in cmake with EXECUTORCH_BUILD_ARM_ETHOSU_LINUX=ON. The EthosUBackend.cpp is keept for shared code between the different targets. Change-Id: I0dfdf4bff793f7c7d83e20eb4d388f3c151fbfd3 Signed-off-by: Per Held <per.held@arm.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17367
Note: Links to docs will display an error until the docs builds have been completed. ❌ 12 New Failures, 1 Cancelled Job, 17 Unrelated FailuresAs of commit 5e8a9a5 with merge base 9c74c32 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a Linux (Cortex-A) Ethos-U backend path for executor_runner and refactors the existing backend into shared + platform-specific implementations.
Changes:
- Split Ethos-U backend into common code plus Cortex-M (baremetal) and Cortex-A (Linux driver stack) implementations.
- Added build options and CMake plumbing to enable the Linux backend and fetch/build the Linux driver stack.
- Extended Vela binary parsing to accept an additional command stream header.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/cmake/preset/default.cmake | Adds EXECUTORCH_BUILD_ARM_ETHOSU_LINUX option and conflicts with baremetal build. |
| examples/arm/executor_runner/ethosu_link_helper.cpp | Adds force-link hook so the Ethos-U backend is retained in portable runner builds. |
| examples/arm/ethos-u-setup/aarch64-linux-musl-toolchain.cmake | Introduces a musl cross toolchain file for aarch64 Linux builds. |
| backends/arm/runtime/VelaBinStream.cpp | Accepts COP2 “magic header” for cmd stream blocks in vela binaries. |
| backends/arm/runtime/EthosUBackend_Internal.h | Adds shared declarations, platform hooks, and profiling macros for split backend. |
| backends/arm/runtime/EthosUBackend_Cortex_M.cpp | Moves baremetal Ethos-U driver execution into a Cortex-M specific file. |
| backends/arm/runtime/EthosUBackend_Cortex_A.cpp | Adds Linux Ethos-U driver stack execution path (device, buffers, inference). |
| backends/arm/runtime/EthosUBackend.cpp | Refactors common backend to delegate platform-specific execution via platform_execute. |
| backends/arm/CMakeLists.txt | Adds Linux driver stack fetch/configuration and builds the split platform sources. |
| CMakeLists.txt | Wires new build flags and adds runner linking helpers/options for Linux backend. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const size_t expand_factor = chunk_size / vela_chunk_size; | ||
| if (expand_factor == 2 && elem_size == 1 && | ||
| tensor_out.scalar_type() == ScalarType::Char) { | ||
| const uint8_t* src_bytes = reinterpret_cast<const uint8_t*>(src); | ||
| int8_t* dest = tensor_out.mutable_data_ptr<int8_t>(); | ||
| const uint8_t* chunk_src = src_bytes; | ||
| int8_t* chunk_dest = dest; | ||
| for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) { | ||
| for (size_t byte_idx = 0; byte_idx < vela_chunk_size; ++byte_idx) { | ||
| const uint8_t packed = chunk_src[byte_idx]; | ||
| int8_t low = static_cast<int8_t>(packed & 0x0F); | ||
| int8_t high = static_cast<int8_t>((packed >> 4) & 0x0F); | ||
| if (low >= 8) { | ||
| low -= 16; | ||
| } | ||
| if (high >= 8) { | ||
| high -= 16; | ||
| } | ||
| chunk_dest[2 * byte_idx] = low; | ||
| chunk_dest[2 * byte_idx + 1] = high; | ||
| } | ||
| chunk_src += vela_chunk_size; | ||
| chunk_dest += chunk_size; | ||
| } | ||
| dest[2 * byte_idx] = low; | ||
| dest[2 * byte_idx + 1] = high; | ||
| return Error::Ok; | ||
| } | ||
|
|
||
| ET_LOG( | ||
| Error, | ||
| "Ethos-U output %d expansion factor %zu with element size %d not supported", | ||
| output_index, | ||
| expand_factor, | ||
| elem_size); | ||
| return Error::InvalidProgram; | ||
| } | ||
|
|
||
| Error strip_delegate_padding( | ||
| const char* src, | ||
| char* dest, | ||
| size_t chunk_count, | ||
| size_t dest_chunk_size, | ||
| size_t src_chunk_size) const { | ||
| if (dest_chunk_size > src_chunk_size) { | ||
| ET_LOG( | ||
| Error, | ||
| "dest chunk size %zu must not exceed src chunk size %zu", | ||
| dest_chunk_size, | ||
| src_chunk_size); | ||
| return Error::InvalidProgram; | ||
| } | ||
| if (src == nullptr || dest == nullptr) { | ||
| ET_LOG(Error, "Ethos-U padded copy received null buffer"); | ||
| return Error::InvalidState; | ||
| } | ||
| for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) { | ||
| memcpy(dest, src, dest_chunk_size); | ||
| src += src_chunk_size; | ||
| dest += dest_chunk_size; | ||
| } | ||
| return Error::Ok; | ||
| if (src == nullptr) { | ||
| ET_LOG(Error, "Ethos-U padded copy received null buffer"); | ||
| return Error::InvalidState; | ||
| } | ||
| char* dest = tensor_out.mutable_data_ptr<char>(); | ||
| if (dest == nullptr) { | ||
| ET_LOG(Error, "Ethos-U padded copy received null destination"); | ||
| return Error::InvalidState; | ||
| } | ||
| const char* src_bytes = src; | ||
| for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) { | ||
| memcpy(dest, src_bytes, chunk_size); | ||
| src_bytes += vela_chunk_size; | ||
| dest += chunk_size; | ||
| } | ||
| return Error::Ok; |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memcpy(dest, src_bytes, chunk_size) can read past the end of each source chunk when chunk_size > vela_chunk_size (e.g., dest chunk 10 bytes, src chunk 8 bytes). The previous implementation guarded against this; please reintroduce an explicit check (and error) for chunk_size > vela_chunk_size before the copy loop.
| if (i < static_cast<int>(output_io_bytes.size())) { | ||
| output_io_bytes[i] = static_cast<size_t>(io_count) * | ||
| static_cast<size_t>(handles.outputs->io[i].elem_size); | ||
| } |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_io_bytes is derived from the delegate (VelaIO) shape and can be larger than tensor_out.nbytes() when the delegate output includes padding / packed layouts. Later, the Linux driver reads output_io_bytes[i] bytes directly into tensor_out.mutable_data_ptr<char>(), which can overflow the tensor buffer. Fix by either (a) enforcing tensor_out.nbytes() == output_io_bytes[i] and erroring otherwise, or (b) reading into a temporary buffer sized to output_io_bytes[i] and then using the existing copy_with_layout_adjustment(...) path to populate the tensor safely.
| } | |
| } | |
| const size_t tensor_nbytes = tensor_out.nbytes(); | |
| if (i < static_cast<int>(output_io_bytes.size()) && | |
| output_io_bytes[i] != tensor_nbytes) { | |
| ET_LOG( | |
| Error, | |
| "Ethos-U Linux backend output size mismatch for index %d: " | |
| "driver IO bytes = %zu, tensor bytes = %zu", | |
| i, | |
| output_io_bytes[i], | |
| tensor_nbytes); | |
| return Error::InvalidState; | |
| } |
| std::shared_ptr<EthosU::Buffer> constant_buffer = | ||
| std::make_shared<EthosU::Buffer>(); | ||
| if (handles.weight_data_size > 0) { | ||
| auto constant_buffers = device.createBuffers({handles.weight_data_size}); | ||
| constant_buffer = constant_buffers.front(); | ||
| constant_buffer->write( | ||
| const_cast<char*>(handles.weight_data), handles.weight_data_size); | ||
| } | ||
|
|
||
| std::shared_ptr<EthosU::Buffer> intermediate_buffer = | ||
| std::make_shared<EthosU::Buffer>(); | ||
| if (handles.scratch_data_size > 0) { | ||
| auto scratch_buffers = device.createBuffers({handles.scratch_data_size}); | ||
| intermediate_buffer = scratch_buffers.front(); | ||
| } |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When handles.weight_data_size == 0 and/or handles.scratch_data_size == 0, this passes a default-constructed EthosU::Buffer into EthosU::Inference. If the driver stack expects these buffers to be either properly allocated or nullptr, this can lead to invalid buffer handles at runtime. Prefer initializing these as null (std::shared_ptr<EthosU::Buffer>{}) and only setting them when a real device buffer is created.
|
|
||
| for (int i = 0; i < input_count; ++i) { | ||
| auto tensor_in = args[i]->toTensor(); | ||
| linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>(); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inputs are treated as read-only (written into driver buffers), but this takes a mutable pointer via mutable_data_ptr<char>(). Prefer using a const pointer (const_data_ptr<char>() if available) and keep linux_input_ptrs as const char* (it already is), avoiding accidental mutation and unnecessary const-casts downstream.
| linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>(); | |
| linux_input_ptrs[i] = tensor_in.const_data_ptr<char>(); |
| } | ||
|
|
||
| void destroy(DelegateHandle* handle) const override { | ||
| return; |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExecutionHandle now owns/points to platform_state (Linux PlatformState contains std::string), but destroy() never runs destructors. This can leak memory (and potentially other resources held by PlatformState) across repeated loads/unloads. Consider explicitly calling the placement-new destructors (handle->~ExecutionHandle() and handle->platform_state->~PlatformState() when non-null) and returning the memory to the allocator if the runtime provides a deallocation API for delegate handles.
| return; | |
| if (handle == nullptr) { | |
| return; | |
| } | |
| // The delegate handle memory contains an ExecutionHandle constructed via | |
| // placement new. Explicitly run its destructor and the destructor of the | |
| // owned PlatformState (if present) to avoid leaking resources such as | |
| // std::string. | |
| auto* exec_handle = reinterpret_cast<ExecutionHandle*>(handle); | |
| if (exec_handle->platform_state != nullptr) { | |
| exec_handle->platform_state->~PlatformState(); | |
| exec_handle->platform_state = nullptr; | |
| } | |
| exec_handle->~ExecutionHandle(); |
| # Minimum version kept low so this file can be used even if the global project | ||
| # requires a newer CMake. | ||
| cmake_minimum_required(VERSION 3.20) |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling cmake_minimum_required() inside a toolchain file can unexpectedly reset policy settings for the parent project (toolchain files are loaded very early). Prefer removing this and, if needed, instead guard features with a if(${CMAKE_VERSION} VERSION_LESS ...) message(FATAL_ERROR ...) check.
| # Minimum version kept low so this file can be used even if the global project | |
| # requires a newer CMake. | |
| cmake_minimum_required(VERSION 3.20) | |
| # Minimum version requirement for using this toolchain file. Do not call | |
| # cmake_minimum_required() here, as that can reset policies for the parent | |
| # project; instead, guard on CMAKE_VERSION explicitly. | |
| if(CMAKE_VERSION VERSION_LESS 3.20) | |
| message(FATAL_ERROR "This toolchain file requires at least CMake 3.20") | |
| endif() |
| if(POLICY CMP0169) | ||
| # Allow FetchContent_Populate to be used for source-only fetch | ||
| cmake_policy(SET CMP0169 OLD) | ||
| endif() | ||
|
|
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing CMP0169 to OLD opts into deprecated FetchContent_Populate behavior and can cause policy warnings or future breakage. Prefer updating the FetchContent usage to the NEW behavior (e.g., FetchContent_MakeAvailable / modern pattern) so you don’t need to change the policy globally for this directory.
| if(POLICY CMP0169) | |
| # Allow FetchContent_Populate to be used for source-only fetch | |
| cmake_policy(SET CMP0169 OLD) | |
| endif() |
Add support for a Linux version of the executor_runner targetting the corstone1000 fvp.
This commit splits the backends/arm/runtime/EthosUBackend.cpp into multiple files for Cortex-M and Cortex-A.
Cortex-M is for baremetal and zephyr, this is mostly "old" code moved into its new home.
Cortex-A is for the Linux version of the executor_runner to communicate with the Linux kernel device driver. This is enabled in cmake with EXECUTORCH_BUILD_ARM_ETHOSU_LINUX=ON.
The EthosUBackend.cpp is keept for shared code between the different targets.
Change-Id: I0dfdf4bff793f7c7d83e20eb4d388f3c151fbfd3
cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai