Arm backend: Linux backend for U85 direct drive #17367

perheld · 2026-02-11T12:40:13Z

Add support for a Linux version of the executor_runner targetting the corstone1000 fvp.

This commit splits the backends/arm/runtime/EthosUBackend.cpp into multiple files for Cortex-M and Cortex-A.

Cortex-M is for baremetal and zephyr, this is mostly "old" code moved into its new home.
Cortex-A is for the Linux version of the executor_runner to communicate with the Linux kernel device driver. This is enabled in cmake with EXECUTORCH_BUILD_ARM_ETHOSU_LINUX=ON.

The EthosUBackend.cpp is keept for shared code between the different targets.

Change-Id: I0dfdf4bff793f7c7d83e20eb4d388f3c151fbfd3

cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Add support for a Linux version of the executor_runner targetting the corstone1000 fvp. This commit splits the backends/arm/runtime/EthosUBackend.cpp into multiple files for Cortex-M and Cortex-A. * Cortex-M is for baremetal and zephyr, this is mostly "old" code moved into its new home. * Cortex-A is for the Linux version of the executor_runner to communicate with the Linux kernel device driver. This is enabled in cmake with EXECUTORCH_BUILD_ARM_ETHOSU_LINUX=ON. The EthosUBackend.cpp is keept for shared code between the different targets. Change-Id: I0dfdf4bff793f7c7d83e20eb4d388f3c151fbfd3 Signed-off-by: Per Held <per.held@arm.com>

pytorch-bot · 2026-02-11T12:40:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17367

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 12 New Failures, 1 Cancelled Job, 17 Unrelated Failures

As of commit 5e8a9a5 with merge base 9c74c32 ():

NEW FAILURES - The following jobs have failed:

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t 658d769c72841307527b5733504d5acf4820d75c81772c8963a7e9f3d7a08a4e /exec failed with exit code 137
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 6eab5664f9020324f3deed09de6bbf3e39d388189f9d1d0dede74252539e5ce8 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 86d50cc3f9555306e710fe9f97d9cfed3dee2cf2dce70e5636e7031e9d9822e6 /exec failed with exit code 1
trunk / test-huggingface-transformers-macos (bert|coreml_fp32_gpu|--quantize) / macos-job (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
trunk / test-models-linux-aarch64 (add_mul, portable, linux.arm64.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t f06219e3c803f43b004e86acf5620a19d7df941ee5c1d5597c6e619c1a06e92d /exec failed with exit code 1
trunk / test-models-macos-coreml (efficient_sam) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (resnet50, xnnpack-quantization-delegation) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-qnn-model (fp32, ic3) / linux-job (gh)
RuntimeError: Command docker exec -t d315d9d84ebd08b3fd626aac9118f35ae14ea553122171df7bde68aa135b1f3f /exec failed with exit code 137
trunk / test-qnn-model (fp32, ic4) / linux-job (gh)
RuntimeError: Command docker exec -t cf57f13d57a221c25ab2a18318c923735877ae56c9ca1ef1752ae03fe4014e30 /exec failed with exit code 137
trunk / test-qnn-model (fp32, vit) / linux-job (gh)
RuntimeError: Command docker exec -t 7611f18480d59e44f2ec1279ed295e5f96d7d2d296382156763ecd032d0d2df4 /exec failed with exit code 137
trunk / test-qnn-optimum-model (fp32, pvt) / linux-job (gh)
RuntimeError: Command docker exec -t b62d27d47f91313fc071ec2ba7e6a66ae9eae2bcda04850a277a56755db5f521 /exec failed with exit code 137
trunk / test-torchao-huggingface-checkpoints (qwen3_4b, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc1... / linux-job (gh)
RuntimeError: Command docker exec -t 62da82a982dc338971cefe07303718561c7318cea0534f657a10e75fefce39ec /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-qnn-optimum-model (fp32, focalnet) / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / test-qnn-model (fp32, mb) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-qnn-model (fp32, mv2) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-qnn-optimum-model (fp32, distilbert) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-qnn-optimum-model (fp32, efficientnet) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-qnn-optimum-model (fp32, mobilevit_v1) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-qnn-optimum-model (fp32, roberta) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

Adds a Linux (Cortex-A) Ethos-U backend path for executor_runner and refactors the existing backend into shared + platform-specific implementations.

Changes:

Split Ethos-U backend into common code plus Cortex-M (baremetal) and Cortex-A (Linux driver stack) implementations.
Added build options and CMake plumbing to enable the Linux backend and fetch/build the Linux driver stack.
Extended Vela binary parsing to accept an additional command stream header.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tools/cmake/preset/default.cmake	Adds `EXECUTORCH_BUILD_ARM_ETHOSU_LINUX` option and conflicts with baremetal build.
examples/arm/executor_runner/ethosu_link_helper.cpp	Adds force-link hook so the Ethos-U backend is retained in portable runner builds.
examples/arm/ethos-u-setup/aarch64-linux-musl-toolchain.cmake	Introduces a musl cross toolchain file for aarch64 Linux builds.
backends/arm/runtime/VelaBinStream.cpp	Accepts `COP2` “magic header” for cmd stream blocks in vela binaries.
backends/arm/runtime/EthosUBackend_Internal.h	Adds shared declarations, platform hooks, and profiling macros for split backend.
backends/arm/runtime/EthosUBackend_Cortex_M.cpp	Moves baremetal Ethos-U driver execution into a Cortex-M specific file.
backends/arm/runtime/EthosUBackend_Cortex_A.cpp	Adds Linux Ethos-U driver stack execution path (device, buffers, inference).
backends/arm/runtime/EthosUBackend.cpp	Refactors common backend to delegate platform-specific execution via `platform_execute`.
backends/arm/CMakeLists.txt	Adds Linux driver stack fetch/configuration and builds the split platform sources.
CMakeLists.txt	Wires new build flags and adds runner linking helpers/options for Linux backend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T12:42:52Z

backends/arm/runtime/EthosUBackend.cpp

+    const size_t expand_factor = chunk_size / vela_chunk_size;
+    if (expand_factor == 2 && elem_size == 1 &&
+        tensor_out.scalar_type() == ScalarType::Char) {
+      const uint8_t* src_bytes = reinterpret_cast<const uint8_t*>(src);
+      int8_t* dest = tensor_out.mutable_data_ptr<int8_t>();
+      const uint8_t* chunk_src = src_bytes;
+      int8_t* chunk_dest = dest;
+      for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) {
+        for (size_t byte_idx = 0; byte_idx < vela_chunk_size; ++byte_idx) {
+          const uint8_t packed = chunk_src[byte_idx];
+          int8_t low = static_cast<int8_t>(packed & 0x0F);
+          int8_t high = static_cast<int8_t>((packed >> 4) & 0x0F);
+          if (low >= 8) {
+            low -= 16;
+          }
+          if (high >= 8) {
+            high -= 16;
+          }
+          chunk_dest[2 * byte_idx] = low;
+          chunk_dest[2 * byte_idx + 1] = high;
+        }
+        chunk_src += vela_chunk_size;
+        chunk_dest += chunk_size;
      }
-      dest[2 * byte_idx] = low;
-      dest[2 * byte_idx + 1] = high;
+      return Error::Ok;
    }
+
+    ET_LOG(
+        Error,
+        "Ethos-U output %d expansion factor %zu with element size %d not supported",
+        output_index,
+        expand_factor,
+        elem_size);
+    return Error::InvalidProgram;
  }

-  Error strip_delegate_padding(
-      const char* src,
-      char* dest,
-      size_t chunk_count,
-      size_t dest_chunk_size,
-      size_t src_chunk_size) const {
-    if (dest_chunk_size > src_chunk_size) {
-      ET_LOG(
-          Error,
-          "dest chunk size %zu must not exceed src chunk size %zu",
-          dest_chunk_size,
-          src_chunk_size);
-      return Error::InvalidProgram;
-    }
-    if (src == nullptr || dest == nullptr) {
-      ET_LOG(Error, "Ethos-U padded copy received null buffer");
-      return Error::InvalidState;
-    }
-    for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) {
-      memcpy(dest, src, dest_chunk_size);
-      src += src_chunk_size;
-      dest += dest_chunk_size;
-    }
-    return Error::Ok;
+  if (src == nullptr) {
+    ET_LOG(Error, "Ethos-U padded copy received null buffer");
+    return Error::InvalidState;
+  }
+  char* dest = tensor_out.mutable_data_ptr<char>();
+  if (dest == nullptr) {
+    ET_LOG(Error, "Ethos-U padded copy received null destination");
+    return Error::InvalidState;
+  }
+  const char* src_bytes = src;
+  for (size_t chunk_idx = 0; chunk_idx < chunk_count; ++chunk_idx) {
+    memcpy(dest, src_bytes, chunk_size);
+    src_bytes += vela_chunk_size;
+    dest += chunk_size;
  }
+  return Error::Ok;


memcpy(dest, src_bytes, chunk_size) can read past the end of each source chunk when chunk_size > vela_chunk_size (e.g., dest chunk 10 bytes, src chunk 8 bytes). The previous implementation guarded against this; please reintroduce an explicit check (and error) for chunk_size > vela_chunk_size before the copy loop.

Copilot · 2026-02-11T12:42:52Z

backends/arm/runtime/EthosUBackend_Cortex_A.cpp

+      if (i < static_cast<int>(output_io_bytes.size())) {
+        output_io_bytes[i] = static_cast<size_t>(io_count) *
+            static_cast<size_t>(handles.outputs->io[i].elem_size);
+      }


output_io_bytes is derived from the delegate (VelaIO) shape and can be larger than tensor_out.nbytes() when the delegate output includes padding / packed layouts. Later, the Linux driver reads output_io_bytes[i] bytes directly into tensor_out.mutable_data_ptr<char>(), which can overflow the tensor buffer. Fix by either (a) enforcing tensor_out.nbytes() == output_io_bytes[i] and erroring otherwise, or (b) reading into a temporary buffer sized to output_io_bytes[i] and then using the existing copy_with_layout_adjustment(...) path to populate the tensor safely.

Suggested change

}

}

const size_t tensor_nbytes = tensor_out.nbytes();

if (i < static_cast<int>(output_io_bytes.size()) &&

output_io_bytes[i] != tensor_nbytes) {

ET_LOG(

Error,

"Ethos-U Linux backend output size mismatch for index %d: "

"driver IO bytes = %zu, tensor bytes = %zu",

i,

output_io_bytes[i],

tensor_nbytes);

return Error::InvalidState;

}

Copilot · 2026-02-11T12:42:52Z

backends/arm/runtime/EthosUBackend_Cortex_A.cpp

+    std::shared_ptr<EthosU::Buffer> constant_buffer =
+        std::make_shared<EthosU::Buffer>();
+    if (handles.weight_data_size > 0) {
+      auto constant_buffers = device.createBuffers({handles.weight_data_size});
+      constant_buffer = constant_buffers.front();
+      constant_buffer->write(
+          const_cast<char*>(handles.weight_data), handles.weight_data_size);
+    }
+
+    std::shared_ptr<EthosU::Buffer> intermediate_buffer =
+        std::make_shared<EthosU::Buffer>();
+    if (handles.scratch_data_size > 0) {
+      auto scratch_buffers = device.createBuffers({handles.scratch_data_size});
+      intermediate_buffer = scratch_buffers.front();
+    }


When handles.weight_data_size == 0 and/or handles.scratch_data_size == 0, this passes a default-constructed EthosU::Buffer into EthosU::Inference. If the driver stack expects these buffers to be either properly allocated or nullptr, this can lead to invalid buffer handles at runtime. Prefer initializing these as null (std::shared_ptr<EthosU::Buffer>{}) and only setting them when a real device buffer is created.

Copilot · 2026-02-11T12:42:53Z

backends/arm/runtime/EthosUBackend_Cortex_A.cpp

+
+  for (int i = 0; i < input_count; ++i) {
+    auto tensor_in = args[i]->toTensor();
+    linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>();


Inputs are treated as read-only (written into driver buffers), but this takes a mutable pointer via mutable_data_ptr<char>(). Prefer using a const pointer (const_data_ptr<char>() if available) and keep linux_input_ptrs as const char* (it already is), avoiding accidental mutation and unnecessary const-casts downstream.

Suggested change

linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>();

linux_input_ptrs[i] = tensor_in.const_data_ptr<char>();

Copilot · 2026-02-11T12:42:53Z

backends/arm/runtime/EthosUBackend.cpp

  }

  void destroy(DelegateHandle* handle) const override {
    return;


ExecutionHandle now owns/points to platform_state (Linux PlatformState contains std::string), but destroy() never runs destructors. This can leak memory (and potentially other resources held by PlatformState) across repeated loads/unloads. Consider explicitly calling the placement-new destructors (handle->~ExecutionHandle() and handle->platform_state->~PlatformState() when non-null) and returning the memory to the allocator if the runtime provides a deallocation API for delegate handles.

Suggested change

return;

if (handle == nullptr) {

return;

}

// The delegate handle memory contains an ExecutionHandle constructed via

// placement new. Explicitly run its destructor and the destructor of the

// owned PlatformState (if present) to avoid leaking resources such as

// std::string.

auto* exec_handle = reinterpret_cast<ExecutionHandle*>(handle);

if (exec_handle->platform_state != nullptr) {

exec_handle->platform_state->~PlatformState();

exec_handle->platform_state = nullptr;

}

exec_handle->~ExecutionHandle();

Copilot · 2026-02-11T12:42:53Z

examples/arm/ethos-u-setup/aarch64-linux-musl-toolchain.cmake

+# Minimum version kept low so this file can be used even if the global project
+# requires a newer CMake.
+cmake_minimum_required(VERSION 3.20)


Calling cmake_minimum_required() inside a toolchain file can unexpectedly reset policy settings for the parent project (toolchain files are loaded very early). Prefer removing this and, if needed, instead guard features with a if(${CMAKE_VERSION} VERSION_LESS ...) message(FATAL_ERROR ...) check.

Suggested change

# Minimum version kept low so this file can be used even if the global project

# requires a newer CMake.

cmake_minimum_required(VERSION 3.20)

# Minimum version requirement for using this toolchain file. Do not call

# cmake_minimum_required() here, as that can reset policies for the parent

# project; instead, guard on CMAKE_VERSION explicitly.

if(CMAKE_VERSION VERSION_LESS 3.20)

message(FATAL_ERROR "This toolchain file requires at least CMake 3.20")

endif()

Copilot · 2026-02-11T12:42:54Z

backends/arm/CMakeLists.txt

+if(POLICY CMP0169)
+  # Allow FetchContent_Populate to be used for source-only fetch
+  cmake_policy(SET CMP0169 OLD)
+endif()
+


Forcing CMP0169 to OLD opts into deprecated FetchContent_Populate behavior and can cause policy warnings or future breakage. Prefer updating the FetchContent usage to the NEW behavior (e.g., FetchContent_MakeAvailable / modern pattern) so you don’t need to change the policy globally for this directory.

Suggested change

if(POLICY CMP0169)

# Allow FetchContent_Populate to be used for source-only fetch

cmake_policy(SET CMP0169 OLD)

endif()

Copilot AI review requested due to automatic review settings February 11, 2026 12:40

perheld requested review from digantdesai, kirklandsign and larryliu0820 as code owners February 11, 2026 12:40

perheld added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Feb 11, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2026

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm backend: Linux backend for U85 direct drive #17367

Arm backend: Linux backend for U85 direct drive #17367

perheld commented Feb 11, 2026 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-      }
+      }
+      const size_t tensor_nbytes = tensor_out.nbytes();
+      if (i < static_cast<int>(output_io_bytes.size()) &&
+          output_io_bytes[i] != tensor_nbytes) {
+        ET_LOG(
+            Error,
+            "Ethos-U Linux backend output size mismatch for index %d: "
+            "driver IO bytes = %zu, tensor bytes = %zu",
+            i,
+            output_io_bytes[i],
+            tensor_nbytes);
+        return Error::InvalidState;
+      }

	linux_input_ptrs[i] = tensor_in.mutable_data_ptr<char>();
	linux_input_ptrs[i] = tensor_in.const_data_ptr<char>();

-    return;
+    if (handle == nullptr) {
+      return;
+    }
+    // The delegate handle memory contains an ExecutionHandle constructed via
+    // placement new. Explicitly run its destructor and the destructor of the
+    // owned PlatformState (if present) to avoid leaking resources such as
+    // std::string.
+    auto* exec_handle = reinterpret_cast<ExecutionHandle*>(handle);
+    if (exec_handle->platform_state != nullptr) {
+      exec_handle->platform_state->~PlatformState();
+      exec_handle->platform_state = nullptr;
+    }
+    exec_handle->~ExecutionHandle();

-# Minimum version kept low so this file can be used even if the global project
-# requires a newer CMake.
-cmake_minimum_required(VERSION 3.20)
+# Minimum version requirement for using this toolchain file. Do not call
+# cmake_minimum_required() here, as that can reset policies for the parent
+# project; instead, guard on CMAKE_VERSION explicitly.
+if(CMAKE_VERSION VERSION_LESS 3.20)
+  message(FATAL_ERROR "This toolchain file requires at least CMake 3.20")
+endif()

Arm backend: Linux backend for U85 direct drive #17367

Are you sure you want to change the base?

Arm backend: Linux backend for U85 direct drive #17367

Conversation

perheld commented Feb 11, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17367

❌ 12 New Failures, 1 Cancelled Job, 17 Unrelated Failures

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perheld commented Feb 11, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 11, 2026 •

edited

Loading