-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Feature request
Feature description
Currently, nav2_mppi heavily relies on CPU for trajectory sampling and evaluation. While efficient on desktop-class CPUs, this becomes a significant bottleneck on edge computing platforms like NVIDIA Jetson Orin. In high-density obstacle environments or with a high number of sampled trajectories (
I propose adding an optional CUDA-accelerated backend to offload these computations to the GPU, significantly improving real-time performance on ARM-based SoC architectures.
Implementation considerations
I suggest implementing this as an optional plugin-based optimization. Key technical points include:
-
Parallel Computing: Use
cuRANDfor parallel noise generation and custom CUDA kernels for trajectory rollouts and scores. - Memory Optimization: Leverage Unified Memory (managed memory) to minimize host-to-device data transfer overhead, specifically targeting the shared-memory architecture of Jetson devices.
-
Build System: The CUDA backend will be gated behind a CMake flag (e.g.,
-DENABLE_CUDA=ON), ensuring 100% backward compatibility and no additional dependencies for non-NVIDIA users. -
Pros: - Much higher sampling density (e.g.,
$K > 2000$ ).- Significantly lower latency and higher control frequency.
- Reduced CPU overhead for other critical tasks like perception or localization.
- Cons: - Additional build-time dependency on the CUDA Toolkit for developers who explicitly enable this feature.
Recent research, such as "MPPI-Generic: A CUDA Library for Stochastic Trajectory Optimization" (arXiv:2409.07563), has already demonstrated the feasibility and performance gains of such an approach. I am a Robotics Algorithm Engineer and I am willing to contribute the implementation and a PR for this feature.