-
Notifications
You must be signed in to change notification settings - Fork 94
Open
Description
What you would like to be added?
During KubeCon + CloudNativeCon NA 2025, several users reached out to discuss challenges around GPU failures and the need for better observability of GPU utilization when running AI workloads.
A few potential approaches were proposed, including:
- Deeper integration with the PyTorch Profiler to capture per–neural-network-layer GPU utilization: https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
- Providing optional SSH access to the MASTER node so users can run tools like nvidia-smi for real-time GPU monitoring (e.g. new
TrainerClient()API) - Any other ideas ?
Let’s use this issue to brainstorm solutions that can improve visibility into GPU utilization and failure modes, enabling users to better tune their workloads and cluster configurations.
cc @kubeflow/kubeflow-sdk-team
Why is this needed?
Improve GPU utilization of AI workloads.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
Reactions are currently unavailable