Skip to content

Enhancing GPU Visibility for AI Workloads created with Kubeflow SDK #165

@andreyvelich

Description

@andreyvelich

What you would like to be added?

During KubeCon + CloudNativeCon NA 2025, several users reached out to discuss challenges around GPU failures and the need for better observability of GPU utilization when running AI workloads.

A few potential approaches were proposed, including:

  1. Deeper integration with the PyTorch Profiler to capture per–neural-network-layer GPU utilization: https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
  2. Providing optional SSH access to the MASTER node so users can run tools like nvidia-smi for real-time GPU monitoring (e.g. new TrainerClient() API)
  3. Any other ideas ?

Let’s use this issue to brainstorm solutions that can improve visibility into GPU utilization and failure modes, enabling users to better tune their workloads and cluster configurations.

cc @kubeflow/kubeflow-sdk-team

Why is this needed?

Improve GPU utilization of AI workloads.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions