Enhancing GPU Visibility for AI Workloads created with Kubeflow SDK

### What you would like to be added?

During KubeCon + CloudNativeCon NA 2025, several users reached out to discuss challenges around GPU failures and the need for better observability of GPU utilization when running AI workloads.

A few potential approaches were proposed, including:
1. Deeper integration with the PyTorch Profiler to capture per–neural-network-layer GPU utilization: https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
2. Providing optional SSH access to the MASTER node so users can run tools like nvidia-smi for real-time GPU monitoring (e.g. new `TrainerClient()` API)
3. Any other ideas ?

Let’s use this issue to brainstorm solutions that can improve visibility into GPU utilization and failure modes, enabling users to better tune their workloads and cluster configurations.

cc @kubeflow/kubeflow-sdk-team 

### Why is this needed?

Improve GPU utilization of AI workloads.

### Love this feature?

Give it a 👍 We prioritize the features with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing GPU Visibility for AI Workloads created with Kubeflow SDK #165

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancing GPU Visibility for AI Workloads created with Kubeflow SDK #165

Description

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions