Skip to content

fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly#153

Open
loiht2 wants to merge 2 commits intoProject-HAMi:mainfrom
loiht2:fix/container-memory
Open

fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly#153
loiht2 wants to merge 2 commits intoProject-HAMi:mainfrom
loiht2:fix/container-memory

Conversation

@loiht2
Copy link

@loiht2 loiht2 commented Jan 23, 2026

  • Fixes incorrect GPU memory reporting inside the container vs. on the host (as shown by nvidia-smi).
  • Enforces GPU memory limits using the corrected container-visible memory, preventing incorrect OOM enforcement.

GPU Memory Usage

In container

Command: nvidia-smi
image

Command: nvidia-smi -a

    FB Memory Usage
        Total                             : 3072 MiB
        Reserved                          : 274 MiB
        Used                              : 2584 MiB
        Free                              : 488 MiB

On host

Command: nvidia-smi
image

Command: nvidia-smi -a

    FB Memory Usage
        Total                             : 32768 MiB
        Reserved                          : 274 MiB
        Used                              : 2588 MiB
        Free                              : 29907 MiB

GPU Memory Limit Enforcement (OOM scenario)

I run the same pod, which requires ~2588 MiB GPU memory. However, in this test, the ResourceClaim requests only 2GiB (2048 MiB) GPU memory, so the pod hits GPU OOM.

Pod log showing the OOM:
image

Hoang Thanh Loi added 2 commits January 23, 2026 11:09
…its accordingly

Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>
…its accordingly (updated)

Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>
@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 23, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: loiht2
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot added the size/M label Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant