Skip to content

Big letter in python script will change to small letter in Trainjob yaml, finally cause npu resource cannot allocate. #263

@wangyakun

Description

@wangyakun

What happened?

I started a trainer job uses npu by python sdk, the key code related to this issue is here:

job_id = TrainerClient().train(
    runtime=TrainerClient().get_runtime(name='custom-minimize-npu'),
    trainer=CustomTrainerContainer(
        # image='36.134.128.101.nip.io:31104/aict/yolov9:20240802',
        image='36.134.128.101.nip.io:31104/aict-2025/llamafactory:0.9.4-npu-a2-tensorboard-bnb',
        num_nodes=1,
        resources_per_node={
            "cpu": 2,
            "memory": "16Gi",
            "huawei.com/Ascend910": "1",
            # "gpu": 1, # Comment this line if you don't have GPUs.
        },
    ),
...
Image

Then I run this script, and check the detail of trainjob, I found the letter 'A' is changed to 'a':
(trainjob yaml:)

...
kind: TrainJob
spec:
  trainer:
    resourcesPerNode:
      limits:
        cpu: "2"
        huawei.com/ascend910: "1"
        memory: 16Gi
      requests:
        cpu: "2"
        huawei.com/ascend910: "1"
        memory: 16Gi
...
Image

This will eventually cause the pod to fail to be allocated resources normally:

Image

related code

This code may need to be modified to fix this issue:

sdk\kubeflow\trainer\backends\kubernetes\utils.py

Image

What did you expect to happen?

I expect the letter 'A' don't change to 'a', so npu resource will be properly allocated.

Environment

Kubernetes version:

$ kubectl version

result:
Client Version: v1.34.3+k3s1
Kustomize Version: v5.7.1
Server Version: v1.34.3+k3s1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"

result:
(empty)

Kubeflow Python SDK version:

$ pip show kubeflow

result:
Name: kubeflow
Version: 0.2.1
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/sdk
Author:
Author-email: The Kubeflow Authors kubeflow-discuss@googlegroups.com
License-Expression: Apache-2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: kubeflow-katib-api, kubeflow-trainer-api, kubernetes, pydantic
Required-by:

Impacted by this bug?

npu resource cannot allocate to pod

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions