-
Notifications
You must be signed in to change notification settings - Fork 91
Description
What happened?
I started a trainer job uses npu by python sdk, the key code related to this issue is here:
job_id = TrainerClient().train(
runtime=TrainerClient().get_runtime(name='custom-minimize-npu'),
trainer=CustomTrainerContainer(
# image='36.134.128.101.nip.io:31104/aict/yolov9:20240802',
image='36.134.128.101.nip.io:31104/aict-2025/llamafactory:0.9.4-npu-a2-tensorboard-bnb',
num_nodes=1,
resources_per_node={
"cpu": 2,
"memory": "16Gi",
"huawei.com/Ascend910": "1",
# "gpu": 1, # Comment this line if you don't have GPUs.
},
),
...
Then I run this script, and check the detail of trainjob, I found the letter 'A' is changed to 'a':
(trainjob yaml:)
...
kind: TrainJob
spec:
trainer:
resourcesPerNode:
limits:
cpu: "2"
huawei.com/ascend910: "1"
memory: 16Gi
requests:
cpu: "2"
huawei.com/ascend910: "1"
memory: 16Gi
...
This will eventually cause the pod to fail to be allocated resources normally:
related code
This code may need to be modified to fix this issue:
sdk\kubeflow\trainer\backends\kubernetes\utils.py
What did you expect to happen?
I expect the letter 'A' don't change to 'a', so npu resource will be properly allocated.
Environment
Kubernetes version:
$ kubectl version
result:
Client Version: v1.34.3+k3s1
Kustomize Version: v5.7.1
Server Version: v1.34.3+k3s1
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
result:
(empty)
Kubeflow Python SDK version:
$ pip show kubeflow
result:
Name: kubeflow
Version: 0.2.1
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/sdk
Author:
Author-email: The Kubeflow Authors kubeflow-discuss@googlegroups.com
License-Expression: Apache-2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: kubeflow-katib-api, kubeflow-trainer-api, kubernetes, pydantic
Required-by:
Impacted by this bug?
npu resource cannot allocate to pod