Skip to content

Terraform Updates for HyperPod Inference Operator v2 Official EKS addon#947

Draft
bluecrayon52 wants to merge 4 commits intomainfrom
tf-hpio-update
Draft

Terraform Updates for HyperPod Inference Operator v2 Official EKS addon#947
bluecrayon52 wants to merge 4 commits intomainfrom
tf-hpio-update

Conversation

@bluecrayon52
Copy link
Contributor

@bluecrayon52 bluecrayon52 commented Feb 5, 2026

Issue #, if available:

Description of changes:
Updated the hyperpod_inference_operator to leverage the new amazon-sagemaker-hyperpod-inference EKS addon instead of the legacy Helm chart deployment.

Deployment verification:


# HPIO EKS Addon
kubectl get all -n hyperpod-inference-system
NAME                                                        READY   STATUS    RESTARTS      AGE
pod/hyperpod-inference-alb-5f655d5f97-7rvzk                 1/1     Running   0             10m
pod/hyperpod-inference-alb-5f655d5f97-s5lvt                 1/1     Running   0             10m
pod/hyperpod-inference-controller-manager-6885bb776-j45mm   1/1     Running   0             10m
pod/keda-admission-webhooks-686f6f9c64-mxhcn                1/1     Running   0             10m
pod/keda-operator-5fc4b88dbb-rj9ts                          1/1     Running   1 (10m ago)   10m
pod/keda-operator-metrics-apiserver-694685d8bd-ksgj5        1/1     Running   0             10m

NAME                                                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
service/alb-webhook-service                                     ClusterIP   172.20.172.204   <none>        443/TCP            10m
service/hyperpod-inference-controller-manager-metrics-service   ClusterIP   172.20.125.203   <none>        8443/TCP           10m
service/hyperpod-inference-conversion-webhook                   ClusterIP   172.20.136.39    <none>        443/TCP            10m
service/keda-admission-webhooks                                 ClusterIP   172.20.30.94     <none>        443/TCP            10m
service/keda-operator                                           ClusterIP   172.20.98.231    <none>        9666/TCP           10m
service/keda-operator-metrics-apiserver                         ClusterIP   172.20.230.120   <none>        443/TCP,8080/TCP   10m

NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/hyperpod-inference-alb                  2/2     2            2           10m
deployment.apps/hyperpod-inference-controller-manager   1/1     1            1           10m
deployment.apps/keda-admission-webhooks                 1/1     1            1           10m
deployment.apps/keda-operator                           1/1     1            1           10m
deployment.apps/keda-operator-metrics-apiserver         1/1     1            1           10m

NAME                                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/hyperpod-inference-alb-5f655d5f97                 2         2         2       10m
replicaset.apps/hyperpod-inference-controller-manager-6885bb776   1         1         1       10m
replicaset.apps/keda-admission-webhooks-686f6f9c64                1         1         1       10m
replicaset.apps/keda-operator-5fc4b88dbb                          1         1         1       10m
replicaset.apps/keda-operator-metrics-apiserver-694685d8bd        1         1         1       10m

# FSxL CSI Driver
kubectl get all -l app=fsx-csi-controller -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
pod/fsx-csi-controller-55bc879f94-cbkjn   4/4     Running   0          26m
pod/fsx-csi-controller-55bc879f94-nxdbp   4/4     Running   0          26m

NAME                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/fsx-csi-controller-55bc879f94   2         2         2       26m

# S3 CSI Driver
kubectl get all -l app=s3-csi-node -n kube-system
NAME                    READY   STATUS    RESTARTS   AGE
pod/s3-csi-node-ccb6v   3/3     Running   0          28m
pod/s3-csi-node-qxl7k   3/3     Running   0          28m

# Metrics Server
kubectl get all -l app.kubernetes.io/name=metrics-server -n kube-system
NAME                                  READY   STATUS    RESTARTS   AGE
pod/metrics-server-6cdc4574fb-d7t59   1/1     Running   0          30m
pod/metrics-server-6cdc4574fb-djn97   1/1     Running   0          30m

NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/metrics-server   ClusterIP   172.20.138.85   <none>        443/TCP   30m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/metrics-server   2/2     2            2           30m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/metrics-server-6cdc4574fb   2         2         2       30m

# Cert Manager
kubectl get all -n cert-manager
NAME                                           READY   STATUS    RESTARTS   AGE
pod/cert-manager-7657f9b596-nfssj              1/1     Running   0          32m
pod/cert-manager-cainjector-78d6749d66-6ztd8   1/1     Running   0          32m
pod/cert-manager-webhook-777dbb5b86-h5jd5      1/1     Running   0          32m

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
service/cert-manager              ClusterIP   172.20.184.240   <none>        9402/TCP           32m
service/cert-manager-cainjector   ClusterIP   172.20.70.94     <none>        9402/TCP           32m
service/cert-manager-webhook      ClusterIP   172.20.66.73     <none>        443/TCP,9402/TCP   32m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cert-manager              1/1     1            1           32m
deployment.apps/cert-manager-cainjector   1/1     1            1           32m
deployment.apps/cert-manager-webhook      1/1     1            1           32m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cert-manager-7657f9b596              1         1         1       32m
replicaset.apps/cert-manager-cainjector-78d6749d66   1         1         1       32m
replicaset.apps/cert-manager-webhook-777dbb5b86      1         1         1       32m

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant