Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -500,6 +500,14 @@ Set the following parameters to `true` in your `custom.tfvars` file to enable op
| `create_hyperpod_inference_operator_module` | Installs the [HyperPod inference operator addon](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-setup.html) for deployment and management of machine learning inference endpoints |
| `create_observability_module` | Installs the [HyperPod Observability addon](https://docs.aws.amazon.com/sagemaker/latest/dg/hyperpod-observability-addon-setup.html) to publish key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards |

The HyperPod training and inference operators both require the [cert-manager](https://cert-manager.io/) EKS addon to be installed as a prerequisite. The variable `enable_cert_manager` is set to `true` by default, so that when `create_hyperpod_training_operator_module` or `create_hyperpod_inference_operator_module` are also set to `true`, cert-manager will be installed as a dependency of the operators. In other words, this stack will not install cert-manager as a standalone component, but it can be disabled if you already have it installed on an existing EKS cluster and wish to use one of the HyperPod operators.

The HyperPod inference operator also has the following additional dependencies:
- The [Amazon FSx for Lustre CSI driver](https://github.com/kubernetes-sigs/aws-fsx-csi-driver): This EKS addon is installed by default as part of the FSx for Lustre module. Set `create_fsx_module = false` if you already have it installed on an existing EKS cluster.
- The [Mountpoint for Amazon S3 CSI Driver](https://github.com/awslabs/mountpoint-s3-csi-driver): This EKS addon is bundled with the Hyperpod inference operator module and is enabled by default. Set `enable_s3_csi_driver = false` if you already have it installed on an existing EKS cluster.
- The [AWS Load Balancer Controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller): This is bundled with the HyperPod inference operator EKS addon and is enabled by default. Set `enable_alb_controller = false` if you already have it installed on an existing EKS cluster.
- The [KEDA (Kubernetes Event-driven Autoscaling) Operator](https://keda.sh/): This is bundled with the HyperPod inference operator EKS addon and is enabled by default. Set `enable_keda = false` if you already have it installed on an existing EKS cluster.

---
### Advanced Observability Metrics Configuration
In addition to enabling the [HyperPod Observability addon](https://docs.aws.amazon.com/sagemaker/latest/dg/hyperpod-observability-addon-setup.html) by setting `create_observability_module = true`, you can also configure the following metrics that you wish to collect on your cluster:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@ hyperpod_cluster_name = "tf-hp-cluster"
resource_name_prefix = "tf-eks-test"
aws_region = "us-east-1"
instance_groups = [
{
name = "accelerated-instance-group-1"
instance_type = "ml.g5.8xlarge",
instance_count = 2,
availability_zone_id = "use1-az2",
ebs_volume_size_in_gb = 100,
threads_per_core = 1,
enable_stress_check = false,
enable_connectivity_check = false,
lifecycle_script = "on_create.sh"
}
{
name = "accelerated-instance-group-1"
instance_type = "ml.g5.8xlarge",
instance_count = 2,
availability_zone_id = "use1-az2",
ebs_volume_size_in_gb = 100,
threads_per_core = 1,
enable_stress_check = false,
enable_connectivity_check = false,
lifecycle_script = "on_create.sh"
}
]
create_observability_module = true
network_metric_level = "ADVANCED"
Expand All @@ -23,4 +23,4 @@ create_task_governance_module = true
create_hyperpod_training_operator_module = true
create_hyperpod_inference_operator_module = true
enable_guardduty_cleanup = true
create_new_fsx_filesystem = true
create_new_fsx_filesystem = true
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ locals {
create_s3_bucket_module = !local.rig_mode && var.create_s3_bucket_module
s3_bucket_name = !local.rig_mode ? (var.create_s3_bucket_module ? module.s3_bucket[0].s3_bucket_name : var.existing_s3_bucket_name) : null
create_lifecycle_script_module = !local.rig_mode && var.create_lifecycle_script_module
enable_cert_manager = !local.rig_mode && (var.create_hyperpod_training_operator_module || var.create_hyperpod_inference_operator_module)
enable_cert_manager = !local.rig_mode && var.enable_cert_manager && (var.create_hyperpod_training_operator_module || var.create_hyperpod_inference_operator_module)
wait_for_nodes = !local.rig_mode && anytrue(local.features_requiring_nodes)
create_fsx_module = !local.rig_mode ? var.create_fsx_module : false
create_task_governance_module = !local.rig_mode && var.create_task_governance_module
Expand Down Expand Up @@ -295,14 +295,13 @@ module "hyperpod_inference_operator" {
source = "./modules/hyperpod_inference_operator"

resource_name_prefix = var.resource_name_prefix
helm_repo_path = var.helm_repo_path_hpio
helm_release_name = var.helm_release_name_hpio
helm_repo_revision = var.helm_repo_revision_hpio
namespace = var.namespace
eks_cluster_name = local.eks_cluster_name
vpc_id = local.vpc_id
hyperpod_cluster_arn = module.hyperpod_cluster[0].hyperpod_cluster_arn
access_logs_bucket_name = module.s3_bucket[0].s3_logs_bucket_name
enable_s3_csi_driver = var.enable_s3_csi_driver
enable_alb_controller = var.enable_alb_controller
enable_keda = var.enable_keda
enable_metrics_server = var.enable_metrics_server

depends_on = [
module.hyperpod_cluster,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
data "aws_region" "current" {}

data "aws_vpc" "selected" {
id = var.vpc_id
}

data "aws_availability_zones" "available" {
state = "available"
filter {
Expand All @@ -19,7 +15,7 @@ locals {

resource "aws_subnet" "private" {
count = var.create_eks_subnets ? length(var.private_subnet_cidrs) : 0
vpc_id = data.aws_vpc.selected.id
vpc_id = var.vpc_id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = data.aws_availability_zones.available.names[count.index]

Expand All @@ -32,7 +28,7 @@ resource "aws_subnet" "private" {

resource "aws_route_table" "eks_private" {
count = var.create_eks_subnets ? length(var.private_subnet_cidrs) : 0
vpc_id = data.aws_vpc.selected.id
vpc_id = var.vpc_id

tags = {
Name = "${var.resource_name_prefix}-EKS-Private-RT-${count.index + 1}"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ resource "aws_eks_addon" "fsx_lustre_csi_driver" {
}
}

# Wait for FSx CSI driver to be available (required for the HPIO and dynamic provisioning)
# Wait for FSx CSI driver to be available (required for the HPIO and PVC binding)
resource "null_resource" "wait_for_fsx_csi_driver" {
count = local.wait_for_fsx_csi_driver ? 1 : 0

Expand Down

This file was deleted.

This file was deleted.

Loading