Skip to content

SREP-2644: Managed Policy additions for Karpenter/AutoNode on ROSA HCP#2653

Open
MitaliBhalla wants to merge 1 commit intoopenshift:masterfrom
MitaliBhalla:SREP-2644-v2
Open

SREP-2644: Managed Policy additions for Karpenter/AutoNode on ROSA HCP#2653
MitaliBhalla wants to merge 1 commit intoopenshift:masterfrom
MitaliBhalla:SREP-2644-v2

Conversation

@MitaliBhalla
Copy link
Contributor

@MitaliBhalla MitaliBhalla commented Feb 26, 2026

What type of PR is this?

feature

What this PR does / why we need it?

Adds AWS managed policy support for Karpenter/AutoNode on ROSA HCP:

  • New Karpenter Controller Policy - EC2, KMS, IAM, and SQS permissions for node provisioning with red-hat-managed tag conditions
  • Control Plane Operator Update - Security group tagging for Karpenter discovery
  • Installer Policy Update - SQS queue validation
  • Policy Justification Doc - AWS-required permission justifications

Improvements over original PR #2581:

  • Added ec2:DescribeCapacityReservations for capacity reservation support
  • Enhanced KMS grant conditions following HCP CAPA Controller pattern

Which Jira/Github issue(s) this PR fixes?

SREP-2644

Special notes for your reviewer:

  • Re-validated with latest hypershift operator release
  • CreateLaunchTemplate succeeds with red-hat-managed=true tag
  • NodeClaims Ready, pods scheduled successfully
  • Details in SREP-2644

Pre-checks (if applicable):

  • Tested latest changes against a cluster

  • Included documentation changes with PR

  • If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:

    matchExpressions:
    - key: api.openshift.com/fedramp
      operator: NotIn
      values: ["true"]

- Add Karpenter controller credentials policy with least-privilege permissions
- Add ec2:DescribeCapacityReservations for capacity reservation support
- Add kms:ViaService condition to KMS grants (following CAPA pattern)
- Update control plane operator policy with security group tagging
- Update installer policy with SQS queue validation

Addresses PR review feedback:
- EC2 Describe actions: Cannot be tag-conditioned (AWS API limitation)
- KMS grants: Added kms:ViaService condition per existing CAPA pattern
- PassRole: Already has iam:PassedToService condition
- IAM Get/ListInstanceProfiles: Read-only, require Resource '*' per AWS docs

Tested and validated on latest hypershift operator release.

Signed-off-by: Mitali Bhalla <mbhalla@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2026

@MitaliBhalla: This pull request references SREP-2644 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

feature

What this PR does / why we need it?

Adds AWS managed policy support for Karpenter/AutoNode on ROSA HCP:

  • New Karpenter Controller Policy - EC2, KMS, IAM, and SQS permissions for node provisioning with red-hat-managed tag conditions
  • Control Plane Operator Update - Security group tagging for Karpenter discovery
  • Installer Policy Update - SQS queue validation
  • Policy Justification Doc - AWS-required permission justifications

Improvements over original PR #2581:

  • Added ec2:DescribeCapacityReservations for capacity reservation support
  • Enhanced KMS grant conditions following HCP CAPA Controller pattern

Which Jira/Github issue(s) this PR fixes?

SREP-2644

Special notes for your reviewer:

  • Re-validated with latest hypershift operator release
  • CreateLaunchTemplate succeeds with red-hat-managed=true tag
  • NodeClaims Ready, pods scheduled successfully
  • Details in SREP-2644

Pre-checks (if applicable):

  • Tested latest changes against a cluster

  • Included documentation changes with PR

  • If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:

    matchExpressions:
    - key: api.openshift.com/fedramp
      operator: NotIn
      values: ["true"]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2026

@MitaliBhalla: This pull request references SREP-2644 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

feature

What this PR does / why we need it?

Adds AWS managed policy support for Karpenter/AutoNode on ROSA HCP:

  • New Karpenter Controller Policy - EC2, KMS, IAM, and SQS permissions for node provisioning with red-hat-managed tag conditions
  • Control Plane Operator Update - Security group tagging for Karpenter discovery
  • Installer Policy Update - SQS queue validation
  • Policy Justification Doc - AWS-required permission justifications

Improvements over original PR #2581:

  • Added ec2:DescribeCapacityReservations for capacity reservation support
  • Enhanced KMS grant conditions following HCP CAPA Controller pattern

Which Jira/Github issue(s) this PR fixes?

SREP-2644

Special notes for your reviewer:

  • Re-validated with latest hypershift operator release
  • CreateLaunchTemplate succeeds with red-hat-managed=true tag
  • NodeClaims Ready, pods scheduled successfully
  • Details in SREP-2644

Pre-checks (if applicable):

  • Tested latest changes against a cluster

  • Included documentation changes with PR

  • If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:

    matchExpressions:
    - key: api.openshift.com/fedramp
      operator: NotIn
      values: ["true"]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MitaliBhalla
Once this PR has been reviewed and has the lgtm label, please assign typeid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@MitaliBhalla: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@MitaliBhalla
Copy link
Contributor Author

@joshbranham @rafael-azevedo

Copy link
Contributor

@joshbranham joshbranham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, sorry if they were addressed elsewhere, just foreseeing AWS questions

"Action": [
"ec2:CreateTags"
],
"Resource": "arn:aws:ec2:*:*:security-group/*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it, but why does the CPO need this permission now as part of Karpenter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding,The Control Plane Operator adds karpenter.sh/discovery tags to the cluster's security groups when AutoNode is enabled as a Day-2 operation.

Karpenter uses these tags to discover which security groups to attach to provisioned nodes. Without this permission, the CPO cannot tag the existing Red Hat-managed security groups, and Karpenter won't be able to find the correct security groups for node provisioning.

The permission is scoped to only allow tagging on security groups that already have aws:ResourceTag/red-hat-managed: "true"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Sid": "AllowListSQSQueues",
"Effect": "Allow",
"Action": [
"sqs:ListQueues"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this is primarily for the installer to validate that the queue provided is actually existent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This allows Cluster Service to validate that the customer-provided SQS queue exists before cluster creation, preventing misconfiguration errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshbranham / @MitaliBhalla Bala confirmed that this was not a hard requirement as long as we have clear error messages on the Karpenter APIs if the SQS queue is mis-configured. See the requirements in https://issues.redhat.com/browse/AUTOSCALE-354

We can remove this permission request from the installer role.

"Sid": "SSMReadActions",
"Effect": "Allow",
"Action": [
"ssm:GetParameter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a pretty sensitive call, lots of customer data could be accessed. I see we are wildcarding on /aws/service/*, is there any tag conditioning we can do? I presume karpenter looks up something in this path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on the sensitivity concern, However, the resource is already scoped to only AWS-managed service parameters, not customer parameters:
"Resource": "arn:aws:ssm:*::parameter/aws/service/*"
The double colon (::) in the ARN - this means no account ID, indicating AWS-owned public parameters only.

Karpenter uses this to look up official AMI IDs. For example:

  • /aws/service/bottlerocket/aws-k8s-*/x86_64/latest/image_id
  • /aws/service/ami-amazon-linux-latest/*

These are read-only, publicly available AWS parameters - no customer data is accessible through this path.

Comment on lines +252 to +254
"sqs:DeleteMessage",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be conditioned or resource scoped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't easily scope these to a specific queue ARN because:

  1. The queue is customer-provided - we don't know the queue name/ARN at policy creation time
  2. sqs:GetQueueUrl is used to resolve the queue name to a URL, which requires Resource: "*"

However, we could potentially add a tag condition if customers are required to tag their interruption queues:
"Condition": { "StringEquals": { "aws:ResourceTag/rosa-karpenter-interruption-queue": "true" } }

Do we want to require customers to tag their SQS queues for this to work?
This adds a customer setup step but improves security scoping.
Alternatively, the current approach matches how other Karpenter deployments handle this - the queue ARN is validated at runtime by the controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see AWS accepting this policy with unrestricted SQS permissions so we would likely have to require tags on the resources. The way we do this for KMS etc is to require red-hat: true to denote the customer made the resource but is granting us access to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants