Skip to content

Conversation

@carlbraganza
Copy link
Contributor

@carlbraganza carlbraganza commented Feb 2, 2026

  • One-line PR description: Update KEP-3314 PRR for Beta.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot requested a review from saad-ali February 2, 2026 21:41
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 2, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @carlbraganza. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 2, 2026
@xing-yang
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 2, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: carlbraganza
Once this PR has been reviewed and has the lgtm label, please assign jpbetz, jsafrane for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@carlbraganza carlbraganza force-pushed the kep3314-beta branch 4 times, most recently from 6f475d4 to 943a783 Compare February 3, 2026 16:35
@xing-yang
Copy link
Contributor

General comment: Please provide more details when you say "CSI driver specific."

For e2e tests, please include links to the test files.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 3, 2026
Copy link
Member

@stlaz stlaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR shadow review

NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->

The feature can be disabled by removing the sidecar from the CSI driver.
Copy link
Member

@stlaz stlaz Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design text, add these changes:

  • Explain the abbreviation "SP" which is often used throughout the text. I'm assuming it's "Storage Provider".

  • Where it says:

The full specification of the Kubernetes SnapshotMetadata API will be published in the source code repository of the [external-snapshot-metadata sidecar](https://github.com/kubernetes/enhancements/blob/1d3cb5e92e7cb3a815974b172f238e23b957ed96/keps/sig-storage/3314-csi-changed-block-tracking/README.md#the-external-snapshot-metadata-sidecar).

add a link to the actual API spec.

  • In the The External Snapshot Metadata Sidecar section, add a link to the repo with the code.
    In the same section - The sidecar must be configured with the name of this CR object. <- it's unclear what this means.

  • Since you're referring to it from your official docs multiple times. add a concrete RBAC rule definition that would let a backup client scrape the snapshot metadata to the Risks and Mitigations section. You can just limit the rule to what you've got in your actual authorizer (the volumesnapshots SAR check).

  • Also, in the Risks and Mitigations, explain how you implement rate limiting to address the Uncontrolled access issue raised by sig-auth.

  • In the e2e section of the Test Plan, add detailed explanation how you're going to test your sidecar's authenticator and authorizer. Make these a requirement for beta in the Graduation criteria section.

In the code of your sidecar:

In your sidecar repo:

  • Create an issue where you explain how you plan on improving the e2e testing for the repository. Currently the only e2e tests live in a .github/workflows directory in a file called integration-test.yaml as a long bash script.
    Make these improvements a requirement for Beta in the Graduation criteria section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

Explain the abbreviation "SP" ...

I added a link to https://github.com/container-storage-interface/spec/blob/master/spec.md#terminology for terminology in the Proposal section.

... add a link to the repo with the code ...
Added a link to kubernetes-csi/external-snapshot-metadata.
The KEP was written before the associated external-snapshot-repository was created so lacks these links to the code.

add a concrete RBAC rule definition that would let a backup client scrape the snapshot metadata to the Risks and Mitigations section...

Added a link to deploy/example at the end of the section.

... explain how you implement rate limiting to address the Uncontrolled access issue raised by sig-auth.

We do not implement rate limiting. Our mitigation of the Uncontrolled access is partially achieved as a side-effect of requiring K8s authentication and authorization (TokenRequest/TokenReview/SAR) in order to use the service. This was implied previously, so I added language to call it out explicititly.

In the e2e section of the Test Plan, add detailed explanation how you're going to test your sidecar's authenticator and authorizer. Make these a requirement for beta in the Graduation criteria section.

Hmm... the Test Plan does not provide that level of detail on anything. I've added a link to the E2E test PR in case anyone wants to see the actual test logic, and I also added an explicit requirement for authentication and authorization checks to the Beta graduation criteria.

Make the authorization test introspect...

I created Issue 206 to track your suggestion for the authorization unit test.

installed by the CSI driver, or by
reinstalling the CSI driver without this feature enabled.

It can also be effectively disabled by modifying the CSI driver specific
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would still likely have at least one cluster-admin in the cluster that could still use it. I wouldn't say this is a valid way to disable the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll remove this.

-->

No.
The presence of `SnapshotMetadataService` CR objects indicates that this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is about test presence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert to the previous text.

will rollout across nodes.
-->

The CSI driver installs the `SnapshotMetadataService` CRD, creates an instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the CSI driver should be responsible for API presence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as per the KEP.

This feature adds functionality to a CSI driver and follows the established pattern of existing CSI driver sidecars that provide for volume snapshots, etc, in how the sidecar is deployed (https://kubernetes-csi.github.io/docs/deploying.html).

However, unlike the other CSI features, access to this new sidecar bypasses the K8s API server for the most part, as it involves a lot of I/O. Secure access to the sidecar requires that the client know the server's certificate and audience string - the CR exists to convey this information, and how this CR gets created is up to the CSI driver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the cluster operator/administrator should be responsible for the APIs that exist in the cluster, correct?

If my assumption is wrong, please explain how conflicts in different CRD versions should be handled by the CSI drivers, e.g.. a field gets added into the CRD but an old version of the CSI driver attempts to reconcile the CRD to its desired version.

-->

CSI driver specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is common for all the CSI drivers I believe. Please fill this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upgrade or rollback was tested.

CSI sidecar's are deployed as part of a CSI driver installation and hence are vendor specific in nature. The CSI specifies how a sidecar container communicates with the CSI driver container, but does not specify how the containers are deployed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

The CSI driver installs the `SnapshotMetadataService` CRD, creates an instance
of this CR, and deploys the `external-snapshot-metadata` sidecar.
Each CSI driver vendor independently must explicitly handle failure to construct
these objects during rollout or rollback of their driver fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain how a backup client should handle rollbacks that would cause the feature to suddenly disappear/stop working for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

Any rollback will cause active operations by a backup application to fail.
Such failure is similar to a network failure that the application ought to be
capable of handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think there might be a typo I missed in the first read.

Suggested change
these objects during rollout or rollback of their driver fails.
these objects during rollout or rollback if their driver fails.

Comment on lines 1475 to 1478
- Event Reason: A `GetMetadataAllocated` Normal Event with a response containing the details of the
metadata blocks of the snapshot from the CSI driver.
- Event Reason: A `GetMetadataDelta` Normal Event with a response containing the delta of the metadata
blocks of the 2 snapshots from the CSI driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be logged into events but should just be kept as a response to the client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. Reverted.


CSI driver specific.
Log records from the `external-snapshot-metadata` sidecar should be visible in
the CSI driver logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which logs would appear in the CSI driver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-wrote to be:

Log records from the external-snapshot-metadata sidecar container should be visible in its logs,
assuming a typical deployment such as that described in
Deploying CSI Driver on Kubernetes.

deleting the `SnapshotMetadataService` CR or rolling back the feature.
It is also possible that the operator may be able to disable the use of the
feature in the backup application concerned.
- The [integration tests](https://github.com/kubernetes-csi/external-snapshot-metadata/blob/main/.github/workflows/integration-test.yaml)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Test Plan section you explicitly claim integration tests are not suitable for the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Thanks for pointing that out! I removed that comment and moved this link to that section.

- Testing: Are there any tests for failure mode? If not, describe why.
-->

- Failure due to RBAC permissions require the operator to update security
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should follow the template from the .md comment above.

Copy link
Contributor Author

@carlbraganza carlbraganza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stlaz I hope I've addressed all your feedback.

deleting the `SnapshotMetadataService` CR or rolling back the feature.
It is also possible that the operator may be able to disable the use of the
feature in the backup application concerned.
- The [integration tests](https://github.com/kubernetes-csi/external-snapshot-metadata/blob/main/.github/workflows/integration-test.yaml)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Thanks for pointing that out! I removed that comment and moved this link to that section.

installed by the CSI driver, or by
reinstalling the CSI driver without this feature enabled.

It can also be effectively disabled by modifying the CSI driver specific
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll remove this.

NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->

The feature can be disabled by removing the sidecar from the CSI driver.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

Explain the abbreviation "SP" ...

I added a link to https://github.com/container-storage-interface/spec/blob/master/spec.md#terminology for terminology in the Proposal section.

... add a link to the repo with the code ...
Added a link to kubernetes-csi/external-snapshot-metadata.
The KEP was written before the associated external-snapshot-repository was created so lacks these links to the code.

add a concrete RBAC rule definition that would let a backup client scrape the snapshot metadata to the Risks and Mitigations section...

Added a link to deploy/example at the end of the section.

... explain how you implement rate limiting to address the Uncontrolled access issue raised by sig-auth.

We do not implement rate limiting. Our mitigation of the Uncontrolled access is partially achieved as a side-effect of requiring K8s authentication and authorization (TokenRequest/TokenReview/SAR) in order to use the service. This was implied previously, so I added language to call it out explicititly.

In the e2e section of the Test Plan, add detailed explanation how you're going to test your sidecar's authenticator and authorizer. Make these a requirement for beta in the Graduation criteria section.

Hmm... the Test Plan does not provide that level of detail on anything. I've added a link to the E2E test PR in case anyone wants to see the actual test logic, and I also added an explicit requirement for authentication and authorization checks to the Beta graduation criteria.

Make the authorization test introspect...

I created Issue 206 to track your suggestion for the authorization unit test.

-->

No.
The presence of `SnapshotMetadataService` CR objects indicates that this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert to the previous text.

will rollout across nodes.
-->

The CSI driver installs the `SnapshotMetadataService` CRD, creates an instance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as per the KEP.

This feature adds functionality to a CSI driver and follows the established pattern of existing CSI driver sidecars that provide for volume snapshots, etc, in how the sidecar is deployed (https://kubernetes-csi.github.io/docs/deploying.html).

However, unlike the other CSI features, access to this new sidecar bypasses the K8s API server for the most part, as it involves a lot of I/O. Secure access to the sidecar requires that the client know the server's certificate and audience string - the CR exists to convey this information, and how this CR gets created is up to the CSI driver.

-->

CSI driver specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upgrade or rollback was tested.

CSI sidecar's are deployed as part of a CSI driver installation and hence are vendor specific in nature. The CSI specifies how a sidecar container communicates with the CSI driver container, but does not specify how the containers are deployed.


CSI driver specific.
Log records from the `external-snapshot-metadata` sidecar should be visible in
the CSI driver logs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-wrote to be:

Log records from the external-snapshot-metadata sidecar container should be visible in its logs,
assuming a typical deployment such as that described in
Deploying CSI Driver on Kubernetes.

The CSI driver installs the `SnapshotMetadataService` CRD, creates an instance
of this CR, and deploys the `external-snapshot-metadata` sidecar.
Each CSI driver vendor independently must explicitly handle failure to construct
these objects during rollout or rollback of their driver fails.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

Any rollback will cause active operations by a backup application to fail.
Such failure is similar to a network failure that the application ought to be
capable of handling.

Comment on lines 1475 to 1478
- Event Reason: A `GetMetadataAllocated` Normal Event with a response containing the details of the
metadata blocks of the snapshot from the CSI driver.
- Event Reason: A `GetMetadataDelta` Normal Event with a response containing the delta of the metadata
blocks of the 2 snapshots from the CSI driver.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. Reverted.

-->

*Please refer to
[CSI Terminology](https://github.com/container-storage-interface/spec/blob/master/spec.md#terminology)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permanent links everywhere, please

[TokenReview](https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-review-v1/) and
[SubjectAccessReview](https://kubernetes.io/docs/reference/kubernetes-api/authorization-resources/subject-access-review-v1/)
APIs are controlled by Kubernetes security policy.
A side effect of the use of these APIs is that it partially mitigates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubernetes/sig-auth-proposals This KEP suggests that an unauthenticated client can cause N authenticated clients to fan out requests to the KAS until KAS starts rate-limiting each of these authenticated clients. N is the number of CSI driver instances installed in the cluster.

Is this ok?


The proposal does not specify how such a security policy is to be configured.

An example of such a policy can be seen in [deploy/example](https://github.com/kubernetes-csi/external-snapshot-metadata/tree/main/deploy/example)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline the specific RBAC rule here

published in the source code repository of the
[external-snapshot-metadata sidecar](#the-external-snapshot-metadata-sidecar).
[external-snapshot-metadata sidecar](#the-external-snapshot-metadata-sidecar)
(see [proto/schema.proto](https://github.com/kubernetes-csi/external-snapshot-metadata/blob/main/proto/schema.proto)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permanent link, please

Comment on lines -1134 to +1147
No integration tests are required. This feature is better tested with e2e tests.
[Integration tests](https://github.com/kubernetes-csi/external-snapshot-metadata/blob/main/.github/workflows/integration-test.yaml)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permanent link, please.

Why is this considered an integration test?

will rollout across nodes.
-->

The CSI driver installs the `SnapshotMetadataService` CRD, creates an instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the cluster operator/administrator should be responsible for the APIs that exist in the cluster, correct?

If my assumption is wrong, please explain how conflicts in different CRD versions should be handled by the CSI drivers, e.g.. a field gets added into the CRD but an old version of the CSI driver attempts to reconcile the CRD to its desired version.

The CSI driver installs the `SnapshotMetadataService` CRD, creates an instance
of this CR, and deploys the `external-snapshot-metadata` sidecar.
Each CSI driver vendor independently must explicitly handle failure to construct
these objects during rollout or rollback of their driver fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think there might be a typo I missed in the first read.

Suggested change
these objects during rollout or rollback of their driver fails.
these objects during rollout or rollback if their driver fails.

Comment on lines +1426 to +1427
Such failure is similar to a network failure that the application ought to be
capable of handling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Such failure is similar to a network failure that the application ought to be
capable of handling.
Such failure is similar to a network failure that the backup client application ought to be
capable of handling gracefully.

-->

CSI driver specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

block mode backups should become more efficient in both storage
space and time utilization.
The `GetMetadataAllocated` and `GetMetadataDelta` snapshot metadata operations
should be successful most of time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define "most of the time" with a specific number.

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. kind/design Categorizes issue or PR as related to design. labels Feb 9, 2026
@enj enj moved this to Needs Triage in SIG Auth Feb 9, 2026
@enj enj added this to SIG Auth Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

4 participants