Skip to content

TQ: Support for ZFS Key Rotation#9737

Open
plaidfinch wants to merge 2 commits intomainfrom
tq-zfs-key-rotation
Open

TQ: Support for ZFS Key Rotation#9737
plaidfinch wants to merge 2 commits intomainfrom
tq-zfs-key-rotation

Conversation

@plaidfinch
Copy link
Contributor

@plaidfinch plaidfinch commented Jan 28, 2026

When Trust Quorum commits a new epoch, all U.2 crypt datasets must have their encryption keys rotated to use the new epoch's derived key. This change implements the key rotation flow triggered by epoch commits.

Trust Quorum Integration

  • Add watch channel to NodeTaskHandle for epoch change notifications
  • Initialize channel with current committed epoch on startup
  • Notify subscribers via send_if_modified() when epoch changes

Config Reconciler Integration

  • Accept committed_epoch_rx watch channel from trust quorum
  • Trigger reconciliation when epoch changes
  • Track per-disk encryption epoch in ExternalDisks
  • Add rekey_for_epoch() to coordinate key rotation:
    • Filter disks needing rekey (cached epoch < target OR unknown)
    • Derive keys for each disk via StorageKeyRequester
    • Send batch request to dataset task
    • Update cached epochs on success
    • Retry on failure via normal reconciliation retry logic

Dataset Task Changes

  • Add RekeyRequest/RekeyResult types for batch rekey operations
  • Add datasets_rekey() with idempotency check (skip if already at target)
  • Use Zfs::change_key() for atomic key + epoch property update

ZFS Utilities

  • Add Zfs::change_key() using zfs_atomic_change_key crate
  • Add Zfs::load_key(), unload_key(), dataset_exists()
  • Add epoch field to DatasetProperties
  • Add structured error types for key operations

Crash Recovery

  • Add trial decryption recovery in sled-storage for datasets with missing epoch property (e.g., crash during initial creation)
  • Unload key before each trial attempt to handle crash-after-load-key
  • Set epoch property after successful recovery

@andrewjstone andrewjstone mentioned this pull request Jan 31, 2026
48 tasks
andrewjstone added a commit that referenced this pull request Jan 31, 2026
This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum
configuration via omdb. The second phase is to reboot the sled after
polling for commit the trust quorum removal. The third phase is to
issue the existing omdb expunge command, which changes the sled policy
as before.

The first and second phases remove the need to physically remove the
sled before expungement. They act as a software mechanism that gates the
sled-agent from restarting on the sled and doing work when it should be
treated as "absent". We've discussed this numerous times in the update
huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone
and remains the same except for an extra sanity check that that the last
committed trust quorum configuration does not contain the sled that is
to be expunged.

The removed sled may be added back to this rack or another after being
clean slated. I tested this by deleting the files in the internal
"cluster" and "config" directories and rebooting the removed sled in
a4x2 and it worked.

This PR is marked draft because it changes the current
sled-expunge pathway to depend on real trust quorum. We
cannot safely merge it in until the key-rotation work from
#9737 is merged in.

This also builds on #9741 and should merge after that PR.
When Trust Quorum commits a new epoch, all U.2 crypt datasets must have
their encryption keys rotated to use the new epoch's derived key. This
change implements the key rotation flow triggered by epoch commits.

## Trust Quorum Integration

- Add watch channel to `NodeTaskHandle` for epoch change notifications
- Initialize channel with current committed epoch on startup
- Notify subscribers via `send_if_modified()` when epoch changes

## Config Reconciler Integration

- Accept `committed_epoch_rx` watch channel from trust quorum
- Trigger reconciliation when epoch changes
- Track per-disk encryption epoch in `ExternalDisks`
- Add `rekey_for_epoch()` to coordinate key rotation:
  - Filter disks needing rekey (cached epoch < target OR unknown)
  - Derive keys for each disk via `StorageKeyRequester`
  - Send batch request to dataset task
  - Update cached epochs on success
  - Retry on failure via normal reconciliation retry logic

## Dataset Task Changes

- Add `RekeyRequest`/`RekeyResult` types for batch rekey operations
- Add `datasets_rekey()` with idempotency check (skip if already at target)
- Use `Zfs::change_key()` for atomic key + epoch property update

## ZFS Utilities

- Add `Zfs::change_key()` using `zfs_atomic_change_key` crate
- Add `Zfs::load_key()`, `unload_key()`, `dataset_exists()`
- Add `epoch` field to `DatasetProperties`
- Add structured error types for key operations

## Crash Recovery

- Add trial decryption recovery in `sled-storage` for datasets with
  missing epoch property (e.g., crash during initial creation)
- Unload key before each trial attempt to handle crash-after-load-key
- Set epoch property after successful recovery

## Safety Properties

- Atomic: Key and epoch property set together via `zfs_atomic_change_key`
- Idempotent: Skip rekey if dataset already at target epoch
- Crash-safe: Epoch read from ZFS on restart rebuilds cache correctly
- Conservative: Unknown epochs (None) trigger rekey attempt
@plaidfinch plaidfinch linked an issue Feb 4, 2026 that may be closed by this pull request
@plaidfinch plaidfinch marked this pull request as ready for review February 4, 2026 03:11
Create a new key-manager-types crate containing the disk encryption key
types (Aes256GcmDiskEncryptionKey and VersionedAes256GcmDiskEncryptionKey)
that were previously defined in key-manager. This breaks the dependency
from illumos-utils to key-manager, allowing illumos-utils to depend only
on the minimal types crate.

The key-manager crate re-exports VersionedAes256GcmDiskEncryptionKey for
backwards compatibility.
/// Returns (exists=true, mounted) if the dataset exists with the expected
/// mountpoint, (exists=false, mounted=false) if it doesn't exist.
/// Returns an error if the dataset exists but has an unexpected mountpoint.
async fn dataset_mount_info(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this internal function because it was previously named dataset_exists and I wanted to use that name, and it was not a good name for what this function actually returns (DatasetMountInfo).

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic @plaidfinch.

We'll need to test this and possibly coordinate merging it in case it relies on trust quorum being enabled / lrtq upgrade, etc...

impl DatasetProperties {
const ZFS_GET_PROPS: &'static str =
"oxide:uuid,name,mounted,avail,used,quota,reservation,compression";
const ZFS_GET_PROPS: &'static str = "oxide:uuid,oxide:epoch,name,mounted,avail,used,quota,reservation,compression";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be formatted.

log,
"Rotating encryption key";
"dataset" => &req.dataset_name,
"new_epoch" => new_epoch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth logging the current epoch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just add the current epoch to the new logging scope that's created above, perhaps...

log,
"Failed to rotate encryption key";
"dataset" => &req.dataset_name,
"error" => %e,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth logging the old and new epochs here?

/// Returns true if any rekey operations failed.
async fn rekey_for_epoch(&mut self, target_epoch: Epoch) -> bool {
// Filter to disks that need rekeying:
// - Known epoch < target: definitely needs rekey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this would be possible, but I"m curious if we should log an error in case e > target_epoch, just to cover all conditions.

}

if request.disks.is_empty() {
info!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be warn! ?

format!("{}/{}", info.disk.zpool_name(), CRYPT_DATASET);
match self
.key_requester
.get_key(target_epoch.0, info.disk.identity().clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can block indefinitely, while waiting to load the secret. I don't think that's a problem but wanted to mention it. I think other things can block in the reconciler also.


// Get the key for this epoch
let key =
match key_requester.get_key(epoch, disk_identity.clone()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth it at some point to get the set of known committed epochs when loading the latest key. Loading the latest key decrypts and loads all prior epochs, so this should be cheap. Then if we aborted a bunch for some reason we wouldn't need to make those request. This is just an optimization though and not necessary to do in this PR.

pub compression: String,
/// The encryption key epoch for this dataset.
///
/// Only present on encrypted datasets (e.g., crypt datasets on U.2s).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth noting here that this is not present on datasets that inherit encryption from a parent (if I understand that correctly)?

#[derive(Debug, thiserror::Error)]
pub enum KeyRotationError {
#[error("failed to rotate encryption key for dataset {dataset}")]
ChangeKeyFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as-is, but if this is the only variant this error will ever have, we could shorten it a bit by making it a struct?

#[derive(Debug, thiserror::Error)]
#[error("failed to rotate encryption key for dataset {dataset}")]
pub struct KeyRotationError {
    dataset: String,
    #[source]
    err: anyhow::Error,
}

#[derive(Debug, Default)]
pub struct RekeyRequest {
/// Datasets to rekey, keyed by physical disk UUID.
pub disks: BTreeMap<PhysicalDiskUuid, DatasetRekeyInfo>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine but just making sure I understand - we only expect to ever have one encrypted dataset per disk, and any datasets that need encryption will be children of it (and inherit its encryption), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

// Ensure we process the initial epoch on startup. Using mark_unchanged()
// means that changed() will fire on the first value, allowing us to
// catch any missed rekeys from crashes.
self.committed_epoch_rx.mark_unchanged();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this now that we moved rekeying into do_reconciliation; we're about to go into the loop which unconditionally starts with a call to do_reconciliation, which means we will immediately try to rekey if necessary. As written this will induce a second do_reconciliation (which is harmless but not useful).

// Check if any disks need rekeying to the current committed epoch.
// We use borrow_and_update() to mark the epoch as seen, so we don't
// trigger another reconciliation for the same epoch change.
// Note: We must copy the epoch out of the Ref before any await points.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Note: We must copy the epoch out of the Ref before any await points.

This is true but I think in general we want the time we hold the watch channel to be as short as possible. If we did something like:

let ref = watch_rx.borrow_and_update();
do_a_synchronous_thing(ref);
drop(ref);
do_an_async_thing().await;

we still keep the lock on the channel held throughout do_synchronous_thing, which we'd prefer to avoid. It's wonderful when watch channels hold Copy types like this one and we can just immediately deref as you have here.

Comment on lines +723 to +724
let dataset_name =
format!("{}/{}", info.disk.zpool_name(), CRYPT_DATASET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turbo-nit - maybe move this down into the Ok(key) => { ... } branch? That'd minimize the scope of the variable and let us skip the allocation if we don't need it.

let name = format!("{}/{}", zpool_name, dataset);
let epoch = if let Ok(epoch_str) =
Zfs::get_oxide_value(dataset, "epoch").await
Zfs::get_oxide_value(&name, "epoch").await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this wrong before and always failing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha yes. Claude found this one yesterday and Finch pointed it out to me. It was always wrong, but innocuous because LRTQ doesn't allow key rotations.

let mut command = tokio::process::Command::new(illumos_utils::zfs::ZFS);
let cmd = command.args(&["list", "-H", dataset]);
Ok(cmd.status().await?.success())
Ok(Zfs::dataset_exists(dataset).await?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point should we remove this helper and inline Zfs::dataset_exists wherever it's called?

// loaded keys across process restarts, so if we crashed after a
// successful load-key but before setting the epoch property, the
// correct key would still be loaded and our load-key would fail.
let _ = Zfs::unload_key(dataset_name).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this fails?

"epoch" => epoch,
);

// No need to unload here -- we unload at the start of each iteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to unload on the last iteration?

@andrewjstone
Copy link
Contributor

We tested this on a4x2 with TQ enabled and keys were created correctly during RSS. We then added a sled and rotation completed correctly. We're going to relaunch with trust quorum disabled and see if we can upgrade out of lrtq successfully and then add a sled in our next experiment.

.await
.map_err(|e| ChangeKeyError {
name: dataset.to_string(),
err: anyhow::anyhow!("{e}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be

Suggested change
err: anyhow::anyhow!("{e}"),
err: anyhow::Error::from(e),

rather than formatting it, as that will eat the structured representation of the error's source chain, while Error::from preserves it.

log,
"Rotating encryption key";
"dataset" => &req.dataset_name,
"new_epoch" => new_epoch,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just add the current epoch to the new logging scope that's created above, perhaps...

Comment on lines +707 to +710
.filter(|i| match i.cached_epoch {
Some(e) => e < target_epoch,
None => true,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit: Option<T: PartialEq> is itself PartialEq, and None compares less than any Some, so this can just be

Suggested change
.filter(|i| match i.cached_epoch {
Some(e) => e < target_epoch,
None => true,
})
.filter(|i| i.cached_epoch < Some(target_epoch)

@andrewjstone
Copy link
Contributor

Woohoo. Tested on hardware. LRTQ upgrade succeeded. This required a small patch, which we should pull into this branch: 494d49a

I think it's probably good to go in once the comments are resolved and this patch is pulled in. Will re-assess when fresh tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TQ: Support for ZFS Key Rotation

4 participants