ct: hook cloud topics recovery into whole cluster restore#29515
ct: hook cloud topics recovery into whole cluster restore#29515andrwng merged 8 commits intoredpanda-data:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hooks cloud topics recovery into the whole cluster restore process. It introduces new recovery stages for cloud topics metastore and cloud topic data, allowing cloud topics to be restored from controller snapshots during cluster recovery.
Changes:
- Added two new recovery stages (
recovered_cloud_topics_metastoreandrecovered_cloud_topic_data) for restoring cloud topics - Modified the topic creation flow to use the metastore's remote label for cloud topics
- Implemented recovery actions for cloud topics metastore and cloud topic data
- Added comprehensive tests for cloud topics cluster recovery scenarios
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/redpanda/application_start.cc | Passes cloud topics state to controller start for recovery |
| src/v/kafka/server/handlers/alter_configs.cc | Prevents remote_label from being modified via AlterConfigs |
| src/v/cluster/types.h | Adds remote_label property and new recovery stages |
| src/v/cluster/types.cc | Adds string representation for new recovery stages |
| src/v/cluster/topics_frontend.cc | Updates topic creation to use metastore's remote label for cloud topics |
| src/v/cluster/topic_table.cc | Adds incremental update handling for remote_label |
| src/v/cluster/partition_manager.cc | Skips log download for cloud topics during recovery |
| src/v/cluster/controller.h | Adds cloud topics state parameter to controller start |
| src/v/cluster/controller.cc | Passes cloud topics state to recovery backend |
| src/v/cluster/cluster_recovery_reconciler.h | Adds cloud topics fields to recovery actions |
| src/v/cluster/cluster_recovery_reconciler.cc | Implements recovery action generation for cloud topics |
| src/v/cluster/cloud_metadata/tests/controller_snapshot_reconciliation_test.cc | Tests for metastore and cloud topic reconciliation |
| src/v/cluster/cloud_metadata/tests/cluster_metadata_utils.h | Helper functions for cloud topic properties |
| src/v/cluster/cloud_metadata/tests/BUILD | Adds dependencies for test helpers |
| src/v/cluster/cloud_metadata/cluster_recovery_backend.h | Adds cloud topics state member |
| src/v/cluster/cloud_metadata/cluster_recovery_backend.cc | Implements metastore and cloud topic recovery actions |
| src/v/cluster/BUILD | Adds cloud topics metastore dependencies |
| src/v/cloud_topics/tests/cluster_recovery_test.cc | End-to-end test for cloud topics cluster recovery |
| src/v/cloud_topics/tests/BUILD | Adds cluster recovery test target |
| src/v/cloud_topics/level_one/metastore/tests/replicated_metastore_test.cc | Tests restore with correct partition count |
| src/v/cloud_topics/level_one/metastore/tests/BUILD | Adds test dependencies |
| src/v/cloud_topics/level_one/metastore/replicated_metastore.cc | Updates restore to handle missing manifest and partition count |
| src/v/cloud_topics/level_one/metastore/leader_router.h | Adds num_partitions parameter to ensure_topic_exists |
| src/v/cloud_topics/level_one/metastore/leader_router.cc | Passes num_partitions through to domain supervisor |
| src/v/cloud_topics/level_one/domain/domain_supervisor.h | Adds num_partitions parameter |
| src/v/cloud_topics/level_one/domain/domain_supervisor.cc | Creates metastore topic with specified partition count |
Comments suppressed due to low confidence (1)
src/v/cloud_topics/level_one/metastore/replicated_metastore.cc:1
- Corrected spelling of 'uplaod' to 'upload'.
/*
| // Restore on the metastore with our desired remote label. This will | ||
| // create the underlying topic if it doesn't already exist. Retry on | ||
| // transport errors since leadership may be unstable during recovery. | ||
| auto* metastore = _ct_state->local().get_l1_metastore(); |
There was a problem hiding this comment.
Potential null pointer dereference. While there's a check for !_ct_state earlier, there's no validation that get_l1_metastore() returns a non-null pointer before using it.
| auto* metastore = _ct_state->local().get_l1_metastore(); | |
| auto* metastore = _ct_state->local().get_l1_metastore(); | |
| if (metastore == nullptr) { | |
| vlog( | |
| clusterlog.error, | |
| "Cloud topics metastore not available for recovery"); | |
| co_return cluster::errc::invalid_configuration_update; | |
| } |
4de461f to
eef4196
Compare
| auto last_included_term = last_segment->archiver_term; | ||
|
|
||
| // TODO: implement a recovery primitive for cloud topics | ||
| if (!ntp_cfg.cloud_topic_enabled()) { |
There was a problem hiding this comment.
Can you remove this change? It interferes with the implementation of the CTP recovery.
There was a problem hiding this comment.
Is it tricky to rebase over the conflict? Would different formatting make the conflict easier to deal with?
This is required for my tests, as without it, WCR will fail in trying to autocreate the topics.
|
Just rebased |
Lazin
left a comment
There was a problem hiding this comment.
LGTM except the last mile (partition manager recovery) but I'll revert that later
| case recovered_acls: | ||
| return 5; | ||
| case recovered_remote_topic_data: | ||
| case recovered_cloud_topics_metastore: |
There was a problem hiding this comment.
the order has to be changed because subsequent steps need all topics to be recovered already?
There was a problem hiding this comment.
Are you suggesting that i change the order? or are you asking why it's important that it's ordered that way?
The order is important for a few reasons:
- it's generally gonna be less error prone to do metastore recovery when we haven't restored all other (e.g. tiered storage) topics already
- we want to restore the cloud topics before e.g. getting to the stage where we truncate consumer group offsets
| tp_ns); | ||
| continue; | ||
| } | ||
| if (tp_config.is_cloud_topic()) { |
There was a problem hiding this comment.
for cloud topics the redpanda.remote.recovery topic property is not needed
the flow of the L0 recovery:
- the reconciler recovers metastore
- then it reads starting points for all CTPs
- then it invokes method of the topics_frontend that replicates the starting points
- the starting points are added to the topics table
- CTP's are created normally (no recovery prop), the partitions will be seeded with the right starting points from the metastore
i think here we just need to add a config to the list without adding the recovery
or I guess I can change it in my PR later
There was a problem hiding this comment.
Yea that makes sense, we can either do that in the cluster_recovery_backend, which has access to the metastore when getting into the recovered_cloud_topic_data stage, or as you suggest maybe plumbing that into the topics_frontend. I suspect doing it in the recovery backend will be more straightforward
|
Force push:
|
There were some uncaught exceptions in opening the replicated database. Wraps these in an appropriate error type. Along the way, I updated some other exception handling to use the new wrapper.
In restoring the metastore topic during WCR, one of the steps to complete recovery is to change the metastore topic's remote label to be that of the original metastore topic. In this way, the location of the metastore manifest will be consistent from the original cluster and restored cluster (which is a useful property for read replicas, that will need to locate the metastore).
It'll generally be more robust if the restored topic is created with the number of partitions from the metastore manifest, vs the current behaveior of rejecting the restore if there's a partition count mismatch. To that end, allow ensure_topic_exists() to take in a parameter for the number of partitions.
- We previously rejected the restore if the number of domains didn't match the default number of partitions. Instead, we just create the same number of partitions as there are expected domains. - We previoulsy rejected the restore if there is no metastore manifest in the cloud. Instead, treat this as success, and opt to create the metastore topic if needed.
This allows us to rely on the invariant that a topic's remote_label points to the same remote label that the metastore's manifest will be. This is a nice property to have for read replicas: given a topic manifest, we'll be able to find the appropriate metastore manifest in the object storage.
The current ordering of actions is somewhat brittle since action order is tied to stage enum value. This makes it difficult to extend with new stages because stages are serialized in the controller log, and thus it's unsafe to add new stage enums with values that aren't at the end of the enum space. This commit takes a stab at making this a bit better by defining a non-persisted ordering of stages that is decoupled from enum value. My intent is to add more stages for cloud topics restore that will not be the last stage actioned.
This adds a couple of stages that will be used for cloud topics recovery: - metastore recovery: the metastore will be restored such that the new cluster's metastore topic will host the appropriate domain databases - topic recovery: each cloud topic will query the metastore to learn about the offsets/terms at which to start each partition, and then initialize their CTP STMs with those starting points These stages are only performed if there are cloud topics that need to be restored.
This implements the body of the cluster restore stages for cloud topics, by restoring the metastore, and then calling create topics on all the desired cloud topics. One of the invariants that we're trying to uphold is that the metastore topic will have the same remote_label after restoring as the original, in order (this has the nice property of being able to point at the metastore from a topic manifest, e.g. for read replicas). To that end, restoring the metastore looks like: 1. creating the metastore topic 2. calling restore 3. upon success, updating the metastore topic's remote_label to be what it was in the dead cluster After this happens, we go ahead and create all the cloud topics, which is only partially implemented in this commit: we create the topics, but additional logic is needed to get the restored topics to bootstrap their ctp_stms from the metastore. That is left as separate work.
Retry command for Build#80268please wait until all jobs are finished before running the slash command |
This hooks cloud topics restore into the Whole Cluster Restore machinery. At a high level, cloud topics restore entails two steps:
In restoring the metastore, we restore each domain, and then update the metastore topic config so its remote_label will be that of the original metastore. This sets the metastore up to continue uploading its metastore manifest to the original cluster's location, which will be a nice property to have for discoverability of the metastore (e.g. for read replicas). This updated remote label also serves as an indicator that restore is complete and that subsequent attempts to restore this metastore will be no-ops.
In restoring the cloud topics, all the original cloud topics' properties are preserved, but with the
recoveryproperty set. There is some additional work that needs to be done to make the restored cloud topics operational (e.g. on restore, they should query the metastore to bootstrap their CTP STMs), but that is left for later work.Backports Required
Release Notes