Skip to content

Comments

feat: replace SharedDomainType with SharedForkLifecycle for fork-aware peer selection#824

Open
diegomrsantos wants to merge 10 commits intosigp:unstablefrom
diegomrsantos:feat/shared-fork-lifecycle
Open

feat: replace SharedDomainType with SharedForkLifecycle for fork-aware peer selection#824
diegomrsantos wants to merge 10 commits intosigp:unstablefrom
diegomrsantos:feat/shared-fork-lifecycle

Conversation

@diegomrsantos
Copy link
Member

Issue Addressed

Follows up on PR #820 (fork-aware observed_peer_subnets). While #820 correctly stores per-fork bitmaps, all query methods still aggregate across forks with a blind union(). After the Boole grace period ends, peers only subscribed to Alan topics appear useful for Boole subnets — leading to incorrect peer selection.

Proposed Changes

  • Introduce ForkLifecycle enum (Normal, WarmUp, GracePeriod) and SharedForkLifecycle shared state in the fork crate, replacing SharedDomainType
  • ForkMonitor updates the shared lifecycle state before broadcasting each ForkPhase event
  • Peer selection queries in ConnectionManager now consult the lifecycle: in Normal state only the current fork's bitmap is considered; during WarmUp/GracePeriod both forks are aggregated
  • All components that previously read SharedDomainType (Discovery, Handshake) now read SharedForkLifecycle::domain_type() instead
  • Network::on_fork_phase simplified — no longer sets domain type (monitor handles it), only updates ENR

Additional Info

@claude-code-actions-sigp

This comment was marked as outdated.

diegomrsantos and others added 4 commits February 11, 2026 13:13
Add a new lifecycle module to the fork crate that models fork transition
states as a single enum with three variants:

- Normal: operating on a single fork (pre-fork or post-grace-period)
- WarmUp: preparing for an upcoming fork (dual-subscribing to new topics)
- GracePeriod: fork activated but keeping old subscriptions for late messages

SharedForkLifecycle wraps this in Arc<RwLock<T>> for cross-component
sharing, following the same pattern as the SharedDomainType it will
replace. Each variant carries domain_type, making invalid states
unrepresentable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ForkMonitor now accepts a SharedForkLifecycle and updates it right
before broadcasting each ForkPhase event:

- Preparing  → WarmUp { current, upcoming, domain_type }
- Activated  → GracePeriod { current, previous, domain_type }
- GracePeriodEnded → Normal { current, domain_type }

This makes ForkMonitor the single writer of fork lifecycle state.
Components that previously needed to react to events just to cache
derived values can now read the shared state directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the SharedDomainType abstraction with SharedForkLifecycle
across all network components, making peer selection fork-aware.

Key changes:
- Remove SharedDomainType from network crate entirely
- Discovery and Handshake read domain_type via fork_lifecycle
- Network::on_fork_phase simplified to only update ENR (domain type
  updates are now handled by ForkMonitor via SharedForkLifecycle)
- ConnectionManager uses fork lifecycle to filter peer subnets:
  in Normal state only the current fork's bitmap is returned;
  during WarmUp/GracePeriod both forks are aggregated
- Client creates SharedForkLifecycle and passes to both ForkMonitor
  and Network

This prevents peers subscribed only to a defunct fork from appearing
useful for subnet coverage after the grace period ends.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@diegomrsantos diegomrsantos force-pushed the feat/shared-fork-lifecycle branch from 47555e3 to 50378e6 Compare February 11, 2026 16:14
@diegomrsantos diegomrsantos marked this pull request as draft February 11, 2026 16:14
@diegomrsantos diegomrsantos marked this pull request as ready for review February 11, 2026 16:14
@diegomrsantos diegomrsantos self-assigned this Feb 11, 2026
@diegomrsantos diegomrsantos marked this pull request as draft February 11, 2026 16:21
@diegomrsantos diegomrsantos marked this pull request as ready for review February 11, 2026 16:21
@diegomrsantos
Copy link
Member Author

@claude review this PR

@sigp sigp deleted a comment from claude-code-actions-sigp bot Feb 11, 2026
@sigp sigp deleted a comment from claude-code-actions-sigp bot Feb 11, 2026
@claude-code-actions-sigp
Copy link

Claude Code is working…

I'll analyze this and get back to you.

View job run

@diegomrsantos diegomrsantos marked this pull request as draft February 11, 2026 19:15
@diegomrsantos diegomrsantos marked this pull request as ready for review February 11, 2026 19:15
@claude-code-actions-sigp
Copy link

Claude Code is working…

I'll analyze this and get back to you.

View job run

1 similar comment
@claude-code-actions-sigp
Copy link

Claude Code is working…

I'll analyze this and get back to you.

View job run

@diegomrsantos diegomrsantos marked this pull request as draft February 11, 2026 19:46
@diegomrsantos diegomrsantos marked this pull request as ready for review February 11, 2026 19:46
@claude-code-actions-sigp
Copy link

Claude Code is working…

I'll analyze this and get back to you.

View job run

///
/// Both forks' contexts are relevant — peers subscribed to either
/// the current or upcoming fork are useful.
WarmUp {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about collapsing WarmUp and GracePeriod variants into something like

Transitioning {
   current,
   domain_type
}

just an idea since consumers don't seem to care whether they're in WarmUp or GracePeriod, just that current and domain_type are correct. and upcoming/previous fields seem unused

) -> Option<Bitfield<Fixed<U128>>> {
let fork_map = self.observed_peer_subnets.get(peer)?;
Some(Self::aggregate_fork_bitmaps(fork_map))
match lifecycle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when fork_map is empty (newly connected peer, no gossipsub yet), Normal returns None (triggering ENR fallback) but WarmUp/GracePeriod returns Some(all-zeros) (skipping ENR fallback). Adding an early if fork_map.is_empty() { return None } would make the behavior consistent


/// Fork lifecycle state. Updated only by ForkMonitor.
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum ForkLifecycle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking of ways we could simplify this further @dknopik. I think one way would be to convert this enum to a flat struct:

pub struct ForkState {
    pub domain_type: DomainType,
    pub current_fork: Fork,
    pub in_transition: bool,
}

since network components seem to just care about what is the current domain type and whether the network is in a fork transition period (warm up/grace period) vs if normal operation

another idea would just be to extend ForkSchedule with a current field:

  pub struct ForkSchedule {
      configs: BTreeMap<Fork, ForkConfig>,
      network_name: String,
      // Runtime: updated by ForkMonitor, read by
  network components
      current: RwLock<ActiveFork>,
  }

  struct ActiveFork {
      fork: Fork,
      in_transition: bool,
  }

benefit of this is killing need for new ForkState module. con is that it mixes config (current ForkSchedule struct) with runtime characteristics (new current field)

or perhaps you wanted to take a look and come up with something. if so, feel free to lmk if you want me to try out whatever you're thinking

Copy link
Member Author

@diegomrsantos diegomrsantos Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on simplification. I’d keep runtime state separate from ForkSchedule (it’s config-only today), so I don’t think adding mutable current there is the right direction.
I’m fine simplifying the consumer API with an is_transition() helper, but I’d keep ForkLifecycle as an enum for now so we retain WarmUp vs GracePeriod semantics if/when we need them

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Feb 12, 2026

Claude finished @diegomrsantos's task in 1m 44s —— View job


Code Review Complete

Reviewing PR #824: fork-aware peer selection with SharedForkLifecycle

Todo List

  • Read and understand the new ForkLifecycle and SharedForkLifecycle implementation
  • Review ForkMonitor changes for lifecycle state management
  • Analyze peer selection logic changes in ConnectionManager
  • Review network component updates (Discovery, Handshake, Network)
  • Check for proper error handling and thread safety
  • Evaluate test coverage
  • Consider existing review feedback from shane-moore
  • Provide comprehensive review feedback

Summary

This PR successfully addresses the core issue of fork-aware peer selection after grace periods end. The implementation is well-structured with comprehensive tests. However, there are architectural opportunities for simplification and a critical behavioral inconsistency that should be addressed.

Critical Issues

1. Inconsistent ENR Fallback Behavior (connection.rs:319)

Issue: Shane-moore correctly identified that when fork_map is empty (newly connected peer), behavior differs based on lifecycle state:

  • Normal state: returns None → triggers ENR fallback
  • WarmUp/GracePeriod: returns Some(all-zeros bitfield) → skips ENR fallback

Location: anchor/network/src/peer_manager/connection.rs:313-324

fn get_peer_subnets_for_lifecycle(...) -> Option<Bitfield<Fixed<U128>>> {
    let fork_map = self.observed_peer_subnets.get(peer)?;
    match lifecycle {
        ForkLifecycle::Normal { current, .. } => fork_map.get(current).cloned(),
        ForkLifecycle::WarmUp { .. } | ForkLifecycle::GracePeriod { .. } => {
            Some(Self::aggregate_fork_bitmaps(fork_map))  // ← returns Some(zeros) if fork_map is empty
        }
    }
}

Impact: During fork transitions, newly connected peers that haven't gossiped yet appear to offer zero subnets instead of falling back to their ENR claims.

Fix: Add early return before the match statement:

fn get_peer_subnets_for_lifecycle(...) -> Option<Bitfield<Fixed<U128>>> {
    let fork_map = self.observed_peer_subnets.get(peer)?;
    if fork_map.is_empty() {
        return None;  // Consistent ENR fallback across all states
    }
    match lifecycle {
        // ... rest of the logic
    }
}

Architectural Suggestions

2. Enum Complexity vs. Benefit (lifecycle.rs:16)

Context: Shane-moore suggested collapsing WarmUp and GracePeriod into a single Transitioning variant since consumers only care about:

  1. Current fork identity
  2. Whether to aggregate bitmaps (transition) or not (normal)

Current Design:

pub enum ForkLifecycle {
    Normal { current: Fork, domain_type: DomainType },
    WarmUp { current: Fork, upcoming: Fork, domain_type: DomainType },
    GracePeriod { current: Fork, previous: Fork, domain_type: DomainType },
}

Observation: The upcoming/previous fields in WarmUp/GracePeriod are never accessed outside ForkMonitor. All consumers pattern-match to distinguish transition states from normal, then treat both transition states identically.

Evidence from code:

  • connection.rs:319-324: WarmUp | GracePeriod => { aggregate() } (treats both identically)
  • Monitor is the only component that sets/reads upcoming/previous

Options:

  1. Simplify enum (shane-moore's suggestion):

    pub enum ForkLifecycle {
        Normal { current: Fork, domain_type: DomainType },
        Transitioning { current: Fork, domain_type: DomainType },
    }

    Pros: Simpler API, fewer match arms, clearer intent
    Cons: Loses explicit phase distinction in type system

  2. Keep current design (status quo):
    Pros: Explicit phases self-document the fork lifecycle
    Cons: Added complexity, unused fields

Recommendation: Consider option 1. The explicit phase distinction provides minimal value if the fields aren't used and consumers treat both phases identically. The simpler enum would be easier to understand and maintain.

3. Alternative: Extend ForkSchedule (review comments)

Shane-moore also suggested embedding runtime state in ForkSchedule:

pub struct ForkSchedule {
    configs: BTreeMap<Fork, ForkConfig>,
    network_name: String,
    current: RwLock<ActiveFork>,  // Runtime state
}

Pros: Eliminates separate SharedForkLifecycle module
Cons: Mixes configuration (immutable) with runtime state (mutable)

Recommendation: The current separate SharedForkLifecycle approach maintains better separation of concerns. Configuration and runtime state have different lifecycles and mutation patterns.

Code Quality

Strengths

  1. Excellent Test Coverage:

    • lifecycle.rs: Basic lifecycle operations well-tested
    • monitor.rs: Comprehensive state machine tests covering all transitions
    • connection.rs:552-997: 32 tests covering fork-aware behavior, edge cases, and the multi-fork subscription bug
  2. Clear Documentation:

    • Doc comments explain the "why" effectively
    • Per-fork tracking motivation is well-documented
  3. Thread Safety:

    • SharedForkLifecycle correctly uses Arc<RwLock<T>>
    • Single-writer pattern (ForkMonitor only) is enforced by convention and documented
  4. Performance Optimization:

    • connection.rs:223: Hoists lifecycle read outside peer loop to avoid repeated lock acquisition (good catch!)

Areas for Improvement

  1. Missing Error Handling:

    // monitor.rs:508
    let Some(domain_type) = fork_schedule.domain_type(current_fork) else {
        tracing::error!(...);
        return;  // Silent failure - lifecycle not updated!
    };

    Issue: If the fork schedule is misconfigured, lifecycle updates fail silently. This could lead to incorrect peer selection.

    Suggestion: Consider panic or more explicit error propagation for config errors vs. runtime errors.

  2. Clone vs. Reference (lifecycle.rs:85-86):

    pub fn get(&self) -> ForkLifecycle {
        self.0.read().clone()  // Clones the enum
    }

    Trade-off: Cloning the enum (small, 3 variants) is acceptable, but for bulk operations, prefer passing &ForkLifecycle or pre-reading once (as done in count_observed_peers_for_subnets).

  3. Discovery Domain Type Filtering (discovery.rs:455-462):
    The domain type predicate correctly reads from SharedForkLifecycle:

    let domain_type_predicate = move |enr: &Enr| {
        if let Some(Ok(domain_type)) = enr.get_decodable::<[u8; 4]>("domaintype") {
            shared_lifecycle.domain_type().0 == domain_type
        } else {
            trace!(?enr, "Rejecting ENR with missing domaintype");
            false
        }
    };

    This is correct - discovery automatically adapts to fork transitions.

Testing

Strong Points:

  • All three lifecycle states tested
  • Fork activation sequence tested (monitor.rs:935-1010)
  • Multi-fork bitmap aggregation thoroughly tested
  • Edge cases covered (duplicate subscribes, partial unsubscribes, disconnected peers)
  • Handshake fork lifecycle update propagation tested (handshake/mod.rs:542-684)

Suggestion: Add integration test demonstrating the core bug fix - that after grace period ends, peers only on the old fork are correctly excluded from subnet coverage calculations. The existing tests verify the mechanism but not the end-to-end behavior.

Security & Performance

No Issues Found:

  • No obvious security vulnerabilities
  • No performance regressions - the lifecycle read optimization is a positive change
  • Thread-safe shared state correctly implemented

Comparison with Existing Review Comments

Shane-moore's feedback aligns with my findings:

  1. ✅ ENR fallback inconsistency (issue Draft a Testing Framework #1 above) - valid and should be fixed
  2. ✅ Enum simplification suggestion (issue Add basic repository structure #2 above) - worth considering but not critical
  3. The "ForkSchedule extension" suggestion is interesting but I recommend keeping the current separation

Summary of Recommendations

Must Fix

  • Fix ENR fallback inconsistency for empty fork_map during transitions (add early return in get_peer_subnets_for_lifecycle)

Should Consider

  • Simplify ForkLifecycle enum by collapsing WarmUp/GracePeriod into Transitioning
  • Improve error handling for fork schedule misconfigurations (don't fail silently)
  • Add integration test for core bug fix scenario

Nice to Have

  • Add inline comments explaining why certain lifecycle phases are treated identically in peer selection

Conclusion

The PR successfully solves the fork-aware peer selection problem with a well-tested implementation. The critical ENR fallback inconsistency should be fixed before merge. The architectural simplification suggestions are worth discussing but not blocking.


@dknopik
Copy link
Member

dknopik commented Feb 20, 2026

I think we can simplify this - I am just not sure how exactly. Here are a few possibilities:

  1. Roll back fix: share domain type across network components to fix handshake after fork activation #814. Considering the original motivation was a bugfix for keeping the current domain type updated, this has grown quite complex - hundreds of lines and a lot of code to test. Instead, I think the original approach is way simpler and at this point less error prone.
  2. Kind of this approach, but less complex: I think having a RwLock for the current ForkConfig is simple enough. Any parts of the code interested in the lifecycle can use the existing broadcast channel.
  3. Finally, avoid the shared state, but instead pass around the ForkSchedule to parts of the code. Or maybe a new wrapper that brings along a SlotClock for easier querying of the current fork.

Right now, this is the rough order of my preference. Happy to discuss all of these approaches!

@diegomrsantos
Copy link
Member Author

Thanks for the feedback, I'll take a look at it next week. Did you consider this is also adding a fork-aware peer selection and isn't just a refactor? The idea was to have only one source of truth for fork-dependent data.

@shane-moore
Copy link
Member

@dknopik @diegomrsantos, made a POC of how I interpret option 1 against this pr just to see what it could look like diegomrsantos/anchor@feat/shared-fork-lifecycle...shane-moore:anchor:option1-simplification

def kills a lot of code

idea is that ForkPhase broadcast is source of truth for fork transitions. so we replace SharedForkLifecycle and have Network::on_fork_phase dispatch directly to the connection manager / discovery / handshake

and we keep the fork aware peer selection logic from this PR via having ConnectionManager track current_fork and in_transition

Not saying what i'm suggesting is necessarily the right call, just showing what it could look like

@diegomrsantos
Copy link
Member Author

diegomrsantos commented Feb 21, 2026

idea is that ForkPhase broadcast is source of truth for fork transitions. so we replace SharedForkLifecycle and have Network::on_fork_phase dispatch directly to the connection manager / discovery / handshake

They serve two different goals. ForkPhase is for components that need to act on fork events. SharedForkLifecycle is for components that only need the current value.
Less code is good, but not at the expense of clarity. The priority is code that is easy to understand and reason about; cleverness that obscures behavior is a net negative.

One nuance: fewer lines often help clarity, so it’s not less code vs clarity, it’s clarity first.

@shane-moore
Copy link
Member

agreed with clarity first! i think it's also nice to think about these type of decisions from the perspective of "what's easier for someone new to this part of the codebase to debug?". i would think if you were trying to debug something related to this, you'd just trace through the call paths and find in Network::on_fork_phase where domain type propogates to the three components and go from there. With SharedForkLifecycle, it has the benefit of self-contained state, so components don't need to know who updates them. But it might be more daunting to debug since you're tracing through Arc<RwLock<>> shared state that gets mutated from the monitor task, vs just following fn calls in Network::on_fork_phase

If we find we need this shared state across more components in the future, we can always introduce something like SharedForkLifecycle then since it'd for sure be more justified with more consumers

Just my thoughts!

@diegomrsantos
Copy link
Member Author

diegomrsantos commented Feb 23, 2026

I’m still struggling to see where the additional complexity is coming from here.

We already need a background task (ForkMonitor) to track fork schedule and broadcast ForkPhase transitions. That’s true regardless of whether we propagate state via events or via a shared handle. In this PR, the monitor simply becomes the single writer of a single value (SharedForkLifecycle), and everyone else reads it when they need the current domain type / lifecycle. This is the same pattern we already had with SharedDomainType, just with a better invariant model (ForkLifecycle makes invalid states unrepresentable). 

What this change buys us is removing duplicated “cache derived value on event” logic from multiple components and avoiding desync between “what phase are we in?” vs “what domain type are we using?” 

If the concern is specifically:
• traceability (following explicit call paths from Network::on_fork_phase), or
• locks (hot-path reads),

can we call that out directly? From what I can see, reads are cheap (and we already hoist outside loops where it matters), and the writer is centralized. 

Happy to simplify further, but I’d like to keep the decision rule explicit:
• single source of truth,
• minimal duplicated derived state,
• easiest to debug “current fork / domain type” at any point in time.

If there’s a concrete code path where reading SharedForkLifecycle is harder to reason about/debug than propagating via Network::on_fork_phase, point me at it and we’ll address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants