Skip to content

Comments

Aggregator Committee Runner - Pre-Consensus Completeness Improvement#604

Open
MatheusFranco99 wants to merge 12 commits intoboolefrom
agg-comm-pre-consensus-termination
Open

Aggregator Committee Runner - Pre-Consensus Completeness Improvement#604
MatheusFranco99 wants to merge 12 commits intoboolefrom
agg-comm-pre-consensus-termination

Conversation

@MatheusFranco99
Copy link
Contributor

@MatheusFranco99 MatheusFranco99 commented Feb 12, 2026

Overview

This PR improves the liveness vs. completeness trade-off of the Pre-Consensus phase of the Aggregator Committee Runner.

  • Before: as soon as at least one validator duty has received a quorum of pre-consensus partial signatures, all such duties are checked for aggregator or sync committee contribution selection. If there's any aggregator, it starts the consensus phase, else it terminates the duty.
  • Now: as soon as at least one validator duty has received a quorum of pre-consensus partial signatures and it's considered selected, it starts the consensus phase. And, once all messages (from all operators) are received or if all duties have been checked and none were selected, then terminate the duty.

More explicitly, with the new change, if a quorum is reached for some validator duty but it isn't selected (as aggregator or sync committee contribution), the duty wont' stop, and it will wait for more messages that may discover selected validators.

Spec Change

AggregatorCommitteeRunner have two new state variables:

  • preConsensusSeenSigners: that registers the signers of the pre-consensus messages we've seen.
  • preConsensusDutiesCheckedForSelection: that registers the (validator index, root) tuples (a bijection to validator duties) that we've already checked for selection in pre-consensus.

In PreConsensus processing:

  • We mark the signer of the message as seen.
  • After quorums are processed, if no aggregator (or contributor) was selected, we check if we've seen messages from all signers, or if all duties have been checked and none were selected, and if so, we terminate the duty. Else, we keep waiting for more messages.
  • Corner case: if no new quorum is detected by basePreConsensusMsgProcessing, we can't simply return. Instead, we should check if this is the last message we were waiting for. For example, we received messages from operators 1, 2, and 3. No aggregator have been selected. Then, we receive a message from operator 4, but it doesn't trigger any new quorum (only contributing to already checked duties). In this case, we now check that all operators have been seen, and since no aggregator was selected, we terminate the duty.
  • We add the initial check that, if consensus has already started, we ignore pre-consensus messages. This is important not only because it avoids redundant processing, but it also prevents the pre-consensus termination checks that were added from being made.

Tests

TestingBeaconNode is expanded with a new field CommitteeIndexAggregators map[phase0.CommitteeIndex]bool, that allows the test to specify the validators (through their CommitteeIndex`) it expects to be selected as aggregators.

Accordingly, MsgProcessingSpecTest is expanded with two fields BeaconAggregators []phase0.CommitteeIndex and BeaconAggregatorsValues []bool to specify the same values for the beacon node. This is needed as the beacon node interface isn't added to the json file. And it needs two field as map can't as well be used for json files.

MsgProcessingSpecTest also has a new field QBFTProposals [][]byte to specify the expected consensus data in the QBFT proposal (if any). This is needed to check cases in which only some validators are selected, which should reflect in the consensus data of the proposal.

New tests

  • SingleValidatorNotSelected: tests that a single validator quorum that is not selected won't cause the duty to terminate or start consensus.
  • SingleValidatorSelected: tests that a single validator quorum that is selected will cause the consensus to start (one test for aggregator and one for sync committee contribution).
  • TerminatedWhenAllSignersSeen: tests that if all messages from all operators are received but no validator duty is selected as aggregator, then the duty is terminated.
  • TerminatesWhenAllSelectionsChecked: tests that if all validator duties are checked for selection but none is selected, then the duty is terminated.
  • IgnoreIfAlreadyStartedConsensus: tests that if consensus is already started, pre-consensus messages are ignored (and thus don't cause any state change, such termination checks being made).

Side effects

  • The change in runner/preconsensus/post_decided.go happened because the new unexported states (of the aggregator committee runner) were not being passed to the json files, thus causing post state mismatches (in the json test). However, the logic has been kept the same, simply by passing the messages to the Messages field, instead of only adding its impact on the runner as done before.
  • TestAggregatorCommitteeConsensusDataForDuty has been expanded to put correct selection signatures in the consensus data, which wasn't being tested before by the testing value check functions. This is needed to properly test the QBFTProposals field of the new tests, which checks the exact match of the consensus data of the proposal.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR enhances the Aggregator Committee Runner's pre-consensus phase to improve the liveness vs. completeness trade-off by waiting for more messages when no aggregator is selected.

Key Changes:

  • Adds two new state variables to track seen signers (preConsensusSeenSigners) and tested selections (preConsensusSelectionsTested)
  • Modifies pre-consensus logic to continue waiting for messages when a quorum is reached but no aggregator is selected, instead of immediately terminating
  • Only terminates duty when either all operators have sent messages OR all possible validator duties have been checked for selection
  • Updates test infrastructure with CommitteeIndexAggregators to allow fine-grained control over which validators are selected as aggregators
  • Adds comprehensive tests covering three new scenarios: single validator not selected, single validator selected, and termination when all signers seen

Critical Issue Found:
The new state variables preConsensusSeenSigners and preConsensusSelectionsTested are not reset in StartNewDuty (line 71), which will cause state to leak across different duties. This could lead to incorrect behavior where the runner thinks it has already seen operators or tested selections from previous duties.

Confidence Score: 2/5

  • This PR has a critical state management bug that will cause incorrect behavior across duties
  • Score reflects the critical bug in ssv/aggregator_committee.go:71 where new state variables are not reset in StartNewDuty. This causes state to leak across duties, potentially making the runner terminate prematurely or proceed incorrectly based on stale data from previous duties. While the logic improvements are sound and tests are comprehensive, this fundamental state management issue must be fixed before merging.
  • Pay close attention to ssv/aggregator_committee.go - the StartNewDuty method must reset the new state variables

Important Files Changed

Filename Overview
ssv/aggregator_committee.go Adds pre-consensus tracking state variables but fails to reset them in StartNewDuty, causing state leak across duties
types/testingutils/beacon_node.go Adds CommitteeIndexAggregators map to allow test control over aggregator selection
ssv/spectest/tests/msg_processing_spectest.go Adds BeaconAggregators and BeaconAggregatorsValues fields to support new test scenarios
ssv/spectest/tests/runner/preconsensus/single_validator_not_selected.go New test verifying non-selected validator quorum doesn't start consensus
ssv/spectest/tests/runner/preconsensus/single_validator_selected.go New test verifying selected validator quorum starts consensus
ssv/spectest/tests/runner/preconsensus/terminates_when_all_signers_seen.go New test verifying duty terminates after all operators send messages with no selected aggregators
ssv/spectest/tests/runner/preconsensus/post_decided.go Updated to handle pre-consensus messages after consensus decision correctly

Sequence Diagram

sequenceDiagram
    participant O1 as Operator 1
    participant O2 as Operator 2
    participant O3 as Operator 3
    participant O4 as Operator 4
    participant Runner as Aggregator Committee Runner
    participant Beacon as Beacon Node
    participant QBFT as QBFT Consensus

    Note over Runner: StartNewDuty()
    Note over Runner: Reset preConsensusSeenSigners<br/>Reset preConsensusSelectionsTested

    O1->>Runner: Pre-consensus message (validator A)
    Runner->>Runner: Mark O1 as seen
    Note over Runner: Process message, check for quorum
    alt No quorum yet
        Runner-->>O1: Wait for more messages
    end

    O2->>Runner: Pre-consensus message (validator A)
    Runner->>Runner: Mark O2 as seen
    Note over Runner: Process message, check for quorum
    
    O3->>Runner: Pre-consensus message (validator A)
    Runner->>Runner: Mark O3 as seen
    Note over Runner: Quorum reached for validator A

    Runner->>Beacon: IsAggregator(validator A)?
    Beacon-->>Runner: Selection result
    Runner->>Runner: Mark (validator A, root) as checked

    alt Validator A selected
        Runner->>QBFT: Start consensus
        Note over Runner,QBFT: Proceed to consensus phase
    else Validator A not selected
        alt All operators seen OR all selections checked
            Note over Runner: Terminate duty (finished=true)
        else More messages expected
            Note over Runner: Wait for more messages
            O4->>Runner: Pre-consensus message (validator B)
            Runner->>Runner: Mark O4 as seen
            Note over Runner: Check if all operators seen now
            alt All operators seen AND no aggregators
                Note over Runner: Terminate duty (finished=true)
            end
        end
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 12, 2026

Additional Comments (1)

ssv/aggregator_committee.go
State variables preConsensusSeenSigners and preConsensusSelectionsTested are not reset when starting a new duty. This causes state to leak across duties, potentially allowing old signer/selection data to affect new duties.

func (r *AggregatorCommitteeRunner) StartNewDuty(duty types.Duty, quorum uint64) error {
	err := r.BaseRunner.baseStartNewDuty(r, duty, quorum)
	if err != nil {
		return errors.Wrap(err, "failed to start new duty")
	}

	// Initialize submission tracking for both duty types
	r.submittedDuties[types.BNRoleAggregator] = make(map[phase0.ValidatorIndex]map[[32]byte]struct{})
	r.submittedDuties[types.BNRoleSyncCommitteeContribution] = make(map[phase0.ValidatorIndex]map[[32]byte]struct{})

	// Reset pre-consensus tracking state
	r.preConsensusSeenSigners = make(map[types.OperatorID]struct{})
	r.preConsensusSelectionsTested = make(map[phase0.ValidatorIndex]map[[32]byte]struct{})

	return nil
}

Copy link
Contributor

@GalRogozinski GalRogozinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre‑consensus signer “seen” tracking happens before validation in processing. This is fine since we have p2p message validation checking signatures.

So again we see a motivation adding this to spec... Just a little observation not for this PR.

// If already started consensus, ignore pre-consensus messages.
// This is important because it avoids redundant processing and prevents the pre-consensus termination checks from being made.
if r.HasStartedConsensus() {
return types.NewError(types.AggCommPreConsensusIgnoredSinceAlreadyStartedConsensusErrorCode, "ignoring pre-consensus message since consensus already started")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm...
I just wonder if it should be an error... because this isn't really an error (we are expecting N messages in the happy flow)

I think you can test w.o the error by just seeing that the last preconf message doesn't add validators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, originally I didn't add it, but didn't know how to test.
I see, I didn't find a way to test it but I'll look at it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking I think we can still just show the output of a single QBFT message. There is also the state introspection in the generated state that should be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can still just show the output of a single QBFT message

how so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will show the expected CD doesn't change

// If all duties have been tested for selection or all messages (from all operators) have been seen, terminate.
if r.HaveCheckedAllDutiesForSelection(aggregatorMap, contributionMap) || r.HasSeenAllPreConsensusSigners() {
r.BaseRunner.State.Finished = true
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can happen on that you see all preconsensus signers and there is an error for processing selectionProof.
i.e. if the beacon node is bad

Suggested change
return nil
return anyErr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think this motivates us to bring the selection check to spec (which is currently in the implementation but doesn't need the beacon ndoe).
Also, about bad communication, maybe the impl should have retries and for spec this matter less? Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm dunno if we need to specify what is already specified in ETH spec
even though I think currently we have some things that are duplicate specifications but we don't always do it.

In spec we don't return error for selection, but for getting the aggregate attestations (that impl does refer to the beacon node for it).

It is just about propagating the error correctly upwards. It doesn't mean we retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spec we don't return error for selection, but for getting the aggregate attestations (that impl does refer to the beacon node for it).

In implementation right? As spec doesn't have this detail

Comment on lines +81 to +102
"Msgs": {
"1": [
{
"SignedMessage": {
"Signatures": [
"Dh866O3mNb1dj0DSNzv6IpSnnjhgmmpUCDHTpsrtxVEUGs4sgCWyPUdcp/g27EXoyfPwPomQyoCY5M3AwjL0rJPOuMutNFqcqmOy86f7We/Zdg0U3brmnJ2zJAZt4CG9sro/GXzzMzdtDJH0dvhm4Z4vtv87k+485vM2R5CqlDt3j8bfsWRa5EcDiOedxL4V97WeBR/cLV7buIcwBxnYuek9UsS7qDab3wAYkGlDHmXFUPNkC9NcGoiGJklBUF+7Ljsx3It9UycBKw/rEEHw5kQ7lLCwZEqyAs3LN51L1Ao6ZR8LKDchnTCdqaa5xpTeA+rCXeKX2WBz9738PAHImQ=="
],
"OperatorIDs": [
1
],
"SSVMessage": {
"MsgType": 0,
"MsgID": "000003010600000000000000000000000000000000000000cf97adeedb59e05bfd73a2b4c2a8885708c4f4f70c84c64b27120e72ab733b72",
"Data": "AAAAAAAAAAAMSHEAAAAAAAEAAAAAAAAATAAAAKV7Rbu15E9BsS9PCaIudvca/pUK8NZtWKS73NHxnlw4AAAAAAAAAACEAAAAhAAAAAAAAwEGAAAAAAAAAAAAAAAAAAAAAAAAAM+Xre7bWeBb/XOitMKoiFcIxPT3DITGSycSDnKrczty"
},
"FullData": "BgAAAAAAAAAcAAAAjAAAAJQAAACVAQAAlQEAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAADAAAAAAAAAAQAAADsAAAADEhxAAAAAAAAAAAAAAAAAAECAwQFBgcICQoBAgMEBQYHCAkKAQIDBAUGBwgJCgECAAAAAAAAAAABAgMEBQYHCAkKAQIDBAUGBwgJCgECAwQFBgcICQoBAgEAAAAAAAAAAQIDBAUGBwgJCgECAwQFBgcICQoBAgMEBQYHCAkKAQIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB"
},
"QBFTMessage": {
"DataRound": 0,
"Height": 7424012,
"Identifier": "AAADAQYAAAAAAAAAAAAAAAAAAAAAAAAAz5et7ttZ4Fv9c6K0wqiIVwjE9PcMhMZLJxIOcqtzO3I=",
"MsgType": 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change the post decided test. Pls, checkout the description above on the "Side effects" section

}

multiSpecTest.Tests = append(multiSpecTest.Tests, &tests.MsgProcessingSpecTest{
Name: fmt.Sprintf("single validator quorum not selected doesn't start consensus (%s)", version.String()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go: downloading github.com/ssvlabs/ssv-spec v1.2.3-0.20260216165555-9749b2d49c63
go: github.com/ssvlabs/ssv-spec@agg-comm-pre-consensus-termination-fix-repo-size: create zip: ssv/spectest/generate/state_comparison/tests_MultiMsgProcessingSpecTest/pre_consensus_single_validator_not_selected/single validator quorum not selected doesn't start consensus (electra).json: malformed file path "ssv/spectest/generate/state_comparison/tests_MultiMsgProcessingSpecTest/pre_consensus_single_validator_not_selected/single validator quorum not selected doesn't start consensus (electra).json": invalid char '\''
ssv/spectest/generate/state_comparison/tests_MultiMsgProcessingSpecTest/pre_consensus_single_validator_not_selected/single validator quorum not selected doesn't start consensus (phase0).json: malformed file path "ssv/spectest/generate/state_comparison/tests_MultiMsgProcessingSpecTest/pre_consensus_single_validator_not_selected/single validator quorum not selected doesn't start consensus (phase0).json": invalid char '\''
Suggested change
Name: fmt.Sprintf("single validator quorum not selected doesn't start consensus (%s)", version.String()),
Name: fmt.Sprintf("single validator quorum not selected does not start consensus (%s)", version.String()),

Copy link

@shane-moore shane-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, reviewed the PR from the lens of anchor, lgtm! on our end, we collect selection proofs independently per validator, so we're unaffected by the cross-validator early termination bug this fixes. def agree it's a good idea to postpone ending the pre-consensus phase in your agg committee runner until after you've received messages from all operators in a committee or after you've received partial sig messages back for all duties in the operator's view

also, thanks for tagging us to review the spec PR! small ask to tag me in lieu of @dknopik in future spec PR's 🙏

@nkryuchkov nkryuchkov requested review from iurii-ssv and shane-moore and removed request for shane-moore February 20, 2026 13:18
Comment on lines 1204 to 1216
// HasStartedConsensus checks if consensus has already started by verifying if we have a QBFT instance for the duty's slot.
func (r *AggregatorCommitteeRunner) HasStartedConsensus() bool {
if r.BaseRunner.QBFTController == nil {
return false
}
if r.BaseRunner.State == nil {
return false
}
if r.BaseRunner.State.StartingDuty == nil {
return false
}
return r.BaseRunner.QBFTController.InstanceForHeight(qbft.Height(r.BaseRunner.State.StartingDuty.DutySlot())) != nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify, the implementation of this func can probably be just this instead ?

// HasStartedConsensus checks if consensus has already started (or maybe even finished) for the current duty.
func (r *AggregatorCommitteeRunner) HasStartedConsensus() bool {
	if r.BaseRunner.State == nil {
		return false
	}
	return r.BaseRunner.State.RunningInstance != nil || r.BaseRunner.State.DecidedValue != nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks!

Comment on lines 107 to 113
if !hasNewQuorum {
// If didn't get any new quorum, didn't yet start QBFT, and has received the last message, then terminate.
if r.HasSeenAllPreConsensusSigners() {
if !r.HasStartedConsensus() {
r.BaseRunner.State.Finished = true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!r.HasStartedConsensus() at this point is always true, the early return at the top of ProcessPreConsensus already exits when consensus has started, and basePreConsensusMsgProcessing doesn't start consensus.

Consider removing the inner check to avoid implying there's a path where consensus could have started between the two checks.

if !hasNewQuorum {
    if r.HasSeenAllPreConsensusSigners() {
        r.BaseRunner.State.Finished = true
    }
    return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks!

// compareConsensusDataSample compares two consensus data samples.
// It tries decoding into ProposerConsensusData, AggregatorCommitteeConsensusData, or BeaconVote
// and compare each type accordingly.
func compareConsensusDataSample(t *testing.T, expectedData []byte, actualData []byte) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a require.Fail at the end of compareConsensusDataSample if none of the decode attempts succeed, otherwise the test silently passes without comparing anything.

require.Fail(t, "could not decode consensus data as any known type")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants