Skip to content

blockchain, netsync: implement a complete headers-first download during ibd#2428

Open
kcalvinalvin wants to merge 17 commits intobtcsuite:masterfrom
kcalvinalvin:2025-09-16-new-parallel-block-downloads
Open

blockchain, netsync: implement a complete headers-first download during ibd#2428
kcalvinalvin wants to merge 17 commits intobtcsuite:masterfrom
kcalvinalvin:2025-09-16-new-parallel-block-downloads

Conversation

@kcalvinalvin
Copy link
Collaborator

@kcalvinalvin kcalvinalvin commented Sep 22, 2025

Change Description

Right now the headers-first download is only based off of the checkpoints and is thus limited to the last checkpoint.
The newly implemented headers-first download will always download headers-first and will validate them to see if the headers connect and have proper proof of work.

Then the block download will be based off of the verified headers. This now eliminates any potential downloading of txs or orphan blocks during ibd. It also makes future parallel block download much better as a parallel block download can only happen for blocks we already have headers for.

It's not yet put into the code yet but this allows the node to also receive block headers instead of invs during block propagation. I'll do this in a follow up later.

Steps to Test

  • Perform ibd on any of the networks and see that the node syncs up fine.
  • Check that the newly added function tests are correct.

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

We add a chainview of bestHeaders so that we'll be able to keep track of
headers separately from the bestChain. This is needed as we're getting
headers for new block annoucements instead of invs.
Since we may now have blockNodes with just the block header stored
without the data, we add a new status to account for this.
@coveralls
Copy link

coveralls commented Sep 22, 2025

Pull Request Test Coverage Report for Build 19373443826

Details

  • 139 of 254 (54.72%) changed or added relevant lines in 6 files are covered.
  • 65 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.5%) to 55.433%

Changes Missing Coverage Covered Lines Changed/Added Lines %
blockchain/chainio.go 1 3 33.33%
blockchain/chain.go 11 22 50.0%
netsync/manager.go 40 142 28.17%
Files with Coverage Reduction New Missed Lines %
mempool/mempool.go 1 66.56%
btcutil/gcs/gcs.go 3 81.25%
database/ffldb/blockio.go 4 88.81%
rpcclient/infrastructure.go 4 47.8%
netsync/manager.go 18 7.05%
blockchain/chainio.go 35 69.75%
Totals Coverage Status
Change from base Build 19121431262: 0.5%
Covered Lines: 31425
Relevant Lines: 56690

💛 - Coveralls

@kcalvinalvin kcalvinalvin force-pushed the 2025-09-16-new-parallel-block-downloads branch 3 times, most recently from 2308a1e to 3519720 Compare September 23, 2025 09:43
maybeAcceptHeader performs checks to accept block headers into the
header chain.  This function allows for a true headers-first download
where we only accpet block headers for downloading new blocks.
@kcalvinalvin kcalvinalvin force-pushed the 2025-09-16-new-parallel-block-downloads branch 5 times, most recently from 242fa3b to e6fadc9 Compare September 25, 2025 07:25
@kcalvinalvin kcalvinalvin marked this pull request as ready for review September 25, 2025 13:17
@saubyk
Copy link
Collaborator

saubyk commented Oct 21, 2025

cc: @gijswijs for review

@saubyk
Copy link
Collaborator

saubyk commented Oct 21, 2025

cc: @mohamedawnallah for review

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great set of changes! I found the commit structure very easy to follow as well.

Have performed an initial review pass, but will start to run some active tests on one of my nodes to get a better feel for the changes.

}

// HaveHeader returns whether the header data is stored in the database.
func (status blockStatus) HaveHeader() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just the inverse of HaveData?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you can't communicate not having headers for a given blockhash so I think having this is better.

You can use !HaveData() to check if you have the headers but then what do you do when you want to check if you have a block header.

// If we're in the initial block download mode, check if we have
// peers that we can download headers from.
_, bestHeaderHeight := sm.chain.BestHeader()
higherHeaderPeers := sm.fetchHigherPeers(bestHeaderHeight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though many years have passed at this point, we should retain the logic that we removed to filter out non-segwit peers once segwit is active.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sm.fetchHigherPeers(), we filter out peers that aren't sync candidates. The peers can only be sync candidates if they're a segwit activated peer.

This check is done in isSyncCandidate()

ProcessBlockHeader performs chain selection and context-free &
contextual validation for the given block header.  The function allows
a header-first downloading of blocks even without checkpoints.
On flushes to the database, we check that the blockNodes we have for
the download block headers are not flushed to the disk unless the
block data is stored as well for backwards compatibility.

With older btcd clients, they rely on the fact that the blockNode is
present to check if the block data is also present.  Since we now
store blockNodes for just the block headers, this is no longer true.

Because of this, we don't flush the blockNodes if there's no
accompanying block data for it.  This results in the downloading and
verifying the headers again if the node were to restart but since the
header data is small and the verification is quick, it's not a big
downside.
In block processing and block downloading, HaveBlock is used to check if
the block data already exists.  It was ok to just check for the
existence of the blockNode but we now we also need to check if the data
exists as the blockNode may be present for just the block header.
The added exported methods on BlockChain provide access to the block
header tip like fetching block hashes and heights from the headers
chain.
checkHeadersList takes in a blockhash and returns if it's a checkpointed
block and the correct behavior flags for the verification of the block.
fetchHigherPeers provides a convenient function to get peers that are
sync candidates and are at a higher advertised height than the passed in
height.
isInIBDMode returns if the SyncManager needs to download blocks and sync
to the latest chain tip.

It determines if it's in ibd mode by checking if the blockchain thinks
we're current and if we don't have peers that are at higher advertised
blocks.
fetchHeaders picks a random peer at a higher advertised block and
requests headers from them.
Instead of the old headerList based header processing, we make use of
the new ProcessBlockHeader function.
headers

Since we now utilize ProcessBlockHeader, we change the fetchHeaderBlocks
to be based off of the block index instead of the headerList in
SyncManager.
ince we now utilize ProcessBlockHeaders, we change the startSync
function to utilize the block index for downloading blocks/headers
instead of using the headerList in SyncManager.
handleBlockMsg is updated to not be based off of the headerList and
sm.nextCheckpoint when doing operations related to checkpoints and
headers.
Since we no longer utilize the headerList for doing headers-first
download and checkpoint tracking, we remove related code.
ibdMode is a more fitting name than headersFirstMode since all blocks
are downloaded headers-first during the initial block download.
@kcalvinalvin kcalvinalvin force-pushed the 2025-09-16-new-parallel-block-downloads branch from e6fadc9 to 2f4c749 Compare November 14, 2025 18:07
@kcalvinalvin
Copy link
Collaborator Author

Addressed all the review comments by roasbeef

@saubyk saubyk requested a review from Roasbeef November 14, 2025 18:39
@Roasbeef
Copy link
Member

Tested in the wild, and does what it says on the tin:

Details ``` 2025-12-17 18:32:10.885 [INF] SYNC: Downloading headers for blocks 136001 to 2401568 from peer PEER_A:18333 2025-12-17 18:32:10.885 [INF] SYNC: Lost peer PEER_B:18333 (outbound) 2025-12-17 18:32:16.309 [INF] SYNC: New valid peer PEER_C:18333 (outbound) (/Satoshi:27.1.0/) 2025-12-17 18:32:40.885 [INF] SYNC: Downloading headers for blocks 136001 to 4807558 from peer PEER_D:18333 2025-12-17 18:32:40.885 [INF] SYNC: Lost peer PEER_A:18333 (outbound) 2025-12-17 18:32:41.044 [INF] SYNC: New valid peer PEER_E:18333 (outbound) (/Satoshi:28.1.0/) 2025-12-17 18:32:41.135 [INF] SYNC: New valid peer PEER_F:18333 (outbound) (/Satoshi:27.0.0/) 2025-12-17 18:32:46.438 [INF] CHAN: Verified checkpoint at height 200000/block 0000000000287bffd321963ef05feab753ebe274e1d78b2fd4e2bfe9ad3aa6f2 2025-12-17 18:32:51.245 [INF] SYNC: New valid peer PEER_G:18333 (outbound) (/Satoshi:28.1.0/) 2025-12-17 18:32:55.171 [INF] CHAN: Verified checkpoint at height 300001/block 0000000000004829474748f3d1bc8fcf893c88be255e6d7f571c548aff57abf4 2025-12-17 18:33:03.103 [INF] CHAN: Verified checkpoint at height 400002/block 0000000005e2c73b8ecb82ae2dbc2e8274614ebad7172b53528aba7501f5a089 2025-12-17 18:33:10.885 [INF] SYNC: Downloading headers for blocks 490001 to 4807558 from peer PEER_G:18333 2025-12-17 18:33:10.885 [INF] SYNC: Lost peer PEER_D:18333 (outbound) 2025-12-17 18:33:13.135 [INF] CHAN: Verified checkpoint at height 500011/block 00000000000929f63977fbac92ff570a9bd9e7715401ee96f2848f7b07750b02 2025-12-17 18:33:25.990 [INF] CHAN: Verified checkpoint at height 600002/block 000000000001f471389afd6ee94dcace5ccc44adc18e8bff402443f034b07240 2025-12-17 18:33:38.331 [INF] CHAN: Verified checkpoint at height 700000/block 000000000000406178b12a4dea3b27e13b3c4fe4510994fd667d7c1e6a3f4dc1 2025-12-17 18:33:40.884 [INF] SYNC: Downloading headers for blocks 718001 to 4807558 from peer PEER_E:18333 2025-12-17 18:33:40.888 [INF] SYNC: Lost peer PEER_G:18333 (outbound) 2025-12-17 18:33:46.254 [INF] SYNC: New valid peer PEER_H:18333 (outbound) (/Satoshi:25.1.0/) 2025-12-17 18:33:50.353 [INF] CHAN: Verified checkpoint at height 800010/block 000000000017ed35296433190b6829db01e657d80631d43f5983fa403bfdb4c1 2025-12-17 18:34:01.433 [INF] CHAN: Verified checkpoint at height 900000/block 0000000000356f8d8924556e765b7a94aaebc6b5c8685dcfa2b1ee8b41acd89b 2025-12-17 18:34:10.884 [INF] SYNC: Downloading headers for blocks 986001 to 4807558 from peer PEER_E:18333 2025-12-17 18:34:10.884 [INF] SYNC: Lost peer PEER_E:18333 (outbound) 2025-12-17 18:34:10.884 [INF] SYNC: Downloading headers for blocks 986001 to 4807558 from peer PEER_H:18333 2025-12-17 18:34:11.269 [INF] SYNC: New valid peer PEER_I:18333 (outbound) (/Satoshi:25.1.0/) 2025-12-17 18:34:11.270 [INF] SYNC: New valid peer PEER_J:18333 (outbound) (/Satoshi:0.20.1/) 2025-12-17 18:34:13.213 [INF] CHAN: Verified checkpoint at height 1000007/block 00000000001ccb893d8a1f25b70ad173ce955e5f50124261bbbc50379a612ddf 2025-12-17 18:34:16.312 [INF] SYNC: New valid peer PEER_K:18333 (outbound) (/Satoshi:27.0.0/) 2025-12-17 18:34:25.081 [INF] CHAN: Verified checkpoint at height 1100007/block 00000000000abc7b2cd18768ab3dee20857326a818d1946ed6796f42d66dd1e8 ```

One thing I noticed though is that the peers may D/C us along the way, so we rotate through a few peers to fetch all the headers. This is testnet3 , there's quite a lot of headers (over 4 million). Are we hitting something like default per-peer upload limit for bitcoind?

IIUC this is a stepping stone for multi-peer header download, so perhaps that'll be sorted out as we'll fetch blocks of headers from many peers in parallel.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🗑️

Copy link

@Abdulkbk Abdulkbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice structure 🎉. Left a few comments. I will also do a manual test and have another deeper look.

Edit: successfully synced to testnet3 with over ~5M blocks (despite facing a few challenges leading to restarts, but unrelated to this change).

image

return headers
}

func TestProcessBlockHeader(t *testing.T) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: godoc

// If we don't already know this is a valid header, return false and
// BFNone.
if !sm.chain.IsValidHeader(blockHash) {
return false, blockchain.BFNone

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to be explicit? else we can just return the decalred variables.

filename := filepath.Join("testdata/", testFile)

file, err := os.Open(filename)
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use require.NoError

for scanner.Scan() {
line := scanner.Text()
b, err := hex.DecodeString(line)
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use the require package, and other places where t.Fatal was used.

t.Fatal(err)
}

if !isMainChain {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can use require.True

header.BlockHash())
}
}
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is redundant.

require.Equal(t, test.behaviorFlags, gotFlags)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing godoc

return 0
}

func TestIsInIBDMode(t *testing.T) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc

// Orphan headers are not allowed and this function should never be called
// with the genesis block.
prevHash := &header.PrevBlock
prevNode := b.index.LookupNode(prevHash)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a duplicate prevnode lookup, we just did it above in the if node != nil block. Maybe we should consolidate it?.

_, height := sm.chain.BestHeader()
higherPeers := sm.fetchHigherPeers(height)
var bestPeer *peerpkg.Peer
switch {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect other cases in the future else we can just use if?

@saubyk saubyk requested a review from yyforyongyu January 27, 2026 17:35
Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff - left a few comments that i think we need to fix before merging.

if node == nil {
node = newBlockNode(header, prevNode)
node.status = statusHeaderStored
b.index.AddNode(node)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if node == nil guard here correctly avoids overwriting an existing node. But maybeAcceptBlock at line 67 has no such guard — it unconditionally creates a fresh *blockNode and calls AddNode, overwriting the index entry. If this header-only node already exists in bestHeader's chainView, the old pointer becomes orphaned while the chainView still holds it.

After that, bestHeader.Contains(b.index.LookupNode(&hash)) returns false because LookupNode returns the new pointer while bestHeader still has the old one. This breaks IsValidHeader, HeaderHashByHeight, and checkHeadersList in netsync.

Could maybeAcceptBlock check if a node already exists and upgrade its status to statusDataStored rather than creating a new one? Similar to the if node == nil pattern here.

str := fmt.Sprintf(
"previous block %s is known to be invalid", prevHash)
return false, ruleError(ErrInvalidAncestorBlock, str)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small gap here -- when the header is already known (node != nil), the code checks prevNode.KnownInvalid() but not node.status.KnownInvalid() itself. The godoc at line 109-111 promises we return an error for known-invalid headers, so we should probably add that check before the bestHeader.Contains early return.

node, hasBlock := bi.index[*hash]
bi.RUnlock()
return hasBlock
return hasBlock && node.status.HaveData()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small concurrency note -- HaveBlock now reads node.status after releasing RLock at line 395. The old code only returned a captured bool. Since SetStatusFlags/UnsetStatusFlags mutate status under the write lock, this is technically a data race under Go's memory model, even though byte reads are usually atomic in practice. Moving the HaveData() check inside the lock would keep it clean.

// up a btcd node with an older version, it would result in
// an unrecoverable error as older versions would consider a
// blockNode being present as having the block data as well.
if node.status.HaveHeader() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushToDB opens a DB write transaction whenever dirty is non-empty, but during header processing every dirty node is header-only and gets skipped here. That means each ProcessBlockHeader call ends up doing a write txn that writes nothing. During long header sync that's a lot of no-op transactions. Maybe we could skip the flush entirely in maybeAcceptBlockHeader, or have flushToDB bail early if all dirty nodes would be skipped?

}

// Set the lastNode in the blockindex as the bestheader.
b.bestHeader.SetTip(lastNode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initChainState sets bestHeader to lastNode here, which is the last entry in the block index bucket's key order (height+hash). If side-chain blocks exist at a height above the best chain tip, this picks the wrong branch. A few lines below, the best block chain tip is resolved from state.hash at line 1270 — should bestHeader be set to the same tip node? Since header-only nodes aren't persisted, after restart the header and block views should be identical.

peer.Disconnect()
return
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old headersFirstMode guard got removed but there's no replacement check for unsolicited headers. In non-IBD with syncPeer == nil, a peer sending valid unsolicited headers can drive handleHeadersMsg into fetchHeaderBlocks (line 971), which dereferences sm.syncPeer at lines 899/907/920 — that's a nil pointer panic. Might need either restoring a request/response guard or adding a nil check on syncPeer before entering fetchHeaderBlocks.

sm.headerList.Remove(sm.headerList.Front())
log.Infof("Received %v block headers: Fetching blocks",
sm.headerList.Len())
sm.progressLogger.SetLastLogTime(time.Now())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchHeaders at line 294 sends getheaders but never refreshes lastProgressTime, and handleHeadersMsg doesn't update it on successful header processing either. With ~870k mainnet headers and a 3-minute stall timeout, the stall handler at line 470 will fire mid-header-download and disconnect a healthy sync peer. This would hit almost immediately on mainnet IBD. We probably need a lastProgressTime refresh in the header processing path, similar to how block processing refreshes it.

return
bestHash, bestHeight := sm.chain.BestHeader()
if sm.ibdMode {
if bestHeight < sm.syncPeer.LastBlock() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sm.syncPeer.LastBlock() here has no nil guard. syncPeer can already be nil when we enter handleHeadersMsg (e.g., if a peer disconnect was processed first), so this is a reachable panic. A quick nil check before this call would make it safe.

return sm, tearDown
}

func TestCheckHeadersList(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests cover the helper predicates well, but the risky control flow in handleHeadersMsg, fetchHeaderBlocks, startSync, and stall handling isn't exercised. Those are where the trickiest behavior changes live — would be great to have at least a basic end-to-end test that drives headers through to block download, so regressions in the sync state machine get caught.

// Check that the tip is correct.
lastHeader := headers[len(headers)-1]
lastHeaderHash := lastHeader.BlockHash()
tipNode := chain.bestHeader.Tip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing — tipNode is captured once here and reused in assertions at lines 134 and 179. If bestHeader moved unexpectedly, these checks would still pass against the stale pointer. Re-fetching chain.bestHeader.Tip() for each assertion would make the "tip didn't move" checks more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments