-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Description
The milestone synchronization mechanism introduces structural latency at multiple points in the sync pipeline. This latency is inherent to the current design and affects both the syncToTip phase and the event loop at tip, regardless of whether the node eventually catches up.
Root causes
1. Milestone scraper pollDelay = 1s (vs 200ms for spans)
In polygon/heimdall/service.go:89, the milestone scraper polls Heimdall every 1 second:
milestoneScraper := NewScraper(
"milestones",
store.Milestones(),
milestoneFetcher,
1*time.Second, // ← 5x slower than spans (200ms)
...
)During syncToTip, each cycle calls SynchronizeMilestones() which blocks on syncEvent.Wait() until the scraper completes a full poll cycle. This adds up to ~1s of dead time per syncToTip iteration, contributing to the 32-58s inter-cycle gap observed in production.
The span scraper uses 200*time.Millisecond (service.go:98). There's no reason for milestones to be 5x slower.
2. futureMilestoneDelay = 1s re-queue polling loop
In polygon/sync/sync.go:289-308, when a milestone arrives ahead of the current tip (which is the common case since Heimdall publishes milestones before the node has executed the blocks):
if milestone.EndBlock().Uint64() > ccb.Tip().Number.Uint64() {
// finality is already tracked here (line 293) ✓
go func() {
time.Sleep(futureMilestoneDelay) // 1s
s.tipEvents.events.PushEvent(...) // re-queue
}()
return nil
}This spawns a goroutine that sleeps 1s and re-pushes the event, repeating until the tip catches up. For a milestone 3 blocks (6s) ahead of tip, this creates ~6 polling goroutines. The finality tracking (lastFinalizedBlockNum) is already done at line 293 before the re-queue — the re-queue only serves CCB pruning and milestone verification, which the next on-time milestone will handle anyway (~32s later).
3. WaitUntilHeimdallIsSynced + SynchronizeSpans on every block event
In polygon/sync/sync.go:364-376, every single block event in the event loop triggers:
err := s.heimdallSync.WaitUntilHeimdallIsSynced(ctx, 200ms)
err = s.heimdallSync.SynchronizeSpans(ctx, math.MaxUint64)This is fast in steady state, but during span rotation (every 128 blocks / ~256s), SynchronizeSpans must fetch from Heimdall and recompute producer selection, adding ~12s of overhead — a systematic source of lag at a predictable interval.
Production data
From v3.4.0-beta (bor-mainnet, commit 48d7b0b):
- syncToTip phase: 32-58s inter-cycle gap. Execution + trie dominates, but the 1s scraper poll is wasted time on every iteration.
- Event loop at tip: FC cycle avg = 2.07s for 2s blocks. Steady-state head age = 2-4s. Every second of unnecessary latency directly translates to falling further behind.
From issue #59 logs (v3.1.2):
[sync] update fork choice done in=8.7s
[sync] applying new milestone event milestoneId=3648361 ...
[sync] applying new milestone event milestoneId=3648362 ...
[span-rotation] need to wait for span rotation ...
[bor.heimdall] anticipating new span update within 8 seconds
[span-rotation] producer set was not updated within 8 seconds
Milestones are processed sequentially after the FC update, then span rotation adds another 8s — the node is idle for seconds doing Heimdall bookkeeping instead of processing blocks.
Suggested improvements
- Reduce milestone
pollDelayfrom 1s to 200ms — align with spans, one-line change inservice.go:89 - Remove the
futureMilestoneDelayre-queue loop — finality is already tracked; drop the event and let the next on-time milestone handle CCB pruning and verification - Consider making span synchronization non-blocking for block events that don't fall on a span boundary
Related issues
- Polygon sync enters feedback loop when milestones accumulate #116 — Milestone accumulation feedback loop (consequence of this latency)
- Polygon mainnet: Erigon v3.3.7 intermittently falls behind by several thousand blocks occasionally #112 — Intermittent lag of thousands of blocks
- Sync Performance Degradation - Node Falls Behind by Dozens of Blocks with Multi-Second Processing Times #59 — Sync performance degradation with multi-second FC updates + span rotation