Skip to content

fix: add WAL checkpoint robustness with retry logic#1281

Closed
rsnodgrass wants to merge 1 commit intosteveyegge:mainfrom
rsnodgrass:pr/sync-flush-robustness
Closed

fix: add WAL checkpoint robustness with retry logic#1281
rsnodgrass wants to merge 1 commit intosteveyegge:mainfrom
rsnodgrass:pr/sync-flush-robustness

Conversation

@rsnodgrass
Copy link
Contributor

@rsnodgrass rsnodgrass commented Jan 23, 2026

Summary

  • Add WAL checkpoint retry logic with exponential backoff
  • Prevent silent data loss when checkpoint fails due to concurrent readers

Changes

  • store.go: Add 5-retry exponential backoff for WAL checkpoint in Close()
  • flush_manager_race_test.go: Concurrency tests for flush manager
  • wal_test.go: Tests for checkpoint with concurrent readers

Security Fixes

Test Plan

  • Run go test -race ./cmd/bd/... -run FlushManager for flush tests
  • Run go test -race ./internal/storage/sqlite/... -run WAL for WAL tests

Related Issues

sequenceDiagram
    participant S as Store.Close()
    participant W as WAL

    S->>W: Checkpoint (attempt 1)
    W-->>S: SQLITE_BUSY (reader lock)
    S->>S: Sleep 100ms
    S->>W: Checkpoint (attempt 2)
    W-->>S: SQLITE_BUSY
    S->>S: Sleep 200ms
    S->>W: Checkpoint (attempt 3)
    W-->>S: Success
    S->>S: Close DB
Loading

@rsnodgrass rsnodgrass marked this pull request as ready for review January 23, 2026 08:39
Add WAL checkpoint retry logic with exponential backoff to prevent data
loss when concurrent readers hold locks during database close.

Security fixes from SECURITY_AUDIT.md:
- Issue steveyegge#8: WAL checkpoint failure silently losing data

Changes:
- Add 5-retry exponential backoff for WAL checkpoint in Close()
- Warn user if checkpoint fails after retries (potential data loss)
- Add flush_manager_race_test.go for concurrency testing
- Add wal_test.go for checkpoint with concurrent readers

Co-Authored-By: SageOx <ox@sageox.ai>
@rsnodgrass rsnodgrass force-pushed the pr/sync-flush-robustness branch from efcef23 to 63c8da9 Compare January 23, 2026 08:47
@rsnodgrass rsnodgrass changed the title fix: improve flush manager timeout and WAL checkpoint robustness fix: add WAL checkpoint robustness with retry logic Jan 23, 2026
@steveyegge
Copy link
Owner

PR Review: fix: add WAL checkpoint robustness with retry logic

Recommendation: needs-changes


Summary

This PR adds retry logic with exponential backoff to the WAL checkpoint in SQLiteStorage.Close(), and includes two new test files (1464 lines total) covering FlushManager race conditions and WAL behavior.

What's Good

  • Real problem, appropriate solution. WAL checkpoint can fail with SQLITE_BUSY when concurrent readers hold locks. Retry with exponential backoff (100ms, 200ms, 400ms, 800ms, 1600ms -- about 3.1s total) is a reasonable strategy.
  • User-visible warning. Printing to stderr when all retries fail is better than the previous silent discard.
  • FlushManager race tests cover meaningful concurrent scenarios: shutdown during marking, channel buffer overflow, timer races, concurrent FlushNow, etc.
  • WAL tests include useful tests for checkpoint-during-transaction, crash recovery, and data visibility after checkpoint.

Issues to Address

1. Error discrimination (production bug risk)

The retry loop retries on any error, not just SQLITE_BUSY. If the error is something other than SQLITE_BUSY (e.g., disk full, I/O error, corruption), retrying is pointless and delays Close() by ~3 seconds for no reason. The loop should check for SQLITE_BUSY specifically and bail immediately on other errors.

2. No direct test for the Close() retry logic

The core change is the retry loop in Close(), but neither test file tests it. wal_test.go tests CheckpointWAL() (which uses FULL mode, not TRUNCATE), and flush_manager_race_test.go tests FlushManager concurrency. A test that exercises the actual retry path in Close() -- e.g., by holding a read transaction while calling Close() and verifying the warning or retry behavior -- would validate the fix directly.

3. Nonexistent SECURITY_AUDIT.md reference

The comment references "SECURITY_AUDIT.md issue #8", but this file does not exist in the repository. The repo has SECURITY.md, but no SECURITY_AUDIT.md. This reference should be corrected or removed.

4. Test-to-fix ratio / scope creep

The production change is ~30 lines. The test additions are 1464 lines, and much of it tests SQLite internals rather than the application's checkpoint logic:

  • TestCheckpointModes -- tests PASSIVE/FULL/TRUNCATE modes (SQLite behavior, not app logic)
  • TestWALAutoCheckpoint -- queries wal_autocheckpoint pragma then does nothing with it
  • TestWALModeIsEnabled -- verifies WAL mode is set (already implicitly covered by existing tests)
  • TestWALFileCreation -- checks whether -wal and -shm files exist

These add maintenance burden without testing the retry logic. Consider trimming to tests that directly validate the new behavior.

5. Checkpoint error not propagated

Close() prints a warning but still returns only db.Close(). If the checkpoint fails, the caller has no way to know programmatically. This matches the old behavior (errors were discarded), so it's not a regression, but the PR description claims to "prevent silent data loss" -- the data loss is now noisy but still not preventable by the caller. Worth noting in the PR description.

Minor Notes

  • flush_manager_race_test.go timeouts of 35 seconds could slow CI.
  • TestWALReaderIsolation acknowledges in comments that mid-transaction read behavior is non-deterministic, which weakens that test's assertions.

Verdict

The core fix is sound in concept but needs error discrimination to avoid unnecessary retries on non-busy errors. A direct test of the Close() retry path would strengthen confidence. The ancillary WAL tests are nice-to-have but are mostly testing SQLite itself, and the SECURITY_AUDIT.md reference needs fixing.


Reviewed by: beads/crew/wickham

@steveyegge
Copy link
Owner

Merged to main via direct push (0b92eae)

@steveyegge steveyegge closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants