fix: add WAL checkpoint robustness with retry logic#1281
fix: add WAL checkpoint robustness with retry logic#1281rsnodgrass wants to merge 1 commit intosteveyegge:mainfrom
Conversation
Add WAL checkpoint retry logic with exponential backoff to prevent data loss when concurrent readers hold locks during database close. Security fixes from SECURITY_AUDIT.md: - Issue steveyegge#8: WAL checkpoint failure silently losing data Changes: - Add 5-retry exponential backoff for WAL checkpoint in Close() - Warn user if checkpoint fails after retries (potential data loss) - Add flush_manager_race_test.go for concurrency testing - Add wal_test.go for checkpoint with concurrent readers Co-Authored-By: SageOx <ox@sageox.ai>
efcef23 to
63c8da9
Compare
PR Review: fix: add WAL checkpoint robustness with retry logicRecommendation: needs-changes SummaryThis PR adds retry logic with exponential backoff to the WAL checkpoint in What's Good
Issues to Address1. Error discrimination (production bug risk) The retry loop retries on any error, not just SQLITE_BUSY. If the error is something other than SQLITE_BUSY (e.g., disk full, I/O error, corruption), retrying is pointless and delays Close() by ~3 seconds for no reason. The loop should check for SQLITE_BUSY specifically and bail immediately on other errors. 2. No direct test for the Close() retry logic The core change is the retry loop in Close(), but neither test file tests it. wal_test.go tests CheckpointWAL() (which uses FULL mode, not TRUNCATE), and flush_manager_race_test.go tests FlushManager concurrency. A test that exercises the actual retry path in Close() -- e.g., by holding a read transaction while calling Close() and verifying the warning or retry behavior -- would validate the fix directly. 3. Nonexistent SECURITY_AUDIT.md reference The comment references "SECURITY_AUDIT.md issue #8", but this file does not exist in the repository. The repo has SECURITY.md, but no SECURITY_AUDIT.md. This reference should be corrected or removed. 4. Test-to-fix ratio / scope creep The production change is ~30 lines. The test additions are 1464 lines, and much of it tests SQLite internals rather than the application's checkpoint logic:
These add maintenance burden without testing the retry logic. Consider trimming to tests that directly validate the new behavior. 5. Checkpoint error not propagated Close() prints a warning but still returns only db.Close(). If the checkpoint fails, the caller has no way to know programmatically. This matches the old behavior (errors were discarded), so it's not a regression, but the PR description claims to "prevent silent data loss" -- the data loss is now noisy but still not preventable by the caller. Worth noting in the PR description. Minor Notes
VerdictThe core fix is sound in concept but needs error discrimination to avoid unnecessary retries on non-busy errors. A direct test of the Close() retry path would strengthen confidence. The ancillary WAL tests are nice-to-have but are mostly testing SQLite itself, and the SECURITY_AUDIT.md reference needs fixing. Reviewed by: beads/crew/wickham |
|
Merged to main via direct push (0b92eae) |
Summary
Changes
store.go: Add 5-retry exponential backoff for WAL checkpoint in Close()flush_manager_race_test.go: Concurrency tests for flush managerwal_test.go: Tests for checkpoint with concurrent readersSecurity Fixes
Test Plan
go test -race ./cmd/bd/... -run FlushManagerfor flush testsgo test -race ./internal/storage/sqlite/... -run WALfor WAL testsRelated Issues
sequenceDiagram participant S as Store.Close() participant W as WAL S->>W: Checkpoint (attempt 1) W-->>S: SQLITE_BUSY (reader lock) S->>S: Sleep 100ms S->>W: Checkpoint (attempt 2) W-->>S: SQLITE_BUSY S->>S: Sleep 200ms S->>W: Checkpoint (attempt 3) W-->>S: Success S->>S: Close DB