Re-queue failed pageview batch for retry on next interval#710
Re-queue failed pageview batch for retry on next interval#710rockinrimmer wants to merge 2 commits intorybbit-io:masterfrom
Conversation
|
@rockinrimmer is attempting to deploy a commit to the goldflag's projects Team on Vercel. A member of the Team first needs to authorize it. |
WalkthroughAdds a configurable, exponential-backoff retry mechanism for bulk pageview inserts, a helper to parse positive numeric env vars, and environment variable support ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Scheduler
participant Queue
participant Service as PageviewQueue
participant DB as ClickHouse
Scheduler->>Service: trigger processQueue()
Service->>Queue: dequeue batch -> processedPageviews
Service->>Service: call insertBatchWithRetry(batch, retryCount=0)
alt Insert succeeds
Service->>DB: bulkInsert(batch)
DB-->>Service: OK
Service->>Queue: finalize batch (removed)
else Insert fails
Service->>DB: bulkInsert(batch)
DB-->>Service: Error
Service->>Service: if retryCount < maxRetries, wait interval*2^retryCount
Service->>Service: insertBatchWithRetry(batch, retryCount+1)
alt Exhausted retries
Service-->>Scheduler: throw error (bubbled)
Note right of Service: logs include retry attempts & batch size
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-09-13T04:45:39.436ZApplied to files:
🧬 Code graph analysis (1)server/src/services/tracker/pageviewQueue.ts (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (6)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
server/src/services/tracker/pageviewQueue.ts (1)
45-116: Consider preserving processed pageviews to avoid redundant geo-data fetching.Currently, when a batch fails and is re-queued, the original unprocessed payloads are added back, which means the geo-location lookup (line 45) and all transformations (lines 48-116) are repeated on retry. If the ClickHouse insert failed but geo-data was successfully fetched, this re-processing is redundant.
You could optimize by:
- Caching the
processedPageviewsalongside the retry count when re-queuing- Or, rely on the geo-data service's caching layer if one exists
However, this adds complexity and may not be worth the effort if:
- Geo-data lookups are fast/cached
- Retry failures are rare
- Fresh data on retry is preferred
Only pursue this if profiling shows geo-data fetching is a bottleneck during retries.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/services/tracker/pageviewQueue.ts(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)
- GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)
| this.logger.error(error, `Error processing pageview queue, re-adding ${batch.length} items to the start of the queue`); | ||
| // Re-queue the failed batch so it can be retried on the next interval | ||
| this.queue.unshift(...batch); |
There was a problem hiding this comment.
Infinite retry loop and potential memory exhaustion.
The retry mechanism lacks safeguards for persistent failures:
- Unbounded retries: If ClickHouse insertion fails repeatedly (e.g., network issues, schema mismatch, or corrupt data in the batch), the same batch will retry indefinitely every 10 seconds.
- Memory leak: During prolonged ClickHouse outages, new pageviews continuously arrive while failed batches are re-queued, causing the queue to grow without bounds.
- No failure isolation: A single corrupt item in the batch causes the entire batch to fail and retry forever.
Consider implementing:
- Retry limit: Track retry attempts per batch (e.g., max 3 retries) and move permanently failed batches to a dead-letter queue or log them for investigation.
- Exponential backoff: Delay subsequent retries to reduce pressure on ClickHouse (e.g., 10s, 30s, 60s).
- Queue size limit: Implement a maximum queue size with alerting when approaching capacity.
- Item-level error handling: Consider catching per-item validation errors if ClickHouse supports partial batch inserts.
Example approach for retry tracking:
type TotalPayloadWithRetry = TotalPayload & {
retryCount?: number;
};
// In the error handler:
const batchWithRetry = batch.map(item => ({
...item,
retryCount: (item.retryCount || 0) + 1
}));
const itemsToRequeue = batchWithRetry.filter(item => (item.retryCount || 0) < 3);
const failedItems = batchWithRetry.filter(item => (item.retryCount || 0) >= 3);
if (itemsToRequeue.length > 0) {
this.queue.unshift(...itemsToRequeue);
}
if (failedItems.length > 0) {
this.logger.error({ count: failedItems.length }, "Dropping items after max retries");
// Optionally: write to dead-letter queue/file for investigation
}There was a problem hiding this comment.
@goldflag I can add handling so we don't retry forever if you think its worth it?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Yea maybe we can retry 3 times. What were the issues you were seeing though? i wouldn't expect clickhouse to be flaky in this way
There was a problem hiding this comment.
Sure can do.
The only issue was during a restart of Clickhouse. For example I added some limits to the k8s pod which recreates the pod and it was down for a short while.
There was a problem hiding this comment.
Any more changes required for this?
This re-adds pageviews to the queue if there is an error inserting into Clickhouse.
Summary by CodeRabbit
Bug Fixes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.