Re-queue failed pageview batch for retry on next interval by rockinrimmer · Pull Request #710 · rybbit-io/rybbit

rockinrimmer · 2025-11-18T23:11:29Z

This re-adds pageviews to the queue if there is an error inserting into Clickhouse.

Summary by CodeRabbit

Bug Fixes
- Improved pageview tracking reliability with automatic retrying of failed submissions and clearer failure handling to reduce data loss.
New Features
- Added configurable retry behavior for pageview queue processing so failed batches are retried automatically up to a set limit.
Documentation
- Updated self-hosting docs to document the new environment variable for configuring pageview retry limits.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

vercel · 2025-11-18T23:11:34Z

@rockinrimmer is attempting to deploy a commit to the goldflag's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2025-11-18T23:11:44Z

Walkthrough

Adds a configurable, exponential-backoff retry mechanism for bulk pageview inserts, a helper to parse positive numeric env vars, and environment variable support (PAGEVIEW_QUEUE_MAX_RETRIES). processQueue now delegates to a retry-aware insert method and logs retry attempts before ultimately throwing on exhaustion.

Changes

Cohort / File(s)	Summary
Pageview queue retry logic `server/src/services/tracker/pageviewQueue.ts`	Introduces `parsePositiveNumber(value, fallback)`, reads `PAGEVIEW_QUEUE_MAX_RETRIES` into `maxRetries`, extracts bulk insert into `insertBatchWithRetry(processedPageviews, retryCount)` implementing exponential backoff and retry cap, updates `processQueue` to call the retry method, adds try/catch logging and finalization, and throws after max retries.
Docs: environment variable `docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx`	Appends new environment variable documentation lines and comment for `PAGEVIEW_QUEUE_MAX_RETRIES=3` explaining ClickHouse insert retry default.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Scheduler
    participant Queue
    participant Service as PageviewQueue
    participant DB as ClickHouse

    Scheduler->>Service: trigger processQueue()
    Service->>Queue: dequeue batch -> processedPageviews
    Service->>Service: call insertBatchWithRetry(batch, retryCount=0)

    alt Insert succeeds
        Service->>DB: bulkInsert(batch)
        DB-->>Service: OK
        Service->>Queue: finalize batch (removed)
    else Insert fails
        Service->>DB: bulkInsert(batch)
        DB-->>Service: Error
        Service->>Service: if retryCount < maxRetries, wait interval*2^retryCount
        Service->>Service: insertBatchWithRetry(batch, retryCount+1)
        alt Exhausted retries
            Service-->>Scheduler: throw error (bubbled)
            Note right of Service: logs include retry attempts & batch size
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Moderate complexity: new retry/backoff logic and env parsing in a single core service file.
Review focus:
- Correctness of exponential backoff calculation and delay implementation.
- Proper handling of async control flow and thrown errors bubbling to callers.
- Validation and bounds for parsePositiveNumber and PAGEVIEW_QUEUE_MAX_RETRIES behavior.
- Logging clarity and potential sensitive data in logs.
- Docs addition consistency.

Poem

🐰 I thump my foot, then count to three,
Retries like carrots, growing exponentially.
Batches hop back, patient and bright,
Till ClickHouse nods and accepts them right.
A little rabbit cheers on every retry night. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: implementing a retry mechanism that re-queues failed pageview batches for retry on the next interval.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f79a29f and f4efd7f.

📒 Files selected for processing (2)

docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx (1 hunks)
server/src/services/tracker/pageviewQueue.ts (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-13T04:45:39.436Z

Learnt from: nktnet1
Repo: rybbit-io/rybbit PR: 270
File: server/src/db/clickhouse/clickhouse.ts:7-7
Timestamp: 2025-09-13T04:45:39.436Z
Learning: In the rybbit codebase, environment variables for ClickHouse configuration (CLICKHOUSE_HOST, CLICKHOUSE_DB, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD) are used directly from process.env without default values to maintain consistency across the configuration pattern.

Applied to files:

docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx

🧬 Code graph analysis (1)

server/src/services/tracker/pageviewQueue.ts (1)

server/src/db/clickhouse/clickhouse.ts (1)

clickhouse (4-8)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)
GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)

🔇 Additional comments (6)

docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx (1)

89-91: Documentation looks good.

The environment variable documentation is clear and accurate. The comment correctly describes the default behavior (3 retries after the initial attempt, totaling 4 attempts).

server/src/services/tracker/pageviewQueue.ts (5)

13-16: Robust environment variable parsing.

The helper correctly handles all edge cases (undefined, non-numeric strings, zero, negative numbers, and Infinity) by falling back to the default value.

32-32: Appropriate use of default value.

Using a sensible default (3 retries) for this operational configuration is appropriate, as it allows the system to function without requiring explicit configuration.

125-132: Well-implemented error handling that addresses past concerns.

The try-catch-finally structure properly handles failures:

Delegates to retry method with exponential backoff

Logs detailed error context including attempt count

Ensures processing flag is reset via finally block

Drops batch after exhausting retries (preventing infinite retry loop)

This directly addresses the concerns raised in previous reviews about unbounded retries and memory exhaustion.

134-152: Solid retry implementation with exponential backoff.

The retry logic is well-designed:

Exponential backoff delays (10s → 20s → 40s) reduce load during transient failures

Clear retry exhaustion condition (retryCount >= maxRetries)

Informative warning logs with attempt number and delay

Recursive approach is safe for the default retry limit (3)

The implementation provides resilience against transient ClickHouse failures while preventing infinite retry loops.

134-152: Implementation differs from PR description.

The PR description states: "This change re-adds pageviews to the queue if there is an error inserting into ClickHouse, so failed pageview batches are retried on the next interval."

However, the implementation doesn't re-add items to this.queue for retry on the next interval. Instead, it retries immediately with exponential backoff delays within the same processQueue call. After exhausting retries, the batch is dropped rather than re-queued.

This approach is actually better (avoids mixing failed and new data, uses exponential backoff), but the PR description should be updated to accurately reflect the implementation.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

server/src/services/tracker/pageviewQueue.ts (1)

45-116: Consider preserving processed pageviews to avoid redundant geo-data fetching.

Currently, when a batch fails and is re-queued, the original unprocessed payloads are added back, which means the geo-location lookup (line 45) and all transformations (lines 48-116) are repeated on retry. If the ClickHouse insert failed but geo-data was successfully fetched, this re-processing is redundant.

You could optimize by:

Caching the processedPageviews alongside the retry count when re-queuing

Or, rely on the geo-data service's caching layer if one exists

However, this adds complexity and may not be worth the effort if:

Geo-data lookups are fast/cached

Retry failures are rare

Fresh data on retry is preferred

Only pursue this if profiling shows geo-data fetching is a bottleneck during retries.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba9a3a7 and f79a29f.

📒 Files selected for processing (1)

server/src/services/tracker/pageviewQueue.ts (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)
GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)

coderabbitai · 2025-11-18T23:14:22Z

server/src/services/tracker/pageviewQueue.ts

+      this.logger.error(error, `Error processing pageview queue, re-adding ${batch.length} items to the start of the queue`);
+      // Re-queue the failed batch so it can be retried on the next interval
+      this.queue.unshift(...batch);


⚠️ Potential issue | 🟠 Major

Infinite retry loop and potential memory exhaustion.

The retry mechanism lacks safeguards for persistent failures:

Unbounded retries: If ClickHouse insertion fails repeatedly (e.g., network issues, schema mismatch, or corrupt data in the batch), the same batch will retry indefinitely every 10 seconds.

Memory leak: During prolonged ClickHouse outages, new pageviews continuously arrive while failed batches are re-queued, causing the queue to grow without bounds.

No failure isolation: A single corrupt item in the batch causes the entire batch to fail and retry forever.

Consider implementing:

Retry limit: Track retry attempts per batch (e.g., max 3 retries) and move permanently failed batches to a dead-letter queue or log them for investigation.

Exponential backoff: Delay subsequent retries to reduce pressure on ClickHouse (e.g., 10s, 30s, 60s).

Queue size limit: Implement a maximum queue size with alerting when approaching capacity.

Item-level error handling: Consider catching per-item validation errors if ClickHouse supports partial batch inserts.

Example approach for retry tracking:

type TotalPayloadWithRetry = TotalPayload & { retryCount?: number; }; // In the error handler: const batchWithRetry = batch.map(item => ({ ...item, retryCount: (item.retryCount || 0) + 1 })); const itemsToRequeue = batchWithRetry.filter(item => (item.retryCount || 0) < 3); const failedItems = batchWithRetry.filter(item => (item.retryCount || 0) >= 3); if (itemsToRequeue.length > 0) { this.queue.unshift(...itemsToRequeue); } if (failedItems.length > 0) { this.logger.error({ count: failedItems.length }, "Dropping items after max retries"); // Optionally: write to dead-letter queue/file for investigation }

@goldflag I can add handling so we don't retry forever if you think its worth it?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Yea maybe we can retry 3 times. What were the issues you were seeing though? i wouldn't expect clickhouse to be flaky in this way

Sure can do.

The only issue was during a restart of Clickhouse. For example I added some limits to the k8s pod which recreates the pod and it was down for a short while.

Any more changes required for this?

Re-queue failed pageview batch for retry on next interval

f79a29f

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

Add max retries for ClickHouse pageview inserts with exponential backoff

f4efd7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-queue failed pageview batch for retry on next interval#710

Re-queue failed pageview batch for retry on next interval#710
rockinrimmer wants to merge 2 commits intorybbit-io:masterfrom
rockinrimmer:handle-insert-error

rockinrimmer commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

vercel bot commented Nov 18, 2025

Uh oh!

coderabbitai bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 18, 2025 •

edited

Loading

Uh oh!

rockinrimmer Nov 19, 2025

Uh oh!

coderabbitai bot Nov 19, 2025

Uh oh!

goldflag Nov 20, 2025

Uh oh!

rockinrimmer Nov 20, 2025

Uh oh!

rockinrimmer Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rockinrimmer commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel bot commented Nov 18, 2025

Uh oh!

coderabbitai bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rockinrimmer Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

goldflag Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

rockinrimmer Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

rockinrimmer Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rockinrimmer commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 18, 2025 •

edited

Loading

coderabbitai bot Nov 18, 2025 •

edited

Loading