Skip to content

Re-queue failed pageview batch for retry on next interval#710

Open
rockinrimmer wants to merge 2 commits intorybbit-io:masterfrom
rockinrimmer:handle-insert-error
Open

Re-queue failed pageview batch for retry on next interval#710
rockinrimmer wants to merge 2 commits intorybbit-io:masterfrom
rockinrimmer:handle-insert-error

Conversation

@rockinrimmer
Copy link
Contributor

@rockinrimmer rockinrimmer commented Nov 18, 2025

This re-adds pageviews to the queue if there is an error inserting into Clickhouse.

Summary by CodeRabbit

  • Bug Fixes

    • Improved pageview tracking reliability with automatic retrying of failed submissions and clearer failure handling to reduce data loss.
  • New Features

    • Added configurable retry behavior for pageview queue processing so failed batches are retried automatically up to a set limit.
  • Documentation

    • Updated self-hosting docs to document the new environment variable for configuring pageview retry limits.

✏️ Tip: You can customize this high-level summary in your review settings.

@vercel
Copy link

vercel bot commented Nov 18, 2025

@rockinrimmer is attempting to deploy a commit to the goldflag's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

Walkthrough

Adds a configurable, exponential-backoff retry mechanism for bulk pageview inserts, a helper to parse positive numeric env vars, and environment variable support (PAGEVIEW_QUEUE_MAX_RETRIES). processQueue now delegates to a retry-aware insert method and logs retry attempts before ultimately throwing on exhaustion.

Changes

Cohort / File(s) Summary
Pageview queue retry logic
server/src/services/tracker/pageviewQueue.ts
Introduces parsePositiveNumber(value, fallback), reads PAGEVIEW_QUEUE_MAX_RETRIES into maxRetries, extracts bulk insert into insertBatchWithRetry(processedPageviews, retryCount) implementing exponential backoff and retry cap, updates processQueue to call the retry method, adds try/catch logging and finalization, and throws after max retries.
Docs: environment variable
docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx
Appends new environment variable documentation lines and comment for PAGEVIEW_QUEUE_MAX_RETRIES=3 explaining ClickHouse insert retry default.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Scheduler
    participant Queue
    participant Service as PageviewQueue
    participant DB as ClickHouse

    Scheduler->>Service: trigger processQueue()
    Service->>Queue: dequeue batch -> processedPageviews
    Service->>Service: call insertBatchWithRetry(batch, retryCount=0)

    alt Insert succeeds
        Service->>DB: bulkInsert(batch)
        DB-->>Service: OK
        Service->>Queue: finalize batch (removed)
    else Insert fails
        Service->>DB: bulkInsert(batch)
        DB-->>Service: Error
        Service->>Service: if retryCount < maxRetries, wait interval*2^retryCount
        Service->>Service: insertBatchWithRetry(batch, retryCount+1)
        alt Exhausted retries
            Service-->>Scheduler: throw error (bubbled)
            Note right of Service: logs include retry attempts & batch size
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Moderate complexity: new retry/backoff logic and env parsing in a single core service file.
  • Review focus:
    • Correctness of exponential backoff calculation and delay implementation.
    • Proper handling of async control flow and thrown errors bubbling to callers.
    • Validation and bounds for parsePositiveNumber and PAGEVIEW_QUEUE_MAX_RETRIES behavior.
    • Logging clarity and potential sensitive data in logs.
    • Docs addition consistency.

Poem

🐰 I thump my foot, then count to three,
Retries like carrots, growing exponentially.
Batches hop back, patient and bright,
Till ClickHouse nods and accepts them right.
A little rabbit cheers on every retry night. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: implementing a retry mechanism that re-queues failed pageview batches for retry on the next interval.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f79a29f and f4efd7f.

📒 Files selected for processing (2)
  • docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx (1 hunks)
  • server/src/services/tracker/pageviewQueue.ts (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-13T04:45:39.436Z
Learnt from: nktnet1
Repo: rybbit-io/rybbit PR: 270
File: server/src/db/clickhouse/clickhouse.ts:7-7
Timestamp: 2025-09-13T04:45:39.436Z
Learning: In the rybbit codebase, environment variables for ClickHouse configuration (CLICKHOUSE_HOST, CLICKHOUSE_DB, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD) are used directly from process.env without default values to maintain consistency across the configuration pattern.

Applied to files:

  • docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx
🧬 Code graph analysis (1)
server/src/services/tracker/pageviewQueue.ts (1)
server/src/db/clickhouse/clickhouse.ts (1)
  • clickhouse (4-8)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)
  • GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)
🔇 Additional comments (6)
docs/content/docs/self-hosting-guides/self-hosting-advanced.mdx (1)

89-91: Documentation looks good.

The environment variable documentation is clear and accurate. The comment correctly describes the default behavior (3 retries after the initial attempt, totaling 4 attempts).

server/src/services/tracker/pageviewQueue.ts (5)

13-16: Robust environment variable parsing.

The helper correctly handles all edge cases (undefined, non-numeric strings, zero, negative numbers, and Infinity) by falling back to the default value.


32-32: Appropriate use of default value.

Using a sensible default (3 retries) for this operational configuration is appropriate, as it allows the system to function without requiring explicit configuration.


125-132: Well-implemented error handling that addresses past concerns.

The try-catch-finally structure properly handles failures:

  • Delegates to retry method with exponential backoff
  • Logs detailed error context including attempt count
  • Ensures processing flag is reset via finally block
  • Drops batch after exhausting retries (preventing infinite retry loop)

This directly addresses the concerns raised in previous reviews about unbounded retries and memory exhaustion.


134-152: Solid retry implementation with exponential backoff.

The retry logic is well-designed:

  • Exponential backoff delays (10s → 20s → 40s) reduce load during transient failures
  • Clear retry exhaustion condition (retryCount >= maxRetries)
  • Informative warning logs with attempt number and delay
  • Recursive approach is safe for the default retry limit (3)

The implementation provides resilience against transient ClickHouse failures while preventing infinite retry loops.


134-152: Implementation differs from PR description.

The PR description states: "This change re-adds pageviews to the queue if there is an error inserting into ClickHouse, so failed pageview batches are retried on the next interval."

However, the implementation doesn't re-add items to this.queue for retry on the next interval. Instead, it retries immediately with exponential backoff delays within the same processQueue call. After exhausting retries, the batch is dropped rather than re-queued.

This approach is actually better (avoids mixing failed and new data, uses exponential backoff), but the PR description should be updated to accurately reflect the implementation.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/src/services/tracker/pageviewQueue.ts (1)

45-116: Consider preserving processed pageviews to avoid redundant geo-data fetching.

Currently, when a batch fails and is re-queued, the original unprocessed payloads are added back, which means the geo-location lookup (line 45) and all transformations (lines 48-116) are repeated on retry. If the ClickHouse insert failed but geo-data was successfully fetched, this re-processing is redundant.

You could optimize by:

  • Caching the processedPageviews alongside the retry count when re-queuing
  • Or, rely on the geo-data service's caching layer if one exists

However, this adds complexity and may not be worth the effort if:

  • Geo-data lookups are fast/cached
  • Retry failures are rare
  • Fresh data on retry is preferred

Only pursue this if profiling shows geo-data fetching is a bottleneck during retries.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba9a3a7 and f79a29f.

📒 Files selected for processing (1)
  • server/src/services/tracker/pageviewQueue.ts (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)
  • GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)

Comment on lines 127 to 129
this.logger.error(error, `Error processing pageview queue, re-adding ${batch.length} items to the start of the queue`);
// Re-queue the failed batch so it can be retried on the next interval
this.queue.unshift(...batch);
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Infinite retry loop and potential memory exhaustion.

The retry mechanism lacks safeguards for persistent failures:

  1. Unbounded retries: If ClickHouse insertion fails repeatedly (e.g., network issues, schema mismatch, or corrupt data in the batch), the same batch will retry indefinitely every 10 seconds.
  2. Memory leak: During prolonged ClickHouse outages, new pageviews continuously arrive while failed batches are re-queued, causing the queue to grow without bounds.
  3. No failure isolation: A single corrupt item in the batch causes the entire batch to fail and retry forever.

Consider implementing:

  • Retry limit: Track retry attempts per batch (e.g., max 3 retries) and move permanently failed batches to a dead-letter queue or log them for investigation.
  • Exponential backoff: Delay subsequent retries to reduce pressure on ClickHouse (e.g., 10s, 30s, 60s).
  • Queue size limit: Implement a maximum queue size with alerting when approaching capacity.
  • Item-level error handling: Consider catching per-item validation errors if ClickHouse supports partial batch inserts.

Example approach for retry tracking:

type TotalPayloadWithRetry = TotalPayload & {
  retryCount?: number;
};

// In the error handler:
const batchWithRetry = batch.map(item => ({
  ...item,
  retryCount: (item.retryCount || 0) + 1
}));

const itemsToRequeue = batchWithRetry.filter(item => (item.retryCount || 0) < 3);
const failedItems = batchWithRetry.filter(item => (item.retryCount || 0) >= 3);

if (itemsToRequeue.length > 0) {
  this.queue.unshift(...itemsToRequeue);
}
if (failedItems.length > 0) {
  this.logger.error({ count: failedItems.length }, "Dropping items after max retries");
  // Optionally: write to dead-letter queue/file for investigation
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldflag I can add handling so we don't retry forever if you think its worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea maybe we can retry 3 times. What were the issues you were seeing though? i wouldn't expect clickhouse to be flaky in this way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do.

The only issue was during a restart of Clickhouse. For example I added some limits to the k8s pod which recreates the pod and it was down for a short while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any more changes required for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants