Skip to content

Fix race condition causing bookmark queue items to accumulate during rapid succession resumes#7225

Open
Copilot wants to merge 4 commits intorelease/3.6.0from
copilot/fix-bookmarks-processing-issue
Open

Fix race condition causing bookmark queue items to accumulate during rapid succession resumes#7225
Copilot wants to merge 4 commits intorelease/3.6.0from
copilot/fix-bookmarks-processing-issue

Conversation

Copy link
Contributor

Copilot AI commented Jan 30, 2026

When multiple /tasks/{taskid}/complete requests arrive in rapid succession, duplicate queue items are created for the same bookmark. The original implementation only deleted queue items when responses.Count > 0, causing the second queue item to remain stuck after the bookmark was already processed.

Changes

Modified BookmarkQueueProcessor.ProcessItemAsync()

  • Move queue item deletion to finally block—always executes regardless of bookmark presence or exceptions
  • Use CancellationToken.None for deletion to ensure cleanup during shutdown
  • Add explicit exception logging for workflow resumption failures

Modified BookmarkQueueProcessor.ProcessPageAsync()

  • Add try-catch wrapper to prevent one failing item from blocking batch processing

Before/After

// Before: queue item only deleted if bookmark found
var responses = await workflowResumer.ResumeAsync(filter, options, cancellationToken);
if (responses.Count > 0) {
    await store.DeleteAsync(item.Id, cancellationToken);
}

// After: always deleted, exception-safe
try {
    var responses = await workflowResumer.ResumeAsync(filter, options, cancellationToken);
    // ... logging ...
}
catch (Exception ex) {
    logger.LogError(ex, "Error resuming workflow...");
}
finally {
    try {
        await store.DeleteAsync(item.Id, CancellationToken.None);
    }
    catch (Exception ex) {
        logger.LogWarning(ex, "Failed to delete queue item...");
    }
}

Safety

The distributed lock in WorkflowResumer.ResumeAsync() ensures bookmarks are only processed once. Queue items are ephemeral markers—safe to delete after processing attempt. Existing 1-minute TTL purge task serves as fallback for any deletion failures.

Original prompt

This section details on the original issue you should resolve

<issue_title>Bookmarks remain stuck in database when resuming multiple workflows in rapid succession</issue_title>
<issue_description>## Description
When attempting to resume multiple bookmarks in rapid succession, several records remain stuck in the Bookmarks table and are never processed, even though the API calls to /tasks/{taskid}/complete return a successful (200 OK) response.

Steps to Reproduce

  1. Setup & Execution

    • Open the attached Visual Studio project.
    • Configure your connection string in the appsettings.Development.json file of the Server project.
    • Start the Server project.
    • Log into Elsa Studio using default credentials and import the attached Workflow JSON.
    • Publish the imported workflow.
    • Configure your connection string in the appsettings.json of the ConsoleTestApp project.
    • Build the ConsoleTestApp project.
    • Ensure the Bookmarks table is currently empty.
    • Open a command prompt (CMD) in the directory containing Sample.ElsaWorkflow.ConsoleTestApp.exe.
    • Execute the following command to initialize workflows:
      Sample.ElsaWorkflow.ConsoleTestApp.exe run 200
    • Wait for completion and verify that 200 records exist in the Bookmarks table.
    • Execute the following command to resume the workflows:
      Sample.ElsaWorkflow.ConsoleTestApp.exe resume 200
    • Wait for the resume process to finish.
  2. Observation

    • Upon completion, several records remain "hanging" in the Bookmarks table.
    • No errors are reported in the server logs.

Expected Behavior

All bookmarks should be successfully resumed, and the corresponding workflows should proceed to completion.

Actual Behavior

A subset of bookmarks remains stuck in the database, leaving the workflows in a permanent Suspended state.

Environment

  • Elsa Package Version: 3.5.3 and 3.6.0-rc2
  • Operating System: Windows 11, Windows Server 2016, Windows Server 2019.
  • Database Engine: SQL Server 15 Express, SQL Server 14 Developer.
  • Dispatcher: Default Workflow Dispatcher.

test-input-parent.json
Sample.ElsaWorkflow.zip
</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits January 30, 2026 16:36
This change ensures that bookmark queue items are always deleted after processing,
regardless of whether matching bookmarks were found. This prevents duplicate queue
items from accumulating in the database when the same bookmark is queued multiple
times in rapid succession.

The distributed lock in WorkflowResumer ensures that the actual bookmark is only
processed once, so it's safe to always delete the queue item.

Fixes race condition where:
1. Multiple rapid requests create duplicate queue items
2. First item processes and deletes the bookmark
3. Second item finds no bookmark but was NOT deleted
4. Queue items remain stuck in database indefinitely

Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
Enhanced the fix to ensure queue items are always deleted even if an exception
occurs during workflow resumption. The deletion is now in a finally block with
its own try-catch to handle any deletion failures gracefully.

This addresses the code review feedback and ensures robustness in all scenarios:
- Normal processing (bookmark found or not)
- Exception during workflow resumption (timeout, network error, etc.)
- Exception during deletion itself (database connectivity issues)

Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
Enhanced the fix based on code review feedback:
1. Added error handling at ProcessPageAsync level to ensure one failing item
   doesn't block processing of other items in the batch
2. Added explicit exception logging in ProcessItemAsync to surface errors
   during workflow resumption
3. Changed deletion to use CancellationToken.None to ensure cleanup happens
   even during application shutdown
4. Added detailed comments explaining error handling strategy

These improvements ensure the system is robust and provides visibility into
any issues while still guaranteeing queue items are cleaned up properly.

Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix bookmarks remaining stuck in database during workflow resumption Fix race condition causing bookmark queue items to accumulate during rapid succession resumes Jan 30, 2026
Copilot AI requested a review from sfmskywalker January 30, 2026 16:48
@sfmskywalker sfmskywalker marked this pull request as ready for review February 2, 2026 10:03
Copilot AI review requested due to automatic review settings February 2, 2026 10:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses a race condition where rapid successive resume requests can leave duplicate bookmark queue items stuck in the queue, leading to accumulation and unprocessed workflows.

Changes:

  • Wrap per-item processing in ProcessPageAsync with error handling to keep batch processing moving.
  • Update ProcessItemAsync to always delete the queue item via a finally block and add logging for resume failures.

Comment on lines +33 to +43
foreach (var bookmarkQueueItem in page.Items)
{
try
{
await ProcessItemAsync(bookmarkQueueItem, cancellationToken);
}
catch (Exception ex)
{
// Log the error but continue processing other items in the batch
logger.LogError(ex, "Error processing bookmark queue item {BookmarkQueueItemId}. Continuing with next item.", bookmarkQueueItem.Id);
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessPageAsync catches all exceptions and continues, which will also catch OperationCanceledException if cancellation is requested (e.g., during shutdown) and ProcessItemAsync starts propagating it. This can lead to noisy error logs and the processor continuing work even though cancellation was requested. Handle OperationCanceledException separately (rethrow) and/or break out of the loop when cancellationToken.IsCancellationRequested is true.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to 70
catch (Exception ex)
{
logger.LogDebug("Successfully resumed {WorkflowCount} workflow instances using stimulus {StimulusHash} for activity type {ActivityType}", responses.Count, item.StimulusHash, item.ActivityTypeName);
await store.DeleteAsync(item.Id, cancellationToken);
logger.LogError(ex, "Error resuming workflow for bookmark queue item {BookmarkQueueItemId} for workflow instance {WorkflowInstanceId}. The queue item will still be deleted to prevent accumulation.", item.Id, item.WorkflowInstanceId);
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch (Exception ex) here will also catch OperationCanceledException when cancellationToken is canceled, log it as an error, and then continue processing more items. This makes shutdown/cancellation noisy and prevents normal cancellation flow. Add an OperationCanceledException catch/when-filter that rethrows when cancellationToken.IsCancellationRequested, and only log unexpected exceptions.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +80
finally
{
logger.LogDebug("No matching bookmarks found for bookmark queue item {BookmarkQueueItemId} for workflow instance {WorkflowInstanceId} for activity type {ActivityType} with stimulus {StimulusHash}", item.Id, item.WorkflowInstanceId, item.ActivityTypeName, item.StimulusHash);
// Always delete the queue item after processing, regardless of whether bookmarks were found or an exception occurred.
// This prevents duplicate queue items from accumulating when the same bookmark is queued multiple times in rapid succession.
// The distributed lock in WorkflowResumer ensures that the actual bookmark is only processed once.
// Use CancellationToken.None to ensure cleanup happens even during application shutdown.
try
{
await store.DeleteAsync(item.Id, CancellationToken.None);
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because queue items are now always deleted, the offset-based pagination in ProcessAsync can skip items: after deleting the first page, incrementing offset causes the next query to skip records in the remaining set, potentially leaving items unprocessed until the next trigger (or even until they are purged by TTL). Consider switching the loop to always read from offset 0 until no items remain, or use cursor-based paging (e.g., last CreatedAt/Id) instead of offsets when deleting during iteration.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +80
// Use CancellationToken.None to ensure cleanup happens even during application shutdown.
try
{
await store.DeleteAsync(item.Id, CancellationToken.None);
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using CancellationToken.None for the delete means this await can block indefinitely if the underlying store call hangs (e.g., DB connectivity issues), which is especially risky during shutdown. Prefer using a bounded token (e.g., a short CancelAfter timeout) so cleanup is attempted but can't stall the processor forever.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants