Skip to content

refactor(api,runner): make snapshot pull/build async and poll for result#3798

Open
MDzaja wants to merge 2 commits intomainfrom
refactor/async-snapshot-pull
Open

refactor(api,runner): make snapshot pull/build async and poll for result#3798
MDzaja wants to merge 2 commits intomainfrom
refactor/async-snapshot-pull

Conversation

@MDzaja
Copy link
Collaborator

@MDzaja MDzaja commented Feb 17, 2026

Description

Snapshot pull and build operations on the runner were synchronous — the API waited for the full Docker operation to complete before getting a response. These operations can exceed the 1-hour Axios timeout, causing failures for large snapshots.

Solution

Runner-side POST /snapshots/pull and POST /snapshots/build endpoints now launch the Docker operation in a background goroutine and return 202 Accepted immediately. Failed operations store the error reason in an in-memory concurrent map (cmap) on the runner instance.

The API polls for completion by calling getSnapshotInfo which returns:

  • 200 with snapshot details when the image exists
  • 422 with the error reason if the last operation failed
  • 404 if the image doesn't exist yet (still in progress)

Changes

  • Runner (snapshot.go): PullSnapshot and BuildSnapshot run asynchronously via goroutines with context.Background(). Errors are stored in runner.SnapshotErrors. GetSnapshotInfo returns 422 when a stored error is found.
  • Runner (runner.go): Added SnapshotErrors field using cmap.ConcurrentMap[string, string] for thread-safe error storage.
  • Common errors (http.go, middleware.go): Added UnprocessableEntityError type with 422 status handling in the error middleware.
  • API interceptor (runnerAdapter.v0.ts): Axios error interceptor now throws RunnerApiError preserving statusCode and code, enabling callers to inspect HTTP status.
  • API adapters (runnerAdapter.v0.ts, runnerAdapter.v2.ts): getSnapshotInfo converts 422 responses into SnapshotStateError. V2 adapter throws SnapshotStateError for failed jobs.
  • Snapshot manager (snapshot.manager.ts): Polling handlers use getSnapshotInfo instead of snapshotExists to detect both completion and failure. SnapshotStateError sets runner state to ERROR with the reason.
  • Sandbox start (sandbox-start.action.ts): Pull/build polling uses a 1-hour timeout with 5-second intervals instead of 10 retries with ECONNRESET checks.

Copilot AI review requested due to automatic review settings February 17, 2026 08:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors snapshot pull and build operations to be asynchronous, preventing timeout failures for large Docker images that exceed the 1-hour Axios timeout. The runner now returns HTTP 202 Accepted immediately and processes operations in background goroutines, while the API polls for completion using getSnapshotInfo.

Changes:

  • Runner-side PullSnapshot and BuildSnapshot endpoints launch Docker operations asynchronously and return 202 immediately, storing errors in an in-memory concurrent map
  • API polling logic uses getSnapshotInfo to detect completion (200), failure (422 with error reason), or in-progress (404) states
  • Sandbox start action implements 1-hour polling with 5-second intervals instead of 10 retries with connection reset checks

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
apps/runner/pkg/runner/runner.go Added SnapshotErrors concurrent map field to store async operation errors
libs/common-go/pkg/errors/http.go Added UnprocessableEntityError type for 422 status code handling
libs/common-go/pkg/errors/middleware.go Added middleware case for UnprocessableEntityError with 422 response
apps/runner/pkg/api/controllers/snapshot.go Made PullSnapshot and BuildSnapshot async with goroutines; GetSnapshotInfo returns 422 for failed operations; error cleanup on snapshot removal
apps/runner/pkg/api/docs/swagger.yaml Updated API documentation to reflect async behavior and 202 responses
apps/runner/pkg/api/docs/swagger.json Updated API documentation (JSON format)
apps/runner/pkg/api/docs/docs.go Updated embedded API documentation
apps/api/src/sandbox/errors/snapshot-state-error.ts New error class for snapshot operation failures
apps/api/src/sandbox/errors/runner-api-error.ts New error class preserving HTTP status codes from runner API
apps/api/src/sandbox/runner-adapter/runnerAdapter.v0.ts Added error interceptor to preserve status codes; getSnapshotInfo converts 422 to SnapshotStateError
apps/api/src/sandbox/runner-adapter/runnerAdapter.v2.ts Throws SnapshotStateError for failed jobs with error message
apps/api/src/sandbox/managers/snapshot.manager.ts Replaced snapshotExists with getSnapshotInfo for polling; handles SnapshotStateError by setting runner state to ERROR; removed retry logic for ECONNRESET
apps/api/src/sandbox/managers/sandbox-actions/sandbox-start.action.ts Implements 1-hour polling with 5-second intervals for pull/build operations
libs/runner-api-client/src/api/snapshots-api.ts Updated generated documentation for async endpoints

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…use runner context for pull/build goroutines for graceful shutdown

Signed-off-by: MDzaja <mirkodzaja0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant