Skip to content

feat: unified runner and healthchecks#4920

Merged
chronark merged 8 commits intomainfrom
runner
Feb 5, 2026
Merged

feat: unified runner and healthchecks#4920
chronark merged 8 commits intomainfrom
runner

Conversation

@chronark
Copy link
Collaborator

@chronark chronark commented Feb 3, 2026

deduplicates a lot of startup logic and makes it easy to use proper probes.

This needs to be updated in the infra repo too before rolling out, otherwise healthchecks will fail

@vercel
Copy link

vercel bot commented Feb 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Feb 5, 2026 10:17am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
engineering Ignored Ignored Preview Feb 5, 2026 10:17am

Request Review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Replaces the shutdown package with a runner-based lifecycle (New(logger), Go, Defer/DeferCtx, Recover, Wait), adds in-process health endpoints (/health/live, /health/ready, /health/startup) with parallel readiness checks, changes InitGrafana to return a shutdown func, and standardizes Kubernetes probe paths.

Changes

Cohort / File(s) Summary
Runner core & health
pkg/runner/runner.go, pkg/runner/health.go
Adds Runner state machine, New(logger), Wait(...) (replaces Run), Go/Defer/DeferCtx/Recover, internal healthState and HTTP endpoints /health/live, /health/ready, /health/startup with parallel readiness checks and per-check timeouts.
Runner build & tests
pkg/runner/BUILD.bazel, pkg/runner/*.go, pkg/runner/*_test.go
Adds health.go to build, adds //pkg/otel/logging dependency, updates tests to construct Runner with a noop logger and use Wait semantics; adds comprehensive health tests (health_test.go).
Shutdown package removal
pkg/shutdown/BUILD.bazel, pkg/shutdown/doc.go, pkg/shutdown/shutdown.go, pkg/shutdown/shutdown_test.go
Removes the entire pkg/shutdown package (BUILD, docs, implementation, tests). Consumers migrated to runner APIs.
OTel / Grafana API change
pkg/otel/grafana.go, pkg/otel/BUILD.bazel
InitGrafana signature changed to return (func(ctx context.Context) error, error); shutdown manager usage removed and BUILD deps drop //pkg/shutdown.
Service migrations to Runner
svc/*/run.go, e.g. svc/api/run.go, svc/ctrl/api/run.go, svc/frontline/run.go, svc/krane/run.go, svc/preflight/run.go, svc/sentinel/run.go, svc/vault/run.go, svc/ctrl/worker/run.go
Replaces shutdown manager usage with runner: create Runner, use r.Recover(), r.Go, r.Defer/r.DeferCtx, r.RegisterHealth, r.Wait; moves servers and Prometheus under runner tasks and tightens error propagation.
BUILD dependency updates
svc/*/BUILD.bazel, pkg/otel/BUILD.bazel, pkg/runner/BUILD.bazel
Replaces many //pkg/shutdown deps with //pkg/runner, adds //pkg/otel, //pkg/version, and logging deps where required.
Preflight health route removal
svc/preflight/routes/healthz/handler.go, svc/preflight/routes/healthz/BUILD.bazel
Removes legacy /healthz handler and BUILD target; health endpoints now provided via Runner.RegisterHealth.
Kubernetes manifests
dev/k8s/manifests/*.yaml, dev/k8s/manifests/...
Standardizes liveness/readiness probe paths to /health/live and /health/ready, adds probes where missing (e.g., ctrl-api, krane), and adjusts frontline initContainer command formatting.
Service config & CLI flags
cmd/*/main.go, svc/*/config.go, cmd/ctrl/worker.go, cmd/vault/main.go
Adds observability and region flags/fields (OtelEnabled, OtelTraceSamplingRate, Region) to service configs and CLI wiring.
Docs & frontend
web/apps/engineering/content/docs/architecture/services/healthchecks.mdx, web/apps/engineering/content/docs/architecture/services/meta.json, web/apps/dashboard/components/virtual-table/index.tsx
Adds healthchecks documentation and index update; TSX refactor introduces VirtualTableRef, layout helper, and rendering adjustments.
Tests & harness updates
svc/ctrl/api/harness_test.go, various runner tests
Updates tests to new health endpoints and Runner constructor/Wait semantics; adds new runner/health tests and adjusts harness expectations.

Sequence Diagram

sequenceDiagram
    participant Service as Service
    participant Runner as Runner
    participant Checks as ReadinessChecks
    participant Client as HTTP Client

    Service->>Runner: New(logger)
    Service->>Runner: RegisterHealth(mux)
    Service->>Runner: Go(startServerTask)
    Service->>Runner: DeferCtx(server.Shutdown)
    Service->>Runner: Defer(db.Close)

    Client->>Runner: GET /health/live
    Runner-->>Client: 200 (started) or 503

    Client->>Runner: GET /health/ready
    Runner->>Checks: runChecks(ctx) (parallel, per-check timeouts)
    Checks-->>Runner: results
    Runner-->>Client: 200 if all pass, else 503

    Service->>Runner: Wait(ctx, WithTimeout(...))
    Note right of Runner: on cancellation → mark shuttingDown, run deferred cleanups (LIFO), wait for tasks
    Runner-->>Service: shutdown complete / aggregated error
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and lacks required sections. It does not fill out the template's 'What does this PR do?', 'Type of change', 'How should this be tested?', or checklist sections. Complete the PR description using the provided template: add a 'What does this PR do?' section with issue link, select type of change, describe testing procedures, and mark off the checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: unified runner and healthchecks' clearly and concisely summarizes the main changes—introducing a unified runner abstraction and healthchecks system—which aligns with the substantial refactoring across multiple service files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch runner

Comment @coderabbitai help to get the list of available commands and usage tips.

@chronark
Copy link
Collaborator Author

chronark commented Feb 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
dev/k8s/manifests/preflight.yaml (1)

126-139: ⚠️ Potential issue | 🔴 Critical

The probe paths do not match the service handlers—critical deployment issue.

The manifest probes for /health/live and /health/ready will fail because the preflight service still only exposes /healthz. The routes directory contains only healthz and mutate handlers. The HTTP health check handlers must be added to match the new probe paths, or the pod will continuously fail liveness/readiness checks and restart.

Fix required: Add /health/live and /health/ready handlers to svc/preflight/routes/ and register them in svc/preflight/run.go.

pkg/otel/grafana.go (1)

68-90: ⚠️ Potential issue | 🟡 Minor

Docs still show shutdowns.

✏️ Update example
-// The function registers all necessary shutdown handlers with the provided shutdowns instance.
+// The function registers all necessary shutdown handlers with the provided runner.
@@
-//	shutdowns := shutdown.New()
-//	err := otel.InitGrafana(ctx, otel.Config{
+//	r := runner.New(logging.NewNoop())
+//	err := otel.InitGrafana(ctx, otel.Config{
 //	    GrafanaEndpoint: "https://otlp-sentinel-prod-us-east-0.grafana.net/otlp",
 //	    Application:     "unkey-api",
 //	    Version:         version.Version,
-//	}, shutdowns)
+//	}, r)
@@
-//	errs := shutdowns.Shutdown(ctx)
-//	for _, err := range errs {
-//	    log.Printf("Shutdown error: %v", err)
-//	}
+//	_ = r.Wait(ctx)
🤖 Fix all issues with AI agents
In `@pkg/runner/health.go`:
- Around line 177-198: The function Runner.runChecks reads r.health.checkTimeout
without synchronization which can race with writers (e.g., Runner.Wait); add a
mutex to the health container (e.g., sync.RWMutex on the health struct) and use
a read lock when accessing checkTimeout in runChecks (acquire RLock, copy
timeout, RUnlock) and use the write lock (Lock/Unlock) wherever checkTimeout is
updated. Update the health struct and the writers that modify checkTimeout to
hold the write lock to eliminate the race.

In `@pkg/runner/runner.go`:
- Around line 36-39: The struct comment incorrectly references Run() but the
method is named Wait(); update the doc on Runner to mention Wait() (and describe
that Wait() blocks until OS signals or errors and then triggers cleanup), i.e.,
change the text "Run() blocks until OS signals or errors, then triggers cleanup"
to "Wait() blocks until OS signals or errors, then triggers cleanup" and ensure
any other occurrences in the Runner comment or nearby docs refer to Wait rather
than Run to keep documentation consistent with the Runner.Wait method.
- Around line 221-223: The variables 'cause' and 'reason' are pre-initialized
but unused; remove their unnecessary initializations in runner.go and instead
declare them at the point of first use (or use short declaration := when setting
them). Update references to 'cause' and 'reason' so they are defined where
assigned (e.g., inside the select/case blocks or immediately before they are
needed) to satisfy the linter and avoid unused init.

In `@svc/api/run.go`:
- Around line 340-347: The logger.Info call is placed after srv.Serve returns
(in the anonymous r.Go function), so "API server started successfully" is logged
on shutdown; move the success log to before calling srv.Serve (i.e., log
immediately before invoking srv.Serve(ctx, cfg.Listener) inside the same r.Go
goroutine) or change the message to reflect shutdown (e.g., "API server
stopped"); update the code around the anonymous function passed to r.Go, keeping
srv.Serve, cfg.Listener and logger.Info in mind.

In `@svc/frontline/run.go`:
- Around line 282-289: The HTTP server goroutine currently ignores only
context.Canceled when checking serveErr from httpSrv.Serve; update the error
check inside the r.Go anonymous function that calls httpSrv.Serve (referencing
httpSrv.Serve, httpListener and serveErr) to also ignore http.ErrServerClosed
(i.e., return an error only if serveErr != nil && !errors.Is(serveErr,
context.Canceled) && !errors.Is(serveErr, http.ErrServerClosed)), so the HTTP
shutdown path matches the HTTPS handling.
- Around line 247-254: The HTTPS goroutine currently treats any
non-context.Canceled error from httpsSrv.Serve as fatal; update the check inside
the r.Go block that wraps httpsSrv.Serve so it also treats http.ErrServerClosed
as non-fatal (i.e., ignore errors that satisfy errors.Is(err,
http.ErrServerClosed)) to avoid reporting expected shutdown errors triggered by
r.DeferCtx(httpsSrv.Shutdown); reference the r.Go anonymous function, serveErr
variable, httpsSrv.Serve call, and the context.Canceled and http.ErrServerClosed
sentinel errors when making the change.
🧹 Nitpick comments (7)
pkg/runner/concurrency_test.go (1)

151-156: Use require.Eventually; drop t.Fatal

Line 151-156 uses t.Fatal; switch to require.Eventually to keep require-only assertions, less flake. As per coding guidelines, Use testify/require for assertions in tests.

Proposed change
-	select {
-	case <-started:
-		// Task started immediately - good
-	case <-time.After(100 * time.Millisecond):
-		t.Fatal("task did not start immediately")
-	}
+	require.Eventually(t, func() bool {
+		select {
+		case <-started:
+			return true
+		default:
+			return false
+		}
+	}, 100*time.Millisecond, 5*time.Millisecond, "task did not start immediately")
svc/api/run.go (2)

125-131: Consider also ignoring http.ErrServerClosed for Prometheus.

The API server (line 342) ignores both context.Canceled and http.ErrServerClosed, but Prometheus only ignores context.Canceled. For consistency and to avoid spurious errors during graceful shutdown, consider adding the same check.

Proposed fix
 r.Go(func(ctx context.Context) error {
     serveErr := prom.Serve(ctx, promListener)
-    if serveErr != nil && !errors.Is(serveErr, context.Canceled) {
+    if serveErr != nil && !errors.Is(serveErr, context.Canceled) && !errors.Is(serveErr, http.ErrServerClosed) {
         return fmt.Errorf("prometheus server failed: %w", serveErr)
     }
     return nil
 })

350-353: Inconsistent shutdown timeout across services.

This service uses runner.WithTimeout(time.Minute), but svc/frontline/run.go (line 295) and svc/ctrl/api/run.go (line 275) use r.Wait(ctx) without a timeout. Consider standardizing the shutdown timeout behavior across all services.

svc/frontline/run.go (1)

97-103: Prometheus error handling consistency.

Same as other services—consider adding http.ErrServerClosed check for consistency.

Proposed fix
 r.Go(func(ctx context.Context) error {
     promListenErr := prom.Serve(ctx, promListener)
-    if promListenErr != nil && !errors.Is(promListenErr, context.Canceled) {
+    if promListenErr != nil && !errors.Is(promListenErr, context.Canceled) && !errors.Is(promListenErr, http.ErrServerClosed) {
         return fmt.Errorf("prometheus server failed: %w", promListenErr)
     }
     return nil
 })
svc/ctrl/api/run.go (3)

232-236: Inconsistent error comparison style.

Line 232 uses direct comparison err != http.ErrServerClosed instead of errors.Is(err, http.ErrServerClosed) used elsewhere. While functionally equivalent for this error, using errors.Is is more robust for wrapped errors and maintains consistency across services.

Proposed fix
-    if err != nil && err != http.ErrServerClosed {
+    if err != nil && !errors.Is(err, http.ErrServerClosed) {
         return fmt.Errorf("server failed: %w", err)
     }

247-248: Certificate bootstrap uses bare go instead of runner.

certBootstrap.run(ctx) is launched with a bare go goroutine, bypassing the runner's lifecycle management. This means:

  • Errors from bootstrap won't propagate to the runner
  • The goroutine may outlive graceful shutdown

Consider using r.Go for consistency, or document why this is intentionally fire-and-forget.

Proposed fix using runner
-    go certBootstrap.run(ctx)
+    r.Go(func(ctx context.Context) error {
+        certBootstrap.run(ctx)
+        return nil
+    })

264-270: Prometheus error handling consistency.

Same as other services—consider adding http.ErrServerClosed check for consistency.

Proposed fix
 r.Go(func(ctx context.Context) error {
     logger.Info("prometheus started", "port", cfg.PrometheusPort)
-    if serveErr := prom.Serve(ctx, ln); serveErr != nil && !errors.Is(serveErr, context.Canceled) {
+    if serveErr := prom.Serve(ctx, ln); serveErr != nil && !errors.Is(serveErr, context.Canceled) && !errors.Is(serveErr, http.ErrServerClosed) {
         return fmt.Errorf("failed to start prometheus server: %w", serveErr)
     }
     return nil
 })

@chronark chronark marked this pull request as ready for review February 4, 2026 09:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/runner/health.go`:
- Around line 184-191: The goroutine launched for each readiness check (the loop
over checks calling the anonymous function that uses ReadinessCheck) can panic
and then never send a checkResult to resultCh, causing /health/ready to hang;
wrap the body of that goroutine with a defer that recovers from panics and
converts the panic into an error value, and ensure that you always send a
checkResult{name: name, err: err} to resultCh in all cases (including after a
panic) while still canceling the check context via defer cancel(); update the
anonymous function around checkCtx, cancel, err := check(checkCtx) to guarantee
a resultCh send even on panic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
svc/preflight/run.go (1)

100-100: ⚠️ Potential issue | 🟡 Minor

Use r.Go() to integrate cleanup service into runner's lifecycle.

cleanupService.Start(ctx) spawned with raw go won't be tracked by the runner, so shutdown won't wait for it to complete. Use r.Go() for consistency with other background tasks.

Suggested fix
-	go cleanupService.Start(ctx)
+	r.Go(func(ctx context.Context) error {
+		cleanupService.Start(ctx)
+		return nil
+	})
🧹 Nitpick comments (1)
svc/krane/run.go (1)

198-199: Remove extra blank lines.

Two consecutive blank lines; one is sufficient.

Suggested fix
-	
-	
+
 	// Wait for signal and handle shutdown

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
pkg/otel/grafana.go (1)

58-90: ⚠️ Potential issue | 🟡 Minor

Docs still mention shutdowns param; update for cleanup func return.

web/apps/dashboard/components/virtual-table/index.tsx (2)

142-143: ⚠️ Potential issue | 🟡 Minor

Use column.key instead of array index as key in colgroup elements. Array indices are unstable as React keys and can cause reconciliation issues if columns are reordered. The column.key property is already used correctly in the thead section (line 148) and should be used consistently here as well.

🔧 Suggested patch
-            {columns.map((_, idx) => (
-              <col key={idx} style={{ width: colWidths[idx].width }} />
+            {columns.map((column, idx) => (
+              <col key={column.key} style={{ width: colWidths[idx].width }} />
             ))}

Also applies to: 195-196


55-60: ⚠️ Potential issue | 🟡 Minor

Replace any with unknown and specify ref type. This improves type safety per coding guidelines.

-// biome-ignore lint/suspicious/noExplicitAny: Safe to leave
-export const VirtualTable = forwardRef<VirtualTableRef, VirtualTableProps<any>>(
+export const VirtualTable = forwardRef<VirtualTableRef, VirtualTableProps<unknown>>(
   function VirtualTable<TTableData>(
     props: VirtualTableProps<TTableData>,
-    ref: Ref<unknown> | undefined,
+    ref: Ref<VirtualTableRef> | undefined,
   ) {
🤖 Fix all issues with AI agents
In `@cmd/krane/main.go`:
- Around line 97-105: The Config Clock field is set to nil but should use a real
clock; update the krane.Config initialization in main.go to set Clock to
clock.RealClock{} (not nil) so production code uses a real clock implementation
as required by the Clock docs and to avoid an illegal nil state in the Config
struct; locate the krane.Config literal and replace Clock: nil with Clock:
clock.RealClock{}.

In `@web/apps/dashboard/components/virtual-table/index.tsx`:
- Around line 316-318: Replace all unsafe type assertions of DOM nodes with
runtime instanceof guards: instead of using document.activeElement as
HTMLElement, check if (document.activeElement instanceof HTMLElement) and then
call blur() on it; instead of document.querySelector(...) as HTMLElement, store
the result in a variable (e.g., const el = document.querySelector(selector)) and
guard with if (el instanceof HTMLElement) before accessing HTMLElement-specific
properties or methods. Apply this pattern to every occurrence of the expressions
document.activeElement as HTMLElement and document.querySelector(...) as
HTMLElement in virtual-table/index.tsx so all six assertions are removed and
guarded with instanceof checks.
- Around line 85-90: The ref variables parentRef and containerRef are
redundantly asserted with "as React.RefObject<HTMLDivElement>" after
useRef<HTMLDivElement>(null); remove those type assertions and let
useRef<HTMLDivElement>(null) infer the RefObject type directly (update the
declarations of parentRef and containerRef to just use
useRef<HTMLDivElement>(null) without the "as" cast).
🧹 Nitpick comments (4)
svc/krane/config.go (1)

63-83: Validate OtelTraceSamplingRate range. Add a 0–1 bounds check when enabled. The sampling rate must fall within the valid range per OpenTelemetry specification.

🔧 Suggested patch
 import (
+	"fmt"
 	"github.com/unkeyed/unkey/pkg/clock"
 )
@@
 func (c Config) Validate() error {
+	if c.OtelEnabled {
+		if c.OtelTraceSamplingRate < 0 || c.OtelTraceSamplingRate > 1 {
+			return fmt.Errorf("otel trace sampling rate must be between 0 and 1")
+		}
+	}
 	return nil
 }
pkg/otel/grafana.go (1)

198-215: Collect all shutdown errors with errors.Join instead of stopping on first failure.

During graceful shutdown, it's best practice to attempt all cleanup steps and report all failures together. The current implementation returns immediately on the first error, preventing subsequent cleanups from running.

Replace the early return with error aggregation using errors.Join (available in Go 1.20+). Import the "errors" package and collect all shutdown errors before returning.

🔧 Suggested patch
 import (
 	"context"
+	"errors"
 	"fmt"
 	"time"
@@
 	return func(ctx context.Context) error {
+		var errs []error
 		for _, fn := range []func(context.Context) error{
 			meterProvider.Shutdown,
 			reader.Shutdown,
 			metricExporter.Shutdown,
 			traceProvider.Shutdown,
 			traceExporter.Shutdown,
 			logProvider.Shutdown,
 			processor.Shutdown,
 			logExporter.Shutdown,
 		} {
 			if err := fn(ctx); err != nil {
-				return err
+				errs = append(errs, err)
 			}
-
 		}
+		if len(errs) > 0 {
+			return errors.Join(errs...)
+		}
 		return nil
 
 	}, nil
 }
svc/vault/config.go (1)

18-19: Consider validating OtelTraceSamplingRate bounds.

Comment documents 0.0 to 1.0 range but Validate() doesn't enforce it. Values outside this range may cause unexpected behavior in OTel.

Proposed validation addition
 func (c Config) Validate() error {
+	if c.OtelEnabled {
+		if err := assert.All(
+			assert.GreaterOrEqual(c.OtelTraceSamplingRate, 0.0, "otelTraceSamplingRate must be >= 0.0"),
+			assert.LessOrEqual(c.OtelTraceSamplingRate, 1.0, "otelTraceSamplingRate must be <= 1.0"),
+		); err != nil {
+			return err
+		}
+	}
 	return assert.All(
svc/ctrl/worker/config.go (1)

144-146: Same validation gap as vault/config.go.

Consider validating OtelTraceSamplingRate bounds (0.0-1.0) if OtelEnabled is true, for consistency.

Copy link
Member

@Flo4604 Flo4604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some questions otherwise lgtm

also we have some stall comments if you search for shutdown. e. in. pkg/otel/grafana.go

@chronark chronark merged commit 540ee3f into main Feb 5, 2026
11 of 14 checks passed
@chronark chronark deleted the runner branch February 5, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants