Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughReplaces the shutdown package with a runner-based lifecycle (New(logger), Go, Defer/DeferCtx, Recover, Wait), adds in-process health endpoints (/health/live, /health/ready, /health/startup) with parallel readiness checks, changes InitGrafana to return a shutdown func, and standardizes Kubernetes probe paths. Changes
Sequence DiagramsequenceDiagram
participant Service as Service
participant Runner as Runner
participant Checks as ReadinessChecks
participant Client as HTTP Client
Service->>Runner: New(logger)
Service->>Runner: RegisterHealth(mux)
Service->>Runner: Go(startServerTask)
Service->>Runner: DeferCtx(server.Shutdown)
Service->>Runner: Defer(db.Close)
Client->>Runner: GET /health/live
Runner-->>Client: 200 (started) or 503
Client->>Runner: GET /health/ready
Runner->>Checks: runChecks(ctx) (parallel, per-check timeouts)
Checks-->>Runner: results
Runner-->>Client: 200 if all pass, else 503
Service->>Runner: Wait(ctx, WithTimeout(...))
Note right of Runner: on cancellation → mark shuttingDown, run deferred cleanups (LIFO), wait for tasks
Runner-->>Service: shutdown complete / aggregated error
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
dev/k8s/manifests/preflight.yaml (1)
126-139:⚠️ Potential issue | 🔴 CriticalThe probe paths do not match the service handlers—critical deployment issue.
The manifest probes for
/health/liveand/health/readywill fail because the preflight service still only exposes/healthz. The routes directory contains onlyhealthzandmutatehandlers. The HTTP health check handlers must be added to match the new probe paths, or the pod will continuously fail liveness/readiness checks and restart.Fix required: Add
/health/liveand/health/readyhandlers tosvc/preflight/routes/and register them insvc/preflight/run.go.pkg/otel/grafana.go (1)
68-90:⚠️ Potential issue | 🟡 MinorDocs still show shutdowns.
✏️ Update example
-// The function registers all necessary shutdown handlers with the provided shutdowns instance. +// The function registers all necessary shutdown handlers with the provided runner. @@ -// shutdowns := shutdown.New() -// err := otel.InitGrafana(ctx, otel.Config{ +// r := runner.New(logging.NewNoop()) +// err := otel.InitGrafana(ctx, otel.Config{ // GrafanaEndpoint: "https://otlp-sentinel-prod-us-east-0.grafana.net/otlp", // Application: "unkey-api", // Version: version.Version, -// }, shutdowns) +// }, r) @@ -// errs := shutdowns.Shutdown(ctx) -// for _, err := range errs { -// log.Printf("Shutdown error: %v", err) -// } +// _ = r.Wait(ctx)
🤖 Fix all issues with AI agents
In `@pkg/runner/health.go`:
- Around line 177-198: The function Runner.runChecks reads r.health.checkTimeout
without synchronization which can race with writers (e.g., Runner.Wait); add a
mutex to the health container (e.g., sync.RWMutex on the health struct) and use
a read lock when accessing checkTimeout in runChecks (acquire RLock, copy
timeout, RUnlock) and use the write lock (Lock/Unlock) wherever checkTimeout is
updated. Update the health struct and the writers that modify checkTimeout to
hold the write lock to eliminate the race.
In `@pkg/runner/runner.go`:
- Around line 36-39: The struct comment incorrectly references Run() but the
method is named Wait(); update the doc on Runner to mention Wait() (and describe
that Wait() blocks until OS signals or errors and then triggers cleanup), i.e.,
change the text "Run() blocks until OS signals or errors, then triggers cleanup"
to "Wait() blocks until OS signals or errors, then triggers cleanup" and ensure
any other occurrences in the Runner comment or nearby docs refer to Wait rather
than Run to keep documentation consistent with the Runner.Wait method.
- Around line 221-223: The variables 'cause' and 'reason' are pre-initialized
but unused; remove their unnecessary initializations in runner.go and instead
declare them at the point of first use (or use short declaration := when setting
them). Update references to 'cause' and 'reason' so they are defined where
assigned (e.g., inside the select/case blocks or immediately before they are
needed) to satisfy the linter and avoid unused init.
In `@svc/api/run.go`:
- Around line 340-347: The logger.Info call is placed after srv.Serve returns
(in the anonymous r.Go function), so "API server started successfully" is logged
on shutdown; move the success log to before calling srv.Serve (i.e., log
immediately before invoking srv.Serve(ctx, cfg.Listener) inside the same r.Go
goroutine) or change the message to reflect shutdown (e.g., "API server
stopped"); update the code around the anonymous function passed to r.Go, keeping
srv.Serve, cfg.Listener and logger.Info in mind.
In `@svc/frontline/run.go`:
- Around line 282-289: The HTTP server goroutine currently ignores only
context.Canceled when checking serveErr from httpSrv.Serve; update the error
check inside the r.Go anonymous function that calls httpSrv.Serve (referencing
httpSrv.Serve, httpListener and serveErr) to also ignore http.ErrServerClosed
(i.e., return an error only if serveErr != nil && !errors.Is(serveErr,
context.Canceled) && !errors.Is(serveErr, http.ErrServerClosed)), so the HTTP
shutdown path matches the HTTPS handling.
- Around line 247-254: The HTTPS goroutine currently treats any
non-context.Canceled error from httpsSrv.Serve as fatal; update the check inside
the r.Go block that wraps httpsSrv.Serve so it also treats http.ErrServerClosed
as non-fatal (i.e., ignore errors that satisfy errors.Is(err,
http.ErrServerClosed)) to avoid reporting expected shutdown errors triggered by
r.DeferCtx(httpsSrv.Shutdown); reference the r.Go anonymous function, serveErr
variable, httpsSrv.Serve call, and the context.Canceled and http.ErrServerClosed
sentinel errors when making the change.
🧹 Nitpick comments (7)
pkg/runner/concurrency_test.go (1)
151-156: Use require.Eventually; drop t.FatalLine 151-156 uses t.Fatal; switch to require.Eventually to keep require-only assertions, less flake. As per coding guidelines, Use
testify/requirefor assertions in tests.Proposed change
- select { - case <-started: - // Task started immediately - good - case <-time.After(100 * time.Millisecond): - t.Fatal("task did not start immediately") - } + require.Eventually(t, func() bool { + select { + case <-started: + return true + default: + return false + } + }, 100*time.Millisecond, 5*time.Millisecond, "task did not start immediately")svc/api/run.go (2)
125-131: Consider also ignoringhttp.ErrServerClosedfor Prometheus.The API server (line 342) ignores both
context.Canceledandhttp.ErrServerClosed, but Prometheus only ignorescontext.Canceled. For consistency and to avoid spurious errors during graceful shutdown, consider adding the same check.Proposed fix
r.Go(func(ctx context.Context) error { serveErr := prom.Serve(ctx, promListener) - if serveErr != nil && !errors.Is(serveErr, context.Canceled) { + if serveErr != nil && !errors.Is(serveErr, context.Canceled) && !errors.Is(serveErr, http.ErrServerClosed) { return fmt.Errorf("prometheus server failed: %w", serveErr) } return nil })
350-353: Inconsistent shutdown timeout across services.This service uses
runner.WithTimeout(time.Minute), butsvc/frontline/run.go(line 295) andsvc/ctrl/api/run.go(line 275) user.Wait(ctx)without a timeout. Consider standardizing the shutdown timeout behavior across all services.svc/frontline/run.go (1)
97-103: Prometheus error handling consistency.Same as other services—consider adding
http.ErrServerClosedcheck for consistency.Proposed fix
r.Go(func(ctx context.Context) error { promListenErr := prom.Serve(ctx, promListener) - if promListenErr != nil && !errors.Is(promListenErr, context.Canceled) { + if promListenErr != nil && !errors.Is(promListenErr, context.Canceled) && !errors.Is(promListenErr, http.ErrServerClosed) { return fmt.Errorf("prometheus server failed: %w", promListenErr) } return nil })svc/ctrl/api/run.go (3)
232-236: Inconsistent error comparison style.Line 232 uses direct comparison
err != http.ErrServerClosedinstead oferrors.Is(err, http.ErrServerClosed)used elsewhere. While functionally equivalent for this error, usingerrors.Isis more robust for wrapped errors and maintains consistency across services.Proposed fix
- if err != nil && err != http.ErrServerClosed { + if err != nil && !errors.Is(err, http.ErrServerClosed) { return fmt.Errorf("server failed: %w", err) }
247-248: Certificate bootstrap uses baregoinstead of runner.
certBootstrap.run(ctx)is launched with a baregogoroutine, bypassing the runner's lifecycle management. This means:
- Errors from bootstrap won't propagate to the runner
- The goroutine may outlive graceful shutdown
Consider using
r.Gofor consistency, or document why this is intentionally fire-and-forget.Proposed fix using runner
- go certBootstrap.run(ctx) + r.Go(func(ctx context.Context) error { + certBootstrap.run(ctx) + return nil + })
264-270: Prometheus error handling consistency.Same as other services—consider adding
http.ErrServerClosedcheck for consistency.Proposed fix
r.Go(func(ctx context.Context) error { logger.Info("prometheus started", "port", cfg.PrometheusPort) - if serveErr := prom.Serve(ctx, ln); serveErr != nil && !errors.Is(serveErr, context.Canceled) { + if serveErr := prom.Serve(ctx, ln); serveErr != nil && !errors.Is(serveErr, context.Canceled) && !errors.Is(serveErr, http.ErrServerClosed) { return fmt.Errorf("failed to start prometheus server: %w", serveErr) } return nil })
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/runner/health.go`:
- Around line 184-191: The goroutine launched for each readiness check (the loop
over checks calling the anonymous function that uses ReadinessCheck) can panic
and then never send a checkResult to resultCh, causing /health/ready to hang;
wrap the body of that goroutine with a defer that recovers from panics and
converts the panic into an error value, and ensure that you always send a
checkResult{name: name, err: err} to resultCh in all cases (including after a
panic) while still canceling the check context via defer cancel(); update the
anonymous function around checkCtx, cancel, err := check(checkCtx) to guarantee
a resultCh send even on panic.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
svc/preflight/run.go (1)
100-100:⚠️ Potential issue | 🟡 MinorUse
r.Go()to integrate cleanup service into runner's lifecycle.
cleanupService.Start(ctx)spawned with rawgowon't be tracked by the runner, so shutdown won't wait for it to complete. User.Go()for consistency with other background tasks.Suggested fix
- go cleanupService.Start(ctx) + r.Go(func(ctx context.Context) error { + cleanupService.Start(ctx) + return nil + })
🧹 Nitpick comments (1)
svc/krane/run.go (1)
198-199: Remove extra blank lines.Two consecutive blank lines; one is sufficient.
Suggested fix
- - + // Wait for signal and handle shutdown
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
pkg/otel/grafana.go (1)
58-90:⚠️ Potential issue | 🟡 MinorDocs still mention shutdowns param; update for cleanup func return.
web/apps/dashboard/components/virtual-table/index.tsx (2)
142-143:⚠️ Potential issue | 🟡 MinorUse
column.keyinstead of array index as key in colgroup elements. Array indices are unstable as React keys and can cause reconciliation issues if columns are reordered. Thecolumn.keyproperty is already used correctly in thetheadsection (line 148) and should be used consistently here as well.🔧 Suggested patch
- {columns.map((_, idx) => ( - <col key={idx} style={{ width: colWidths[idx].width }} /> + {columns.map((column, idx) => ( + <col key={column.key} style={{ width: colWidths[idx].width }} /> ))}Also applies to: 195-196
55-60:⚠️ Potential issue | 🟡 MinorReplace
anywithunknownand specify ref type. This improves type safety per coding guidelines.-// biome-ignore lint/suspicious/noExplicitAny: Safe to leave -export const VirtualTable = forwardRef<VirtualTableRef, VirtualTableProps<any>>( +export const VirtualTable = forwardRef<VirtualTableRef, VirtualTableProps<unknown>>( function VirtualTable<TTableData>( props: VirtualTableProps<TTableData>, - ref: Ref<unknown> | undefined, + ref: Ref<VirtualTableRef> | undefined, ) {
🤖 Fix all issues with AI agents
In `@cmd/krane/main.go`:
- Around line 97-105: The Config Clock field is set to nil but should use a real
clock; update the krane.Config initialization in main.go to set Clock to
clock.RealClock{} (not nil) so production code uses a real clock implementation
as required by the Clock docs and to avoid an illegal nil state in the Config
struct; locate the krane.Config literal and replace Clock: nil with Clock:
clock.RealClock{}.
In `@web/apps/dashboard/components/virtual-table/index.tsx`:
- Around line 316-318: Replace all unsafe type assertions of DOM nodes with
runtime instanceof guards: instead of using document.activeElement as
HTMLElement, check if (document.activeElement instanceof HTMLElement) and then
call blur() on it; instead of document.querySelector(...) as HTMLElement, store
the result in a variable (e.g., const el = document.querySelector(selector)) and
guard with if (el instanceof HTMLElement) before accessing HTMLElement-specific
properties or methods. Apply this pattern to every occurrence of the expressions
document.activeElement as HTMLElement and document.querySelector(...) as
HTMLElement in virtual-table/index.tsx so all six assertions are removed and
guarded with instanceof checks.
- Around line 85-90: The ref variables parentRef and containerRef are
redundantly asserted with "as React.RefObject<HTMLDivElement>" after
useRef<HTMLDivElement>(null); remove those type assertions and let
useRef<HTMLDivElement>(null) infer the RefObject type directly (update the
declarations of parentRef and containerRef to just use
useRef<HTMLDivElement>(null) without the "as" cast).
🧹 Nitpick comments (4)
svc/krane/config.go (1)
63-83: Validate OtelTraceSamplingRate range. Add a 0–1 bounds check when enabled. The sampling rate must fall within the valid range per OpenTelemetry specification.🔧 Suggested patch
import ( + "fmt" "github.com/unkeyed/unkey/pkg/clock" ) @@ func (c Config) Validate() error { + if c.OtelEnabled { + if c.OtelTraceSamplingRate < 0 || c.OtelTraceSamplingRate > 1 { + return fmt.Errorf("otel trace sampling rate must be between 0 and 1") + } + } return nil }pkg/otel/grafana.go (1)
198-215: Collect all shutdown errors witherrors.Joininstead of stopping on first failure.During graceful shutdown, it's best practice to attempt all cleanup steps and report all failures together. The current implementation returns immediately on the first error, preventing subsequent cleanups from running.
Replace the early return with error aggregation using
errors.Join(available in Go 1.20+). Import the"errors"package and collect all shutdown errors before returning.🔧 Suggested patch
import ( "context" + "errors" "fmt" "time" @@ return func(ctx context.Context) error { + var errs []error for _, fn := range []func(context.Context) error{ meterProvider.Shutdown, reader.Shutdown, metricExporter.Shutdown, traceProvider.Shutdown, traceExporter.Shutdown, logProvider.Shutdown, processor.Shutdown, logExporter.Shutdown, } { if err := fn(ctx); err != nil { - return err + errs = append(errs, err) } - } + if len(errs) > 0 { + return errors.Join(errs...) + } return nil }, nil }svc/vault/config.go (1)
18-19: Consider validating OtelTraceSamplingRate bounds.Comment documents
0.0 to 1.0range butValidate()doesn't enforce it. Values outside this range may cause unexpected behavior in OTel.Proposed validation addition
func (c Config) Validate() error { + if c.OtelEnabled { + if err := assert.All( + assert.GreaterOrEqual(c.OtelTraceSamplingRate, 0.0, "otelTraceSamplingRate must be >= 0.0"), + assert.LessOrEqual(c.OtelTraceSamplingRate, 1.0, "otelTraceSamplingRate must be <= 1.0"), + ); err != nil { + return err + } + } return assert.All(svc/ctrl/worker/config.go (1)
144-146: Same validation gap as vault/config.go.Consider validating
OtelTraceSamplingRatebounds (0.0-1.0) if OtelEnabled is true, for consistency.
Flo4604
left a comment
There was a problem hiding this comment.
just some questions otherwise lgtm
also we have some stall comments if you search for shutdown. e. in. pkg/otel/grafana.go
deduplicates a lot of startup logic and makes it easy to use proper probes.
This needs to be updated in the infra repo too before rolling out, otherwise healthchecks will fail