Skip to content

Bug report draft: SHC ManualDetention + Kubernetes readiness gating can deadlock operator-driven rollouts (and can strand users with sticky ingress) #1676

@dpericaxon

Description

@dpericaxon

Describe the request

When running Search Head Cluster (SHC) on Kubernetes, the Splunk Operator’s rollout/recycle logic can become stuck if an SHC member enters ManualDetention and Kubernetes readiness is used as the primary gate to proceed.

In addition, when ingress/session stickiness is enabled (common for Splunk Web), users can appear “stuck” on a specific SH member during rollouts/detention events, even though the cluster is attempting to drain traffic from that member.

We need a robust operator behavior that:

  • avoids deadlocking upgrades/rollouts when an SH member is detained, and
  • avoids or minimizes user impact when a member is out of service (detained/unhealthy) in environments that use sticky sessions.

Expected behavior

  • During an operator-driven SHC recycle/upgrade:

    • The operator detains a member, waits for safe conditions, updates/restarts as needed, then releases detention.
    • The rollout should always converge without requiring manual intervention.
  • During detention/restart events:

    • Users should not be pinned indefinitely to an SH member that is out of service.
    • If sticky sessions are used, there should be a failover path when a pinned backend becomes unhealthy/out of service.

Splunk setup on K8S

  • Splunk Enterprise 10.0.3 deployed with Splunk Operator 3.0.0.
  • Includes:
    • SearchHeadCluster (SHC) with multiple members
    • (Optionally) Standalone instances in the same cluster (not required to reproduce this issue)
  • Splunk Web traffic routed through Kubernetes Ingress (NGINX Ingress Controller) with cookie-based session affinity enabled (typical for Splunk Web session behavior).

Reproduction/Testing steps

A) Deadlock / “circular loop” between detention and readiness

  1. Start a normal SHC with N members managed by Splunk Operator.
  2. Trigger an operator-driven rolling operation (e.g., image upgrade / recycle path).
  3. Ensure an SH member becomes detained as part of the recycle process (status=ManualDetention at the Splunk SHC member layer).
  4. If a readiness probe is implemented such that:
    • ManualDetention → readiness probe fails → Kubernetes marks the pod Ready=false
  5. Observe:
    • Kubernetes removes the pod from Service endpoints (expected).
    • The Splunk Operator’s SHC rollout can stall waiting for ReadyReplicas / pod readiness gates, and never reaches the step that releases detention.

The loop in one line

Operator detains a member → probe marks it NotReady → operator waits for all pods Ready → operator never reaches the step that undetains → member stays detained → probe keeps it NotReady → rollout is stuck.

Why this happens (mechanically)

During an operator-driven SHC upgrade/recycle, the operator:

  • puts a member into detention (Splunk-side “out of service”)
  • later, when safe, it releases detention (i.e., clears manual detention)

But the operator’s control loop gates progress on Kubernetes readiness (e.g., StatefulSet ReadyReplicas == replicas or equivalent “cluster is ready” conditions).

If the readiness probe is defined such that detention implies NotReady, Kubernetes will keep that pod Ready=false while detained. If the operator requires all pods Ready before proceeding to the release/undetain step, the process can deadlock.

Important nuance

A “detention ⇒ NotReady” probe is reasonable for human/manual detentions (you want traffic drained).

It conflicts specifically with operator-driven rolling restarts/upgrades, because the operator expects to be able to detain a member and still progress through the rest of the orchestration, eventually releasing detention.

We attempted a “fail-open” guard approach (only mark NotReady for detention when a rolling restart is not in progress). Without a guard like that (or a different operator gating model), the deadlock can occur.

B) User impact: sticky ingress can strand users on a detained member

  1. Enable cookie-based stickiness in the ingress for Splunk Web (common).
  2. Have a user establish a session routed to SH member X.
  3. During a rollout, member X is detained/restarted/unhealthy.
  4. Observe:
    • The user’s browser continues to send requests with the same affinity cookie.
    • Depending on ingress behavior, the user can appear “stuck” (errors/timeouts/looping) until the cookie expires or the ingress fails over the session.

This is typically addressed at the ingress layer (e.g., session-cookie-change-on-failure and proxy-next-upstream settings), but it’s tightly coupled to how detention is reflected in readiness/endpoints during operator-driven operations.


K8s environment

  • Kubernetes cluster (managed)
  • NGINX Ingress Controller
  • (Optional) strict session stickiness enabled for Splunk Web ingress

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions