-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Describe the request
When running Search Head Cluster (SHC) on Kubernetes, the Splunk Operator’s rollout/recycle logic can become stuck if an SHC member enters ManualDetention and Kubernetes readiness is used as the primary gate to proceed.
In addition, when ingress/session stickiness is enabled (common for Splunk Web), users can appear “stuck” on a specific SH member during rollouts/detention events, even though the cluster is attempting to drain traffic from that member.
We need a robust operator behavior that:
- avoids deadlocking upgrades/rollouts when an SH member is detained, and
- avoids or minimizes user impact when a member is out of service (detained/unhealthy) in environments that use sticky sessions.
Expected behavior
-
During an operator-driven SHC recycle/upgrade:
- The operator detains a member, waits for safe conditions, updates/restarts as needed, then releases detention.
- The rollout should always converge without requiring manual intervention.
-
During detention/restart events:
- Users should not be pinned indefinitely to an SH member that is out of service.
- If sticky sessions are used, there should be a failover path when a pinned backend becomes unhealthy/out of service.
Splunk setup on K8S
- Splunk Enterprise 10.0.3 deployed with Splunk Operator 3.0.0.
- Includes:
- SearchHeadCluster (SHC) with multiple members
- (Optionally) Standalone instances in the same cluster (not required to reproduce this issue)
- Splunk Web traffic routed through Kubernetes Ingress (NGINX Ingress Controller) with cookie-based session affinity enabled (typical for Splunk Web session behavior).
Reproduction/Testing steps
A) Deadlock / “circular loop” between detention and readiness
- Start a normal SHC with N members managed by Splunk Operator.
- Trigger an operator-driven rolling operation (e.g., image upgrade / recycle path).
- Ensure an SH member becomes detained as part of the recycle process (
status=ManualDetentionat the Splunk SHC member layer). - If a readiness probe is implemented such that:
ManualDetention→ readiness probe fails → Kubernetes marks the podReady=false
- Observe:
- Kubernetes removes the pod from Service endpoints (expected).
- The Splunk Operator’s SHC rollout can stall waiting for
ReadyReplicas/ pod readiness gates, and never reaches the step that releases detention.
The loop in one line
Operator detains a member → probe marks it NotReady → operator waits for all pods Ready → operator never reaches the step that undetains → member stays detained → probe keeps it NotReady → rollout is stuck.
Why this happens (mechanically)
During an operator-driven SHC upgrade/recycle, the operator:
- puts a member into detention (Splunk-side “out of service”)
- later, when safe, it releases detention (i.e., clears manual detention)
But the operator’s control loop gates progress on Kubernetes readiness (e.g., StatefulSet ReadyReplicas == replicas or equivalent “cluster is ready” conditions).
If the readiness probe is defined such that detention implies NotReady, Kubernetes will keep that pod Ready=false while detained. If the operator requires all pods Ready before proceeding to the release/undetain step, the process can deadlock.
Important nuance
A “detention ⇒ NotReady” probe is reasonable for human/manual detentions (you want traffic drained).
It conflicts specifically with operator-driven rolling restarts/upgrades, because the operator expects to be able to detain a member and still progress through the rest of the orchestration, eventually releasing detention.
We attempted a “fail-open” guard approach (only mark NotReady for detention when a rolling restart is not in progress). Without a guard like that (or a different operator gating model), the deadlock can occur.
B) User impact: sticky ingress can strand users on a detained member
- Enable cookie-based stickiness in the ingress for Splunk Web (common).
- Have a user establish a session routed to SH member X.
- During a rollout, member X is detained/restarted/unhealthy.
- Observe:
- The user’s browser continues to send requests with the same affinity cookie.
- Depending on ingress behavior, the user can appear “stuck” (errors/timeouts/looping) until the cookie expires or the ingress fails over the session.
This is typically addressed at the ingress layer (e.g., session-cookie-change-on-failure and proxy-next-upstream settings), but it’s tightly coupled to how detention is reflected in readiness/endpoints during operator-driven operations.
K8s environment
- Kubernetes cluster (managed)
- NGINX Ingress Controller
- (Optional) strict session stickiness enabled for Splunk Web ingress