Skip to content

[1.34–1.35] Powered‑off node stays Ready indefinitely; Lease expired but controller actions stall until kubelite restart #5386

@kcarson77

Description

@kcarson77

Summary

I have 3 nodes, dmhost1, dmhost2 and dmhost3. In this example I've powered down dmhost1.
On MicroK8s 1.34 and 1.35, a powered‑off node can remain Ready=True indefinitely (pods still show Running) even though the node’s Lease renewTime is stale (kubelet offline). The problem resolves immediately after restarting microk8s.daemon-kubelite on a healthy node, suggesting a kubelite/controller‑manager watch or write‑path stall in the node‑lifecycle reconciliation.
This is reproducible and appears to be a regression (did not observe in earlier MicroK8s releases, eg 1.32).
To me looks like a kubelite level stall, or stuck not watching leases

Environment

MicroK8s channels: 1.35/stable (also reproducible on 1.34/stable)
Kubernetes: v1.35.0 (server & client)
HA: Yes — dqlite with 3 voters
OS: Ubuntu 22.04.5 LTS (kernel 5.15.0-164-generic)
Container runtime: containerd 2.1.3
CNI: Calico
API service endpoints :
default/kubernetes -> 10.173.128.165:16443, 10.173.128.166:16443

MicroK8s status
high-availability: yes
datastore master nodes:
10.173.128.164:19001
10.173.128.166:19001
10.173.128.165:19001

Nodes (example):
NAME STATUS ROLES VERSION INTERNAL-IP
dmhost1 Ready v1.35.0 10.173.128.164
dmhost2 Ready v1.35.0 10.173.128.165
dmhost3 Ready v1.35.0 10.173.128.166

Log Snippets:
labuser@dmhost2:$ date
Wed 4 Feb 12:01:18 UTC 2026
labuser@dmhost2:
$ ping dmhost1
PING dmhost1 (10.173.128.164) 56(84) bytes of data.
From dmhost2 (10.173.128.165) icmp_seq=1 Destination Host Unreachable
From dmhost2 (10.173.128.165) icmp_seq=2 Destination Host Unreachable
From dmhost2 (10.173.128.165) icmp_seq=3 Destination Host Unreachable
^C
--- dmhost1 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3029ms
pipe 3
labuser@dmhost2:~$ microk8s kubectl get lease -n kube-node-lease dmhost1 -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
creationTimestamp: "2026-01-19T19:03:15Z"
name: dmhost1
namespace: kube-node-lease
ownerReferences:

  • apiVersion: v1
    kind: Node
    name: dmhost1
    uid: 9dfa6bcc-5c7b-4405-b648-c4f5edca51b4
    resourceVersion: "7004700"
    uid: 6b13c86b-c57c-4056-a435-760351b52d7f
    spec:
    holderIdentity: dmhost1
    leaseDurationSeconds: 40
    renewTime: "2026-02-04T10:46:27.257957Z"

Lease:
HolderIdentity: dmhost1
AcquireTime:
RenewTime: Wed, 04 Feb 2026 10:46:27 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Mon, 02 Feb 2026 18:43:35 +0000 Mon, 02 Feb 2026 18:43:35 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Wed, 04 Feb 2026 10:46:35 +0000 Mon, 02 Feb 2026 17:46:01 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 04 Feb 2026 10:46:35 +0000 Mon, 02 Feb 2026 17:46:01 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 04 Feb 2026 10:46:35 +0000 Mon, 02 Feb 2026 17:46:01 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 04 Feb 2026 10:46:35 +0000 Mon, 02 Feb 2026 17:50:00 +0000 KubeletReady kubelet is posting ready status

labuser@dmhost2:~$ microk8s kubectl get lease -n kube-system kube-controller-manager -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
creationTimestamp: "2026-01-19T19:03:05Z"
name: kube-controller-manager
namespace: kube-system
resourceVersion: "7021853"
uid: 9ecf9c97-2af1-4575-92d4-c2bab5144ec4
spec:
acquireTime: "2026-02-04T10:47:43.452780Z"
holderIdentity: dmhost3_9cb98c2e-2d0b-4404-99c4-5613e7eed79a
leaseDurationSeconds: 60
leaseTransitions: 18
renewTime: "2026-02-04T12:02:44.917919Z"

####Observed behavior

Powered‑off node Lease shows stale renewTime:
HolderIdentity: dmhost1
RenewTime: 2026-02-04T10:46:27Z # stale long after shutdown

Node conditions remain Ready=True long after the Lease expired:
Type: Ready Status: True
LastHeartbeatTime: 2026-02-04T10:46:35Z
Reason: KubeletReady

Controller‑manager leader Lease healthy and renewing:
holderIdentity: dmhost3_9cb98c2e-...
leaseDurationSeconds: 60
renewTime: 2026-02-04T11:39:57Z

kubelite logs around the stuck period include repeated timeouts updating Leases and Node status (sample excerpt):
E0204 10:54:11 writers.go:123] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
E0204 10:54:18 controller.go:251] "Failed to update lease" err="Put "https://127.0.0.1:16443/apis/coordination.k8s.io/.../leases/kube-controller-manager\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
E0204 10:54:21 kubelet_node_status.go:474] "Error updating node status, will retry" err="failed to patch status ... http: Handler timeout"
E0204 10:54:31 kubelet_node_status.go:461] "Unable to update node status" err="update node status exceeds retry count"
E0204 10:54:18 timeout.go:140] "Post-timeout activity" method="PUT" path="/apis/coordination.k8s.io/v1/.../leases/..."
{"logger":"etcd-client","msg":"retrying of unary invoke"} # repeated (MicroK8s logs label the KV retry layer as etcd; cluster is dqlite-backed)

After:
sudo snap restart microk8s.daemon-kubelite

→ The node promptly becomes NotReady and eviction resumes.

######Expected behavior
Once a node’s Lease expires (kubelet stopped) and heartbeats are missed beyond the grace period, the node‑lifecycle controller should mark the node NotReady within ~1 minute and begin standard eviction timing (subject to tolerations).

#####Actual behavior

Node remains Ready=True indefinitely (observed >1 hour).
Lease is stale, confirming kubelet is offline.
Controller‑manager leader is healthy, but internal controller actions (Lease/Node status updates) time out and do not progress until kubelite is restarted.

#####mpact

False health state for failed nodes.
Evictions do not trigger; workloads are not rescheduled.
Failure drills and real outages appear healthy when they are not.

Reproduction Steps

Yes, I can reproduce. Typically I fail a node, that works correctly, I bring it back and wait for >30 mins to allow to settle. Subsequent node failures may produce this issue.

Start with a healthy 3‑node HA MicroK8s cluster on 1.34 or 1.35 (dqlite 3 voters).
Confirm controller‑manager leadership:
Shellmicrok8s kubectl get lease -n kube-system kube-controller-manager -o yamlShow more lines

Power off one node at the host level (e.g., dmhost1).
Wait well beyond node-monitor-grace-period (e.g., >60–120s; we observed >60 minutes).
Observe:

The powered‑off node remains Ready=True.
Pods on that node still show Running.

Check the node Lease:
Shellmicrok8s kubectl get lease -n kube-node-lease dmhost1 -o yamlShow more lines
→ renewTime is stale (stops around the power‑off time), confirming kubelet is offline.
Check the controller‑manager leader Lease:
Shellmicrok8s kubectl get lease -n kube-system kube-controller-manager -o yamlShow more lines
→ holderIdentity present; renewTime fresh (CM is alive and leading).
Workaround: Restart kubelite:
Shellsudo snap restart microk8s.daemon-kubeliteShow more lines
→ Within ~60 seconds, the powered‑off node flips to NotReady and normal eviction behavior resumes.

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions