-
Notifications
You must be signed in to change notification settings - Fork 507
Description
What would you like to be added?
Introduce stale DNS response serving in NodeLocalDNS, similar to “serve‑stale” mechanisms implemented in production resolvers (e.g., Unbound, BIND).
Behavior
If a cached record exists (even if TTL has expired), NodeLocalDNS should return the stale response when any of the following conditions occur:
- Upstream DNS is unreachable or times out
- Upstream DNS returns a temporary
NXDOMAIN - Upstream DNS returns a response with no IP addresses (empty A/AAAA records)
This mechanism is intended as a resiliency feature, not a replacement for normal TTL‑based resolution.
Why is this needed?
NodeLocalDNS currently does not prevent outages due to DNS resolution failures when the upstream DNS resolver(s):
- Are temporarily unavailable or not functioning
- Return transient
NXDOMAINresponses - Return responses without any IP addresses (empty A/AAAA answers)
These failures can directly propagate to workloads and cause application outages, even when valid DNS data existed shortly before the failure.
RFC 8767 already defines serving stale DNS responses when in case when upstream dns servers are unavailable or not functioning. In addition to this we need stale responses even for temporary NXDOMAIN and empty responses.
Intermittent DNS failures are a well‑known source of cascading outages in distributed systems. A recent high‑profile AWS outage (caused by transient DNS resolution failures) highlighted how short‑lived DNS unavailability can lead to widespread service impact.
References
- AWS DNS‑related outage (high‑level incident summary)
- RFC 8767 – Serving Stale DNS Data to Improve Resilency