-
Notifications
You must be signed in to change notification settings - Fork 150
Open
Labels
area/reduceReduce operations like GroupByKeyReduce operations like GroupByKeybackportBack port the commit to previous stable release.Back port the commit to previous stable release.
Milestone
Description
Discussed in #3191
Originally posted by bobo February 4, 2026
We are having issues with the Fence file blocking for too long after an abrupt pod failure. 5 minutes will fill up any buffers we can have multiple times over.
I think it should be possible to lower the time quite alot in some different ways, but im open for suggestions.
Im leaning towards the first option here, since it feels easiest. Although i think the Kubernetes Leases is the more correct option.
Options
Heartbeat timestamp in fence file
- Pod writes {pod_name}:{timestamp}, updates periodically
- New pod detects stale timestamp (~30s) and takes over
- Simple, no external dependencies
Kubernetes Lease objects
- Use native k8s coordination primitive
- Automatic expiry built-in
- Requires API access from data plane
Configurable timeout
- Just add a config option to wait for a shorter time (with added risks)
Add pod identity to the file
- Check if the pod owning the file is still alive, and try to take over if it is not.
Id be happy to implement any of them if it is something that is wanted.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/reduceReduce operations like GroupByKeyReduce operations like GroupByKeybackportBack port the commit to previous stable release.Back port the commit to previous stable release.