Reducing the Fence File Recovery time


### Discussed in https://github.com/numaproj/numaflow/discussions/3191

<div type='discussions-op-text'>

<sup>Originally posted by **bobo** February  4, 2026</sup>

We are having issues with the Fence file blocking for too long after an abrupt pod failure. 5 minutes will fill up any buffers we can have multiple times over. 
I think it should be possible to lower the time quite alot in some different ways, but im open for suggestions.

Im leaning towards the first option here, since it feels easiest. Although i think the Kubernetes Leases is the more correct option.

Options
  Heartbeat timestamp in fence file
  - Pod writes {pod_name}:{timestamp}, updates periodically
  - New pod detects stale timestamp (~30s) and takes over
  - Simple, no external dependencies

  Kubernetes Lease objects
  - Use native k8s coordination primitive
  - Automatic expiry built-in
  - Requires API access from data plane

  Configurable timeout
  - Just add a config option to wait for a shorter time (with added risks)

  Add pod identity to the file
  - Check if the pod owning the file is still alive, and try to take over if it is not.


Id be happy to implement any of them if it is something that is wanted.
</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing the Fence File Recovery time #3192

Discussed in #3191

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reducing the Fence File Recovery time #3192

Description

Discussed in #3191

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions