Skip to content

Reducing the Fence File Recovery time #3192

@vigith

Description

@vigith

Discussed in #3191

Originally posted by bobo February 4, 2026

We are having issues with the Fence file blocking for too long after an abrupt pod failure. 5 minutes will fill up any buffers we can have multiple times over.
I think it should be possible to lower the time quite alot in some different ways, but im open for suggestions.

Im leaning towards the first option here, since it feels easiest. Although i think the Kubernetes Leases is the more correct option.

Options
Heartbeat timestamp in fence file

  • Pod writes {pod_name}:{timestamp}, updates periodically
  • New pod detects stale timestamp (~30s) and takes over
  • Simple, no external dependencies

Kubernetes Lease objects

  • Use native k8s coordination primitive
  • Automatic expiry built-in
  • Requires API access from data plane

Configurable timeout

  • Just add a config option to wait for a shorter time (with added risks)

Add pod identity to the file

  • Check if the pod owning the file is still alive, and try to take over if it is not.

Id be happy to implement any of them if it is something that is wanted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/reduceReduce operations like GroupByKeybackportBack port the commit to previous stable release.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions