Skip to content

Bug: File Handle Resource Leak in SyncLock.acquire() #403

@Serhan-Asad

Description

@Serhan-Asad

Bug Description

File: pdd/sync_determine_operation.py
Line: 183
Type: Resource Leak

The SyncLock.acquire() method opens a file without guaranteeing it will be closed if certain exceptions occur, leading to file descriptor leaks in long-running processes.

Location

# pdd/sync_determine_operation.py:183-200

def acquire(self):
    try:
        # ... setup code ...
        
        # Line 183: File opened, FD assigned
        self.fd = open(self.lock_file, 'w')  # ← PROBLEM: No guaranteed cleanup
        
        # Lines 185-194: Operations that can raise various exceptions
        if HAS_FCNTL:
            fcntl.flock(self.fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
        elif HAS_MSVCRT:
            msvcrt.locking(self.fd.fileno(), msvcrt.LK_NBLCK, 1)
        
        self.fd.write(str(self.current_pid))
        self.fd.flush()
        
    except (IOError, OSError) as e:  # ← Only catches IOError and OSError!
        if self.fd:
            self.fd.close()
            self.fd = None
        raise TimeoutError(f"Failed to acquire lock: {e}")

Root Cause

The file is opened at line 183, but the exception handler only catches (IOError, OSError). If any other exception occurs (e.g., KeyboardInterrupt, RuntimeError, MemoryError), the file descriptor is never closed.

Impact

When Does This Leak Occur?

  1. User presses Ctrl+C during lock acquisition
  2. Unexpected exceptions (not IOError/OSError) during fcntl.flock()
  3. System interrupts during the lock acquisition process
  4. Out of memory or other system errors

Consequences

  • File descriptors accumulate over time in long-running processes
  • Eventually causes: OSError: [Errno 24] Too many open files
  • Process crashes requiring manual restart
  • Affects:
    • PDD server running continuously
    • Automated scripts with repeated sync operations
    • CI/CD pipelines
    • Users frequently interrupting operations

Proof of Concept

Automated Test Results

Created isolated test in bug_verification_workspace/simple_leak_test.py:

$ python3 simple_leak_test.py

======================================================================
SIMPLE FILE HANDLE LEAK TEST
======================================================================

Process PID: 9908

1. Initial state (no locks):
   Open .lock files: 0

2. Normal lock acquisition and release:
   ✓ Lock acquired
   Open .lock files: 1
   ✓ Lock released
   Open .lock files after release: 0

3. Interrupted lock acquisition (simulating Ctrl+C):
   Acquiring lock (will be interrupted)...
   ✓ KeyboardInterrupt occurred (as expected)
   Open .lock files after interrupt: 1

   🔴 LEAK DETECTED!
   The following .lock files are still open:
     Python  9908  9w  REG  .../test_interrupt_python.lock

======================================================================
RESULT: Bug confirmed - file handle leak exists
======================================================================

Manual Reproduction

import fcntl
from sync_determine_operation import SyncLock

# Simulate interrupt during lock acquisition
original_flock = fcntl.flock
fcntl.flock = lambda fd, op: (_ for _ in ()).throw(KeyboardInterrupt())

lock = SyncLock("test", "python")
try:
    lock.acquire()  # File opened at line 183, but never closed
except KeyboardInterrupt:
    pass

# Check with: lsof -p <PID> | grep .lock
# The .lock file will still be open!

Recommended Fix

Option 1: Try-Finally Block (Minimal Change)

def acquire(self):
    self.lock_file.parent.mkdir(parents=True, exist_ok=True)
    
    try:
        # ... re-entrancy checks ...
        
        self.lock_file.touch()
        self.fd = open(self.lock_file, 'w')
        
        try:
            # Critical section - must close file if anything fails
            if HAS_FCNTL:
                fcntl.flock(self.fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
            elif HAS_MSVCRT:
                msvcrt.locking(self.fd.fileno(), msvcrt.LK_NBLCK, 1)
            
            self.fd.write(str(self.current_pid))
            self.fd.flush()
        except:
            # Close file on ANY exception (not just IOError/OSError)
            if self.fd:
                self.fd.close()
                self.fd = None
            raise
            
    except (IOError, OSError) as e:
        if self.fd:
            self.fd.close()
            self.fd = None
        raise TimeoutError(f"Failed to acquire lock: {e}")

Option 2: Context Manager (More Pythonic)

Use context manager for initial file operations, then duplicate FD:

def acquire(self):
    # ... setup ...
    
    self.lock_file.touch()
    
    with open(self.lock_file, 'w') as temp_fd:
        if HAS_FCNTL:
            fcntl.flock(temp_fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
        elif HAS_MSVCRT:
            msvcrt.locking(temp_fd.fileno(), msvcrt.LK_NBLCK, 1)
        
        temp_fd.write(str(self.current_pid))
        temp_fd.flush()
        
        # Duplicate FD to keep lock alive after context exits
        self.fd = os.fdopen(os.dup(temp_fd.fileno()), 'w')

Option 3: Broader Exception Handling

Change exception handler to catch all exceptions:

except Exception as e:  # Instead of (IOError, OSError)
    if self.fd:
        self.fd.close()
        self.fd = None
    if isinstance(e, (IOError, OSError)):
        raise TimeoutError(f"Failed to acquire lock: {e}")
    raise

Related Issues

This is similar to issues that occur when resources are not properly managed in exception scenarios:

  • File handles left open
  • Locks not released
  • Memory not freed

Environment

  • Confirmed on: macOS (Darwin)
  • Detection method: lsof showing leaked file descriptors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions