Skip to content

feat(reconnection): add slow probe mode after max attempts#94

Open
Milofax wants to merge 7 commits intochris-schra:developfrom
Milofax:feature/auto-reconnect-probe-mode
Open

feat(reconnection): add slow probe mode after max attempts#94
Milofax wants to merge 7 commits intochris-schra:developfrom
Milofax:feature/auto-reconnect-probe-mode

Conversation

@Milofax
Copy link

@Milofax Milofax commented Jan 16, 2026

Summary

  • After max reconnection attempts are reached, the proxy now enters "slow probe mode" instead of giving up
  • Periodically attempts to reconnect every 60 seconds (circuit breaker pattern best practice)
  • Proper cleanup on shutdown and on successful reconnection
  • Fixed: Transport cache is cleared before each reconnect attempt to ensure fresh transport creation

Motivation

When actively developing local MCP servers (e.g., custom integrations, Docker-based services), frequent restarts are common during development. Currently, when a backend MCP server restarts, mcp-funnel reaches max reconnection attempts and gives up completely - requiring users to restart their entire Claude Code / mcp-funnel session just to reconnect.

This is especially frustrating when:

  • Developing and testing MCP servers locally
  • Running MCP servers in Docker containers that restart on code changes
  • Managing multiple MCP servers where one might temporarily be down

Problem

When a backend MCP server restarts (e.g., Docker container restart), the proxy eventually reaches max reconnection attempts and gives up completely. Users then need to restart the entire mcp-funnel process to reconnect.

Solution

Implements the "slow probe" phase of the circuit breaker pattern:

  1. After exponential backoff exhausts max attempts, enter slow probe mode
  2. Attempt reconnection every 60 seconds
  3. Clear transport cache before each attempt (critical fix - closed transports were being reused)
  4. On success, stop probing and resume normal operation
  5. Emit events for monitoring (server.slow_probe_started, recovery logs)

This follows industry best practices (AWS, Microsoft) for resilient service connections - never give up entirely, just back off to a slower probe interval.

Changes

server-connection-manager.ts:

  • Add slowProbeTimers Map and SLOW_PROBE_INTERVAL_MS constant (60s)
  • Add startSlowProbeMode() and stopSlowProbeMode() private methods
  • Modify onMaxAttemptsReached callback to call startSlowProbeMode instead of just deleting the manager
  • Update shutdown() to clean up slow probe timers
  • Update connectToSingleServer() to stop slow probe mode on successful connection
  • Call clearTransportCache() before each reconnection attempt

transport-cache.ts:

  • Add removeCachedTransport() helper function
  • Add invalidateCachedTransport() helper function

transport/index.ts:

  • Export new cache management functions

Test plan

  • Unit tests for slow probe mode start/stop (9 tests passing)
  • E2E test: Graphiti Docker container restart → automatic recovery via slow probe ✅

🤖 Generated with Claude Code

Milofax and others added 6 commits January 4, 2026 09:27
The README.md files were creating a loop:
- /README.md → packages/mcp/README.md
- /packages/mcp/README.md → ../../README.md

This caused npm to fail with ELOOP when spawning MCP servers via npx.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of giving up when max reconnection attempts are reached,
the proxy now enters "slow probe mode" - periodically attempting
to reconnect every 60 seconds. This follows the circuit breaker
pattern best practice.

Changes:
- Add slowProbeTimers Map and SLOW_PROBE_INTERVAL_MS constant
- Add startSlowProbeMode() and stopSlowProbeMode() methods
- Modify onMaxAttemptsReached to enter slow probe mode
- Update shutdown() to clean up slow probe timers
- Stop slow probe mode on successful connection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests cover:
- Slow probe mode lifecycle (start, stop, prevent duplicates)
- Reconnection attempts every 60 seconds
- Successful reconnection stops probing
- Shutdown cleanup of all timers
- Manual disconnect request handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The root cause of failed reconnections was transport caching - closed
transports were being reused instead of creating fresh ones.

Changes:
- Add clearTransportCache() call before each slow probe reconnect attempt
- Add invalidateCachedTransport() and removeCachedTransport() helpers
- Export cache management functions from transport index
- Update tests to mock clearTransportCache

This fix was verified with E2E testing: Graphiti container restart
now successfully auto-reconnects via slow probe mode.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem: Claude Code starts new MCP-Funnel processes for each session
but doesn't always send clean shutdown signals. This leads to zombie
processes accumulating (10+ processes, 1.2GB RAM observed).

Solution: Implement PID-file based singleton pattern:
- Check ~/.mcp-funnel.pid on startup
- Kill stale process if still running (SIGTERM then SIGKILL)
- Write current PID to file
- Clean up PID file on shutdown

This ensures only one MCP-Funnel instance runs at a time.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Milofax
Copy link
Author

Milofax commented Jan 26, 2026

Additional Fix: Singleton Enforcement (Commit 74049f3)

Added PID-file based singleton pattern to prevent zombie processes.

Problem

Claude Code starts new MCP-Funnel processes per session but doesn't always send clean shutdown signals. This leads to zombie processes accumulating (observed: 10+ processes, ~1.2GB RAM).

Solution

  • Check ~/.mcp-funnel.pid on startup
  • Kill stale process if still running
  • Write current PID to file
  • Clean up on shutdown

Background

This is a known MCP ecosystem issue: typescript-sdk#208

The VS Code/LSP approach passes host PID for periodic liveness checks. Since Claude Code doesn't support this, PID-file singleton is a practical workaround.

Previous singleton approach killed ALL other mcp-funnel processes,
breaking multi-terminal setups where each Claude Code session needs
its own mcp-funnel.

New approach: Only kill processes whose parent (Claude Code) no longer
exists. This cleans up true zombies while keeping active sessions alive.

- Add isOrphanedProcess() to check if parent PID exists
- Rename enforceSignleton() to cleanupOrphanedProcesses()
- Log when skipping processes with active parents

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@chris-schra
Copy link
Owner

thanks! Sorry, was totally busy - still interested in merging this?

@Milofax
Copy link
Author

Milofax commented Mar 4, 2026

Yes, absolutely still interested! Ready to merge whenever you are. Let me know if you need any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants