feat(reconnection): add slow probe mode after max attempts#94
Open
Milofax wants to merge 7 commits intochris-schra:developfrom
Open
feat(reconnection): add slow probe mode after max attempts#94Milofax wants to merge 7 commits intochris-schra:developfrom
Milofax wants to merge 7 commits intochris-schra:developfrom
Conversation
The README.md files were creating a loop: - /README.md → packages/mcp/README.md - /packages/mcp/README.md → ../../README.md This caused npm to fail with ELOOP when spawning MCP servers via npx. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of giving up when max reconnection attempts are reached, the proxy now enters "slow probe mode" - periodically attempting to reconnect every 60 seconds. This follows the circuit breaker pattern best practice. Changes: - Add slowProbeTimers Map and SLOW_PROBE_INTERVAL_MS constant - Add startSlowProbeMode() and stopSlowProbeMode() methods - Modify onMaxAttemptsReached to enter slow probe mode - Update shutdown() to clean up slow probe timers - Stop slow probe mode on successful connection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests cover: - Slow probe mode lifecycle (start, stop, prevent duplicates) - Reconnection attempts every 60 seconds - Successful reconnection stops probing - Shutdown cleanup of all timers - Manual disconnect request handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The root cause of failed reconnections was transport caching - closed transports were being reused instead of creating fresh ones. Changes: - Add clearTransportCache() call before each slow probe reconnect attempt - Add invalidateCachedTransport() and removeCachedTransport() helpers - Export cache management functions from transport index - Update tests to mock clearTransportCache This fix was verified with E2E testing: Graphiti container restart now successfully auto-reconnects via slow probe mode. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Problem: Claude Code starts new MCP-Funnel processes for each session but doesn't always send clean shutdown signals. This leads to zombie processes accumulating (10+ processes, 1.2GB RAM observed). Solution: Implement PID-file based singleton pattern: - Check ~/.mcp-funnel.pid on startup - Kill stale process if still running (SIGTERM then SIGKILL) - Write current PID to file - Clean up PID file on shutdown This ensures only one MCP-Funnel instance runs at a time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Author
Additional Fix: Singleton Enforcement (Commit 74049f3)Added PID-file based singleton pattern to prevent zombie processes. ProblemClaude Code starts new MCP-Funnel processes per session but doesn't always send clean shutdown signals. This leads to zombie processes accumulating (observed: 10+ processes, ~1.2GB RAM). Solution
BackgroundThis is a known MCP ecosystem issue: typescript-sdk#208 The VS Code/LSP approach passes host PID for periodic liveness checks. Since Claude Code doesn't support this, PID-file singleton is a practical workaround. |
Previous singleton approach killed ALL other mcp-funnel processes, breaking multi-terminal setups where each Claude Code session needs its own mcp-funnel. New approach: Only kill processes whose parent (Claude Code) no longer exists. This cleans up true zombies while keeping active sessions alive. - Add isOrphanedProcess() to check if parent PID exists - Rename enforceSignleton() to cleanupOrphanedProcesses() - Log when skipping processes with active parents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Owner
|
thanks! Sorry, was totally busy - still interested in merging this? |
Author
|
Yes, absolutely still interested! Ready to merge whenever you are. Let me know if you need any changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Motivation
When actively developing local MCP servers (e.g., custom integrations, Docker-based services), frequent restarts are common during development. Currently, when a backend MCP server restarts, mcp-funnel reaches max reconnection attempts and gives up completely - requiring users to restart their entire Claude Code / mcp-funnel session just to reconnect.
This is especially frustrating when:
Problem
When a backend MCP server restarts (e.g., Docker container restart), the proxy eventually reaches max reconnection attempts and gives up completely. Users then need to restart the entire mcp-funnel process to reconnect.
Solution
Implements the "slow probe" phase of the circuit breaker pattern:
server.slow_probe_started, recovery logs)This follows industry best practices (AWS, Microsoft) for resilient service connections - never give up entirely, just back off to a slower probe interval.
Changes
server-connection-manager.ts:
slowProbeTimersMap andSLOW_PROBE_INTERVAL_MSconstant (60s)startSlowProbeMode()andstopSlowProbeMode()private methodsonMaxAttemptsReachedcallback to callstartSlowProbeModeinstead of just deleting the managershutdown()to clean up slow probe timersconnectToSingleServer()to stop slow probe mode on successful connectionclearTransportCache()before each reconnection attempttransport-cache.ts:
removeCachedTransport()helper functioninvalidateCachedTransport()helper functiontransport/index.ts:
Test plan
🤖 Generated with Claude Code