Skip to content

Commit b8fdf8a

Browse files
sirtimidclaude
andauthored
feat(ocap-kernel): add permanent failure detection for reconnection (#789)
Closes #688 ## Summary - Add error pattern tracking to `ReconnectionManager` to detect permanently unreachable peers - When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed - Permanent failures stop reconnection attempts to avoid wasted retries for unreachable peers ## Changes **kernel-errors:** - Add `getNetworkErrorCode()` helper to extract error codes from errors **ocap-kernel:** - Extend `ReconnectionState` with `errorHistory` and `permanentlyFailed` fields - Add `recordError()`, `isPermanentlyFailed()`, `clearPermanentFailure()` methods to `ReconnectionManager` - Update `startReconnection()` to return `false` for permanently failed peers and reset error history - Integrate error recording into reconnection lifecycle - Check permanent failure status before attempting reconnection ## Test plan - [x] Unit tests for `getNetworkErrorCode` helper - [x] Unit tests for error tracking in `ReconnectionManager` - [x] Unit tests for permanent failure detection (consecutive identical errors) - [x] Unit tests for clearing permanent failure state - [x] Integration tests for reconnection lifecycle with permanent failure - [x] All existing tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes reconnection control-flow to stop retry loops based on tracked error patterns and modifies `startReconnection` semantics (now returns a boolean), which could alter peer recovery behavior if misclassified or mis-integrated. > > **Overview** > Adds permanent-failure detection to the remote reconnection flow: the `ReconnectionManager` now tracks recent retryable network error codes per peer and marks a peer as permanently failed after a configurable threshold of consecutive identical “unreachable” codes. > > The reconnection lifecycle records retryable failures (via new `kernel-errors` export `getNetworkErrorCode`) and short-circuits future attempts when a peer is permanently failed, calling `onRemoteGiveUp`; manual `reconnectPeer` now clears permanent-failure state before restarting. Also treats `ENOTFOUND` as retryable to support this detection, with expanded unit/integration test coverage. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e08af9e. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 353bfec commit b8fdf8a

File tree

12 files changed

+948
-19
lines changed

12 files changed

+948
-19
lines changed

packages/kernel-errors/src/index.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ describe('index', () => {
2020
'VatAlreadyExistsError',
2121
'VatDeletedError',
2222
'VatNotFoundError',
23+
'getNetworkErrorCode',
2324
'isMarshaledError',
2425
'isMarshaledOcapError',
2526
'isOcapError',

packages/kernel-errors/src/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,5 @@ export { unmarshalError } from './marshal/unmarshalError.ts';
2727
export { isMarshaledError } from './marshal/isMarshaledError.ts';
2828
export { isMarshaledOcapError } from './marshal/isMarshaledOcapError.ts';
2929
export { isRetryableNetworkError } from './utils/isRetryableNetworkError.ts';
30+
export { getNetworkErrorCode } from './utils/getNetworkErrorCode.ts';
3031
export { isResourceLimitError } from './utils/isResourceLimitError.ts';
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
import { describe, it, expect } from 'vitest';
2+
3+
import { getNetworkErrorCode } from './getNetworkErrorCode.ts';
4+
5+
describe('getNetworkErrorCode', () => {
6+
describe('Node.js network error codes', () => {
7+
it.each([
8+
{ code: 'ECONNRESET' },
9+
{ code: 'ETIMEDOUT' },
10+
{ code: 'EPIPE' },
11+
{ code: 'ECONNREFUSED' },
12+
{ code: 'EHOSTUNREACH' },
13+
{ code: 'ENETUNREACH' },
14+
{ code: 'ENOTFOUND' },
15+
])('returns $code from error with code property', ({ code }) => {
16+
const error = new Error('Network error') as Error & { code: string };
17+
error.code = code;
18+
expect(getNetworkErrorCode(error)).toBe(code);
19+
});
20+
});
21+
22+
describe('libp2p and other named errors', () => {
23+
it.each([
24+
{ name: 'DialError' },
25+
{ name: 'TransportError' },
26+
{ name: 'MuxerClosedError' },
27+
{ name: 'WebRTCDialError' },
28+
])('returns $name from error with name property', ({ name }) => {
29+
const error = new Error('Connection failed');
30+
error.name = name;
31+
expect(getNetworkErrorCode(error)).toBe(name);
32+
});
33+
34+
it('prefers code over name when both are present', () => {
35+
const error = Object.assign(new Error('Network error'), {
36+
code: 'ECONNREFUSED',
37+
name: 'DialError',
38+
});
39+
expect(getNetworkErrorCode(error)).toBe('ECONNREFUSED');
40+
});
41+
});
42+
43+
describe('relay reservation errors', () => {
44+
it('returns name for Error with NO_RESERVATION in message (name takes precedence)', () => {
45+
const error = new Error(
46+
'failed to connect via relay with status NO_RESERVATION',
47+
);
48+
// name ('Error') takes precedence over message parsing
49+
expect(getNetworkErrorCode(error)).toBe('Error');
50+
});
51+
52+
it('returns NO_RESERVATION when error has empty name', () => {
53+
const error = Object.assign(
54+
new Error('failed to connect via relay with status NO_RESERVATION'),
55+
{ name: '' },
56+
);
57+
expect(getNetworkErrorCode(error)).toBe('NO_RESERVATION');
58+
});
59+
60+
it('returns name when both name and NO_RESERVATION message are present', () => {
61+
const error = Object.assign(
62+
new Error('failed to connect via relay with status NO_RESERVATION'),
63+
{ name: 'InvalidMessageError' },
64+
);
65+
// name takes precedence over message parsing
66+
expect(getNetworkErrorCode(error)).toBe('InvalidMessageError');
67+
});
68+
});
69+
70+
describe('unknown errors', () => {
71+
it.each([
72+
{ name: 'null', value: null },
73+
{ name: 'undefined', value: undefined },
74+
{ name: 'string', value: 'error string' },
75+
{ name: 'number', value: 42 },
76+
{ name: 'plain object', value: { message: 'error' } },
77+
{ name: 'empty error name and code', value: { name: '', code: '' } },
78+
])('returns UNKNOWN for $name', ({ value }) => {
79+
expect(getNetworkErrorCode(value)).toBe('UNKNOWN');
80+
});
81+
82+
it('returns UNKNOWN for generic Error with no code', () => {
83+
const error = new Error('Generic error');
84+
// Generic Error has name 'Error', so it returns 'Error'
85+
expect(getNetworkErrorCode(error)).toBe('Error');
86+
});
87+
});
88+
});
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
/**
2+
* Extract a network error code from an error object.
3+
*
4+
* Returns error codes like 'ECONNREFUSED', 'ETIMEDOUT', etc., or
5+
* the error name for libp2p errors, or 'UNKNOWN' for unrecognized errors.
6+
*
7+
* @param error - The error to extract the code from.
8+
* @returns The error code string.
9+
*/
10+
export function getNetworkErrorCode(error: unknown): string {
11+
const anyError = error as {
12+
code?: string;
13+
name?: string;
14+
message?: string;
15+
};
16+
17+
// Node.js network error codes (ECONNREFUSED, ETIMEDOUT, etc.)
18+
if (typeof anyError?.code === 'string' && anyError.code.length > 0) {
19+
return anyError.code;
20+
}
21+
22+
// libp2p errors and other named errors
23+
if (typeof anyError?.name === 'string' && anyError.name.length > 0) {
24+
return anyError.name;
25+
}
26+
27+
// Check message for relay reservation errors
28+
if (
29+
typeof anyError?.message === 'string' &&
30+
anyError.message.includes('NO_RESERVATION')
31+
) {
32+
return 'NO_RESERVATION';
33+
}
34+
35+
return 'UNKNOWN';
36+
}

packages/kernel-errors/src/utils/isRetryableNetworkError.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ describe('isRetryableNetworkError', () => {
3434
{ code: 'ECONNREFUSED', description: 'connection refused' },
3535
{ code: 'EHOSTUNREACH', description: 'no route to host' },
3636
{ code: 'ENETUNREACH', description: 'network unreachable' },
37+
{ code: 'ENOTFOUND', description: 'DNS lookup failed' },
3738
])('returns true for $code ($description)', ({ code }) => {
3839
const error = new Error('Network error') as Error & { code: string };
3940
error.code = code;

packages/kernel-errors/src/utils/isRetryableNetworkError.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,18 @@ export function isRetryableNetworkError(error: unknown): boolean {
2121
}
2222

2323
// Node.js network error codes
24+
// Note: ENOTFOUND (DNS lookup failed) is included to allow permanent failure
25+
// detection to work - after multiple consecutive ENOTFOUND errors, the peer
26+
// will be marked as permanently failed rather than giving up immediately.
2427
const code = anyError?.code;
2528
if (
2629
code === 'ECONNRESET' ||
2730
code === 'ETIMEDOUT' ||
2831
code === 'EPIPE' ||
2932
code === 'ECONNREFUSED' ||
3033
code === 'EHOSTUNREACH' ||
31-
code === 'ENETUNREACH'
34+
code === 'ENETUNREACH' ||
35+
code === 'ENOTFOUND'
3236
) {
3337
return true;
3438
}

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.test.ts

Lines changed: 168 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,15 @@ vi.mock('@metamask/kernel-utils', async () => {
1717
};
1818
});
1919

20-
// Mock kernel-errors for isRetryableNetworkError
20+
// Mock kernel-errors for isRetryableNetworkError and getNetworkErrorCode
2121
vi.mock('@metamask/kernel-errors', async () => {
2222
const actual = await vi.importActual<typeof kernelErrors>(
2323
'@metamask/kernel-errors',
2424
);
2525
return {
2626
...actual,
2727
isRetryableNetworkError: vi.fn(),
28+
getNetworkErrorCode: vi.fn().mockReturnValue('ECONNREFUSED'),
2829
};
2930
});
3031

@@ -75,9 +76,11 @@ describe('reconnection-lifecycle', () => {
7576
incrementAttempt: vi.fn().mockReturnValue(1),
7677
decrementAttempt: vi.fn(),
7778
calculateBackoff: vi.fn().mockReturnValue(100),
78-
startReconnection: vi.fn(),
79+
startReconnection: vi.fn().mockReturnValue(true),
7980
stopReconnection: vi.fn(),
8081
resetBackoff: vi.fn(),
82+
isPermanentlyFailed: vi.fn().mockReturnValue(false),
83+
recordError: vi.fn(),
8184
},
8285
maxRetryAttempts: 3,
8386
onRemoteGiveUp: vi.fn(),
@@ -616,5 +619,168 @@ describe('reconnection-lifecycle', () => {
616619
// stopReconnection should be called on success
617620
expect(deps.reconnectionManager.stopReconnection).toHaveBeenCalled();
618621
});
622+
623+
describe('permanent failure detection', () => {
624+
it('gives up when peer is permanently failed at start of loop', async () => {
625+
(
626+
deps.reconnectionManager.isPermanentlyFailed as ReturnType<
627+
typeof vi.fn
628+
>
629+
).mockReturnValue(true);
630+
631+
const lifecycle = makeReconnectionLifecycle(deps);
632+
633+
await lifecycle.attemptReconnection('peer1');
634+
635+
expect(deps.reconnectionManager.stopReconnection).toHaveBeenCalledWith(
636+
'peer1',
637+
);
638+
expect(deps.onRemoteGiveUp).toHaveBeenCalledWith('peer1');
639+
expect(deps.logger.log).toHaveBeenCalledWith(
640+
expect.stringContaining('permanently failed'),
641+
);
642+
});
643+
644+
it('records error after failed dial attempt', async () => {
645+
const error = new Error('Connection refused');
646+
(error as Error & { code: string }).code = 'ECONNREFUSED';
647+
(deps.dialPeer as ReturnType<typeof vi.fn>).mockRejectedValueOnce(
648+
error,
649+
);
650+
(
651+
kernelErrors.isRetryableNetworkError as ReturnType<typeof vi.fn>
652+
).mockReturnValue(true);
653+
(deps.reconnectionManager.isReconnecting as ReturnType<typeof vi.fn>)
654+
.mockReturnValueOnce(true)
655+
.mockReturnValueOnce(false);
656+
657+
const lifecycle = makeReconnectionLifecycle(deps);
658+
659+
await lifecycle.attemptReconnection('peer1');
660+
661+
expect(deps.reconnectionManager.recordError).toHaveBeenCalledWith(
662+
'peer1',
663+
'ECONNREFUSED',
664+
);
665+
});
666+
667+
it('gives up when error triggers permanent failure', async () => {
668+
const error = new Error('Connection refused');
669+
(deps.dialPeer as ReturnType<typeof vi.fn>).mockRejectedValueOnce(
670+
error,
671+
);
672+
(
673+
kernelErrors.isRetryableNetworkError as ReturnType<typeof vi.fn>
674+
).mockReturnValue(true);
675+
(
676+
deps.reconnectionManager.isPermanentlyFailed as ReturnType<
677+
typeof vi.fn
678+
>
679+
)
680+
.mockReturnValueOnce(false) // At start of loop
681+
.mockReturnValueOnce(true); // After recording error
682+
683+
const lifecycle = makeReconnectionLifecycle(deps);
684+
685+
await lifecycle.attemptReconnection('peer1');
686+
687+
expect(deps.reconnectionManager.stopReconnection).toHaveBeenCalledWith(
688+
'peer1',
689+
);
690+
expect(deps.onRemoteGiveUp).toHaveBeenCalledWith('peer1');
691+
expect(deps.outputError).toHaveBeenCalledWith(
692+
'peer1',
693+
expect.stringContaining('permanent failure detected'),
694+
expect.any(Error),
695+
);
696+
});
697+
698+
it('continues retrying when error does not trigger permanent failure', async () => {
699+
(deps.dialPeer as ReturnType<typeof vi.fn>)
700+
.mockRejectedValueOnce(new Error('Temporary failure'))
701+
.mockResolvedValueOnce(mockChannel);
702+
(
703+
kernelErrors.isRetryableNetworkError as ReturnType<typeof vi.fn>
704+
).mockReturnValue(true);
705+
(
706+
deps.reconnectionManager.isPermanentlyFailed as ReturnType<
707+
typeof vi.fn
708+
>
709+
).mockReturnValue(false);
710+
(deps.reconnectionManager.isReconnecting as ReturnType<typeof vi.fn>)
711+
.mockReturnValueOnce(true)
712+
.mockReturnValueOnce(true)
713+
.mockReturnValueOnce(false);
714+
715+
const lifecycle = makeReconnectionLifecycle(deps);
716+
717+
await lifecycle.attemptReconnection('peer1');
718+
719+
expect(deps.dialPeer).toHaveBeenCalledTimes(2);
720+
expect(deps.reconnectionManager.recordError).toHaveBeenCalled();
721+
});
722+
723+
it('does not record non-retryable errors in history', async () => {
724+
const error = new Error('Auth failed');
725+
(deps.dialPeer as ReturnType<typeof vi.fn>).mockRejectedValueOnce(
726+
error,
727+
);
728+
// Non-retryable error
729+
(
730+
kernelErrors.isRetryableNetworkError as ReturnType<typeof vi.fn>
731+
).mockReturnValue(false);
732+
733+
const lifecycle = makeReconnectionLifecycle(deps);
734+
735+
await lifecycle.attemptReconnection('peer1');
736+
737+
// Non-retryable errors should NOT be recorded - they don't contribute
738+
// to permanent failure detection
739+
expect(deps.reconnectionManager.recordError).not.toHaveBeenCalled();
740+
// But should still give up
741+
expect(deps.onRemoteGiveUp).toHaveBeenCalledWith('peer1');
742+
});
743+
});
744+
});
745+
746+
describe('handleConnectionLoss with permanent failure', () => {
747+
it('skips reconnection and calls onRemoteGiveUp for permanently failed peer', () => {
748+
(
749+
deps.reconnectionManager.isReconnecting as ReturnType<typeof vi.fn>
750+
).mockReturnValue(false);
751+
(
752+
deps.reconnectionManager.startReconnection as ReturnType<typeof vi.fn>
753+
).mockReturnValue(false);
754+
755+
const lifecycle = makeReconnectionLifecycle(deps);
756+
757+
lifecycle.handleConnectionLoss('peer1');
758+
759+
expect(deps.reconnectionManager.startReconnection).toHaveBeenCalledWith(
760+
'peer1',
761+
);
762+
expect(deps.onRemoteGiveUp).toHaveBeenCalledWith('peer1');
763+
expect(deps.logger.log).toHaveBeenCalledWith(
764+
expect.stringContaining('permanently failed'),
765+
);
766+
});
767+
768+
it('proceeds with reconnection when startReconnection returns true', () => {
769+
(
770+
deps.reconnectionManager.isReconnecting as ReturnType<typeof vi.fn>
771+
).mockReturnValue(false);
772+
(
773+
deps.reconnectionManager.startReconnection as ReturnType<typeof vi.fn>
774+
).mockReturnValue(true);
775+
776+
const lifecycle = makeReconnectionLifecycle(deps);
777+
778+
lifecycle.handleConnectionLoss('peer1');
779+
780+
expect(deps.reconnectionManager.startReconnection).toHaveBeenCalledWith(
781+
'peer1',
782+
);
783+
expect(deps.onRemoteGiveUp).not.toHaveBeenCalled();
784+
});
619785
});
620786
});

0 commit comments

Comments
 (0)