-
Notifications
You must be signed in to change notification settings - Fork 154
Description
Description
When the bidirectional gRPC stream encounters an RST_STREAM error (error code 8), the _server_request_watcher task exits silently without triggering any reconnection logic. This causes ephemeral instances to be removed from Nacos server after the heartbeat timeout, while the client remains unaware of the disconnection.
Environment
- nacos-sdk-python version: 3.0.4
- Python version: 3.13
- Nacos server version: 3.1.1
- Network: Behind AWS ALB Gateway
Steps to Reproduce
- Register an ephemeral instance using
NacosNamingService - Wait for network idle timeout (e.g., ALB 60s idle timeout) or simulate RST_STREAM
- Observe that the client's
rpc_client_statusremainsRUNNING - The instance disappears from Nacos registry after server-side heartbeat timeout
Expected Behavior
When the bidirectional stream dies, the SDK should:
- Detect the stream termination
- Update
rpc_client_statustoUNHEALTHY - Trigger reconnection via
reconnection_chan - Re-register the ephemeral instance
Actual Behavior
The _server_request_watcher task in grpc_client.py exits silently:
async def _server_request_watcher(self, grpc_conn: GrpcConnection):
async for payload in grpc_conn.bi_stream_send():
try:
# ... handle payload
except Exception as e:
self.logger.error(f"handle server request occur exception: {e}")When RST_STREAM occurs, the async for loop simply terminates without:
- Catching the stream termination
- Updating client status
- Triggering reconnection
Root Cause Analysis
The issue is in v2/nacos/transport/grpc_client.py:
- Line 93: Task is created via
asyncio.create_task(self._server_request_watcher(grpc_conn)) - Lines 95-115: The
async forloop has nofinallyblock to handle stream termination - The SDK's health check (
send_health_check) uses unary requests, not the bidirectional stream, so it may still succeed even when the stream is dead
Proposed Fix
async def _server_request_watcher(self, grpc_conn: GrpcConnection):
try:
async for payload in grpc_conn.bi_stream_send():
try:
self.logger.info(
"receive stream server request, connection_id:%s, original info: %s"
% (grpc_conn.get_connection_id(), str(payload))
)
request = GrpcUtils.parse(payload)
if request:
await self._handle_server_request(request, grpc_conn)
except Exception as e:
self.logger.error(
f"[{grpc_conn.connection_id}] handle server request occur exception: {e}"
)
except Exception as e:
self.logger.error(
f"[{grpc_conn.connection_id}] bidirectional stream error: {e}"
)
finally:
# Trigger reconnection when stream ends unexpectedly
if not self.is_shutdown() and not grpc_conn.is_abandon():
self.logger.warning(
f"[{grpc_conn.connection_id}] bidirectional stream ended, triggering reconnect"
)
self.rpc_client_status = RpcClientStatus.UNHEALTHY
await self.reconnection_chan.put(
ReconnectContext(server_info=None, on_request_fail=True)
)Workaround
We implemented a wrapper that monitors SDK internal state and forces reconnection:
def _check_sdk_internal_state(self) -> tuple[bool, str]:
# Check rpc_client_status
# Check current_connection exists
# Check connection.abandon flag
# Check if _server_request_task.done() <-- Key detection
...This works but requires accessing private attributes, which is fragile.
Impact
This issue affects all users who:
- Use ephemeral instances (the default)
- Operate behind load balancers or NAT with idle timeouts
- Experience any network interruption that causes RST_STREAM
The service silently disappears from Nacos registry without any client-side error or recovery attempt.