Skip to content

_server_request_watcher silently exits on RST_STREAM without triggering reconnection #294

@jeremy9494

Description

@jeremy9494

Description

When the bidirectional gRPC stream encounters an RST_STREAM error (error code 8), the _server_request_watcher task exits silently without triggering any reconnection logic. This causes ephemeral instances to be removed from Nacos server after the heartbeat timeout, while the client remains unaware of the disconnection.

Environment

  • nacos-sdk-python version: 3.0.4
  • Python version: 3.13
  • Nacos server version: 3.1.1
  • Network: Behind AWS ALB Gateway

Steps to Reproduce

  1. Register an ephemeral instance using NacosNamingService
  2. Wait for network idle timeout (e.g., ALB 60s idle timeout) or simulate RST_STREAM
  3. Observe that the client's rpc_client_status remains RUNNING
  4. The instance disappears from Nacos registry after server-side heartbeat timeout

Expected Behavior

When the bidirectional stream dies, the SDK should:

  1. Detect the stream termination
  2. Update rpc_client_status to UNHEALTHY
  3. Trigger reconnection via reconnection_chan
  4. Re-register the ephemeral instance

Actual Behavior

The _server_request_watcher task in grpc_client.py exits silently:

async def _server_request_watcher(self, grpc_conn: GrpcConnection):
    async for payload in grpc_conn.bi_stream_send():
        try:
            # ... handle payload
        except Exception as e:
            self.logger.error(f"handle server request occur exception: {e}")

When RST_STREAM occurs, the async for loop simply terminates without:

  • Catching the stream termination
  • Updating client status
  • Triggering reconnection

Root Cause Analysis

The issue is in v2/nacos/transport/grpc_client.py:

  1. Line 93: Task is created via asyncio.create_task(self._server_request_watcher(grpc_conn))
  2. Lines 95-115: The async for loop has no finally block to handle stream termination
  3. The SDK's health check (send_health_check) uses unary requests, not the bidirectional stream, so it may still succeed even when the stream is dead

Proposed Fix

async def _server_request_watcher(self, grpc_conn: GrpcConnection):
    try:
        async for payload in grpc_conn.bi_stream_send():
            try:
                self.logger.info(
                    "receive stream server request, connection_id:%s, original info: %s"
                    % (grpc_conn.get_connection_id(), str(payload))
                )
                request = GrpcUtils.parse(payload)
                if request:
                    await self._handle_server_request(request, grpc_conn)
            except Exception as e:
                self.logger.error(
                    f"[{grpc_conn.connection_id}] handle server request occur exception: {e}"
                )
    except Exception as e:
        self.logger.error(
            f"[{grpc_conn.connection_id}] bidirectional stream error: {e}"
        )
    finally:
        # Trigger reconnection when stream ends unexpectedly
        if not self.is_shutdown() and not grpc_conn.is_abandon():
            self.logger.warning(
                f"[{grpc_conn.connection_id}] bidirectional stream ended, triggering reconnect"
            )
            self.rpc_client_status = RpcClientStatus.UNHEALTHY
            await self.reconnection_chan.put(
                ReconnectContext(server_info=None, on_request_fail=True)
            )

Workaround

We implemented a wrapper that monitors SDK internal state and forces reconnection:

def _check_sdk_internal_state(self) -> tuple[bool, str]:
    # Check rpc_client_status
    # Check current_connection exists
    # Check connection.abandon flag
    # Check if _server_request_task.done()  <-- Key detection
    ...

This works but requires accessing private attributes, which is fragile.

Impact

This issue affects all users who:

  • Use ephemeral instances (the default)
  • Operate behind load balancers or NAT with idle timeouts
  • Experience any network interruption that causes RST_STREAM

The service silently disappears from Nacos registry without any client-side error or recovery attempt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions