Skip to content

Raven IPsec L3 tunnel: traffic from non-gateway nodes to edge Pod CIDRs fails when forwarded via gateway #184

@Sabre94

Description

@Sabre94

In an OpenYurt cluster using Raven for cloud-edge connectivity (IPsec L3 tunnel), only the cloud gateway node can reach the edge Pod network. Traffic from other cloud nodes to edge Pod IPs times out, and edge-to-non-gateway-cloud nodes also fails.

Example symptom: Prometheus running on a non-gateway cloud node cannot scrape a metrics endpoint on an edge node:

Error scraping target: Get "http://Pod_IP:9400/metrics": context deadline exceeded

Observed connectivity:

  • cloud gateway node (master-gw) <-> edge: OK
  • other cloud nodes (master-other) -> edge Pod IPs: FAIL
  • edge -> master-other Pod IPs: FAIL

Flannel MASQUERADE SNAT breaks xfrm policy matching (tunnel not triggered)

On the cloud gateway node , Flannel installs a MASQUERADE rule similar to:

-A FLANNEL-POSTRTG -s 10.16.0.0/12 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully

When traffic from a non-gateway node Pod (e.g. 10.16.2.x) is forwarded via master-gw towards the edge Pod CIDR (10.16.10.0/24), it hits nat/POSTROUTING on master-gw and is SNATed to master-gw’s node IP. After SNAT, the original (src=10.16.2.0/24, dst=10.16.10.0/24) no longer matches the xfrm policy, so the IPsec tunnel is not triggered.

We saw:

  • ip xfrm policy exists for src 10.16.2.0/24 -> dst 10.16.10.0/24 (dir out), but lifetime current stays 0
  • ip -s xfrm state shows OUT SA counters for that subnet pair stay at 0

Workaround

On the cloud gateway node (master-gw), inserting a “podCIDR -> podCIDR skip SNAT” rule at the top of nat/POSTROUTING fixes the SNAT/xfrm mismatch:

iptables -t nat -I POSTROUTING 1 -s 10.16.0.0/12 -d 10.16.0.0/12 -j RETURN

After this, traffic from master-02/03 to edge Pod IPs works and xfrm counters start increasing.

We also observed that disabling VPC “source/destination check” on the gateway NIC can restore connectivity (even without the above iptables rule), implying VPC forwarding restrictions can be a second factor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions