-
Notifications
You must be signed in to change notification settings - Fork 39
Description
In an OpenYurt cluster using Raven for cloud-edge connectivity (IPsec L3 tunnel), only the cloud gateway node can reach the edge Pod network. Traffic from other cloud nodes to edge Pod IPs times out, and edge-to-non-gateway-cloud nodes also fails.
Example symptom: Prometheus running on a non-gateway cloud node cannot scrape a metrics endpoint on an edge node:
Error scraping target: Get "http://Pod_IP:9400/metrics": context deadline exceeded
Observed connectivity:
- cloud gateway node (master-gw) <-> edge: OK
- other cloud nodes (master-other) -> edge Pod IPs: FAIL
- edge -> master-other Pod IPs: FAIL
Flannel MASQUERADE SNAT breaks xfrm policy matching (tunnel not triggered)
On the cloud gateway node , Flannel installs a MASQUERADE rule similar to:
-A FLANNEL-POSTRTG -s 10.16.0.0/12 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
When traffic from a non-gateway node Pod (e.g. 10.16.2.x) is forwarded via master-gw towards the edge Pod CIDR (10.16.10.0/24), it hits nat/POSTROUTING on master-gw and is SNATed to master-gw’s node IP. After SNAT, the original (src=10.16.2.0/24, dst=10.16.10.0/24) no longer matches the xfrm policy, so the IPsec tunnel is not triggered.
We saw:
ip xfrm policyexists for src 10.16.2.0/24 -> dst 10.16.10.0/24 (dir out), butlifetime currentstays 0ip -s xfrm stateshows OUT SA counters for that subnet pair stay at 0
Workaround
On the cloud gateway node (master-gw), inserting a “podCIDR -> podCIDR skip SNAT” rule at the top of nat/POSTROUTING fixes the SNAT/xfrm mismatch:
iptables -t nat -I POSTROUTING 1 -s 10.16.0.0/12 -d 10.16.0.0/12 -j RETURN
After this, traffic from master-02/03 to edge Pod IPs works and xfrm counters start increasing.
We also observed that disabling VPC “source/destination check” on the gateway NIC can restore connectivity (even without the above iptables rule), implying VPC forwarding restrictions can be a second factor.