egress snat ha break while using multiple nodes for SNAT gateway. #4115

iMikeG6 · 2022-08-15T12:26:14Z

Describe the bug
We are currently testing Antrea Egress SNAT feature in HA mode. There's two nodes that has the snat-origin label, let's say node1 and node3.

both has the interface down

antrea-egress0   DOWN           10.246.3.170/32

Leader is currently node1

the snat and pool are the following:

apiVersion: crd.antrea.io/v1alpha2
kind: ExternalIPPool
metadata:
  name: snat-ippool-default
spec:
  ipRanges:
    - end: 10.246.3.174
      start: 10.246.3.169
    - cidr: 10.246.3.168/29
  nodeSelector:
    matchLabels:
      network-role: snat-origin

and egress config

apiVersion: crd.antrea.io/v1alpha2
kind: Egress
metadata:
  annotations:
  name: snat-default-ip
spec:
  appliedTo:
    namespaceSelector:
      matchLabels:
        field.cattle.io/projectId: p-ht5wm
  egressIP: 10.246.3.170
  externalIPPool: snat-ippool-default

The egress is applied by the following matchLabels selector field.cattle.io/projectId: p-ht5wm which, when using rancher, apply to all namespace within projectID: p-ht5wm. But the selector could've also been "kubernetes.io/metadata.name": default. The result is the same.

Everything is working fine until the leader node is rebooted. The seconds node (node3) get lead on the egressip 10.246.3.170, pods can still reach internet or ping external machine outside the k8s cluster, but when node1 backup node1 things start to get messy. randomly, pod1 can't rush internet, but pod2 can and sometime, it's the opposite, pod2 can't and pod1 can.

The only way I can get back to a normal state is by removing the snat-origin label on node3 which basically mean that there's only one node that handle de egress SNAT ip. Which in a way, break the HA feature.

Expected

Pods should still be able to get internet access no matter what antrea egress lead has the egress snat ip.

Versions:
Antrea version: Main
Kubernetes version: rancher deployed cluster. RKE1 k8s version: 1.23.7
Docker version: 20.10.17, build 100c701
Linux Kernel is: 5.14.0-70.22.1.el9_0.x86_64 1 SMP PREEMPT Tue Aug 2 10:02:12 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
OS: Red Hat Enterprise Linux release 9.0 (Plow)

Additional context
All nodes are using sfp+ intel cards and are using bonding active-backup mode.
Problem also occured on our vmware infrastructure while I was testing it on virtual machines.
Logs where no relevant but if needed, let me know. Currently, on our physical machines, I had dedicate only one machine to handle the egress snat gateway.
Firewalld is enable
SELinux is in permissive mode.

Thanks for your help.

The text was updated successfully, but these errors were encountered:

iMikeG6 · 2022-08-15T12:40:10Z

Here's the controller logs:

I0815 12:35:13.763268       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:13.771145       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:13.963767       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:13.971046       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.163243       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.171530       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.365428       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.371244       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.563450       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.571143       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.763937       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.771830       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.963498       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.971201       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})

below, agent logs:

https://gist.github.com/iMikeG6/97417c05659e2b3ed514a732559d64f0

tnqn · 2022-08-15T14:14:20Z

@iMikeG6 thanks for reporting the issue. It seems the egress IP 10.246.3.170/32 was configured on antrea-egress0 of two Nodes, I suspect the agents failed to reach each other to negotiate the active Egress node for the Egress IP. It could be caused by the nodes' firewall dropping the traffic. Could you check whether TCP port 10351 is allowed?
https://github.com/antrea-io/antrea/blob/main/docs/network-requirements.md

antrea/pkg/config/agent/config.go

Lines 125 to 128 in 65b62cc

    
           // ClusterMembershipPort is the server port used by the antrea-agent to run a gossip-based cluster 
        
           // membership protocol. Currently it's used only when the Egress feature is enabled. 
        
           // Defaults to 10351. 
        
           ClusterMembershipPort int `yaml:"clusterPort,omitempty"`

iMikeG6 · 2022-08-15T14:21:36Z

Hi @tnqn,

thank you very much for quick reply.

Indeed, I realized that as soon as I dug into the agent logs.

I then open port 10351 and restarted the agents and the controller, then only node1 one has the vip on the antrea-egress0.

For the records, I used this documents to open firewall ports: https://antrea.io/docs/v1.6.0/docs/network-requirements/ which miss the port 10351. Doc need to be updated.

Once again, thanks for your quick reply.

it looks like it solved the problem, if not, i'll reopen the issue if its need be.

Cheers.

iMikeG6 · 2022-08-15T14:29:39Z

Oups, my bad, read the wrong doc page version instead of the main one. Port 10351 is indeed mentioned.

iMikeG6 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2022

iMikeG6 closed this as completed Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

egress snat ha break while using multiple nodes for SNAT gateway. #4115

egress snat ha break while using multiple nodes for SNAT gateway. #4115

iMikeG6 commented Aug 15, 2022 •

edited

Loading

iMikeG6 commented Aug 15, 2022

tnqn commented Aug 15, 2022

iMikeG6 commented Aug 15, 2022

iMikeG6 commented Aug 15, 2022

egress snat ha break while using multiple nodes for SNAT gateway. #4115

egress snat ha break while using multiple nodes for SNAT gateway. #4115

Comments

iMikeG6 commented Aug 15, 2022 • edited Loading

iMikeG6 commented Aug 15, 2022

tnqn commented Aug 15, 2022

iMikeG6 commented Aug 15, 2022

iMikeG6 commented Aug 15, 2022

iMikeG6 commented Aug 15, 2022 •

edited

Loading