Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

egress snat ha break while using multiple nodes for SNAT gateway. #4115

Closed
iMikeG6 opened this issue Aug 15, 2022 · 4 comments
Closed

egress snat ha break while using multiple nodes for SNAT gateway. #4115

iMikeG6 opened this issue Aug 15, 2022 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@iMikeG6
Copy link

iMikeG6 commented Aug 15, 2022

Describe the bug
We are currently testing Antrea Egress SNAT feature in HA mode. There's two nodes that has the snat-origin label, let's say node1 and node3.

both has the interface down

antrea-egress0   DOWN           10.246.3.170/32 

Leader is currently node1

the snat and pool are the following:

apiVersion: crd.antrea.io/v1alpha2
kind: ExternalIPPool
metadata:
  name: snat-ippool-default
spec:
  ipRanges:
    - end: 10.246.3.174
      start: 10.246.3.169
    - cidr: 10.246.3.168/29
  nodeSelector:
    matchLabels:
      network-role: snat-origin

and egress config

apiVersion: crd.antrea.io/v1alpha2
kind: Egress
metadata:
  annotations:
  name: snat-default-ip
spec:
  appliedTo:
    namespaceSelector:
      matchLabels:
        field.cattle.io/projectId: p-ht5wm
  egressIP: 10.246.3.170
  externalIPPool: snat-ippool-default

The egress is applied by the following matchLabels selector field.cattle.io/projectId: p-ht5wm which, when using rancher, apply to all namespace within projectID: p-ht5wm. But the selector could've also been "kubernetes.io/metadata.name": default. The result is the same.

Everything is working fine until the leader node is rebooted. The seconds node (node3) get lead on the egressip 10.246.3.170, pods can still reach internet or ping external machine outside the k8s cluster, but when node1 backup node1 things start to get messy. randomly, pod1 can't rush internet, but pod2 can and sometime, it's the opposite, pod2 can't and pod1 can.

The only way I can get back to a normal state is by removing the snat-origin label on node3 which basically mean that there's only one node that handle de egress SNAT ip. Which in a way, break the HA feature.

Expected

Pods should still be able to get internet access no matter what antrea egress lead has the egress snat ip.

Versions:
Antrea version: Main
Kubernetes version: rancher deployed cluster. RKE1 k8s version: 1.23.7
Docker version: 20.10.17, build 100c701
Linux Kernel is: 5.14.0-70.22.1.el9_0.x86_64 1 SMP PREEMPT Tue Aug 2 10:02:12 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
OS: Red Hat Enterprise Linux release 9.0 (Plow)

Additional context
All nodes are using sfp+ intel cards and are using bonding active-backup mode.
Problem also occured on our vmware infrastructure while I was testing it on virtual machines.
Logs where no relevant but if needed, let me know. Currently, on our physical machines, I had dedicate only one machine to handle the egress snat gateway.
Firewalld is enable
SELinux is in permissive mode.

Thanks for your help.

@iMikeG6 iMikeG6 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2022
@iMikeG6
Copy link
Author

iMikeG6 commented Aug 15, 2022

Here's the controller logs:

I0815 12:35:13.763268       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:13.771145       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:13.963767       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:13.971046       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.163243       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.171530       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.365428       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.371244       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.563450       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.571143       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.763937       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.771830       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.963498       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})
I0815 12:35:14.971201       1 controller.go:416] Processing Egress snat-default-ip UPDATE event with selector ({nil &LabelSelector{MatchLabels:map[string]string{field.cattle.io/projectId: p-ht5wm,},MatchExpressions:[]LabelSelectorRequirement{},} []})

below, agent logs:

https://gist.github.com/iMikeG6/97417c05659e2b3ed514a732559d64f0

@tnqn
Copy link
Member

tnqn commented Aug 15, 2022

@iMikeG6 thanks for reporting the issue. It seems the egress IP 10.246.3.170/32 was configured on antrea-egress0 of two Nodes, I suspect the agents failed to reach each other to negotiate the active Egress node for the Egress IP. It could be caused by the nodes' firewall dropping the traffic. Could you check whether TCP port 10351 is allowed?
https://github.com/antrea-io/antrea/blob/main/docs/network-requirements.md

// ClusterMembershipPort is the server port used by the antrea-agent to run a gossip-based cluster
// membership protocol. Currently it's used only when the Egress feature is enabled.
// Defaults to 10351.
ClusterMembershipPort int `yaml:"clusterPort,omitempty"`

@iMikeG6
Copy link
Author

iMikeG6 commented Aug 15, 2022

Hi @tnqn,

thank you very much for quick reply.

Indeed, I realized that as soon as I dug into the agent logs.

I then open port 10351 and restarted the agents and the controller, then only node1 one has the vip on the antrea-egress0.

For the records, I used this documents to open firewall ports: https://antrea.io/docs/v1.6.0/docs/network-requirements/ which miss the port 10351. Doc need to be updated.

Once again, thanks for your quick reply.

it looks like it solved the problem, if not, i'll reopen the issue if its need be.

Cheers.

@iMikeG6
Copy link
Author

iMikeG6 commented Aug 15, 2022

Oups, my bad, read the wrong doc page version instead of the main one. Port 10351 is indeed mentioned.

@iMikeG6 iMikeG6 closed this as completed Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants