-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
egress snat ha break while using multiple nodes for SNAT gateway. #4115
Comments
Here's the controller logs:
below, agent logs: https://gist.github.com/iMikeG6/97417c05659e2b3ed514a732559d64f0 |
@iMikeG6 thanks for reporting the issue. It seems the egress IP 10.246.3.170/32 was configured on antrea-egress0 of two Nodes, I suspect the agents failed to reach each other to negotiate the active Egress node for the Egress IP. It could be caused by the nodes' firewall dropping the traffic. Could you check whether TCP port 10351 is allowed? antrea/pkg/config/agent/config.go Lines 125 to 128 in 65b62cc
|
Hi @tnqn, thank you very much for quick reply. Indeed, I realized that as soon as I dug into the agent logs. I then open port 10351 and restarted the agents and the controller, then only node1 one has the vip on the antrea-egress0. For the records, I used this documents to open firewall ports: https://antrea.io/docs/v1.6.0/docs/network-requirements/ which miss the port 10351. Doc need to be updated. Once again, thanks for your quick reply. it looks like it solved the problem, if not, i'll reopen the issue if its need be. Cheers. |
Oups, my bad, read the wrong doc page version instead of the main one. Port 10351 is indeed mentioned. |
Describe the bug
We are currently testing Antrea Egress SNAT feature in HA mode. There's two nodes that has the
snat-origin
label, let's say node1 and node3.both has the interface down
Leader is currently node1
the snat and pool are the following:
and egress config
The egress is applied by the following matchLabels selector
field.cattle.io/projectId: p-ht5wm
which, when using rancher, apply to all namespace within projectID: p-ht5wm. But the selector could've also been"kubernetes.io/metadata.name": default
. The result is the same.Everything is working fine until the leader node is rebooted. The seconds node (node3) get lead on the egressip
10.246.3.170
, pods can still reach internet or ping external machine outside the k8s cluster, but when node1 backup node1 things start to get messy. randomly, pod1 can't rush internet, but pod2 can and sometime, it's the opposite, pod2 can't and pod1 can.The only way I can get back to a normal state is by removing the
snat-origin
label on node3 which basically mean that there's only one node that handle de egress SNAT ip. Which in a way, break the HA feature.Expected
Pods should still be able to get internet access no matter what antrea egress lead has the egress snat ip.
Versions:
Antrea version: Main
Kubernetes version: rancher deployed cluster. RKE1 k8s version: 1.23.7
Docker version: 20.10.17, build 100c701
Linux Kernel is: 5.14.0-70.22.1.el9_0.x86_64 1 SMP PREEMPT Tue Aug 2 10:02:12 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
OS: Red Hat Enterprise Linux release 9.0 (Plow)
Additional context
All nodes are using sfp+ intel cards and are using bonding active-backup mode.
Problem also occured on our vmware infrastructure while I was testing it on virtual machines.
Logs where no relevant but if needed, let me know. Currently, on our physical machines, I had dedicate only one machine to handle the egress snat gateway.
Firewalld is enable
SELinux is in permissive mode.
Thanks for your help.
The text was updated successfully, but these errors were encountered: