Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying ACL rule causes BGP neighbor to go down #21183

Open
Javier-Tan opened this issue Dec 16, 2024 · 9 comments
Open

Applying ACL rule causes BGP neighbor to go down #21183

Javier-Tan opened this issue Dec 16, 2024 · 9 comments
Labels

Comments

@Javier-Tan
Copy link

Javier-Tan commented Dec 16, 2024

Description

We noticed that applying a specific ACL rules causes one specific BGP neighbor to go down (fc00::a) during ACL tests (specifically those with "IPV6" and "INGRESS" parameters). Removing it brings it back up.

admin@sonic:~$ show acl rule
...
DATA_INGRESS_IPV6_TEST  RULE_15       9985        DROP      DST_IPV6: 20c0:a800::9/128      {'asic0': 'Active', 'asic1': 'Active'}
                                                           IP_TYPE: IPV6ANY
...

admin@sonic:~$ show ipv6 bgp sum
...

Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
...
fc00::a        4  65200        278         52         0      0       0  00:01:18   Connect         ARISTA03T3

admin@sonic:~$ show ipv6 interface
Interface       Master    IPv6 address/mask                            Admin/Oper    BGP Neighbor    Neighbor IP
--------------  --------  -------------------------------------------  ------------  --------------  -------------
...
Ethernet64                fc00::9/126                                  up/up         ARISTA03T3      fc00::a

Steps to reproduce the issue:

  1. Run any ACL tests with ipv6+ingress parameters e.g. acl/test_acl.py::TestBasicAcl::test_ingress_unmatched_blocked[ipv6-ingress-downlink->uplink-default-no_vlan] with breakpoint after ACL rules are applied
  2. After rule 15 is added, once BGP updates (~3mins), neighbor fc00::a will go down
  3. Removing the rule will bring it immediately back up

NOTE: BGP neighbor fc00::a will always go down when the rule is applied dugin ipv6+ingress test runs, however, only tests that fail is acl/test_acl.py::TestAclWithReboot...[ipv6-ingress...] as there are explicit BGP neighbor up checks.

Describe the results you received:

ACL rule 15 causes BGP neighbor fc00::a to go down when they are seeminly unrelated.

Describe the results you expected:

BGP neighbor fc00::a should stay up.

Output of show version:

SONiC Software Version: SONiC.20240510.16
BRCM SAI ver: [11.2.13.1], OCP SAI ver: [1.14.0], SDK ver: [sdk-6.5.30-SP4]

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Rules applied can be found at sonic-mgmt-int/tests/acl/templates/acltb_v6_test_rules.j2

{
    "acl": {
        "acl-sets": {
            "acl-set": {
                "{{ acl_table_name }}": {
                    "acl-entries": {
                        "acl-entry": {
                            ...
                            "15": {
                                "actions": {
                                    "config": {
                                        "forwarding-action": "DROP"
                                    }
                                },
                                "config": {
                                    "sequence-id": 15
                                },
                                "ip": {
                                    "config": {
                                        "destination-ip-address": "20c0:a800::9/128"
                                    }
                                }
                            },
                            ...
                        }
                    }
                }
            }
        }
    }
}

@Javier-Tan
Copy link
Author

@arlakshm for vis

@arlakshm
Copy link
Contributor

after this change sonic-net/sonic-mgmt#15921. If any the bgp sessions are down, the test is marked as failed.

@arlakshm
Copy link
Contributor

@arista-nwolfe, @kenneth-arista, @saksarav-nokia, @sanjair-git do you see these failures as well?

@arista-nwolfe
Copy link
Contributor

@arista-nwolfe, @kenneth-arista, @saksarav-nokia, @sanjair-git do you see these failures as well?

I'll try out the manual steps @Javier-Tan outlined with the pdb and wait to see if the bgp neighbors go down, but our ACL pass rate has been pretty consistently at 100% so we aren't seeing the failures caused by this.

@arlakshm
Copy link
Contributor

Thanks @arista-nwolfe, are you using the latest sonic-mgmt code for 202405. As I mentioned earlier after this change sonic-net/sonic-mgmt#15921. We check if all the bgp session are up after appling the ACLs

@arista-nwolfe
Copy link
Contributor

Thanks @arista-nwolfe, are you using the latest sonic-mgmt code for 202405. As I mentioned earlier after this change sonic-net/sonic-mgmt#15921. We check if all the bgp session are up after appling the ACLs

Yeah this last weekend's run has this change and we didn't see any failures due to All BGP sessions are not up after reboot, no point in continuing the test on any of our 3 testbeds.

@sanjair-git
Copy link

Hi @arlakshm, we have the latest code change from #15921 and all the tests from ACL are passing in our test beds too.

@arista-nwolfe
Copy link
Contributor

@arista-nwolfe, @kenneth-arista, @saksarav-nokia, @sanjair-git do you see these failures as well?

I'll try out the manual steps @Javier-Tan outlined with the pdb and wait to see if the bgp neighbors go down, but our ACL pass rate has been pretty consistently at 100% so we aren't seeing the failures caused by this.

I see the same behavior @Javier-Tan sees when I put a pdb after setup_rules:

DATA_INGRESS_IPV6_TEST  RULE_15       9985        DROP      DST_IPV6: 20c0:a800::9/128      {'asic0': 'Active', 'asic1': 'Active'}
                                                            IP_TYPE: IPV6ANY
Neighbhor       V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
------------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
fc00:3000::1    4  65100       1226       1220         0      0       0  00:22:33   6               ASIC0
fc00:3000::3    4  65100       1219       1227         0      0       0  00:22:33   7               ASIC1
fc00:3000::5    4  65100        719       1221         0      0       0  00:22:36   519             cmp214-6-ASIC0
fc00:3000::5    4  65100        719       1228         0      0       0  00:22:36   519             cmp214-6-ASIC0
fc00:3000::7    4  65100        719       1228         0      0       0  00:22:37   519             cmp214-6-ASIC1
fc00:3000::7    4  65100        722       1224         0      0       0  00:22:42   519             cmp214-6-ASIC1
fc00::2         4  65200        724       1046         0      0       0  00:22:38   34050           ARISTA01T3
fc00::16        4  65200        726        796         0      0       0  00:22:43   34050           ARISTA06T3
fc00::a         4  65200        699        757         0      0       0  00:01:11   Connect         ARISTA03T3
fc00::e         4  65200        725        795         0      0       0  00:22:39   34050           ARISTA04T3

It's just the 1 neighbor down that goes down strangely.

@Javier-Tan
Copy link
Author

Javier-Tan commented Dec 16, 2024

Sorry, I wasn't clear enough in the description but it was just that 1 BGP neighbor "fc00::a" that goes down @arista-nwolfe , so this is the same bug we see

@rlhui rlhui added the BRCM label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

5 participants