Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202405] Running vrf test cases in T0 topology crashes orchagent #21431

Open
vikramchandra123 opened this issue Jan 14, 2025 · 1 comment
Open

Comments

@vikramchandra123
Copy link

When running VRF testcases in T0 topology, the box is rebooted with vrf config. After reboot, orchagent crashes and port do not come up. This causes all VRF test cases to fail in setup after loading the VRF configuration. I have not looked at the exact code changes but have some additional info

Core file info:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/orchagent -d /var/log/swss -b 1024 -s -m 4c:62:cd:b3:b1:18'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f4798d53ebc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f47985caa40 (LWP 116))]
(gdb) bt
#0  0x00007f4798d53ebc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4798d04fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f4798cef472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f4799046919 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f4799051e1a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f4799051e85 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f47990520d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f4799049240 in std::__throw_out_of_range(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00005647a68daba6 in std::map<unsigned long, std::map<swss::IpPrefix, RouteNhg, std::less<swss::IpPrefix>, std::allocator<std::pair<swss::IpPrefix const, RouteNhg> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::map<swss::IpPrefix, RouteNhg, std::less<swss::IpPrefix>, std::allocator<std::pair<swss::IpPrefix const, RouteNhg> > > > > >::at (
    this=<optimized out>, __k=<optimized out>) at /usr/include/c++/12/bits/stl_map.h:551
#9  0x00005647a68d245b in RouteOrch::addRoutePost (this=this@entry=0x5647a8477680, ctx=..., nextHops=...) at ./orchagent/routeorch.cpp:2143
#10 0x00005647a68d7b5e in RouteOrch::doTask (this=0x5647a8477680, consumer=...) at ./orchagent/routeorch.cpp:1021
#11 0x00005647a6896c82 in Orch::doTask (this=0x5647a8477680) at ./orchagent/orch.cpp:539
#12 0x00005647a6887d0a in OrchDaemon::start (this=this@entry=0x5647a83d9ce0) at ./orchagent/orchdaemon.cpp:880
#13 0x00005647a67f5a49 in main (argc=<optimized out>, argv=<optimized out>) at ./orchagent/main.cpp:810


(gdb) p m_syncdRoutes
$5 = std::map with 1 element = {[844424930132002] = std::map with 2 elements = {[{m_ip = {m_ip = {family = 2 '\002', ip_addr = {ipv4_addr = 0, ipv6_addr = '\000' <repeats 15 times>}}}, 
      m_mask = 0}] = {nhg_key = {m_nexthops = std::set with 0 elements, m_overlay_nexthops = false, m_srv6_nexthops = false}, nhg_index = ""}, [{m_ip = {m_ip = {family = 10 '\n', ip_addr = {
            ipv4_addr = 0, ipv6_addr = '\000' <repeats 15 times>}}}, m_mask = 0}] = {nhg_key = {m_nexthops = std::set with 0 elements, m_overlay_nexthops = false, m_srv6_nexthops = false}, 
      nhg_index = ""}}}
(gdb) p vrf_id
$6 = (const sai_object_id_t &) @0x5647a84c2fd0: 844424930133303

This failure started to happen recently and seems to be because of the following commit in sonic-swss

commit 640da98efd1b45e0dea276d9c0802e5b2afb83ab (origin/202405)
Author: abdosi <[email protected]>
Date:   2024-12-18 12:39:21 -0500

    Added change not to create ECMP Group in SAI and program the route if none of ECMP members are active/link-up (#3394)
    
    What I did:
    Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up.
    Also do not program the Temp Route if Neigh is not active (Link Down)
    
    Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case.
    
    Why I did:
    In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable.
    
    Also in this case no point to add Temp Route if neighbor is link down.
@tudupa
Copy link

tudupa commented Jan 20, 2025

@abdosi Can you please take a look at this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants