Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warm-reboot] ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). after performing warm-boot command #16686

Closed
dgsudharsan opened this issue Sep 25, 2023 · 3 comments
Assignees
Labels
Issue for 202305 MSFT Triaged this issue has been triaged

Comments

@dgsudharsan
Copy link
Collaborator

Description

The following error message is seen when performing warmboot command. This is due to the recent watchdog introduced to monitor orchagent in #15429. However this should be disabled when executing warmboot or fastboot commands.

Sep  7 18:46:12.366130 r-anaconda-51 NOTICE swss#orchagent: :- setAgingFDB: Set switch 21000000000000 fdb_aging_time 0 sec
Sep  7 18:46:12.366130 r-anaconda-51 INFO swss#orchagent: :- set: setting attribute 0x10000004 status: SAI_STATUS_SUCCESS
Sep  7 18:46:12.366130 r-anaconda-51 WARNING swss#orchagent: :- start: Orchagent is frozen for warm restart!
**Sep  7 18:47:07.639517 r-anaconda-51 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).**
Sep  7 18:47:30.581322 r-anaconda-51 INFO systemd[1]: Stopping switch state service...
Sep  7 18:47:30.640958 r-anaconda-51 NOTICE root: Stopping swss service...
Sep  7 18:47:30.645693 r-anaconda-51 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Sep  7 18:47:30.651871 r-anaconda-51 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Sep  7 18:47:30.673126 r-anaconda-51 NOTICE root: Warm boot flag: swss true.
Sep  7 18:47:30.686321 r-anaconda-51 NOTICE root: Fast boot flag: swss false.
Sep  7 18:47:30.690974 r-anaconda-51 NOTICE root: Killing Docker swss...

Steps to reproduce the issue:

  1. Execute warmboot.

Describe the results you received:

Error in logs

Describe the results you expected:

No error in logs.

Output of show version:

(paste your output here)

Output of show techsupport:

SONiC Software Version: SONiC.202305_RC.1-ca3667ac0_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: ca3667ac0
Build date: Thu Aug 31 08:08:59 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-244

Platform: x86_64-mlnx_msn3700-r0
HwSKU: ACS-MSN3700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1949X06182
Model Number: MSN3700-VS2F
Hardware Revision: A3
Uptime: 18:50:52 up 2 min,  3 users,  load average: 2.25, 1.27, 0.51
Date: Thu 07 Sep 2023 18:50:52

Docker images:
REPOSITORY                                         TAG                              IMAGE ID       SIZE
docker-orchagent                                   202305_RC.1-ca3667ac0_Internal   eea7e5499baf   328MB
docker-orchagent                                   latest                           eea7e5499baf   328MB
docker-platform-monitor                            202305_RC.1-ca3667ac0_Internal   e900cb12e21a   859MB
docker-platform-monitor                            latest                           e900cb12e21a   859MB
docker-nat                                         202305_RC.1-ca3667ac0_Internal   2179d6fe9058   319MB
docker-nat                                         latest                           2179d6fe9058   319MB
docker-teamd                                       202305_RC.1-ca3667ac0_Internal   526d2e82e919   317MB
docker-teamd                                       latest                           526d2e82e919   317MB
docker-sflow                                       202305_RC.1-ca3667ac0_Internal   dc2a4b5005ea   318MB
docker-sflow                                       latest                           dc2a4b5005ea   318MB
docker-fpm-frr                                     202305_RC.1-ca3667ac0_Internal   68295485543a   348MB
docker-fpm-frr                                     latest                           68295485543a   348MB
docker-syncd-mlnx                                  202305_RC.1-ca3667ac0_Internal   6346ad2d0073   870MB
docker-syncd-mlnx                                  latest                           6346ad2d0073   870MB
docker-macsec                                      latest                           27ca898ade3a   319MB
docker-snmp                                        202305_RC.1-ca3667ac0_Internal   4f7178d22f96   338MB
docker-snmp                                        latest                           4f7178d22f96   338MB
docker-sonic-telemetry                             202305_RC.1-ca3667ac0_Internal   ab21c6b604b7   599MB
docker-sonic-telemetry                             latest                           ab21c6b604b7   599MB
docker-dhcp-relay                                  latest                           1c536017f212   306MB
docker-eventd                                      202305_RC.1-ca3667ac0_Internal   05ec40cf1898   299MB
docker-eventd                                      latest                           05ec40cf1898   299MB
docker-lldp                                        202305_RC.1-ca3667ac0_Internal   760bf2209a2e   341MB
docker-lldp                                        latest                           760bf2209a2e   341MB
docker-router-advertiser                           202305_RC.1-ca3667ac0_Internal   8a56229f873b   299MB
docker-router-advertiser                           latest                           8a56229f873b   299MB
docker-mux                                         202305_RC.1-ca3667ac0_Internal   14cc6ecdbc5a   348MB
docker-mux                                         latest                           14cc6ecdbc5a   348MB
docker-database                                    202305_RC.1-ca3667ac0_Internal   2766cb678d20   299MB
docker-database                                    latest                           2766cb678d20   299MB
docker-sonic-mgmt-framework                        202305_RC.1-ca3667ac0_Internal   182387b4f331   415MB
docker-sonic-mgmt-framework                        latest                           182387b4f331   415MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.0.0-202305-1                   da0d5011c828   432MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai        1.0.0-202305-1                   17676f080268   277MB

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_r-anaconda-51_20230907_185038.tar.gz

@dgsudharsan
Copy link
Collaborator Author

@qiluo-msft @liuh-80 Can you please investigate and possibly stop watchdog during warmboot and fastboot commands?

@judyjoseph judyjoseph added Triaged this issue has been triaged MSFT labels Sep 27, 2023
@judyjoseph
Copy link
Contributor

@vaibhavhd f.y.i

@liuh-80
Copy link
Contributor

liuh-80 commented Sep 28, 2023

I will investigation and create fix for this issue ASAP.

StormLiangMS pushed a commit to sonic-net/sonic-swss that referenced this issue Nov 11, 2023
Orchangent send heartbeat during warm-reboot to prevent Orchagent stuck alert.

Why I did it
Orchangent will freese during warm-reboot, then supervisor-proc-exit-listener will generate false alert during warm reboot:
sonic-net/sonic-buildimage#16686

Work item tracking
Microsoft ADO: 25295846
How I did it
Send heartbeat during warm-reboot freeze.

How to verify it
Pass all UT.
Manually verify issue fixed by check syslog.
liuh-80 added a commit to sonic-net/sonic-mgmt that referenced this issue Nov 17, 2023
Add orchagent heartbeat during warm-reboot UT

### Description of PR
Add orchagent heartbeat during warm-reboot UT

##### Work item tracking
- Microsoft ADO: 25295846

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [x] Test case(new/improvement)


### Back port request
- [ ] 201911
- [ ] 202012
- [ ] 202205

### Approach
#### What is the motivation for this PR?
Fix orchagent stuck error during warm-reboot:
sonic-net/sonic-buildimage#16686

#### How did you do it?
Add new UT, freeze orchanget for warm-reboot then check the process listener not send alert.

#### How did you verify/test it?
Pass all UT

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202305 MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants