Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watchdog mechanism to swss service and generate alert when swss have issue. #15429

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Jun 12, 2023

Add watchdog mechanism to swss service and generate alert when swss have issue.

Work item tracking
Microsoft ADO (number only): 16578912

What I did
Add orchagent watchdog to monitor and alert orchagent stuck issue.

Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

How I verified it
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP ', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

Details if related
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306

@liuh-80 liuh-80 requested a review from qiluo-msft June 12, 2023 08:46
@liuh-80 liuh-80 marked this pull request as ready for review June 12, 2023 08:46
@liuh-80 liuh-80 requested a review from lguohan as a code owner June 12, 2023 08:46
@qiluo-msft qiluo-msft merged commit 05f1a5a into sonic-net:master Jun 13, 2023
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
@mint570
Copy link
Contributor

mint570 commented Dec 13, 2024

This PR introduces some log span. Filed #21157.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants