-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix supervisor-proc-exit-listener false alert during warm reboot issue. #16742
Conversation
status, | ||
namespace, | ||
dead_minutes) | ||
if is_warm_reboot(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, only ignore error log for orchagent stuck during warm reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like WARM_RESTART_ENABLE_TABLE
is the indicator for warm-boot-up, not warm-shutdown which is what you want. I am considering orchagent process has the full visibility of freezing status, how about continuing hearbeat even when it is frozen?
Sorry for misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently orchagent freeze by sleep, I will check if can improve the restart code in orchagent by stop handle all task:
SWSS_LOG_WARN("Orchagent is frozen for warm restart!");
sleep(UINT_MAX);
Close this PR, will fix the issue in Orchagent: sonic-net/sonic-swss#2923 |
Fix supervisor-proc-exit-listener false alert during warm reboot issue: #16686
Why I did it
supervisor-proc-exit-listener will generate false alert during warm reboot.
Work item tracking
How I did it
Ignore alert message during warm reboot.
How to verify it
Pass all UT.
Manually verify issue fixed:
Sep 28 07:53:20.285652 vlab-01 ERR swss#supervisor-proc-exit-listener: message repeated 27 times: [ Process 'orchagent' is stuck in namespace 'host' (7.0 minutes).]
Sep 28 07:53:20.287377 vlab-01 INFO swss#supervisor-proc-exit-listener: Warm rebooting, Process 'orchagent' is stuck in namespace 'host' (7.0 minutes).
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Fix supervisor-proc-exit-listener false alert during warm reboot issue.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)