Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix supervisor-proc-exit-listener false alert during warm reboot issue. #16742

Closed

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Sep 28, 2023

Fix supervisor-proc-exit-listener false alert during warm reboot issue: #16686

Why I did it

supervisor-proc-exit-listener will generate false alert during warm reboot.

Work item tracking
  • Microsoft ADO: 25295846

How I did it

Ignore alert message during warm reboot.

How to verify it

Pass all UT.
Manually verify issue fixed:

Sep 28 07:53:20.285652 vlab-01 ERR swss#supervisor-proc-exit-listener: message repeated 27 times: [ Process 'orchagent' is stuck in namespace 'host' (7.0 minutes).]
Sep 28 07:53:20.287377 vlab-01 INFO swss#supervisor-proc-exit-listener: Warm rebooting, Process 'orchagent' is stuck in namespace 'host' (7.0 minutes).

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • [] 202205
  • [] 202211
  • [] 202305

Tested branch (Please provide the tested image version)

  • master-16742.373711-493d8b7bf

Description for the changelog

Fix supervisor-proc-exit-listener false alert during warm reboot issue.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liuh-80 liuh-80 marked this pull request as ready for review September 28, 2023 09:40
@liuh-80 liuh-80 requested a review from lguohan as a code owner September 28, 2023 09:40
@liuh-80 liuh-80 requested a review from qiluo-msft September 28, 2023 09:40
status,
namespace,
dead_minutes)
if is_warm_reboot():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_warm_reboot

Could this condition be more conservative? like is_warm_reboot and the orchagent is stuck at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, only ignore error log for orchagent stuck during warm reboot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like WARM_RESTART_ENABLE_TABLE is the indicator for warm-boot-up, not warm-shutdown which is what you want. I am considering orchagent process has the full visibility of freezing status, how about continuing hearbeat even when it is frozen?

Sorry for misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently orchagent freeze by sleep, I will check if can improve the restart code in orchagent by stop handle all task:

                SWSS_LOG_WARN("Orchagent is frozen for warm restart!");
                sleep(UINT_MAX);

@qiluo-msft qiluo-msft requested a review from vaibhavhd October 10, 2023 18:40
@liuh-80
Copy link
Contributor Author

liuh-80 commented Oct 11, 2023

Close this PR, will fix the issue in Orchagent: sonic-net/sonic-swss#2923

@liuh-80 liuh-80 closed this Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants