Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

featured: use run() to run cli commands in place of check_call() #177

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

anamehra
Copy link
Contributor

Signed-off-by: anamehra [email protected]

Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started

The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”

featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.

@anamehra
Copy link
Contributor Author

Hi @abdosi , please review.

@anamehra anamehra marked this pull request as draft October 31, 2024 22:08
@anamehra anamehra force-pushed the anamehra/featured_1 branch 2 times, most recently from feffc38 to 7b38356 Compare October 31, 2024 23:19
@anamehra anamehra force-pushed the anamehra/featured_1 branch from a97c285 to 36cec0a Compare October 31, 2024 23:54
With run(), seeign extra data in buffer and causing order check failure:
2024-11-01T00:00:02.3225936Z E               Actual: [call(['sudo', 'systemctl', 'daemon-reload'], capture_output=True, check=True, text=True),
2024-11-01T00:00:02.3226361Z E                call().stdout.__str__(),
2024-11-01T00:00:02.3226594Z E                call().stderr.__str__(),
2024-11-01T00:00:02.3227055Z E                call(['sudo', 'systemctl', 'unmask', 'dhcp_relay.service'], capture_output=True, check=True, text=True),
2024-11-01T00:00:02.3227342Z E                call().stdout.__str__(),
2024-11-01T00:00:02.3227570Z E                call().stderr.__str__(),
@anamehra anamehra force-pushed the anamehra/featured_1 branch from 01bf6a1 to c268fb7 Compare November 1, 2024 00:14
@anamehra anamehra marked this pull request as ready for review November 1, 2024 00:29
@rlhui rlhui requested a review from judyjoseph November 6, 2024 18:22
@anamehra
Copy link
Contributor Author

anamehra commented Nov 8, 2024

Hi @judyjoseph , please help with this PR review. Thanks

@anamehra anamehra requested a review from judyjoseph November 13, 2024 22:53
Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@qiluo-msft qiluo-msft merged commit c05d43e into sonic-net:master Nov 18, 2024
5 checks passed
@anamehra anamehra deleted the anamehra/featured_1 branch November 18, 2024 20:56
@anamehra
Copy link
Contributor Author

Hi @yejianquan , please help to get this one in 202405. Thanks

@yejianquan
Copy link

Hi @yejianquan , please help to get this one in 202405. Thanks

@bingwang-ms , could you help to cherry-pick? Thanks!

mssonicbld pushed a commit to mssonicbld/sonic-host-services that referenced this pull request Nov 22, 2024
…ic-net#177)

Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started
The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”
featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.
@mssonicbld
Copy link

Cherry-pick PR to 202405: #188

mssonicbld pushed a commit that referenced this pull request Nov 22, 2024
Fixes: sonic-net/sonic-buildimage#20662

During some reboots, it was observed that some times featured.service script command fails to start the services like pmon, snmp, lldp etc.

From logs, it was observed that 'sudo systemctl enable ' command failed with errorcode 13 (SIGPIPE.

2024 Oct 29 01:31:26.191236 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'unmask', 'pmon.service']'
2024 Oct 29 01:31:26.211167 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:27.212381 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'pmon.service']'
2024 Oct 29 01:31:27.232428 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:31:28.135667 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'pmon.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:31:28.135746 aaa14-rp ERR featured: Feature 'pmon.service' failed to be enabled and started

2024 Oct 29 01:34:08.661711 aaa14-rp INFO featured: Running cmd: '['sudo', 'systemctl', 'enable', 'gnmi.service']'
2024 Oct 29 01:34:08.677242 aaa14-rp INFO systemd[1]: Reloading.
2024 Oct 29 01:34:09.316554 aaa14-rp ERR featured: ['sudo', 'systemctl', 'enable', 'gnmi.service'] - failed: return code - -13, output:#012None
2024 Oct 29 01:34:09.316791 aaa14-rp ERR featured: Feature 'gnmi.service' failed to be enabled and started
The issue does not recover and the pmon and other services never starts. On supervisor this also leads to swss, syncd and other related docker to stay down.

In general systemctl enable does not work for some services like pmon, snmp, lldp etc as there is no WantBy directive set for these services in unit file.

The command returns stderr :

"The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
”
featured python script uses subprocess.check_call() function to invoke the command which looks like is not very reliable at handling the stderr and may cause SIGPIPE with big buffer data.

Modifying the function to use subprocess.run() resolves this issue.

run() is more reliable at handing the return data.

Validated the change with multiple reboots.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[202405]: featured enable/start failed for pmon, snmp, lldp services with SIGPIPE
6 participants