Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default implementation of under/over speed checks #382

Merged
merged 5 commits into from
Jul 11, 2023

Conversation

spilkey-cisco
Copy link
Contributor

Description

Provide default implementation of fan under and over speed threshold checks, providing backwards compatibility for vendors that only implement get_speed_tolerance

Motivation and Context

Fan under/over speed checks should be vendor customizable, since a tolerance based off the pwm/percentage fan speed can easily give false failures, especially for low fan speeds.

How Has This Been Tested?

root@sonic:/home/cisco# echo 10000 > /opt/cisco/etc/fantray0.fan0.rpm
root@sonic:/home/cisco# grep thermalctld /var/log/syslog
<snip>
May 19 05:09:47.763970 sonic WARNING pmon#thermalctld: Fan high speed warning: fantray0.fan0 current speed=91, target speed=20
May 19 05:09:51.129298 sonic INFO pmon#supervisord 2023-05-19 05:09:51,128 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
May 19 05:10:01.347935 sonic INFO pmon#supervisord: thermalctld WARNING:cisco.pacific.thermal.thermal_zone:level minor: fantray0.fan0: pwm 20; motor out of tolerance @ rpm 10000; maximum rpm 2950
root@sonic:/home/cisco# echo 2400 > /opt/cisco/etc/fantray0.fan0.rpm
root@sonic:/home/cisco# grep thermalctld /var/log/syslog
<snip>
May 19 05:09:47.763970 sonic WARNING pmon#thermalctld: Fan high speed warning: fantray0.fan0 current speed=91, target speed=20
May 19 05:09:51.129298 sonic INFO pmon#supervisord 2023-05-19 05:09:51,128 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
May 19 05:10:01.347935 sonic INFO pmon#supervisord: thermalctld WARNING:cisco.pacific.thermal.thermal_zone:level minor: fantray0.fan0: pwm 20; motor out of tolerance @ rpm 10000; maximum rpm 2950
May 19 05:10:47.023156 sonic NOTICE pmon#thermalctld: Fan high speed warning cleared: fantray0.fan0 speed is back to normal

@gechiang
Copy link

@spilkey-cisco , can you please add the needed test file to cover your new added functions? This is needed in order to pass the coverage requirement.

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco , can you please add the needed test file to cover your new added functions? This is needed in order to pass the coverage requirement.

Are you referring to sonic-mgmt? PR is here: sonic-net/sonic-mgmt#8587

@gechiang
Copy link

@spilkey-cisco , can you please add the needed test file to cover your new added functions? This is needed in order to pass the coverage requirement.

Are you referring to sonic-mgmt? PR is here: sonic-net/sonic-mgmt#8587

NO... I am referring to the pull request coverage verification for this PR which you can see is failing...

Pull Request Coverage
Total: 20 lines
Missing: 18 lines
Coverage: 10%
Threshold: 80%

It appears that you have not added any new tests for the PR code coverage of the new code you added...
Please refer to other recently merged PRs under this repo for reference on what code coverage tests need to be added...

@prgeor
Copy link
Collaborator

prgeor commented Jun 14, 2023

@spilkey-cisco please provide more context.

@spilkey-cisco
Copy link
Contributor Author

@gechiang sure, I'll update the PR ASAP. I didn't see an existing file for fans so I was unsure of the requirements. I'll create a file to cover the new fan APIs.

Comment on lines +69 to +103
def is_under_speed(self):
"""
Calculates if the fan speed is under the tolerated low speed threshold

Default calculation requires get_speed_tolerance to be implemented, and checks
if the current fan speed (expressed as a percentage) is lower than <get_speed_tolerance>
percent below the target fan speed (expressed as a percentage)

Returns:
A boolean, True if fan speed is under the low threshold, False if not
"""
speed = self.get_speed()
target_speed = self.get_target_speed()
tolerance = self.get_speed_tolerance()

for param, value in (('speed', speed), ('target speed', target_speed), ('speed tolerance', tolerance)):
if not isinstance(value, int):
raise TypeError(f'Fan {param} is not an integer value: {param}={value}')
if value < 0 or value > 100:
raise ValueError(f'Fan {param} is not a valid percentage value: {param}={value}')

return speed * 100 < target_speed * (100 - tolerance)

def is_over_speed(self):
"""
Calculates if the fan speed is over the tolerated high speed threshold

Default calculation requires get_speed_tolerance to be implemented, and checks
if the current fan speed (expressed as a percentage) is higher than <get_speed_tolerance>
percent above the target fan speed (expressed as a percentage)

Returns:
A boolean, True if fan speed is over the high threshold, False if not
"""
speed = self.get_speed()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilkey-cisco This is an abstract class. Where is the HLD changes?

Copy link
Contributor Author

@spilkey-cisco spilkey-cisco Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that https://github.com/sonic-net/sonic-platform-common/blob/master/sonic_fan/fan_base.py is the (obsolete, from what I can tell) fan abstract base class, and the file in this changeset follows the design from https://github.com/sonic-net/SONiC/blob/master/doc/platform_api/new_platform_api.md indicating ABCs are no longer used, 'No longer abstract base classes: All abstract methods simply raise "NotImplementedError"'. The new methods are not abstract since they have a default implementation, but can be overridden with vendor specific implementations if desired. If an HLD is needed, please advise on where it should go.

Please correct me if this understanding is inaccurate or incomplete.

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco please provide more context.

The tolerance calculations that thermalctld performed (the defaults in the base file now) were not sufficient for Cisco's fan tolerance requirements. Since the default tolerance checks were required to calculate using fan speed as percentage values (1-100), low fan speeds had a large margin of error, and frequently falsely indicated the fan speed was outside the allowable tolerance. Allowing vendors to specify their own tolerance calculations if desired resolves this issue (such as performing calculations on larger rpm values instead of percentages, as an example), and the default implementation here allows existing vendor code to continue to work with no modifications.

@Junchao-Mellanox
Copy link
Contributor

I guess there is a PR for sonic-platform-daemons to use these new APIs, where can I find it?

@spilkey-cisco
Copy link
Contributor Author

I guess there is a PR for sonic-platform-daemons to use these new APIs, where can I find it?

sonic-net/sonic-platform-daemons#378

@arlakshm arlakshm requested a review from mlok-nokia June 14, 2023 17:32
@mlok-nokia
Copy link
Contributor

It looks ok to me

@spilkey-cisco
Copy link
Contributor Author

@gechiang sure, I'll update the PR ASAP. I didn't see an existing file for fans so I was unsure of the requirements. I'll create a file to cover the new fan APIs.

Added fan_base_test.py, all checks now pass

@gechiang
Copy link

MSFT ADO: 24645343

@gechiang
Copy link

@StormLiangMS , @yxieca , MSFT ADO: 24645343 created. Please approve to cherry pick this to 202205, 202211, and 202305.
Thanks!

@yxieca
Copy link
Contributor

yxieca commented Jul 30, 2023

@gechiang, @spilkey-cisco , this change is a feature change and impacts all platforms. I don't think we should cherry-pick this change to feature branches.

@gechiang
Copy link

gechiang commented Aug 1, 2023

@yxieca , This is not really a feature change. it is rather a bug fix where the original design is insufficient to accommodate the vendors where the old way of specifying the tolerance can cause quite a bit of inaccuracy and this PR fixes this issue. We originally found this issue in 202205 branch and asked the vendor to fix this.

yxieca pushed a commit that referenced this pull request Sep 6, 2023
* Default implementation of under/over speed checks

* Update comment, remove requirement for float conversion

* Fan test coverage
@gechiang
Copy link

gechiang commented Sep 6, 2023

@yxieca , @StormLiangMS , Appreciate if this PR can also be approved for 202211 and 202305 so we can mark the ADO as done.
Thanks!

StormLiangMS pushed a commit that referenced this pull request Sep 21, 2023
* Default implementation of under/over speed checks

* Update comment, remove requirement for float conversion

* Fan test coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants