Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vendor customizable fan speed threshold checks #378

Merged
merged 7 commits into from
Jul 14, 2023

Conversation

spilkey-cisco
Copy link
Contributor

@spilkey-cisco spilkey-cisco commented Jun 13, 2023

MSFT ADO: 24591579

Description

Use vendor customizable fan speed threshold checks in thermalctld instead of hard-coded calculations based on pwm/percentage speed.

Motivation and Context

Fan under/over speed checks should be vendor customizable, since a tolerance based off the pwm/percentage fan speed can easily give false failures, especially for low fan speeds.

How Has This Been Tested?

root@sonic:/home/cisco# echo 10000 > /opt/cisco/etc/fantray0.fan0.rpm
root@sonic:/home/cisco# grep thermalctld /var/log/syslog
<snip>
May 19 05:09:47.763970 sonic WARNING pmon#thermalctld: Fan high speed warning: fantray0.fan0 current speed=91, target speed=20
May 19 05:09:51.129298 sonic INFO pmon#supervisord 2023-05-19 05:09:51,128 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
May 19 05:10:01.347935 sonic INFO pmon#supervisord: thermalctld WARNING:cisco.pacific.thermal.thermal_zone:level minor: fantray0.fan0: pwm 20; motor out of tolerance @ rpm 10000; maximum rpm 2950
root@sonic:/home/cisco# echo 2400 > /opt/cisco/etc/fantray0.fan0.rpm
root@sonic:/home/cisco# grep thermalctld /var/log/syslog
<snip>
May 19 05:09:47.763970 sonic WARNING pmon#thermalctld: Fan high speed warning: fantray0.fan0 current speed=91, target speed=20
May 19 05:09:51.129298 sonic INFO pmon#supervisord 2023-05-19 05:09:51,128 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
May 19 05:10:01.347935 sonic INFO pmon#supervisord: thermalctld WARNING:cisco.pacific.thermal.thermal_zone:level minor: fantray0.fan0: pwm 20; motor out of tolerance @ rpm 10000; maximum rpm 2950
May 19 05:10:47.023156 sonic NOTICE pmon#thermalctld: Fan high speed warning cleared: fantray0.fan0 speed is back to normal

@gechiang
Copy link
Contributor

@spilkey-cisco , can you address the test failures?

@spilkey-cisco
Copy link
Contributor Author

spilkey-cisco commented Jun 14, 2023

@spilkey-cisco , can you address the test failures?

Are you referring to sonic-mgmt? PR is here: sonic-net/sonic-mgmt#8587
Edit: I see what you're referring to, I'll check

@patrickmacarthur
Copy link
Contributor

@patrickmacarthur to review

@prgeor prgeor requested a review from Junchao-Mellanox June 27, 2023 20:55
@prgeor
Copy link
Collaborator

prgeor commented Jun 27, 2023

@Junchao-Mellanox could you review.

@mlok-nokia
Copy link
Contributor

It looks ok to me

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco , can you address the test failures?

Are you referring to sonic-mgmt? PR is here: sonic-net/sonic-mgmt#8587 Edit: I see what you're referring to, I'll check

Failed tests depend on APIs provided by sonic-net/sonic-platform-common#382

@gechiang
Copy link
Contributor

@spilkey-cisco can you rebase your branch and let's see if the tests can pass afterwards...
Thanks!

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco can you rebase your branch and let's see if the tests can pass afterwards... Thanks!

Still fails for the same reasons, eg. Failed to update fan status - AttributeError("'MockFan' object has no attribute 'is_under_speed'"). I'm not sure how sonic-platforms-common gets updated for these tests, it needs to include sonic-net/sonic-platform-common#382. How can this be checked/updated?

@gechiang
Copy link
Contributor

@spilkey-cisco can you rebase your branch and let's see if the tests can pass afterwards... Thanks!

Still fails for the same reasons, eg. Failed to update fan status - AttributeError("'MockFan' object has no attribute 'is_under_speed'"). I'm not sure how sonic-platforms-common gets updated for these tests, it needs to include sonic-net/sonic-platform-common#382. How can this be checked/updated?

The mentioned fix was merged yesterday. it probably will take another few days before it gets propagated and pick up in the test. I will try to run the test again tomorrow and if still not passing will consult build team on how to overcome the issue.

@prgeor
Copy link
Collaborator

prgeor commented Jul 14, 2023

@spilkey-cisco please check the build failure

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco please check the build failure

Updated the branch, so the tests ran again. Looks like the sonic-platforms-common changes have been propagated now, so those tests passed.

@gechiang gechiang merged commit 94242c2 into sonic-net:master Jul 14, 2023
@gechiang
Copy link
Contributor

@gechiang could we have an ADO to track this?

MSFT ADO: 24591579

@gechiang
Copy link
Contributor

@StormLiangMS , @yxieca , MSFT ADO 24591579 created . Please help cherry pick to 202205, 202305, and 202211.
Thanks!

@yxieca
Copy link

yxieca commented Jul 30, 2023

@gechiang, @spilkey-cisco this is a feature change. It should say in master branch until next feature branch is created.

@gechiang
Copy link
Contributor

gechiang commented Aug 1, 2023

@yxieca This is not really a feature change. it is rather a bug fix where the original design is insufficient to accommodate the vendors where the old way of specifying the tolerance can cause quite a bit of inaccuracy and this PR fixes this issue. We originally found this issue in 202205 branch and asked the vendor to fix this.

@StormLiangMS
Copy link

@spilkey-cisco could you test this PR with 202305?

@gechiang for vis.

@abdosi
Copy link
Contributor

abdosi commented Oct 30, 2023

@bmridul what is latest on this ? Which all PR's are needed here ?

@spilkey-cisco
Copy link
Contributor Author

spilkey-cisco commented Oct 30, 2023

In master, 4 PRs are needed:

  1. This PR
  2. Default implementation of under/over speed checks sonic-platform-common#382 (merged)
  3. Update fan related tests for under/over speed check APIs sonic-mgmt#8587 (complete, blocked by @prgeor, urgently need to be unblocked as it is approved)
  4. Fix system-health hardware_checker to consume fan tolerance details sonic-buildimage#16689 (blocked in same way as 3)

202305:
Number 2 has been cherry-picked into 202305, 1, 3, and 4 will need to be cherry-picked into 202305.

202205:
Original communication was that this would not be cherry-picked into 202205, so cisco vendor code was not updated in 202205. However, 1 and 2 ended up being cherry-picked recently which caused extra cisco test failures, and ended up being reverted. If this feature is wanted in 202205, please let us know before cherry-picking again so that vendor code can be updated.

@spilkey-cisco
Copy link
Contributor Author

From above analysis, please cherry-pick this into 202305. Not sure why "Request for 202305 branch" label was removed.

@gechiang
Copy link
Contributor

@spilkey-cisco This fix was originated from test failure on 202205 so the expectation was to have the "full fix" also available for 202205. Are you saying that Cisco platform code on 202205 does not have the implementation to support this yet? This is a surprise for me. Given that this fix was causing regression, it got backed out from 202205. Only if you think all regression issues are fully addressed for 202205, we can then try to pick this up again. Please add dependency PRs for the 202205 branch here so we clearly know what order the PRs should be merged in order to not cause any regressions. Also, we must not cause regressions on other vendor platform test as well. So, if there is any dependency on other Vendors to also need to make changes in their platform code for 202205 then we should not request this be backported to 202205. Can you clarify on this so we can make a decision for 202205 accordingly?
Thanks!

@spilkey-cisco
Copy link
Contributor Author

@gechiang Correct, Cisco platform code is needed before these upstream changes are merged (platform code was never added to 202205 because last communication was that these upstream changes would not be cherry-picked into 202205). This code is backwards compatible with existing get_speed_tolerance API implementations, so other vendors should not need to make any changes. I see no issue with cherry-picking this into 202205 again after Cisco platform support is added. I can make those platform changes ASAP.

@anamehra
Copy link
Contributor

@spilkey-cisco , @gechiang , all dependent PRs will also be required in 202305 (and 202311)

@shivuv
Copy link

shivuv commented Dec 21, 2023

@gechiang @abdosi : Can you please help to cherry pick this into 202305 branch?

@abdosi
Copy link
Contributor

abdosi commented Dec 29, 2023

We will not take this bug fix in 202205.
Added Label for 202305.
@StormLiangMS please take this PR and sonic-net/sonic-buildimage#16689 into 202305. MSFT ADO is their and has been tested by @spilkey-cisco on 202305 so we should be good.

202311: Already have this change.

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #418

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.