Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[thermalctld] Add check for psu presence to thermalctld when updating PSU thermals #387

Closed

Conversation

cytsao1
Copy link

@cytsao1 cytsao1 commented Jul 13, 2023

Description

issue #384

For any nonpresent PSU, thermalctld will add its sensors to the redis database, with all values N/A.
thermalctld should check for PSU presence before adding their sensors to the db.

Motivation and Context

The empty non-present sensor entries are not necessary to be tracked, and causes test case failures as values are expected to be retrieved from the sensors.

How Has This Been Tested?

With the update to thermalctld, ran relevant CLI and checked the db.

show plat inv

Power Supplies
    psu0                8800-HV-TRAY    1.0             FOC2417N4X0     Cisco 8800 Power Tray for AC/HVAC/HVDC Power Supply
        psu0.0 -- not present
        psu0.1 -- not present
        psu0.2 -- not present
    psu1                8800-HV-TRAY    1.0             FOC2417N4VJ     Cisco 8800 Power Tray for AC/HVAC/HVDC Power Supply
        psu1.0 -- not present
        psu1.1 -- not present
        psu1.2 -- not present
    psu2                8800-HV-TRAY    1.0             FOC2417N4U6     Cisco 8800 Power Tray for AC/HVAC/HVDC Power Supply
        psu2.0          PSU6.3KW-20A-HV 1.0             DTM243501T9     6.3KW AC/HVAC/HVDC Power Supply-20A
        psu2.1          PSU6.3KW-20A-HV 1.0             DTM243501UV     6.3KW AC/HVAC/HVDC Power Supply-20A
        psu2.2          PSU6.3KW-20A-HV 1.0             DTM2435022V     6.3KW AC/HVAC/HVDC Power Supply-20A
    psu3 -- not present
    psu4 -- not present
    psu5 -- not present

show plat temp (no empty entries)

            PSU2.0 Outlet_Temp         48.992         97        -5             102            -10      False  20230228 13:58:20
            PSU2.1 Outlet_Temp         48.992         97        -5             102            -10      False  20230228 13:58:21
            PSU2.2 Outlet_Temp         48.992         97        -5             102            -10      False  20230228 13:58:21

db entries (no non present PSUs)

root@sonic:/home/cisco# redis-cli -n 6
127.0.0.1:6379[6]> keys TEMPERATURE_INFO|PSU*
1) "TEMPERATURE_INFO|PSU2.1 Outlet_Temp"
2) "TEMPERATURE_INFO|PSU2.2 Outlet_Temp"
3) "TEMPERATURE_INFO|PSU2.0 Outlet_Temp"
127.0.0.1:6379[6]>

Additional Information (Optional)

@prgeor prgeor requested a review from Junchao-Mellanox July 14, 2023 19:28
@@ -592,6 +592,8 @@ class TemperatureUpdater(logger.Logger):
self._refresh_temperature_status(CHASSIS_INFO_KEY, thermal, index)

for psu_index, psu in enumerate(self.chassis.get_all_psus()):
if not psu.get_presence():
Copy link
Collaborator

@Junchao-Mellanox Junchao-Mellanox Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we should check presence here. "N/A" is good enough to tell user that the PSU thermal is not available at the moment, no matter it is because of PSU absence or hardware error. Maybe they should fix their test expectation. But if you insist the change, please consider following items:

  1. Do we need check psu.get_power_good_status here? If PSU is present but not powered on, thermal sensor should not be available.
  2. Do we need remove previous DB entry here?
  3. For show platform fan and show platform psu, non presence PSU will be displayed, should we align it with this design?
  4. For existing sonic-mgmt test, can we guarantee no test is affected?
  5. For other vendor, who may have private test based on existing design might run into issue
  6. Unit test is needed to cover the new change.

Copy link
Collaborator

@prgeor prgeor Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cytsao1 @Junchao-Mellanox how about get_chassis.all_psus() return only those PSUs which are present? In that case, "N/A" would indicate some issue with physical present PSU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only returning present PSUs from get_all_psus would cause issues with inconsistent indexing throughout the code; when enumerate is used, numbering will always be sequential from 0, and inserting a new PSU may cause numbering of existing PSUs to change.

@Junchao-Mellanox
Copy link
Collaborator

Hi @cytsao1 and @prgeor , I personally prefer to keeping existing code and requesting to change their test logic. Let me know if you have concern.

@gechiang
Copy link
Contributor

Hi @cytsao1 and @prgeor , I personally prefer to keeping existing code and requesting to change their test logic. Let me know if you have concern.

@Junchao-Mellanox , I still don't get why you need to keep sensors info for PSU that are not present in redis-db? what is the purpose of it? It has absolutely no useful info so why are we keeping this? I think not including it is the correct way of handling this.

@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented Oct 11, 2023

Hi @cytsao1 and @prgeor , I personally prefer to keeping existing code and requesting to change their test logic. Let me know if you have concern.

@Junchao-Mellanox , I still don't get why you need to keep sensors info for PSU that are not present in redis-db? what is the purpose of it? It has absolutely no useful info so why are we keeping this? I think not including it is the correct way of handling this.

It tells user that there are 2 PSU sensors, and 1 is not available, I don't see an issue here. To my understanding, this change will cause sonic-mgmt test_show_platform_temperature_mocked case failure. So, why should we choose a solution that would affect other vendors with no obviously benefit? How about change your own test case? How about change your platform API psu.get_all_thermals to return empty list while the PSU is not present?

@gechiang
Copy link
Contributor

Hi @cytsao1 and @prgeor , I personally prefer to keeping existing code and requesting to change their test logic. Let me know if you have concern.

@Junchao-Mellanox , I still don't get why you need to keep sensors info for PSU that are not present in redis-db? what is the purpose of it? It has absolutely no useful info so why are we keeping this? I think not including it is the correct way of handling this.

It tells user that there are 2 PSU sensors, and 1 is not available, I don't see an issue here. To my understanding, this change will cause sonic-mgmt test_show_platform_temperature_mocked case failure. So, why should we choose a solution that would affect other vendors with no obviously benefit? How about change your own test case? How about change your platform API psu.get_all_thermals to return empty list while the PSU is not present?

Is N/A in redis-db indicate Not Available Or it is bad sensor data? how can one tell? Also, for your example, there is only 1 PSU sensor. there is really no "2 PSU sensors" since one PSU is not even present?? How do you differentiate the case where the PSU is really present but not providing proper sensor data which resulted to N/A?

@Junchao-Mellanox
Copy link
Collaborator

Hi @cytsao1 and @prgeor , I personally prefer to keeping existing code and requesting to change their test logic. Let me know if you have concern.

@Junchao-Mellanox , I still don't get why you need to keep sensors info for PSU that are not present in redis-db? what is the purpose of it? It has absolutely no useful info so why are we keeping this? I think not including it is the correct way of handling this.

It tells user that there are 2 PSU sensors, and 1 is not available, I don't see an issue here. To my understanding, this change will cause sonic-mgmt test_show_platform_temperature_mocked case failure. So, why should we choose a solution that would affect other vendors with no obviously benefit? How about change your own test case? How about change your platform API psu.get_all_thermals to return empty list while the PSU is not present?

Is N/A in redis-db indicate Not Available Or it is bad sensor data? how can one tell? Also, for your example, there is only 1 PSU sensor. there is really no "2 PSU sensors" since one PSU is not even present?? How do you differentiate the case where the PSU is really present but not providing proper sensor data which resulted to N/A?

We don't have mechanism to check bad sensor data. So, with or without this change, we cannot tell it. For PSU presence, you can check show platform psustatus. Like I said, it will cause sonic-mgmt test case failure. The change should not break exist test case, agree? Why don't you change the platform API psu.get_all_thermals?

@spilkey-cisco
Copy link
Contributor

There will be a new PR raised to hopefully address this more completely. One issue I haven't seen mentioned here is that without some sort of presence check, sensors from PSUs that were previously present will not be removed from redis when the PSU is removed, and stale data from those sensors will remain in redis (this also affects snmp).

@gechiang
Copy link
Contributor

@spilkey-cisco ,
I am not sure what is the final decision and what we are still waiting for on this PR? Based on your last comment reply, are you making additional changes? raising a new PR and abandoning this PR? BTW, you need to remerge as this PR is out-of-sync with base branch already...

@spilkey-cisco
Copy link
Contributor

@spilkey-cisco , I am not sure what is the final decision and what we are still waiting for on this PR? Based on your last comment reply, are you making additional changes? raising a new PR and abandoning this PR? BTW, you need to remerge as this PR is out-of-sync with base branch already...

New PR will be created, this PR will not be merged and can be closed.

@prgeor
Copy link
Collaborator

prgeor commented Dec 1, 2023

closing @spilkey-cisco @cytsao1

@prgeor prgeor closed this Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants