Coupling to Default Crate Fan Cooling Program #354
Replies: 4 comments 3 replies
-
@swh76 Do you know how we should disable the fan program? If we decide this is the best route forward I can add it to jackhammer. |
Beta Was this translation helpful? Give feedback.
-
@jlashner @msilvafe be aware 15 is the max speed for ELMA and ASIS crates, but the COMTELs top out at 100. Do be careful disabling the fan policy ; should only use to set to max or crate will no longer be able to protect hardware from overheat. On the other hand, the carriers do have some thermal protection built in ; if the FPGA temperature exceeds a (very high) threshold it will automatically shut itself down. This is the FPGA OT = over temperature protection which is built into the fw. It acts on the output of this sensor:
and the threshold can be queried like this:
typically we have it set quite high. We may want to consider lowering it for the site systems. There is a function for setting it, https://pysmurf.readthedocs.io/en/main/client/command.html?highlight=ot#pysmurf.client.command.smurf_command.SmurfCommandMixin.set_ultrascale_ot_upper_threshold. But don't set it higher! One thing I found recently that may be helpful ; I noticed that the SLAC ASIS crate was constantly railing despite
here's an example of one of the "Temperature" sensors that the crate controller uses to decide how to set the fan speed:
the crate acts on preprogrammed thresholds, accessible from the shelf manager like this:
e.g. here's the same sensor's thresholds in my crate:
I don't actually have a good understanding of the crate fan control logic, but it seems that if the UNCT (=Upper Non-Critical Threshold) is exceeded, the controller tries to counter by increasing the fan speed. The thresholds are preprogrammed into the boards at SLAC, but can be changed. After digging into this for a while, the sensor above I think is the source of most of our fan speed woes ; it's a different sensor for the FPGA temperature but we know from testing that it's not a very accurate sensor. In particular, it tends to read ~10-15C hotter than the more accurate sensor whose data is returned by One other thing on this : Jesus took a hard look at how to set the fans to run at a persistent fixed level. Find details here - https://confluence.slac.stanford.edu/pages/viewpage.action?spaceKey=ppareg&title=How+to+disable+the+FAN+policy. In particular, see instructions for how to make the change persistent. But be aware if you disable the fan policy and someone changes the fan level, the controller will not be able to counter the change. So if you make this change you'll need to make sure e.g. |
Beta Was this translation helpful? Give feedback.
-
Uploading the SLAC internal confluence page I referenced, just in case folks don't have access. |
Beta Was this translation helpful? Give feedback.
-
Just wanted to post a few things that shawn brought up on slack that I think belong here. Shawn wrote this script to regularly check temp sensors and compares them to the fan-thresholds, in order to help determine what is tripping the threshold: https://github.com/slaclab/pysmurf/blob/main/scratch/shawn/check_crate_thermal_sensors.py Also, max is thinking about upgrading the crate-monitor from parsing the Jesus used this when writing the ATCA-monitor rogue process, so implementation details can be found here: https://github.com/slaclab/smurf-atca-monitor/blob/main/python/atcaipmi/monitor.py |
Beta Was this translation helpful? Give feedback.
-
Logging here to not be lost in SLAC. We (UCSD folks) found this in P10R2 and Yuhan + Penn folks found this during the initial LATR highbay cooldown but the fans seems to want to cycle on a ~32 min (this was definitely the cadence in SAT1) and there is a measurable coupling in the superconducting datasets (likely coupled through TES bias DAC voltage reference temperature coefficient) and in the phase monitoring channels (likely coupled through time delays in the smurf having some temperature coupling although that is less understood at this point). Regardless the best solution for operations is to disable the crate fan program and leave the fans running at maximum. We may need to consider sourcing more fan tray spares as this will likely shorten the fan tray lifetime but is a worthwhile investment to remove this source of 1/f.
Some plots included below for reference in case others rediscover this in the future.
From SAT1:
From LATR:
Beta Was this translation helpful? Give feedback.
All reactions