Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EDAC hardware check (ECC memory errors) #23

Open
wants to merge 59 commits into
base: master
Choose a base branch
from

Conversation

kcgthb
Copy link

@kcgthb kcgthb commented Sep 7, 2016

This patch will add a check_hw_edac check to verify correctable and uncorrectable ECC errors in memory, as reported by edac-utils

EDAC is an alternative to MCE checks, with support for older hardware (cf. http://www.mcelog.org/faq.html#13) and could be used on platforms where mcelog is not available.

# edac-util
mc1: csrow1: CPU_SrcID#1_Channel#1_DIMM#0: 31 Corrected Errors

# nhc -e check_hw_edac MARK_OFFLINE=0
check_hw_edac:  ECC errors detected - 31 corrected memory errors detected (limit of 9).
ERROR:  nhc:  Health check failed:  check_hw_edac:  ECC errors detected - 31 corrected memory errors detected (limit of 9).
ERROR:  nhc:  Health check failed:  check_hw_edac:  ECC errors detected - 31 corrected memory errors detected (limit of 9).

It is very closely modelled after the check_hw_mcelog function, with similar thresholds definitions.

Michael Jennings and others added 30 commits November 15, 2015 13:36
…es by re-execing NHC after each loop.

Some problems have been reported when NHC is used as a Grid Engine
load sensor, including memory leaks, due to the long-running NHC
process.  All function variables reported as being leaked are local to
their respective functions, but for unknown reasons (BASH bug?),
they're being leaked anyway.

In an attempt to work around these issues, this commit re-execs NHC
after the completion of each iteration.  There are some side effects
to this, including resetting of non-exported environment variables, so
this is experimental for now.
new check: check_nvsmi_healthmon which uses nvidia-smi
… does or does not match a specific node range
…after the last check, such as if pbsnodes hangs and causes nhc to timeout
Added more debug information for if a check matches the node range in nhc.conf
…n works before running tests that need it.
…keeping debugging output added by NREL. Avoids executing mcheck() 3 times on same data.
fixed misleading timeout message if nhc times out after all checks were completed
David Whiteside (GitHub user starboarder2001) noticed (see his PR mej#9
for details) that NHC would sometimes report that it had timed out
while executing a particular check when, in fact, it had already
completed that check and was trying to finish its work and/or clean up
(e.g., marking a node online).

This tweaks his fix to be even more specific about what exactly NHC is
doing as it goes through the final stages of execution.
…o add file/line/function info to PS4 for xtrace output.
This fixes mej#15 by using squeue instead of stat to obtain the list of authorized users
…m bbbbbrie to use built-in NHC variable HOSTNAME_S instead of shelling out to the hostname command.
…es by re-execing NHC after each loop.

Some problems have been reported when NHC is used as a Grid Engine
load sensor, including memory leaks, due to the long-running NHC
process.  All function variables reported as being leaked are local to
their respective functions, but for unknown reasons (BASH bug?),
they're being leaked anyway.

In an attempt to work around these issues, this commit re-execs NHC
after the completion of each iteration.  There are some side effects
to this, including resetting of non-exported environment variables, so
this is experimental for now.
* sge-fixes:
  workaround:  Try avoiding SGE-related memory leaks and other BASH issues by re-execing NHC after each loop.
  Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, respectively.
* master:
  Makefile.am:  Minor/cosmetic fix for removing /var/{lib,run}/nhc in uninstall-local rule.
  Allow SLURM nodes in reservation to be marked offline
  Fixed bad links
* master:
  nhc:  Update help output to show correct default value for $TIMEOUT.  Fixes mej#22 from @eshelman.
mej and others added 6 commits December 28, 2018 14:05
One of my last major changes before leaving LBNL was to globally
disable pathname expansion (i.e., globbing) by default throughout
NHC.  Apparently when I did that, I missed a spot:  the unit test
driver script!  So only the unit tests specific to `nhc` itself have
been running since that happened (commit 8fa6657).

While this problem has existed for over a year-and-a-half in wallclock
time, commit-wise it hasn't been very long.  (About 10-ish, not
counting merging work from others.)  That's my story, and I'm sticking
to it.
Fix the process substitution sanity check so that we're not skipping
46 unit tests for no reason whatsoever.
While taking another look at @SMark-Black's mej#50 and mej#53, I realized
that the code in question regarding `$MAX_SYS_UID` is doing exactly
what it is supposed to do, given the intended meaning of the variable
based on its name.  What was *actually* wrong was that the
`nhc_common_get_max_sys_uid()` function has been reading the wrong
variable!

So `nhc_common_get_max_sys_uid()` will now look for `$SYS_UID_MAX` in
`/etc/login.defs` like it should have been doing all along, and using
the value of `UID_MIN - 1` as a fallback if necessary.

NOTE:  This means that the default auto-detected value of
`$MAX_SYS_UID` will likely be something ending in `99` (like `499` or
`999`) rather than `00` because it was always intended to be (and the
code has always treated it as) the *top* of the exempt UID range, NOT
the bottom of the non-exempt UID range!  If you have any configs or
scripts that rely on different assumptions, please make sure to make
any necessary updates.

Closes mej#50.
Based on a couple changes suggested by @SMark-Black in his PR mej#53, add
another command to look for to auto-detect LSF, and add support for
the LSF `res` daemon to the `check_ps_userproc_lineage()` check.  Also
moved the setting of `$RM_DAEMON_MATCH` to inside the check -- that's
the only thing in that whole entire file that actually requires a
resource manager!
Disable the logfile when in eval (`-e`) mode so that NHC doesn't try
to output to both the log and `stdout` and wind up saying the same
thing twice!
@mej
Copy link
Owner

mej commented Dec 29, 2018

EDAC support is high on our priority list at LANL, so getting this merged is very much on my radar and at the top of my priority list! I want to make sure it gets some bake time in production before putting it into a release, so I'm re-targetting this for 1.4.4, but it will be going in very early in the new year! :-)

@mej mej added this to the 1.4.4 Release milestone Dec 29, 2018
@mej mej self-assigned this Dec 29, 2018
Add a Table of Contents to the README.md documentation file
automatically generated by the `gh-md-toc` script from @ekalinin.

Many thanks to @basvandervlies for both suggesting this much-needed
addition and helping find an editor-agnostic way to generate it
automatically going forward!

To update:
```bash
$ git clone https://github.com/ekalinin/github-markdown-toc.git
$ github-markdown-toc/gh-md-toc --insert README.md
```

Closes mej#67.  Feedback is welcome on (1) whether or not to
submodule-ize this, and (2) whether or not any changes are needed to
tweak the output for NHC...it looks like it might need some manual
tweaking at the moment.
Made a couple tweaks to the `gh-md-toc` script (NHC-specific) to put
the Table of Contents more in line with what the intent of the
formatting was, not necessarily the exact indentation level.  I
intentionally skip heading levels to achieve the correct style, but
that confuses the ToC generator.  Some special casing in the `awk`
script has remedied that.

I haven't taken the time to make the changes in a way that would be
generic enough to consider upstreaming.  Maybe in the future!
@OleHolmNielsen
Copy link

OleHolmNielsen commented Jan 2, 2019

The current edac-utils release 0.16 from RHEL/CentOS 7 contains a bug as documented in https://github.com/grondo/edac-utils/blob/master/NEWS:

Version 0.18 (2011-11-09);

  • Do not print "No errors to report" with edac-util --quiet

We get this useless output from the command:
$ edac-util --quiet
edac-util: No errors to report.

I have opened a bug report for RHEL 7 requesting an upgrade of edac-utils to version 0.18, see https://bugzilla.redhat.com/show_bug.cgi?id=1662858

Hopefully this will be useful for NHC when Red Hat implements the update.

@OleHolmNielsen
Copy link

It turns out that edac-utils is deprecated (though still supported) in RHEL 7, and the hardware checking functionality is replaced by rasdaemon, see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/migration_planning_guide/sect-red_hat_enterprise_linux-migration_planning_guide-deprecated_packages

The rasdaemon is documented for RHEL 7 in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-checking_for_hardware_errors
The rasdaemon project page is https://pagure.io/rasdaemon

Usage of rasdaemon on RHEL 7 requires a daemon: systemctl start rasdaemon
Then you can inquire the daemon by the ras-mc-ctl command.

For NHC it may be preferable to use rasdaemon in stead of the deprecated edac-utils or mcelog.

@kcgthb kcgthb force-pushed the master branch 3 times, most recently from 2cc5f7c to 38142c4 Compare April 8, 2021 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants