Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EDAC hardware check (ECC memory errors) #23

Open
wants to merge 59 commits into
base: master
Choose a base branch
from

Commits on Nov 15, 2015

  1. Prepare dev branch for 1.4.3 development.

    Michael Jennings committed Nov 15, 2015
    Configuration menu
    Copy the full SHA
    bb2268b View commit details
    Browse the repository at this point in the history

Commits on Nov 20, 2015

  1. Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, resp…

    …ectively.
    Michael Jennings committed Nov 20, 2015
    Configuration menu
    Copy the full SHA
    21cbe95 View commit details
    Browse the repository at this point in the history
  2. workaround: Try avoiding SGE-related memory leaks and other BASH issu…

    …es by re-execing NHC after each loop.
    
    Some problems have been reported when NHC is used as a Grid Engine
    load sensor, including memory leaks, due to the long-running NHC
    process.  All function variables reported as being leaked are local to
    their respective functions, but for unknown reasons (BASH bug?),
    they're being leaked anyway.
    
    In an attempt to work around these issues, this commit re-execs NHC
    after the completion of each iteration.  There are some side effects
    to this, including resetting of non-exported environment variables, so
    this is experimental for now.
    Michael Jennings committed Nov 20, 2015
    Configuration menu
    Copy the full SHA
    e068246 View commit details
    Browse the repository at this point in the history

Commits on Dec 18, 2015

  1. Configuration menu
    Copy the full SHA
    5a06141 View commit details
    Browse the repository at this point in the history
  2. Merge pull request mej#5 from CSC-IT-Center-for-Science/nvidia_smi_dev

    new check: check_nvsmi_healthmon which uses nvidia-smi
    Michael Jennings committed Dec 18, 2015
    Configuration menu
    Copy the full SHA
    b75c969 View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2016

  1. Added more debug information, this allows easier debugging if a check…

    … does or does not match a specific node range
    thedavidwhiteside committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    227a021 View commit details
    Browse the repository at this point in the history

Commits on Feb 11, 2016

  1. fixed misleading timeout message if nhc times out in another section …

    …after the last check, such as if pbsnodes hangs and causes nhc to timeout
    thedavidwhiteside committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    109e234 View commit details
    Browse the repository at this point in the history

Commits on Feb 17, 2016

  1. Merge pull request mej#8 from starboarder2001/feature/debugchecksrun

    Added more debug information for if a check matches the node range in nhc.conf
    Michael Jennings committed Feb 17, 2016
    Configuration menu
    Copy the full SHA
    cf25960 View commit details
    Browse the repository at this point in the history
  2. check_nvsmi_healthmon(): Minor typo fix and cleanup of output.

    Michael Jennings committed Feb 17, 2016
    Configuration menu
    Copy the full SHA
    687c156 View commit details
    Browse the repository at this point in the history
  3. test_lbnl_file.nhc: Add sanity check to make sure process substitutio…

    …n works before running tests that need it.
    Michael Jennings committed Feb 17, 2016
    Configuration menu
    Copy the full SHA
    7a59b89 View commit details
    Browse the repository at this point in the history
  4. scripts/common.nhc: Refactor and clean/speed up check matching while …

    …keeping debugging output added by NREL. Avoids executing mcheck() 3 times on same data.
    Michael Jennings committed Feb 17, 2016
    Configuration menu
    Copy the full SHA
    23bd7e5 View commit details
    Browse the repository at this point in the history
  5. Merge pull request mej#9 from starboarder2001/bug/checktimeoutmsg

    fixed misleading timeout message if nhc times out after all checks were completed
    Michael Jennings committed Feb 17, 2016
    Configuration menu
    Copy the full SHA
    46bf407 View commit details
    Browse the repository at this point in the history

Commits on Mar 30, 2016

  1. Configuration menu
    Copy the full SHA
    99cd537 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2016

  1. nhc: Fix erroneous reporting of timeouts during final checks.

    David Whiteside (GitHub user starboarder2001) noticed (see his PR mej#9
    for details) that NHC would sometimes report that it had timed out
    while executing a particular check when, in fact, it had already
    completed that check and was trying to finish its work and/or clean up
    (e.g., marking a node online).
    
    This tweaks his fix to be even more specific about what exactly NHC is
    doing as it goes through the final stages of execution.
    Michael Jennings committed Mar 31, 2016
    Configuration menu
    Copy the full SHA
    ecddd41 View commit details
    Browse the repository at this point in the history
  2. test/test_common.nhc: Fix typo in mcheck tests.

    Michael Jennings committed Mar 31, 2016
    Configuration menu
    Copy the full SHA
    a7837f4 View commit details
    Browse the repository at this point in the history
  3. scripts/common.nhc: Fix debugging messages for external matches with …

    …bracketing delimiters.
    Michael Jennings committed Mar 31, 2016
    Configuration menu
    Copy the full SHA
    19ff2de View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2016

  1. Configuration menu
    Copy the full SHA
    aca9cb0 View commit details
    Browse the repository at this point in the history
  2. nhc-wrapper: In verbose mode, dump command output to stdout as well a…

    …s the result file.
    Michael Jennings committed Apr 20, 2016
    Configuration menu
    Copy the full SHA
    a1c012a View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2016

  1. Add bash_stack_trace() function for printing BASH backtrace info. Als…

    …o add file/line/function info to PS4 for xtrace output.
    Michael Jennings committed Apr 21, 2016
    Configuration menu
    Copy the full SHA
    c02d299 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2016

  1. This fixes mej#15 by using squeue instead of stat to obtain the list …

    …of authorized users.
    bbbbbrie committed May 10, 2016
    Configuration menu
    Copy the full SHA
    43f18ec View commit details
    Browse the repository at this point in the history

Commits on May 31, 2016

  1. Merge pull request mej#16 from bbbbbrie/dev

    This fixes mej#15 by using squeue instead of stat to obtain the list of authorized users
    Michael Jennings committed May 31, 2016
    Configuration menu
    Copy the full SHA
    919063d View commit details
    Browse the repository at this point in the history
  2. scripts/lbnl_job.nhc: Slight tweak to nhc_job_find_users() change fro…

    …m bbbbbrie to use built-in NHC variable HOSTNAME_S instead of shelling out to the hostname command.
    Michael Jennings committed May 31, 2016
    Configuration menu
    Copy the full SHA
    ca7783e View commit details
    Browse the repository at this point in the history

Commits on Aug 20, 2016

  1. nhc: Enhance PS4 value with additional shell level/subshell/call stac…

    …k depth info.
    Michael Jennings committed Aug 20, 2016
    Configuration menu
    Copy the full SHA
    d548834 View commit details
    Browse the repository at this point in the history
  2. Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, resp…

    …ectively.
    Michael Jennings committed Aug 20, 2016
    Configuration menu
    Copy the full SHA
    bdfa98c View commit details
    Browse the repository at this point in the history
  3. workaround: Try avoiding SGE-related memory leaks and other BASH issu…

    …es by re-execing NHC after each loop.
    
    Some problems have been reported when NHC is used as a Grid Engine
    load sensor, including memory leaks, due to the long-running NHC
    process.  All function variables reported as being leaked are local to
    their respective functions, but for unknown reasons (BASH bug?),
    they're being leaked anyway.
    
    In an attempt to work around these issues, this commit re-execs NHC
    after the completion of each iteration.  There are some side effects
    to this, including resetting of non-exported environment variables, so
    this is experimental for now.
    Michael Jennings committed Aug 20, 2016
    Configuration menu
    Copy the full SHA
    3a6d299 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'sge-fixes' into dev

    * sge-fixes:
      workaround:  Try avoiding SGE-related memory leaks and other BASH issues by re-execing NHC after each loop.
      Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, respectively.
    Michael Jennings committed Aug 20, 2016
    Configuration menu
    Copy the full SHA
    8604209 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2016

  1. Merge branch 'master' into dev

    * master:
      Makefile.am:  Minor/cosmetic fix for removing /var/{lib,run}/nhc in uninstall-local rule.
      Allow SLURM nodes in reservation to be marked offline
      Fixed bad links
    Michael Jennings committed Sep 6, 2016
    Configuration menu
    Copy the full SHA
    a0287d3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ac1bb8c View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2016

  1. Merge branch 'master' into dev

    * master:
      nhc:  Update help output to show correct default value for $TIMEOUT.  Fixes mej#22 from @eshelman.
    Michael Jennings committed Sep 7, 2016
    Configuration menu
    Copy the full SHA
    ba4963d View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2017

  1. scripts/common.nhc: Minor indentation/spacing fix.

    Michael Jennings committed Jan 12, 2017
    Configuration menu
    Copy the full SHA
    6027fcd View commit details
    Browse the repository at this point in the history

Commits on Mar 3, 2017

  1. After multiple encounters with, and user reports of, unintended

    expansion of wildcards -- especially * -- during parsing of file
    content, process listings, etc. (the latest being GH Issue mej#25 from
    @barrymoo and @bbenedetto), I finally decided I needed a better, more
    fool-proof method of protecting NHC from unintended pathname
    expansion.  So I decided to force the issue...literally.
    
    By invoking "set -f" on startup, NHC now disables all pathname (i.e.,
    wildcard) expansion throughout its run.  I went through the code and
    identified all the locations where globbing files and/or paths was
    required, and I've specifically re-enabled the expansion for only
    those commands/code blocks that require it, making sure to turn it
    back off afterward.
    
    This is going on a dedicated branch for now until I have a chance to
    do more thorough testing and vetting of this change since it
    potentially impacts how almost every line of code is processed.
    Michael Jennings committed Mar 3, 2017
    Configuration menu
    Copy the full SHA
    8fa6657 View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2017

  1. scripts/lbnl_ps.nhc: Fix bigtime stupid thinko on my part thanks to B…

    …ill Benedetto (@bbenedetto).  This closes mej#25, and similar globbing issues, once and for all!
    Michael Jennings committed Mar 23, 2017
    Configuration menu
    Copy the full SHA
    5d981db View commit details
    Browse the repository at this point in the history

Commits on May 28, 2017

  1. RELEASE_NOTES.txt: Add release notes.

    Michael Jennings authored and mej committed May 28, 2017
    Configuration menu
    Copy the full SHA
    1fcf31e View commit details
    Browse the repository at this point in the history
  2. scripts/lbnl_ps.nhc: Pull userid by PID from ps output, not from `p…

    …asswd` cache. Should fix mej#28 from Trey Dockendorf <[email protected]>.
    Michael Jennings authored and mej committed May 28, 2017
    Configuration menu
    Copy the full SHA
    42149b2 View commit details
    Browse the repository at this point in the history
  3. test/test_lbnl_file.nhc: Put all process substitutions on the same li…

    …nes where they're used.
    
    Some (newer?) versions of BASH don't like creating a process
    substitution file on one line and using it on the next while other
    (older?) versions seem to handle it just fine.  Keeping everything all
    on the same line should work everywhere, so let's do that.  This
    should (hopefully?) fix mej#31 which appears on both Fedora and Cygwin64
    but which I had previously misunderstood.
    mej committed May 28, 2017
    Configuration menu
    Copy the full SHA
    5ced918 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'master' into dev

    * master:
      README.md:  Add blurb and link for SGE/Grid Engine Integration.
    mej committed May 28, 2017
    Configuration menu
    Copy the full SHA
    b6c487c View commit details
    Browse the repository at this point in the history

Commits on May 30, 2018

  1. Add parameters to check_ps_unauth_users and check_ps_userproc_lineage to

    ignore specific processes, user names, uids or pids.
    rpabel committed May 30, 2018
    Configuration menu
    Copy the full SHA
    05c5938 View commit details
    Browse the repository at this point in the history

Commits on May 31, 2018

  1. Merge pull request mej#61 from rpabel/check_ps

    Add parameters to check_ps_unauth_users and check_ps_userproc_lineage
    mej authored May 31, 2018
    Configuration menu
    Copy the full SHA
    85a4811 View commit details
    Browse the repository at this point in the history

Commits on Jun 9, 2018

  1. Merge branch 'master' into dev

    * master:
      escaped bash pipe symbols in markdown table
      Make nhc debian aware, fixes mej#38
      Fix typo in check_file_stat
    mej committed Jun 9, 2018
    Configuration menu
    Copy the full SHA
    ed18abe View commit details
    Browse the repository at this point in the history

Commits on Oct 31, 2018

  1. Limit length of ps output returned

    Avoid hangs of NHC when processes are excessively long. Resolves mej#52
    treydock committed Oct 31, 2018
    Configuration menu
    Copy the full SHA
    4f2dcb5 View commit details
    Browse the repository at this point in the history
  2. Merge pull request mej#64 from treydock/ps-width

    Limit length of ps output returned
    
    Set COLUMNS environment variable to 1024 prior to invoking `ps` command to avoid hangs with long command lines.
    mej authored Oct 31, 2018
    Configuration menu
    Copy the full SHA
    4d3950e View commit details
    Browse the repository at this point in the history
  3. check_file_test added negative option

    Only pass check if there is no result, eg:
     *  check_file_test -! -e /etc/node_status/reboot_node
    
    This check will pass if the file is absent
    basvandervlies committed Oct 31, 2018
    Configuration menu
    Copy the full SHA
    8026428 View commit details
    Browse the repository at this point in the history

Commits on Nov 1, 2018

  1. Merge branch 'master' into dev

    * master:
      check_dmi_data_match():  Use -! for consistency
      README.md:  Add missing documentation
      check_hw_mem():  Fix handling of fudge factor
      Fixed dead link to Torque documentation
      Fix link to source tarball. Should point to releases/download not archive
      Fix incorrect call to `die`
    mej committed Nov 1, 2018
    Configuration menu
    Copy the full SHA
    35b7ad6 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2018

  1. nhc: Fix detached mode result writability test

    In detached mode, we check to see if we can write to the result file
    before actually trying to do so, but `test -w` returns false for
    non-existent files, not just unwritable files.  So we need to redo the
    logic of that test to account for both cases: (1) file exists and is
    writable, or (2) file does not exist but directory does and is
    writable.
    
    While I was at it, I also changed where the write is done to account
    for a third scenario:  (3) everything looks like it should work but
    the write itself fails.
    
    Fixes mej#59.  Thanks to @hwj0303 for spotting this!
    mej committed Nov 8, 2018
    Configuration menu
    Copy the full SHA
    fa5d035 View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2018

  1. Merge branch 'master' into dev

    * master:
      test:  Add BASH variable tests for clobbering
      check_cmd_output():  Fix index of failed match
      scripts:  Don't clobber $LINENO
      Change to preferred capitialization for Slurm.
      correct example check
      Correct Readme
    mej committed Dec 8, 2018
    Configuration menu
    Copy the full SHA
    dafc417 View commit details
    Browse the repository at this point in the history

Commits on Dec 28, 2018

  1. test/nhc-test: Fix unit tests after glob disable

    One of my last major changes before leaving LBNL was to globally
    disable pathname expansion (i.e., globbing) by default throughout
    NHC.  Apparently when I did that, I missed a spot:  the unit test
    driver script!  So only the unit tests specific to `nhc` itself have
    been running since that happened (commit 8fa6657).
    
    While this problem has existed for over a year-and-a-half in wallclock
    time, commit-wise it hasn't been very long.  (About 10-ish, not
    counting merging work from others.)  That's my story, and I'm sticking
    to it.
    mej committed Dec 28, 2018
    Configuration menu
    Copy the full SHA
    ca5e72c View commit details
    Browse the repository at this point in the history
  2. test/test_lbnl_file.nhc: Fix proc sub sanity chk

    Fix the process substitution sanity check so that we're not skipping
    46 unit tests for no reason whatsoever.
    mej committed Dec 28, 2018
    Configuration menu
    Copy the full SHA
    35e9389 View commit details
    Browse the repository at this point in the history
  3. Merge pull request mej#68 from basvandervlies/check_file_test_negative

    check_file_test added negative option
    mej authored Dec 28, 2018
    Configuration menu
    Copy the full SHA
    2e4e4e9 View commit details
    Browse the repository at this point in the history

Commits on Dec 29, 2018

  1. scripts/common.nhc: Fix MAX_SYS_UID auto-detection

    While taking another look at @SMark-Black's mej#50 and mej#53, I realized
    that the code in question regarding `$MAX_SYS_UID` is doing exactly
    what it is supposed to do, given the intended meaning of the variable
    based on its name.  What was *actually* wrong was that the
    `nhc_common_get_max_sys_uid()` function has been reading the wrong
    variable!
    
    So `nhc_common_get_max_sys_uid()` will now look for `$SYS_UID_MAX` in
    `/etc/login.defs` like it should have been doing all along, and using
    the value of `UID_MIN - 1` as a fallback if necessary.
    
    NOTE:  This means that the default auto-detected value of
    `$MAX_SYS_UID` will likely be something ending in `99` (like `499` or
    `999`) rather than `00` because it was always intended to be (and the
    code has always treated it as) the *top* of the exempt UID range, NOT
    the bottom of the non-exempt UID range!  If you have any configs or
    scripts that rely on different assumptions, please make sure to make
    any necessary updates.
    
    Closes mej#50.
    mej committed Dec 29, 2018
    Configuration menu
    Copy the full SHA
    88a5fab View commit details
    Browse the repository at this point in the history
  2. scripts/lbnl_ps.nhc: Improve LSF support

    Based on a couple changes suggested by @SMark-Black in his PR mej#53, add
    another command to look for to auto-detect LSF, and add support for
    the LSF `res` daemon to the `check_ps_userproc_lineage()` check.  Also
    moved the setting of `$RM_DAEMON_MATCH` to inside the check -- that's
    the only thing in that whole entire file that actually requires a
    resource manager!
    mej committed Dec 29, 2018
    Configuration menu
    Copy the full SHA
    9e8d0dc View commit details
    Browse the repository at this point in the history
  3. nhc: Don't output check results twice in -e mode

    Disable the logfile when in eval (`-e`) mode so that NHC doesn't try
    to output to both the log and `stdout` and wind up saying the same
    thing twice!
    mej committed Dec 29, 2018
    Configuration menu
    Copy the full SHA
    f9d7286 View commit details
    Browse the repository at this point in the history

Commits on Jan 1, 2019

  1. README.md: Add Table of Contents

    Add a Table of Contents to the README.md documentation file
    automatically generated by the `gh-md-toc` script from @ekalinin.
    
    Many thanks to @basvandervlies for both suggesting this much-needed
    addition and helping find an editor-agnostic way to generate it
    automatically going forward!
    
    To update:
    ```bash
    $ git clone https://github.com/ekalinin/github-markdown-toc.git
    $ github-markdown-toc/gh-md-toc --insert README.md
    ```
    
    Closes mej#67.  Feedback is welcome on (1) whether or not to
    submodule-ize this, and (2) whether or not any changes are needed to
    tweak the output for NHC...it looks like it might need some manual
    tweaking at the moment.
    mej committed Jan 1, 2019
    Configuration menu
    Copy the full SHA
    894f144 View commit details
    Browse the repository at this point in the history
  2. README.md: Fix Table of Contents

    Made a couple tweaks to the `gh-md-toc` script (NHC-specific) to put
    the Table of Contents more in line with what the intent of the
    formatting was, not necessarily the exact indentation level.  I
    intentionally skip heading levels to achieve the correct style, but
    that confuses the ToC generator.  Some special casing in the `awk`
    script has remedied that.
    
    I haven't taken the time to make the changes in a way that would be
    generic enough to consider upstreaming.  Maybe in the future!
    mej committed Jan 1, 2019
    Configuration menu
    Copy the full SHA
    8a21cd4 View commit details
    Browse the repository at this point in the history

Commits on May 31, 2019

  1. Configuration menu
    Copy the full SHA
    6f3a7f9 View commit details
    Browse the repository at this point in the history
  2. fix for mej#14

    kcgthb committed May 31, 2019
    Configuration menu
    Copy the full SHA
    c38c7c6 View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2020

  1. Configuration menu
    Copy the full SHA
    e0a587a View commit details
    Browse the repository at this point in the history

Commits on Feb 12, 2021

  1. Merge branch 'master' of https://github.com/mej/nhc

    Conflicts:
    	nhc
    kcgthb committed Feb 12, 2021
    Configuration menu
    Copy the full SHA
    9a2606d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ib'

    kcgthb committed Feb 12, 2021
    Configuration menu
    Copy the full SHA
    8fa27b5 View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2021

  1. Configuration menu
    Copy the full SHA
    3404a34 View commit details
    Browse the repository at this point in the history