-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EDAC hardware check (ECC memory errors) #23
base: master
Are you sure you want to change the base?
Commits on Nov 15, 2015
-
Prepare dev branch for 1.4.3 development.
Michael Jennings committedNov 15, 2015 Configuration menu - View commit details
-
Copy full SHA for bb2268b - Browse repository at this point
Copy the full SHA bb2268bView commit details
Commits on Nov 20, 2015
-
Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, resp…
…ectively.
Michael Jennings committedNov 20, 2015 Configuration menu - View commit details
-
Copy full SHA for 21cbe95 - Browse repository at this point
Copy the full SHA 21cbe95View commit details -
workaround: Try avoiding SGE-related memory leaks and other BASH issu…
…es by re-execing NHC after each loop. Some problems have been reported when NHC is used as a Grid Engine load sensor, including memory leaks, due to the long-running NHC process. All function variables reported as being leaked are local to their respective functions, but for unknown reasons (BASH bug?), they're being leaked anyway. In an attempt to work around these issues, this commit re-execs NHC after the completion of each iteration. There are some side effects to this, including resetting of non-exported environment variables, so this is experimental for now.
Michael Jennings committedNov 20, 2015 Configuration menu - View commit details
-
Copy full SHA for e068246 - Browse repository at this point
Copy the full SHA e068246View commit details
Commits on Dec 18, 2015
-
Configuration menu - View commit details
-
Copy full SHA for 5a06141 - Browse repository at this point
Copy the full SHA 5a06141View commit details -
Merge pull request mej#5 from CSC-IT-Center-for-Science/nvidia_smi_dev
new check: check_nvsmi_healthmon which uses nvidia-smi
Michael Jennings committedDec 18, 2015 Configuration menu - View commit details
-
Copy full SHA for b75c969 - Browse repository at this point
Copy the full SHA b75c969View commit details
Commits on Jan 29, 2016
-
Added more debug information, this allows easier debugging if a check…
… does or does not match a specific node range
Configuration menu - View commit details
-
Copy full SHA for 227a021 - Browse repository at this point
Copy the full SHA 227a021View commit details
Commits on Feb 11, 2016
-
fixed misleading timeout message if nhc times out in another section …
…after the last check, such as if pbsnodes hangs and causes nhc to timeout
Configuration menu - View commit details
-
Copy full SHA for 109e234 - Browse repository at this point
Copy the full SHA 109e234View commit details
Commits on Feb 17, 2016
-
Merge pull request mej#8 from starboarder2001/feature/debugchecksrun
Added more debug information for if a check matches the node range in nhc.conf
Michael Jennings committedFeb 17, 2016 Configuration menu - View commit details
-
Copy full SHA for cf25960 - Browse repository at this point
Copy the full SHA cf25960View commit details -
check_nvsmi_healthmon(): Minor typo fix and cleanup of output.
Michael Jennings committedFeb 17, 2016 Configuration menu - View commit details
-
Copy full SHA for 687c156 - Browse repository at this point
Copy the full SHA 687c156View commit details -
test_lbnl_file.nhc: Add sanity check to make sure process substitutio…
…n works before running tests that need it.
Michael Jennings committedFeb 17, 2016 Configuration menu - View commit details
-
Copy full SHA for 7a59b89 - Browse repository at this point
Copy the full SHA 7a59b89View commit details -
scripts/common.nhc: Refactor and clean/speed up check matching while …
…keeping debugging output added by NREL. Avoids executing mcheck() 3 times on same data.
Michael Jennings committedFeb 17, 2016 Configuration menu - View commit details
-
Copy full SHA for 23bd7e5 - Browse repository at this point
Copy the full SHA 23bd7e5View commit details -
Merge pull request mej#9 from starboarder2001/bug/checktimeoutmsg
fixed misleading timeout message if nhc times out after all checks were completed
Michael Jennings committedFeb 17, 2016 Configuration menu - View commit details
-
Copy full SHA for 46bf407 - Browse repository at this point
Copy the full SHA 46bf407View commit details
Commits on Mar 30, 2016
-
Properly (I hope...) handle nodes in the "resv" (reserved) state.
Michael Jennings committedMar 30, 2016 Configuration menu - View commit details
-
Copy full SHA for 99cd537 - Browse repository at this point
Copy the full SHA 99cd537View commit details
Commits on Mar 31, 2016
-
nhc: Fix erroneous reporting of timeouts during final checks.
David Whiteside (GitHub user starboarder2001) noticed (see his PR mej#9 for details) that NHC would sometimes report that it had timed out while executing a particular check when, in fact, it had already completed that check and was trying to finish its work and/or clean up (e.g., marking a node online). This tweaks his fix to be even more specific about what exactly NHC is doing as it goes through the final stages of execution.
Michael Jennings committedMar 31, 2016 Configuration menu - View commit details
-
Copy full SHA for ecddd41 - Browse repository at this point
Copy the full SHA ecddd41View commit details -
test/test_common.nhc: Fix typo in mcheck tests.
Michael Jennings committedMar 31, 2016 Configuration menu - View commit details
-
Copy full SHA for a7837f4 - Browse repository at this point
Copy the full SHA a7837f4View commit details -
scripts/common.nhc: Fix debugging messages for external matches with …
…bracketing delimiters.
Michael Jennings committedMar 31, 2016 Configuration menu - View commit details
-
Copy full SHA for 19ff2de - Browse repository at this point
Copy the full SHA 19ff2deView commit details
Commits on Apr 20, 2016
-
nhc-wrapper: Return actual exit status of subprogram, not just 1 or 0.
Michael Jennings committedApr 20, 2016 Configuration menu - View commit details
-
Copy full SHA for aca9cb0 - Browse repository at this point
Copy the full SHA aca9cb0View commit details -
nhc-wrapper: In verbose mode, dump command output to stdout as well a…
…s the result file.
Michael Jennings committedApr 20, 2016 Configuration menu - View commit details
-
Copy full SHA for a1c012a - Browse repository at this point
Copy the full SHA a1c012aView commit details
Commits on Apr 21, 2016
-
Add bash_stack_trace() function for printing BASH backtrace info. Als…
…o add file/line/function info to PS4 for xtrace output.
Michael Jennings committedApr 21, 2016 Configuration menu - View commit details
-
Copy full SHA for c02d299 - Browse repository at this point
Copy the full SHA c02d299View commit details
Commits on May 10, 2016
-
This fixes mej#15 by using squeue instead of stat to obtain the list …
…of authorized users.
Configuration menu - View commit details
-
Copy full SHA for 43f18ec - Browse repository at this point
Copy the full SHA 43f18ecView commit details
Commits on May 31, 2016
-
Merge pull request mej#16 from bbbbbrie/dev
This fixes mej#15 by using squeue instead of stat to obtain the list of authorized users
Michael Jennings committedMay 31, 2016 Configuration menu - View commit details
-
Copy full SHA for 919063d - Browse repository at this point
Copy the full SHA 919063dView commit details -
scripts/lbnl_job.nhc: Slight tweak to nhc_job_find_users() change fro…
…m bbbbbrie to use built-in NHC variable HOSTNAME_S instead of shelling out to the hostname command.
Michael Jennings committedMay 31, 2016 Configuration menu - View commit details
-
Copy full SHA for ca7783e - Browse repository at this point
Copy the full SHA ca7783eView commit details
Commits on Aug 20, 2016
-
nhc: Enhance PS4 value with additional shell level/subshell/call stac…
…k depth info.
Michael Jennings committedAug 20, 2016 Configuration menu - View commit details
-
Copy full SHA for d548834 - Browse repository at this point
Copy the full SHA d548834View commit details -
Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, resp…
…ectively.
Michael Jennings committedAug 20, 2016 Configuration menu - View commit details
-
Copy full SHA for bdfa98c - Browse repository at this point
Copy the full SHA bdfa98cView commit details -
workaround: Try avoiding SGE-related memory leaks and other BASH issu…
…es by re-execing NHC after each loop. Some problems have been reported when NHC is used as a Grid Engine load sensor, including memory leaks, due to the long-running NHC process. All function variables reported as being leaked are local to their respective functions, but for unknown reasons (BASH bug?), they're being leaked anyway. In an attempt to work around these issues, this commit re-execs NHC after the completion of each iteration. There are some side effects to this, including resetting of non-exported environment variables, so this is experimental for now.
Michael Jennings committedAug 20, 2016 Configuration menu - View commit details
-
Copy full SHA for 3a6d299 - Browse repository at this point
Copy the full SHA 3a6d299View commit details -
Merge branch 'sge-fixes' into dev
* sge-fixes: workaround: Try avoiding SGE-related memory leaks and other BASH issues by re-execing NHC after each loop. Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, respectively.
Michael Jennings committedAug 20, 2016 Configuration menu - View commit details
-
Copy full SHA for 8604209 - Browse repository at this point
Copy the full SHA 8604209View commit details
Commits on Sep 6, 2016
-
Merge branch 'master' into dev
* master: Makefile.am: Minor/cosmetic fix for removing /var/{lib,run}/nhc in uninstall-local rule. Allow SLURM nodes in reservation to be marked offline Fixed bad links
Michael Jennings committedSep 6, 2016 Configuration menu - View commit details
-
Copy full SHA for a0287d3 - Browse repository at this point
Copy the full SHA a0287d3View commit details -
Configuration menu - View commit details
-
Copy full SHA for ac1bb8c - Browse repository at this point
Copy the full SHA ac1bb8cView commit details
Commits on Sep 7, 2016
-
Merge branch 'master' into dev
* master: nhc: Update help output to show correct default value for $TIMEOUT. Fixes mej#22 from @eshelman.
Michael Jennings committedSep 7, 2016 Configuration menu - View commit details
-
Copy full SHA for ba4963d - Browse repository at this point
Copy the full SHA ba4963dView commit details
Commits on Jan 12, 2017
-
scripts/common.nhc: Minor indentation/spacing fix.
Michael Jennings committedJan 12, 2017 Configuration menu - View commit details
-
Copy full SHA for 6027fcd - Browse repository at this point
Copy the full SHA 6027fcdView commit details
Commits on Mar 3, 2017
-
After multiple encounters with, and user reports of, unintended
expansion of wildcards -- especially * -- during parsing of file content, process listings, etc. (the latest being GH Issue mej#25 from @barrymoo and @bbenedetto), I finally decided I needed a better, more fool-proof method of protecting NHC from unintended pathname expansion. So I decided to force the issue...literally. By invoking "set -f" on startup, NHC now disables all pathname (i.e., wildcard) expansion throughout its run. I went through the code and identified all the locations where globbing files and/or paths was required, and I've specifically re-enabled the expansion for only those commands/code blocks that require it, making sure to turn it back off afterward. This is going on a dedicated branch for now until I have a chance to do more thorough testing and vetting of this change since it potentially impacts how almost every line of code is processed.
Michael Jennings committedMar 3, 2017 Configuration menu - View commit details
-
Copy full SHA for 8fa6657 - Browse repository at this point
Copy the full SHA 8fa6657View commit details
Commits on Mar 23, 2017
-
scripts/lbnl_ps.nhc: Fix bigtime stupid thinko on my part thanks to B…
…ill Benedetto (@bbenedetto). This closes mej#25, and similar globbing issues, once and for all!
Michael Jennings committedMar 23, 2017 Configuration menu - View commit details
-
Copy full SHA for 5d981db - Browse repository at this point
Copy the full SHA 5d981dbView commit details
Commits on May 28, 2017
-
Configuration menu - View commit details
-
Copy full SHA for 1fcf31e - Browse repository at this point
Copy the full SHA 1fcf31eView commit details -
scripts/lbnl_ps.nhc: Pull userid by PID from
ps
output, not from `p……asswd` cache. Should fix mej#28 from Trey Dockendorf <[email protected]>.
Configuration menu - View commit details
-
Copy full SHA for 42149b2 - Browse repository at this point
Copy the full SHA 42149b2View commit details -
test/test_lbnl_file.nhc: Put all process substitutions on the same li…
…nes where they're used. Some (newer?) versions of BASH don't like creating a process substitution file on one line and using it on the next while other (older?) versions seem to handle it just fine. Keeping everything all on the same line should work everywhere, so let's do that. This should (hopefully?) fix mej#31 which appears on both Fedora and Cygwin64 but which I had previously misunderstood.
Configuration menu - View commit details
-
Copy full SHA for 5ced918 - Browse repository at this point
Copy the full SHA 5ced918View commit details -
Merge branch 'master' into dev
* master: README.md: Add blurb and link for SGE/Grid Engine Integration.
Configuration menu - View commit details
-
Copy full SHA for b6c487c - Browse repository at this point
Copy the full SHA b6c487cView commit details
Commits on May 30, 2018
-
Add parameters to check_ps_unauth_users and check_ps_userproc_lineage to
ignore specific processes, user names, uids or pids.
Configuration menu - View commit details
-
Copy full SHA for 05c5938 - Browse repository at this point
Copy the full SHA 05c5938View commit details
Commits on May 31, 2018
-
Merge pull request mej#61 from rpabel/check_ps
Add parameters to check_ps_unauth_users and check_ps_userproc_lineage
Configuration menu - View commit details
-
Copy full SHA for 85a4811 - Browse repository at this point
Copy the full SHA 85a4811View commit details
Commits on Jun 9, 2018
-
Merge branch 'master' into dev
* master: escaped bash pipe symbols in markdown table Make nhc debian aware, fixes mej#38 Fix typo in check_file_stat
Configuration menu - View commit details
-
Copy full SHA for ed18abe - Browse repository at this point
Copy the full SHA ed18abeView commit details
Commits on Oct 31, 2018
-
Limit length of ps output returned
Avoid hangs of NHC when processes are excessively long. Resolves mej#52
Configuration menu - View commit details
-
Copy full SHA for 4f2dcb5 - Browse repository at this point
Copy the full SHA 4f2dcb5View commit details -
Merge pull request mej#64 from treydock/ps-width
Limit length of ps output returned Set COLUMNS environment variable to 1024 prior to invoking `ps` command to avoid hangs with long command lines.
Configuration menu - View commit details
-
Copy full SHA for 4d3950e - Browse repository at this point
Copy the full SHA 4d3950eView commit details -
check_file_test added negative option
Only pass check if there is no result, eg: * check_file_test -! -e /etc/node_status/reboot_node This check will pass if the file is absent
Configuration menu - View commit details
-
Copy full SHA for 8026428 - Browse repository at this point
Copy the full SHA 8026428View commit details
Commits on Nov 1, 2018
-
Merge branch 'master' into dev
* master: check_dmi_data_match(): Use -! for consistency README.md: Add missing documentation check_hw_mem(): Fix handling of fudge factor Fixed dead link to Torque documentation Fix link to source tarball. Should point to releases/download not archive Fix incorrect call to `die`
Configuration menu - View commit details
-
Copy full SHA for 35b7ad6 - Browse repository at this point
Copy the full SHA 35b7ad6View commit details
Commits on Nov 8, 2018
-
nhc: Fix detached mode result writability test
In detached mode, we check to see if we can write to the result file before actually trying to do so, but `test -w` returns false for non-existent files, not just unwritable files. So we need to redo the logic of that test to account for both cases: (1) file exists and is writable, or (2) file does not exist but directory does and is writable. While I was at it, I also changed where the write is done to account for a third scenario: (3) everything looks like it should work but the write itself fails. Fixes mej#59. Thanks to @hwj0303 for spotting this!
Configuration menu - View commit details
-
Copy full SHA for fa5d035 - Browse repository at this point
Copy the full SHA fa5d035View commit details
Commits on Dec 8, 2018
-
Merge branch 'master' into dev
* master: test: Add BASH variable tests for clobbering check_cmd_output(): Fix index of failed match scripts: Don't clobber $LINENO Change to preferred capitialization for Slurm. correct example check Correct Readme
Configuration menu - View commit details
-
Copy full SHA for dafc417 - Browse repository at this point
Copy the full SHA dafc417View commit details
Commits on Dec 28, 2018
-
test/nhc-test: Fix unit tests after glob disable
One of my last major changes before leaving LBNL was to globally disable pathname expansion (i.e., globbing) by default throughout NHC. Apparently when I did that, I missed a spot: the unit test driver script! So only the unit tests specific to `nhc` itself have been running since that happened (commit 8fa6657). While this problem has existed for over a year-and-a-half in wallclock time, commit-wise it hasn't been very long. (About 10-ish, not counting merging work from others.) That's my story, and I'm sticking to it.
Configuration menu - View commit details
-
Copy full SHA for ca5e72c - Browse repository at this point
Copy the full SHA ca5e72cView commit details -
test/test_lbnl_file.nhc: Fix proc sub sanity chk
Fix the process substitution sanity check so that we're not skipping 46 unit tests for no reason whatsoever.
Configuration menu - View commit details
-
Copy full SHA for 35e9389 - Browse repository at this point
Copy the full SHA 35e9389View commit details -
Merge pull request mej#68 from basvandervlies/check_file_test_negative
check_file_test added negative option
Configuration menu - View commit details
-
Copy full SHA for 2e4e4e9 - Browse repository at this point
Copy the full SHA 2e4e4e9View commit details
Commits on Dec 29, 2018
-
scripts/common.nhc: Fix MAX_SYS_UID auto-detection
While taking another look at @SMark-Black's mej#50 and mej#53, I realized that the code in question regarding `$MAX_SYS_UID` is doing exactly what it is supposed to do, given the intended meaning of the variable based on its name. What was *actually* wrong was that the `nhc_common_get_max_sys_uid()` function has been reading the wrong variable! So `nhc_common_get_max_sys_uid()` will now look for `$SYS_UID_MAX` in `/etc/login.defs` like it should have been doing all along, and using the value of `UID_MIN - 1` as a fallback if necessary. NOTE: This means that the default auto-detected value of `$MAX_SYS_UID` will likely be something ending in `99` (like `499` or `999`) rather than `00` because it was always intended to be (and the code has always treated it as) the *top* of the exempt UID range, NOT the bottom of the non-exempt UID range! If you have any configs or scripts that rely on different assumptions, please make sure to make any necessary updates. Closes mej#50.
Configuration menu - View commit details
-
Copy full SHA for 88a5fab - Browse repository at this point
Copy the full SHA 88a5fabView commit details -
scripts/lbnl_ps.nhc: Improve LSF support
Based on a couple changes suggested by @SMark-Black in his PR mej#53, add another command to look for to auto-detect LSF, and add support for the LSF `res` daemon to the `check_ps_userproc_lineage()` check. Also moved the setting of `$RM_DAEMON_MATCH` to inside the check -- that's the only thing in that whole entire file that actually requires a resource manager!
Configuration menu - View commit details
-
Copy full SHA for 9e8d0dc - Browse repository at this point
Copy the full SHA 9e8d0dcView commit details -
nhc: Don't output check results twice in -e mode
Disable the logfile when in eval (`-e`) mode so that NHC doesn't try to output to both the log and `stdout` and wind up saying the same thing twice!
Configuration menu - View commit details
-
Copy full SHA for f9d7286 - Browse repository at this point
Copy the full SHA f9d7286View commit details
Commits on Jan 1, 2019
-
README.md: Add Table of Contents
Add a Table of Contents to the README.md documentation file automatically generated by the `gh-md-toc` script from @ekalinin. Many thanks to @basvandervlies for both suggesting this much-needed addition and helping find an editor-agnostic way to generate it automatically going forward! To update: ```bash $ git clone https://github.com/ekalinin/github-markdown-toc.git $ github-markdown-toc/gh-md-toc --insert README.md ``` Closes mej#67. Feedback is welcome on (1) whether or not to submodule-ize this, and (2) whether or not any changes are needed to tweak the output for NHC...it looks like it might need some manual tweaking at the moment.
Configuration menu - View commit details
-
Copy full SHA for 894f144 - Browse repository at this point
Copy the full SHA 894f144View commit details -
README.md: Fix Table of Contents
Made a couple tweaks to the `gh-md-toc` script (NHC-specific) to put the Table of Contents more in line with what the intent of the formatting was, not necessarily the exact indentation level. I intentionally skip heading levels to achieve the correct style, but that confuses the ToC generator. Some special casing in the `awk` script has remedied that. I haven't taken the time to make the changes in a way that would be generic enough to consider upstreaming. Maybe in the future!
Configuration menu - View commit details
-
Copy full SHA for 8a21cd4 - Browse repository at this point
Copy the full SHA 8a21cd4View commit details
Commits on May 31, 2019
-
Configuration menu - View commit details
-
Copy full SHA for 6f3a7f9 - Browse repository at this point
Copy the full SHA 6f3a7f9View commit details -
Configuration menu - View commit details
-
Copy full SHA for c38c7c6 - Browse repository at this point
Copy the full SHA c38c7c6View commit details
Commits on Mar 23, 2020
-
Added check for Infiniband PCIe link width and speed.
Yong Qin committedMar 23, 2020 Configuration menu - View commit details
-
Copy full SHA for e0a587a - Browse repository at this point
Copy the full SHA e0a587aView commit details
Commits on Feb 12, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 9a2606d - Browse repository at this point
Copy the full SHA 9a2606dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8fa27b5 - Browse repository at this point
Copy the full SHA 8fa27b5View commit details
Commits on Apr 8, 2021
-
Configuration menu - View commit details
-
Copy full SHA for 3404a34 - Browse repository at this point
Copy the full SHA 3404a34View commit details