Author: | Marius Gedminas <[email protected]> |
---|---|
Date: | 2020-10-31 |
Version: | 0.13.2 |
Manual section: | 8 |
check-health [-c] [-v] [-f configfile]
check-health -g > configfile
check-health -h
check-health is a "poor man's Nagios": a script that performs some
basic system health checks. The checks are specified in the configuration
file /etc/pov/check-health
; if that file doesn't exist,
check-health will exit silently without checking anything.
You can run check-health -g
to generate a config file. You'll probably
need to modify it to suit your needs.
Usually check-health is run automatically from cron. It doesn't emit
any output and returns exit code 0 if all checks pass. Any output
indicates an error, and cron emails it to root
.
-h | Print brief usage message and exit. |
-v | Verbose output: show what checks are being performed. |
-c | Colorize error messages in red. |
-g | Generate a sample config file and print it to stdout. |
-f FILENAME | Use the specified config file instead of /etc/pov/check-health . |
Note: -v
also uses some colors, for informational messages, when
standard output is a terminal that supports colors. -c
, on the other
hand, is unconditional and always uses colors, which is useful when
you run check-health over ssh without an allocated terminal and want
to see the errors stand out.
All checks return a status code in addition to warning about problems.
- checkuptime [<uptime>[s/m/h/sec/min/hour]]
Skip the rest of the checks if system uptime is less than N seconds/minutes/hours.
<uptime> defaults to 10 minutes.
Example:
checkuptime 10m
- checkfs <mountpoint> [<amount>[K/M/G/T]]
Check that the filesystem mounted on <mountpoint> has at least <amount> of metric kilo/mega/giga/terabytes free.
<amount> defaults to 1M.
Example:
checkfs / 100M
- checkinodes <mountpoint> [<inodes>]
Check that the filesystem mounted on <mountpoint> has at least <inodes> of free inodes left.
<inodes> defaults to 5000.
Example:
checkinodes /
- checknfs <mountpoint>
Check that an NFS file system is mounted on <mountpoint>.
If not, try to mount all NFS filesystems.
Used as a workaround for an Ubuntu issue where NFS filesystems would fail to mount during boot, but would mount fine afterwards.
This hasn't been a problem lately.
Example:
checknfs /home
- checkpidfile <filename>
Check that the process listed in a given pidfile is running.
Example:
checkpidfile /var/run/crond.pid
- checkpidfiles <filename> ...
Check that the processes listed in given pidfiles are running.
Suppresses warnings for /var/run/sm-notify.pid because it feels like a false positive.
Suppresses warnings for failed glob expansion under /run or /var/run.
Example:
checkpidfiles /var/run/*.pid /var/run/*/*.pid
- checkproc <name>
Check that a process with a given name is running.
See also: checkproc_pgrep, checkproc_pgrep_full.
Example:
checkproc crond
- checkproc_pgrep <name>
Check that a process with a given name is running.
Uses pgrep instead of pidof.
Example:
checkproc_pgrep tracd
- checkproc_pgrep_full <cmdline>
Check that a process matching a given command line is running.
Uses pgrep -f instead of pidof, which makes it handle all sorts of things.
Example:
checkproc_pgrep_full scriptname.py
Example:
checkproc_pgrep_full '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'
- checktoomanyproc <name> <limit>
Check that fewer than <limit> instances of a given process is running.
See also: checktoomanyproc_pgrep, checktoomanyproc_pgrep_full.
Example:
checktoomanyproc aspell 2
- checktoomanyproc_pgrep <name> <limit>
Check that fewer than <limit> instances of a given process is running.
Uses pgrep instead of pidof.
Example:
checktoomanyproc_pgrep tracd 2
- checktoomanyproc_pgrep_full <limit> <cmdline>
Check that fewer than <limit> instances of a given process is running.
Uses pgrep -f instead of pidof, which makes it handle all sorts of things.
Example:
checktoomanyproc_pgrep_full 2 scriptname.py
Example:
checktoomanyproc_pgrep_full 2 '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'
- checkthreads <min> <pgrep-args>
Check that a process has at least <min> threads.
Uses pgrep <pgrep-args> to find the process. Shows an error if pgrep finds nothing, or if pgrep finds more than one process.
Useful to detect dying threads due to missing/buggy exception handling.
Example:
checkthreads 7 runzope -u ivija-staging
- checklocale <locale> <pgrep-args>
Check that a process is running with the correct locale set.
Uses pgrep <pgrep-args> to find the process. Shows an error if pgrep finds nothing, or if pgrep finds more than one process.
Looks at LC_ALL/LC_CTYPE/LANG in the process environment. <locale> can be a glob pattern.
Background: this is useful to detect problems when a system daemon's locale differs depending on which sysadmin used their ssh session to launch it (or if the daemon was started at system startup).
Example:
checklocale en_US.UTF-8 runzope -u ivija-staging
Example:
checklocale '*.UTF-8' runzope -u ivija-staging
- checkram [<free>[M/G/T]]
Check that at least <free> metric mega/giga/terabytes of virtual memory are free.
<free> defaults to 100 megabytes.
Example:
checkram 100M
- checkswap [<limit>[M/G/T]]
Check if more than <limit> metric mega/giga/terabytes of swap are used.
<limit> defaults to 100 megabytes.
Example:
checkswap 2G
- checkmailq [<limit>]
Check if more than <limit> emails are waiting in the outgoing mail queue.
<limit> defaults to 20.
The check is silently skipped if you don't have any MTA (that provides a mailq command) installed. Otherwise it probably works only with Postfix.
Example:
checkmailq 100
- checkzopemailq <path> ...
Check if any messages older than one minute are present in the outgoing maildir used by zope.sendmail.
<path> needs to refer to the 'new' subdirectory of the mail queue.
Example:
checkzopemailq /apps/zopes/*/var/mailqueue/new
- checkcups <queuename>
Check if the printer is ready.
Try to enable it if it became disabled.
Background: I had this issue with CUPS randomly disabling a particular mail queue after it couldn't talk to the printer for a while due to network issues or something. Manually reenabling the printer got old fast. This hasn't been a problem lately.
Example:
checkcups cheese
- cmpfiles <pathname1> <pathname2>
Check if the two files are identical.
Background: there were some init.d scripts that were writable by a non-root user. I wanted to do manual inspection before replacing copies of them into /etc/init.d/.
Example:
cmpfiles /etc/init.d/someservice /home/someservice/initscript
- check_no_matching_lines <regexp> <pathname>
Check that a file has no lines matching a regular expression.
Background: I had Jenkins jobs install random user crontabs.
Example:
check_no_matching_lines ^[^#] /var/spool/cron/crontabs/jenkins
- checkaliases
Check if /etc/aliases.db is up to date.
Probably works only with Postfix, and only if you use the default database format.
Background: when you edit /etc/aliases it's so easy to forget to run newaliases.
Example:
checkaliases
- check_postmap_up_to_date <pathname>
Check if <pathname>.db is up to date with respect to <pathname>.
Background: when you edit /etc/postfix/* it's so easy to forget to run postmap.
Example:
check_postmap_up_to_date /etc/postfix/virtual
- checklilo
Check if LILO was run after a kernel update.
Background: if you don't re-run LILO after you update your kernel, your machine will not boot. We had to use LILO on one server because GRUB completely refused to boot from the Software RAID-1 root partition.
Example:
checklilo
- checkweb
Check if a website is available over HTTP/HTTPS.
A thin wrapper around check_http from nagios-plugins-basic. See https://www.monitoring-plugins.org/doc/man/check_http.html for the available options.
Normally you wouldn't use this from /etc/pov/check-web-health, and not from /etc/pov/check-health.
Example:
checkweb -H www.example.com
Example:
checkweb --ssl -H www.example.com -u /prefix/ -f follow -s 'Expect this string' --timeout=30
Example:
checkweb --ssl -H www.example.com -u /protected/ -e 'HTTP/1.1 401 Unauthorized' -s 'Login required'
Example:
checkweb --ssl -H www.example.com --invert-regex -r "Database connection error"
This function is normally used from /etc/pov/check-web-health.
- checkweb_auth
Check if a website is available over HTTP/HTTPS.
checkweb_auth user:pwd args
is equivalent tocheckweb -a user:pwd args
but the username/password pair is not printed if the check fails or in verbose mode.(It's still visible to any local system user who can run 'ps' while check-web-health is running.)
Example:
checkweb_auth username:password -H www.example.com
This function is normally used from /etc/pov/check-web-health.
- checkcert <hostname>[:<port>] [<days>]
Check if the SSL certificate of a website is close to expiration.
<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.
Example:
checkcert www.example.com
Example:
checkcert www.example.com:8443
This function is normally used from /etc/pov/check-ssl-certs.
- checkcert_ssmtp <hostname> [<days>]
Check if the SSL certificate of an SSMTP server is close to expiration.
<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.
Example:
checkcert_ssmtp mail.example.com
This function is normally used from /etc/pov/check-ssl-certs.
- checkcert_smtp_starttls <hostname> [<days>]
Check if the SSL certificate of an SMTP server is close to expiration.
<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.
Example:
checkcert_smtp_starttls mail.example.com
This function is normally used from /etc/pov/check-ssl-certs.
- checkcert_imaps <hostname> [<days>]
Check if the SSL certificate of an IMAPS server is close to expiration.
<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.
Example:
checkcert_imaps mail.example.com
This function is normally used from /etc/pov/check-ssl-certs.
Example /etc/pov/check-health
:
# Check that processes are running checkproc apache2 checkproc cron checkproc sshd checkproc_pgrep tracd checkproc_pgrep_full '/usr/bin/java -jar /usr/share/jenkins/jenkins.war' # Check for daemons with known bugs and restart them automatically checkproc atop || service atop restart # Check for stale aspell processes (more than 2) checktoomanyproc aspell 2 # Check for stale pidfiles checkpidfiles /var/run/*.pid /var/run/*/*.pid # Check free disk space checkfs / 200M checkfs /var 200M # Check free inodes checkinodes / checkinodes /var # Check free memory checkram 100M # Check excessive swap usage checkswap 2G # Check mail queue checkmailq 100 # Check if /etc/aliases is up to date checkaliases
check-health returns exit code 0 even if some checks failed. You need to watch stderr to notice problems.
Many checks don't check their arguments for correctness and may fail in unexpected ways if you supply a wrong value (or neglect to supply a value where one was expected).
If cron doesn't work, or email sending doesn't work, check-health won't be able to report problems. You can combine it with a service like https://healthchecks.io to catch these kinds of problems.
check-health is stateless and as such will keep reporting the same error once an hour (assuming default cron configuration) until you fix it.
check-web-health(8), check-ssl-certs(8)