-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various updates and quality of life changes #405
base: master
Are you sure you want to change the base?
Conversation
…tures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues.
…ed and simplified a few Postgres queries.
_data/rules.yml
Outdated
- name: Host context switching | ||
description: Context switching is growing on the node (> 10000 / CPU / s) | ||
query: '((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}' | ||
query: '(rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the left join? The nodename
lookup is nice for labelling the alert.
(see #348)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uname collector, although enabled by default, may be disabled. Therefore, adding it, although nice for those systems, does result in some environments having a different set of results when it is added. This happens on some BSD and Docker is weird about it too (if I remember correctly, it returns the information from the container containing the exporter, not the monitored container).
Additionally SystemD out of the box, no longer updates the hostname that uname returns, therefore all of our RHEL boxes were returning localhost.localdomain, despite it having a 'correct' hostname on the network. Since the query wasn't reliable in our environment, we now use a regexp on instance for alert formatting (as in most people's cases, the instance is based on the actual (external) DNS.
# Conflicts: # _data/rules.yml # dist/rules/host-and-hardware/node-exporter.yml
Pre-calculation of percentages
Made less noise from Prometheus flapping alerts into "pending" state by doing longitudinal queries
Made network/disk calculation based on total bandwidth, we have 1, 10 and 100G networking, NVMe drives etc so fixed values don't work well
Simplified prediction query, reduced unnecessary queries and filtered out "weird" filesystems
Software RAID (MD) subsystem has completely changed in last few versions, modified queries accordingly
Host kernel version deviations had no function. Made a guess on the desired functionality and adjusted using a modern PromQL function
Made sure SMART no longer alerts on the temperature trip values (which are fixed values)
Various other SMART improvements including the actual SMART status