Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various updates and quality of life changes #405

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

guruevi
Copy link

@guruevi guruevi commented Feb 25, 2024

  • Made the GitHub actions manually executable for E2E testing.
  • Docker-compose is deprecated, it is now fully integrated into the docker command
  • Node-exporter:
    Pre-calculation of percentages
    Made less noise from Prometheus flapping alerts into "pending" state by doing longitudinal queries
    Made network/disk calculation based on total bandwidth, we have 1, 10 and 100G networking, NVMe drives etc so fixed values don't work well
    Simplified prediction query, reduced unnecessary queries and filtered out "weird" filesystems
    Software RAID (MD) subsystem has completely changed in last few versions, modified queries accordingly
    Host kernel version deviations had no function. Made a guess on the desired functionality and adjusted using a modern PromQL function
  • Smartctl-exporter:
    Made sure SMART no longer alerts on the temperature trip values (which are fixed values)
    Various other SMART improvements including the actual SMART status

_data/rules.yml Outdated
- name: Host context switching
description: Context switching is growing on the node (> 10000 / CPU / s)
query: '((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
query: '(rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the left join? The nodename lookup is nice for labelling the alert.

(see #348)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uname collector, although enabled by default, may be disabled. Therefore, adding it, although nice for those systems, does result in some environments having a different set of results when it is added. This happens on some BSD and Docker is weird about it too (if I remember correctly, it returns the information from the container containing the exporter, not the monitored container).

Additionally SystemD out of the box, no longer updates the hostname that uname returns, therefore all of our RHEL boxes were returning localhost.localdomain, despite it having a 'correct' hostname on the network. Since the query wasn't reliable in our environment, we now use a regexp on instance for alert formatting (as in most people's cases, the instance is based on the actual (external) DNS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants