Various updates and quality of life changes #405

guruevi · 2024-02-25T01:02:26Z

Made the GitHub actions manually executable for E2E testing.
Docker-compose is deprecated, it is now fully integrated into the docker command
Node-exporter:
Pre-calculation of percentages
Made less noise from Prometheus flapping alerts into "pending" state by doing longitudinal queries
Made network/disk calculation based on total bandwidth, we have 1, 10 and 100G networking, NVMe drives etc so fixed values don't work well
Simplified prediction query, reduced unnecessary queries and filtered out "weird" filesystems
Software RAID (MD) subsystem has completely changed in last few versions, modified queries accordingly
Host kernel version deviations had no function. Made a guess on the desired functionality and adjusted using a modern PromQL function
Smartctl-exporter:
Made sure SMART no longer alerts on the temperature trip values (which are fixed values)
Various other SMART improvements including the actual SMART status

…tures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues.

…ed and simplified a few Postgres queries.

.github/workflows/dist.yml

samber · 2024-04-20T23:02:11Z

_data/rules.yml

              - name: Host context switching
                description: Context switching is growing on the node (> 10000 / CPU / s)
-                query: '((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
+                query: '(rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000'


Why did you remove the left join? The nodename lookup is nice for labelling the alert.

(see #348)

The uname collector, although enabled by default, may be disabled. Therefore, adding it, although nice for those systems, does result in some environments having a different set of results when it is added. This happens on some BSD and Docker is weird about it too (if I remember correctly, it returns the information from the container containing the exporter, not the monitored container).

Additionally SystemD out of the box, no longer updates the hostname that uname returns, therefore all of our RHEL boxes were returning localhost.localdomain, despite it having a 'correct' hostname on the network. Since the query wasn't reliable in our environment, we now use a regexp on instance for alert formatting (as in most people's cases, the instance is based on the actual (external) DNS.

# Conflicts: # _data/rules.yml # dist/rules/host-and-hardware/node-exporter.yml

guruevi and others added 11 commits February 24, 2024 13:49

Add an option to run GitHub Action manually

6fe429e

Add an option to force running the action for testing purposes

fbca1a1

Set variables correctly

e3bc917

Set variables correctly

79960ae

Publish

59dc6dc

Clean up some more metrics

d6ef8e7

Publish

b660faf

Minor bug fixes

87ee129

Merge remote-tracking branch 'guruevi/master'

46b9ccf

Publish

45a711f

guruevi mentioned this pull request Feb 25, 2024

Rule "Host RAID array got inactive" has misleading description #395

Open

guruevi and others added 11 commits February 25, 2024 14:53

Removed queries that throw errors when systems are upgraded. Also fix…

4604336

…ed and simplified a few Postgres queries.

Publish

c026db7

Refined some more queries

224e6d0

Publish

7e0d009

Merge branch 'samber:master' into master

bfd04e6

PostgreSQL now has optimized autovacuum behavior

a68beee

Merge remote-tracking branch 'guruevi/master'

351e45c

Publish

8789b86

PostgreSQL now has optimized autovacuum behavior

c823aca

Publish

76a86c3

Merge branch 'samber:master' into master

0c2876e

samber reviewed Apr 12, 2024

View reviewed changes

.github/workflows/dist.yml Show resolved Hide resolved

samber reviewed Apr 20, 2024

View reviewed changes

guruevi and others added 5 commits July 2, 2024 13:32

Merge remote-tracking branch 'samber/master'

51d0484

# Conflicts: # _data/rules.yml # dist/rules/host-and-hardware/node-exporter.yml

Publish

6e48cba

Query fails if instance names are not unique across jobs. This fixes it.

54e2b09

Merge remote-tracking branch 'origin/master'

84a9260

Publish

9766507

Merge branch 'master' into master

860055d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various updates and quality of life changes #405

Various updates and quality of life changes #405

guruevi commented Feb 25, 2024

samber Apr 20, 2024

guruevi Apr 20, 2024

Various updates and quality of life changes #405

Are you sure you want to change the base?

Various updates and quality of life changes #405

Conversation

guruevi commented Feb 25, 2024

samber Apr 20, 2024

Choose a reason for hiding this comment

guruevi Apr 20, 2024

Choose a reason for hiding this comment