Add alert if an OpenSearch scrape fails (#507)

If a scrape fails, this might indicate that a unit is not in a healthy state. OpenSearch right now does not have a metric saying that one node is down. E.g. If the systemd service is stopped in one node, the cluster (N nodes) will drop the faulty node because connectivity issues and the metrics will show that the cluster now has N-1 nodes without saying that one node has failed. With this new alert, at least a notification will appear if one node stop being responsive. How to test: - Deploy opensearch units - Stop the opensearch daemon in one of the units The grafana-agent injects the juju topology at the alert rule, so the expression `up < 1` will filter just for OpenSearch apps: ![image](https://github.com/user-attachments/assets/d09b22be-a571-4ec2-b76d-a654186df327) The alert will trigger: ![image](https://github.com/user-attachments/assets/a1ace958-1188-4b2b-840c-febfd639dd57)
canonical · Nov 27, 2024 · e225961 · e225961
1 parent 1fa589c
commit e225961
Showing 1 changed file with 11 additions and 0 deletions.
diff --git a/src/alert_rules/prometheus/prometheus_alerts.yaml b/src/alert_rules/prometheus/prometheus_alerts.yaml
@@ -1,6 +1,17 @@
 "groups":
 - "name": "opensearch.alerts"
   "rules":
+
+  - "alert": "OpenSearchScrapeFailed"
+    "annotations":
+      "message": "Scrape on {{ $labels.juju_unit }} failed. Ensure that the OpenSearch systemd service is healthy and that the unit is part of the cluster."
+      "summary": "OpenSearch exporter scrape failed"
+    "expr": |
+      up < 1
+    "for": "5m"
+    "labels":
+      "severity": "critical"
+
   - "alert": "OpenSearchClusterRed"
     "annotations":
       "message": "Cluster {{ $labels.cluster }} health status has been RED for at least 2m. Cluster does not accept writes, shards may be missing or master node hasn't been elected yet."