Skip to content

Commit

Permalink
Add alert if an OpenSearch scrape fails (#507)
Browse files Browse the repository at this point in the history
If a scrape fails, this might indicate that a unit is not in a healthy
state.

OpenSearch right now does not have a metric saying that one node is
down. E.g. If the systemd service is stopped in one node, the cluster (N
nodes) will drop the faulty node because connectivity issues and the
metrics will show that the cluster now has N-1 nodes without saying that
one node has failed.

With this new alert, at least a notification will appear if one node
stop being responsive.


How to test:

- Deploy opensearch units
- Stop the opensearch daemon in one of the units

The grafana-agent injects the juju topology at the alert rule, so the
expression `up < 1` will filter just for OpenSearch apps:


![image](https://github.com/user-attachments/assets/d09b22be-a571-4ec2-b76d-a654186df327)

The alert will trigger:

![image](https://github.com/user-attachments/assets/a1ace958-1188-4b2b-840c-febfd639dd57)
  • Loading branch information
gabrielcocenza authored Nov 27, 2024
1 parent 1fa589c commit e225961
Showing 1 changed file with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions src/alert_rules/prometheus/prometheus_alerts.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
"groups":
- "name": "opensearch.alerts"
"rules":

- "alert": "OpenSearchScrapeFailed"
"annotations":
"message": "Scrape on {{ $labels.juju_unit }} failed. Ensure that the OpenSearch systemd service is healthy and that the unit is part of the cluster."
"summary": "OpenSearch exporter scrape failed"
"expr": |
up < 1
"for": "5m"
"labels":
"severity": "critical"

- "alert": "OpenSearchClusterRed"
"annotations":
"message": "Cluster {{ $labels.cluster }} health status has been RED for at least 2m. Cluster does not accept writes, shards may be missing or master node hasn't been elected yet."
Expand Down

0 comments on commit e225961

Please sign in to comment.