Skip to content

Latest commit

 

History

History
137 lines (117 loc) · 4.68 KB

monitoring.md

File metadata and controls

137 lines (117 loc) · 4.68 KB

Monitoring Vault with Prometheus

A Vault node exposes telemetry information that can be used to monitor and alert on the health and performance of a Vault cluster.

How the Vault Metrics are Exposed

By default the Vault operator will configure each vault pod to publish statsd metrics.

The Vault operator runs a statsd-exporter container inside each Vault pod to convert and expose those metrics in the format for Prometheus.

curl the /metrics endpoint on port 9102 for any vault pod to get the Prometheus metrics:

$ VPOD=$(kubectl -n default get vault example -o jsonpath='{.status.vaultStatus.active}')
$ kubectl -n default exec -ti ${VPOD} --container=vault -- curl localhost:9102/metrics
# HELP vault_core_unseal Metric autogenerated by statsd_exporter.
# TYPE vault_core_unseal summary
vault_core_unseal{quantile="0.5"} NaN
vault_core_unseal{quantile="0.9"} NaN
vault_core_unseal{quantile="0.99"} NaN
vault_core_unseal_sum 2.077112
vault_core_unseal_count 1
. . .

Consuming the Metrics

The Vault operator also creates a service with the same name as the Vault cluster that exposes the /metrics endpoint for the Vault nodes via the prometheus port. So for a Vault cluster named example the following service exists:

$ kubectl -n default get service example -o yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: vault
    vault_cluster: example
  name: example
  namespace: default
  ...
spec:
  ports:
  - name: vault-client
    port: 8200
    protocol: TCP
    targetPort: 8200
  - name: vault-cluster
    port: 8201
    protocol: TCP
    targetPort: 8201
  - name: prometheus
    port: 9102
    protocol: TCP
    targetPort: 9102
  selector:
    app: vault
    vault_cluster: example
  type: ClusterIP
  ...

The above service can be scraped to consume the Prometheus metrics for the Vault cluster.

Consult the Prometheus operator docs on how to setup and configure Prometheus with a ServiceMonitor to consume the metrics for a target service.

A ServiceMonitor with the following spec can be created to describe the above Vault service as target for Prometheus.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  ...
spec:
  selector:
    matchLabels:
      app: vault
      vault_cluster: example
  namespaceSelector:
    matchNames:
      - default
  endpoints:
    - interval: 30s
      path: /metrics
      port: prometheus

Alerting Rules

The following alert rules for some key metrics are provided as a guide for the best practice of alerting on Vault metrics.

The sample alert rules assume Prometheus is configured to monitor a Vault service named example.

alert: VaultLeadershipLoss
expr: sum(increase(vault_core_leadership_lost_count{job="example"}[1h])) > 5
for: 1m
labels:
 severity: critical
annotations:
 summary: High frequency of Vault leadership losses
 description: There have been more than 5 Vault leadership losses in the past 1h
alert: VaultLeadershipStepDowns
expr: sum(increase(vault_core_step_down_count{job="example"}[1h])) > 5
for: 1m
labels:
 severity: critical
annotations:
 summary: High frequency of Vault leadership step downs
 description: There have been more than 5 Vault leadership step downs in the past 1h
alert: VaultLeadershipSetupFailures
expr: sum(increase(vault_core_leadership_setup_failed{job="example"}[1h])) > 5
for: 1m
labels:
 severity: critical
annotations:
 summary: High frequency of Vault leadership setup failures
 description: There have been more than 5 Vault leadership setup failures in the past 1h

The above queries and parameters of the alert rules should be tuned for your particular use case. Read more on Prometheus queries and alerting rules to learn how to write the alerting rules as needed.