Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds prometheus metrics #37

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anguswilliams
Copy link

Adds endpoint for prometheus metrics and instrument the snapshot function.

This allows for alerts to be created when snapshots fail or have not been taken for some time.

Adds endpoint for prometheus metrics and instrument the snapshot function.

This allows for alerts to be created when snapshots fail or have not been
taken for some time.
@anguswilliams
Copy link
Author

anguswilliams commented Oct 16, 2024

Hey @Argelbargel, I've added some prometheus metrics, as I like to alert if backups fail. This allows me to generate alerts via alert manager, e.g.

vrsa_last_snapshot_success != 1

Will also add additional serviceMonitor and ports to your helm chart if you're happy with this.

@Argelbargel
Copy link
Owner

Hi @anguswilliams,

sorry, did not have the time yet to review your pull-request. I'll try to find some time for it this week...

@erlisb
Copy link

erlisb commented Dec 10, 2024

Hi @anguswilliams, thanks a lot this PR.

I tried to test it locally:

erlisb ✗ go build -o vault-raft-snapshot-agent cmd/vault-raft-snapshot-agent/main.go

Using the following config.yaml file:

vault:
  nodes:
    urls:
      - https://127.0.0.1:8200
  auth:
    token: hvs.mP0IOUkQ6SzOXpEehQjGc3Di
snapshots:
  frequency: "2m"
  retain: 10
  storages:
    local:
      path: ./snapshots
erlisb ✗ ./vault-raft-snapshot-agent --config ./config.yaml --log-level="info" --metrics-port=10050                    
2024/12/10 23:56:18 INFO Using configuration from /home/erlisb/hashicorp-vault/vault-raft-snapshot-agent/config.yaml...
2024/12/10 23:56:19 INFO (re-)connected to leader node=https://127.0.0.1:8200
2024/12/10 23:56:19 INFO Successfully uploaded snapshot to all scheduled destinations nextSnapshot=2024-12-10T23:58:18.879+01:00

Provided feature seems to work as expected:

erlisb ✗ curl -k http://127.0.0.1:10050/metrics | grep vrsa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9285    0  9285    0     0  1511k      0 --:--:-- --:--:-- --:--:-- 1511k
# HELP vrsa_last_snapshot_size Size of the last snapshot in bytes
# TYPE vrsa_last_snapshot_size gauge
vrsa_last_snapshot_size 17665
# HELP vrsa_last_snapshot_success Returns 1 if the last snapshot was successful and 0 if not
# TYPE vrsa_last_snapshot_success gauge
vrsa_last_snapshot_success 1
# HELP vrsa_last_snapshot_time Unix timestamp of the last snapshot time
# TYPE vrsa_last_snapshot_time gauge
vrsa_last_snapshot_time 1.733871595e+09
# HELP vrsa_next_snapshot_time Unix timestamp of the next scheduled snapshot time
# TYPE vrsa_next_snapshot_time gauge
vrsa_next_snapshot_time 1.733871715e+09

Could you also provide some documentation on how you can access the metrics and what's the idea behind each metric?
That would be really helpful for the end user.

@Argelbargel can you also have a look at it and maybe we can merge it ?

Thank you very much.

@erlisb
Copy link

erlisb commented Dec 17, 2024

Hi @Argelbargel, another kind reminder as we are depending a bit on this feature.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants