Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing core leadership metrics #11732

Closed
agaudreault opened this issue May 31, 2021 · 6 comments · Fixed by #27966
Closed

Missing core leadership metrics #11732

agaudreault opened this issue May 31, 2021 · 6 comments · Fixed by #27966
Labels
bug Used to indicate a potential bug good-first-issue

Comments

@agaudreault
Copy link

agaudreault commented May 31, 2021

Describe the bug
Some documented metrics from https://www.vaultproject.io/docs/internals/telemetry seems to be missing from Vault. Since these metrics are missing, it is impossible in some alerting tools to build the recommended alerts.

  • vault.core.leadership_setup_failed
  • vault.core.leadership_lost
  • vault.core.step_down

To Reproduce
Steps to reproduce the behavior:

  1. Run curl -H "X-Vault-Request: true" -H "X-Vault-Token: $(vault print token)" -H "X-Vault-No-Request-Forwarding: true" $VAULT_ADDR/v1/sys/metrics\?format=prometheus (Run until you reach the HA active node and get some result)
  2. Check result for vault_core_leadership_lost and other metrics

Expected behavior
Metrics should be present.

Environment:

  • Vault Server Version (retrieve with vault status): 1.6.3
  • HA Enabled true

Additional context
I also notice missing metrics on the seal operations that might be caused because we use awskms auto-unseal.

  • vault.core.post_unseal
  • vault.core.pre_seal
  • vault.core.seal-with-request
  • vault.core.seal
  • vault.core.seal-internal
  • vault.core.unseal

I don't know if it is a bug or missing configurations. I can provide more information if necessary.

@vishalnayak vishalnayak added bug Used to indicate a potential bug good-first-issue labels Jun 2, 2021
@AkashSirimanna
Copy link

Any word on this?

@agaudreault
Copy link
Author

I just tested and I still have the same metrics missing in 1.9.0.

@divyaac
Copy link
Contributor

divyaac commented Sep 14, 2022

@agaudreault-jive Just wanted to confirm that this happens after just starting up a cluster correct? As in we're not asking the nodes to do any actions after it's been spun up?

@divyaac
Copy link
Contributor

divyaac commented Sep 16, 2022

In relation to the metrics :

* vault.core.leadership_setup_failed
* vault.core.leadership_lost
* vault.core.step_down

These metrics only get logged when there's a change in leadership.

For example, an action like
vault operator step-down
(which forces the current leader to resign leadership) would cause the metrics "step_down" to be logged.

If we are trying to make the current node the leader node but there is a failure, or it remains sealed, then metrics such as "leadership_setup_failed" and "leadership_lost" would also be logged. If there is no error in attempting to switch leadership, these metrics will not be logged.

In relation to the other metrics:

* vault.core.post_unseal
* vault.core.pre_seal
* vault.core.seal-with-request
* vault.core.seal
* vault.core.seal-internal
* vault.core.unseal

These metrics should be logged regardless of whether autounseal is used. However, it's possible that the telemetry metrics aren't being retained for long enough. Try changing the prometheus_retention_time configuration in the telemetry stanza to a larger value (maybe 5m) - this will probably show the results.

@divyaac
Copy link
Contributor

divyaac commented Sep 16, 2022

This comment should help resolve the issue! We'll close the issue out after some time with no response from you. Please feel free to re-open!

@claviola
Copy link

@divyaac on standby nodes in a cluster I see none of the vault.core.* metrics. I'm using the default retention time of 24h. What could I do to solve this?

This may be related to #10015 by the way.

banks added a commit that referenced this issue Aug 5, 2024
banks added a commit that referenced this issue Aug 30, 2024
* Register ha timing metrics. Fixes #11732

* Add CHANGELOG

* Fix copywrite headers

* Relicence SDK files after move

* Update vault/ha.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug good-first-issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants