Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing telemetry core.unsealed metrics on standby nodes #10015

Closed
exo-cedric opened this issue Sep 22, 2020 · 18 comments
Closed

Missing telemetry core.unsealed metrics on standby nodes #10015

exo-cedric opened this issue Sep 22, 2020 · 18 comments
Labels
bug Used to indicate a potential bug core/metric core/telemetry

Comments

@exo-cedric
Copy link

Describe the bug

This is a follow-up on slightly different #9771

To Reproduce
Steps to reproduce the behavior:

  1. Setup a Raft cluster
  2. Query metrics on a stand-by (Follower) node
  3. vault.core.unsealed metric is missing

Expected behavior
A clear and concise description of what you expected to happen.
vault.core.unsealed metric should be present like on active (Leader) node:

# wget -qO- http://127.0.0.1:18200/v1/sys/metrics | jq '.Gauges[] | select(.Name=="vault.core.unsealed")'
{
  "Name": "vault.core.unsealed",
  "Value": 1,
}

The lack of core.unsealed metrics for a HA standing-by node is problematic since it prevents to monitor the health of all HA nodes (and make sure HA is actually still available).

Environment:

  • Vault Server Version (retrieve with vault status): 1.5.0
  • Vault CLI Version (retrieve with vault version): 1.5.0
  • Server Operating System/Architecture: Linux x86_64

Vault server configuration file(s):

N/A

Additional context

Quickly going through core.go, core_metrics.go and ha.go, it seems to me that emitMetrics (which spawns the metrics Loop to refresh the core.unsealed metric) is only called via postUnseal, which is not called for HA standing-by HA node (in core.go); only the Leader/Active node actually calls postUnseal (in ha.go).

@alwaysastudent
Copy link

Facing the same issue on 1.6.x versions as well. Is this something that will be fixed?

@ncabatoff ncabatoff changed the title Missing telemetry core.unsealed metrics in Raft stand-by nodes Missing telemetry core.unsealed metrics on standby nodes Jul 21, 2021
@KawaiDesu
Copy link

Hello! Any update on this? This is blocking us from abandoning vault_exporter.

@hellstrikes13
Copy link

This issue is seen on Vault_version: 1.7.2, i m using statsite telemetry provider.

@heatherezell
Copy link
Contributor

Wanting to chime in that we're still working on a resolution for this. Thanks for your patience!

@Bowser1704
Copy link

@hsimon-hashicorp Hi, have any updates on this? It's important to emit metrics on the standby node in HA mode.

@geekofalltrades
Copy link

We just upgraded to Vault 1.11.3. We saw all Vault replicas export vault_core_unsealed for 12h (the value of our prometheus_retention_time), but without the cluster label. The leader also exported one with the cluster label. After 12 hours, the unlabeled ones disappeared.

I'm going to guess they just hadn't finished determining they were a cluster yet, and as soon as they went into HA standby mode, the standbys started hitting this bug and not reporting the metric.

@dguihal
Copy link

dguihal commented Nov 24, 2022

Just some more 2 cents "vault.core.unsealed" is missing, but the very basic "vault.core.active" is also missing ....
Probably issue is related to the full fault.core telemetry namespace ?

@laugmanuel
Copy link

laugmanuel commented Dec 9, 2022

I see the same behaviour with Vault 1.12.1 and missing vault_core_active metric after some time.
We've used the absence of that metric to determine missing leaders and got alerted by prometheus many times in the past.

We have a 3 node Vault setup with Raft storage deployed in K8s.
I've queried the metrics endpoint from each pod and the metric is missing everywhere. Also, the vault-active service does not include the metric (as expected if it's missing on the pods themselves.)

@claviola
Copy link

@hsimon-hashicorp any updates about this issue? the lack of reliable core metrics makes it very difficult to properly monitor vault using prometheus.

@none0nfg
Copy link

none0nfg commented Feb 8, 2023

also fased this problem, also needs resolution

@konstantin-921
Copy link

+1

@cadmuxe
Copy link

cadmuxe commented May 18, 2023

Hello, Any update? This really makes the unsealed metric useless. Thanks.

@p-k-sharma
Copy link

Any update on this? Its been more than 3 years... The issue is still open

@ameflorenti
Copy link

Do I understand right? There is no way of knowing with prometheus if a VM on a HA cluster is sealed as long as some are unsealed. Does anyone find a solution to this? I really do not want to wait until (the whole cluster) vault is sealed before I get an alert. It defeats the purpose of HA setup where you can fix issues as they happen while keeping Vault unsealed. I just check and this seems the case for Enterprise Vault too

@ameflorenti
Copy link

ameflorenti commented Dec 6, 2023

WORKAROUND:
While trying to get labels values for cluster I noticed that Vault does not return metrics of sealed nodes. I then named and organized the Prometheus jobs per cluster as i did in Vault. This as a way getting the list of nodes in a cluster even when sealed.
Using this query I can "deduce" that the nodes in the HA cluster not returning metrics are SEALED or UNAVAILABLE to Vault.
count by (instance)(up{job="$cluster"}) unless on(instance) count by (instance)(vault_core_unsealed{job="$cluster"})

makes sense?

@LeoQuote
Copy link

LeoQuote commented Dec 19, 2023

up{job="vault"} > 0 unless on(instance) vault_core_unsealed

for warning alert (part of vault instance sealed)

sum(vault_core_unsealed) < 1 or absent(vault_core_unsealed)

for critical alert

I think this is a problem needs to be solved, but currently I can only use this workaround

@cascadia-sati
Copy link

cascadia-sati commented Jan 9, 2024

Is this still an issue? I'm seeing vault_core_unsealed metrics even from standby nodes, but note that according to the docs, you need to enable unauthenticated access:

"The /v1/sys/metrics endpoint is only accessible on active nodes and automatically disabled on standby nodes. You can enable the /v1/sys/metrics endpoint on standby nodes by enabling unauthenticated metrics access."

This is on an HA setup in K8s with the Vault Helm chart v0.25.0 and Vault v1.14.0.

When all are sealed:

$ for POD in {0..2}; do echo -n "vault-$POD: "; k get pod vault-$POD -oyaml | grep vault-active || echo; done
vault-0:     vault-active: "false"
vault-1:     vault-active: "false"
vault-2:     vault-active: "false"

$ for POD in {0..2}; do echo -n "vault-$POD: "; k exec -it vault-$POD -- /bin/sh -c "wget -qO - localhost:8200/v1/sys/metrics?format=prometheus" | grep "^vault_core_unsealed" || echo; done
vault-0: vault_core_unsealed{cluster="pace-vault"} 0
vault-1: vault_core_unsealed{cluster="pace-vault"} 0
vault-2: vault_core_unsealed{cluster="pace-vault"} 0

When all are unsealed:

$ for POD in {0..2}; do echo -n "vault-$POD: "; k get pod vault-$POD -oyaml | grep vault-active || echo; done
vault-0:     vault-active: "true"
vault-1:     vault-active: "false"
vault-2:     vault-active: "false"

$ for POD in {0..2}; do echo -n "vault-$POD: "; k exec -it vault-$POD -- /bin/sh -c "wget -qO - localhost:8200/v1/sys/metrics?format=prometheus" | grep "^vault_core_unsealed" || echo; done
vault-0: vault_core_unsealed{cluster="pace-vault"} 1
vault-1: vault_core_unsealed{cluster="pace-vault"} 1
vault-2: vault_core_unsealed{cluster="pace-vault"} 1

Make sure to set the cluster_name field in the config to avoid duplicate metrics: #11988

I ran into another issue specific to the Vault Helm chart that caused metrics to disappear when all Vault pods are sealed, which we had to work around: hashicorp/vault-helm#990

And I'm running into another problem that I'm about to file an issue for where specifically the vault_core_unsealed metric disappears after the prometheus_retention_time elapses, because apparently Vault doesn't periodically refresh the metric when it doesn't change.

@banks
Copy link
Member

banks commented Jul 24, 2024

Hi folks. Just checking through older bugs. As @cascadia-sati mentioned, it would seem like this is not an issue any more, does anyone on this thread still see this problem?

As far as I can see this was actually fixed by #12166 a couple of years ago - I've looked through the code and can confirm that now runStandby calls metricsLoop which is what outputs this.

Closing for now, please let us know if someone is still seeing this on a version of Vault after 1.13.0.

@banks banks closed this as completed Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/metric core/telemetry
Projects
None yet
Development

No branches or pull requests