Missing telemetry core.unsealed metrics on standby nodes #10015

exo-cedric · 2020-09-22T15:13:05Z

Describe the bug

This is a follow-up on slightly different #9771

To Reproduce
Steps to reproduce the behavior:

Setup a Raft cluster
Query metrics on a stand-by (Follower) node
vault.core.unsealed metric is missing

Expected behavior
A clear and concise description of what you expected to happen.
vault.core.unsealed metric should be present like on active (Leader) node:

# wget -qO- http://127.0.0.1:18200/v1/sys/metrics | jq '.Gauges[] | select(.Name=="vault.core.unsealed")'
{
  "Name": "vault.core.unsealed",
  "Value": 1,
}

The lack of core.unsealed metrics for a HA standing-by node is problematic since it prevents to monitor the health of all HA nodes (and make sure HA is actually still available).

Environment:

Vault Server Version (retrieve with vault status): 1.5.0
Vault CLI Version (retrieve with vault version): 1.5.0
Server Operating System/Architecture: Linux x86_64

Vault server configuration file(s):

N/A

Additional context

Quickly going through core.go, core_metrics.go and ha.go, it seems to me that emitMetrics (which spawns the metrics Loop to refresh the core.unsealed metric) is only called via postUnseal, which is not called for HA standing-by HA node (in core.go); only the Leader/Active node actually calls postUnseal (in ha.go).

The text was updated successfully, but these errors were encountered:

alwaysastudent · 2021-05-08T00:34:46Z

Facing the same issue on 1.6.x versions as well. Is this something that will be fixed?

KawaiDesu · 2021-10-01T15:41:32Z

Hello! Any update on this? This is blocking us from abandoning vault_exporter.

hellstrikes13 · 2021-10-12T08:13:35Z

This issue is seen on Vault_version: 1.7.2, i m using statsite telemetry provider.

heatherezell · 2021-12-15T20:25:52Z

Wanting to chime in that we're still working on a resolution for this. Thanks for your patience!

Bowser1704 · 2022-04-14T03:16:54Z

@hsimon-hashicorp Hi, have any updates on this？ It's important to emit metrics on the standby node in HA mode.

geekofalltrades · 2022-09-26T16:29:38Z

We just upgraded to Vault 1.11.3. We saw all Vault replicas export vault_core_unsealed for 12h (the value of our prometheus_retention_time), but without the cluster label. The leader also exported one with the cluster label. After 12 hours, the unlabeled ones disappeared.

I'm going to guess they just hadn't finished determining they were a cluster yet, and as soon as they went into HA standby mode, the standbys started hitting this bug and not reporting the metric.

dguihal · 2022-11-24T13:09:11Z

Just some more 2 cents "vault.core.unsealed" is missing, but the very basic "vault.core.active" is also missing ....
Probably issue is related to the full fault.core telemetry namespace ?

laugmanuel · 2022-12-09T13:12:01Z

I see the same behaviour with Vault 1.12.1 and missing vault_core_active metric after some time.
We've used the absence of that metric to determine missing leaders and got alerted by prometheus many times in the past.

We have a 3 node Vault setup with Raft storage deployed in K8s.
I've queried the metrics endpoint from each pod and the metric is missing everywhere. Also, the vault-active service does not include the metric (as expected if it's missing on the pods themselves.)

claviola · 2023-01-24T16:41:48Z

@hsimon-hashicorp any updates about this issue? the lack of reliable core metrics makes it very difficult to properly monitor vault using prometheus.

none0nfg · 2023-02-08T01:02:18Z

also fased this problem, also needs resolution

konstantin-921 · 2023-02-09T11:51:27Z

+1

cadmuxe · 2023-05-18T23:12:31Z

Hello, Any update? This really makes the unsealed metric useless. Thanks.

p-k-sharma · 2023-10-27T09:00:20Z

Any update on this? Its been more than 3 years... The issue is still open

ameflorenti · 2023-12-05T16:30:57Z

Do I understand right? There is no way of knowing with prometheus if a VM on a HA cluster is sealed as long as some are unsealed. Does anyone find a solution to this? I really do not want to wait until (the whole cluster) vault is sealed before I get an alert. It defeats the purpose of HA setup where you can fix issues as they happen while keeping Vault unsealed. I just check and this seems the case for Enterprise Vault too

ameflorenti · 2023-12-06T15:13:35Z

WORKAROUND:
While trying to get labels values for cluster I noticed that Vault does not return metrics of sealed nodes. I then named and organized the Prometheus jobs per cluster as i did in Vault. This as a way getting the list of nodes in a cluster even when sealed.
Using this query I can "deduce" that the nodes in the HA cluster not returning metrics are SEALED or UNAVAILABLE to Vault.
count by (instance)(up{job="$cluster"}) unless on(instance) count by (instance)(vault_core_unsealed{job="$cluster"})

makes sense?

LeoQuote · 2023-12-19T10:16:09Z

up{job="vault"} > 0 unless on(instance) vault_core_unsealed

for warning alert (part of vault instance sealed)

sum(vault_core_unsealed) < 1 or absent(vault_core_unsealed)

for critical alert

I think this is a problem needs to be solved, but currently I can only use this workaround

cascadia-sati · 2024-01-09T13:41:43Z

Is this still an issue? I'm seeing vault_core_unsealed metrics even from standby nodes, but note that according to the docs, you need to enable unauthenticated access:

"The /v1/sys/metrics endpoint is only accessible on active nodes and automatically disabled on standby nodes. You can enable the /v1/sys/metrics endpoint on standby nodes by enabling unauthenticated metrics access."

This is on an HA setup in K8s with the Vault Helm chart v0.25.0 and Vault v1.14.0.

When all are sealed:

$ for POD in {0..2}; do echo -n "vault-$POD: "; k get pod vault-$POD -oyaml | grep vault-active || echo; done
vault-0:     vault-active: "false"
vault-1:     vault-active: "false"
vault-2:     vault-active: "false"

$ for POD in {0..2}; do echo -n "vault-$POD: "; k exec -it vault-$POD -- /bin/sh -c "wget -qO - localhost:8200/v1/sys/metrics?format=prometheus" | grep "^vault_core_unsealed" || echo; done
vault-0: vault_core_unsealed{cluster="pace-vault"} 0
vault-1: vault_core_unsealed{cluster="pace-vault"} 0
vault-2: vault_core_unsealed{cluster="pace-vault"} 0

When all are unsealed:

$ for POD in {0..2}; do echo -n "vault-$POD: "; k get pod vault-$POD -oyaml | grep vault-active || echo; done
vault-0:     vault-active: "true"
vault-1:     vault-active: "false"
vault-2:     vault-active: "false"

$ for POD in {0..2}; do echo -n "vault-$POD: "; k exec -it vault-$POD -- /bin/sh -c "wget -qO - localhost:8200/v1/sys/metrics?format=prometheus" | grep "^vault_core_unsealed" || echo; done
vault-0: vault_core_unsealed{cluster="pace-vault"} 1
vault-1: vault_core_unsealed{cluster="pace-vault"} 1
vault-2: vault_core_unsealed{cluster="pace-vault"} 1

Make sure to set the cluster_name field in the config to avoid duplicate metrics: #11988

I ran into another issue specific to the Vault Helm chart that caused metrics to disappear when all Vault pods are sealed, which we had to work around: hashicorp/vault-helm#990

And I'm running into another problem that I'm about to file an issue for where specifically the vault_core_unsealed metric disappears after the prometheus_retention_time elapses, because apparently Vault doesn't periodically refresh the metric when it doesn't change.

banks · 2024-07-24T14:23:29Z

Hi folks. Just checking through older bugs. As @cascadia-sati mentioned, it would seem like this is not an issue any more, does anyone on this thread still see this problem?

As far as I can see this was actually fixed by #12166 a couple of years ago - I've looked through the code and can confirm that now runStandby calls metricsLoop which is what outputs this.

Closing for now, please let us know if someone is still seeing this on a version of Vault after 1.13.0.

raskchanky added bug Used to indicate a potential bug core/metric core/telemetry ha/raft labels Sep 22, 2020

masa213f mentioned this issue Mar 31, 2021

Refine the vault alert rule after upgrading vault to v1.5.0 or higher. cybozu-go/neco#1246

Open

4 tasks

heatherezell added core/storage storage/raft and removed ha/raft labels Jul 20, 2021

ncabatoff removed core/storage storage/raft version/1.5.x labels Jul 21, 2021

ncabatoff changed the title ~~Missing telemetry core.unsealed metrics in Raft stand-by nodes~~ Missing telemetry core.unsealed metrics on standby nodes Jul 21, 2021

ncabatoff mentioned this issue Sep 14, 2021

Add more raft metrics, emit more metrics on non-perf standbys #12166

Merged

heatherezell mentioned this issue Dec 15, 2021

missing vault_core_unsealed metrics for non-leader nodes #13450

Closed

claviola mentioned this issue Jan 24, 2023

Missing core leadership metrics #11732

Closed

banks closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing telemetry core.unsealed metrics on standby nodes #10015

Missing telemetry core.unsealed metrics on standby nodes #10015

exo-cedric commented Sep 22, 2020

alwaysastudent commented May 8, 2021

KawaiDesu commented Oct 1, 2021

hellstrikes13 commented Oct 12, 2021

heatherezell commented Dec 15, 2021

Bowser1704 commented Apr 14, 2022

geekofalltrades commented Sep 26, 2022

dguihal commented Nov 24, 2022

laugmanuel commented Dec 9, 2022 •

edited

Loading

claviola commented Jan 24, 2023

none0nfg commented Feb 8, 2023

konstantin-921 commented Feb 9, 2023

cadmuxe commented May 18, 2023

p-k-sharma commented Oct 27, 2023

ameflorenti commented Dec 5, 2023

ameflorenti commented Dec 6, 2023 •

edited

Loading

LeoQuote commented Dec 19, 2023 •

edited

Loading

cascadia-sati commented Jan 9, 2024 •

edited

Loading

banks commented Jul 24, 2024 •

edited

Loading

Missing telemetry core.unsealed metrics on standby nodes #10015

Missing telemetry core.unsealed metrics on standby nodes #10015

Comments

exo-cedric commented Sep 22, 2020

alwaysastudent commented May 8, 2021

KawaiDesu commented Oct 1, 2021

hellstrikes13 commented Oct 12, 2021

heatherezell commented Dec 15, 2021

Bowser1704 commented Apr 14, 2022

geekofalltrades commented Sep 26, 2022

dguihal commented Nov 24, 2022

laugmanuel commented Dec 9, 2022 • edited Loading

claviola commented Jan 24, 2023

none0nfg commented Feb 8, 2023

konstantin-921 commented Feb 9, 2023

cadmuxe commented May 18, 2023

p-k-sharma commented Oct 27, 2023

ameflorenti commented Dec 5, 2023

ameflorenti commented Dec 6, 2023 • edited Loading

LeoQuote commented Dec 19, 2023 • edited Loading

cascadia-sati commented Jan 9, 2024 • edited Loading

banks commented Jul 24, 2024 • edited Loading

laugmanuel commented Dec 9, 2022 •

edited

Loading

ameflorenti commented Dec 6, 2023 •

edited

Loading

LeoQuote commented Dec 19, 2023 •

edited

Loading

cascadia-sati commented Jan 9, 2024 •

edited

Loading

banks commented Jul 24, 2024 •

edited

Loading