Add more raft metrics, emit more metrics on non-perf standbys #12166

ncabatoff · 2021-07-24T14:47:57Z

ha_rpc_client_echo and ha_rpc_client_echo_errors aren't raft-specific, but heartbeating is essential to raft autopilot
raft_storage_stats_applied_index and raft_storage_stats_commit_index show what each node has applied locally
raft_storage_stats_fsm_pending shows how many logs are pending to be applied to the FSM
raft_storage_follower_last_heartbeat_ms reports how long it's been since the active node has seen each follower's heartbeat
raft_storage_follower_applied_index_delta reports delta between active node applied index and each follower's reported applied index (seen in heartbeat)

* ha_rpc_client_echo and ha_rpc_client_echo_errors aren't raft-specific, but heartbeating is essential to raft autopilot * raft_storage_fsm_applied_term and raft_storage_fsm_applied_index show what each node has applied locally * raft_storage_follower_last_heartbeat_ms reports how long it's been since the active node has seen each follower's heartbeat * raft_storage_follower_applied_index_delta reports delta between active node applied index and each follower's reported applied index (seen in heartbeat)

…the metrics only when the value changes, better to do it periodically, even when not changing. We also add commit_index and fsm_pending (size of fsm apply queue) to the list. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup.

vault/core.go

physical/raft/raft_autopilot.go

ncabatoff · 2021-07-28T15:52:22Z

vault/core_metrics.go

 	// The gauge collection processes are started and stopped here
 	// because there's more than one TokenManager created during startup,
 	// but we only want one set of gauges.
 	//
-	// Both active nodes and performance standby nodes call emitMetrics
+	// Both active nodes and performance standby nodes call emitMetricsNonStandby


Whoops, this comment change is a holdover from an earlier iteration, will fix.

ncabatoff · 2021-09-14T18:55:30Z

This will incidentally fix #10015, since standby nodes will now be periodically emitting metrics including the one for unsealed.

# Conflicts: # vault/core_metrics.go

briankassouf

One comment, but looks good otherwise. Please don't forget to add docs for these new metrics

briankassouf · 2021-09-15T22:27:20Z

physical/raft/raft.go

+			Value: b.localID,
+		},
+	}
+	for _, key := range []string{"term", "commit_index", "applied_index", "fsm_pending"} {


One small thing here. It's not super obvious, but the b.raft.Stats()["applied_index"] is actually the latest index that the raft library has queued to the FSM (it includes fsm_pending items), not what has actually been applied. Our heartbeating mechanism uses the actual last applied index we've seen in the FSM. If the intent is to compare this value between nodes you might find that it's more up-to-date then reality. If that's not the intent (which maybe not given the existence of the delta metric below), then we may just want to leave it as is and mention this fact in the docs?

I think I'll take the doc strategy. As you say, applied_index_delta can tell us how far behind followers are. IIRC here I was mostly just looking to expose potentially useful info from the raft lib for debugging.

Bowser1704 · 2022-04-14T03:16:08Z

Hi, have any updates on this？ It's important to emit metrics on the standby node in HA mode.

# Conflicts: # physical/raft/raft_autopilot.go # vault/core_metrics.go # vault/request_forwarding_rpc.go

…r debugging. Address a possible bug in generating label slice passed to a defer.

physical/raft/fsm.go

physical/raft/raft_autopilot.go

Add some metrics helpful for monitoring raft cluster state. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup.

Add some metrics helpful for monitoring raft cluster state. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup. Co-authored-by: Nick Cabatoff <[email protected]>

ncabatoff added 2 commits July 24, 2021 10:47

Add CL.

0e42e11

vercel bot temporarily deployed to Preview – vault-storybook July 24, 2021 15:16 Inactive

vercel bot temporarily deployed to Preview – vault July 24, 2021 15:16 Inactive

vercel bot temporarily deployed to Preview – vault July 24, 2021 17:35 Inactive

vercel bot temporarily deployed to Preview – vault-storybook July 24, 2021 17:35 Inactive

raskchanky approved these changes Jul 26, 2021

View reviewed changes

vishalnayak reviewed Jul 28, 2021

View reviewed changes

vault/core.go Show resolved Hide resolved

vishalnayak reviewed Jul 28, 2021

View reviewed changes

physical/raft/raft_autopilot.go Show resolved Hide resolved

ncabatoff commented Jul 28, 2021

View reviewed changes

Merge branch 'main' into add-raft-metrics

d364476

# Conflicts: # vault/core_metrics.go

vercel bot deployed to Preview – vault-storybook September 14, 2021 19:49 View deployment

vercel bot deployed to Preview – vault September 14, 2021 19:49 View deployment

Fix use of statelock on standbys, avoid deadlock on shutdown.

b769329

vercel bot temporarily deployed to Preview – vault-storybook September 15, 2021 18:32 Inactive

vercel bot temporarily deployed to Preview – vault September 15, 2021 18:32 Inactive

ncabatoff changed the title ~~Add some metrics helpful for monitoring raft cluster state~~ Add more raft metrics, emit more metrics on non-perf standbys Sep 15, 2021

Update CL.

b0b993d

vercel bot temporarily deployed to Preview – vault September 15, 2021 19:39 Inactive

vercel bot temporarily deployed to Preview – vault-storybook September 15, 2021 19:39 Inactive

briankassouf approved these changes Sep 15, 2021

View reviewed changes

Merge branch 'main' into add-raft-metrics

1340977

# Conflicts: # physical/raft/raft_autopilot.go # vault/core_metrics.go # vault/request_forwarding_rpc.go

vercel bot deployed to Preview October 4, 2022 16:52 View deployment

Fix a merge issue. Comment a noisy log line I think was only using fo…

d90cab6

…r debugging. Address a possible bug in generating label slice passed to a defer.

ncabatoff mentioned this pull request Oct 4, 2022

Raft index telemetry and docs #17397

Merged

ncabatoff added 2 commits October 4, 2022 13:29

Fix data race on fsm.db

49021ec

Document metrics

9411869

ncabatoff requested review from raskchanky and briankassouf October 4, 2022 18:29

vercel bot deployed to Preview October 4, 2022 18:33 View deployment

raskchanky reviewed Oct 5, 2022

View reviewed changes

physical/raft/fsm.go Show resolved Hide resolved

physical/raft/raft_autopilot.go Outdated Show resolved Hide resolved

raskchanky approved these changes Oct 5, 2022

View reviewed changes

Remove debug line.

a89bf05

ncabatoff enabled auto-merge (squash) October 6, 2022 13:04

ncabatoff merged commit ce74f4f into main Oct 7, 2022

raskchanky deleted the add-raft-metrics branch October 7, 2022 16:27

ncabatoff added backport/1.10.x labels Dec 6, 2022

This was referenced Dec 6, 2022

Backport of Add more raft metrics, emit more metrics on non-perf standbys into release/1.12.x #18246

Merged

Backport of Add more raft metrics, emit more metrics on non-perf standbys into release/1.10.x #18247

Closed

banks mentioned this pull request Jul 24, 2024

Missing telemetry core.unsealed metrics on standby nodes #10015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more raft metrics, emit more metrics on non-perf standbys #12166

Add more raft metrics, emit more metrics on non-perf standbys #12166

ncabatoff commented Jul 24, 2021 •

edited

Loading

ncabatoff Jul 28, 2021

ncabatoff commented Sep 14, 2021

briankassouf left a comment

briankassouf Sep 15, 2021

ncabatoff Oct 4, 2022

Bowser1704 commented Apr 14, 2022

Add more raft metrics, emit more metrics on non-perf standbys #12166

Add more raft metrics, emit more metrics on non-perf standbys #12166

Conversation

ncabatoff commented Jul 24, 2021 • edited Loading

ncabatoff Jul 28, 2021

Choose a reason for hiding this comment

ncabatoff commented Sep 14, 2021

briankassouf left a comment

Choose a reason for hiding this comment

briankassouf Sep 15, 2021

Choose a reason for hiding this comment

ncabatoff Oct 4, 2022

Choose a reason for hiding this comment

Bowser1704 commented Apr 14, 2022

ncabatoff commented Jul 24, 2021 •

edited

Loading