Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more raft metrics, emit more metrics on non-perf standbys #12166

Merged
merged 11 commits into from
Oct 7, 2022

Conversation

ncabatoff
Copy link
Collaborator

@ncabatoff ncabatoff commented Jul 24, 2021

  • ha_rpc_client_echo and ha_rpc_client_echo_errors aren't raft-specific, but heartbeating is essential to raft autopilot
  • raft_storage_stats_applied_index and raft_storage_stats_commit_index show what each node has applied locally
  • raft_storage_stats_fsm_pending shows how many logs are pending to be applied to the FSM
  • raft_storage_follower_last_heartbeat_ms reports how long it's been since the active node has seen each follower's heartbeat
  • raft_storage_follower_applied_index_delta reports delta between active node applied index and each follower's reported applied index (seen in heartbeat)

* ha_rpc_client_echo and ha_rpc_client_echo_errors aren't raft-specific, but heartbeating is essential to raft autopilot
* raft_storage_fsm_applied_term and raft_storage_fsm_applied_index show what each node has applied locally
* raft_storage_follower_last_heartbeat_ms reports how long it's been since the active node has seen each follower's heartbeat
* raft_storage_follower_applied_index_delta reports delta between active node applied index and each follower's reported applied index (seen in heartbeat)
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 24, 2021 15:16 Inactive
@vercel vercel bot temporarily deployed to Preview – vault July 24, 2021 15:16 Inactive
…the metrics only when the value changes,

better to do it periodically, even when not changing.  We also add commit_index and fsm_pending (size of fsm
apply queue) to the list.

Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics
in metricsLoop that would make sense to include in OSS but weren't.  We now have an active-node-only func,
emitMetricsActiveNode.  This runs metricsLoop on the active node.  Standbys and perf-standbys run metricsLoop
from a goroutine managed by the runStandby rungroup.
@vercel vercel bot temporarily deployed to Preview – vault July 24, 2021 17:35 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 24, 2021 17:35 Inactive
// The gauge collection processes are started and stopped here
// because there's more than one TokenManager created during startup,
// but we only want one set of gauges.
//
// Both active nodes and performance standby nodes call emitMetrics
// Both active nodes and performance standby nodes call emitMetricsNonStandby
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this comment change is a holdover from an earlier iteration, will fix.

@ncabatoff
Copy link
Collaborator Author

This will incidentally fix #10015, since standby nodes will now be periodically emitting metrics including the one for unsealed.

# Conflicts:
#	vault/core_metrics.go
@vercel vercel bot temporarily deployed to Preview – vault-storybook September 15, 2021 18:32 Inactive
@vercel vercel bot temporarily deployed to Preview – vault September 15, 2021 18:32 Inactive
@ncabatoff ncabatoff changed the title Add some metrics helpful for monitoring raft cluster state Add more raft metrics, emit more metrics on non-perf standbys Sep 15, 2021
@vercel vercel bot temporarily deployed to Preview – vault September 15, 2021 19:39 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook September 15, 2021 19:39 Inactive
Copy link
Contributor

@briankassouf briankassouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but looks good otherwise. Please don't forget to add docs for these new metrics

Value: b.localID,
},
}
for _, key := range []string{"term", "commit_index", "applied_index", "fsm_pending"} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing here. It's not super obvious, but the b.raft.Stats()["applied_index"] is actually the latest index that the raft library has queued to the FSM (it includes fsm_pending items), not what has actually been applied. Our heartbeating mechanism uses the actual last applied index we've seen in the FSM. If the intent is to compare this value between nodes you might find that it's more up-to-date then reality. If that's not the intent (which maybe not given the existence of the delta metric below), then we may just want to leave it as is and mention this fact in the docs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll take the doc strategy. As you say, applied_index_delta can tell us how far behind followers are. IIRC here I was mostly just looking to expose potentially useful info from the raft lib for debugging.

@Bowser1704
Copy link

Hi, have any updates on this? It's important to emit metrics on the standby node in HA mode.

# Conflicts:
#	physical/raft/raft_autopilot.go
#	vault/core_metrics.go
#	vault/request_forwarding_rpc.go
…r debugging. Address a possible bug in generating label slice passed to a defer.
physical/raft/fsm.go Show resolved Hide resolved
physical/raft/raft_autopilot.go Outdated Show resolved Hide resolved
@ncabatoff ncabatoff enabled auto-merge (squash) October 6, 2022 13:04
@ncabatoff ncabatoff merged commit ce74f4f into main Oct 7, 2022
@raskchanky raskchanky deleted the add-raft-metrics branch October 7, 2022 16:27
miagilepner pushed a commit that referenced this pull request Jun 15, 2023
Add some metrics helpful for monitoring raft cluster state.

Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics
in metricsLoop that would make sense to include in OSS but weren't.  We now have an active-node-only func,
emitMetricsActiveNode.  This runs metricsLoop on the active node.  Standbys and perf-standbys run metricsLoop
from a goroutine managed by the runStandby rungroup.
miagilepner added a commit that referenced this pull request Jun 15, 2023
Add some metrics helpful for monitoring raft cluster state.

Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics
in metricsLoop that would make sense to include in OSS but weren't.  We now have an active-node-only func,
emitMetricsActiveNode.  This runs metricsLoop on the active node.  Standbys and perf-standbys run metricsLoop
from a goroutine managed by the runStandby rungroup.

Co-authored-by: Nick Cabatoff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants