-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more raft metrics, emit more metrics on non-perf standbys #12166
Conversation
ncabatoff
commented
Jul 24, 2021
•
edited
Loading
edited
- ha_rpc_client_echo and ha_rpc_client_echo_errors aren't raft-specific, but heartbeating is essential to raft autopilot
- raft_storage_stats_applied_index and raft_storage_stats_commit_index show what each node has applied locally
- raft_storage_stats_fsm_pending shows how many logs are pending to be applied to the FSM
- raft_storage_follower_last_heartbeat_ms reports how long it's been since the active node has seen each follower's heartbeat
- raft_storage_follower_applied_index_delta reports delta between active node applied index and each follower's reported applied index (seen in heartbeat)
* ha_rpc_client_echo and ha_rpc_client_echo_errors aren't raft-specific, but heartbeating is essential to raft autopilot * raft_storage_fsm_applied_term and raft_storage_fsm_applied_index show what each node has applied locally * raft_storage_follower_last_heartbeat_ms reports how long it's been since the active node has seen each follower's heartbeat * raft_storage_follower_applied_index_delta reports delta between active node applied index and each follower's reported applied index (seen in heartbeat)
…the metrics only when the value changes, better to do it periodically, even when not changing. We also add commit_index and fsm_pending (size of fsm apply queue) to the list. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup.
vault/core_metrics.go
Outdated
// The gauge collection processes are started and stopped here | ||
// because there's more than one TokenManager created during startup, | ||
// but we only want one set of gauges. | ||
// | ||
// Both active nodes and performance standby nodes call emitMetrics | ||
// Both active nodes and performance standby nodes call emitMetricsNonStandby |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this comment change is a holdover from an earlier iteration, will fix.
This will incidentally fix #10015, since standby nodes will now be periodically emitting metrics including the one for unsealed. |
# Conflicts: # vault/core_metrics.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, but looks good otherwise. Please don't forget to add docs for these new metrics
Value: b.localID, | ||
}, | ||
} | ||
for _, key := range []string{"term", "commit_index", "applied_index", "fsm_pending"} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small thing here. It's not super obvious, but the b.raft.Stats()["applied_index"]
is actually the latest index that the raft library has queued to the FSM (it includes fsm_pending
items), not what has actually been applied. Our heartbeating mechanism uses the actual last applied index we've seen in the FSM. If the intent is to compare this value between nodes you might find that it's more up-to-date then reality. If that's not the intent (which maybe not given the existence of the delta metric below), then we may just want to leave it as is and mention this fact in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll take the doc strategy. As you say, applied_index_delta can tell us how far behind followers are. IIRC here I was mostly just looking to expose potentially useful info from the raft lib for debugging.
Hi, have any updates on this? It's important to emit metrics on the standby node in HA mode. |
# Conflicts: # physical/raft/raft_autopilot.go # vault/core_metrics.go # vault/request_forwarding_rpc.go
…r debugging. Address a possible bug in generating label slice passed to a defer.
Add some metrics helpful for monitoring raft cluster state. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup.
Add some metrics helpful for monitoring raft cluster state. Furthermore, we weren't emitting bolt metrics on regular (non-perf) standbys, and there were other metrics in metricsLoop that would make sense to include in OSS but weren't. We now have an active-node-only func, emitMetricsActiveNode. This runs metricsLoop on the active node. Standbys and perf-standbys run metricsLoop from a goroutine managed by the runStandby rungroup. Co-authored-by: Nick Cabatoff <[email protected]>