Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health_check: add stats counters to monitor health check behavior #37409

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

agrawroh
Copy link
Contributor

@agrawroh agrawroh commented Nov 28, 2024

Description

This PR adds stats to the health check HTTP filter. These new stats provide visibility into health check behavior including request counts, successful/failed checks, cached responses, and cluster health status. These stats help operators monitor the health checking system and diagnose issues.

Here is a list of key stats added:

  • request_total (Counter) : Total number of requests that were served from this health check filter
  • failed (Counter) : Total number of health checks that failed (including failures from cluster status)
  • ok (Counter) : Total number of health checks that passed
  • cached_response (Counter) : Total number of requests that were responded to with cached health check status
  • failed_cluster_not_found (Counter) : Total number of failed health checks due to referenced cluster not being found
  • failed_cluster_empty (Counter) : Total number of failed health checks due to empty cluster membership when checking cluster health
  • failed_cluster_unhealthy (Counter) : Total number of failed health checks due to cluster falling below minimum healthy percentage threshold
  • degraded (Counter) : Total number of health check responses that reported degraded status

Commit Message: health_check: add stats counters to monitor health check behavior
Additional Description: This change improves observability of the health check filter by exposing key metrics about health check processing and cluster health state. The stats are scoped under the connection manager and follow standard Envoy stats naming conventions.
Risk Level: Low
Testing: Added unit and integration tests verifying all stats counters
Docs Changes: Added
Release Notes: Added

@agrawroh
Copy link
Contributor Author

/assign adisuissa

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks!
Left a couple of high-level questions, but mostly minor comments.

Statistics
----------

The health check filter outputs statistics in the ``<stat_prefix>.health_check.`` namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify which stats_prefix is used here (and if it is optional, what will be the default value).

Another option to consider is adding stats support as optional (i.e., only if the stat_prefix is set). This adds the ability to avoid paying the stats costs if they are not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified it. Please let me know if you think we must do this part - "adding stats support as optional (i.e., only if the stat_prefix is set). This adds the ability to avoid paying the stats costs if they are not needed."

source/extensions/filters/http/health_check/health_check.h Outdated Show resolved Hide resolved
source/extensions/filters/http/health_check/health_check.h Outdated Show resolved Hide resolved
@@ -173,9 +174,16 @@ void HealthCheckFilter::onComplete() {
if (!Http::CodeUtility::is2xx(enumToInt(final_status))) {
callbacks_->streamInfo().setResponseFlag(
StreamInfo::CoreResponseFlag::FailedLocalHealthCheck);
stats_->failed_.inc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: should the failed_/ok_ also be incremented for the cached response?
Either way, the doc above (about the new statistics) should reflect this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does make sense to have fail/ok include the cached response as well. I have clarified this in the doc. Let me know if you think otherwise.

Signed-off-by: Rohit Agrawal <[email protected]>
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!
/assign-from @envoyproxy/senior-maintainers

Copy link

@envoyproxy/senior-maintainers assignee is @RyanTheOptimist

🐱

Caused by: a #37409 (review) was submitted by @adisuissa.

see: more, trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants