From 359dff3ad3d6a564da34004489c883f23a040387 Mon Sep 17 00:00:00 2001 From: Paul Gottschling Date: Thu, 12 Sep 2024 08:47:26 -0400 Subject: [PATCH 1/2] Add a guide to metrics for monitoring Teleport Closes #40664 This change turns the Metrics guide in `admin-guides` into a conceptual guide to the most important metrics for monitoring a Teleport cluster. Since Agent metrics have inconsistent comprehensiveness across Teleport services--and to reduce the scope of this change--this guide focuses on self-hosted clusters. To make this a conceptual guide instead of a reference, this change removes the reference table from the `admin-guides` metrics page. There is already a table in the dedicated metrics reference guide. Note that, while the new metrics guide is specific to self-hosted clusters, this change does not move the guide to the subsection of Admin Guides related to self-hosting Teleport. Doing this would mean having one subsection of Admin Guides for diagnostics-related guides and one subsection for self-hosted-specific diagnostics, which is potentially confusing. We may also want to add Agent-specific metrics eventually. Finally, this change does not include alert thresholds for the metrics it describes. We can define these in a subsequent change. --- .../management/diagnostics/metrics.mdx | 185 +++++++++++++++++- 1 file changed, 175 insertions(+), 10 deletions(-) diff --git a/docs/pages/admin-guides/management/diagnostics/metrics.mdx b/docs/pages/admin-guides/management/diagnostics/metrics.mdx index bbea5111e27bd..037c24cb3c2bc 100644 --- a/docs/pages/admin-guides/management/diagnostics/metrics.mdx +++ b/docs/pages/admin-guides/management/diagnostics/metrics.mdx @@ -1,11 +1,20 @@ --- -title: Metrics -description: How to enable and consume metrics +title: Key Metrics for Self-Hosted Clusters +description: Describes important metrics to monitor if you are self-hosting Teleport. +tocDepth: 3 --- -Teleport exposes metrics for all of its components, helping you get insight -into the state of your cluster. This guide explains the metrics that you can -collect from your Teleport cluster. +This guide explains the metrics you should use to get started monitoring your +self-hosted Teleport cluster, focusing on metrics reported by the Auth Service +and Proxy Service. If you use Teleport Enterprise (Cloud), the Teleport team +monitors and responds to these metrics for you. + +For a reference of all available metrics, see the [Teleport Metrics +Reference](../../../reference/monitoring/metrics.mdx). + +This guide assumes that you already monitor compute resources on all instances +that run the Teleport Auth Service and Proxy Service (e.g., CPU, memory, disk, +bandwidth, and open file descriptors). ## Enabling metrics @@ -14,12 +23,168 @@ collect from your Teleport cluster. This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors. -The following metrics are available: +## Backend operations + +A Teleport cluster cannot function if the Auth Service does not have a healthy +cluster state backend. You need to track the ability of the Auth Service to read +from and write to its backend. + +The Auth Service can connect to [several possible +backends](../../../reference/backends.mdx). In addition to Teleport backend +metrics, you should set up monitoring for your backend of choice so that, if +these metrics show problematic values, you can correlate them with metrics on +your backend infrastructure. + +### Backend operation throughput and availability + +On each backend operation, the Auth Service increments a metric. Backend +operation metrics have the following format: + +```text +teleport_backend_[_failed]_total +``` + +If an operation results in an error, the Auth Service adds the `_failed` segment +to the metric name. For example, successfully creating a record increments the +`teleport_backend_write_requests_total` metric. If the create operation fails, +the Auth Service increments `teleport_backend_write_requests_failed_total` +instead. + +The following backend operation metrics are available: + +|Operation|Incremented metric name| +|---|---| +|Create an item|`write_requests`| +|Modify an item, creating it if it does not exist|`write_requests`| +|Update an item|`write_requests`| +|Conditionally update an item if versions match|`write_requests`| +|List a range of items|`batch_read_requests`| +|Get a single item|`read_requests`| +|Compare and swap items|`write_requests`| +|Delete an item|`write_requests`| +|Conditionally delete an item if versions match|`write_requests`| +|Write a batch of updates atomically, failing the write if any update fails|Both `write_requests` and `atomic_write_requests`| +|Delete a range of items|`batch_write_requests`| +|Update the keepalive status of an item|`write_requests`| + +You can use these metrics to define an availability formula, i.e., the +percentage of reads or writes that succeeded. Take the sum of requests that +succeeded (including batch requests) over the total sum of requests, multiplied +by 100. If your backend begins to appear unavailable, you can investigate your +backend infrastructure. + +### Backend operation performance + +To help you track backend operation performance, the Auth Service also exposes +Prometheus [histogram metrics](https://prometheus.io/docs/practices/histograms/) +for read and write operations: + +- `teleport_backend_read_seconds_bucket` +- `teleport_backend_write_seconds_bucket` +- `teleport_backend_batch_write_seconds_bucket` +- `teleport_backend_batch_read_seconds_bucket` +- `teleport_backend_atomic_write_seconds_bucket` + +The backend throughput metrics discussed in the previous section map on to +latency metrics. Whenever the Auth Service increments one of the throughput +metrics, it reports one of the corresponding latency metrics. See the table +below for which throughput metrics map to which latency metrics. Each metric +name excludes the standard prefixes and suffixes. + +|Throughput|Latency| +|---|---| +|`read_requests`|`read_seconds_bucket`| +|`read_requests`|`write_seconds_bucket`| +|`batch_read_requests`|`batch_write_seconds_bucket`| +|`batch_write_requests`|`batch_read_seconds_bucket`| +|`atomic_write_requests`|`atomic_write_seconds_bucket`| + +## Agents and connected resources + +To enable users to access most infrastructure with Teleport, you must join a +[Teleport Agent](../../../enroll-resources/agents/agents.mdx) to your Teleport +cluster and configure it to proxy your infrastructure. In a typical setup, an +Agent establishes an SSH reverse tunnel with the Proxy Service. User traffic to +Teleport-protected resources flows through the Proxy Service, an Agent, and +finally the infrastructure resource the Agent proxies. Return traffic from the +resource takes this path in reverse. + +### Number of connected resources by type + +Teleport-connected resources periodically send heartbeat (keepalive) messages to +the Auth Service. The Auth Service uses these heartbeats to track the number of +Teleport-protected resources by type with the `teleport_connected_resources` +metric. + +The Auth Service tracks this metric for the following resources: + +- SSH servers +- Kubernetes clusters +- Applications +- Databases +- Teleport Database Service instances +- Windows desktops + +You can use this metric to: +- Compare the number of resources that are protected by Teleport with those that + are not so you can plan your Teleport rollout, e.g., by configuring [Auto + Discovery](../../../enroll-resources/auto-discovery/auto-discovery.mdx). +- Correlate changes in Teleport usage with resource utilization on Auth Service + and Proxy Service compute instances to determine scaling needs. + +You can include this query in your Grafana configuration to break this metric +down by resource type: + +```text +sum(teleport_connected_resources) by (type) +``` + +### Reverse tunnels by type + +Every Teleport service that starts up establishes an SSH reverse tunnel to the +Proxy Service. (Self-hosted clusters can configure Agent services to connect to +the Auth Service directly without establishing a reverse tunnel.) The Proxy +Service tracks the number of reverse tunnels using the metric, +`teleport_reverse_tunnels_connected`. + +With an improperly scaled Proxy Service pool, the Proxy Service can become a +bottleneck for traffic to Teleport-protected resources. If Proxy Service +instances display heavy utilization of compute resources while the number of +connected infrastructure resources is high, you can consider scaling out your +Proxy Service pool and using [Proxy Peering](../operations/proxy-peering.mdx). + +Use the following Grafana query to track the maximum number of reverse tunnels +by type over a given interval: + +```text +max(teleport_reverse_tunnels_connected) by (type)) +``` + +### Count and version of Teleport Agents + +Alongside the number of connected resources and reverse tunnels, you can track +the number of Agents in your Teleport cluster. Since you can run multiple +Teleport services on a single Agent instance, this metric helps you understand +the architecture of your Teleport Agent deployment so you can diagnose issues +with resource utilization. + +At regular intervals (around 7 seconds with jitter), the Auth Service refreshes +its count of registered Agents. You can measure this count with the metric, +`teleport_registered_servers`. To get the number of registered Agents by +version, you can use this query in Grafana: - +```text +sum by (version)(teleport_registered_servers) +``` - Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service. +Since this metric is grouped by version, you can also tell how many of your +Agents are behind the version of the Auth Service and Proxy Service, which can +help you identify any that are at risk of violating the Teleport [version +compatibility guarantees](../../../upgrading/overview.mdx). - +We strongly encourage self-hosted Teleport users to enroll their Agents in +automatic updates. You can track the count of Teleport Agents that are not +enrolled in automatic updates using the metric, `teleport_enrolled_in_upgrades`. +[Read the documentation](../../../upgrading/automatic-agent-updates.mdx) for how +to enroll Agents in automatic updates. -(!docs/pages/includes/metrics.mdx!) \ No newline at end of file From 2d3c2f32d22658b04f8049c01d1e04b4646734bd Mon Sep 17 00:00:00 2001 From: Paul Gottschling Date: Wed, 2 Oct 2024 10:48:44 -0400 Subject: [PATCH 2/2] Respond to evanfreed feedback - Describe `backend_write_requests_failed_precondition_total` - Include the precondition metric in the write availability formula. - Turn the `registered_servers` discussion into a discussion of Teleport instance version, since it's not possible to group this metric by service and subtract the count of Auth Service/Proxy Service instances from the count of all registered services. --- .../management/diagnostics/metrics.mdx | 60 ++++++++++++------- 1 file changed, 37 insertions(+), 23 deletions(-) diff --git a/docs/pages/admin-guides/management/diagnostics/metrics.mdx b/docs/pages/admin-guides/management/diagnostics/metrics.mdx index 037c24cb3c2bc..3888390bcd795 100644 --- a/docs/pages/admin-guides/management/diagnostics/metrics.mdx +++ b/docs/pages/admin-guides/management/diagnostics/metrics.mdx @@ -67,11 +67,31 @@ The following backend operation metrics are available: |Delete a range of items|`batch_write_requests`| |Update the keepalive status of an item|`write_requests`| -You can use these metrics to define an availability formula, i.e., the -percentage of reads or writes that succeeded. Take the sum of requests that -succeeded (including batch requests) over the total sum of requests, multiplied -by 100. If your backend begins to appear unavailable, you can investigate your -backend infrastructure. +During failed backend writes, a Teleport process also increments the +`backend_write_requests_failed_precondition_total` metric if the cause of the +failure is expected. For example, the metric increments during a create +operation if a record already exists, during an update or delete operation if +the record is not found, and during an atomic write if the resource was modified +concurrently. All of these conditions can hold in a well-functioning Teleport +cluster. + +`backend_write_requests_failed_precondition_total` increments whenever +`backend_write_requests_failed_total` increments, and you can use it to +distinguish potentially expected write failures from unexpected, problematic +ones. + +You can use backend operation metrics to define an availability formula, i.e., +the percentage of reads or writes that succeeded. For example, in Prometheus, +you can define a query similar to the following. This takes the percentage of +write requests that failed for unexpected reasons and subtracts it from 1 to get +a percentage of successful writes: + +``` +1- (sum(rate(backend_write_requests_failed_total -sum(rate(teleport_backend_write_requests_failed_precondition_total)) / sum(rate(backend_write_requests_total)) +``` + +If your backend begins to appear unavailable, you can investigate your backend +infrastructure. ### Backend operation performance @@ -127,8 +147,7 @@ The Auth Service tracks this metric for the following resources: You can use this metric to: - Compare the number of resources that are protected by Teleport with those that - are not so you can plan your Teleport rollout, e.g., by configuring [Auto - Discovery](../../../enroll-resources/auto-discovery/auto-discovery.mdx). + are not so you can plan your Teleport rollout. - Correlate changes in Teleport usage with resource utilization on Auth Service and Proxy Service compute instances to determine scaling needs. @@ -160,31 +179,26 @@ by type over a given interval: max(teleport_reverse_tunnels_connected) by (type)) ``` -### Count and version of Teleport Agents - -Alongside the number of connected resources and reverse tunnels, you can track -the number of Agents in your Teleport cluster. Since you can run multiple -Teleport services on a single Agent instance, this metric helps you understand -the architecture of your Teleport Agent deployment so you can diagnose issues -with resource utilization. +## Teleport instance versions At regular intervals (around 7 seconds with jitter), the Auth Service refreshes -its count of registered Agents. You can measure this count with the metric, -`teleport_registered_servers`. To get the number of registered Agents by -version, you can use this query in Grafana: +its count of registered Teleport instances, including Agents and Teleport +processes that run the Auth Service and Proxy Service. You can measure this +count with the metric, `teleport_registered_servers`. To get the number of +registered instances by version, you can use this query in Grafana: ```text sum by (version)(teleport_registered_servers) ``` -Since this metric is grouped by version, you can also tell how many of your -Agents are behind the version of the Auth Service and Proxy Service, which can -help you identify any that are at risk of violating the Teleport [version -compatibility guarantees](../../../upgrading/overview.mdx). +You can use this metric to tell how many of your registered Teleport instances +are behind the version of the Auth Service and Proxy Service, which can help you +identify any that are at risk of violating the Teleport [version compatibility +guarantees](../../../upgrading/overview.mdx). We strongly encourage self-hosted Teleport users to enroll their Agents in automatic updates. You can track the count of Teleport Agents that are not enrolled in automatic updates using the metric, `teleport_enrolled_in_upgrades`. -[Read the documentation](../../../upgrading/automatic-agent-updates.mdx) for how -to enroll Agents in automatic updates. +[Read the documentation](../../../upgrading.mdx) for how to enroll Agents in +automatic updates.