From 359dff3ad3d6a564da34004489c883f23a040387 Mon Sep 17 00:00:00 2001
From: Paul Gottschling <paul.gottschling@goteleport.com>
Date: Thu, 12 Sep 2024 08:47:26 -0400
Subject: [PATCH 1/2] Add a guide to metrics for monitoring Teleport

Closes #40664

This change turns the Metrics guide in `admin-guides` into a conceptual
guide to the most important metrics for monitoring a Teleport cluster.

Since Agent metrics have inconsistent comprehensiveness across Teleport
services--and to reduce the scope of this change--this guide focuses on
self-hosted clusters.

To make this a conceptual guide instead of a reference, this change
removes the reference table from the `admin-guides` metrics page. There
is already a table in the dedicated metrics reference guide.

Note that, while the new metrics guide is specific to self-hosted
clusters, this change does not move the guide to the subsection of Admin
Guides related to self-hosting Teleport. Doing this would mean having
one subsection of Admin Guides for diagnostics-related guides and one
subsection for self-hosted-specific diagnostics, which is potentially
confusing. We may also want to add Agent-specific metrics eventually.

Finally, this change does not include alert thresholds for the metrics
it describes. We can define these in a subsequent change.
---
 .../management/diagnostics/metrics.mdx        | 185 +++++++++++++++++-
 1 file changed, 175 insertions(+), 10 deletions(-)

diff --git a/docs/pages/admin-guides/management/diagnostics/metrics.mdx b/docs/pages/admin-guides/management/diagnostics/metrics.mdx
index bbea5111e27bd..037c24cb3c2bc 100644
--- a/docs/pages/admin-guides/management/diagnostics/metrics.mdx
+++ b/docs/pages/admin-guides/management/diagnostics/metrics.mdx
@@ -1,11 +1,20 @@
 ---
-title: Metrics
-description: How to enable and consume metrics
+title: Key Metrics for Self-Hosted Clusters
+description: Describes important metrics to monitor if you are self-hosting Teleport.
+tocDepth: 3
 ---
 
-Teleport exposes metrics for all of its components, helping you get insight
-into the state of your cluster. This guide explains the metrics that you can
-collect from your Teleport cluster.
+This guide explains the metrics you should use to get started monitoring your
+self-hosted Teleport cluster, focusing on metrics reported by the Auth Service
+and Proxy Service. If you use Teleport Enterprise (Cloud), the Teleport team
+monitors and responds to these metrics for you.
+
+For a reference of all available metrics, see the [Teleport Metrics
+Reference](../../../reference/monitoring/metrics.mdx).
+
+This guide assumes that you already monitor compute resources on all instances
+that run the Teleport Auth Service and Proxy Service (e.g., CPU, memory, disk,
+bandwidth, and open file descriptors).
 
 ## Enabling metrics
 
@@ -14,12 +23,168 @@ collect from your Teleport cluster.
 This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
 metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.
 
-The following metrics are available:
+## Backend operations
+
+A Teleport cluster cannot function if the Auth Service does not have a healthy
+cluster state backend. You need to track the ability of the Auth Service to read
+from and write to its backend. 
+
+The Auth Service can connect to [several possible
+backends](../../../reference/backends.mdx). In addition to Teleport backend
+metrics, you should set up monitoring for your backend of choice so that, if
+these metrics show problematic values, you can correlate them with metrics on
+your backend infrastructure.
+
+### Backend operation throughput and availability
+
+On each backend operation, the Auth Service increments a metric. Backend
+operation metrics have the following format:
+
+```text
+teleport_backend_<METRIC_NAME>[_failed]_total
+```
+
+If an operation results in an error, the Auth Service adds the `_failed` segment
+to the metric name. For example, successfully creating a record increments the
+`teleport_backend_write_requests_total` metric. If the create operation fails,
+the Auth Service increments `teleport_backend_write_requests_failed_total`
+instead.
+
+The following backend operation metrics are available:
+
+|Operation|Incremented metric name|
+|---|---|
+|Create an item|`write_requests`|
+|Modify an item, creating it if it does not exist|`write_requests`|
+|Update an item|`write_requests`|
+|Conditionally update an item if versions match|`write_requests`|
+|List a range of items|`batch_read_requests`|
+|Get a single item|`read_requests`|
+|Compare and swap items|`write_requests`|
+|Delete an item|`write_requests`|
+|Conditionally delete an item if versions match|`write_requests`|
+|Write a batch of updates atomically, failing the write if any update fails|Both `write_requests` and `atomic_write_requests`|
+|Delete a range of items|`batch_write_requests`|
+|Update the keepalive status of an item|`write_requests`|
+
+You can use these metrics to define an availability formula, i.e., the
+percentage of reads or writes that succeeded. Take the sum of requests that
+succeeded (including batch requests) over the total sum of requests, multiplied
+by 100. If your backend begins to appear unavailable, you can investigate your
+backend infrastructure.
+
+### Backend operation performance
+
+To help you track backend operation performance, the Auth Service also exposes
+Prometheus [histogram metrics](https://prometheus.io/docs/practices/histograms/)
+for read and write operations:
+
+- `teleport_backend_read_seconds_bucket`
+- `teleport_backend_write_seconds_bucket`
+- `teleport_backend_batch_write_seconds_bucket`
+- `teleport_backend_batch_read_seconds_bucket`
+- `teleport_backend_atomic_write_seconds_bucket`
+
+The backend throughput metrics discussed in the previous section map on to
+latency metrics. Whenever the Auth Service increments one of the throughput
+metrics, it reports one of the corresponding latency metrics. See the table
+below for which throughput metrics map to which latency metrics. Each metric
+name excludes the standard prefixes and suffixes.
+
+|Throughput|Latency|
+|---|---|
+|`read_requests`|`read_seconds_bucket`|
+|`read_requests`|`write_seconds_bucket`|
+|`batch_read_requests`|`batch_write_seconds_bucket`|
+|`batch_write_requests`|`batch_read_seconds_bucket`|
+|`atomic_write_requests`|`atomic_write_seconds_bucket`|
+
+## Agents and connected resources
+
+To enable users to access most infrastructure with Teleport, you must join a
+[Teleport Agent](../../../enroll-resources/agents/agents.mdx) to your Teleport
+cluster and configure it to proxy your infrastructure. In a typical setup, an
+Agent establishes an SSH reverse tunnel with the Proxy Service. User traffic to
+Teleport-protected resources flows through the Proxy Service, an Agent, and
+finally the infrastructure resource the Agent proxies. Return traffic from the
+resource takes this path in reverse.
+
+### Number of connected resources by type
+
+Teleport-connected resources periodically send heartbeat (keepalive) messages to
+the Auth Service. The Auth Service uses these heartbeats to track the number of
+Teleport-protected resources by type with the `teleport_connected_resources`
+metric. 
+
+The Auth Service tracks this metric for the following resources:
+
+- SSH servers
+- Kubernetes clusters
+- Applications
+- Databases
+- Teleport Database Service instances
+- Windows desktops
+
+You can use this metric to:
+- Compare the number of resources that are protected by Teleport with those that
+  are not so you can plan your Teleport rollout, e.g., by configuring [Auto
+  Discovery](../../../enroll-resources/auto-discovery/auto-discovery.mdx).
+- Correlate changes in Teleport usage with resource utilization on Auth Service
+  and Proxy Service compute instances to determine scaling needs.
+
+You can include this query in your Grafana configuration to break this metric
+down by resource type:
+
+```text
+sum(teleport_connected_resources) by (type)
+```
+
+### Reverse tunnels by type
+
+Every Teleport service that starts up establishes an SSH reverse tunnel to the
+Proxy Service. (Self-hosted clusters can configure Agent services to connect to
+the Auth Service directly without establishing a reverse tunnel.) The Proxy
+Service tracks the number of reverse tunnels using the metric,
+`teleport_reverse_tunnels_connected`.
+
+With an improperly scaled Proxy Service pool, the Proxy Service can become a
+bottleneck for traffic to Teleport-protected resources. If Proxy Service
+instances display heavy utilization of compute resources while the number of
+connected infrastructure resources is high, you can consider scaling out your
+Proxy Service pool and using [Proxy Peering](../operations/proxy-peering.mdx).
+
+Use the following Grafana query to track the maximum number of reverse tunnels
+by type over a given interval:
+
+```text
+max(teleport_reverse_tunnels_connected) by (type))
+```
+
+### Count and version of Teleport Agents
+
+Alongside the number of connected resources and reverse tunnels, you can track
+the number of Agents in your Teleport cluster. Since you can run multiple
+Teleport services on a single Agent instance, this metric helps you understand
+the architecture of your Teleport Agent deployment so you can diagnose issues
+with resource utilization.
+
+At regular intervals (around 7 seconds with jitter), the Auth Service refreshes
+its count of registered Agents. You can measure this count with the metric,
+`teleport_registered_servers`. To get the number of registered Agents by
+version, you can use this query in Grafana:
 
-<Notice scope={["cloud"]} type="tip">
+```text
+sum by (version)(teleport_registered_servers)
+```
 
-    Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.
+Since this metric is grouped by version, you can also tell how many of your
+Agents are behind the version of the Auth Service and Proxy Service, which can
+help you identify any that are at risk of violating the Teleport [version
+compatibility guarantees](../../../upgrading/overview.mdx). 
 
-</Notice>
+We strongly encourage self-hosted Teleport users to enroll their Agents in
+automatic updates. You can track the count of Teleport Agents that are not
+enrolled in automatic updates using the metric, `teleport_enrolled_in_upgrades`.
+[Read the documentation](../../../upgrading/automatic-agent-updates.mdx) for how
+to enroll Agents in automatic updates.
 
-(!docs/pages/includes/metrics.mdx!)
\ No newline at end of file

From 2d3c2f32d22658b04f8049c01d1e04b4646734bd Mon Sep 17 00:00:00 2001
From: Paul Gottschling <paul.gottschling@goteleport.com>
Date: Wed, 2 Oct 2024 10:48:44 -0400
Subject: [PATCH 2/2] Respond to evanfreed feedback

- Describe `backend_write_requests_failed_precondition_total`
- Include the precondition metric in the write availability formula.
- Turn the `registered_servers` discussion into a discussion of Teleport
  instance version, since it's not possible to group this metric by
  service and subtract the count of Auth Service/Proxy Service instances
  from the count of all registered services.
---
 .../management/diagnostics/metrics.mdx        | 60 ++++++++++++-------
 1 file changed, 37 insertions(+), 23 deletions(-)

diff --git a/docs/pages/admin-guides/management/diagnostics/metrics.mdx b/docs/pages/admin-guides/management/diagnostics/metrics.mdx
index 037c24cb3c2bc..3888390bcd795 100644
--- a/docs/pages/admin-guides/management/diagnostics/metrics.mdx
+++ b/docs/pages/admin-guides/management/diagnostics/metrics.mdx
@@ -67,11 +67,31 @@ The following backend operation metrics are available:
 |Delete a range of items|`batch_write_requests`|
 |Update the keepalive status of an item|`write_requests`|
 
-You can use these metrics to define an availability formula, i.e., the
-percentage of reads or writes that succeeded. Take the sum of requests that
-succeeded (including batch requests) over the total sum of requests, multiplied
-by 100. If your backend begins to appear unavailable, you can investigate your
-backend infrastructure.
+During failed backend writes, a Teleport process also increments the
+`backend_write_requests_failed_precondition_total` metric if the cause of the
+failure is expected. For example, the metric increments during a create
+operation if a record already exists, during an update or delete operation if
+the record is not found, and during an atomic write if the resource was modified
+concurrently. All of these conditions can hold in a well-functioning Teleport
+cluster. 
+
+`backend_write_requests_failed_precondition_total`  increments whenever
+`backend_write_requests_failed_total` increments, and you can use it to
+distinguish potentially expected write failures from unexpected, problematic
+ones.
+
+You can use backend operation metrics to define an availability formula, i.e.,
+the percentage of reads or writes that succeeded. For example, in Prometheus,
+you can define a query similar to the following. This takes the percentage of
+write requests that failed for unexpected reasons and subtracts it from 1 to get
+a percentage of successful writes:
+
+```
+1- (sum(rate(backend_write_requests_failed_total -sum(rate(teleport_backend_write_requests_failed_precondition_total)) / sum(rate(backend_write_requests_total))
+```
+
+If your backend begins to appear unavailable, you can investigate your backend
+infrastructure.
 
 ### Backend operation performance
 
@@ -127,8 +147,7 @@ The Auth Service tracks this metric for the following resources:
 
 You can use this metric to:
 - Compare the number of resources that are protected by Teleport with those that
-  are not so you can plan your Teleport rollout, e.g., by configuring [Auto
-  Discovery](../../../enroll-resources/auto-discovery/auto-discovery.mdx).
+  are not so you can plan your Teleport rollout.
 - Correlate changes in Teleport usage with resource utilization on Auth Service
   and Proxy Service compute instances to determine scaling needs.
 
@@ -160,31 +179,26 @@ by type over a given interval:
 max(teleport_reverse_tunnels_connected) by (type))
 ```
 
-### Count and version of Teleport Agents
-
-Alongside the number of connected resources and reverse tunnels, you can track
-the number of Agents in your Teleport cluster. Since you can run multiple
-Teleport services on a single Agent instance, this metric helps you understand
-the architecture of your Teleport Agent deployment so you can diagnose issues
-with resource utilization.
+## Teleport instance versions
 
 At regular intervals (around 7 seconds with jitter), the Auth Service refreshes
-its count of registered Agents. You can measure this count with the metric,
-`teleport_registered_servers`. To get the number of registered Agents by
-version, you can use this query in Grafana:
+its count of registered Teleport instances, including Agents and Teleport
+processes that run the Auth Service and Proxy Service. You can measure this
+count with the metric, `teleport_registered_servers`. To get the number of
+registered instances by version, you can use this query in Grafana:
 
 ```text
 sum by (version)(teleport_registered_servers)
 ```
 
-Since this metric is grouped by version, you can also tell how many of your
-Agents are behind the version of the Auth Service and Proxy Service, which can
-help you identify any that are at risk of violating the Teleport [version
-compatibility guarantees](../../../upgrading/overview.mdx). 
+You can use this metric to tell how many of your registered Teleport instances
+are behind the version of the Auth Service and Proxy Service, which can help you
+identify any that are at risk of violating the Teleport [version compatibility
+guarantees](../../../upgrading/overview.mdx). 
 
 We strongly encourage self-hosted Teleport users to enroll their Agents in
 automatic updates. You can track the count of Teleport Agents that are not
 enrolled in automatic updates using the metric, `teleport_enrolled_in_upgrades`.
-[Read the documentation](../../../upgrading/automatic-agent-updates.mdx) for how
-to enroll Agents in automatic updates.
+[Read the documentation](../../../upgrading.mdx) for how to enroll Agents in
+automatic updates.