Skip to content

Commit

Permalink
Add a guide to metrics for monitoring Teleport
Browse files Browse the repository at this point in the history
Closes #40664

This change turns the Metrics guide in `admin-guides` into a conceptual
guide to the most important metrics for monitoring a Teleport cluster.

Since Agent metrics have inconsistent comprehensiveness across Teleport
services--and to reduce the scope of this change--this guide focuses on
self-hosted clusters.

To make this a conceptual guide instead of a reference, this change
removes the reference table from the `admin-guides` metrics page. There
is already a table in the dedicated metrics reference guide.

Note that, while the new metrics guide is specific to self-hosted
clusters, this change does not move the guide to the subsection of Admin
Guides related to self-hosting Teleport. Doing this would mean having
one subsection of Admin Guides for diagnostics-related guides and one
subsection for self-hosted-specific diagnostics, which is potentially
confusing. We may also want to add Agent-specific metrics eventually.

Finally, this change does not include alert thresholds for the metrics
it describes. We can define these in a subsequent change.
  • Loading branch information
ptgott committed Sep 30, 2024
1 parent 56b1bab commit 4d69a4f
Showing 1 changed file with 175 additions and 10 deletions.
185 changes: 175 additions & 10 deletions docs/pages/admin-guides/management/diagnostics/metrics.mdx
Original file line number Diff line number Diff line change
@@ -1,11 +1,20 @@
---
title: Metrics
description: How to enable and consume metrics
title: Key Metrics for Self-Hosted Clusters
description: Describes important metrics to monitor if you are self-hosting Teleport.
tocDepth: 3
---

Teleport exposes metrics for all of its components, helping you get insight
into the state of your cluster. This guide explains the metrics that you can
collect from your Teleport cluster.
This guide explains the metrics you should use to get started monitoring your
self-hosted Teleport cluster, focusing on metrics reported by the Auth Service
and Proxy Service. If you use Teleport Enterprise (Cloud), the Teleport team
monitors and responds to these metrics for you.

For a reference of all available metrics, see the [Teleport Metrics
Reference](../../../reference/monitoring/metrics.mdx).

This guide assumes that you already monitor compute resources on all instances
that run the Teleport Auth Service and Proxy Service (e.g., CPU, memory, disk,
bandwidth, and open file descriptors).

## Enabling metrics

Expand All @@ -14,12 +23,168 @@ collect from your Teleport cluster.
This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.

The following metrics are available:
## Backend operations

A Teleport cluster cannot function if the Auth Service does not have a healthy
cluster state backend. You need to track the ability of the Auth Service to read
from and write to its backend.

The Auth Service can connect to [several possible
backends](../../../reference/backends.mdx). In addition to Teleport backend
metrics, you should set up monitoring for your backend of choice so that, if
these metrics show problematic values, you can correlate them with metrics on
your backend infrastructure.

### Backend operation throughput and availability

On each backend operation, the Auth Service increments a metric. Backend
operation metrics have the following format:

```text
teleport_backend_<METRIC_NAME>[_failed]_total
```

If an operation results in an error, the Auth Service adds the `_failed` segment
to the metric name. For example, successfully creating a record increments the
`teleport_backend_write_requests_total` metric. If the create operation fails,
the Auth Service increments `teleport_backend_write_requests_failed_total`
instead.

The following backend operation metrics are available:

|Operation|Incremented metric name|
|---|---|
|Create an item|`write_requests`|
|Modify an item, creating it if it does not exist|`write_requests`|
|Update an item|`write_requests`|
|Conditionally update an item if versions match|`write_requests`|
|List a range of items|`batch_read_requests`|
|Get a single item|`read_requests`|
|Compare and swap items|`write_requests`|
|Delete an item|`write_requests`|
|Conditionally delete an item if versions match|`write_requests`|
|Write a batch of updates atomically, failing the write if any update fails|Both `write_requests` and `atomic_write_requests`|
|Delete a range of items|`batch_write_requests`|
|Update the keepalive status of an item|`write_requests`|

You can use these metrics to define an availability formula, i.e., the
percentage of reads or writes that succeeded. Take the sum of requests that
succeeded (including batch requests) over the total sum of requests, multiplied
by 100. If your backend begins to appear unavailable, you can investigate your
backend infrastructure.

### Backend operation performance

To help you track backend operation performance, the Auth Service also exposes
Prometheus [histogram metrics](https://prometheus.io/docs/practices/histograms/)
for read and write operations:

- `teleport_backend_read_seconds_bucket`
- `teleport_backend_write_seconds_bucket`
- `teleport_backend_batch_write_seconds_bucket`
- `teleport_backend_batch_read_seconds_bucket`
- `teleport_backend_atomic_write_seconds_bucket`

The backend throughput metrics discussed in the previous section map on to
latency metrics. Whenever the Auth Service increments one of the throughput
metrics, it reports one of the corresponding latency metrics. See the table
below for which throughput metrics map to which latency metrics. Each metric
name excludes the standard prefixes and suffixes.

|Throughput|Latency|
|---|---|
|`read_requests`|`read_seconds_bucket`|
|`read_requests`|`write_seconds_bucket`|
|`batch_read_requests`|`batch_write_seconds_bucket`|
|`batch_write_requests`|`batch_read_seconds_bucket`|
|`atomic_write_requests`|`atomic_write_seconds_bucket`|

## Agents and connected resources

To enable users to access most infrastructure with Teleport, you must join a
[Teleport Agent](../../../enroll-resources/agents/agents.mdx) to your Teleport
cluster and configure it to proxy your infrastructure. In a typical setup, an
Agent establishes an SSH reverse tunnel with the Proxy Service. User traffic to
Teleport-protected resources flows through the Proxy Service, an Agent, and
finally the infrastructure resource the Agent proxies. Return traffic from the
resource takes this path in reverse.

### Number of connected resources by type

Teleport-connected resources periodically send heartbeat (keepalive) messages to
the Auth Service. The Auth Service uses these heartbeats to track the number of
Teleport-protected resources by type with the `teleport_connected_resources`
metric.

The Auth Service tracks this metric for the following resources:

- SSH servers
- Kubernetes clusters
- Applications
- Databases
- Teleport Database Service instances
- Windows desktops

You can use this metric to:
- Compare the number of resources that are protected by Teleport with those that
are not so you can plan your Teleport rollout, e.g., by configuring [Auto
Discovery](../../../enroll-resources/auto-discovery/auto-discovery.mdx).
- Correlate changes in Teleport usage with resource utilization on Auth Service
and Proxy Service compute instances to determine scaling needs.

You can include this query in your Grafana configuration to break this metric
down by resource type:

```text
sum(teleport_connected_resources) by (type)
```

### Reverse tunnels by type

Every Teleport service that starts up establishes an SSH reverse tunnel to the
Proxy Service. (Self-hosted clusters can configure Agent services to connect to
the Auth Service directly without establishing a reverse tunnel.) The Proxy
Service tracks the number of reverse tunnels using the metric,
`teleport_reverse_tunnels_connected`.

With an improperly scaled Proxy Service pool, the Proxy Service can become a
bottleneck for traffic to Teleport-protected resources. If Proxy Service
instances display heavy utilization of compute resources while the number of
connected infrastructure resources is high, you can consider scaling out your
Proxy Service pool and using [Proxy Peering](../operations/proxy-peering.mdx).

Use the following Grafana query to track the maximum number of reverse tunnels
by type over a given interval:

```text
max(teleport_reverse_tunnels_connected) by (type))
```

### Count and version of Teleport Agents

Alongside the number of connected resources and reverse tunnels, you can track
the number of Agents in your Teleport cluster. Since you can run multiple
Teleport services on a single Agent instance, this metric helps you understand
the architecture of your Teleport Agent deployment so you can diagnose issues
with resource utilization.

At regular intervals (around 7 seconds with jitter), the Auth Service refreshes
its count of registered Agents. You can measure this count with the metric,
`teleport_registered_servers`. To get the number of registered Agents by
version, you can use this query in Grafana:

<Notice scope={["cloud"]} type="tip">
```text
sum by (version)(teleport_registered_servers)
```

Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.
Since this metric is grouped by version, you can also tell how many of your
Agents are behind the version of the Auth Service and Proxy Service, which can
help you identify any that are at risk of violating the Teleport [version
compatibility guarantees](../../../upgrading/overview.mdx).

</Notice>
We strongly encourage self-hosted Teleport users to enroll their Agents in
automatic updates. You can track the count of Teleport Agents that are not
enrolled in automatic updates using the metric, `teleport_enrolled_in_upgrades`.
[Read the documentation](../../../upgrading/automatic-agent-updates.mdx) for how
to enroll Agents in automatic updates.

(!docs/pages/includes/metrics.mdx!)

0 comments on commit 4d69a4f

Please sign in to comment.