Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: monitoring and metrics overhaul #44720

Merged
merged 4 commits into from
Jul 30, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
289 changes: 284 additions & 5 deletions docs/pages/management/diagnostics.mdx
Original file line number Diff line number Diff line change
@@ -1,10 +1,289 @@
---
title: Monitoring your Cluster
description: Monitoring your Teleport deployment
layout: tocless-doc
---

- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance.
- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics.
- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance.
- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable distributed tracing for a Teleport instance.
Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic.
Metrics, tracing, and profiling provide in-depth data, tracking cluster performance and responsiveness.

## Enable health monitoring

How to monitor the health of a Teleport instance.

(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)

Now you can collect monitoring information from several endpoints. These can be used by things like
Kubernetes probes to monitor the health of a Teleport process.

## `/healthz`

The `http://127.0.0.1:3000/healthz` endpoint responds with a body of
`{"status":"ok"}` and an HTTP 200 OK status code if the process is running.

This is a simple check, suitable for determining if the Teleport process is

Check failure on line 23 in docs/pages/management/diagnostics.mdx

View workflow job for this annotation

GitHub Actions / Lint docs prose style

[vale] reported by reviewdog 🐶 [messaging.subjective-terms] Avoid using 'simple' as a qualifier, since it is subject to interpretation. Use more technically precise terms instead. Raw Output: {"message": "[messaging.subjective-terms] Avoid using 'simple' as a qualifier, since it is subject to interpretation. Use more technically precise terms instead.", "location": {"path": "docs/pages/management/diagnostics.mdx", "range": {"start": {"line": 23, "column": 11}}}, "severity": "ERROR"}
still running.

## `/readyz`

The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its
response includes information about the state of the process.

The response body is a JSON object of the form:

```
{ "status": "a status message here"}
```

### `/readyz` and heartbeats

If a Teleport component fails to execute its heartbeat procedure, it will enter
a degraded state. Teleport will begin recovering from this state when a
heartbeat completes successfully.

The first successful heartbeat will transition Teleport into a recovering state. A second consecutive
successful heartbeat will cause Teleport to transition to the OK state.

Teleport heartbeats run approximately every 60 seconds when healthy, and failed
mmcallister marked this conversation as resolved.
Show resolved Hide resolved
heartbeats are retried approximately every 5 seconds. This means that depending
on the timing of heartbeats, it can take 60-70 seconds after connectivity is
restored for `/readyz` to start reporting healthy again.

### Status codes

The status code of the response can be one of:

- HTTP 200 OK: Teleport is operating normally
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and
is running in a degraded state. This happens when a Teleport heartbeat fails.
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or
has begun recovering from a degraded state.

The same state information is also available via the `process_state` metric
under the `/metrics` endpoint.

## Metrics

How to enable exporting Prometheus metrics.
mmcallister marked this conversation as resolved.
Show resolved Hide resolved

Teleport exposes metrics for all of its components, helping you get insight
into the state of your cluster. This guide explains the metrics that you can
collect from your Teleport cluster.

## Enabling metrics

(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)

This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.

The following metrics are available:

<Notice scope={["cloud"]} type="tip">

Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.

Check failure on line 83 in docs/pages/management/diagnostics.mdx

View workflow job for this annotation

GitHub Actions / Lint docs prose style

[vale] reported by reviewdog 🐶 [messaging.edition-names] "Teleport Cloud" is no longer a recognized Teleport edition. Use "Teleport Enterprise" instead, and clarify the hosting type rather than including it in the name of the product. For example, you could say, "For managed Teleport Enterprise...", "Teleport Enterprise (managed)", "self-hosted Teleport Enterprise," etc., as long as the implication is that Teleport Enterprise is a single product that users can host in two ways. If the hosting type is not important in a given sentence, there is no need to specify it. Raw Output: {"message": "[messaging.edition-names] \"Teleport Cloud\" is no longer a recognized Teleport edition. Use \"Teleport Enterprise\" instead, and clarify the hosting type rather than including it in the name of the product. For example, you could say, \"For managed Teleport Enterprise...\", \"Teleport Enterprise (managed)\", \"self-hosted Teleport Enterprise,\" etc., as long as the implication is that Teleport Enterprise is a single product that users can host in two ways. If the hosting type is not important in a given sentence, there is no need to specify it.", "location": {"path": "docs/pages/management/diagnostics.mdx", "range": {"start": {"line": 83, "column": 5}}}, "severity": "ERROR"}

</Notice>

(!docs/pages/includes/metrics.mdx!)


## Distributed tracing

How to enable distributed tracing for a Teleport instance.

Teleport leverages [OpenTelemetry](https://opentelemetry.io/) to generate traces
and export them to any [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/otlp/)
capable exporter. In the event that your telemetry backend doesn't support receiving OTLP traces, you may be able to
leverage the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to proxy traces from OTLP
to a format that your telemetry backend accepts.

## Configure Teleport

In order to enable tracing for a `teleport` instance, add the following section to that instance's configuration file (`/etc/teleport.yaml`).
For a detailed description of these configuration fields, see the [configuration reference](./reference/configuration.mdx) page.

```yaml
tracing_service:
enabled: yes
exporter_url: grpc://collector.example.com:4317
sampling_rate_per_million: 1000000
```

### Sampling rate

It is important to choose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the
performance of your cluster. Teleport honors the sampling rate included in any incoming requests, which means
that even when the `tracing_service` is enabled and the sampling rate is 0, if Teleport receives a request that has a span which is
sampled, then Teleport will sample and export all spans that are generated in response to that request.

### Exporter URL

The `exporter_url` setting indicates where Teleport should send spans to. Supported schemes are `grpc://`, `http://`,
`https://`, and `file://` (if no scheme is provided, then `grpc://` is used).

When using `file://`, the url must be a path to a directory that Teleport has write permissions for. Spans will be saved to files within
the provided directory, each file containing one proto encoded span per line. Files are rotated after exceeding 100MB, in order to
override the default limit add `?limit=<desired_file_size_in_bytes>` to the `exporter_url` (i.e. `file:///var/lib/teleport/traces?limit=100`).

By default the connection to the exporter is insecure, to support TLS add the following to the `tracing_service` configuration:

```yaml
# Optional path to CA certificates are used to validate the exporter.
ca_certs:
- /var/lib/teleport/exporter_ca.pem
# Optional path tp TLS certificates are used to enable mTLS for the exporter
https_keypairs:
- key_file: /var/lib/teleport/exporter_key.pem
cert_file: /var/lib/teleport/exporter_cert.pem
````

After updating `teleport.yaml`, start your `teleport` instance to apply the new configuration.

## tsh

To capture traces from `tsh` simply add the `--trace` flag to your command. All traces generated by `tsh --trace` will be
proxied to the `exporter_url` defined for the Auth Service of the cluster the command is being run on.

```code
$ tsh --trace ssh root@myserver
$ tsh --trace ls
```

Exporting traces from `tsh` to a different exporter than the one defined in the Auth Service config
is also possible via the `--trace-exporter` flag. A URL must be provided that adheres to the same
format as the `exporter_url` of the `tracing_service`.

```code
$ tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver
$ tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls
```

## Collecting profiles

How to collect runtime profiling data from a Teleport instance.

Teleport leverages Go's diagnostic capabilities to collect and export
profiling data. Profiles can help identify the cause of spikes in CPU,
the source of memory leaks, or the reason for a deadlock.

## Using the Debug Service

The Teleport Debug Service enables administrators to collect diagnostic profiles
without enabling pprof endpoints at startup. The service, enabled by default,
ensures local-only access and must be consumed from inside the same instance.

`teleport debug profile` collects a list of pprof profiles. It outputs a
compressed tarball (`.tar.gz`) to STDOUT. You decompress it using `tar` or
direct the result to a file.

By default, it collects `goroutine`, `heap` and `profile` profiles.

Each profile collected will have a correspondent file inside the tarball. For
example, collecting `goroutine,trace,heap` will result in `goroutine.pprof`,
`trace.pprof`, and `heap.pprof` files.

```code
# Collect default profiles and save to a file.
$ teleport debug profile > pprof.tar.gz
$ tar xvf pprof.tar.gz

# Collect default profiles and decompress it.
$ teleport debug profile | tar xzv -C ./

# Collect "trace" and "mutex" profiles and save to a file.
$ teleport debug profile trace,mutex > pprof.tar.gz

# Collect profiles setting the profiling time in seconds
$ teleport debug profile -s 20 trace > pprof.tar.gz
```

(!docs/pages/includes/diagnostics/teleport-debug-config.mdx!)

If you're running Teleport on a Kubernetes cluster you can directly collect
profiles to a local directory without an interactive session:

```code
$ kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz
```

After extracting the contents, you can use `go tool` commands to explore and
visualize them:

```code
# Opens the terminal interactive explorer
$ go tool pprof heap.pprof

# Opens the web visualizer
$ go tool pprof -http : heap.pprof

# Visualize trace profiles
$ go tool trace trace.pprof
```

## Using diagnostics endpoints

The profiling endpoint is only enabled if the `--debug` flag is supplied.

(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx flags="--debug" !)

### Collecting profiles

Go's standard profiling endpoints are served at `http://127.0.0.1:3000/debug/pprof/`.
Retrieving a profile requires sending a request to the endpoint corresponding
to the desired profile type. When debugging an issue it is helpful to collect
a series of profiles over a period of time.

#### CPU
CPU profile shows execution statistics gathered over a user specified period:
mmcallister marked this conversation as resolved.
Show resolved Hide resolved

``` code
# Download the profile into a file:
$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30

# Visualize the profile
$ go tool pprof -http : cpu.profile
```

#### Goroutine

Goroutine profiles show the stack traces for all running goroutines in the system:

``` code
# Download the profile into a file:
$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine

# Visualize the profile
$ go tool pprof -http : goroutine.profile
```

#### Heap

Heap profiles show allocated objects in the system:

```code
# Download the profile into a file:
$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap

# Visualize the profile
$ go tool pprof -http : heap.profile
```

#### Trace

Trace profiles capture scheduling, system calls, garbage collections, heap size, and other events that are collected by the Go runtime
over a user specified period of time:

```code
# Download the profile into a file:
$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5

# Visualize the profile
$ go tool trace trace.out
```

## Further Reading

- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics
- Go's profiling endpoints: https://golang.org/pkg/net/http/pprof/
- A deep dive on profiling Go programs: https://go.dev/blog/pprof

Loading