-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
monitoring and metrics overhaul - draft
- Loading branch information
1 parent
0618056
commit 79b01a8
Showing
1 changed file
with
284 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,289 @@ | ||
--- | ||
title: Monitoring your Cluster | ||
description: Monitoring your Teleport deployment | ||
layout: tocless-doc | ||
--- | ||
|
||
- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance. | ||
- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics. | ||
- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance. | ||
- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable distributed tracing for a Teleport instance. | ||
Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic. | ||
Metrics, tracing, and profiling provide in-depth data, tracking cluster performance and responsiveness. | ||
|
||
## Enable health monitoring | ||
|
||
How to monitor the health of a Teleport instance. | ||
|
||
(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!) | ||
|
||
Now you can collect monitoring information from several endpoints. These can be used by things like | ||
Kubernetes probes to monitor the health of a Teleport process. | ||
|
||
## `/healthz` | ||
|
||
The `http://127.0.0.1:3000/healthz` endpoint responds with a body of | ||
`{"status":"ok"}` and an HTTP 200 OK status code if the process is running. | ||
|
||
This is a simple check, suitable for determining if the Teleport process is | ||
Check failure on line 23 in docs/pages/management/diagnostics.mdx GitHub Actions / Lint docs prose style
|
||
still running. | ||
|
||
## `/readyz` | ||
|
||
The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its | ||
response includes information about the state of the process. | ||
|
||
The response body is a JSON object of the form: | ||
|
||
``` | ||
{ "status": "a status message here"} | ||
``` | ||
|
||
### `/readyz` and heartbeats | ||
|
||
If a Teleport component fails to execute its heartbeat procedure, it will enter | ||
a degraded state. Teleport will begin recovering from this state when a | ||
heartbeat completes successfully. | ||
|
||
The first successful heartbeat will transition Teleport into a recovering state. A second consecutive | ||
successful heartbeat will cause Teleport to transition to the OK state. | ||
|
||
Teleport heartbeats run approximately every 60 seconds when healthy, and failed | ||
heartbeats are retried approximately every 5 seconds. This means that depending | ||
on the timing of heartbeats, it can take 60-70 seconds after connectivity is | ||
restored for `/readyz` to start reporting healthy again. | ||
|
||
### Status codes | ||
|
||
The status code of the response can be one of: | ||
|
||
- HTTP 200 OK: Teleport is operating normally | ||
- HTTP 503 Service Unavailable: Teleport has encountered a connection error and | ||
is running in a degraded state. This happens when a Teleport heartbeat fails. | ||
- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or | ||
has begun recovering from a degraded state. | ||
|
||
The same state information is also available via the `process_state` metric | ||
under the `/metrics` endpoint. | ||
|
||
## Metrics | ||
|
||
How to enable exporting Prometheus metrics. | ||
|
||
Teleport exposes metrics for all of its components, helping you get insight | ||
into the state of your cluster. This guide explains the metrics that you can | ||
collect from your Teleport cluster. | ||
|
||
## Enabling metrics | ||
|
||
(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!) | ||
|
||
This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the | ||
metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors. | ||
|
||
The following metrics are available: | ||
|
||
<Notice scope={["cloud"]} type="tip"> | ||
|
||
Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service. | ||
Check failure on line 83 in docs/pages/management/diagnostics.mdx GitHub Actions / Lint docs prose style
|
||
|
||
</Notice> | ||
|
||
(!docs/pages/includes/metrics.mdx!) | ||
|
||
|
||
## Distributed tracing | ||
|
||
How to enable distributed tracing for a Teleport instance. | ||
|
||
Teleport leverages [OpenTelemetry](https://opentelemetry.io/) to generate traces | ||
and export them to any [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/otlp/) | ||
capable exporter. In the event that your telemetry backend doesn't support receiving OTLP traces, you may be able to | ||
leverage the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to proxy traces from OTLP | ||
to a format that your telemetry backend accepts. | ||
|
||
## Configure Teleport | ||
|
||
In order to enable tracing for a `teleport` instance, add the following section to that instance's configuration file (`/etc/teleport.yaml`). | ||
For a detailed description of these configuration fields, see the [configuration reference](./reference/configuration.mdx) page. | ||
|
||
```yaml | ||
tracing_service: | ||
enabled: yes | ||
exporter_url: grpc://collector.example.com:4317 | ||
sampling_rate_per_million: 1000000 | ||
``` | ||
### Sampling rate | ||
It is important to choose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the | ||
performance of your cluster. Teleport honors the sampling rate included in any incoming requests, which means | ||
that even when the `tracing_service` is enabled and the sampling rate is 0, if Teleport receives a request that has a span which is | ||
sampled, then Teleport will sample and export all spans that are generated in response to that request. | ||
|
||
### Exporter URL | ||
|
||
The `exporter_url` setting indicates where Teleport should send spans to. Supported schemes are `grpc://`, `http://`, | ||
`https://`, and `file://` (if no scheme is provided, then `grpc://` is used). | ||
|
||
When using `file://`, the url must be a path to a directory that Teleport has write permissions for. Spans will be saved to files within | ||
the provided directory, each file containing one proto encoded span per line. Files are rotated after exceeding 100MB, in order to | ||
override the default limit add `?limit=<desired_file_size_in_bytes>` to the `exporter_url` (i.e. `file:///var/lib/teleport/traces?limit=100`). | ||
|
||
By default the connection to the exporter is insecure, to support TLS add the following to the `tracing_service` configuration: | ||
|
||
```yaml | ||
# Optional path to CA certificates are used to validate the exporter. | ||
ca_certs: | ||
- /var/lib/teleport/exporter_ca.pem | ||
# Optional path tp TLS certificates are used to enable mTLS for the exporter | ||
https_keypairs: | ||
- key_file: /var/lib/teleport/exporter_key.pem | ||
cert_file: /var/lib/teleport/exporter_cert.pem | ||
```` | ||
After updating `teleport.yaml`, start your `teleport` instance to apply the new configuration. | ||
|
||
## tsh | ||
|
||
To capture traces from `tsh` simply add the `--trace` flag to your command. All traces generated by `tsh --trace` will be | ||
proxied to the `exporter_url` defined for the Auth Service of the cluster the command is being run on. | ||
|
||
```code | ||
$ tsh --trace ssh root@myserver | ||
$ tsh --trace ls | ||
``` | ||
|
||
Exporting traces from `tsh` to a different exporter than the one defined in the Auth Service config | ||
is also possible via the `--trace-exporter` flag. A URL must be provided that adheres to the same | ||
format as the `exporter_url` of the `tracing_service`. | ||
|
||
```code | ||
$ tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver | ||
$ tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls | ||
``` | ||
|
||
## Collecting profiles | ||
|
||
How to collect runtime profiling data from a Teleport instance. | ||
|
||
Teleport leverages Go's diagnostic capabilities to collect and export | ||
profiling data. Profiles can help identify the cause of spikes in CPU, | ||
the source of memory leaks, or the reason for a deadlock. | ||
|
||
## Using the Debug Service | ||
|
||
The Teleport Debug Service enables administrators to collect diagnostic profiles | ||
without enabling pprof endpoints at startup. The service, enabled by default, | ||
ensures local-only access and must be consumed from inside the same instance. | ||
|
||
`teleport debug profile` collects a list of pprof profiles. It outputs a | ||
compressed tarball (`.tar.gz`) to STDOUT. You decompress it using `tar` or | ||
direct the result to a file. | ||
|
||
By default, it collects `goroutine`, `heap` and `profile` profiles. | ||
|
||
Each profile collected will have a correspondent file inside the tarball. For | ||
example, collecting `goroutine,trace,heap` will result in `goroutine.pprof`, | ||
`trace.pprof`, and `heap.pprof` files. | ||
|
||
```code | ||
# Collect default profiles and save to a file. | ||
$ teleport debug profile > pprof.tar.gz | ||
$ tar xvf pprof.tar.gz | ||
# Collect default profiles and decompress it. | ||
$ teleport debug profile | tar xzv -C ./ | ||
# Collect "trace" and "mutex" profiles and save to a file. | ||
$ teleport debug profile trace,mutex > pprof.tar.gz | ||
# Collect profiles setting the profiling time in seconds | ||
$ teleport debug profile -s 20 trace > pprof.tar.gz | ||
``` | ||
|
||
(!docs/pages/includes/diagnostics/teleport-debug-config.mdx!) | ||
|
||
If you're running Teleport on a Kubernetes cluster you can directly collect | ||
profiles to a local directory without an interactive session: | ||
|
||
```code | ||
$ kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz | ||
``` | ||
|
||
After extracting the contents, you can use `go tool` commands to explore and | ||
visualize them: | ||
|
||
```code | ||
# Opens the terminal interactive explorer | ||
$ go tool pprof heap.pprof | ||
# Opens the web visualizer | ||
$ go tool pprof -http : heap.pprof | ||
# Visualize trace profiles | ||
$ go tool trace trace.pprof | ||
``` | ||
|
||
## Using diagnostics endpoints | ||
|
||
The profiling endpoint is only enabled if the `--debug` flag is supplied. | ||
|
||
(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx flags="--debug" !) | ||
|
||
### Collecting profiles | ||
|
||
Go's standard profiling endpoints are served at `http://127.0.0.1:3000/debug/pprof/`. | ||
Retrieving a profile requires sending a request to the endpoint corresponding | ||
to the desired profile type. When debugging an issue it is helpful to collect | ||
a series of profiles over a period of time. | ||
|
||
#### CPU | ||
CPU profile shows execution statistics gathered over a user specified period: | ||
|
||
``` code | ||
# Download the profile into a file: | ||
$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30 | ||
# Visualize the profile | ||
$ go tool pprof -http : cpu.profile | ||
``` | ||
|
||
#### Goroutine | ||
|
||
Goroutine profiles show the stack traces for all running goroutines in the system: | ||
|
||
``` code | ||
# Download the profile into a file: | ||
$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine | ||
# Visualize the profile | ||
$ go tool pprof -http : goroutine.profile | ||
``` | ||
|
||
#### Heap | ||
|
||
Heap profiles show allocated objects in the system: | ||
|
||
```code | ||
# Download the profile into a file: | ||
$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap | ||
# Visualize the profile | ||
$ go tool pprof -http : heap.profile | ||
``` | ||
|
||
#### Trace | ||
|
||
Trace profiles capture scheduling, system calls, garbage collections, heap size, and other events that are collected by the Go runtime | ||
over a user specified period of time: | ||
|
||
```code | ||
# Download the profile into a file: | ||
$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5 | ||
# Visualize the profile | ||
$ go tool trace trace.out | ||
``` | ||
|
||
## Further Reading | ||
|
||
- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics | ||
- Go's profiling endpoints: https://golang.org/pkg/net/http/pprof/ | ||
- A deep dive on profiling Go programs: https://go.dev/blog/pprof | ||
|