monitoring and metrics overhaul - draft

gravitational · Jul 27, 2024 · 79b01a8 · 79b01a8
1 parent 0618056
commit 79b01a8
Showing 1 changed file with 284 additions and 5 deletions.
diff --git a/docs/pages/management/diagnostics.mdx b/docs/pages/management/diagnostics.mdx
@@ -1,10 +1,289 @@
 ---
 title: Monitoring your Cluster
 description: Monitoring your Teleport deployment
-layout: tocless-doc
 ---
 
-- [Health Monitoring](./diagnostics/monitoring.mdx): How to monitor the health of a Teleport instance.
-- [Metrics](./diagnostics/metrics.mdx): How to enable exporting Prometheus metrics.
-- [Collecting Profiles](./diagnostics/profiles.mdx): How to collect runtime profiling data from a Teleport instance.
-- [Distributed Tracing](./diagnostics/tracing.mdx): How to enable distributed tracing for a Teleport instance.
+Teleport provides health checking mechanisms in order to verify that it is healthy and ready to serve traffic.
+Metrics, tracing, and profiling provide in-depth data, tracking cluster performance and responsiveness.
+
+## Enable health monitoring
+
+How to monitor the health of a Teleport instance.
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)
+
+Now you can collect monitoring information from several endpoints. These can be used by things like
+Kubernetes probes to monitor the health of a Teleport process.
+
+## `/healthz`
+
+The `http://127.0.0.1:3000/healthz` endpoint responds with a body of
+`{"status":"ok"}` and an HTTP 200 OK status code if the process is running.
+
+This is a simple check, suitable for determining if the Teleport process is
+still running.
+
+## `/readyz`
+
+The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its
+response includes information about the state of the process.
+
+The response body is a JSON object of the form:
+
+```
+{ "status": "a status message here"}
+```
+
+### `/readyz` and heartbeats
+
+If a Teleport component fails to execute its heartbeat procedure, it will enter
+a degraded state. Teleport will begin recovering from this state when a
+heartbeat completes successfully.
+
+The first successful heartbeat will transition Teleport into a recovering state. A second consecutive 
+successful heartbeat will cause Teleport to transition to the OK state.
+
+Teleport heartbeats run approximately every 60 seconds when healthy, and failed
+heartbeats are retried approximately every 5 seconds. This means that depending
+on the timing of heartbeats, it can take 60-70 seconds after connectivity is
+restored for `/readyz` to start reporting healthy again.
+
+### Status codes
+
+The status code of the response can be one of:
+
+- HTTP 200 OK: Teleport is operating normally
+- HTTP 503 Service Unavailable: Teleport has encountered a connection error and
+  is running in a degraded state. This happens when a Teleport heartbeat fails.
+- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or
+  has begun recovering from a degraded state.
+
+The same state information is also available via the `process_state` metric
+under the `/metrics` endpoint.
+
+## Metrics 
+
+How to enable exporting Prometheus metrics.
+
+Teleport exposes metrics for all of its components, helping you get insight
+into the state of your cluster. This guide explains the metrics that you can
+collect from your Teleport cluster.
+
+## Enabling metrics
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx!)
+
+This will enable the `http://127.0.0.1:3000/metrics` endpoint, which serves the
+metrics that Teleport tracks. It is compatible with [Prometheus](https://prometheus.io/) collectors.
+
+The following metrics are available:
+
+<Notice scope={["cloud"]} type="tip">
+
+    Teleport Cloud does not expose monitoring endpoints for the Auth Service and Proxy Service.
+
+</Notice>
+
+(!docs/pages/includes/metrics.mdx!)
+
+
+## Distributed tracing
+
+How to enable distributed tracing for a Teleport instance.
+
+Teleport leverages [OpenTelemetry](https://opentelemetry.io/) to generate traces
+and export them to any [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/otlp/)
+capable exporter. In the event that your telemetry backend doesn't support receiving OTLP traces, you may be able to
+leverage the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to proxy traces from OTLP
+to a format that your telemetry backend accepts.
+
+## Configure Teleport
+
+In order to enable tracing for a `teleport` instance, add the following section to that instance's configuration file (`/etc/teleport.yaml`).
+For a detailed description of these configuration fields, see the [configuration reference](./reference/configuration.mdx) page.
+
+```yaml
+tracing_service:
+   enabled: yes
+   exporter_url: grpc://collector.example.com:4317
+   sampling_rate_per_million: 1000000
+```
+
+### Sampling rate
+
+It is important to choose the sampling rate wisely. Sampling at a rate of 100% could have a negative impact on the
+performance of your cluster. Teleport honors the sampling rate included in any incoming requests, which means
+that even when the `tracing_service` is enabled and the sampling rate is 0, if Teleport receives a request that has a span which is
+sampled, then Teleport will sample and export all spans that are generated in response to that request.
+
+### Exporter URL
+
+The `exporter_url` setting indicates where Teleport should send spans to. Supported schemes are `grpc://`, `http://`,
+`https://`, and `file://` (if no scheme is provided, then `grpc://` is used).
+
+When using `file://`, the url must be a path to a directory that Teleport has write permissions for. Spans will be saved to files within
+the provided directory, each file containing one proto encoded span per line. Files are rotated after exceeding 100MB, in order to
+override the default limit add `?limit=<desired_file_size_in_bytes>` to the `exporter_url` (i.e. `file:///var/lib/teleport/traces?limit=100`).
+
+By default the connection to the exporter is insecure, to support TLS add the following to the `tracing_service` configuration:
+
+```yaml
+   # Optional path to CA certificates are used to validate the exporter.
+  ca_certs:
+    - /var/lib/teleport/exporter_ca.pem
+  # Optional path tp TLS certificates are used to enable mTLS for the exporter
+  https_keypairs:
+    - key_file: /var/lib/teleport/exporter_key.pem
+      cert_file: /var/lib/teleport/exporter_cert.pem
+````
+
+After updating `teleport.yaml`, start your `teleport` instance to apply the new configuration.
+
+## tsh
+
+To capture traces from `tsh` simply add the `--trace` flag to your command. All traces generated by `tsh --trace` will be
+proxied to the `exporter_url` defined for the Auth Service of the cluster the command is being run on.
+
+```code
+$ tsh --trace ssh root@myserver
+$ tsh --trace ls
+```
+
+Exporting traces from `tsh` to a different exporter than the one defined in the Auth Service config
+is also possible via the `--trace-exporter` flag. A URL must be provided that adheres to the same
+format as the `exporter_url` of the `tracing_service`.
+
+```code
+$ tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver
+$ tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls
+```
+
+## Collecting profiles
+
+How to collect runtime profiling data from a Teleport instance.
+
+Teleport leverages Go's diagnostic capabilities to collect and export
+profiling data. Profiles can help identify the cause of spikes in CPU,
+the source of memory leaks, or the reason for a deadlock.
+
+## Using the Debug Service
+
+The Teleport Debug Service enables administrators to collect diagnostic profiles
+without enabling pprof endpoints at startup. The service, enabled by default,
+ensures local-only access and must be consumed from inside the same instance.
+
+`teleport debug profile` collects a list of pprof profiles. It outputs a
+compressed tarball (`.tar.gz`) to STDOUT. You decompress it using `tar` or
+direct the result to a file.
+
+By default, it collects `goroutine`, `heap` and `profile` profiles.
+
+Each profile collected will have a correspondent file inside the tarball. For
+example, collecting `goroutine,trace,heap` will result in `goroutine.pprof`,
+`trace.pprof`, and `heap.pprof` files.
+
+```code
+# Collect default profiles and save to a file.
+$ teleport debug profile > pprof.tar.gz
+$ tar xvf pprof.tar.gz
+
+# Collect default profiles and decompress it.
+$ teleport debug profile | tar xzv -C ./
+
+# Collect "trace" and "mutex" profiles and save to a file.
+$ teleport debug profile trace,mutex > pprof.tar.gz
+
+# Collect profiles setting the profiling time in seconds
+$ teleport debug profile -s 20 trace > pprof.tar.gz
+```
+
+(!docs/pages/includes/diagnostics/teleport-debug-config.mdx!)
+
+If you're running Teleport on a Kubernetes cluster you can directly collect
+profiles to a local directory without an interactive session:
+
+```code
+$ kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz 
+```
+
+After extracting the contents, you can use `go tool` commands to explore and
+visualize them:
+
+```code
+# Opens the terminal interactive explorer
+$ go tool pprof heap.pprof
+
+# Opens the web visualizer
+$ go tool pprof -http : heap.pprof
+
+# Visualize trace profiles
+$ go tool trace trace.pprof
+```
+
+## Using diagnostics endpoints
+
+The profiling endpoint is only enabled if the `--debug` flag is supplied.
+
+(!docs/pages/includes/diagnostics/diag-addr-prereqs-tabs.mdx flags="--debug" !)
+
+### Collecting profiles
+
+Go's standard profiling endpoints are served at `http://127.0.0.1:3000/debug/pprof/`.
+Retrieving a profile requires sending a request to the endpoint corresponding
+to the desired profile type. When debugging an issue it is helpful to collect
+a series of profiles over a period of time.
+
+#### CPU
+CPU profile shows execution statistics gathered over a user specified period:
+
+``` code
+# Download the profile into a file:
+$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30
+
+# Visualize the profile
+$ go tool pprof -http : cpu.profile
+```
+
+#### Goroutine
+
+Goroutine profiles show the stack traces for all running goroutines in the system:
+
+``` code
+# Download the profile into a file:
+$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine
+
+# Visualize the profile
+$ go tool pprof -http : goroutine.profile
+```
+
+#### Heap
+
+Heap profiles show allocated objects in the system:
+
+```code
+# Download the profile into a file:
+$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap
+
+# Visualize the profile
+$ go tool pprof  -http : heap.profile
+```
+
+#### Trace
+
+Trace profiles capture scheduling, system calls, garbage collections, heap size, and other events that are collected by the Go runtime
+over a user specified period of time:
+
+```code
+# Download the profile into a file:
+$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5
+
+# Visualize the profile
+$ go tool trace trace.out
+```
+
+## Further Reading
+
+- More information about diagnostics in the Go ecosystem: https://go.dev/doc/diagnostics
+- Go's profiling endpoints: https://golang.org/pkg/net/http/pprof/
+- A deep dive on profiling Go programs: https://go.dev/blog/pprof
+