docs: Update Observability topic (#13323) (#13712)

grafana · Jul 30, 2024 · 0406b75 · 0406b75
1 parent 2f667dd
commit 0406b75
Show file tree

Hide file tree

Showing 6 changed files with 303 additions and 117 deletions.
diff --git a/docs/sources/operations/meta-monitoring/_index.md b/docs/sources/operations/meta-monitoring/_index.md
@@ -0,0 +1,103 @@
+---
+title: Monitor Loki
+description: Describes the various options for monitoring your Loki environment, and the metrics available.
+aliases: 
+ - ../operations/observability
+---
+
+# Monitor Loki
+
+As part of your Loki implementation, you will also want to monitor your Loki cluster.
+
+As a best practice, you should collect data about Loki in a separate instance of Loki, for example, send your Loki data to a [Grafana Cloud account](https://grafana.com/products/cloud/). This will let you troubleshoot a broken Loki cluster from a working one.
+
+Loki exposes the following observability data about itself:
+
+- **Metrics**: Loki provides a `/metrics` endpoint that exports information about Loki in Prometheus format. These metrics provide aggregated metrics of the health of your Loki cluster, allowing you to observe query response times, etc etc.
+- **Logs**: Loki emits a detailed log line `metrics.go` for every query, which shows query duration, number of lines returned, query throughput, the specific LogQL that was executed, chunks searched, and much more. You can use these log lines to improve and optimize your query performance.
+
+You can also scrape Loki's logs and metrics and push them to separate instances of Loki and Mimir to provide information about the health of your Loki system (a process known as "meta-monitoring").
+
+The Loki [mixin](https://github.com/grafana/loki/blob/main/production/loki-mixin) is an opinionated set of dashboards, alerts and recording rules to monitor your Loki cluster. The mixin provides a comprehensive package for monitoring Loki in production. You can install the mixin into a Grafana instance.
+
+- To install meta-monitoring using the Loki Helm Chart and Grafana Cloud, follow [these directions](https://grafana.com/docs/loki/<LOKI_VERSION>/setup/install/helm/monitor-and-alert/with-grafana-cloud/).
+
+- To install meta-monitoring using the Loki Helm Chart and a local Loki stack, follow [these directions](https://grafana.com/docs/loki/<LOKI_VERSION>/setup/install/helm/monitor-and-alert/with-local-monitoring/).
+
+- To install the Loki mixin, follow [these directions]({{< relref "./mixins" >}}).
+
+You should also plan separately for infrastructure-level monitoring, to monitor the capacity or throughput of your storage provider, for example, or your networking layer.
+
+- [MinIO](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html)
+- [Kubernetes](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/)
+
+## Loki Metrics
+
+As Loki is a [distributed system](https://grafana.com/docs/loki/<LOKI_VERSION>/get-started/components/), each component exports its own metrics. The `/metrics` endpoint exposes hundreds of different metrics. You can find a sampling of the metrics exposed by Loki and their descriptions, in the sections below.
+
+You can find a complete list of the exposed metrics by checking the `/metrics` endpoint.
+
+`http://<host>:<http_listen_port>/metrics`
+
+For example:
+
+[http://localhost:3100/metrics](http://localhost:3100/metrics)
+
+Both Grafana Loki and Promtail expose a `/metrics` endpoint that expose Prometheus metrics (the default port is 3100 for Loki and 80 for Promtail). You will need a local Prometheus and add Loki and Promtail as targets. See [configuring Prometheus](https://prometheus.io/docs/prometheus/latest/configuration/configuration) for more information.
+
+All components of Loki expose the following metrics:
+
+| Metric Name                        | Metric Type | Description                                                                                                                  |
+| ---------------------------------- | ----------- | ----------------------------------------------------------------------- |
+| `loki_internal_log_messages_total` | Counter     | Total number of log messages created by Loki itself.                    |
+| `loki_request_duration_seconds`    | Histogram   | Number of received HTTP requests.                                       |
+
+Note that most of the metrics are counters and should continuously increase during normal operations.
+
+1. Your app emits a log line to a file that is tracked by Promtail.
+1. Promtail reads the new line and increases its counters.
+1. Promtail forwards the log line to a Loki distributor, where the received
+   counters should increase.
+1. The Loki distributor forwards the log line to a Loki ingester, where the
+   request duration counter should increase.
+
+If Promtail uses any pipelines with metrics stages, those metrics will also be
+exposed by Promtail at its `/metrics` endpoint. See Promtail's documentation on
+[Pipelines](https://grafana.com/docs/loki/<LOKI_VERSION>/send-data/promtail/pipelines/) for more information.
+
+### Metrics cardinality
+
+Some of the Loki observability metrics are emitted per tracked file (active), with the file path included in labels. This increases the quantity of label values across the environment, thereby increasing cardinality. Best practices with Prometheus labels discourage increasing cardinality in this way. Review your emitted metrics before scraping with Prometheus, and configure the scraping to avoid this issue.
+
+## Example Loki log line: metrics.go
+
+Loki emits a "metrics.go" log line from the Querier, Query frontend and Ruler components, which lets you inspect query and recording rule performance. This is an example of a detailed log line "metrics.go" for a query.
+
+Example log
+
+`level=info ts=2024-03-11T13:44:10.322919331Z caller=metrics.go:143 component=frontend org_id=mycompany latency=fast query="sum(count_over_time({kind=\"auditing\"} | json | user_userId =`` [1m]))" query_type=metric range_type=range length=10m0s start_delta=10m10.322900424s end_delta=10.322900663s step=1s duration=47.61044ms status=200 limit=100 returned_lines=0 throughput=9.8MB total_bytes=467kB total_entries=1 queue_time=0s subqueries=2 cache_chunk_req=1 cache_chunk_hit=1 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=14394 cache_index_req=19 cache_index_hit=19 cache_result_req=1 cache_result_hit=1`
+
+You can use the query-frontend `metrics.go` lines to understand a query’s overall performance. The “metrics.go” line output by the Queriers contains the same information as the Query frontend but is often more helpful in understanding and troubleshooting query performance. This is largely because it can tell you how the querier spent its time executing the subquery. Here are the most useful stats:
+
+- **total_bytes**: how many total bytes the query processed
+- **duration**: how long the query took to execute
+- **throughput**: total_bytes/duration
+- **total_lines**: how many total lines the query processed
+- **length**: how much time the query was executed over
+- **post_filter_lines**: how many lines matched the filters in the query
+- **cache_chunk_req**: total number of chunks fetched for the query (the cache will be asked for every chunk so this is equivalent to the total chunks requested)
+- **splits**: how many pieces the query was split into based on time and split_queries_by_interval
+- **shards**: how many shards the query was split into
+
+For more information, refer to the blog post [The concise guide to Loki: How to get the most out of your query performance](https://grafana.com/blog/2023/12/28/the-concise-guide-to-loki-how-to-get-the-most-out-of-your-query-performance/).
+
+### Configure Logging Levels
+
+To change the configuration for Loki logging levels, update log_level configuration parameter in your `config.yaml` file.
+
+```yaml
+# Only log messages with the given severity or above. Valid levels: [debug,
+# info, warn, error]
+# CLI flag: -log.level
+[log_level: <string> | default = "info"]
+```
diff --git a/docs/sources/operations/meta-monitoring/mixins.md b/docs/sources/operations/meta-monitoring/mixins.md
@@ -0,0 +1,189 @@
+---
+title: Install Loki mixins
+menuTitle: Install mixins
+description:  Describes the Loki mixins, how to configure and install the dashboards, alerts, and recording rules.
+weight: 100
+---
+
+# Install Loki mixins
+
+Loki is instrumented to expose metrics about itself via the `/metrics` endpoint, designed to be scraped by Prometheus. Each Loki release includes a mixin. The Loki mixin provides a set of Grafana dashboards, Prometheus recording rules and alerts for monitoring Loki.
+
+To set up monitoring using the mixin, you need to:
+
+- Deploy an instance of Prometheus (or a Prometheus-compatible time series database, like [Mimir](https://grafana.com/docs/mimir/latest/)) which can store Loki metrics.
+- Deploy an agent, such as Grafana Alloy, or Grafana Agent, to scrape Loki metrics.
+- Set up Grafana to visualize Loki metrics, by installing the dashboards.
+- Install the recording rules and alerts into Prometheus using `mimirtool`.
+
+This procedure assumes that you have set up Loki using the Helm chart.
+
+{{< admonition type="note" >}}
+Be sure to update the commands and configuration to match your own deployment.
+{{< /admonition >}}
+
+## Before you begin
+
+To make full use of the Loki mixin, you’ll need the following running in your environment:
+
+- Loki instance - A Loki instance which you want to monitor.
+- Grafana - For visualizing logs and metrics ([install on Kubernetes](https://grafana.com/docs/grafana/latest/setup-grafana/installation/kubernetes/#deploy-grafana-oss-on-kubernetes)).
+- Prometheus or Mimir - An instance of Prometheus or Mimir which will store metrics from Loki.
+
+To scrape metrics from Loki, you can use Grafana Alloy or the OpenTelemetry Collector. This procedure provides examples only for Grafana Alloy.
+
+If you have installed Loki using a Helm Chart, this documentation assumes that the Loki and Grafana instances are located on the same Kubernetes cluster.
+
+## Configure Alloy to scrape Loki metrics
+
+Loki exposes Prometheus metrics from all of its components to allow meta-monitoring. To retrieve these metrics, you need to configure a suitable scraper. Grafana Alloy can collect metrics and act as a Prometheus scraper. To use this capability, you need to configure Alloy to scrape from all of the components.
+
+{{< admonition type="tip" >}}
+If you're running on Kubernetes, you can use the Kubernetes Monitoring Helm chart.
+{{< /admonition >}}
+
+To scrape metrics from Loki, follow these steps:
+
+Install Grafana Alloy using the provided instructions for your platform.
+
+- [Standalone](https://grafana.com/docs/alloy/latest/get-started/install/binary/)
+- [Kubernetes](https://grafana.com/docs/alloy/latest/get-started/install/kubernetes/)
+- [Docker](https://grafana.com/docs/alloy/latest/get-started/install/docker/)
+
+Add a configuration block to scrape metrics from your Loki component instances and forward to a Prometheus or Mimir instance.
+
+- On Kubernetes, you can use the Alloy `discovery.kubernetes` component to discover Loki Pods to scrape metrics from.
+- On non-Kubernetes deployments, you may use `prometheus.scrape` and an explicit list of targets to discover Loki instances to scrape.
+
+For an example, see [Collect and forward Prometheus metrics](https://grafana.com/docs/alloy/latest/tasks/collect-prometheus-metrics/).
+
+## Configure Grafana
+
+In your Grafana instance, you'll need to [create a Prometheus datasource](https://grafana.com/docs/grafana/latest/datasources/prometheus/configure-prometheus-data-source/) to visualize the metrics scraped from your Loki cluster.
+
+## Install Loki dashboards in Grafana
+
+After Loki metrics are scraped by Grafana Alloy and stored in a Prometheus compatible time-series database, you can monitor Loki’s operation using the Loki mixin.
+
+Each Loki release includes a mixin that includes:
+
+- Relevant dashboards for overseeing the health of Loki as a whole, as well as its individual Loki components
+- [Recording rules](https://grafana.com/docs/loki/latest/alert/#recording-rules) that compute metrics that are used in the dashboards
+- Alerts that trigger when Loki generates metrics that are outside of normal parameters
+
+To install the mixins in Grafana and Mimir, the general steps are as follows:
+
+1. Download the mixin dashboards from the Loki repository.
+
+1. Import the dashboards in your Grafana instance.
+
+1. Upload `alerts.yaml` and `rules.yaml` files to Prometheus or Mimir with `mimirtool`.
+
+### Download the `loki-mixin` dashboards
+
+1. First, clone the Loki repository from Github:
+
+   ```bash
+   git clone https://github.com/grafana/loki
+   cd loki
+   ```
+
+1. Once you have a local copy of the repository, navigate to the `production/loki-mixin-compiled-ssd` directory.
+
+   ```bash
+   cd production/loki-mixin-compiled-ssd
+   ```
+
+   OR, if you're deploying Loki in microservices mode:
+
+   ```bash
+   cd production/loki-mixin-compiled
+   ```
+
+This directory contains a compiled version of the alert and recording rules, as well as the dashboards.
+
+{{< admonition type="note" >}}
+If you want to change any of the mixins, make your updates in the `production/loki-mixin` directory.
+Use the instructions in the [README](https://github.com/grafana/loki/tree/main/production/loki-mixin) in that directory to regenerate the files.
+{{< /admonition >}}
+
+### Import the dashboards to Grafana
+
+The `dashboards` directory includes the monitoring dashboards that can be installed into your Grafana instance.
+Refer to [Import a dashboard](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/) in the Grafana documentation.
+
+{{< admonition type="tip" >}}
+Install all dashboards.
+You can only import one dashboard at a time.
+Create a new folder in the Dashboards area, for example “Loki Monitoring”, as an easy location to save the imported dashboards.
+{{< /admonition >}}
+
+To create a folder:
+
+1. Open your Grafana instance and select **Dashboards**.
+1. Click the **New** button.
+1. Select **New folder** from the **New** menu.
+1. Name your folder, for example, “Loki Monitoring”.
+1. Click **Create**.
+
+To import a dashboard:
+
+1. Open your Grafana instance and select **Dashboards**.
+1. Click the **New** button.
+1. Select **Import** from the **New** menu.
+1. On the **Import dashboard** screen, select **Upload dashboard JSON file.**
+1. Browse to `production/loki-mixin-compiled-ssd/dashboards` and select the dashboard to import. Or, drag the dashboard file, for example, `loki-operational.json`, onto the **Upload** area of the **Import dashboard** screen.
+1. Select a folder in the **Folder** menu where you want to save the imported dashboard. For example, select "Loki Monitoring" created in the earlier steps.
+1. Click **Import**.
+
+The imported files are listed in the Loki Monitoring dashboard folder.
+
+To view the dashboards in Grafana:
+
+1. Select **Dashboards** in your Grafana instance.
+1. Select **Loki Monitoring**, or the folder where you uploaded the imported dashboards.
+1. Select any file in the folder to view the dashboard.
+
+### Add alerts and recording rules to Prometheus or Mimir
+
+The rules and alerts need to be installed into a Prometheus instance, Mimir or a Grafana Enterprise Metrics cluster.
+
+You can find the YAML files for alerts and rules in the following directories in the Loki repo:
+
+For SSD mode:
+`production/loki-mixin-compiled-ssd/alerts.yaml`
+`production/loki-mixin-compiled-ssd/rules.yaml`
+
+For microservices mode:
+`production/loki-mixin-compiled/alerts.yaml`
+`production/loki-mixin-compiled/rules.yaml`
+
+You use `mimirtool` to load the mixin alerts and rules definitions into a Prometheus instance, Mimir or a Grafana Enterprise Metrics cluster.
+
+1. Download [mimirtool](https://github.com/grafana/mimir/releases).
+
+1. Using the details of a Prometheus instance or Mimir cluster, run the following command to load the recording rules:
+
+    ```bash
+    mimirtool rules load --address=http://prometheus:9090 rules.yaml
+    ```
+
+    Or, for example if your Mimir cluster requires an API key, as is the case with Grafana Enterprise Metrics:
+
+    ```bash
+    mimirtool rules load --id=<tenant-id> --address=http://<mimir-hostname>:<port> --key="<mimir-api key>" rules.yaml
+    ```
+
+1. To load alerts:
+
+    ```bash
+    mimirtool alertmanager load --address=http://prometheus:9090 alerts.yaml
+    ```
+
+    or
+
+    ```bash
+    mimirtool alertmanager load --id=<tenant-id> --address=http://<mimir-hostname>:<port> --key="<mimir-api key>" alerts.yaml
+    ```
+
+Refer to the [mimirtool](https://grafana.com/docs/mimir/latest/manage/tools/mimirtool/) documentation for more information.