Skip to content

Commit

Permalink
docs(alert): clarify remote rule evalutaion
Browse files Browse the repository at this point in the history
  • Loading branch information
LukoJy3D committed Oct 12, 2024
1 parent 10569ab commit ae556f9
Show file tree
Hide file tree
Showing 2 changed files with 69 additions and 43 deletions.
71 changes: 49 additions & 22 deletions docs/sources/alert/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ ruler:
kvstore:
store: inmemory
enable_api: true

```
We support two kinds of rules: [alerting](#alerting-rules) rules and [recording](#recording-rules) rules.
Expand Down Expand Up @@ -62,9 +61,9 @@ groups:
> 0.05
for: 10m
labels:
severity: page
severity: page
annotations:
summary: High request latency
summary: High request latency
- name: credentials_leak
rules:
- alert: http-credentials-leaked
Expand Down Expand Up @@ -106,7 +105,6 @@ This query (`expr`) will be executed every 1 minute (`interval`), the result of
name we have defined (`record`). This metric named `nginx:requests:rate1m` can now be sent to Prometheus, where it will be stored
just like any other metric.


### Limiting Alerts and Recording Rule Samples

Like [Prometheus](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#limiting-alerts-and-series), you can configure a limit for alerts produced by alerting rules and samples produced by recording rules. This limit can be configured per-group. Using limits can prevent a faulty rule from generating a large number of alerts or recording samples. When the limit is exceeded, all recording samples produced by the rule are discarded, and if it is an alerting rule, all alerts for the rule, active, pending, or inactive, are cleared. The event will be recorded as an error in the evaluation, and the rule health will be set to `err`. The default value for limit is `0` meaning no limit.
Expand All @@ -115,8 +113,6 @@ Like [Prometheus](https://prometheus.io/docs/prometheus/latest/configuration/rec

Here is an example of a rule group along with its limit configured.



```yaml
groups:
- name: production_rules
Expand All @@ -131,9 +127,9 @@ groups:
> 0.05
for: 10m
labels:
severity: page
severity: page
annotations:
summary: High request latency
summary: High request latency
- record: nginx:requests:rate1m
expr: |
sum(
Expand Down Expand Up @@ -184,6 +180,7 @@ We don't always control the source code of applications we run. Load balancers a
### Event alerting

Sometimes you want to know whether _any_ instance of something has occurred. Alerting based on logs can be a great way to handle this, such as finding examples of leaked authentication credentials:

```yaml
- name: credentials_leak
rules:
Expand All @@ -209,10 +206,11 @@ As an example, we can use LogQL v2 to help Loki to monitor _itself_, alerting us
## Interacting with the Ruler

### Lokitool

Because the rule files are identical to Prometheus rule files, we can interact with the Loki Ruler via `lokitool`.

{{% admonition type="note" %}}
lokitool is intended to run against multi-tenant Loki. The commands need an `--id=` flag set to the Loki instance ID or set the environment variable `LOKI_TENANT_ID`. If Loki is running in single tenant mode, the required ID is `fake`.
lokitool is intended to run against multi-tenant Loki. The commands need an `--id=` flag set to the Loki instance ID or set the environment variable `LOKI_TENANT_ID`. If Loki is running in single tenant mode, the required ID is `fake`.
{{% /admonition %}}

An example workflow is included below:
Expand Down Expand Up @@ -284,6 +282,28 @@ resource "loki_rule_group_recording" "test" {
```

### Cortex rules action

The [Cortex rules action](https://github.com/grafana/cortex-rules-action) introduced Loki as a backend which can be handy for managing rules in a CI/CD pipeline. It can be used to lint, diff, and sync rules between a local directory and a remote Loki instance.

```yaml
- name: Lint Loki rules
uses: grafana/cortex-rules-action@master
env:
ACTION: check
RULES_DIR: <source_dir_of_rules> # Example: logs/recording_rules/,logs/alerts/
BACKEND: loki
- name: Deploy rules to Loki staging
uses: grafana/cortex-rules-action@master
env:
CORTEX_ADDRESS: <loki_ingress_addr>
CORTEX_TENANT_ID: fake
ACTION: sync
RULES_DIR: <source_dir_of_rules> # Example: logs/recording_rules/,logs/alerts/
BACKEND: loki
```

## Scheduling and best practices

One option to scale the Ruler is by scaling it horizontally. However, with multiple Ruler instances running they will need to coordinate to determine which instance will evaluate which rule. Similar to the ingesters, the Rulers establish a hash ring to divide up the responsibilities of evaluating rules.
Expand All @@ -294,19 +314,19 @@ A full sharding-enabled Ruler example is:

```yaml
ruler:
alertmanager_url: <alertmanager_endpoint>
enable_alertmanager_v2: true
enable_api: true
enable_sharding: true
ring:
kvstore:
consul:
host: consul.loki-dev.svc.cluster.local:8500
store: consul
rule_path: /tmp/rules
storage:
gcs:
bucket_name: <loki-rules-bucket>
alertmanager_url: <alertmanager_endpoint>
enable_alertmanager_v2: true # true by default since Loki 3.2.0
enable_api: true
enable_sharding: true
ring:
kvstore:
consul:
host: consul.loki-dev.svc.cluster.local:8500
store: consul
rule_path: /tmp/rules
storage:
gcs:
bucket_name: <loki-rules-bucket>
```

## Ruler storage
Expand All @@ -316,18 +336,25 @@ The Ruler supports the following types of storage: `azure`, `gcs`, `s3`, `swift`
The local implementation reads the rule files off of the local filesystem. This is a read-only backend that does not support the creation and deletion of rules through the [Ruler API](https://grafana.com/docs/loki/<LOKI_VERSION>/reference/loki-http-api#ruler). Despite the fact that it reads the local filesystem this method can still be used in a sharded Ruler configuration if the operator takes care to load the same rules to every Ruler. For instance, this could be accomplished by mounting a [Kubernetes ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) onto every Ruler pod.

A typical local configuration might look something like:

```
-ruler.storage.type=local
-ruler.storage.local.directory=/tmp/loki/rules
```

With the above configuration, the Ruler would expect the following layout:

```
/tmp/loki/rules/<tenant id>/rules1.yaml
/rules2.yaml
```

Yaml files are expected to be [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) but include LogQL expressions as specified in the beginning of this doc.

## Remote rule evaluation

With larger deployments and complex rules, running a ruler in local evaluation mode brings problems where results could be inconsistent or incomplete compared to what you see in Grafana. The remote mode should be used to evaluate rules against the query frontend to solve this. A more detailed explanation can be found in [scalability documentation]({{< relref "../operations/scalability.md" >}}).

## Future improvements

There are a few things coming to increase the robustness of this service. In no particular order:
Expand Down
41 changes: 20 additions & 21 deletions docs/sources/shared/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -672,7 +672,7 @@ compactor_grpc_client:

# Override the default cipher suite list (separated by commas). Allowed
# values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -687,7 +687,7 @@ compactor_grpc_client:
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -738,7 +738,7 @@ compactor_grpc_client:

# Configuration for memberlist client. Only applies if the selected kvstore is
# memberlist.
#
#
# When a memberlist config with atleast 1 join_members is defined, kvstore of
# type memberlist is automatically selected for all the components that require
# a ring unless otherwise specified in the component's configuration section.
Expand Down Expand Up @@ -1518,7 +1518,7 @@ memcached_client:
# Override the default cipher suite list (separated by commas). Allowed
# values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -1533,7 +1533,7 @@ memcached_client:
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -2356,7 +2356,7 @@ Configuration for an ETCD v3 client. Only applies if the selected kvstore is `et
[tls_insecure_skip_verify: <boolean> | default = false]
# Override the default cipher suite list (separated by commas). Allowed values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -2371,7 +2371,7 @@ Configuration for an ETCD v3 client. Only applies if the selected kvstore is `et
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -2690,7 +2690,7 @@ backoff_config:
[tls_insecure_skip_verify: <boolean> | default = false]
# Override the default cipher suite list (separated by commas). Allowed values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -2705,7 +2705,7 @@ backoff_config:
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -3502,7 +3502,7 @@ The `limits_config` block configures global and per-tenant limits in Loki. The v
# metadata request that falls in this window is split using
# `split_recent_metadata_queries_by_interval`. The value 0 disables using a
# different split interval for recent metadata queries.
#
#
# This is added to improve cacheability of recent metadata queries. Query split
# interval also determines the interval used in cache key. The default split
# interval of 24h is useful for caching long queries, each cache key holding 1
Expand Down Expand Up @@ -4006,7 +4006,7 @@ When a memberlist config with atleast 1 join_members is defined, kvstore of type
[tls_insecure_skip_verify: <boolean> | default = false]
# Override the default cipher suite list (separated by commas). Allowed values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -4021,7 +4021,7 @@ When a memberlist config with atleast 1 join_members is defined, kvstore of type
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -4623,7 +4623,7 @@ alertmanager_client:
# Override the default cipher suite list (separated by commas). Allowed
# values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -4638,7 +4638,7 @@ alertmanager_client:
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -4852,9 +4852,8 @@ remote_write:
# Configuration for rule evaluation.
evaluation:
# The evaluation mode for the ruler. Can be either 'local' or 'remote'. If set
# to 'local', the ruler will evaluate rules locally. If set to 'remote', the
# ruler will evaluate rules remotely. If unset, the ruler will evaluate rules
# locally.
# to 'local', the ruler will evaluate rules locally (default). If set to 'remote', the
# ruler will evaluate rules remotely (recommended for bigger deployments).
# CLI flag: -ruler.evaluation.mode
[mode: <string> | default = "local"]
Expand Down Expand Up @@ -4899,7 +4898,7 @@ evaluation:
# Override the default cipher suite list (separated by commas). Allowed
# values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -4914,7 +4913,7 @@ evaluation:
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down Expand Up @@ -6253,7 +6252,7 @@ The TLS configuration.
[tls_insecure_skip_verify: <boolean> | default = false]
# Override the default cipher suite list (separated by commas). Allowed values:
#
#
# Secure Ciphers:
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
Expand All @@ -6268,7 +6267,7 @@ The TLS configuration.
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
Expand Down

0 comments on commit ae556f9

Please sign in to comment.