Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mixin, allow overriding of some labels by parameterizing mixin recording/alert rules #11495

Merged
merged 23 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
638a1ea
parameterize alert rules
alex5517 Dec 15, 2023
7d1297a
parameterize recording rules
alex5517 Dec 15, 2023
d24862d
Merge branch 'main' into feat/parameterize-mixin
alex5517 Dec 15, 2023
943abab
Use group_prefix_jobs var for metric name
alex5517 Dec 18, 2023
a7b55bd
Merge branch 'feat/parameterize-mixin' of github.com:neticdk/loki int…
alex5517 Dec 18, 2023
8a23765
Merge branch 'grafana:main' into feat/parameterize-mixin
alex5517 Jan 9, 2024
696f716
Merge branch 'grafana:main' into feat/parameterize-mixin
alex5517 Feb 21, 2024
2e43670
Remove config.libsonnet import from dashboards since imported in mixi…
alex5517 Feb 21, 2024
3aacecc
Merge branch 'main' into feat/parameterize-mixin
alex5517 Feb 22, 2024
77a6f15
Merge branch 'main' into feat/parameterize-mixin
alex5517 Feb 22, 2024
ea13e6f
Merge branch 'grafana:main' into feat/parameterize-mixin
alex5517 Feb 26, 2024
7754cf3
Lint error - remove 2 spaces
alex5517 Feb 26, 2024
7ebdd9d
Merge branch 'main' into feat/parameterize-mixin
alex5517 Feb 29, 2024
3a61421
Run make loki-mixin
alex5517 Feb 29, 2024
105ec09
Merge branch 'feat/parameterize-mixin' of github.com:neticdk/loki int…
alex5517 Feb 29, 2024
6571838
Merge branch 'main' into feat/parameterize-mixin
alex5517 Mar 6, 2024
de90f70
Pull from main
alex5517 Apr 18, 2024
8e183e7
Fix merge conflict
alex5517 Apr 18, 2024
88e5d73
Run make loki-mixin
alex5517 Apr 18, 2024
ed0c9d8
Merge branch 'main' of github.com:neticdk/loki into feat/parameterize…
alex5517 May 29, 2024
fdfa18c
Build mixin
alex5517 May 29, 2024
9b3c5df
fix merge conflict
alex5517 Sep 30, 2024
859636f
Merge branch 'main' into feat/parameterize-mixin
alex5517 Oct 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions production/loki-mixin-compiled-ssd/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ groups:
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.
summary: Loki request error rate is high.
expr: |
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (namespace, job, route)
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster, namespace, job, route)
/
sum(rate(loki_request_duration_seconds_count[2m])) by (namespace, job, route)
sum(rate(loki_request_duration_seconds_count[2m])) by (cluster, namespace, job, route)
> 10
for: 15m
labels:
Expand All @@ -20,7 +20,7 @@ groups:
{{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panics.
summary: Loki requests are causing code panics.
expr: |
sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
sum(increase(loki_panic_total[10m])) by (cluster, namespace, job) > 0
labels:
severity: critical
- alert: LokiRequestLatency
Expand All @@ -39,7 +39,7 @@ groups:
{{ $labels.cluster }} {{ $labels.namespace }} has had {{ printf "%.0f" $value }} compactors running for more than 5m. Only one compactor should run at a time.
summary: Loki deployment is running more than one compactor.
expr: |
sum(loki_boltdb_shipper_compactor_running) by (namespace, cluster) > 1
sum(loki_boltdb_shipper_compactor_running) by (cluster, namespace) > 1
for: 5m
labels:
severity: warning
8 changes: 4 additions & 4 deletions production/loki-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ groups:
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.
summary: Loki request error rate is high.
expr: |
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (namespace, job, route)
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster, namespace, job, route)
/
sum(rate(loki_request_duration_seconds_count[2m])) by (namespace, job, route)
sum(rate(loki_request_duration_seconds_count[2m])) by (cluster, namespace, job, route)
> 10
for: 15m
labels:
Expand All @@ -20,7 +20,7 @@ groups:
{{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panics.
summary: Loki requests are causing code panics.
expr: |
sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
sum(increase(loki_panic_total[10m])) by (cluster, namespace, job) > 0
labels:
severity: critical
- alert: LokiRequestLatency
Expand All @@ -39,7 +39,7 @@ groups:
{{ $labels.cluster }} {{ $labels.namespace }} has had {{ printf "%.0f" $value }} compactors running for more than 5m. Only one compactor should run at a time.
summary: Loki deployment is running more than one compactor.
expr: |
sum(loki_boltdb_shipper_compactor_running) by (namespace, cluster) > 1
sum(loki_boltdb_shipper_compactor_running) by (cluster, namespace) > 1
for: 5m
labels:
severity: warning
18 changes: 9 additions & 9 deletions production/loki-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
{
alert: 'LokiRequestErrors',
expr: |||
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (namespace, job, route)
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (%(group_by_cluster)s, job, route)
/
sum(rate(loki_request_duration_seconds_count[2m])) by (namespace, job, route)
sum(rate(loki_request_duration_seconds_count[2m])) by (%(group_by_cluster)s, job, route)
> 10
|||,
||| % $._config,
'for': '15m',
labels: {
severity: 'critical',
Expand All @@ -26,8 +26,8 @@
{
alert: 'LokiRequestPanics',
expr: |||
sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
|||,
sum(increase(loki_panic_total[10m])) by (%(group_by_cluster)s, job) > 0
||| % $._config,
labels: {
severity: 'critical',
},
Expand All @@ -41,8 +41,8 @@
{
alert: 'LokiRequestLatency',
expr: |||
%s_namespace_job_route:loki_request_duration_seconds:99quantile{route!~"(?i).*tail.*|/schedulerpb.SchedulerForQuerier/QuerierLoop"} > 1
||| % $._config.per_cluster_label,
%(group_prefix_jobs)s_route:loki_request_duration_seconds:99quantile{route!~"(?i).*tail.*|/schedulerpb.SchedulerForQuerier/QuerierLoop"} > 1
||| % $._config,
'for': '15m',
labels: {
severity: 'critical',
Expand All @@ -57,8 +57,8 @@
{
alert: 'LokiTooManyCompactorsRunning',
expr: |||
sum(loki_boltdb_shipper_compactor_running) by (namespace, %s) > 1
||| % $._config.per_cluster_label,
sum(loki_boltdb_shipper_compactor_running) by (%(group_by_cluster)s) > 1
||| % $._config,
'for': '5m',
labels: {
severity: 'warning',
Expand Down
17 changes: 17 additions & 0 deletions production/loki-mixin/config.libsonnet
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
{
local makePrefix(groups) = std.join('_', groups),
local makeGroupBy(groups) = std.join(', ', groups),

_config+:: {
// Tags for dashboards.
tags: ['loki'],
Expand All @@ -11,6 +14,20 @@

// The label used to differentiate between different clusters.
per_cluster_label: 'cluster',
per_namespace_label: 'namespace',
per_job_label: 'job',

// Grouping labels, to uniquely identify and group by {jobs, clusters}
job_labels: [$._config.per_cluster_label, $._config.per_namespace_label, $._config.per_job_label],
cluster_labels: [$._config.per_cluster_label, $._config.per_namespace_label],
alex5517 marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these names more explicit? even if it's long, I'd prefer to know all labels that are present from the name OR have to use each per_x_label for readability sake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cstyan,

I can change it, but would it not be better if it matches what mimir-mixin does, så that it easier for one who has used mimir-mixin to also use loki-mixin?


// Each group prefix is composed of `_`-separated labels
group_prefix_jobs: makePrefix($._config.job_labels),
group_prefix_clusters: makePrefix($._config.cluster_labels),

// Each group-by label list is `, `-separated and unique identifies
group_by_job: makeGroupBy($._config.job_labels),
group_by_cluster: makeGroupBy($._config.cluster_labels),
alex5517 marked this conversation as resolved.
Show resolved Hide resolved

// Enable dashboard and panels for Grafana Labs internal components.
internal_components: false,
Expand Down
1 change: 0 additions & 1 deletion production/loki-mixin/dashboards.libsonnet
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
(import 'config.libsonnet') +
(import 'dashboards/loki-retention.libsonnet') +
(import 'dashboards/loki-chunks.libsonnet') +
(import 'dashboards/loki-logs.libsonnet') +
Expand Down
4 changes: 1 addition & 3 deletions production/loki-mixin/mixin-ssd.libsonnet
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
(import 'dashboards.libsonnet') +
(import 'alerts.libsonnet') +
(import 'recording_rules.libsonnet') + {
(import 'mixin.libsonnet') + {
grafanaDashboardFolder: 'Loki SSD',
alex5517 marked this conversation as resolved.
Show resolved Hide resolved

_config+:: {
Expand Down
1 change: 1 addition & 0 deletions production/loki-mixin/mixin.libsonnet
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
(import 'dashboards.libsonnet') +
(import 'alerts.libsonnet') +
(import 'config.libsonnet') +
(import 'recording_rules.libsonnet') + {
grafanaDashboardFolder: 'Loki',
}
2 changes: 1 addition & 1 deletion production/loki-mixin/recording_rules.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
rules:
utils.histogramRules('loki_request_duration_seconds', [$._config.per_cluster_label, 'job']) +
utils.histogramRules('loki_request_duration_seconds', [$._config.per_cluster_label, 'job', 'route']) +
utils.histogramRules('loki_request_duration_seconds', [$._config.per_cluster_label, 'namespace', 'job', 'route']),
utils.histogramRules('loki_request_duration_seconds', $._config.job_labels + ['route']),
}],
},
}
Loading