Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Helm): Support for Zone Awareness and rollout-operator #11404

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
622ec99
Add support for replication zone awareness for write component and pr…
alex5517 Dec 7, 2023
be6775e
Merge branch 'grafana:main' into feat/zone-aware-helm
alex5517 Dec 7, 2023
7743675
Update helm reference docs
alex5517 Dec 7, 2023
a2dfeec
Merge branch 'feat/zone-aware-helm' of github.com:neticdk/loki into f…
alex5517 Dec 7, 2023
4de1b87
Bump chart version + helm-docs
alex5517 Dec 7, 2023
999e0d9
update changelog
alex5517 Dec 7, 2023
c94ba61
Merge branch 'main' into feat/zone-aware-helm
alex5517 Dec 7, 2023
36544c3
Keep old podAntiAffinity for write
alex5517 Dec 7, 2023
7601ebf
Merge branch 'feat/zone-aware-helm' of github.com:neticdk/loki into f…
alex5517 Dec 7, 2023
1902f22
cleanup affinity for write - no default for write.zoneAwareReplicatio…
alex5517 Dec 7, 2023
166da45
change order of labels to better match old version - easier to diff
alex5517 Dec 8, 2023
4f22ba7
Add back write affinity for legacy support
alex5517 Dec 8, 2023
0c3b780
fix dict for labels
alex5517 Dec 8, 2023
5ccd2bd
keep tpl support for write affinity - legacy/migrate support
alex5517 Dec 8, 2023
1ecbd2f
Remove ingester config test + fix ingester config for zone_awareness_…
alex5517 Dec 8, 2023
f624c48
Fix indentation error on template labels
alex5517 Dec 8, 2023
9072a4b
Add initial version of migration docs
alex5517 Dec 11, 2023
d892835
Merge branch 'main' into feat/zone-aware-helm
alex5517 Dec 11, 2023
80566cb
Improve value examples
alex5517 Dec 11, 2023
d509911
Merge branch 'feat/zone-aware-helm' of github.com:neticdk/loki into f…
alex5517 Dec 11, 2023
7b1eb57
fix - wrong location for end
alex5517 Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
503 changes: 501 additions & 2 deletions docs/sources/setup/install/helm/reference.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
---
title: "Migrate from single zone to zone-aware replication in Loki Helm chart version 5.41.0"
menuTitle: "Migrate to zone-aware ingesters(write) with helm"
description: "Learn how to migrate ingesters(write) from having a single availability zone to full zone-aware replication using the Grafana Loki Helm chart"
weight: 800
keywords:
- migrate
- ssd
- scalable
- simple
- zone-aware
- helm
---

# Migrate from single zone to zone-aware replication in Loki Helm chart version 5.41.0

> **Note:** This document was tested with Loki versions 2.9.2 and `loki` Helm chart version 5.41.0. It might work with more recent versions of Loki, but it is recommended to verify that no changes to default configs and helm chart values have been made.

The `loki` Helm chart version 5.41.0 allows for enabling zone-aware replication for the write component. This is not enabled by default in version 5.41.0 due to being a breaking change for existing installations and requires a migration.

This document explains how to migrate the write component from single zone to [zone-aware replication]({{< relref "../operations/zone-ingesters" >}}) with Helm.

**Before you begin:**

We recommend having a Grafana instance available to monitor the existing cluster, to make sure there is no data loss during the migration process.

## Migration
The following are steps to live migrate (no downtime) an existing Loki deployment from a single write StatefulSet to 3 zone aware write StatefulSets.

These instructions assume you are using the SSD `loki` helm chart deployment.

1. Temporarily double max series limits

Explanation: while the new write StatefulSets are being added, some series will start to be written to new write pods, however the series will also exist on old write pods, thus the series will count twice towards limits. Not updating the limits might lead to writes to be refused due to limits violation.

The `limits_config.max_global_streams_per_user` Loki configuration parameter has a non-zero default value of 5000. Double the default or your value by setting:

```yaml
loki:
limits_config:
max_global_streams_per_user: 10000 # <-- or your value doubled
```

1. Start the migration by using the following helm values:
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
migration:
enabled: true
```
> **Note**: replace `<N>` with 1/3 of the current replicas.
> **Note**: replace `<ZONE-[A-C]>` with your real zones.

1. Allow the new changes to be rolled out.

1. Scale up the new write StatefulSets to match the old write StatefulSet.
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
migration:
enabled: true
replicas: <N>
```
> **Note**: replace `<N>` with the number of replicas in the old write StatefulSet - `<N>` will be divided by 3, so if `<N>` is set to 3 then each new StatefulSet replica will be set to 1.

1. Enable zone-awareness on the write path.
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
migration:
enabled: true
replicas: <N>
writePath: true
```
1. Check that all the write pods have restarted properly.

1. Wait `query_ingesters_within` configured hours, by default this is 3h. This ensures that no data will be missing if we query a new write pods.

1. Check that rule evaluations are still correct on the migration, look for increases in the rate for metrics with names with the following suffixes:

```
rule_evaluations_total
rule_evaluation_failures_total
rule_group_iterations_missed_total
```

1. Enable zone-awareness on the read path.
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
migration:
enabled: true
replicas: <N>
writePath: true
readPath: true
```
1. Check that queries are still executing correctly, for example look at `loki_logql_querystats_latency_seconds_count` to see that you don’t have a big increase in latency or error count for a specific query type.

1. Exclude the non zone-aware write pods from the write path.
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
migration:
enabled: true
replicas: <N>
writePath: true
readPath: true
excludeDefaultZone: true
```
It's a good idea to check rules evaluations again at this point, and also that the zone aware write StatefulSets is now receiving all the write traffic, you can compare `sum(loki_ingester_memory_streams{cluster="<cluster>",job=~"(<namespace>)/loki-write"})` to `sum(loki_ingester_memory_streams{cluster="<cluster>",job=~"(<namespace>)/loki-write-zone.*"})`

1. Scale down the non zone-aware write StatefulSet to 0.
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
migration:
enabled: true
replicas: <N>
writePath: true
readPath: true
excludeDefaultZone: true
scaleDownDefaultZone: true
```

1. Wait until all non zone-aware write pods are terminated.

1. Remove all values for used for migration, causing defaults to be used, which removes the old write StatefulSet.
```yaml
rollout_operator:
enabled: true

write:
zoneAwareReplication:
enabled: true
maxUnavailable: <N>
topologyKey: 'kubernetes.io/hostname'
zones:
- name: <ZONE-A>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-A>
- name: <ZONE-B>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-B>
- name: <ZONE-C>
nodeSelector:
topology.kubernetes.io/zone: <ZONE-C>
# migration: removed from value overrides.
```

1. Wait atleast the `chunk_idle_period` configured hours/minutes, by default this is 30m.

1. Undo the doubling of series limits done in the first step.
5 changes: 5 additions & 0 deletions production/helm/loki/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Entries should include a reference to the pull request that introduced the chang

[//]: # (<AUTOMATED_UPDATES_LOCATOR> : do not remove this line. This locator is used by the CI pipeline to automatically create a changelog entry for each new Loki release. Add other chart versions and respective changelog entries bellow this line.)

## 5.41.0

- [CHANGE] Make changes to how labels, annotations, resource names and priorityclassnames are created - matches how it is done in mimir-distributed helm chart.
- [FEATURE] Add support for zone aware deployment of write component and usage of rollout-operator.

## 5.40.1

- [BUGFIX] Remove ruler enabled condition in networkpolicies.
Expand Down
7 changes: 5 additions & 2 deletions production/helm/loki/Chart.lock
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,8 @@ dependencies:
- name: grafana-agent-operator
repository: https://grafana.github.io/helm-charts
version: 0.2.16
digest: sha256:56eeb13a669bc816c1452cde5d6dddc61f6893f8aff3da1d2b56ce3bdcbcf84d
generated: "2023-11-09T12:22:25.317696-03:00"
- name: rollout-operator
repository: https://grafana.github.io/helm-charts
version: 0.10.0
digest: sha256:145b8ca76dde55e195c514e81e4607bf227461ba744682c8ce81bce7ed1640b4
generated: "2023-12-07T08:36:59.813605+01:00"
7 changes: 6 additions & 1 deletion production/helm/loki/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: loki
description: Helm chart for Grafana Loki in simple, scalable mode
type: application
appVersion: 2.9.2
version: 5.40.1
version: 5.41.0
home: https://grafana.github.io/helm-charts
sources:
- https://github.com/grafana/loki
Expand All @@ -21,6 +21,11 @@ dependencies:
version: 0.2.16
repository: https://grafana.github.io/helm-charts
condition: monitoring.selfMonitoring.grafanaAgent.installOperator
- name: rollout-operator
alias: rollout_operator
repository: https://grafana.github.io/helm-charts
version: 0.10.0
condition: rollout_operator.enabled
maintainers:
- name: trevorwhitney
- name: jeschkies
3 changes: 2 additions & 1 deletion production/helm/loki/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# loki

![Version: 5.40.1](https://img.shields.io/badge/Version-5.40.1-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.9.2](https://img.shields.io/badge/AppVersion-2.9.2-informational?style=flat-square)
![Version: 5.41.0](https://img.shields.io/badge/Version-5.41.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.9.2](https://img.shields.io/badge/AppVersion-2.9.2-informational?style=flat-square)

Helm chart for Grafana Loki in simple, scalable mode

Expand All @@ -16,5 +16,6 @@ Helm chart for Grafana Loki in simple, scalable mode
|------------|------|---------|
| https://charts.min.io/ | minio(minio) | 4.0.15 |
| https://grafana.github.io/helm-charts | grafana-agent-operator(grafana-agent-operator) | 0.2.16 |
| https://grafana.github.io/helm-charts | rollout_operator(rollout-operator) | 0.10.0 |

Find more information in the Loki Helm Chart [documentation](https://grafana.com/docs/loki/next/installation/helm).
Loading