Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(multi-region): add operational simplification from Zeebe 8.6. #4205

Merged
merged 5 commits into from
Aug 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 6 additions & 12 deletions docs/self-managed/concepts/multi-region/dual-region.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,6 @@ The currently supported Camunda 8 Self-Managed components are:

leiicamundi marked this conversation as resolved.
Show resolved Hide resolved
The overall system is **active-passive**, even though some components may be **active-active**. You will have to take care of the user traffic routing or DNS by yourself, and won't be considered further. Select one region as the actively serving region and route the user traffic there. In case of a total region failure, route the traffic to the passive region yourself.

<!-- Should we provide some reading materials on how to tackle this? -->

### Components

#### Zeebe
Expand Down Expand Up @@ -129,11 +127,8 @@ In the event of a total active region loss, the following data will be lost:
- Role Based Access Control (RBAC) does not work.
- Optimize is not supported.
- This is due to Optimize depending on Identity to work.
- Connectors are not supported.
- This is due to Connectors depending on Operate to work for inbound Connectors and potentially resulting in race condition.
- During the failback procedure, there’s a small chance that some data will be lost in Elasticsearch affecting Operate and Tasklist.
- This **does not** affect the processing of process instances in any way. The impact is that some information about the affected instances might not be visible in Operate and Tasklist.
- This is further explained in the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md?failback=step2#failback) during the relevant step.
- Connectors can be deployed alongside but ensure to understand idempotency based on [the described documentation](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event).
Langleu marked this conversation as resolved.
Show resolved Hide resolved
- in a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events.
- Zeebe cluster scaling is not supported.
- Web-Modeler is a standalone component and is not covered in this guide.
- Modeling applications can operate independently outside of the automation clusters.
Expand Down Expand Up @@ -194,14 +189,13 @@ The **Recovery Point Objective (RPO)** is the maximum tolerable data loss measur

The **Recovery Time Objective (RTO)** is the time to restore services to a functional state.

For Zeebe the **RPO** is **0**.

For Operate and Tasklist the **RPO** is close to **0** for critical data due to the previously mentioned small chance of data loss in Elasticsearch during the failback procedure.
For Operate, Tasklist, and Zeebe the **RPO** is **0**.

The **RTO** can be considered for the **failover** and **failback** procedures, both resulting in a functional state.

- **failover** has an **RTO** of **15-20** minutes to restore a functional state, excluding DNS considerations.
- **failback** has an **RTO** of **25-30 + X** minutes to restore a functional state. Where X is the time it takes to back up and restore Elasticsearch, which is highly dependent on the setup and chosen [Elasticsearch backup type](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html#ess-repo-types).
- **failover** has an **RTO** of **< 1** minute to restore a functional state, excluding DNS considerations.
- **failback** has an **RTO** of **5 + X** minutes to restore a functional state, where X is the time it takes to back up and restore Elasticsearch. This timing is highly dependent on the setup and chosen [Elasticsearch backup type](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html#ess-repo-types).
During our automated tests, the reinstallation and reconfiguration of Camunda 8 takes 5 minutes. This can serve as a general guideline for the time required, though your experience may vary depending on your available resources and familiarity with the operational procedure.

:::info

Expand Down
783 changes: 368 additions & 415 deletions docs/self-managed/operational-guides/multi-region/dual-region-ops.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This file was deleted.

This file was deleted.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This file was deleted.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 4 additions & 27 deletions docs/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: "Dual-region setup (EKS)"
description: "Deploy two Amazon Kubernetes (EKS) clusters with Terraform for a peered setup allowing dual-region communication."
---

<!-- Image source: https://docs.google.com/presentation/d/1mbEIc0KuumQCYeg1YMpvdVR8AEUcbTWqlesX-IxVIjY/edit?usp=sharing -->
<!-- Image source: https://docs.google.com/presentation/d/1w1KUsvx4r6RS7DAozx6X65BtLJcIxU6ve_y3bYFcfYk/edit?usp=sharing -->

import CoreDNSKubeDNS from "./assets/core-dns-kube-dns.svg"

Expand All @@ -22,8 +22,8 @@ This guide requires you to have previously completed or reviewed the steps taken

- An [AWS account](https://docs.aws.amazon.com/accounts/latest/reference/accounts-welcome.html) to create resources within AWS.
- [Helm (3.x)](https://helm.sh/docs/intro/install/) for installing and upgrading the [Camunda Helm chart](https://github.com/camunda/camunda-platform-helm).
- [Kubectl (1.28.x)](https://kubernetes.io/docs/tasks/tools/#kubectl) to interact with the cluster.
- [Terraform (1.7.x)](https://developer.hashicorp.com/terraform/downloads)
- [Kubectl (1.30.x)](https://kubernetes.io/docs/tasks/tools/#kubectl) to interact with the cluster.
- [Terraform (1.9.x)](https://developer.hashicorp.com/terraform/downloads)

## Considerations

Expand Down Expand Up @@ -69,8 +69,6 @@ You have to choose unique namespaces for Camunda 8 installations. The namespace
For example, you can install Camunda 8 into `CAMUNDA_NAMESPACE_0` in `CLUSTER_0`, and `CAMUNDA_NAMESPACE_1` on the `CLUSTER_1`, where `CAMUNDA_NAMESPACE_0` != `CAMUNDA_NAMESPACE_1`.
Using the same namespace names on both clusters won't work as CoreDNS won't be able to distinguish between traffic targeted at the local and remote cluster.

In addition to namespaces for Camunda installations, create the namespaces for failover (`CAMUNDA_NAMESPACE_0_FAILOVER` in `CLUSTER_0` and `CAMUNDA_NAMESPACE_1_FAILOVER` in `CLUSTER_1`), for the case of a total region loss. This is for completeness, so you don't forget to add the mapping on region recovery. The operational procedure is handled in a different [document on dual-region](./../../../../operational-guides/multi-region/dual-region-ops.md).

:::

4. Execute the script via the following command:
Expand Down Expand Up @@ -259,13 +257,6 @@ kubectl --context cluster-london -n kube-system edit configmap coredns
force_tcp
}
}
camunda-paris-failover.svc.cluster.local:53 {
errors
cache 30
forward . 10.202.19.54 10.202.53.21 10.202.84.222 {
force_tcp
}
}
### Cluster 0 - End ###

Please copy the following between
Expand All @@ -282,13 +273,6 @@ kubectl --context cluster-paris -n kube-system edit configmap coredns
force_tcp
}
}
camunda-london-failover.svc.cluster.local:53 {
errors
cache 30
forward . 10.192.27.56 10.192.84.117 10.192.36.238 {
force_tcp
}
}
### Cluster 1 - End ###
```

Expand Down Expand Up @@ -340,13 +324,6 @@ data:
force_tcp
}
}
camunda-paris-failover.svc.cluster.local:53 {
errors
cache 30
forward . 10.202.19.54 10.202.53.21 10.202.84.222 {
force_tcp
}
}
```

</summary>
Expand Down Expand Up @@ -375,7 +352,7 @@ The script [test_dns_chaining.sh](https://github.com/camunda/c8-multi-region/blo

### Create the secret for Elasticsearch

Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failover. For this, you will need to configure a Kubernetes secret to not expose those in cleartext.
Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failback. For this, you will need to configure a Kubernetes secret to not expose those in cleartext.

You can pull the data from Terraform since you exposed those via `output.tf`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ In the event of a total active region loss, the following data will be lost:
- Role Based Access Control (RBAC) does not work.
- Optimize is not supported.
- This is due to Optimize depending on Identity to work.
- Connectors are not supported.
- This is due to Connectors depending on Operate to work for inbound Connectors and potentially resulting in race condition.
- Connectors can be deployed alongside but ensure to understand idempotency based on [the described documentation](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event).
- in a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events.
- During the failback procedure, there’s a small chance that some data will be lost in Elasticsearch affecting Operate and Tasklist.
- This **does not** affect the processing of process instances in any way. The impact is that some information about the affected instances might not be visible in Operate and Tasklist.
- This is further explained in the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md?failback=step2#failback) during the relevant step.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@ One of the regions is lost, meaning Zeebe:

For the failover procedure, ensure the lost region does not accidentally reconnect. You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment.

It's crucial to ensure the isolation of the environments because, during the operational procedure, we will have duplicate Zeebe broker IDs, which would collide if not correctly isolated and if the other region came accidentally on again.

#### How to get there

Depending on your architecture, possible approaches are:
Expand Down Expand Up @@ -585,7 +587,7 @@ kubectl --context $CLUSTER_SURVIVING scale -n $CAMUNDA_NAMESPACE_SURVIVING deplo
kubectl --context $CLUSTER_SURVIVING scale -n $CAMUNDA_NAMESPACE_SURVIVING deployments/$HELM_RELEASE_NAME-tasklist --replicas 0
```

2. Disable the Zeebe Elasticsearch exporters in Zeebe via kubectl:
2. Disable the Zeebe Elasticsearch exporters in Zeebe via kubectl using the [exporting API](./../../zeebe-deployment/operations/management-api.md#exporting-api):

```bash
kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -226,10 +226,10 @@ kubectl --context $CLUSTER_0 apply -f https://raw.githubusercontent.com/camunda/
kubectl --context $CLUSTER_1 apply -f https://raw.githubusercontent.com/camunda/c8-multi-region/main/aws/dual-region/kubernetes/internal-dns-lb.yml
```

2. Execute the script [generate_core_dns_entry.sh](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/scripts/generate_core_dns_entry.sh) in the folder `aws/dual-region/scripts/` of the repository to help you generate the CoreDNS config. Make sure that you have previously exported the [environment prerequisites](#environment-prerequisites) since the script builds on top of it.
2. Execute the script [generate_core_dns_entry.sh](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/scripts/generate_core_dns_entry.sh) with the parameter `legacy` in the folder `aws/dual-region/scripts/` of the repository to help you generate the CoreDNS config. Make sure that you have previously exported the [environment prerequisites](#environment-prerequisites) since the script builds on top of it.

```shell
./generate_core_dns_entry.sh
./generate_core_dns_entry.sh legacy
leiicamundi marked this conversation as resolved.
Show resolved Hide resolved
```

3. The script will retrieve the IPs of the load balancer via the AWS CLI and return the required config change.
Expand All @@ -244,7 +244,7 @@ For illustration purposes only. These values will not work in your environment.
:::

```shell
./generate_core_dns_entry.sh
./generate_core_dns_entry.sh legacy
Please copy the following between
### Cluster 0 - Start ### and ### Cluster 0 - End ###
and insert it at the end of your CoreDNS configmap in Cluster 0
Expand Down Expand Up @@ -375,7 +375,7 @@ The script [test_dns_chaining.sh](https://github.com/camunda/c8-multi-region/blo

### Create the secret for Elasticsearch

Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failover. For this, you will need to configure a Kubernetes secret to not expose those in cleartext.
Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failback. For this, you will need to configure a Kubernetes secret to not expose those in cleartext.

You can pull the data from Terraform since you exposed those via `output.tf`.

Expand Down
Loading