From f697cf47612d95b499867d76e7456c9b014a1967 Mon Sep 17 00:00:00 2001 From: Lars Lange <9141483+Langleu@users.noreply.github.com> Date: Fri, 30 Aug 2024 16:39:39 +0200 Subject: [PATCH] chore(multi-region): add operational simplification from Zeebe 8.6 --- .../concepts/multi-region/dual-region.md | 18 +- .../multi-region/dual-region-ops.md | 783 ++++++++---------- .../multi-region/img/10.svg | 2 +- .../multi-region/img/11.svg | 2 +- .../multi-region/img/12.svg | 2 +- .../multi-region/img/13.svg | 2 +- .../multi-region/img/14.svg | 2 +- .../multi-region/img/15.svg | 1 - .../operational-guides/multi-region/img/3.svg | 1 - .../operational-guides/multi-region/img/4.svg | 2 +- .../operational-guides/multi-region/img/5.svg | 2 +- .../operational-guides/multi-region/img/6.svg | 2 +- .../operational-guides/multi-region/img/7.svg | 1 - .../operational-guides/multi-region/img/8.svg | 1 + .../operational-guides/multi-region/img/9.svg | 2 +- .../deploy/amazon/amazon-eks/dual-region.md | 31 +- .../concepts/multi-region/dual-region.md | 4 +- .../multi-region/dual-region-ops.md | 4 +- .../deploy/amazon/amazon-eks/dual-region.md | 8 +- 19 files changed, 397 insertions(+), 473 deletions(-) delete mode 100644 docs/self-managed/operational-guides/multi-region/img/15.svg delete mode 100644 docs/self-managed/operational-guides/multi-region/img/3.svg delete mode 100644 docs/self-managed/operational-guides/multi-region/img/7.svg create mode 100644 docs/self-managed/operational-guides/multi-region/img/8.svg diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index 60b732ff088..a7035ad862a 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -54,8 +54,6 @@ The currently supported Camunda 8 Self-Managed components are: The overall system is **active-passive**, even though some components may be **active-active**. You will have to take care of the user traffic routing or DNS by yourself, and won't be considered further. Select one region as the actively serving region and route the user traffic there. In case of a total region failure, route the traffic to the passive region yourself. - - ### Components #### Zeebe @@ -129,11 +127,8 @@ In the event of a total active region loss, the following data will be lost: - Role Based Access Control (RBAC) does not work. - Optimize is not supported. - This is due to Optimize depending on Identity to work. -- Connectors are not supported. - - This is due to Connectors depending on Operate to work for inbound Connectors and potentially resulting in race condition. -- During the failback procedure, there’s a small chance that some data will be lost in Elasticsearch affecting Operate and Tasklist. - - This **does not** affect the processing of process instances in any way. The impact is that some information about the affected instances might not be visible in Operate and Tasklist. - - This is further explained in the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md?failback=step2#failback) during the relevant step. +- Connectors can be deployed alongside but ensure to understand idempotency based on [the described documentation](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event). + - in a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events. - Zeebe cluster scaling is not supported. - Web-Modeler is a standalone component and is not covered in this guide. - Modeling applications can operate independently outside of the automation clusters. @@ -194,14 +189,13 @@ The **Recovery Point Objective (RPO)** is the maximum tolerable data loss measur The **Recovery Time Objective (RTO)** is the time to restore services to a functional state. -For Zeebe the **RPO** is **0**. - -For Operate and Tasklist the **RPO** is close to **0** for critical data due to the previously mentioned small chance of data loss in Elasticsearch during the failback procedure. +For Operate, Tasklist, and Zeebe the **RPO** is **0**. The **RTO** can be considered for the **failover** and **failback** procedures, both resulting in a functional state. -- **failover** has an **RTO** of **15-20** minutes to restore a functional state, excluding DNS considerations. -- **failback** has an **RTO** of **25-30 + X** minutes to restore a functional state. Where X is the time it takes to back up and restore Elasticsearch, which is highly dependent on the setup and chosen [Elasticsearch backup type](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html#ess-repo-types). +- **failover** has an **RTO** of **< 1** minute to restore a functional state, excluding DNS considerations. +- **failback** has an **RTO** of **5 + X** minutes to restore a functional state, where X is the time it takes to back up and restore Elasticsearch. This timing is highly dependent on the setup and chosen [Elasticsearch backup type](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html#ess-repo-types). + During our automated tests, the reinstallation and reconfiguration of Camunda 8 takes 5 minutes. This can serve as a general guideline for the time required, though your experience may vary depending on your available resources and familiarity with the operational procedure. :::info diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 6cca48ebb72..5ad2e1b0b55 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -14,21 +14,25 @@ import StateContainer from './components/stateContainer.jsx'; -import Three from './img/3.svg'; import Four from './img/4.svg'; import Five from './img/5.svg'; import Six from './img/6.svg'; -import Seven from './img/7.svg'; +import Eight from './img/8.svg'; import Nine from './img/9.svg'; import Ten from './img/10.svg'; import Eleven from './img/11.svg'; import Twelve from './img/12.svg'; import Thirteen from './img/13.svg'; import Fourteen from './img/14.svg'; -import Fifteen from './img/15.svg'; + +:::info + +This procedure has been updated in the Camunda 8.6 release. The procedure used in Camunda 8.5 has been deprecated, and compatibility will be removed in the 8.7 release. + +::: ## Introduction @@ -49,7 +53,7 @@ Running dual-region setups requires the users to be able to detect any regional - A dual-region Camunda 8 setup installed in two different regions, preferably derived from our [AWS dual-region guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). - In that guide, we're showcasing Kubernetes dual-region installation, based on the following tools: - [Helm (3.x)](https://helm.sh/docs/intro/install/) for installing and upgrading the [Camunda Helm chart](https://github.com/camunda/camunda-platform-helm). - - [Kubectl (1.28.x)](https://kubernetes.io/docs/tasks/tools/#kubectl) to interact with the Kubernetes cluster. + - [Kubectl (1.30.x)](https://kubernetes.io/docs/tasks/tools/#kubectl) to interact with the Kubernetes cluster. - [zbctl](./../../../apis-tools/cli-client/index.md) to interact with the Zeebe cluster. ## Terminology @@ -72,7 +76,7 @@ After you've identified a region loss and before beginning the region restoratio In case the region is only lost temporarily (for example, due to network hiccups), Zeebe can survive a region loss but will stop processing due to the loss in quorum and ultimately fill up the persistent disk before running out of volume, resulting in the loss of data. -The **failover** phase of the procedure results in the temporary restoration of Camunda 8 functionality by redeploying it within the surviving region to resume Zeebe engine functionality. Before the completion of this phase, Zeebe is unable to export or process new data until it achieves quorum and the configured Elasticsearch endpoints for the exporters become accessible, which is the outcome of the failover procedure. +The **failover** phase of the procedure temporarily restores Camunda 8 functionality by removing the lost brokers and the export to the unreachable Elasticsearch instance. The **failback** phase of the procedure results in completely restoring the failed region to its full functionality. It requires you to have the lost region ready again for the redeployment of Camunda 8. @@ -101,7 +105,6 @@ Depending on which region you lost, select the correct tab below and export thos export CLUSTER_SURVIVING=$CLUSTER_1 export CLUSTER_RECREATED=$CLUSTER_0 export CAMUNDA_NAMESPACE_SURVIVING=$CAMUNDA_NAMESPACE_1 -export CAMUNDA_NAMESPACE_FAILOVER=$CAMUNDA_NAMESPACE_1_FAILOVER export CAMUNDA_NAMESPACE_RECREATED=$CAMUNDA_NAMESPACE_0 export REGION_SURVIVING=region1 export REGION_RECREATED=region0 @@ -114,7 +117,6 @@ export REGION_RECREATED=region0 export CLUSTER_SURVIVING=$CLUSTER_0 export CLUSTER_RECREATED=$CLUSTER_1 export CAMUNDA_NAMESPACE_SURVIVING=$CAMUNDA_NAMESPACE_0 -export CAMUNDA_NAMESPACE_FAILOVER=$CAMUNDA_NAMESPACE_0_FAILOVER export CAMUNDA_NAMESPACE_RECREATED=$CAMUNDA_NAMESPACE_1 export REGION_SURVIVING=region0 export REGION_RECREATED=region1 @@ -128,43 +130,11 @@ export REGION_RECREATED=region1 -#### Ensure network isolation between two regions (for example, between two Kubernetes clusters) +#### Remove lost brokers from Zeebe cluster in the surviving region } -desired={} -/> - -
- -#### Current state - -One of the regions is lost, meaning Zeebe: - -- Is unable to process new requests due to losing the quorum -- Stops exporting new data to Elasticsearch in the lost region -- Stops exporting new data to Elasticsearch in the survived region - -#### Desired state - -For the failover procedure, ensure the lost region does not accidentally reconnect. You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment. - -#### How to get there - -Depending on your architecture, possible approaches are: - -- Configuring [Kubernetes Network Policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/) to disable traffic flow between the clusters. -- Configure firewall rules to disable traffic flow between the clusters. - -
-
- - -#### Create temporary Camunda 8 installation in the failover mode in the surviving region - -} -desired={} +current={} +desired={} />
@@ -177,32 +147,25 @@ Due to the Zeebe data replication, no data has been lost. #### Desired state -You are creating a temporary Camunda 8 deployment within the same region, but different namespace, to recover the Zeebe cluster functionality. Using a different namespace allows for easier distinguishing between the normal Zeebe deployment and Zeebe failover deployment. - -The newly deployed Zeebe brokers will be running in the failover mode. This will restore the quorum and the Zeebe data processing. Additionally, the new failover brokers are configured to export the data to the surviving Elasticsearch instance and to the newly deployed failover Elasticsearch instance. +You have removed the lost brokers from the Zeebe cluster. This will allow us to continue processing after the next step and ensure that the new brokers in the failback procedure will only join the cluster with our intervention. #### How to get there -In the case **Region 1** was lost: in the previously cloned repository [c8-multi-region](https://github.com/camunda/c8-multi-region), navigate to the folder [aws/dual-region/kubernetes/region0](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/kubernetes/region0/). This contains the example Helm values yaml `camunda-values-failover.yml` containing the required overlay for the **failover** mode. +You will port-forward the `Zeebe Gateway` in the surviving region to the local host to interact with the Gateway. -In the case when your **Region 0** was lost, instead go to the folder [aws/dual-region/kubernetes/region1](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/kubernetes/region1/) for the `camunda-values-failover.yml` file. +The following alternatives to port-forwarding are possible: -The chosen `camunda-values-failover.yml` requires adjustments before installing the Helm chart and the same has to be done for the base `camunda-values.yml` in `aws/dual-region/kubernetes`. +- if it's exposed to the outside, one can skip port-forwarding and use the URL directly +- one can [`exec`](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_exec/) into an existing pod (such as Elasticsearch), and `curl` from there +- or temporarily [`run`](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_run/) an Ubuntu pod in the cluster to `curl` from there -- `ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS` -- `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL` -- `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL` +In our example, we went with port-forwarding to a local host, but other alternatives can also be used. -1. The bash script [generate_zeebe_helm_values.sh](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/scripts/generate_zeebe_helm_values.sh) in the repository folder `aws/dual-region/scripts/` helps generate those values. You only have to copy and replace them within the previously mentioned Helm values files. It will use the exported environment variables of the environment prerequisites for namespaces and regions. Additionally, you have to pass in whether your region 0 or 1 was lost. +1. Use the [zbctl client](../../../apis-tools/cli-client/index.md) to retrieve list of remaining brokers ```bash -./generate_zeebe_helm_values.sh failover - -# It will ask you to provide the following values -# Enter the region that was lost, values can either be 0 or 1: -## In our case we lost region 1, therefore input 1 -# Enter Zeebe cluster size (total number of Zeebe brokers in both Kubernetes clusters): -## for a dual-region setup we recommend 8. Resulting in 4 brokers per region. +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING +zbctl status --insecure --address localhost:26500 ```
@@ -210,49 +173,56 @@ The chosen `camunda-values-failover.yml` requires adjustments before installing ```bash -Please use the following to change the existing environment variable ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS in the failover Camunda Helm chart values file 'region0/camunda-values-failover.yml' and in the base Camunda Helm chart values file 'camunda-values.yml'. It's part of the 'zeebe.env' path. - -- name: ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS - value: camunda-zeebe-0.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-0.camunda-zeebe.camunda-paris.svc.cluster.local:26502,camunda-zeebe-1.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-1.camunda-zeebe.camunda-paris.svc.cluster.local:26502,camunda-zeebe-2.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-2.camunda-zeebe.camunda-paris.svc.cluster.local:26502,camunda-zeebe-3.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-3.camunda-zeebe.camunda-paris.svc.cluster.local:26502 - -Please use the following to change the existing environment variable ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL in the failover Camunda Helm chart values file 'region0/camunda-values-failover.yml' and in the base Camunda Helm chart values file 'camunda-values.yml'. It's part of the 'zeebe.env' path. - -- name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london.svc.cluster.local:9200 - -Please use the following to change the existing environment variable ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL in the failover Camunda Helm chart values file 'region0/camunda-values-failover.yml' and in the base Camunda Helm chart values file 'camunda-values.yml'. It's part of the 'zeebe.env' path. - -- name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london-failover.svc.cluster.local:9200 +Cluster size: 8 +Partitions count: 8 +Replication factor: 4 +Gateway version: 8.6.0 +Brokers: + Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 1 : Leader, Healthy + Partition 6 : Follower, Healthy + Partition 7 : Follower, Healthy + Partition 8 : Follower, Healthy + Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 1 : Follower, Healthy + Partition 2 : Follower, Healthy + Partition 3 : Follower, Healthy + Partition 8 : Leader, Healthy + Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 2 : Follower, Healthy + Partition 3 : Leader, Healthy + Partition 4 : Follower, Healthy + Partition 5 : Follower, Healthy + Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 4 : Follower, Healthy + Partition 5 : Follower, Healthy + Partition 6 : Follower, Healthy + Partition 7 : Leader, Healthy ```
-2. As the script suggests, replace the environment variables within `camunda-values-failover.yml`. -3. Repeat the adjustments for the base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes` with the same output for the mentioned environment variables. -4. From the terminal context of `aws/dual-region/kubernetes`, execute the following: +2. Portforward the service of the Zeebe Gateway for the [management REST API](../../zeebe-deployment/configuration/gateway.md#managementserver) ```bash -helm install $HELM_RELEASE_NAME camunda/camunda-platform \ - --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_SURVIVING \ - --namespace $CAMUNDA_NAMESPACE_FAILOVER \ - -f camunda-values.yml \ - -f $REGION_SURVIVING/camunda-values-failover.yml +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING ``` -#### Verification - -The following command will show the deployed pods of the failover namespace. - -Only the minimal amount of brokers required to restore the quorum will be deployed in the failover installation. For example, if `clusterSize` is eight, two Zeebe brokers will be deployed in the failover installation instead of the normal four. This is expected. +3. Based on the [Cluster Scaling APIs](../../zeebe-deployment/operations/cluster-scaling.md), send a request to the Zeebe Gateway to redistribute the load to the remaining brokers, thereby removing the lost brokers. + In our example, we have lost region 1 and with that our uneven brokers. This means we will have to redistribute to our existing even brokers. ```bash -kubectl --context $CLUSTER_SURVIVING get pods -n $CAMUNDA_NAMESPACE_FAILOVER +curl -XPOST 'http://localhost:9600/actuator/cluster/brokers?force=true' -H 'Content-Type: application/json' -d '["0", "2", "4", "6"]' ``` -Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the **failover** brokers have joined the cluster. +#### Verification + +Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the cluster size has decreased to 4, partitions have been redistributed over the remaining brokers, and new leaders have been elected. ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING @@ -264,43 +234,31 @@ zbctl status --insecure --address localhost:26500 ```bash -Cluster size: 8 +Cluster size: 4 Partitions count: 8 -Replication factor: 4 -Gateway version: 8.5.0 +Replication factor: 2 +Gateway version: 8.6.0 Brokers: Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 + Version: 8.6.0 Partition 1 : Leader, Healthy - Partition 6 : Follower, Healthy - Partition 7 : Follower, Healthy - Partition 8 : Follower, Healthy - Broker 1 - camunda-zeebe-0.camunda-zeebe.camunda-london-failover.svc:26501 - Version: 8.5.0 - Partition 1 : Follower, Healthy - Partition 2 : Leader, Healthy + Partition 6 : Leader, Healthy Partition 7 : Follower, Healthy Partition 8 : Follower, Healthy Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 + Version: 8.6.0 Partition 1 : Follower, Healthy - Partition 2 : Follower, Healthy + Partition 2 : Leader, Healthy Partition 3 : Follower, Healthy Partition 8 : Leader, Healthy Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 + Version: 8.6.0 Partition 2 : Follower, Healthy - Partition 3 : Follower, Healthy - Partition 4 : Follower, Healthy - Partition 5 : Follower, Healthy - Broker 5 - camunda-zeebe-1.camunda-zeebe.camunda-london-failover.svc:26501 - Version: 8.5.0 Partition 3 : Leader, Healthy Partition 4 : Follower, Healthy Partition 5 : Follower, Healthy - Partition 6 : Leader, Healthy Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 + Version: 8.6.0 Partition 4 : Leader, Healthy Partition 5 : Leader, Healthy Partition 6 : Follower, Healthy @@ -310,16 +268,39 @@ Brokers: +You can also use the Zeebe Gateway's REST API to ensure the scaling progress has been completed. For better readability of the output, it is recommended to use [jq](https://jqlang.github.io/jq/). + +```bash +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange +``` + +
+ Example output + + +```bash +{ + "id": 2, + "status": "COMPLETED", + "startedAt": "2024-08-23T11:33:08.355681311Z", + "completedAt": "2024-08-23T11:33:09.170531963Z" +} +``` + + +
+
- + -#### Configure Zeebe to export data to temporary Elasticsearch deployment +#### Configure Zeebe to disable the Elastic exporter to the lost region } -desired={} +current={} +desired={} />
@@ -328,15 +309,9 @@ desired={} Zeebe is not yet be able to continue exporting data since the Zeebe brokers in the surviving region are configured to point to the Elasticsearch instance of the lost region. -:::info - -Simply disabling the exporter would not be helpful here, since the sequence numbers in the exported data are not persistent when an exporter configuration is removed from Zeebe settings and added back later. The correct sequence numbers are required by Operate and Tasklist to import Elasticsearch data correctly. - -::: - #### Desired state -You have reconfigured the existing Camunda deployment in `CAMUNDA_NAMESPACE_SURVIVING` to point Zeebe to the export data to the temporary Elasticsearch instance that was previously created in **Step 2**. +You have disabled the Elasticsearch exporter to the failed region in the Zeebe cluster. The Zeebe cluster is then unblocked and can export data to Elasticsearch again. @@ -344,41 +319,42 @@ Completing this step will restore regular interaction with Camunda 8 for your us #### How to get there -In **Step 2** you have already adjusted the base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes` with the same changes as for the failover deployment for the environment variables. +1. Portforward the service of the Zeebe Gateway for the [management REST API](../../zeebe-deployment/configuration/gateway.md#managementserver) -- `ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS` -- `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL` -- `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL` +```bash +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +``` -From the `aws/dual-region/kubernetes` directory, do a Helm upgrade to update the configuration of the Zeebe deployment in `CAMUNDA_NAMESPACE_SURVIVING` to point to the failover Elasticsearch instance: +2. List all exporters to find the corresponding ID. Alternatively, you can check your `camunda-values.yml` file, which lists the exporters as those had to be configured explicitly. ```bash -helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ - --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_SURVIVING \ - --namespace $CAMUNDA_NAMESPACE_SURVIVING \ - -f camunda-values.yml \ - -f $REGION_SURVIVING/camunda-values.yml +curl -XGET 'http://localhost:9600/actuator/exporters' ``` -#### Verification - -The following command will show the deployed pods of the surviving namespace. You should see that the Zeebe brokers have just restarted or are still restarting due to the configuration upgrade. +
+ Example output + ```bash -kubectl --context $CLUSTER_SURVIVING get pods -n $CAMUNDA_NAMESPACE_SURVIVING +[{"exporterId":"elasticsearchregion0","status":"ENABLED"},{"exporterId":"elasticsearchregion1","status":"ENABLED"}] ``` -Furthermore, the following command will watch the [StatefulSets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) update of the Zeebe brokers and wait until it's done. + +
+ +2. Based on the Exporter API you will send a request to the Zeebe Gateway to disable the Elasticsearch exporter to the lost region. ```bash -kubectl --context $CLUSTER_SURVIVING rollout status --watch statefulset/$HELM_RELEASE_NAME-zeebe -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XPOST 'http://localhost:9600/actuator/exporters/elasticsearchregion1/disable' ``` -Alternatively, you can check that the Elasticsearch value was updated in the [StatefulSets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) configuration of the Zeebe brokers and are reflecting the previous output of the script `generate_zeebe_helm_values.sh` in **Step 2**. +#### Verification + +Port-forwarding the Zeebe Gateway via `kubectl` for the REST API and listing all exporters will reveal their current status. ```bash -kubectl --context $CLUSTER_SURVIVING get statefulsets $HELM_RELEASE_NAME-zeebe -oyaml -n $CAMUNDA_NAMESPACE_SURVIVING | grep -A1 'ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION[0-1]_ARGS_URL' +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XGET 'http://localhost:9600/actuator/exporters' ```
@@ -386,21 +362,16 @@ kubectl --context $CLUSTER_SURVIVING get statefulsets $HELM_RELEASE_NAME-zeebe - ```bash - - name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london.svc.cluster.local:9200 --- - - name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london-failover.svc.cluster.local:9200 +[{"exporterId":"elasticsearchregion0","status":"ENABLED"},{"exporterId":"elasticsearchregion1","status":"DISABLED"}] ```
-Lastly, port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that all brokers have joined the Zeebe cluster again. +Via the already port-forwarded Zeebe Gateway, you can also check the status of the change by using the Cluster API. ```bash -kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING -zbctl status --insecure --address localhost:26500 +curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange ```
@@ -408,47 +379,12 @@ zbctl status --insecure --address localhost:26500 ```bash -Cluster size: 8 -Partitions count: 8 -Replication factor: 4 -Gateway version: 8.5.0 -Brokers: - Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 - Partition 1 : Leader, Healthy - Partition 6 : Follower, Healthy - Partition 7 : Follower, Healthy - Partition 8 : Follower, Healthy - Broker 1 - camunda-zeebe-0.camunda-zeebe.camunda-london-failover.svc:26501 - Version: 8.5.0 - Partition 1 : Follower, Healthy - Partition 2 : Leader, Healthy - Partition 7 : Follower, Healthy - Partition 8 : Follower, Healthy - Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 - Partition 1 : Follower, Healthy - Partition 2 : Follower, Healthy - Partition 3 : Follower, Healthy - Partition 8 : Leader, Healthy - Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 - Partition 2 : Follower, Healthy - Partition 3 : Follower, Healthy - Partition 4 : Follower, Healthy - Partition 5 : Follower, Healthy - Broker 5 - camunda-zeebe-1.camunda-zeebe.camunda-london-failover.svc:26501 - Version: 8.5.0 - Partition 3 : Leader, Healthy - Partition 4 : Follower, Healthy - Partition 5 : Follower, Healthy - Partition 6 : Leader, Healthy - Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501 - Version: 8.5.0 - Partition 4 : Leader, Healthy - Partition 5 : Leader, Healthy - Partition 6 : Follower, Healthy - Partition 7 : Leader, Healthy +{ + "id": 4, + "status": "COMPLETED", + "startedAt": "2024-08-23T11:36:14.127510679Z", + "completedAt": "2024-08-23T11:36:14.379980715Z" +} ``` @@ -463,40 +399,31 @@ Brokers: -#### Deploy Camunda 8 in the failback mode in the newly created region +#### Deploy Camunda 8 in the newly created region } -desired={} +current={} +desired={} />
#### Current state -You have temporary Zeebe brokers deployed in failover mode together with a temporary Elasticsearch within the same surviving region. +You have a standalone region with a working Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch. #### Desired state -You want to restore the dual-region functionality again and deploy Zeebe in failback mode to the newly restored region. - -Failback mode means new `clusterSize/2` brokers will be installed in the restored region: - -- `clusterSize/4` brokers are running in the normal mode, participating processing and restoring the data. -- `clusterSize/4` brokers are temporarily running in the sleeping mode. They will run in the normal mode later once the failover setup is removed. - -An Elasticsearch will also be deployed in the restored region, but not used yet, before the data is restored into it from the backup from the surviving Elasticsearch cluster. +You want to restore the dual-region functionality and deploy Camunda 8, consisting of Zeebe and Elasticsearch, to the newly restored region. Operate and Tasklist need to stay disabled to prevent interference with the database backup and restore. #### How to get there -The changes previously done in the base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes` should still be present from **Failover - Step 2**. +From your initial dual-region deployment, your base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes` should still be present. -In particular, the values `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL` and `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL` should solely point at the surviving region. +In particular, the values `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL` and `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL` should point to their respective regions. The placeholder in `ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS` should contain the Zeebe endpoints of both regions, the result of the `aws/dual-region/scripts/generate_zeebe_helm_values.sh`. In addition, the following Helm command will disable Operate and Tasklist since those will only be enabled at the end of the full region restore. It's required to keep them disabled in the newly created region due to their Elasticsearch importers. -Lastly, the `installationType` is set to `failBack` to switch the behavior of Zeebe and prepare for this procedure. - 1. From the terminal context of `aws/dual-region/kubernetes` execute: ```bash @@ -506,7 +433,6 @@ helm install $HELM_RELEASE_NAME camunda/camunda-platform \ --namespace $CAMUNDA_NAMESPACE_RECREATED \ -f camunda-values.yml \ -f $REGION_RECREATED/camunda-values.yml \ - --set global.multiregion.installationType=failBack \ --set operate.enabled=false \ --set tasklist.enabled=false ``` @@ -515,22 +441,72 @@ helm install $HELM_RELEASE_NAME camunda/camunda-platform \ The following command will show the deployed pods of the newly created region. -Depending on your chosen `clusterSize`, you should see that the **failback** deployment contains some Zeebe instances being ready and others unready. Those unready instances are sleeping indefinitely and is the expected behavior. -This behavior stems from the **failback** mode since we still have the temporary **failover**, which acts as a replacement for the lost region. +Depending on your chosen `clusterSize`, you should see that half of the amount are spawned in Zeebe brokers. -For example, in the case of `clusterSize: 8`, you find two active Zeebe brokers and two unready brokers in the newly created region. +For example, in the case of `clusterSize: 8`, you find four Zeebe brokers in the newly created region. + +:::warning +It is expected that the Zeebe broker pods don't become ready as they're not yet part of a Zeebe cluster, therefore not considered healthy by the Kubernetes readiness probe. +::: ```bash kubectl --context $CLUSTER_RECREATED get pods -n $CAMUNDA_NAMESPACE_RECREATED ``` -Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the **failback** brokers have joined the cluster. +Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the new Zeebe brokers are recognized but yet a full member of the Zeebe cluster. ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING zbctl status --insecure --address localhost:26500 ``` +
+ Example Output + + +```bash +Cluster size: 4 +Partitions count: 8 +Replication factor: 2 +Gateway version: 8.6.0 +Brokers: + Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 1 : Leader, Healthy + Partition 6 : Follower, Healthy + Partition 7 : Follower, Healthy + Partition 8 : Leader, Healthy + Broker 1 - camunda-zeebe-0.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 1 : Follower, Healthy + Partition 2 : Leader, Healthy + Partition 3 : Leader, Healthy + Partition 8 : Follower, Healthy + Broker 3 - camunda-zeebe-1.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 2 : Follower, Healthy + Partition 3 : Follower, Healthy + Partition 4 : Leader, Healthy + Partition 5 : Leader, Healthy + Broker 5 - camunda-zeebe-2.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 4 : Follower, Healthy + Partition 5 : Follower, Healthy + Partition 6 : Leader, Healthy + Partition 7 : Leader, Healthy + Broker 7 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 +``` + + +
+
@@ -538,32 +514,22 @@ zbctl status --insecure --address localhost:26500 #### Pause Zeebe exporters to Elasticsearch, pause Operate and Tasklist } -desired={} +current={} +desired={} />
#### Current state -You currently have the following setups: +You currently have the following setup: -- Functioning Zeebe cluster (in multi-region mode): - - Camunda 8 installation in the failover mode in the surviving region - - Camunda 8 installation in the failback mode in the recreated region +- Functioning Zeebe cluster (within a single region): + - working Camunda 8 installation in the surviving region + - non-participating Camunda 8 installation in the recreated region #### Desired state -:::warning - -This step is very important to minimize the risk of losing any data when restoring the backup in the recreated region. - -There remains a small chance of losing some data in Elasticsearch (and in turn, in Operate and Tasklist too). This is because Zeebe might have exported some records to the failover Elasticsearch in the surviving region, but not to the main Elasticsearch in the surviving region, before the exporters have been paused. - -This means those records will not be included in the surviving region's Elasticsearch backup when the recreated region's Elasticsearch is restored from the backup, leading to the new region missing those records (as Zeebe does not re-export them). - -::: - You are preparing everything for the newly created region to take over again to restore the functioning dual-region setup. For this, stop the Zeebe exporters from exporting any new data to Elasticsearch so you can create an Elasticsearch backup. @@ -585,7 +551,7 @@ kubectl --context $CLUSTER_SURVIVING scale -n $CAMUNDA_NAMESPACE_SURVIVING deplo kubectl --context $CLUSTER_SURVIVING scale -n $CAMUNDA_NAMESPACE_SURVIVING deployments/$HELM_RELEASE_NAME-tasklist --replicas 0 ``` -2. Disable the Zeebe Elasticsearch exporters in Zeebe via kubectl: +2. Disable the Zeebe Elasticsearch exporters in Zeebe via kubectl using the [exporting API](./../../zeebe-deployment/operations/management-api.md#exporting-api): ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING @@ -605,7 +571,7 @@ kubectl --context $CLUSTER_SURVIVING get deployments $HELM_RELEASE_NAME-operate # camunda-tasklist 0/0 0 0 23m ``` -For the Zeebe Elasticsearch exporters, there's currently no API available to confirm this. Only the response code of `204` indicates a successful disabling. +For the Zeebe Elasticsearch exporters, there's currently no API available to confirm this. Only the response code of `204` indicates a successful disabling. This is a synchronous operation.
@@ -614,8 +580,8 @@ For the Zeebe Elasticsearch exporters, there's currently no API available to con #### Create and restore Elasticsearch backup } -desired={} +current={} +desired={} />
@@ -626,7 +592,7 @@ The Camunda components are currently not reachable by end-users and will not pro #### Desired state -You are creating a backup of the main Elasticsearch instance in the surviving region and restore it in the recreated region. This Elasticsearch backup contains all the data and may take some time to be finished. The failover Elasticsearch instance only contains a subset of the data from after the region loss and is not sufficient to restore this in the new region. +You are creating a backup of the main Elasticsearch instance in the surviving region and restore it in the recreated region. This Elasticsearch backup contains all the data and may take some time to be finished. #### How to get there @@ -648,7 +614,7 @@ export S3_BUCKET_NAME=$(terraform output -raw s3_bucket_name) ```bash ELASTIC_POD=$(kubectl --context $CLUSTER_SURVIVING get pod --selector=app\.kubernetes\.io/name=elasticsearch -o jsonpath='{.items[0].metadata.name}' -n $CAMUNDA_NAMESPACE_SURVIVING) -kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $ELASTIC_POD -c elasticsearch -- curl -XPUT "http://localhost:9200/_snapshot/camunda_backup" -H "Content-Type: application/json" -d' +kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $ELASTIC_POD -c elasticsearch -- curl -XPUT 'http://localhost:9200/_snapshot/camunda_backup' -H 'Content-Type: application/json' -d' { "type": "s3", "settings": { @@ -664,13 +630,13 @@ kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $E ```bash # The backup will be called failback -kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $ELASTIC_POD -c elasticsearch -- curl -XPUT "http://localhost:9200/_snapshot/camunda_backup/failback?wait_for_completion=true" +kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $ELASTIC_POD -c elasticsearch -- curl -XPUT 'http://localhost:9200/_snapshot/camunda_backup/failback?wait_for_completion=true' ``` 4. Verify the backup has been completed successfully by checking all backups and ensuring the `state` is `SUCCESS`: ```bash -kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $ELASTIC_POD -c elasticsearch -- curl -XGET "http://localhost:9200/_snapshot/camunda_backup/_all" +kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $ELASTIC_POD -c elasticsearch -- curl -XGET 'http://localhost:9200/_snapshot/camunda_backup/_all' ```
@@ -759,7 +725,7 @@ kubectl --context $CLUSTER_SURVIVING exec -n $CAMUNDA_NAMESPACE_SURVIVING -it $E ```bash ELASTIC_POD=$(kubectl --context $CLUSTER_RECREATED get pod --selector=app\.kubernetes\.io/name=elasticsearch -o jsonpath='{.items[0].metadata.name}' -n $CAMUNDA_NAMESPACE_RECREATED) -kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XPUT "http://localhost:9200/_snapshot/camunda_backup" -H "Content-Type: application/json" -d' +kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XPUT 'http://localhost:9200/_snapshot/camunda_backup' -H 'Content-Type: application/json' -d' { "type": "s3", "settings": { @@ -774,7 +740,7 @@ kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $E 6. Verify that the backup can be found in the shared S3 bucket: ```bash -kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XGET "http://localhost:9200/_snapshot/camunda_backup/_all" +kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XGET 'http://localhost:9200/_snapshot/camunda_backup/_all' ``` The example output above should be the same since it's the same backup. @@ -782,13 +748,13 @@ The example output above should be the same since it's the same backup. 7. Restore Elasticsearch backup in the new region namespace `CAMUNDA_NAMESPACE_RECREATED`. Depending on the amount of data, this operation will take a while to complete. ```bash -kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XPOST "http://localhost:9200/_snapshot/camunda_backup/failback/_restore?wait_for_completion=true" +kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XPOST 'http://localhost:9200/_snapshot/camunda_backup/failback/_restore?wait_for_completion=true' ``` 8. Verify that the restore has been completed successfully in the new region: ```bash -kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XGET "http://localhost:9200/_snapshot/camunda_backup/failback/_status" +kubectl --context $CLUSTER_RECREATED exec -n $CAMUNDA_NAMESPACE_RECREATED -it $ELASTIC_POD -c elasticsearch -- curl -XGET 'http://localhost:9200/_snapshot/camunda_backup/failback/_status' ```
@@ -842,11 +808,11 @@ The important part being the `state: "SUCCESS"` and that `done` and `total` are -#### Configure Zeebe exporters to use Elasticsearch in the recreated region +#### Start Operate and Tasklist again } -desired={} +current={} +desired={} />
@@ -859,54 +825,13 @@ The Camunda components remain unreachable by end-users as you proceed to restore #### Desired state -You are repointing all Zeebe brokers from the temporary Elasticsearch instance to the Elasticsearch in the recreated region. - -The Elasticsearch exporters will remain paused during this step. +You can enable Operate and Tasklist again both in the surviving and recreated region. This will allow users to interact with Camunda 8 again. #### How to get there -Your `camunda-values-failover.yml` and base `camunda-values.yml` require adjustments again to reconfigure all installations to the Elasticsearch instance in the new region: - -- `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL` -- `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL` +The base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes` contains the adjustments for Elasticsearch and the Zeebe initial brokers, meaning we just have to reapply / upgrade the release to enable and deploy Operate and Tasklist. -1. The bash script [generate_zeebe_helm_values.sh](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/scripts/generate_zeebe_helm_values.sh) in the repository folder `aws/dual-region/scripts/` helps generate those values again. You only have to copy and replace them within the previously mentioned Helm values files. It will use the exported environment variables of the environment prerequisites for namespaces and regions. - -```bash -./generate_zeebe_helm_values.sh failback - -# It will ask you to provide the following values -# Enter Zeebe cluster size (total number of Zeebe brokers in both Kubernetes clusters): -## for a dual-region setup we recommend eight, resulting in four brokers per region. -``` - -
- Example output - - -```bash -Please use the following to change the existing environment variable ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS in the failover Camunda Helm chart values file 'region0/camunda-values-failover.yml' and in the base Camunda Helm chart values file 'camunda-values.yml'. It's part of the 'zeebe.env' path. - -- name: ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS - value: camunda-zeebe-0.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-0.camunda-zeebe.camunda-paris.svc.cluster.local:26502,camunda-zeebe-1.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-1.camunda-zeebe.camunda-paris.svc.cluster.local:26502,camunda-zeebe-2.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-2.camunda-zeebe.camunda-paris.svc.cluster.local:26502,camunda-zeebe-3.camunda-zeebe.camunda-london.svc.cluster.local:26502,camunda-zeebe-3.camunda-zeebe.camunda-paris.svc.cluster.local:26502 - -Please use the following to change the existing environment variable ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL in the failover Camunda Helm chart values file 'region0/camunda-values-failover.yml' and in the base Camunda Helm chart values file 'camunda-values.yml'. It's part of the 'zeebe.env' path. - -- name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london.svc.cluster.local:9200 - -Please use the following to change the existing environment variable ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL in the failover Camunda Helm chart values file 'region0/camunda-values-failover.yml' and in the base Camunda Helm chart values file 'camunda-values.yml'. It's part of the 'zeebe.env' path. - -- name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-paris.svc.cluster.local:9200 -``` - - -
- -2. As the script suggests, replace the environment variables within `camunda-values-failover.yml`. -3. Repeat the adjustments for the base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes` with the same output for the mentioned environment variables. -4. Upgrade the normal Camunda environment in `CAMUNDA_NAMESPACE_SURVIVING` and `REGION_SURVIVING` to point to the new Elasticsearch: +1. Upgrade the normal Camunda environment in `CAMUNDA_NAMESPACE_SURVIVING` and `REGION_SURVIVING` to deploy Operate and Tasklist: ```bash helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ @@ -914,60 +839,90 @@ helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ --kube-context $CLUSTER_SURVIVING \ --namespace $CAMUNDA_NAMESPACE_SURVIVING \ -f camunda-values.yml \ - -f $REGION_SURVIVING/camunda-values.yml \ - --set operate.enabled=false \ - --set tasklist.enabled=false + -f $REGION_SURVIVING/camunda-values.yml ``` -5. Upgrade the failover Camunda environment in `CAMUNDA_NAMESPACE_FAILOVER` and `REGION_SURVIVING` to point to the new Elasticsearch: +2. Upgrade the new region environment in `CAMUNDA_NAMESPACE_RECREATED` and `REGION_RECREATED` to deploy Operate and Tasklist: ```bash helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_SURVIVING \ - --namespace $CAMUNDA_NAMESPACE_FAILOVER \ + --kube-context $CLUSTER_RECREATED \ + --namespace $CAMUNDA_NAMESPACE_RECREATED \ -f camunda-values.yml \ - -f $REGION_SURVIVING/camunda-values-failover.yml + -f $REGION_RECREATED/camunda-values.yml ``` -6. Upgrade the new region environment in `CAMUNDA_NAMESPACE_RECREATED` and `REGION_RECREATED` to point to the new Elasticsearch: +#### Verification + +For Operate and Tasklist, you can confirm that the deployments have successfully deployed by listing those and indicating `1/1` ready. The same command can be applied for the `CLUSTER_RECREATED` and `CAMUNDA_NAMESPACE_RECREATED`: ```bash -helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ - --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_RECREATED \ - --namespace $CAMUNDA_NAMESPACE_RECREATED \ - -f camunda-values.yml \ - -f $REGION_RECREATED/camunda-values.yml \ - --set global.multiregion.installationType=failBack \ - --set operate.enabled=false \ - --set tasklist.enabled=false +kubectl --context $CLUSTER_SURVIVING get deployments -n $CAMUNDA_NAMESPACE_SURVIVING +# NAME READY UP-TO-DATE AVAILABLE AGE +# camunda-operate 1/1 1 1 3h24m +# camunda-tasklist 1/1 1 1 3h24m +# camunda-zeebe-gateway 1/1 1 1 3h24m ``` -7. Delete all the Zeebe broker pods in the recreated region, as those are blocking a successful rollout of the config change due to the failback mode. The resulting recreated Zeebe brokers pods are expected to be again half of them being functional and half of them running in the sleeping mode due to the failback mode. +
+
+ + +#### Initialize new Zeebe exporter to recreated region + +} +desired={} +/> + +
+ +#### Current state + +Camunda 8 is reachable to the end-user but not yet exporting any data. + +#### Desired state + +You are initializing a new exporter to the recreated region. This will ensure that both Elasticsearch instances are populated, resulting in data redundancy. + +Separating this step from resuming the exporters is essential as the initialization is an asynchronous procedure, and you must ensure it's finished before resuming the exporters. + +#### How to get there + +1. Initialize the new exporter for the recreated region by sending an API request via the Zeebe Gateway: ```bash -kubectl --context $CLUSTER_RECREATED --namespace $CAMUNDA_NAMESPACE_RECREATED delete pods --selector=app\.kubernetes\.io/component=zeebe-broker +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XPOST 'http://localhost:9600/actuator/exporters/elasticsearchregion1/enable' -H 'Content-Type: application/json' -d '{"initializeFrom" : "elasticsearchregion0"}' ``` #### Verification -The following command will show the deployed pods of the namespaces. You should see that the Zeebe brokers are restarting. Adjusting the command for the other cluster and namespaces should reveal the same. +Port-forwarding the Zeebe Gateway via `kubectl` for the REST API and listing all exporters will reveal their current status. ```bash -kubectl --context $CLUSTER_SURVIVING get pods -n $CAMUNDA_NAMESPACE_SURVIVING +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XGET 'http://localhost:9600/actuator/exporters' ``` -Furthermore, the following command will watch the [StatefulSets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) update of the Zeebe brokers and wait until it's done. Adjusting the command for the other cluster and namespaces should have the same effect. +
+ Example output + ```bash -kubectl --context $CLUSTER_SURVIVING rollout status --watch statefulset/$HELM_RELEASE_NAME-zeebe -n $CAMUNDA_NAMESPACE_SURVIVING +[{"exporterId":"elasticsearchregion0","status":"ENABLED"},{"exporterId":"elasticsearchregion1","status":"ENABLED"}] ``` -Alternatively, you can check that the Elasticsearch value was updated in the [StatefulSets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) configuration of the Zeebe brokers and are reflecting the previous output of the script `generate_zeebe_helm_values.sh` in **Step 1**. + +
+ +Via the already port-forwarded Zeebe Gateway, you can also check the status of the change by using the Cluster API. + +**Ensure it says "COMPLETED" before proceeding with the next step.** ```bash -kubectl --context $CLUSTER_SURVIVING get statefulsets $HELM_RELEASE_NAME-zeebe -oyaml -n $CAMUNDA_NAMESPACE_SURVIVING | grep -A1 'ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION[0-1]_ARGS_URL' +curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange ```
@@ -975,21 +930,22 @@ kubectl --context $CLUSTER_SURVIVING get statefulsets $HELM_RELEASE_NAME-zeebe - ```bash - - name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london.svc.cluster.local:9200 --- - - name: ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL - value: http://camunda-elasticsearch-master-hl.camunda-london-failover.svc.cluster.local:9200 +{ + "id": 6, + "status": "COMPLETED", + "startedAt": "2024-08-23T12:54:07.968549269Z", + "completedAt": "2024-08-23T12:54:09.282558853Z" +} ```
-
- + + -#### Reactivate Zeebe exporters, Operate, and Tasklist +#### Reactivate Zeebe exporter } @@ -1000,38 +956,17 @@ desired={} #### Current state -Camunda 8 is pointing at the Elasticsearch instances in both regions again and not the temporary instance. It still remains unreachable to the end-users and no processes are advanced. +Camunda 8 is reachable to the end-user but not yet exporting any data. + +Elasticsearch exporters are enabled for both regions, and it's ensured that the operation has finished. #### Desired state -You are reactivating the exporters and enabling Operate and Tasklist again within the two regions. This will allow users to interact with Camunda 8 again. +You are reactivating the existing exporters. This will allow Zeebe to export data to Elasticsearch again. #### How to get there -1. Upgrade the normal Camunda environment in `CAMUNDA_NAMESPACE_SURVIVING` and `REGION_SURVIVING` to deploy Operate and Tasklist: - -```bash -helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ - --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_SURVIVING \ - --namespace $CAMUNDA_NAMESPACE_SURVIVING \ - -f camunda-values.yml \ - -f $REGION_SURVIVING/camunda-values.yml -``` - -2. Upgrade the new region environment in `CAMUNDA_NAMESPACE_RECREATED` and `REGION_RECREATED` to deploy Operate and Tasklist: - -```bash -helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ - --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_RECREATED \ - --namespace $CAMUNDA_NAMESPACE_RECREATED \ - -f camunda-values.yml \ - -f $REGION_RECREATED/camunda-values.yml \ - --set global.multiregion.installationType=failBack -``` - -3. Reactivate the exporters by sending the API activation request via the Zeebe Gateway: +1. Reactivate the exporters by sending the [exporting API](./../../zeebe-deployment/operations/management-api.md#exporting-api) activation request via the Zeebe Gateway: ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING @@ -1042,23 +977,13 @@ curl -i localhost:9600/actuator/exporting/resume -XPOST #### Verification -For Operate and Tasklist, you can confirm that the deployments have successfully deployed by listing those and indicating `1/1` ready. The same command can be applied for the `CLUSTER_RECREATED` and `CAMUNDA_NAMESPACE_RECREATED`: - -```bash -kubectl --context $CLUSTER_SURVIVING get deployments -n $CAMUNDA_NAMESPACE_SURVIVING -# NAME READY UP-TO-DATE AVAILABLE AGE -# camunda-operate 1/1 1 1 3h24m -# camunda-tasklist 1/1 1 1 3h24m -# camunda-zeebe-gateway 1/1 1 1 3h24m -``` - -For the Zeebe Elasticsearch exporters, there's currently no API available to confirm this. Only the response code of `204` indicates a successful resumption. +For the reactivating the exporters, there's currently no API available to confirm this. Only the response code of `204` indicates a successful resumption. This is a synchronous operation.
- + -#### Remove temporary failover installation +#### Add new brokers to the Zeebe cluster } @@ -1069,94 +994,122 @@ desired={} #### Current state -Camunda 8 is healthy and running in two regions again. You have redeployed Operate and Tasklist and enabled the Elasticsearch exporters again. This will allow users to interact with Camunda 8 again. +Camunda 8 is running in two regions but not yet utilizing all Zeebe brokers. You have redeployed Operate and Tasklist and enabled the Elasticsearch exporters again. This will allow users to interact with Camunda 8 again. #### Desired state -You can remove the temporary failover solution since it is not required anymore and would hinder disablement of the failback mode within the new region. +You have a functioning Camunda 8 setup in two regions and utilizing both regions. This will fully recover the dual-region benefits. #### How to get there -1. Uninstall the failover installation via Helm: +1. Based on the base Helm values file `camunda-values.yml` in `aws/dual-region/kubernetes`, you have to extract the `clusterSize` and `replicationFactor` as you have to re-add the brokers to the Zeebe cluster. +2. Port-forwarding the Zeebe Gateway via `kubectl` for the REST API allows you to send a Cluster API call to add the new brokers to the Zeebe cluster with the previous information on size and replication. + E.g. in our case the `clusterSize` is 8 and `replicationFactor` is 4 meaning we have to list all broker IDs starting from 0 to 7 and set the correct `replicationFactor` in the query. ```bash -helm uninstall $HELM_RELEASE_NAME --kube-context $CLUSTER_SURVIVING --namespace $CAMUNDA_NAMESPACE_FAILOVER +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XPOST 'http://localhost:9600/actuator/cluster/brokers?replicationFactor=4' -H 'Content-Type: application/json' -d '["0", "1", "2", "3", "4", "5", "6", "7"]' ``` -2. Delete the leftover persistent volume claims of the Camunda 8 components: - -```bash -kubectl --context $CLUSTER_SURVIVING delete pvc --all -n $CAMUNDA_NAMESPACE_FAILOVER -``` +:::note +This step can take longer depending on the size of the cluster, size of the data and the current load. +::: #### Verification -The following will show the pods within the namespace. You deleted the Helm installation in the failover namespace, which should result in no pods or in deletion state. +Port-forwarding the Zeebe Gateway via `kubectl` for the REST API and checking the Cluster API endpoint will show the status of the last change. ```bash -kubectl --context $CLUSTER_SURVIVING get pods -n $CAMUNDA_NAMESPACE_FAILOVER +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange ``` -Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the failover brokers are missing: +
+ Example output + ```bash -kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING -zbctl status --insecure --address localhost:26500 +{ + "id": 6, + "status": "COMPLETED", + "startedAt": "2024-08-23T12:54:07.968549269Z", + "completedAt": "2024-08-23T12:54:09.282558853Z" +} ``` -
-
- - -#### Switch Zeebe brokers in the recreated region to normal mode - -} -desired={} -/> - -
- -#### Current state - -You have almost fully restored the dual-region setup. Two Camunda deployments exist in two different regions. - -The failback mode is still enabled in the restored region. - -#### Desired state - -You restore the new region to its normal functionality by removing the failback mode and forcefully removing the sleeping Zeebe pods. They would otherwise hinder the rollout since they will never be ready. - -With this done, Zeebe is fully functional again and you are prepared in case of another region loss. - -#### How to get there + + -1. Upgrade the new region environment in `CAMUNDA_NAMESPACE_RECREATED` and `REGION_RECREATED` by removing the failback mode: +Port-forwarding the Zeebe Gateway via kubectl and printing the topology should reveal that all brokers have joined the Zeebe cluster again. -```bash -helm upgrade $HELM_RELEASE_NAME camunda/camunda-platform \ - --version $HELM_CHART_VERSION \ - --kube-context $CLUSTER_RECREATED \ - --namespace $CAMUNDA_NAMESPACE_RECREATED \ - -f camunda-values.yml \ - -f $REGION_RECREATED/camunda-values.yml ``` - -2. Delete the sleeping pods in the new region, as those are blocking a successful rollout due to the failback mode: - -```bash -kubectl --context $CLUSTER_RECREATED --namespace $CAMUNDA_NAMESPACE_RECREATED delete pods --selector=app\.kubernetes\.io/component=zeebe-broker +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING +zbctl status --insecure --address localhost:26500 ``` -#### Verification - -Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that all brokers have joined the Zeebe cluster again. +
+ Example Output + ```bash -kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING -zbctl status --insecure --address localhost:26500 +Cluster size: 8 +Partitions count: 8 +Replication factor: 4 +Gateway version: 8.6.0 +Brokers: +Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 1 : Leader, Healthy + Partition 6 : Follower, Healthy + Partition 7 : Follower, Healthy + Partition 8 : Leader, Healthy +Broker 1 - camunda-zeebe-0.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Partition 1 : Follower, Healthy + Partition 2 : Follower, Healthy + Partition 7 : Follower, Healthy + Partition 8 : Follower, Healthy +Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 1 : Follower, Healthy + Partition 2 : Follower, Healthy + Partition 3 : Follower, Healthy + Partition 8 : Follower, Healthy +Broker 3 - camunda-zeebe-1.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Partition 1 : Follower, Healthy + Partition 2 : Follower, Healthy + Partition 3 : Follower, Healthy + Partition 4 : Follower, Healthy +Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 2 : Leader, Healthy + Partition 3 : Leader, Healthy + Partition 4 : Leader, Healthy + Partition 5 : Follower, Healthy +Broker 5 - camunda-zeebe-2.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Partition 3 : Follower, Healthy + Partition 4 : Follower, Healthy + Partition 5 : Follower, Healthy + Partition 6 : Follower, Healthy +Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501 + Version: 8.6.0 + Partition 4 : Follower, Healthy + Partition 5 : Leader, Healthy + Partition 6 : Leader, Healthy + Partition 7 : Leader, Healthy +Broker 7 - camunda-zeebe-3.camunda-zeebe.camunda-paris.svc:26501 + Version: 8.6.0 + Partition 5 : Follower, Healthy + Partition 6 : Follower, Healthy + Partition 7 : Follower, Healthy + Partition 8 : Follower, Healthy ``` + +
+
diff --git a/docs/self-managed/operational-guides/multi-region/img/10.svg b/docs/self-managed/operational-guides/multi-region/img/10.svg index 45afbdccfeb..8aa2f51081a 100644 --- a/docs/self-managed/operational-guides/multi-region/img/10.svg +++ b/docs/self-managed/operational-guides/multi-region/img/10.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/11.svg b/docs/self-managed/operational-guides/multi-region/img/11.svg index 460316eee9e..d59c1009483 100644 --- a/docs/self-managed/operational-guides/multi-region/img/11.svg +++ b/docs/self-managed/operational-guides/multi-region/img/11.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/12.svg b/docs/self-managed/operational-guides/multi-region/img/12.svg index c2918534765..5a8e7976504 100644 --- a/docs/self-managed/operational-guides/multi-region/img/12.svg +++ b/docs/self-managed/operational-guides/multi-region/img/12.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/13.svg b/docs/self-managed/operational-guides/multi-region/img/13.svg index e5ab3a79b3c..7f61fb6ac32 100644 --- a/docs/self-managed/operational-guides/multi-region/img/13.svg +++ b/docs/self-managed/operational-guides/multi-region/img/13.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/14.svg b/docs/self-managed/operational-guides/multi-region/img/14.svg index 492473fe1a5..c38be161cbf 100644 --- a/docs/self-managed/operational-guides/multi-region/img/14.svg +++ b/docs/self-managed/operational-guides/multi-region/img/14.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/15.svg b/docs/self-managed/operational-guides/multi-region/img/15.svg deleted file mode 100644 index 4fd23ce94e0..00000000000 --- a/docs/self-managed/operational-guides/multi-region/img/15.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/3.svg b/docs/self-managed/operational-guides/multi-region/img/3.svg deleted file mode 100644 index 6703d8c9488..00000000000 --- a/docs/self-managed/operational-guides/multi-region/img/3.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/4.svg b/docs/self-managed/operational-guides/multi-region/img/4.svg index 41f2701e8c3..cc0e2a23f66 100644 --- a/docs/self-managed/operational-guides/multi-region/img/4.svg +++ b/docs/self-managed/operational-guides/multi-region/img/4.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/5.svg b/docs/self-managed/operational-guides/multi-region/img/5.svg index b38aa23ed66..e61ad4b423e 100644 --- a/docs/self-managed/operational-guides/multi-region/img/5.svg +++ b/docs/self-managed/operational-guides/multi-region/img/5.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/6.svg b/docs/self-managed/operational-guides/multi-region/img/6.svg index edcde812348..47ffd254606 100644 --- a/docs/self-managed/operational-guides/multi-region/img/6.svg +++ b/docs/self-managed/operational-guides/multi-region/img/6.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/7.svg b/docs/self-managed/operational-guides/multi-region/img/7.svg deleted file mode 100644 index 8ce6cae3502..00000000000 --- a/docs/self-managed/operational-guides/multi-region/img/7.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/8.svg b/docs/self-managed/operational-guides/multi-region/img/8.svg new file mode 100644 index 00000000000..d792456987e --- /dev/null +++ b/docs/self-managed/operational-guides/multi-region/img/8.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/self-managed/operational-guides/multi-region/img/9.svg b/docs/self-managed/operational-guides/multi-region/img/9.svg index 79b0c50e6ee..469eec88f09 100644 --- a/docs/self-managed/operational-guides/multi-region/img/9.svg +++ b/docs/self-managed/operational-guides/multi-region/img/9.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md b/docs/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md index fcb10d39e99..9eb4c332bd2 100644 --- a/docs/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md +++ b/docs/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md @@ -4,7 +4,7 @@ title: "Dual-region setup (EKS)" description: "Deploy two Amazon Kubernetes (EKS) clusters with Terraform for a peered setup allowing dual-region communication." --- - + import CoreDNSKubeDNS from "./assets/core-dns-kube-dns.svg" @@ -22,8 +22,8 @@ This guide requires you to have previously completed or reviewed the steps taken - An [AWS account](https://docs.aws.amazon.com/accounts/latest/reference/accounts-welcome.html) to create resources within AWS. - [Helm (3.x)](https://helm.sh/docs/intro/install/) for installing and upgrading the [Camunda Helm chart](https://github.com/camunda/camunda-platform-helm). -- [Kubectl (1.28.x)](https://kubernetes.io/docs/tasks/tools/#kubectl) to interact with the cluster. -- [Terraform (1.7.x)](https://developer.hashicorp.com/terraform/downloads) +- [Kubectl (1.30.x)](https://kubernetes.io/docs/tasks/tools/#kubectl) to interact with the cluster. +- [Terraform (1.9.x)](https://developer.hashicorp.com/terraform/downloads) ## Considerations @@ -69,8 +69,6 @@ You have to choose unique namespaces for Camunda 8 installations. The namespace For example, you can install Camunda 8 into `CAMUNDA_NAMESPACE_0` in `CLUSTER_0`, and `CAMUNDA_NAMESPACE_1` on the `CLUSTER_1`, where `CAMUNDA_NAMESPACE_0` != `CAMUNDA_NAMESPACE_1`. Using the same namespace names on both clusters won't work as CoreDNS won't be able to distinguish between traffic targeted at the local and remote cluster. -In addition to namespaces for Camunda installations, create the namespaces for failover (`CAMUNDA_NAMESPACE_0_FAILOVER` in `CLUSTER_0` and `CAMUNDA_NAMESPACE_1_FAILOVER` in `CLUSTER_1`), for the case of a total region loss. This is for completeness, so you don't forget to add the mapping on region recovery. The operational procedure is handled in a different [document on dual-region](./../../../../operational-guides/multi-region/dual-region-ops.md). - ::: 4. Execute the script via the following command: @@ -259,13 +257,6 @@ kubectl --context cluster-london -n kube-system edit configmap coredns force_tcp } } - camunda-paris-failover.svc.cluster.local:53 { - errors - cache 30 - forward . 10.202.19.54 10.202.53.21 10.202.84.222 { - force_tcp - } - } ### Cluster 0 - End ### Please copy the following between @@ -282,13 +273,6 @@ kubectl --context cluster-paris -n kube-system edit configmap coredns force_tcp } } - camunda-london-failover.svc.cluster.local:53 { - errors - cache 30 - forward . 10.192.27.56 10.192.84.117 10.192.36.238 { - force_tcp - } - } ### Cluster 1 - End ### ``` @@ -340,13 +324,6 @@ data: force_tcp } } - camunda-paris-failover.svc.cluster.local:53 { - errors - cache 30 - forward . 10.202.19.54 10.202.53.21 10.202.84.222 { - force_tcp - } - } ``` @@ -375,7 +352,7 @@ The script [test_dns_chaining.sh](https://github.com/camunda/c8-multi-region/blo ### Create the secret for Elasticsearch -Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failover. For this, you will need to configure a Kubernetes secret to not expose those in cleartext. +Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failback. For this, you will need to configure a Kubernetes secret to not expose those in cleartext. You can pull the data from Terraform since you exposed those via `output.tf`. diff --git a/versioned_docs/version-8.5/self-managed/concepts/multi-region/dual-region.md b/versioned_docs/version-8.5/self-managed/concepts/multi-region/dual-region.md index 60b732ff088..9189a983fba 100644 --- a/versioned_docs/version-8.5/self-managed/concepts/multi-region/dual-region.md +++ b/versioned_docs/version-8.5/self-managed/concepts/multi-region/dual-region.md @@ -129,8 +129,8 @@ In the event of a total active region loss, the following data will be lost: - Role Based Access Control (RBAC) does not work. - Optimize is not supported. - This is due to Optimize depending on Identity to work. -- Connectors are not supported. - - This is due to Connectors depending on Operate to work for inbound Connectors and potentially resulting in race condition. +- Connectors can be deployed alongside but ensure to understand idempotency based on [the described documentation](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event). + - in a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events. - During the failback procedure, there’s a small chance that some data will be lost in Elasticsearch affecting Operate and Tasklist. - This **does not** affect the processing of process instances in any way. The impact is that some information about the affected instances might not be visible in Operate and Tasklist. - This is further explained in the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md?failback=step2#failback) during the relevant step. diff --git a/versioned_docs/version-8.5/self-managed/operational-guides/multi-region/dual-region-ops.md b/versioned_docs/version-8.5/self-managed/operational-guides/multi-region/dual-region-ops.md index 6cca48ebb72..f0df2244b3a 100644 --- a/versioned_docs/version-8.5/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/versioned_docs/version-8.5/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -149,6 +149,8 @@ One of the regions is lost, meaning Zeebe: For the failover procedure, ensure the lost region does not accidentally reconnect. You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment. +It's crucial to ensure the isolation of the environments because, during the operational procedure, we will have duplicate Zeebe broker IDs, which would collide if not correctly isolated and if the other region came accidentally on again. + #### How to get there Depending on your architecture, possible approaches are: @@ -585,7 +587,7 @@ kubectl --context $CLUSTER_SURVIVING scale -n $CAMUNDA_NAMESPACE_SURVIVING deplo kubectl --context $CLUSTER_SURVIVING scale -n $CAMUNDA_NAMESPACE_SURVIVING deployments/$HELM_RELEASE_NAME-tasklist --replicas 0 ``` -2. Disable the Zeebe Elasticsearch exporters in Zeebe via kubectl: +2. Disable the Zeebe Elasticsearch exporters in Zeebe via kubectl using the [exporting API](./../../zeebe-deployment/operations/management-api.md#exporting-api): ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING diff --git a/versioned_docs/version-8.5/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md b/versioned_docs/version-8.5/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md index fcb10d39e99..c1def9c7131 100644 --- a/versioned_docs/version-8.5/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md +++ b/versioned_docs/version-8.5/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md @@ -226,10 +226,10 @@ kubectl --context $CLUSTER_0 apply -f https://raw.githubusercontent.com/camunda/ kubectl --context $CLUSTER_1 apply -f https://raw.githubusercontent.com/camunda/c8-multi-region/main/aws/dual-region/kubernetes/internal-dns-lb.yml ``` -2. Execute the script [generate_core_dns_entry.sh](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/scripts/generate_core_dns_entry.sh) in the folder `aws/dual-region/scripts/` of the repository to help you generate the CoreDNS config. Make sure that you have previously exported the [environment prerequisites](#environment-prerequisites) since the script builds on top of it. +2. Execute the script [generate_core_dns_entry.sh](https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/scripts/generate_core_dns_entry.sh) with the parameter `legacy` in the folder `aws/dual-region/scripts/` of the repository to help you generate the CoreDNS config. Make sure that you have previously exported the [environment prerequisites](#environment-prerequisites) since the script builds on top of it. ```shell -./generate_core_dns_entry.sh +./generate_core_dns_entry.sh legacy ``` 3. The script will retrieve the IPs of the load balancer via the AWS CLI and return the required config change. @@ -244,7 +244,7 @@ For illustration purposes only. These values will not work in your environment. ::: ```shell -./generate_core_dns_entry.sh +./generate_core_dns_entry.sh legacy Please copy the following between ### Cluster 0 - Start ### and ### Cluster 0 - End ### and insert it at the end of your CoreDNS configmap in Cluster 0 @@ -375,7 +375,7 @@ The script [test_dns_chaining.sh](https://github.com/camunda/c8-multi-region/blo ### Create the secret for Elasticsearch -Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failover. For this, you will need to configure a Kubernetes secret to not expose those in cleartext. +Elasticsearch will need an S3 bucket for data backup and restore procedure, required during a regional failback. For this, you will need to configure a Kubernetes secret to not expose those in cleartext. You can pull the data from Terraform since you exposed those via `output.tf`.