TODO : Rollback procedure using restore
Check the list in the corresponding file .md
https://docs.mirantis.com/docker-enterprise/v3.0/dockeree-products/ucp/admin/disaster-recovery.html
Action | Duration |
---|---|
0. Open a proactive support case | 5 minutes |
1. Take a support dump | 5 minutes |
2. Check available backups | 15 minutes |
3. Check current health of the cluster | 15 minutes |
4. Communicate start of DR | 5 minutes |
5. Stop VMs in Datacenter dc1 | 15 minutes |
6. Restore UCP quorum | 40 minutes |
7. Restore DTR quorum | 10 minutes |
8. Communicate end of failover to dc2 | 5 minutes |
9. Applications testings | 120 minutes |
10. Add workers | 10 minutes |
11. Add DTR replicas | 30 minutes |
12. Add managers | 30 minutes |
13. Check current health of the cluster | 15 minutes |
Before starting the procedure, connect to UCP and take a full support dump.
https://docs.mirantis.com/docker-enterprise/v3.0/dockeree-products/dee-intro/manage/backup.html
If you need to take a backup, connect to manager3 (the manager on dc2) :
UCP_DOCKERHUB=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d"/" -f1)
UCP_VERSION=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d":" -f2)
UCP_IMAGE=${UCP_DOCKERHUB}/ucp:${UCP_VERSION}
docker container run \
--rm \
--security-opt label=disable \
--log-driver none \
--name ucp \
--volume /var/run/docker.sock:/var/run/docker.sock \
--volume /tmp:/backup \
${UCP_IMAGE} backup \
--debug \
--file ucp-backup-$(date +%Y%m%d-%H_%M_%S).tar \
--no-passphrase
If you need to take a backup, connect to dtr3 (the DTR replica on dc2) :
DTR_DOCKERHUB=$(docker ps --filter "name=dtr-registry" --format "{{.Image}}" | cut -d"/" -f1)
DTR_VERSION=$(docker ps --filter "name=dtr-registry" --format "{{.Image}}" | cut -d":" -f2)
DTR_IMAGE=${DTR_DOCKERHUB}/dtr:${DTR_VERSION}
UCP_URL=
read -p 'ucp-admin: ' UCP_ADMIN; \
read -sp 'ucp-password: ' UCP_PASSWORD; \
REPLICA_ID=$(docker ps --format '{{.Names}}' -f name=dtr-rethink | cut -f 3 -d '-'); \
docker run --log-driver none -i --rm \
--env UCP_PASSWORD=$UCP_PASSWORD \
$DTR_IMAGE backup \
--ucp-username $UCP_ADMIN \
--ucp-url $UCP_URL \
--ucp-insecure-tls \
--existing-replica-id $REPLICA_ID > dtr-metadata-backup-$(date +%Y%m%d-%H_%M_%S).tar
Before starting the operations, check that the cluster is healthy.
-
Connect to the UCP UI and check that there is no banner with error or warning.
-
Check the UCP API
UCP_URL=
curl -ks https://${UCP_URL}/_ping
- Using a client bundle, check that every nodes are OK
cd <BUNDLE_DIR>
source env.sh
docker node ls
kubectl get nodes
Every nodes should be in Ready state
-
Connect to the DTR UI and check that there is no banner with error or warning.
-
Check the DTR API
DTR_URL=
read -p 'ucp-admin: ' UCP_ADMIN; \
read -sp 'ucp-password: ' UCP_PASSWORD; \
curl -ksL -u $UCP_ADMIN:$UCP_PASSWORD https://${DTR_URL}/api/v0/meta/cluster_status | jq .replica_health
A test application has been deployed in the cluster.
You can check the availability of the application
NOTE : For the DR exercise, you will just stop the docker engine.
Stop the docker engine on the nodes in the table List of Nodes that are in Datacenter dc1.
First stop managers, then dtrs, then workers in order to loose quorum before Kubernetes is able to reschedule Pods on dc2.
After the shutdown :
- UCP UI is down
- docker bundle and kubectl are KO
kubectl get pods
Error from server (InternalError): an error on the server ("unknown") has prevented the request from succeeding (get pods)
- manager3 (remaining manager on dc2) has lost quorum
Execute these commands on manager3 (the remaining manager on dc2) :
docker node ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
- DTR is down
docker login <dtr>
Username (admin):
Password:
Error response from daemon: Get https://dtr/v2/: received unexpected HTTP status: 500 Internal Server Error
- Replicas of the apps running on dc1 are UP
Until this point, rollback is as simple as starting again all the VMs on dc1.
After executing the following steps, the rollback procedure if anything goes wrong will be to restore the cluster with a backup.
https://docs.mirantis.com/docker-enterprise/v3.0/dockeree-products/ucp/admin/disaster-recovery.html
Force a new Swarm cluster, and force a new Swarm snapshot to happen.
Execute these commands on manager3 (the remaining manager on dc2) :
docker swarm init --force-new-cluster --default-addr-pool 10.10.0.0/16 --default-addr-pool 10.20.0.0/16 --default-addr-pool-mask-length 21
docker swarm update --snapshot-interval 1
docker network create temporarynet -d overlay
docker network rm temporarynet
docker swarm update --snapshot-interval 10000
Wait for workers to become Ready, stopped VMs should appear as Down
watch docker node ls
Now Swarm quorum is restored and Swarm commands are working on manager3 :
docker node ls
docker service ls
Execute these commands on manager3 (the remaining manager on dc2) :
UCP_VERSION=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d":" -f2)
UCP_DOCKERHUB=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d"/" -f1)
UCP_ETCD_IMAGE=${UCP_DOCKERHUB}/ucp-etcd:${UCP_VERSION}
NODE_NAME=$(docker info --format '{{.Name}}')
docker node update --label-add com.docker.ucp.agent-pause=true $NODE_NAME
docker stop ucp-kv
Move the following file to another location if it exists :
mv /var/lib/docker/volumes/ucp-kv/_data/datav3/extra.args /tmp
Create a temporary etcd container :
docker run --rm -it --entrypoint sh -v ucp-kv:/data -v ucp-kv-certs:/etc/docker/ssl $UCP_ETCD_IMAGE
NOTE : If you get an error docker: Error response from daemon: OCI runtime create failed, do the following actions.
Then, re-execute the command to create a temporary etcd container.
systemctl restart docker
# wait a few seconds
docker stop ucp-kv
# check that ucp-kv is not started
docker ps | grep ucp-kv
When the temporary etcd container is launched, run this command inside the container
/bin/etcd --data-dir /data/datav3 --force-new-cluster
Wait for the message ready to service client requests, then exit the container using <CONTROL+C> <CONTROL+V>
docker start ucp-kv
docker node update --label-rm com.docker.ucp.agent-pause $NODE_NAME
Wait a few seconds, then check etcd Health
docker exec -it --env ETCDCTL_API=2 ucp-kv etcdctl --endpoint https://127.0.0.1:2379 --ca-file /etc/docker/ssl/ca.pem --cert-file /etc/docker/ssl/cert.pem --key-file /etc/docker/ssl/key.pem cluster-health
docker exec -it --env ETCDCTL_API=2 ucp-kv etcdctl --endpoint https://127.0.0.1:2379 --ca-file /etc/docker/ssl/ca.pem --cert-file /etc/docker/ssl/cert.pem --key-file /etc/docker/ssl/key.pem member list
Etcd cluster has now only 1 replica
- Restore quorum
Execute these commands on manager3 (the remaining manager on dc2) :
UCP_VERSION=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d":" -f2)
UCP_DOCKERHUB=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d"/" -f1)
UCP_AUTH_IMAGE=${UCP_DOCKERHUB}/ucp-auth:${UCP_VERSION}
NUM_MANAGERS=1
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
docker container run \
--rm -v ucp-auth-store-certs:/tls ${UCP_AUTH_IMAGE} \
--db-addr=${NODE_ADDRESS}:12383 \
--debug reconfigure-db \
--num-replicas ${NUM_MANAGERS} \
--emergency-repair
NOTE : Restart the container ucp-auth-store if you get the following error, then run again the emergency-repair command :
time="2020-11-03T15:01:00Z" level=debug msg="Connecting to db ..."
time="2020-11-03T15:01:00Z" level=debug msg="connecting to DB Addrs: [172.0.0.1:12383]"
time="2020-11-03T15:01:00Z" level=fatal msg="unable to connect to database cluster: rethinkdb: read tcp 172.0.0.1:48350->172.0.0.1:12383: read: connection reset by peer"
- Check rethinkdb health
UCP_VERSION=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d":" -f2)
UCP_DOCKERHUB=$(docker ps --filter "name=ucp-proxy" --format "{{.Image}}" | cut -d"/" -f1)
UCP_AUTH_IMAGE=${UCP_DOCKERHUB}/ucp-auth:${UCP_VERSION}
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
docker container run \
--rm -v ucp-auth-store-certs:/tls ${UCP_AUTH_IMAGE} \
--db-addr=${NODE_ADDRESS}:12383 \
--debug db-status
- Restart docker engine
systemctl restart docker
-
Connect to the UCP UI and check that there is no banner with error or warning.
-
Check the UCP API
UCP_URL=
curl -ks https://${UCP_URL}/_ping
- Using a client bundle, check that UCP and Kubernetes responds
cd <BUNDLE_DIR>
source env.sh
docker node ls
kubectl get nodes
kubectl get pods -A
Note :
- nginx-controller replicas of dc1 will stay Pending because of PodAntiaffinity on Datacenter.
- Apps replicas od dc1 will be scheduled on workers of dc2 after 5 minutes, but can be stuck in ContainerCreating state if image is not present locally and DTR is not available.
Now that docker bundle is working again, using the docker bundle, remove old managers on dc1 (not dtr and workers)
cd <BUNDLE_DIR>
source env.sh
docker node rm <manager1> <manager2>
Execute these commands on dtr3 (DTR replica in dc2) :
DTR_DOCKERHUB=$(docker ps --filter "name=dtr-registry" --format "{{.Image}}" | cut -d"/" -f1)
DTR_VERSION=$(docker ps --filter "name=dtr-registry" --format "{{.Image}}" | cut -d":" -f2)
DTR_IMAGE=${DTR_DOCKERHUB}/dtr:${DTR_VERSION}
UCP_URL=
REPLICA_ID=$(docker inspect -f '{{.Name}}' $(docker ps -q -f name=dtr-rethink) | cut -f 3 -d '-')
read -p 'ucp-admin: ' UCP_ADMIN; \
docker run -it --rm ${DTR_IMAGE} emergency-repair \
--ucp-url ${UCP_URL} \
--ucp-username ${UCP_ADMIN} \
--ucp-insecure-tls \
--existing-replica-id $REPLICA_ID
- Check that there is now 1 replica available
DTR_URL=
read -p 'ucp-admin: ' UCP_ADMIN ; \
read -sp 'ucp-password: ' UCP_PASSWORD ; \
curl -ksL -u $UCP_ADMIN:$UCP_PASSWORD https://${DTR_URL}/api/v0/meta/cluster_status | jq .replica_health
Test that you are able to pull an image
docker image pull <IMAGE ON DTR>
-
To add back the worker nodes to the cluster, just start the VMs and they will automatically join the cluster.
-
Check the status of the nodes using a client bundle:
cd <BUNDLE DIR>
source env.sh
kubectl get nodes
docker node ls
NOTE : Sometimes the nodes will take some time to appear as Ready.
You may need to restart ucp-swarm-manager container on the manager to update swarm classic inventory Execute these commands on manager3 (remaining manager in dc2) :
docker restart ucp-swarm-manager
-
Follow the same procedure than for the workers to add the nodes back to the Swarm cluster.
-
Check Allow administrators to deploy containers on UCP managers or nodes running DTR
-
Before adding the DTR replica to the DTR cluster, clear old DTR state on each dtr node on dc1
On each dtr node on dc1 :
docker container rm -f $(docker ps -aq --filter "name=dtr-")
docker volume rm $(docker volume ls -q --filter "name=dtr-")
- Join new replicas to cluster
Execute these commands on dtr3 (the DTR replica on dc2) :
DTR_DOCKERHUB=$(docker ps --filter "name=dtr-registry" --format "{{.Image}}" | cut -d"/" -f1)
DTR_VERSION=$(docker ps --filter "name=dtr-registry" --format "{{.Image}}" | cut -d":" -f2)
DTR_IMAGE=${DTR_DOCKERHUB}/dtr:${DTR_VERSION}
UCP_URL=
read -p 'ucp-admin: ' UCP_ADMIN
# Replace <dtrX> with tne name of the node you want to add
docker run -it --rm ${DTR_IMAGE} join --ucp-node <dtr1> --ucp-insecure-tls --ucp-url $UCP_URL --ucp-username $UCP_ADMIN
# Wait for the replica to join the cluster, then join the last replica
docker run -it --rm ${DTR_IMAGE} join --ucp-node <dtr2> --ucp-insecure-tls --ucp-url $UCP_URL --ucp-username $UCP_ADMIN
TODO : check labels
For the exercise, you will reuse old nodes manager1 and manager2.
First start manager1 and wipe datas on it before starting manager2
- Wipe the datas in order to prevent it to reconnect automatically to the old etcd/rethinkdb clusters
Execute the following commands on manager1, and then on manager2
systemctl stop docker
# Wait a few seconds then wipe datas
rm -rf /var/lib/docker/*
systemctl start docker
NOTE : you can discard the following error :
rm: cannot remove '/var/lib/docker/containers': Device or resource busy
rm: cannot remove '/var/lib/docker/volumes': Device or resource busy
- Load UCP images
UCP_VERSION=3.2.4
curl https://<images>/ucp_images_${UCP_VERSION}.tar.gz | docker load
- Join the first node as a manager
Execute the following command on manager3 (remaining node in dc2) in order to generate a join token
docker swarm join-token manager
Paste the command on the node you want to join
docker swarm join <TOKEN HOST:IP>
Wait until the manager has fully joined the cluster and is healthy, then repeat operation for the second manager
NOTE : the command can timeout, it's normal. It can takes up to 5 minutes to join the cluster.
Follow the same procedure as before in the chapter 3. Check current health of the cluster
# manager3
docker container run --rm -it --security-opt label=disable -v /var/run/docker.sock:/var/run/docker.sock --name ucp $UCP_IMAGE uninstall-ucp --interactive
docker container run --rm -i --security-opt label=disable -v /var/run/docker.sock:/var/run/docker.sock --name ucp $UCP_IMAGE restore < /tmp/mybackup.tar
docker node rm <manager1> <manager2>
# dtr3
docker run -it --rm mirantis/dtr:2.7.8 destroy --ucp-insecure-tls
DTR_IMAGE=mirantis/dtr:2.7.8
UCP_URL=
# TODO : docker run -it --rm ${DTR_IMAGE} restore --ucp-node $(hostname) --ucp-insecure-tls --ucp-url $UCP_URL --ucp-username $USER < dtr-metadata-backup-20201030-15_59_32.tar