Skip to content

Commit

Permalink
Merge pull request #370 from wireapp/release_2020-10-28
Browse files Browse the repository at this point in the history
  • Loading branch information
fisx authored Oct 28, 2020
2 parents a4a35b6 + d8b3900 commit 5d14220
Show file tree
Hide file tree
Showing 28 changed files with 884 additions and 435 deletions.
6 changes: 6 additions & 0 deletions .envrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
env="$(nix-build $PWD/default.nix -A env --no-out-link)"

PATH_add "${env}/bin"

# allow local .envrc overrides
[[ -f .envrc.local ]] && source_env .envrc.local
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,10 @@ values-init-done
*~
# Emacs autosave files
\#*\#

# Envrc local overrides
/.envrc.local

# Nix-created result symlinks
result
result-*
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# 2020-10-28

## Features

* ansible/requirements.yml: Bump SFT for new checksum format (#361)
* Create SFT servers in two groups (#356)
* Skip creating SFT monitoring certs if there are no SFT servers (#357)
* Delete the SFT SRV record after provsioning (#368)
* Update message stats dashboard (#208)

## Bug fixes / work-arounds

* add support for cargohold s3Compatibility option (#364)

## Documentation

* Comment on email visibility feature flag (#276)

## Internal

* Better nix support (#362, #358, #367, #369)
* ansible/Makefile: Print errors correctly when ENV is not in order (#359)
* Makefile target to get logs (#355)
* Makefile target to decrypt sops containers (#354)
* [tf-module:push-notifications] Allow to define multiple apps per client platform (#347)

# 2020-10-06

## Internal
Expand Down
9 changes: 9 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM nixos/nix

COPY . /wire-server-deploy

RUN apk add -u bash git

RUN nix-env -f /wire-server-deploy/default.nix -iA env

RUN rm -rf /wire-server-deploy
48 changes: 36 additions & 12 deletions ansible/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ download-ansible-roles-force:
provision-sft: check-env
poetry run ansible-playbook ${ANSIBLE_DIR}/provision-sft.yml \
-i ${ENV_DIR}/gen/terraform-inventory.yml \
-i ${ENV_DIR}/inventory.yml \
-i ${ENV_DIR}/inventory \
--private-key ${ENV_DIR}/operator-ssh.dec \
-vv

.PHONY: bootstrap
bootstrap: check-env
poetry run ansible-playbook ${ANSIBLE_DIR}/bootstrap.yml \
-i ${ENV_DIR}/gen/terraform-inventory.yml \
-i ${ENV_DIR}/inventory.yml \
-i ${ENV_DIR}/inventory \
--private-key ${ENV_DIR}/operator-ssh.dec \
-vv

Expand All @@ -47,26 +47,50 @@ bootstrap: check-env
kube-minio-static-files: check-env
poetry run ansible-playbook ${ANSIBLE_DIR}/kube-minio-static-files.yml \
-i ${ENV_DIR}/gen/terraform-inventory.yml \
-i ${ENV_DIR}/inventory.yml \
-i ${ENV_DIR}/inventory \
--private-key ${ENV_DIR}/operator-ssh.dec \
--extra-vars "service_cluster_ip=$$(KUBECONFIG=${ENV_DIR}/gen/artifacts/admin.conf kubectl get service fake-aws-s3 -o json | jq -r .spec.clusterIP)" \
-vv

.PHONY: check-env
check-env:
LOG_UNTIL ?= "now"
.PHONY: get-logs
get-logs: check-env
ifndef LOG_HOST
$(error please define LOG_HOST)
endif
ifndef LOG_SERVICE
$(error please define LOG_SERVICE)
endif
ifndef LOG_SINCE
$(error please define LOG_SINCE)
endif
poetry run ansible-playbook ${ANSIBLE_DIR}/get-logs.yml \
-i ${ENV_DIR}/gen/terraform-inventory.yml \
-i ${ENV_DIR}/inventory.yml \
--private-key ${ENV_DIR}/operator-ssh.dec \
--extra-vars "log_host=${LOG_HOST}" \
--extra-vars "log_service=${LOG_SERVICE}" \
--extra-vars "log_since=${LOG_SINCE}" \
--extra-vars "log_until=${LOG_UNTIL}"

.PHONY: ensure-env-dir
ensure-env-dir:
ifndef ENV_DIR
ifndef ENV
$(error please define either ENV or ENV_DIR)
else
ENV_DIR=${CAILLEACH_DIR}/environments/${ENV}
endif
endif
ifeq ("$(wildcard ${ENV_DIR}/inventory.yml)", "")
$(error please make sure ${ENV_DIR}/inventory.yml exists)
endif
ifeq ("$(wildcard ${ENV_DIR}/gen/terraform-inventory.yml)", "")

${ENV_DIR}/inventory:
$(error please make sure ${ENV_DIR}/inventory exists)

${ENV_DIR}/gen/terraform-inventory.yml:
$(error please make you have applied terraform for ${ENV_DIR})
endif
ifeq ("$(wildcard ${ENV_DIR}/operator-ssh.dec)", "")

${ENV_DIR}/operator-ssh.dec:
$(error please make sure ${ENV_DIR}/operator-ssh.dec exists and contains the private key to ssh into servers)
endif

.PHONY: check-env
check-env: ensure-env-dir ${ENV_DIR}/operator-ssh.dec ${ENV_DIR}/gen/terraform-inventory.yml ${ENV_DIR}/inventory
175 changes: 174 additions & 1 deletion ansible/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,179 @@ soon.
1. Ensure that `make apply` and `make create-inventory` have been run for the
environment. Please refer to the [docs in the terraform
folder](../terraform/README.md) for details about how to run this.
1. Ensure all required variables are set in `$ENV_DIR/inventory.yml`
1. Ensure all required variables are set in `$ENV_DIR/inventory/*.yml`
1. Running `make bootstrap` from this directory will bootstrap the
environment.

## Operating SFT Servers

There are a few things to consider while running SFT servers.

1. Restarting SFT servers while a call is going on will drop the call. To avoid
this, we must provide 6 hours of grace period after stopping SRV record
announcements.
1. Let's encrypt will not issue more than 50 certificates per registered domain
per week.
1. Let's encrypt will not do more than 5 renewals per set of domains.

To deal with these issues, we create 2 groups (blue and green) of the SFT
servers. These groups are configured like this in terraform:
```tfvars
sft_server_names_blue = ["1", "2"] # defaults to []
sft_server_type_blue = "cx21" # defaults to "cx11"
sft_server_names_green = ["3", "4"] # defaults to []
sft_server_type_green = "cx21" # defaults to "cx11"
```

Terraform will put all the SFT servers (blue and green) in a group called
`sft_servers` and additionally, it will put the blue servers in
`sft_servers_blue` group and green servers in `sft_servers_green` group. This
allows putting common variables in the `sft_servers` group and uncommon ones
(like `sft_artifact_file_url`) in the respective groups.

To maintain uptime, at least one of the groups should be active. The size of the
groups should ideally be equal and one group must be able to support peak
traffic.

### Deployment

Assuming blue servers are serving version 42 and we want to upgrade to version 43.

In this case the initial group vars for the `sft_servers_blue` group would look
like this:
```yaml
sft_servers_blue:
vars:
sft_artifact_file_url: "https://example.com/path/to/sftd_42.tar.gz"
sft_artifact_checksum: somechecksum_42
srv_announcer_active: true
```
For `sft_servers_green`, `srv_announcer_active` must be `false`.

1. Make sure all env variables like `ENV`, `ENV_DIR` are set.
1. Create terraform inventory (This section assumes all commands are executed
from the root of this repository)
```bash
make -C terraform/environment create-inventory
```
1. Setup green servers to have version 43 and become active:
```yaml
sft_servers_green:
vars:
sft_artifact_file_url: "https://example.com/path/to/sftd_43.tar.gz"
sft_artifact_checksum: somechecksum_43
srv_announcer_active: true
```
1. Run ansible
```yaml
make -C ansible provision-sft
```

This will make sure that green SFT servers will have version 43 of sftd and
they are available. At this point we will have both blue green servers as
active.
1. Ensure that new servers function properly. If they don't you can set
`srv_announcer_active` to `false` for the green group.
1. If the servers are working properly, setup the old servers to be deactivated:
```yaml
sft_servers_blue:
vars:
sft_artifact_file_url: "https://example.com/path/to/sftd_42.tar.gz"
sft_artifact_checksum: somechecksum_42
srv_announcer_active: false
```
1. Run ansible again
```yaml
make -C ansible provision-sft
```
1. There is a race condition in stopping SRV announcers, which will mean that
sometimes a server will not get removed from the list. This can be found by
running this command:
```bash
dig SRV _sft._tcp.<env>.<domain>
```

If an old server is found even after TTL for the record has expired, it must
be taken care of manually. It is safe to delete all the SRV records, they
should get re-populated within 20 seconds.

### Decomission one specific server

Assuming the terraform variables look like this and we have to take down server
`"1"`.
```tfvars
sft_server_names_blue = ["1", "2"] # defaults to []
sft_server_type_blue = "cx21" # defaults to "cx11"
sft_server_names_green = ["3", "4"] # defaults to []
sft_server_type_green = "cx21" # defaults to "cx11"
environment = "staging"
```

#### When the server is active

1. Add one more server to the blue group by replacing the first line with:
```tfvars
sft_server_names_blue = ["1", "2", "5"] # These shouldn't overlap with the green ones
```
1. Run terraform (this will wait for approval)
```bash
make -C terraform/environment init apply create-inventory
```
1. Set `srv_announcer_active` to `false` only for the host which is to be taken
down. Here the ansible host name would be `staging-sft-1`
1. Run ansible
```bash
make -C ansible provision-sft
```
1. Ensure that the SRV records don't contain `sft1`, same as last step of deployment procedure.
1. Monitor `sft_calls` metric to make sure that there are no calls left.
1. Setup instance for deletion by removing it from `sft_server_names_blue`:
```tfvars
sft_server_names_blue: ["2", 5"]
```
1. Run terraform (this will again wait for approval)
```bash
make -C terraform/environment apply
```

#### When the server is not active

1. Remove the server from `sft_server_names_blue` and add a new name by
replacing the first line like this:
```tfvars
sft_server_names_blue: ["2", 5"]
```
1. Run terraform (this will wait for approval)
```bash
make -C terraform/environment init apply
```

### Change server type of all servers

Assuming:
1. Initial tfvars has these variables:
```
sft_server_names_blue = ["1", "2"] # defaults to []
sft_server_type_blue = "cx21" # defaults to "cx11"
sft_server_names_green = ["3", "4"] # defaults to []
sft_server_type_green = "cx21" # defaults to "cx11"
environment = "staging"
```
1. We want to make all th servers "cx31".
1. The blue group is active, green is not.

We can do it like this:

1. Replace all the green servers by changing `server_type`:
```
sft_server_type_green = "cx31"
```
1. Run terraform (will wait for approval)
```
make -C terraform/environment init apply create-inventory
```
1. Deploy the same version as blue to green by following steps in the deployment
procedure.
1. Once the blue servers are inactive and all the calls have finished, replace
them the same way as green servers. No need to make them active again.
11 changes: 11 additions & 0 deletions ansible/get-logs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
- hosts: "{{ log_host }}"
tasks:
- name: get logs
shell: journalctl -u {{ log_service }} --since {{ log_since }} --until {{ log_until }}
register: the_logs
- name: save logs
delegate_to: localhost
become: no
copy:
dest: /tmp/{{log_host}}-{{ log_service }}-{{ log_since }}-{{ log_until }}
content: "{{ the_logs.stdout }}"
Loading

0 comments on commit 5d14220

Please sign in to comment.