Updates to devops docs (#134)

ooni · Jan 6, 2025 · 903b3c7 · 903b3c7
1 parent 8ac1779
commit 903b3c7
Show file tree

Hide file tree

Showing 10 changed files with 345 additions and 415 deletions.
diff --git a/README.md b/README.md
@@ -1,117 +1,3 @@
 # OONI Devops
 
-
-## Infrastructure Tiers
-
-We divide our infrastructure components into 3 tiers:
-
-- **Tier 0: Critical**: These are mission critical infrastructure components. If these become unavailable or have significant disruption, it will have a major impact.
-
-- **Tier 1: Essential**: These components are important, but not as critical as
-  tier 0. They are part of our core operations, but if they become unavailable
-  the impact is important, but not major.
-
-- **Tier 2: Non-Essential**: These are auxiliary components. Their
-  unavailability does not have a major impact.
-
-### Tier 0 (Critical) components
-
-- [ ] Probe Services (collector specifically)
-- [ ] Fastpath (part responsible for storing post-cans)
-- [x] DNS configuration
-- [ ] Monitoring
-- [ ] OONI bridges
-- [ ] OONI.org website
-- [x] Web Connectivity test helpers
-- [x] Code signing
-
-### Tier 1 (Essential) components
-
-- [ ] OONI API measurement listing
-- [x] OONI Explorer
-- [x] OONI Run
-- [ ] OONI Data analysis pipeline
-- [x] OONI Findings API
-- [x] Website analytics
-
-### Tier 2 (Non-Essential) components
-
-- [ ] Test list editor
-- [ ] Jupyter notebooks
-- [ ] Countly
-
-## DNS and Domains
-
-The primary domains used by the backend are:
-- `ooni.org`
-- `ooni.io`
-- `ooni.nu`
-
-### DNS naming policy
-
-The public facing name of services, follows this format:
-
-- `<service>.ooni.org`
-
-Examples:
-
-- `explorer.ooni.org`
-- `run.ooni.org`
-
-Public-facing means the FQDNs are used directly by external users, services, or
-embedded in the probes. They cannot be changed or retired without causing
-outages.
-
-Use public facing names sparingly and when possible start off by creating a
-private name first.
-Not every host needs to have a public facing name. For example staging and
-testing environments might not have a public facing name.
-
-Each service also has public name which points to the specific host running that
-service, and these are hosted in the `.io` zone.
-This is helpful because sometimes you might have the same host running multiple
-services or you might also have multiple services behind the same public service
-endpoint (eg. in the case of an API gateway setup).
-
-Name in the `.io` zone should always include also the environment name they are
-related to:
-
-- `<service>.prod.ooni.io` for production services
-- `<service>.test.ooni.io` for test services
-
-When there may be multiple instances of a service running, you can append a
-number to the service name. Otherwise the service name should be only alphabetic
-characters.
-
-Examples:
-
-- `clickhouse.prod.ooni.io`
-- `postgres0.prod.ooni.io`
-- `postgres1.prod.ooni.io`
-- `prometheus.prod.ooni.io`
-- `grafana.prod.ooni.io`
-
-Finally, the actual host which runs the service, should have a FQDN defined
-inside of the `.nu` zone.
-
-This might not apply to every host, especially in a cloud environment. The FQDN
-in the `.nu` are the ones which are going to be stored in the ansible inventory
-file and will be used as targets for configuration management.
-
-The structure of these domains is:
-
-- `<name>.<location>.[prod|test].ooni.nu`
-
-The location tag can be either just the provider name or provider name `-` the location.
-
-Here is a list of location tags:
-
-- `htz-fsn`: Hetzner on Falkenstein
-- `htz-hel`: Hetzner in Helsinki
-- `grh-ams`: Greenhost in Amsterdam
-- `grh-mia`: Greenhost in Miami
-- `aws-fra`: AWS in Europe (Frankfurt)
-
-Examples:
-
-- `monitoring.htz-fsn.prod.ooni.nu`
+This documentation contains information
diff --git a/ansible/README.md b/ansible/README.md
@@ -1,22 +1,34 @@
-### Quickstart
+# Ansible
+
+**NOTE** We are currently in the process of migrating ansible configurations from [ooni/sysadmin](https://github.com/ooni/sysadmin) to [ooni/devops](https://github.com/ooni/devops).
+
+Ansible is used to configure the OSes on long term provisioned backend hosts and manage the configuration for these components.
+
+For example ansible will be used to configure the setup of VPSs and dedicated hosts that are provisioned manually or using terraform.
+
+In the case of hosts that are continously delivered, we instead use cloud-native configuration management tools.
+
+## Installation and setup
 
 It's recommended to make use of a virtualenv, for example managed using `pyenv virtualenv`:
 ```
 pyenv virtualenv ooni-devops
 pyenv activate ooni-devops
 ```
 
-Install deps:
-```
-pip install ansible dnspython boto3 passlib
-```
+### Ansible setup
 
-Install ansible galaxy modules:
+You should then install the required python and ansible-galaxy depedencies with:
 ```
-ansible-galaxy install -r requirements.yml
+pip install -r requirements/python.yml
+ansible-galaxy install -r requirements/ansible-galaxy.yml
 ```
 
-Setup AWS credentials, you should add 2 profiles called `oonidevops_user_dev` and `oonidevops_user_prod` which have access to the development and production environment respectively
+### AWS configuration
+
+You should then setup AWS credentials, by adding 2 profiles called `oonidevops_user_dev` and `oonidevops_user_prod` which have access to the development and production environment respectively.
+
+To this end edit your `~/.aws/credentials` file to contain:
 
 ```
 [oonidevops_user_dev]
@@ -36,11 +48,153 @@ region = eu-central-1
 role_arn = arn:aws:iam::471112720364:role/oonidevops
 ```
 
-Run playbook:
+### SSH configuration
+
+You should configure your `~/.ssh/config` with the following:
+
+```
+    IdentitiesOnly yes
+    ServerAliveInterval 120
+    UserKnownHostsFile ~/.ssh/known_hosts ~/REPLACE_ME/sysadmin/ext/known_hosts
+
+    host *.ooni.io
+      user YOUR_USERNAME
+
+    host *.ooni.nu
+      user YOUR_USERNAME
+
+    host *.ooni.org
+      user YOUR_USERNAME
+```
+
+**TODO** restore ext/known_hosts setup
+
+Replace `~/REPLACE_ME/sysadmin/ext/known_hosts` to where you have cloned
+the `ooni/sysadmin` repo. This will ensure you use the host key
+fingeprints from this repo instead of just relying on TOFU.
+
+You should replace `YOUR_USERNAME` with your username from `adm_login`.
+
+On MacOS you may want to also add:
+
+    host *
+        UseKeychain yes
+
+To use the Keychain to store passwords.
+
+## Running ansible playbooks
+
+Playbooks are run via an wrapper script called `./play` which notifies the slack #ooni-bots channel that a deployment has been triggered.
+
+```
+./play -i inventory deploy-<component>.yml -l <hostname> --diff -C
+./play -i inventory deploy-<component>.yml -l <hostname> --diff
+```
+
+:::caution
+any minor error in configuration files or ansible's playbooks can be
+destructive for the backend infrastructure. Always test-run playbooks
+with `--diff` and `-C` at first and carefully verify configuration
+changes. After verification run the playbook without `-C` and verify
+again the applied changes.
+:::
+
+:::note
+[Etckeeper](#etckeeper)&thinsp;🔧 can be useful to verify configuration
+changes from a different point of view.
+:::
+
+In general there are two classes of playbooks:
+* Those starting with `deploy-*.yml`, which are used to deploy specific components or pieces of components related to OONI infrastructure. All of these playbooks are included inside of `playbook.yml` to faciliate testing and ensuring that every component in our infrastucture is fully deployable.
+* Those starting with `playbook-*` which are playbooks for specific tasks that may not be part of the main infrastructure deployment (eg. bootstrapping nodes once upon creation, creating snapshots of remote configurations, etc.)
+
+Some notable playbooks or roles are:
+
+The bootstrap playbook is in `playbook-bootstrap.yml` and is a playbook that should be run once when a new host is created.
+
+The nftables firewall is configured to read every `.nft` file under
+`/etc/ooni/nftables/` and `/etc/ooni/nftables/`. This allows roles to
+create small files to open a port each and keep the configuration as
+close as possible to the ansible step that deploys a service. See this in use inside of the `nftables` role.
+
+#### The root account
+
+Runbooks use ssh to log on the hosts using your own account and leveraging `sudo` to act as root.
+
+The only exception is when a new host is being deployed - in that case ansible will log in as root to create
+individual accounts and lock out the root user.
+
+When running the entire runbook ansible might try to run it as root.
+This can be avoided by selecting only the required tags using `-t <tagname>`.
+
+Ideally the root user should be disabled after succesfully creating user accounts.
+
+#### Roles layout
+
+Ansible playbooks use multiple roles (see
+[example](https://github.com/ooni/sysadmin/blob/master/ansible/deploy-backend.yml#L46))
+to deploy various components.
+
+Few roles use the `meta/main.yml` file to depend on other roles. See
+[example](https://github.com/ooni/sysadmin/blob/master/ansible/roles/ooni-backend/meta/main.yml)
+
+:::note
+The latter method should be used sparingly because ansible does not
+indicate where each task in a playbook is coming from. Moreover if a dependencies is specified twice inside of two roles, it will run twice.
+:::
+
+A diagram of the role dependencies for the deploy-backend.yml playbook:
+
+```mermaid
+
+flowchart LR
+        A(deploy-backend.yml) --> B(base-bullseye)
+        B -- meta --> G(adm)
+        A --> F(nftables)
+        A --> C(nginx-buster)
+        A --> D(dehydrated)
+        D -- meta --> C
+        E -- meta --> F
+        A --> E(ooni-backend)
+        style B fill:#eeffee
+        style C fill:#eeffee
+        style D fill:#eeffee
+        style E fill:#eeffee
+        style F fill:#eeffee
+        style G fill:#eeffee
 ```
-ansible-playbook playbook.yml -i inventory
+
+A similar diagram for deploy-monitoring.yml:
+
+```mermaid
+
+flowchart LR
+        B -- meta --> G(adm)
+        M(deploy-monitoring.yml) --> B(base-bookworm)
+        M --> O(ooca-cert)
+        M --> F(nftables)
+        M --> D(dehydrated) -- meta --> N(nginx-buster)
+        M --> P(prometheus)
+        M --> X(blackbox-exporter)
+        M --> T(alertmanager)
+        style B fill:#eeffee
+        style D fill:#eeffee
+        style F fill:#eeffee
+        style G fill:#eeffee
+        style N fill:#eeffee
+        style O fill:#eeffee
+        style P fill:#eeffee
+        style T fill:#eeffee
+        style X fill:#eeffee
 ```
 
+:::note
+When deploying files or updating files already existing on the hosts it can be useful to add a note e.g. "Deployed by ansible, see <role_name>".
+This helps track down how files on the host were modified and why.
+:::
+
+### Platform specific known bugs
+
 On macOS you might run into this issue: https://github.com/ansible/ansible/issues/76322
 
 The current workaround is to export the following environment variable before running ansible:

diff --git a/ansible/playbook-controller.yml → ansible/deploy-controller.yml b/ansible/playbook-controller.yml → ansible/deploy-controller.yml
diff --git a/ansible/ansible-playbook → ansible/play b/ansible/ansible-playbook → ansible/play
diff --git a/ansible/requirements.yml → ansible/requirements/ansible-galaxy.yml b/ansible/requirements.yml → ansible/requirements/ansible-galaxy.yml
diff --git a/ansible/requirements/python.yml b/ansible/requirements/python.yml
@@ -0,0 +1,3 @@
+ansible==9.3.0
+boto3==1.34.65
+dnspython==2.6.1
diff --git a/docs/IncidentResponse.md b/docs/IncidentResponse.md
@@ -12,9 +12,7 @@ On Android devices the following apps can be used:
  * [Grafana](#grafana)&thinsp;🔧 viewer
     <https://play.google.com/store/apps/details?id=it.ksol.grafanaview>
 
-## Tiers and severities
-
-**TODO** Consolidate the tiers outlined here with the other tiers listed in the top level readme.
+## Severities
 
 When designing architecture of backend components or handling incidents it can be useful to have
 defined severities and tiers.
@@ -27,17 +25,12 @@ In this case there is no distinction between severity and priority. Impact and r
 Incidents and alarms from monitoring can be classified by severity levels based on their impact:
 
  - 1: Serious security breach or data loss; serious loss of privacy impacting users or team members; legal risks.
- - 2: Downtime impacting service usability for a significant fraction of users; Serious security vulnerability.
+ - 2: Downtime impacting service usability for a significant fraction of users or a tier 0 component; Serious security vulnerability.
       Examples: probes being unable to submit measurements
- - 3: Downtime or poor performance impacting secondary services; anything that can cause a level 2 event if not addressed within 24h; outages of monitoring infrastructure
+ - 3: Downtime or poor performance impacting secondary services (tier 1 or above); anything that can cause a level 2 event if not addressed within 24h; outages of monitoring infrastructure
  - 4: Every other event that requires attention within 7 days
 
-Based on the set of severities, components can be classified in tier as follows:
-
- - tier 1: Anything that can cause a severity 1 (or less severe) event.
- - tier 2: Anything that can cause a severity 2 (or less severe) event but not a severity 1.
- - tier 3: Anything that can cause a severity 3 (or less severe) event but not a severity 1 or 2.
- - ...and so on
+For an outline of infrastructure tiers see [infrastructure tiers](devops/infrastructure).
 
 ### Relations and dependencies between services
 
@@ -70,7 +63,6 @@ and with no significant downtime.
 Example: An active/standby database pair provides a tier 2 service. An automatic failover tool is triggered by a simple monitoring script.
 Both have to be labeled tier 2.
 
-
 ### Handling incidents
 
 Depending on the severity of an event a different workflow can be followed.