Skip to content

Commit

Permalink
Updates to devops docs (#134)
Browse files Browse the repository at this point in the history
  • Loading branch information
hellais authored Jan 6, 2025
1 parent 8ac1779 commit 903b3c7
Show file tree
Hide file tree
Showing 10 changed files with 345 additions and 415 deletions.
116 changes: 1 addition & 115 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,117 +1,3 @@
# OONI Devops


## Infrastructure Tiers

We divide our infrastructure components into 3 tiers:

- **Tier 0: Critical**: These are mission critical infrastructure components. If these become unavailable or have significant disruption, it will have a major impact.

- **Tier 1: Essential**: These components are important, but not as critical as
tier 0. They are part of our core operations, but if they become unavailable
the impact is important, but not major.

- **Tier 2: Non-Essential**: These are auxiliary components. Their
unavailability does not have a major impact.

### Tier 0 (Critical) components

- [ ] Probe Services (collector specifically)
- [ ] Fastpath (part responsible for storing post-cans)
- [x] DNS configuration
- [ ] Monitoring
- [ ] OONI bridges
- [ ] OONI.org website
- [x] Web Connectivity test helpers
- [x] Code signing

### Tier 1 (Essential) components

- [ ] OONI API measurement listing
- [x] OONI Explorer
- [x] OONI Run
- [ ] OONI Data analysis pipeline
- [x] OONI Findings API
- [x] Website analytics

### Tier 2 (Non-Essential) components

- [ ] Test list editor
- [ ] Jupyter notebooks
- [ ] Countly

## DNS and Domains

The primary domains used by the backend are:
- `ooni.org`
- `ooni.io`
- `ooni.nu`

### DNS naming policy

The public facing name of services, follows this format:

- `<service>.ooni.org`

Examples:

- `explorer.ooni.org`
- `run.ooni.org`

Public-facing means the FQDNs are used directly by external users, services, or
embedded in the probes. They cannot be changed or retired without causing
outages.

Use public facing names sparingly and when possible start off by creating a
private name first.
Not every host needs to have a public facing name. For example staging and
testing environments might not have a public facing name.

Each service also has public name which points to the specific host running that
service, and these are hosted in the `.io` zone.
This is helpful because sometimes you might have the same host running multiple
services or you might also have multiple services behind the same public service
endpoint (eg. in the case of an API gateway setup).

Name in the `.io` zone should always include also the environment name they are
related to:

- `<service>.prod.ooni.io` for production services
- `<service>.test.ooni.io` for test services

When there may be multiple instances of a service running, you can append a
number to the service name. Otherwise the service name should be only alphabetic
characters.

Examples:

- `clickhouse.prod.ooni.io`
- `postgres0.prod.ooni.io`
- `postgres1.prod.ooni.io`
- `prometheus.prod.ooni.io`
- `grafana.prod.ooni.io`

Finally, the actual host which runs the service, should have a FQDN defined
inside of the `.nu` zone.

This might not apply to every host, especially in a cloud environment. The FQDN
in the `.nu` are the ones which are going to be stored in the ansible inventory
file and will be used as targets for configuration management.

The structure of these domains is:

- `<name>.<location>.[prod|test].ooni.nu`

The location tag can be either just the provider name or provider name `-` the location.

Here is a list of location tags:

- `htz-fsn`: Hetzner on Falkenstein
- `htz-hel`: Hetzner in Helsinki
- `grh-ams`: Greenhost in Amsterdam
- `grh-mia`: Greenhost in Miami
- `aws-fra`: AWS in Europe (Frankfurt)

Examples:

- `monitoring.htz-fsn.prod.ooni.nu`
This documentation contains information
174 changes: 164 additions & 10 deletions ansible/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,34 @@
### Quickstart
# Ansible

**NOTE** We are currently in the process of migrating ansible configurations from [ooni/sysadmin](https://github.com/ooni/sysadmin) to [ooni/devops](https://github.com/ooni/devops).

Ansible is used to configure the OSes on long term provisioned backend hosts and manage the configuration for these components.

For example ansible will be used to configure the setup of VPSs and dedicated hosts that are provisioned manually or using terraform.

In the case of hosts that are continously delivered, we instead use cloud-native configuration management tools.

## Installation and setup

It's recommended to make use of a virtualenv, for example managed using `pyenv virtualenv`:
```
pyenv virtualenv ooni-devops
pyenv activate ooni-devops
```

Install deps:
```
pip install ansible dnspython boto3 passlib
```
### Ansible setup

Install ansible galaxy modules:
You should then install the required python and ansible-galaxy depedencies with:
```
ansible-galaxy install -r requirements.yml
pip install -r requirements/python.yml
ansible-galaxy install -r requirements/ansible-galaxy.yml
```

Setup AWS credentials, you should add 2 profiles called `oonidevops_user_dev` and `oonidevops_user_prod` which have access to the development and production environment respectively
### AWS configuration

You should then setup AWS credentials, by adding 2 profiles called `oonidevops_user_dev` and `oonidevops_user_prod` which have access to the development and production environment respectively.

To this end edit your `~/.aws/credentials` file to contain:

```
[oonidevops_user_dev]
Expand All @@ -36,11 +48,153 @@ region = eu-central-1
role_arn = arn:aws:iam::471112720364:role/oonidevops
```

Run playbook:
### SSH configuration

You should configure your `~/.ssh/config` with the following:

```
IdentitiesOnly yes
ServerAliveInterval 120
UserKnownHostsFile ~/.ssh/known_hosts ~/REPLACE_ME/sysadmin/ext/known_hosts
host *.ooni.io
user YOUR_USERNAME
host *.ooni.nu
user YOUR_USERNAME
host *.ooni.org
user YOUR_USERNAME
```

**TODO** restore ext/known_hosts setup

Replace `~/REPLACE_ME/sysadmin/ext/known_hosts` to where you have cloned
the `ooni/sysadmin` repo. This will ensure you use the host key
fingeprints from this repo instead of just relying on TOFU.

You should replace `YOUR_USERNAME` with your username from `adm_login`.

On MacOS you may want to also add:

host *
UseKeychain yes

To use the Keychain to store passwords.

## Running ansible playbooks

Playbooks are run via an wrapper script called `./play` which notifies the slack #ooni-bots channel that a deployment has been triggered.

```
./play -i inventory deploy-<component>.yml -l <hostname> --diff -C
./play -i inventory deploy-<component>.yml -l <hostname> --diff
```

:::caution
any minor error in configuration files or ansible's playbooks can be
destructive for the backend infrastructure. Always test-run playbooks
with `--diff` and `-C` at first and carefully verify configuration
changes. After verification run the playbook without `-C` and verify
again the applied changes.
:::

:::note
[Etckeeper](#etckeeper)&thinsp;🔧 can be useful to verify configuration
changes from a different point of view.
:::

In general there are two classes of playbooks:
* Those starting with `deploy-*.yml`, which are used to deploy specific components or pieces of components related to OONI infrastructure. All of these playbooks are included inside of `playbook.yml` to faciliate testing and ensuring that every component in our infrastucture is fully deployable.
* Those starting with `playbook-*` which are playbooks for specific tasks that may not be part of the main infrastructure deployment (eg. bootstrapping nodes once upon creation, creating snapshots of remote configurations, etc.)

Some notable playbooks or roles are:

The bootstrap playbook is in `playbook-bootstrap.yml` and is a playbook that should be run once when a new host is created.

The nftables firewall is configured to read every `.nft` file under
`/etc/ooni/nftables/` and `/etc/ooni/nftables/`. This allows roles to
create small files to open a port each and keep the configuration as
close as possible to the ansible step that deploys a service. See this in use inside of the `nftables` role.

#### The root account

Runbooks use ssh to log on the hosts using your own account and leveraging `sudo` to act as root.

The only exception is when a new host is being deployed - in that case ansible will log in as root to create
individual accounts and lock out the root user.

When running the entire runbook ansible might try to run it as root.
This can be avoided by selecting only the required tags using `-t <tagname>`.

Ideally the root user should be disabled after succesfully creating user accounts.

#### Roles layout

Ansible playbooks use multiple roles (see
[example](https://github.com/ooni/sysadmin/blob/master/ansible/deploy-backend.yml#L46))
to deploy various components.

Few roles use the `meta/main.yml` file to depend on other roles. See
[example](https://github.com/ooni/sysadmin/blob/master/ansible/roles/ooni-backend/meta/main.yml)

:::note
The latter method should be used sparingly because ansible does not
indicate where each task in a playbook is coming from. Moreover if a dependencies is specified twice inside of two roles, it will run twice.
:::

A diagram of the role dependencies for the deploy-backend.yml playbook:

```mermaid
flowchart LR
A(deploy-backend.yml) --> B(base-bullseye)
B -- meta --> G(adm)
A --> F(nftables)
A --> C(nginx-buster)
A --> D(dehydrated)
D -- meta --> C
E -- meta --> F
A --> E(ooni-backend)
style B fill:#eeffee
style C fill:#eeffee
style D fill:#eeffee
style E fill:#eeffee
style F fill:#eeffee
style G fill:#eeffee
```
ansible-playbook playbook.yml -i inventory

A similar diagram for deploy-monitoring.yml:

```mermaid
flowchart LR
B -- meta --> G(adm)
M(deploy-monitoring.yml) --> B(base-bookworm)
M --> O(ooca-cert)
M --> F(nftables)
M --> D(dehydrated) -- meta --> N(nginx-buster)
M --> P(prometheus)
M --> X(blackbox-exporter)
M --> T(alertmanager)
style B fill:#eeffee
style D fill:#eeffee
style F fill:#eeffee
style G fill:#eeffee
style N fill:#eeffee
style O fill:#eeffee
style P fill:#eeffee
style T fill:#eeffee
style X fill:#eeffee
```

:::note
When deploying files or updating files already existing on the hosts it can be useful to add a note e.g. "Deployed by ansible, see <role_name>".
This helps track down how files on the host were modified and why.
:::

### Platform specific known bugs

On macOS you might run into this issue: https://github.com/ansible/ansible/issues/76322

The current workaround is to export the following environment variable before running ansible:
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
3 changes: 3 additions & 0 deletions ansible/requirements/python.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ansible==9.3.0
boto3==1.34.65
dnspython==2.6.1
16 changes: 4 additions & 12 deletions docs/IncidentResponse.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,7 @@ On Android devices the following apps can be used:
* [Grafana](#grafana)&thinsp;🔧 viewer
<https://play.google.com/store/apps/details?id=it.ksol.grafanaview>

## Tiers and severities

**TODO** Consolidate the tiers outlined here with the other tiers listed in the top level readme.
## Severities

When designing architecture of backend components or handling incidents it can be useful to have
defined severities and tiers.
Expand All @@ -27,17 +25,12 @@ In this case there is no distinction between severity and priority. Impact and r
Incidents and alarms from monitoring can be classified by severity levels based on their impact:

- 1: Serious security breach or data loss; serious loss of privacy impacting users or team members; legal risks.
- 2: Downtime impacting service usability for a significant fraction of users; Serious security vulnerability.
- 2: Downtime impacting service usability for a significant fraction of users or a tier 0 component; Serious security vulnerability.
Examples: probes being unable to submit measurements
- 3: Downtime or poor performance impacting secondary services; anything that can cause a level 2 event if not addressed within 24h; outages of monitoring infrastructure
- 3: Downtime or poor performance impacting secondary services (tier 1 or above); anything that can cause a level 2 event if not addressed within 24h; outages of monitoring infrastructure
- 4: Every other event that requires attention within 7 days

Based on the set of severities, components can be classified in tier as follows:

- tier 1: Anything that can cause a severity 1 (or less severe) event.
- tier 2: Anything that can cause a severity 2 (or less severe) event but not a severity 1.
- tier 3: Anything that can cause a severity 3 (or less severe) event but not a severity 1 or 2.
- ...and so on
For an outline of infrastructure tiers see [infrastructure tiers](devops/infrastructure).

### Relations and dependencies between services

Expand Down Expand Up @@ -70,7 +63,6 @@ and with no significant downtime.
Example: An active/standby database pair provides a tier 2 service. An automatic failover tool is triggered by a simple monitoring script.
Both have to be labeled tier 2.


### Handling incidents

Depending on the severity of an event a different workflow can be followed.
Expand Down
Loading

0 comments on commit 903b3c7

Please sign in to comment.