Skip to content

Commit

Permalink
[tanka] Consolidate service deployments to new deploy structure (#1111)
Browse files Browse the repository at this point in the history
  • Loading branch information
barroco authored Sep 25, 2024
1 parent d7b42ab commit 1f7f50a
Show file tree
Hide file tree
Showing 35 changed files with 121 additions and 117 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,8 @@ dss-tests: evaluate-tanka test-go-units test-go-units-crdb build-dss down-locall

.PHONY: evaluate-tanka
evaluate-tanka:
docker container run -v $(CURDIR)/build/jsonnetfile.json:/build/jsonnetfile.json -v $(CURDIR)/build/deploy:/build/deploy grafana/tanka show --dangerous-allow-redirect /build/deploy/examples/minimum
docker container run -v $(CURDIR)/build/jsonnetfile.json:/build/jsonnetfile.json -v $(CURDIR)/build/deploy:/build/deploy grafana/tanka show --dangerous-allow-redirect /build/deploy/examples/schema_manager
docker container run -v $(CURDIR)/deploy/services/tanka:/deploy/services/tanka grafana/tanka show --dangerous-allow-redirect /deploy/services/tanka/examples/minimum
docker container run -v $(CURDIR)/deploy/services/tanka:/deploy/services/tanka grafana/tanka show --dangerous-allow-redirect /deploy/services/tanka/examples/schema_manager

# This reproduces the entire continuous integration workflow (.github/workflows/ci.yml)
.PHONY: presubmit
Expand Down
9 changes: 7 additions & 2 deletions build/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,9 +262,14 @@ a PR to that effect would be greatly appreciated.
[previous section](#docker-images).
1. From this working directory,
`cp -r deploy/examples/minimum/* workspace/$CLUSTER_CONTEXT`. Note that
`cp -r ../deploy/services/tanka/examples/minimum/* workspace/$CLUSTER_CONTEXT`. Note that
the `workspace/$CLUSTER_CONTEXT` folder should have already been created
by the `make-certs.py` script.
Replace the imports at the top of `main.jsonnet` to correctly locate the files:
```
local dss = import '../../../deploy/services/tanka/dss.libsonnet';
local metadataBase = import '../../../deploy/services/tanka/metadata_base.libsonnet';
```
1. If providing a .pem file directly as the public key to validate incoming
access tokens, copy it to [dss/build/jwt-public-certs](./jwt-public-certs).
Expand Down Expand Up @@ -561,7 +566,7 @@ existing clusters you will need to:
1. Create `workspace/$CLUSTER_CONTEXT_schema_manager` in this (build) directory.
1. From this (build) working directory,
`cp -r deploy/examples/schema_manager/* workspace/$CLUSTER_CONTEXT_schema_manager`.
`cp -r ../deploy/services/tanka/examples/schema_manager/* workspace/$CLUSTER_CONTEXT_schema_manager`.
1. Edit `workspace/$CLUSTER_CONTEXT_schema_manager/main.jsonnet` and replace all `VAR_*`
instances with appropriate values where applicable as explained in the above section.
Expand Down
4 changes: 2 additions & 2 deletions build/db_schemas/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ When a new database version is created, it needs to be targeted in a number of
places:
* Both .sql files in the appropriate folder in db_schemas when setting
schema_versions.schema_version
* [DSS main.jsonnet](../deploy/examples/minimum/main.jsonnet)
* [Schema manager main.jsonnet](../deploy/examples/schema_manager/main.jsonnet)
* [DSS main.jsonnet](../../deploy/services/tanka/examples/minimum/main.jsonnet)
* [Schema manager main.jsonnet](../../deploy/services/tanka/examples/schema_manager/main.jsonnet)
* /pkg/{rid|scd}/store/cockroach/store.go
* /deploy/infrastructure/dependencies/terraform-commons-dss/default_latest.tf
* /deploy/services/helm-charts/dss/templates/schema-manager.yaml
105 changes: 11 additions & 94 deletions build/deploy/README.md
Original file line number Diff line number Diff line change
@@ -1,99 +1,16 @@
# Kubernetes deployment via Tanka

This folder contains a set of configuration files to be used by
[tanka](https://tanka.dev/install) to deploy a single DSS instance via
Kubernetes following the procedures found in the [build](..) folder.
The documentation and configuration have been moved to [deploy/services](../../deploy/services/tanka).
[Architecture](../../deploy/architecture.md#architecture), [Survivability](../../deploy/architecture.md#survivability)
and [Sizing](../../deploy/architecture.md#sizing) sections have been moved to [deploy/architecture](../../deploy/architecture.md)

## Architecture
## Migrating configurations to new location

The expected deployment configuration of a DSS pool supporting a DSS Region is
multiple organizations to each host one DSS instance that is interoperable with
each other organization's DSS instance. A DSS pool with three participating
organizations (USSs) will have an architecture similar to the diagram below.
The following steps describe how to update your workspace configurations to use the new configuration location.

_**Note** that the diagram shows 2 stateful sets per DSS instance. Currently, the
files in this folder produce 3 stateful sets per DSS instance. However, after
Issue #481 is resolved, this is expected to be reduced to 2 stateful sets._

![Pool architecture diagram](../../assets/generated/pool_architecture.png)

## Survivability

One of the primary design considerations of the DSS is to be very resilient to
failures. This resiliency is obtained primarily from the behavior of the
underlying CockroachDB database technology and how we configure it. The diagram
below shows the result of failures (bringing a node down for maintenance, or
having an entire USS go down) from different starting points, assuming 3 replicas.

![Survivability diagram](../../assets/generated/survivability_3x2.svg)


The table
below summarizes survivable failures with 3 DSS instances configured according
to the architecture described above. Each system state is summarized by three
groups (one group per USS) of two nodes per USS.

* 🟩 : Functional node has no recent changes in functionality
* 🟥 : Non-functional node in down USS has no recent changes in functionality
* 🟧 : Non-functional node due to USS upgrade or maintenance has no recent changes in functionality
* 🔴 : Node becomes non-functional due to a USS going down
* 🟠 : Node becomes non-functional due to USS upgrade or maintenance

| Pre-existing conditions | New failures | Survivable?
| --- | --- | ---
| (🟩 , 🟩 ) (🟩 , 🟩 ) (🟩 , 🟩 ) | (🟩 , 🟩 ) (🟩 , 🟩 ) (🟩 , 🟠 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🟩 , 🟠 ) (🟩 , 🟠 ) | 🔴 No; some ranges may be lost because of [this bug](https://github.com/cockroachdb/cockroach/issues/66159)
| | (🟩 , 🟠 ) (🟩 , 🟠 ) (🟩 , 🟠 ) | 🔴 No; some ranges may be lost
| | (🟩 , 🟩 ) (🟩 , 🟩 ) (🔴 , 🔴 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🔴 , 🔴 ) (🔴 , 🔴 ) | 🔴 No; ranges guaranteed to be lost
| (🟩 , 🟩 ) (🟩 , 🟩 ) (🟩 , 🟧 ) | (🟩 , 🟩 ) (🟩 , 🟠 ) (🟩 , 🟧 ) | 🟢 Yes
| | (🟩 , 🟠 ) (🟩 , 🟠 ) (🟩 , 🟧 ) | 🔴 No; some ranges may be lost because of [this bug](https://github.com/cockroachdb/cockroach/issues/66159)
| | (🟩 , 🟩 ) (🟩 , 🟩 ) (🔴 , 🔴 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🔴 , 🔴 ) (🟩 , 🟧 ) | 🟡 Yes, with 3 replicas
| (🟩 , 🟩 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | (🟩 , 🟠 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🟩 , 🟧 ) (🟠 , 🟧 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🟩 , 🟧 ) (🔴 , 🔴 ) | 🟢 Yes
| | (🔴 , 🔴 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | 🟡 Yes, with 3 replicas
| (🟩 , 🟧 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | (🟩 , 🟧 ) (🟩 , 🟧 ) (🟠 , 🟧 ) | 🟡 Yes, with 3 replicas
| | (🟩 , 🟧 ) (🟠 , 🟧 ) (🟠 , 🟧 ) | 🔴 No; ranges guaranteed to be lost
| | (🟠 , 🟧 ) (🟠 , 🟧 ) (🟠 , 🟧 ) | 🔴 No; ranges guaranteed to be lost
| | (🟩 , 🟧 ) (🟩 , 🟧 ) (🔴 , 🔴 ) | 🟡 Yes, with 3 replicas
| (🟩 , 🟩 ) (🟩 , 🟩 ) (🟥 , 🟥 ) | (🟩 , 🟩 ) (🟩 , 🟠 ) (🟥 , 🟥 ) | 🟡 Yes, with 3 replicas
| | (🟩 , 🟠 ) (🟩 , 🟠 ) (🟥 , 🟥 ) | 🔴 No; some ranges may be lost
| | (🟩 , 🟩 ) (🔴 , 🔴 ) (🟥 , 🟥 ) | 🔴 No; some ranges may be lost

## Sizing

### Introduction
This section contains an estimate of the computational and other resources
likely necessary to support expected demand in a country similar to the United
States.

### Time required to fulfill queries for a single flight
1. Assume 1 ISA per flight (worst case)
1. 2 ISA management queries per flight (create & delete)
1. Assume 90% of flights are nominal and require 3 strategic deconfliction queries (Accepted, Activated, Ended) while 10% of flights have problems and require 7 strategic deconfliction queries
1. 3.4 strategic deconfliction queries per flight
1. Assume 0.1 seconds to fulfill a query
1. Therefore, 0.54 seconds required (on average) to fulfill management queries to support a flight

### Time required to fulfill queries for a RID Display Provider
1. Assume 2 Display Providers viewing each flight on average, 4 subscriptions per flight per DP, and 40% chance of subscription reuse
1. 9.6 subscription queries per flight
1. 0.96 seconds required (on average) to fulfill viewing queries to support a flight

### Required parallelism
1. Use [348,537 remote pilots in 2024](https://www.faa.gov/uas/resources/by_the_numbers/)
1. Assume 100 flights per month per remote pilot
1. Use [989,916 recreational pilots](https://www.faa.gov/data_research/aviation/aerospace_forecasts/media/FY2020-40_faa_aerospace_forecast.pdf) as a baseline (even though this is likely number of aircraft, not number of pilots) and double it for the future
1. Use [7.1 flights per month per recreational pilot](https://www.faa.gov/data_research/aviation/aerospace_forecasts/media/FY2020-40_faa_aerospace_forecast.pdf)
1. Therefore, expect about 18.6 flights per second
1. With 1.5 seconds of query time per flight, a nominal parallelism of 28 is required to satisfy the demand
1. Assuming a peak-average ratio of 3.5, a parallelism of 98 is required

### Required resources
1. With Cockroach Labs guidance of 4 parallel operations per vCPU, the DSS pool requires 25 vCPUs.
1. Assuming 3 DSS instances and the need to continue to operate when one instance is down, each DSS instance requires 13 vCPUs.
1. Using 8-vCPU virtual machines (like n2-standard-8), this means each instance needs 2 of these virtual machines
1. Assuming that 5 days' worth of flights are occupying space on disk at any given time and that each flight record on disk is 100k, approximately 83 GB of storage is required
1. Note that Cockroach Labs recommends 4,000 read IO/s and 4,000 write IO/s, and some cloud providers scale storage speed with storage size, so 83 GB of storage may be far less than is necessary to achieve these speed numbers
For tanka only deployments, update imports in your `main.jsonnet` for `dss` and `metadataBase` libraries.
Replace the current paths with:
```
local dss = import '../../../deploy/services/tanka/dss.libsonnet';
local metadataBase = import '../../../deploy/services/tanka/metadata_base.libsonnet';
```
2 changes: 1 addition & 1 deletion deploy/MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ CockroachDB requires to upgrade one minor version at a time, therefore the follo

**Important notes:**

- The migration plan below has been tested with the deployment of services using [Helm](services/helm-charts) and [Tanka](../build/deploy) without Istio enabled. Note that this configuration flag has been decommissioned since [#995](https://github.com/interuss/dss/pull/995).
- The migration plan below has been tested with the deployment of services using [Helm](services/helm-charts) and [Tanka](services/tanka) without Istio enabled. Note that this configuration flag has been decommissioned since [#995](https://github.com/interuss/dss/pull/995).
- Further work is required to test and evaluate the availability of the DSS during migrations.
- It is highly recommended to rehearse such operation on a test cluster before applying them to a production environment.

Expand Down
2 changes: 1 addition & 1 deletion deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Terraform modules are provided for:

1. [Services](#services) provides the tooling to deploy a DSS instance to a Kubernetes cluster.
- [Helm Charts](services/helm-charts/dss)
- [Tanka](../build/deploy)
- [Tanka](services/tanka)

1. [Operations](#operations) provides instructions to operate a deployed DSS instance.
- [Pooling procedure](./operations/README.md#pooling-procedure)
Expand Down
86 changes: 83 additions & 3 deletions deploy/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,16 @@ See [introduction](../build/pooling.md#introduction)

## Architecture

See [architecture](../build/deploy/README.md#architecture)
The expected deployment configuration of a DSS pool supporting a DSS Region is
multiple organizations to each host one DSS instance that is interoperable with
each other organization's DSS instance. A DSS pool with three participating
organizations (USSs) will have an architecture similar to the diagram below.

_**Note** that the diagram shows 2 stateful sets per DSS instance. Currently, the
helm and tanka deployments produce 3 stateful sets per DSS instance. However, after
Issue #481 is resolved, this is expected to be reduced to 2 stateful sets._

![Pool architecture diagram](../assets/generated/pool_architecture.png)

### Terminology notes

Expand All @@ -24,8 +33,79 @@ See [Additional requirements](../build/pooling.md#additional-requirements).

### Survivability

See [survivability](../build/deploy/README.md#survivability).
One of the primary design considerations of the DSS is to be very resilient to
failures. This resiliency is obtained primarily from the behavior of the
underlying CockroachDB database technology and how we configure it. The diagram
below shows the result of failures (bringing a node down for maintenance, or
having an entire USS go down) from different starting points, assuming 3 replicas.

![Survivability diagram](../assets/generated/survivability_3x2.svg)

The table below summarizes survivable failures with 3 DSS instances configured according
to the architecture described above. Each system state is summarized by three
groups (one group per USS) of two nodes per USS.

* 🟩 : Functional node has no recent changes in functionality
* 🟥 : Non-functional node in down USS has no recent changes in functionality
* 🟧 : Non-functional node due to USS upgrade or maintenance has no recent changes in functionality
* 🔴 : Node becomes non-functional due to a USS going down
* 🟠 : Node becomes non-functional due to USS upgrade or maintenance

| Pre-existing conditions | New failures | Survivable?
| --- | --- | ---
| (🟩 , 🟩 ) (🟩 , 🟩 ) (🟩 , 🟩 ) | (🟩 , 🟩 ) (🟩 , 🟩 ) (🟩 , 🟠 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🟩 , 🟠 ) (🟩 , 🟠 ) | 🔴 No; some ranges may be lost because of [this bug](https://github.com/cockroachdb/cockroach/issues/66159)
| | (🟩 , 🟠 ) (🟩 , 🟠 ) (🟩 , 🟠 ) | 🔴 No; some ranges may be lost
| | (🟩 , 🟩 ) (🟩 , 🟩 ) (🔴 , 🔴 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🔴 , 🔴 ) (🔴 , 🔴 ) | 🔴 No; ranges guaranteed to be lost
| (🟩 , 🟩 ) (🟩 , 🟩 ) (🟩 , 🟧 ) | (🟩 , 🟩 ) (🟩 , 🟠 ) (🟩 , 🟧 ) | 🟢 Yes
| | (🟩 , 🟠 ) (🟩 , 🟠 ) (🟩 , 🟧 ) | 🔴 No; some ranges may be lost because of [this bug](https://github.com/cockroachdb/cockroach/issues/66159)
| | (🟩 , 🟩 ) (🟩 , 🟩 ) (🔴 , 🔴 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🔴 , 🔴 ) (🟩 , 🟧 ) | 🟡 Yes, with 3 replicas
| (🟩 , 🟩 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | (🟩 , 🟠 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🟩 , 🟧 ) (🟠 , 🟧 ) | 🟢 Yes
| | (🟩 , 🟩 ) (🟩 , 🟧 ) (🔴 , 🔴 ) | 🟢 Yes
| | (🔴 , 🔴 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | 🟡 Yes, with 3 replicas
| (🟩 , 🟧 ) (🟩 , 🟧 ) (🟩 , 🟧 ) | (🟩 , 🟧 ) (🟩 , 🟧 ) (🟠 , 🟧 ) | 🟡 Yes, with 3 replicas
| | (🟩 , 🟧 ) (🟠 , 🟧 ) (🟠 , 🟧 ) | 🔴 No; ranges guaranteed to be lost
| | (🟠 , 🟧 ) (🟠 , 🟧 ) (🟠 , 🟧 ) | 🔴 No; ranges guaranteed to be lost
| | (🟩 , 🟧 ) (🟩 , 🟧 ) (🔴 , 🔴 ) | 🟡 Yes, with 3 replicas
| (🟩 , 🟩 ) (🟩 , 🟩 ) (🟥 , 🟥 ) | (🟩 , 🟩 ) (🟩 , 🟠 ) (🟥 , 🟥 ) | 🟡 Yes, with 3 replicas
| | (🟩 , 🟠 ) (🟩 , 🟠 ) (🟥 , 🟥 ) | 🔴 No; some ranges may be lost
| | (🟩 , 🟩 ) (🔴 , 🔴 ) (🟥 , 🟥 ) | 🔴 No; some ranges may be lost

### Sizing

See [sizing](../build/deploy/README.md#sizing).
#### Introduction
This section contains an estimate of the computational and other resources
likely necessary to support expected demand in a country similar to the United
States.

#### Time required to fulfill queries for a single flight
1. Assume 1 ISA per flight (worst case)
1. 2 ISA management queries per flight (create & delete)
1. Assume 90% of flights are nominal and require 3 strategic deconfliction queries (Accepted, Activated, Ended) while 10% of flights have problems and require 7 strategic deconfliction queries
1. 3.4 strategic deconfliction queries per flight
1. Assume 0.1 seconds to fulfill a query
1. Therefore, 0.54 seconds required (on average) to fulfill management queries to support a flight

#### Time required to fulfill queries for a RID Display Provider
1. Assume 2 Display Providers viewing each flight on average, 4 subscriptions per flight per DP, and 40% chance of subscription reuse
1. 9.6 subscription queries per flight
1. 0.96 seconds required (on average) to fulfill viewing queries to support a flight

#### Required parallelism
1. Use [348,537 remote pilots in 2024](https://www.faa.gov/uas/resources/by_the_numbers/)
1. Assume 100 flights per month per remote pilot
1. Use [989,916 recreational pilots](https://www.faa.gov/data_research/aviation/aerospace_forecasts/media/FY2020-40_faa_aerospace_forecast.pdf) as a baseline (even though this is likely number of aircraft, not number of pilots) and double it for the future
1. Use [7.1 flights per month per recreational pilot](https://www.faa.gov/data_research/aviation/aerospace_forecasts/media/FY2020-40_faa_aerospace_forecast.pdf)
1. Therefore, expect about 18.6 flights per second
1. With 1.5 seconds of query time per flight, a nominal parallelism of 28 is required to satisfy the demand
1. Assuming a peak-average ratio of 3.5, a parallelism of 98 is required

#### Required resources
1. With Cockroach Labs guidance of 4 parallel operations per vCPU, the DSS pool requires 25 vCPUs.
1. Assuming 3 DSS instances and the need to continue to operate when one instance is down, each DSS instance requires 13 vCPUs.
1. Using 8-vCPU virtual machines (like n2-standard-8), this means each instance needs 2 of these virtual machines
1. Assuming that 5 days' worth of flights are occupying space on disk at any given time and that each flight record on disk is 100k, approximately 83 GB of storage is required
1. Note that Cockroach Labs recommends 4,000 read IO/s and 4,000 write IO/s, and some cloud providers scale storage speed with storage size, so 83 GB of storage may be far less than is necessary to achieve these speed numbers
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
// This file was automatically generated by terraform-commons-dss.
// Do not edit it directly.

local dss = import '../../deploy/dss.libsonnet';
local metadataBase = import '../../deploy/metadata_base.libsonnet';
local dss = import '../../../deploy/services/tanka/dss.libsonnet';
local metadataBase = import '../../../deploy/services/tanka/metadata_base.libsonnet';

// All VAR_* values below must be replaced with appropriate values; see
// dss/build/README.md for more information.
Expand Down
Loading

0 comments on commit 1f7f50a

Please sign in to comment.