Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify onboarding new participants, including for more major cloud providers #874

Open
barroco opened this issue Oct 28, 2022 · 1 comment
Labels
dss Relating to one of the DSS implementations feature Issue would improve software P1 High priority

Comments

@barroco
Copy link
Contributor

barroco commented Oct 28, 2022

The instructions for bringing up a DSS instance are pretty actionable (complete, clear) currently, but they’re very long and require a fair amount of engineering expertise. We have a tool under development called deployment_manager which should simplify this process substantially, and therefore make deployment of a DSS instance easier

Deployment instructions: https://github.com/interuss/dss/tree/master/build
Deployment tool: https://github.com/interuss/dss/tree/master/monitoring/deployment_manager

@barroco barroco assigned marcadamsge and unassigned marcadamsge Oct 28, 2022
@barroco
Copy link
Contributor Author

barroco commented Oct 28, 2022

Following discussions with @BenjaminPelletier @BradNicolle and @marcadamsge, the plan to update the DSS deployment approach to support other cloud providers and keep it manageable for InterUSS over time.

Background

The deployment of the DSS is currently mostly documented on a README. Kubernetes (K8s) deployment instructions only cover GKE. Tanka is used to generate and configure kubernetes resources. In addition, the DSS codebase is being refactored to require only one container instead of two currently. Most of the complexity lies in getting a kubernetes cluster running for cockroach-db ready to be pooled and the pooling steps. We have undertaken the process of extracting self-contained modules to separate repositories. Finally, we are starting the work to support other cloud providers.

Default DSS infrastructure

The DSS is composed of two different services to run (we assume the http-gateway and core-service application as one since refactoring is under way): the DSS API and the cockroach database.
In addition, the current configuration proposes the following supporting services:

  • Istio: service mesh
  • Prometheus & Grafana

The requirements for the InterUSS standard deployment of the DSS in terms of infrastructure are:

  • A Kubernetes cluster with 3 nodes
  • 4 ingresses with individual static ips with DNS associated (1 for the gateway, 3 for cockroach db instances)

Objectives and change plan overview

1. Infrastructure as code

Conceptually, the deployment will be broken down in three main categories:

Infrastructure: It is responsible for the cloud resources required to run the DSS services. It includes the kubernetes cluster creation, cluster nodes, load balancer and associated fixed IPs, etc. This stage is cloud provider specific. The objective is to support Amazon Web Services (EKS), Azure (AKS), Google (GKE).
To manage multi-cloud resources, we propose to use terraform providers [C.1]. Using terraform providers will offer the following benefits of infrastructure as code:
Limit the number of untested command line steps in the READMEs.
Allow users to keep track of the infrastructure lifecycle and run simple upgrades.
Practice multi-cloud deployment as part of the CI/CD.

Services: The ambition is to be cloud provider agnostic for the services part. It will be responsible for managing Kubernetes resources. We will distinguish core services which are the minimal set of services required by the DSS and supporting services, which may be of interest for users wishing to operate the DSS out of the box.

Currently, services are deployed using tanka. Tanka provides a templating mechanism to k8s manifests. The second main change proposal is to replace Tanka with Helm [C.2]. In addition to the templating feature, it would offer the benefit of packaging and publishing helm charts so more advanced users can reuse it for their own deployments. Helm is especially well suited for gitops deployments. Helm charts are versioned and can be used to automate upgrade lifecycle. Helm can be published to cloud providers container registries. It supports hooks and testing to allow sequences of operations in upgrades and validation steps.

Operations: Diagnostic and utilities operations such as certificates management may be simplified using the deployment manager CLI tool / pod.

To keep the learning curve and maintenance burden low, new users should be able to deploy the DSS with knowledge of terraform only. Advanced users running their own infrastructure should be able to deploy the DSS using the Helm Chart directly.

2. New repository structure

This is the opportunity to reorganize the repository structure incrementally to split build and deployment [C.3]. All assets are currently located in the build folder and expect users to work by default in an ignored folder build/workspace/. A new folder at the root of the repository may be created with the following structure:

/deploy
    infrastructure (terraform)
        aks *
        eks *
        gke *
    common (common modules, if needed)
    services
        tanka 
        helm *
    operations
        README (how to use the deployment manager)
        scripts
            make_certs
            apply_certs
    workspace (environment definitions, custom to each user)
        example
            example.tfvars (User variables for the infrastructure deployment)
            services.yaml (User values for the helm chart)
            main.tf (terraform deployment specification)
            certs/ (Generated certificates - this should move to the secret manager store, see **C.7**)
  • Note that terraform modules and helm charts may require to be moved to independent repository to be properly published to public registries. The folder structure may be reviewed to take into account terraform and helm limitations.

3. Extract deployment example to a new repository

Terraform modules, helm charts and the deployment manager CLI can be packaged and published. [C.4] Once those components can be installed from a publicly available registry, an example repository could be created supporting users to work outside the main dss repository for their own deployment. [C.5]

4. Use secret manager to store the generated certificates

Currently certificates are generated in the repository in an ignored folder. Storing them in a secret manager. [C.7] The following services are available:
Google: https://cloud.google.com/secret-manager
AWS: https://aws.amazon.com/fr/secrets-manager/
Azure: https://azure.microsoft.com/en-us/products/key-vault/
The secret manager will be provisioned by the infrastructure stage and filled and updated by the deployment manager. Secrets will be exposed as a K8s resource in the cluster or via CLI for local usage.

5. Automatically test the deployment

Once the infrastructure and the services can be deployed using infrastructure as code, the pooling procedure of a DSS Region deployment with multi-cloud DSS instances can be added to the CI/CD. [C.6] The pooling procedure will be orchestrated by the deployment manager. This will support committers and contributors to gain confidence on contributions / changes to the deployment procedure and unnoticed changes of cloud providers.

Changes summary

Priority 1

C.1: Introduction of terraform to manage the infrastructure stage for each cloud provider.
C.3: Reorganize the dss repository to make a separation between build and deployment.
C.2: Replacement of tanka with helm charts.
C.4: Publish DSS helm chart and terraform modules for each cloud provider to simplify usage outside of the repository.
C.5: Example repository using published artifacts.

Priority 2

C.7: Use secret manager to store the certificates.

Priority 3

C.6: Test in the CI/CD the deployment of a DSS Region with multi-cloud DSS instances and test the pooling procedure.

barroco added a commit to Orbitalize/dss that referenced this issue Nov 8, 2022
barroco added a commit to Orbitalize/dss that referenced this issue Nov 8, 2022
barroco added a commit to Orbitalize/dss that referenced this issue Nov 8, 2022
barroco added a commit to Orbitalize/dss that referenced this issue Nov 8, 2022
@BenjaminPelletier BenjaminPelletier added P1 High priority feature Issue would improve software dss Relating to one of the DSS implementations labels Dec 29, 2022
barroco added a commit that referenced this issue Jan 24, 2023
* [terraform] #874: terraform module for gcp

* Add desired db versions as a variable

* Format and simplify commons

* Fix variable name consistency

* Default crdb_node_count to 3

* Add required crdb_node_count to example definitions

* Add /build/workspace to the repository

* Add a note about GCP login

* Reorganize the files to use composition instead of encapsulation

* Move README temporarily to terraform-google-dss

* Refactor variables and use module composition for terraform-google-dss

* Add utility to manage variables of tf modules

* Update variables and example.tfvars

* Format

* Fix examples

* Remove redundant crdb_internal_addresses and adapt make-certs to handle joining cluster

* Fix link in readme

* Fix link in readme

* Update documentation

* Fix link in readme

* Update deploy/infrastructure/modules/terraform-google-dss/README.md

Co-authored-by: Benjamin Pelletier <[email protected]>

* Update deploy/infrastructure/modules/terraform-google-dss/README.md

Co-authored-by: Benjamin Pelletier <[email protected]>

* Update deploy/infrastructure/modules/terraform-google-dss/README.md

Co-authored-by: Benjamin Pelletier <[email protected]>

* Apply suggestions from code review regarding default values

Co-authored-by: Benjamin Pelletier <[email protected]>

* Add missing cd as suggested in PR

* Update deploy/infrastructure/modules/terraform-google-dss/terraform.example.tfvars

Co-authored-by: Benjamin Pelletier <[email protected]>

* Address PR comments

- Update build/deploy/db_schemas/README.md
- Change kubernetes_storage_class.tf to google_kubernetes_storage_class.tf
- Add "latest" value to specify default db schema version
- Move variables descriptions to TFVARS.md instead of the example file
- Improve google_zone documentation and add list of options
- Fix us-demo.pem path
- Include dummy auth option in the authentication variable documentation

* TF format

* Add latest value for image variable

* Add latest value for image and revert default for kubernetes_namespace

* Fail bash script in case of error

* Improve instructions

* Expose cluster context as outpu

* Fix key path

* Propose test key by default in example file

* Split step 3 in instructions as suggested in PR

* Clarify cluster context folder.

Co-authored-by: Benjamin Pelletier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dss Relating to one of the DSS implementations feature Issue would improve software P1 High priority
Projects
Status: In Progress
Development

No branches or pull requests

3 participants