Skip to content

Cloud Infrastructure

fvankrieken edited this page Aug 8, 2024 · 3 revisions

Essentially all of our cloud resources are hosted on Digital Ocean (DO).

Breaking our cloud resources into various categories, this is an inventory of our cloud resources

Storage

We use DO Spaces for cloud object/file storage. This is effectively AWS S3 (it even uses the same API and is likely just S3 under the hood). We have two main buckets here

  • edm-recipes is our "data lake" and long term store for data. All source data we ingest is stored and versioned here, and never deleted. Once product builds are complete, they are archived here as well. The recipes bucket supplies (almost) all source data for product builds, and we have a python api for things like pulling a dataset, finding available versions of a dataset, etc
  • edm-publishing is where we dump outputs of product builds, make them available for other teams, and eventually package and distribute from. It's a bit of a long term store as well, with our historical build outputs for all versions of each data product we produce. The difference being here has the whole data product "package" - outputs in various formats, multiple datasets per product, etc - while any archived dataset in recipes is much more regular and tabular, intended to be treated closer to a database than publishing which is more focused on, well, publishing.

Database

A database obviously has both storage and compute resources, but is a bit of its own entity and we use it much differently than our other storage, hence it having its own section.

We have a single persisted database cluster, a PostgreSQL cluster on DO. We do not use it for any real persisted storage, but rather use it as an engine for builds/transformation. PostGIS has long been one of the more mature and accessible tools for geospatial transformations, hence most of our transformation logic is written in PostgreSQL. So for a standard build, data is grabbed from our long-term data store (edm-recipes) and loaded into this database, and then our transformation logic runs on this database. Once a build is complete (all transformations have been run), exports are produced from this database/schema and dumped back to file storage (edm-publishing). Everything in the database is somewhat transient, but while we're in a build cycle (QAing data, etc), the intermediate tables (as well as source and output tables) are persisted and available, which are useful in debugging anything that might have gone wrong with a build and running tests.

Compute

As far as our builds go, we don't actually have managed computing resources. We (ab)use GitHub Actions to run our builds. These GitHub-hosted runners are spun up when jobs are run, and are capable of pulling a docker image to run a set of commands in a container. We use this with a relatively simple set of commands - load secrets, "plan" a build, load data into the production database, run a build, export outputs.

We do have a single cloud vm, a DO "Droplet". This hosts a streamlit app which we use to QA the outputs of our builds.

Diagram

edm-inbox is a bucket that we dump raw data into. We're phasing this out

edm-distributions has been scrapped in favor of handling packaging/distribution in edm-publishing