Current state of the data flow #56

TangoYankee · 2024-07-09T20:41:12Z

TangoYankee
Jul 9, 2024
Maintainer

Description

Updating the Data Flow for the Capital Planning Project highlighted pain points in using and maintaining it. There are several branches that have made different changes in pursuit of alleviating these pain points. The most recent iteration is cp-flow-node-only. The next several sections will highlight the steps of the data flow, the associated pain points, and how the pain points are addressed in the latest iteration. We do not need to accept the whole of the changes. We can identify which parts make sense and use them on their own, leaving behind the parts that do not make sense.

Configuring the architecture

On main, the data-flow is run in persistent or semi-persistent databases. In live environments, there is a dedicated Digital Ocean Managed Database instance for each build environment. During local development, there is a database volume. The Digital Ocean Managed Database create a maintenance burden, adding three databases that need to be updated. But local and live databases add complexity to managing the state of the database. If a failed build leaves the database in a "bad" state, the developer will need to perform several steps to wipe the state and start over.

cp-flow-node-only changes both the local and live environments. First, the GitHub action uses a docker "service" to create a dockerized database on each run. This database is ephemeral and disappears at the end of each build. We can make this change because we expect that any data flows running in the GitHub action have already been tested locally. They will likely succeed by the time they are on main. Getting rid of the database at the end of the build means 1) we do not need to maintain the database and 2) each run is guaranteed to be independent of previous runs. Second, the local environment removes the volume. If state needs to be temporarily saved across development sessions, it can be preserved by doing docker compose stop. If it needs to be wiped for a clean run, this can be done with docker compose down

Installing dependencies

On main, the data-flow requires the installation of several programs. These programs include MiniO, python/pyenv, and postgres. Though not included on main, development of the capital projects map also required handling of shapefiles using the ogr2ogr function of gdal. Each of these dependencies needs to be managed and updated independently. They also need a corresponding but distinct installation instruction in the GitHub Runner. This creates a burden on the developers, especially those who develop across different machines and operating systems.

On the cp-flow-node-only branch, dependencies are managed either through npm or docker. pg_dump and psql come with the postgres docker image. We need to use this image anyway for the flow database. No additional installation is required to access these dependencies. The remaining minio, node-postgres, shapefile, and drizzle packages are managed using npm. This is a familiar workflow for application engineering, as (almost) all other AE repos rely on npm.

Downloading data

main and cp-flow-node-only both rely on miniO. While main uses a direct install/cli tool, cp-flow-node-only using a node module. Using a MiniO Node Module allows us to 1) Create loading logic using typescript and 2) load asyncronously

Defining the source tables

On main, source tables are defined manually by writing the tables in SQL. However, not every data source is loaded into an intermediate "source table". For example, "borough" is loaded directly from the source file to the "borough" table.

The cp-flow-node-only branch still manually defines source table. The significant difference is that it makes every data source go into an intermediate source table, even boroughs.

Validating the source tables

On main, dbt is used to apply tests to select columns. Dbt allows for the reuse of these tests.

The cp-flow-node-only branch relies on "Check Constraints" applied directly to the source table column. These check constraints are not as reusable. However, we haven't seen much duplication in tests. Meanwhile, relying on SQL itself simplifies the source table configuration

Loading data into the source tables

Both main and the cp-flow-node-only branches rely on pg copies. However, main relies on psql and cp-flow-node-only relies on node-postgres. Node postgres allows us to 1) Create loading logic using typescript and 2) load asyncronously

Creating the model tables

The "model" tables refer to the tables that exist in the "Flow" database but mirror the tables in the "API" database.

main manually defines (possibly copying migration files) the model tables. This is unsustainable as the database is continuously updating. At the time of writing, there are 17 migrations.

cp-flow-node-only introspects the api database for its schema. It uses drizzle to get the enums. Once these schemas exists, it can use pg_dump to create a sql file defining the latest schema. Finally, psql uses this schema definition to create the model tables in the flow database. These definitions are updated on each run, ensuring that always model the latest api database schema.

Transforming from source to model tables

Main and cp-flow-node-only both run sql files against the database main's file one of cp-flow's files.

main uses psql while cp-flow uses node-postgres. Using 'node-postgres' allows for complex flow control using typescript (a familiar language to AE)

Populating the target database

main dumps the flow database using pg_dump and then attempts clean restore to the API Database using pg_restore. We should avoid this approach because it undermines drizzle as the "source of truth" for the API database. Also, it practice it creates a flurry of errors.

The data-only option of pg_restore is also insufficient for our needs. We want to complete the entirety of the data upload in a single transaction. We need to accomplish two steps: 1) Truncate the target tables and 2) load the data. pg_restore does not support doing this in single transaction

cp-flow-node-only uses psql to run a single sql file that runs "Truncate" and all of the required "copy" commands in a single transaction

Separating into Domains

main does not separate the database runs into different domains. As I understand, dbt would be used in future iterations to create different builds for specific builds.

cp-flow-node-only mostly uses typescript to control the flow based on the desired domain. The domain is passed as a BUILD variable when the command is run. (The typescript files enforce the use of this variable using Zod)

Bonus

There was exploration of Foreign Data Wrappers. However, it does not transfer constraints to the inspecting database. Consequently, these constraints could not be checked until the data were being loaded into the actual API database. This is a fatal flaw. We need to check these constraints when transforming the data within the flow database.

TylerMatteo · 2024-09-16T20:57:47Z

TylerMatteo
Sep 16, 2024
Maintainer

@TangoYankee I've finally had a chance to digest this and would like to move it forward. Are one or both of the draft PRs open ready for a proper review? I don't think I'll have much feedback on a line-by-line scale, but I feel good about the broad strokes outlined above.

Moving to a proper dependency management system (in our case, npm) jumps out as the first big win a see here. As for dbt I think there may be value in further exploring how we could use it for full DAG orchestration (something @damonmcc might be able to help with), but as things stand, the move to using table schemas for data validation seems reasonable to me.

2 replies

TangoYankee Sep 20, 2024
Maintainer Author

Neither is ready for review. But, the npm based PR is #57. I'll move forward with that one. Two main tasks to get it ready are:

Update the GitHub Actions to reflect the npm strategy
Not have 80 commits

TangoYankee Sep 20, 2024
Maintainer Author

Update: open for review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current state of the data flow #56

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Current state of the data flow #56

TangoYankee Jul 9, 2024 Maintainer

Description

Configuring the architecture

Installing dependencies

Downloading data

Defining the source tables

Validating the source tables

Loading data into the source tables

Creating the model tables

Transforming from source to model tables

Populating the target database

Separating into Domains

Bonus

Replies: 1 comment · 2 replies

TylerMatteo Sep 16, 2024 Maintainer

TangoYankee Sep 20, 2024 Maintainer Author

TangoYankee Sep 20, 2024 Maintainer Author

TangoYankee
Jul 9, 2024
Maintainer

Replies: 1 comment 2 replies

TylerMatteo
Sep 16, 2024
Maintainer

TangoYankee Sep 20, 2024
Maintainer Author

TangoYankee Sep 20, 2024
Maintainer Author