Current state of the data flow #56
TangoYankee
started this conversation in
General
Replies: 1 comment 2 replies
-
@TangoYankee I've finally had a chance to digest this and would like to move it forward. Are one or both of the draft PRs open ready for a proper review? I don't think I'll have much feedback on a line-by-line scale, but I feel good about the broad strokes outlined above. Moving to a proper dependency management system (in our case, |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description
Updating the Data Flow for the Capital Planning Project highlighted pain points in using and maintaining it. There are several branches that have made different changes in pursuit of alleviating these pain points. The most recent iteration is
cp-flow-node-only
. The next several sections will highlight the steps of the data flow, the associated pain points, and how the pain points are addressed in the latest iteration. We do not need to accept the whole of the changes. We can identify which parts make sense and use them on their own, leaving behind the parts that do not make sense.Configuring the architecture
On
main
, the data-flow is run in persistent or semi-persistent databases. In live environments, there is a dedicated Digital Ocean Managed Database instance for each build environment. During local development, there is a database volume. The Digital Ocean Managed Database create a maintenance burden, adding three databases that need to be updated. But local and live databases add complexity to managing the state of the database. If a failed build leaves the database in a "bad" state, the developer will need to perform several steps to wipe the state and start over.cp-flow-node-only
changes both the local and live environments. First, the GitHub action uses a docker "service" to create a dockerized database on each run. This database is ephemeral and disappears at the end of each build. We can make this change because we expect that any data flows running in the GitHub action have already been tested locally. They will likely succeed by the time they are onmain
. Getting rid of the database at the end of the build means 1) we do not need to maintain the database and 2) each run is guaranteed to be independent of previous runs. Second, the local environment removes the volume. If state needs to be temporarily saved across development sessions, it can be preserved by doingdocker compose stop
. If it needs to be wiped for a clean run, this can be done withdocker compose down
Installing dependencies
On
main
, the data-flow requires the installation of several programs. These programs include MiniO, python/pyenv, and postgres. Though not included onmain
, development of the capital projects map also required handling of shapefiles using theogr2ogr
function ofgdal
. Each of these dependencies needs to be managed and updated independently. They also need a corresponding but distinct installation instruction in the GitHub Runner. This creates a burden on the developers, especially those who develop across different machines and operating systems.On the
cp-flow-node-only
branch, dependencies are managed either throughnpm
or docker.pg_dump
andpsql
come with the postgres docker image. We need to use this image anyway for the flow database. No additional installation is required to access these dependencies. The remainingminio
,node-postgres
,shapefile
, anddrizzle
packages are managed usingnpm
. This is a familiar workflow for application engineering, as (almost) all other AE repos rely onnpm
.Downloading data
main
andcp-flow-node-only
both rely onminiO
. Whilemain
uses a direct install/cli tool,cp-flow-node-only
using a node module. Using a MiniO Node Module allows us to 1) Create loading logic using typescript and 2) load asyncronouslyDefining the source tables
On
main
, source tables are defined manually by writing the tables in SQL. However, not every data source is loaded into an intermediate "source table". For example, "borough" is loaded directly from the source file to the "borough" table.The
cp-flow-node-only
branch still manually defines source table. The significant difference is that it makes every data source go into an intermediate source table, even boroughs.Validating the source tables
On
main
, dbt is used to apply tests to select columns. Dbt allows for the reuse of these tests.The
cp-flow-node-only
branch relies on "Check Constraints" applied directly to the source table column. These check constraints are not as reusable. However, we haven't seen much duplication in tests. Meanwhile, relying on SQL itself simplifies the source table configurationLoading data into the source tables
Both
main
and thecp-flow-node-only
branches rely on pg copies. However,main
relies on psql andcp-flow-node-only
relies onnode-postgres
.Node postgres
allows us to 1) Create loading logic using typescript and 2) load asyncronouslyCreating the model tables
The "model" tables refer to the tables that exist in the "Flow" database but mirror the tables in the "API" database.
main
manually defines (possibly copying migration files) the model tables. This is unsustainable as the database is continuously updating. At the time of writing, there are 17 migrations.cp-flow-node-only
introspects the api database for its schema. It uses drizzle to get theenums
. Once these schemas exists, it can usepg_dump
to create a sql file defining the latest schema. Finally,psql
uses this schema definition to create the model tables in the flow database. These definitions are updated on each run, ensuring that always model the latest api database schema.Transforming from source to model tables
Main
andcp-flow-node-only
both runsql
files against the database main's file one of cp-flow's files.main uses psql while cp-flow uses node-postgres. Using 'node-postgres' allows for complex flow control using typescript (a familiar language to AE)
Populating the target database
main
dumps the flow database usingpg_dump
and then attempts clean restore to the API Database usingpg_restore
. We should avoid this approach because it undermines drizzle as the "source of truth" for the API database. Also, it practice it creates a flurry of errors.The
data-only
option ofpg_restore
is also insufficient for our needs. We want to complete the entirety of the data upload in a single transaction. We need to accomplish two steps: 1) Truncate the target tables and 2) load the data.pg_restore
does not support doing this in single transactioncp-flow-node-only
uses psql to run a single sql file that runs "Truncate" and all of the required "copy" commands in a single transactionSeparating into Domains
main
does not separate the database runs into different domains. As I understand,dbt
would be used in future iterations to create different builds for specific builds.cp-flow-node-only
mostly uses typescript to control the flow based on the desired domain. The domain is passed as aBUILD
variable when the command is run. (The typescript files enforce the use of this variable using Zod)Bonus
There was exploration of Foreign Data Wrappers. However, it does not transfer constraints to the inspecting database. Consequently, these constraints could not be checked until the data were being loaded into the actual API database. This is a fatal flaw. We need to check these constraints when transforming the data within the flow database.
Beta Was this translation helpful? Give feedback.
All reactions