Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Enums and Drizzle-Kit references #71

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ logs
# database volume
db-volume

# dbt files
# dbt files
**/.user.yml
dbt_packages/
target/
Expand All @@ -29,7 +29,3 @@ data/download/*

data/convert/*
!data/convert/.gitkeep

# drizzle sql migrations
drizzle/migration/meta/*
drizzle/migration/*.sql
42 changes: 13 additions & 29 deletions documentation/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,18 @@ The data flow coordinates data across a few resource. These resource are:
- The data flow is coordinated from this service. During development, it is run on a developer's machine. In staging and production, it is a GitHub Action Runner.
- Data flow database (Docker Container within the Runner)
- The Data flow runner (both the GitHub Action and Local Machine) version contains a Database within a Docker container.
- It is referred to as the "Flow" database
- It is referred to as the "Flow" database
- API database (Docker container on the Local Machine or Digital Ocean Postgres Cluster)
- This the database that actually serves the data to applications. Locally, it exists within a Docker container. In live environments, it exists on a Digital Ocean Postgres cluster. There are two steps which are run from the "Flow" database against the "API" database. When running locally, these database communicate via a docker network. When run against the Digital Ocean cluster, it runs through the internet.
- It is referred to as the "API" or "Target" database
## Steps

The entire data flow contains 10 steps. They are found in the `scripts` section of the [package.json](../package.json). With some exceptions, they follow the format `<tool used>:<targeted resource>:<operation performed>`. In the [flow steps diagram](./diagrams/flow_steps.drawio.png), they are numbered 0 through 9. They are described in greater detail in the list below.
The entire data flow contains 8 steps. They are found in the `scripts` section of the [package.json](../package.json). With some exceptions, they follow the format `<tool used>:<targeted resource>:<operation performed>`. In the [flow steps diagram](./diagrams/flow_steps.drawio.png), they are numbered 1 through 8. They are described in greater detail in the list below.

Steps 1 through 9 can be run with no configuration. They are part of the `flow` command listed in the [README](../README.md#run-the-local-data-flow). Step 0 requires some configuration and is not part of the `flow` command; more context is listed in the step 0 description. Of the remaining steps, they can be run individually using their listed `Command`. This is helpful when a step fails. After fixing the failure, the next steps of the flow can be run without rerunning the steps that already succeeded. In addition to the individual commands, multiple commands can be run together as a `Group`.
These steps are part of the `flow` command listed in the [README](../README.md#run-the-local-data-flow).They can be run individually using their listed `Command`. This is helpful when a step fails. After fixing the failure, the next steps of the flow can be run without rerunning the steps that already succeeded. In addition to the individual commands, multiple commands can be run together as a `Group`.

The available groups are `download`, `configure`, `seed`, and `populate`. `download` will retrieve the source files and convert the shapefiles to csvs. `configure` will install dependencies on the flow database. `seed` will initialize the source tables in the flow database and fill them with source data. `populate` will transform the source data within the flow database and then transfer the transformed data to the target database.

0) Pull custom types from target database
- Command: `drizzle:api:pull`
- Group: none
- Tool: Drizzle
- Run from: Runner, [drizzle (api config)](../drizzle/api.config.ts)
- Run against: API database
- Description: The API schemas include several custom enum types. These types need to be defined in the data flow database before it can replicate the API schema. Drizzle introspection is performed against the api database, looking only at the custom types. The resulting introspection is saved in the `drizzle/migrations` folder in the `schema.ts` file. There are a few things to note. First, the introspection also produces meta and sql files. However, the data flow does not use these files and they are ignored by source control. Second, Drizzle does not automatically import the `pgEnum` function in the `schema.ts`; any developer that reruns the introspection needs to manually import this method. Third, this step should only need to be run when there are changes to the custom types. Consequently, it is excluded from the `flow` command. Finally, changes to the custom types should happen rarely. When they do happen, they will require updates to the [drizzle/migration/schema.ts](../drizzle/migration/schema.ts) file. These files will need to be committed and integrated into the `main` branch.

1) Download source files from Digital Ocean Spaces
- Command: `minio:download`
- Group: `download`
Expand All @@ -38,7 +30,7 @@ The available groups are `download`, `configure`, `seed`, and `populate`. `downl
- Description: Use the minio node module to download source files from Digital Ocean. Files are saved on the Data flow runner in the [data/download](../data/download/) folder.

2) Convert shapefiles into csv files
- Command: `shp:convert`
- Command: `shp:convert`
- Group: `download`
- Tool: shapefile.js
- Run from: Data flow runner, [shp/convert.ts](../shp/convert.ts)
Expand All @@ -53,51 +45,43 @@ The available groups are `download`, `configure`, `seed`, and `populate`. `downl
- Run against: Data flow database
- Description: Run a sql command against the data flow database to [activate PostGIS](../pg/configure/configure.sql)

4) Push custom types and enums to the flow database
- Command: `drizzle:flow:push`
- Group: `configure`
- Tool: drizzle
- Run from: Data flow runner, [drizzle (flow config)](../drizzle/flow.config.ts)
- Run against: Flow database
- Description: Push the enums stored in [drizzle/migration/schema.ts](../drizzle/migration/schema.ts)

5) Create tables in flow database to hold source data
4) Create tables in flow database to hold source data
- Command: `pg:source:create`
- Group: `seed`
- Tool: pg.js [pg/source-create](../pg/source-create/create.ts)
- Run from: Data flow runner
- Run against: Flow database
- Description: Run sql commands to [create tables](../pg/source-create/borough.sql) that hold data as they are stored in their source files. The source tables also create constraints that will check data validity as it is copied into the source tables.
- Description: Run sql commands to [create tables](../pg/source-create/borough.sql) that hold data as they are stored in their source files. The source tables also create constraints that will check data validity as it is copied into the source tables.
If any source tables already existed, drop them before adding them again.

6) Load source tables with source data
5) Load source tables with source data
- Command: `pg:source:load`
- Group: `seed`
- Tool: pg.js [pg/source-load](../pg/source-load/load.ts)
- Run from: Data flow runner
- Run against: Flow database
- Description: Copy the source data from the `data` folder to the source tables within the flow database.

7) Create tables in the flow database that model the api database tables
6) Create tables in the flow database that model the api database tables
- Command: `db:pg:model:create`
- Group: `seed`
- Tool: pg_dump and psql, [db/pg/model-create](../db/pg/model-create/all.sh)
- Run from: Flow database
- Run against: Flow database
- Description: Run `pg_dump` and `psql` from the flow database docker container.
- Description: Run `pg_dump` and `psql` from the flow database docker container.
Use `pg_dump` to extract the API Table Schemas into a `sql` file stored in the flow database docker container.
Use `psql` to read the `sql` file of the dump into the flow database.
Use `psql` to read the `sql` file of the dump into the flow database.
If any model tables already existed in the flow database, drop them before adding them again.

8) Transform the source data and insert it into the model tables
7) Transform the source data and insert it into the model tables
- Command: `pg:model:transform`
- Group: `populate`
- Tool: pg.js, [pg/model-transform](../pg/model-transform/transform.ts)
- Run from: Data flow runner
- Run against: Flow database
- Description: Use pg node to run the [`sql` files](../pg/model-transform/capital-planning.sql) that transform the `source` columns into their respective `model` columns. Export the populated `model` tables to `.csv` files within the flow database docker container. Truncate any data that may already exist in the `model` tables before inserting the data again.

9) Move the data from the model tables in the flow database to their corresponding target tables in the api database
8) Move the data from the model tables in the flow database to their corresponding target tables in the api database
- Command: `db:pg:target:populate`
- Group: `populate`
- Tool: psql, [db/pg/target-populate](../db/pg/target-populate/populate.sh)
Expand All @@ -107,7 +91,7 @@ The available groups are `download`, `configure`, `seed`, and `populate`. `downl

## Domains

The data flow can be used to either initialize a full data suite or update a portion of the full suite. These portions are group into "Domains". The data flow is divided into four Domains- "all", "admin", "capital-planning", and "pluto". They are visualized in the [domains diagram](./diagrams/domains.drawio.png).
The data flow can be used to either initialize a full data suite or update a portion of the full suite. These portions are group into "Domains". The data flow is divided into four Domains- "all", "admin", "capital-planning", and "pluto". They are visualized in the [domains diagram](./diagrams/domains.drawio.png).

The "all" domain contains every other domain plus "Boroughs". The "admin" domain contains administrative boundaries, other than Boroughs. Boroughs are excluded from the "admin" domain because tax lots depend on the Borough Ids existing. Rebuilding Boroughs would require also require rebuilding tax lots.

Expand Down
Binary file modified documentation/diagrams/flow_steps.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 0 additions & 20 deletions drizzle/api.config.ts

This file was deleted.

20 changes: 0 additions & 20 deletions drizzle/flow.config.ts

This file was deleted.

1 change: 0 additions & 1 deletion drizzle/migration/relations.ts

This file was deleted.

7 changes: 0 additions & 7 deletions drizzle/migration/schema.ts

This file was deleted.

Loading