Skip to content
fvankrieken edited this page Jun 9, 2024 · 2 revisions

dcpy is our internal python package. While we have longer-term hopes to make this a publicly available package, for now it is our product-agnostic code used for a variety of purposes. It contains various submodules

test

This is code to be run with pytest. Organization mimics overall structure of dcpy

models

We strive for type-safety in our python code, and an important step in this is creating classes/objects in python that represent discrete practical entities. Further, we often want to define these entities in structured yaml or json. For this reason, we rely heavily on pydantic. Pydantic's main stated purpose is data validation in python - ability to read in data from json or yaml, validate it at time of parsing, and then (assuming data was valid) provide type-safe objects to use in code. These classes can also have attributes specified, making it easy to go from a json definition of a dataset in edm-recipes ({"name": "bpl_libraries", "version": "20240609"}) to an S3 key ("datasets/bpl_libraries/20240609/bpl_libraries.parquet").

We have models organized by domain - parts of the product lifecycle (plan, builds, packaging), data definitions for connectors to external apis, or just more purely by conceptual domain (geospatial). Some of the layout of models then is a recreation of the folder structure of the rest of dcpy, while some does not map 1-1.

Outside of class/object methods, no code should live in models. Models is meant to be one of the more "base" submodules of dcpy - it should not depend on any other submodule, and with this design, the various submodules of dcpy can have knowledge of all of the defined entities that exist within dcpy (in models) without any circular references.

utils

These are meant to be relatively pure utilities. This is currently maybe slightly too broad a category at the moment with too many top-level files, but in general most files are representative of what utilities should look like - simple atomic functions with no concept of business logic.

connectors

The submodules within connectors are meant to represent some sort of entities outside dcpy with which we want to interact. Some are our own resources - edm.recipes, edm.publishing both have their apis here, for getting data, publishing a build data product, etc. There are also third party connectors: Socrata, which is the tool/api underlying NYC OpenData, or ArcGIS Online, which we pull from several servers. There could be some work to align functionalities between the different connectors - each having concept of a "dataset" with a "download" function or something along those lines, but for now the code for each is a bit more specific to that connector. These are also supposed to be largely free from business logic (at least the external connectors), and operations should be relatively atomic. Get a dataset, push a dataset, etc.

library

A tool we will (hopefully) soon be deprecating, data library was the original tool/cli which we built/used to extract and archive source data from a variety of sources (and declare definitions of these datasets in the format of yaml templates, archiving the ingested datasets in [edm-recipes](https://github.com/NYCPlanning/data-engineering/wiki/Cloud-Infrastructure#storage). We recently worked on building a replacement to library (lifecycle.ingest), which streamlines code, makes preprocessing simpler to run and qa, and will be easier to build upon and customize as we begin to do more work on QA of source data.

Essentially, library takes a yaml template and a version optionally supplied. Based on the information in the template file, a dataset is pulled and opened in memory by gdal, and dumped in a desired format. Historically we've used pg_dumps but are moving towards parquet (or when applicable, GeoParquet). During this transformation from input data (json, shapefile, csv, etc) to pg_dump or other output format, various operations can also be performed. Often, we reproject geospatial datasets, or specify to read in empty strings as null, or other similar functionality.

lifecycle

All code related to actually running the higher-level processes within the lifecycle of a product. Within lifecycle, we have

ingest

The replacement to library. run.py has the logic of what actually happens going from a template to an archived dataset, like in library. In this, we

  • read in a template file, and "compute" certain aspects of it. Things like looking up a version of the source of the data provides some sort of mechanism for versioning
  • download the raw data
  • archive this raw data
  • convert the raw data to parquet
  • run any defined preprocessing steps for the dataset, including spatial reprojection
  • archive the processed dataset

builds

Would prefer "build" but that's a special folder in python, so the tense does not match, alas. All python logic related to a product build is in here. For now, that's really setting up a build.

  • "planning" a build from its definition in its recipe yaml file
  • loading source datasets into our build Postgres database. Actual running of the transformations (be they python, bash/postgres, or dbt/postgres) and exporting build outputs is not governed by dcpy code yet. For now, that logic is defined in the product folders.

package

Our data products have different expectations depending on how we distribute them. This is a rapidly evolving section of our codebase, so this will be a bit of a stub for now. But essentially, the thought here is code that takes in metadata and outputs of a build and, as the name implies, "packages" it up for distribution - including metadata, adding annotations/attachments as necessary, etc

distribute

Also a relatively new and quickly evolving section of dcpy, this contains code related to actually distributing our exports to external destinations. Our main focus has been Socrata/OpenData, where much data is currently loaded manually rather than programmatically.