-
Notifications
You must be signed in to change notification settings - Fork 1
Roadmap
Warning
This page is stale and will be made more useful soon. We now use the DE Roadmap meeting doc in Sharepoint as our running doc during relevant periodic meetings.
This is our long-term vision to improve Data Engineering's infrastructure and operations without major disruptions to our data product releases.
This document guides our project planning and the tasks planning github project.
-
We want data engineering infrastructure that:
- reduces time spent building and maintaining data pipelines
- reduces time spent performing updates and QA of datasets
- standardizes the approaches and code used to build datasets
-
The data platform we imagine may do the following:
- extract and store unstructured source data
- load source data to a persistent build database
- transform data in the build database to build data products
Deprecated Roadmap notes
This is a long-term vision to improve Data Engineering's infrastructure and operations without major disruptions to our data product releases.
These ideas and notes inform some of the tasks in our team project.
- Our current data engineering infrastructure
- extracts and transforms source data before storing it as a sql dump file
- often uses a temporary database to build a dataset
- relies heavily on long bash scripts which call python and sql files
- is spread across dozens of repos
- Pros
- rigorous versioning of archived source data
- flexibility/stability of isolated build processes
- python and sql are mature, popular languages
- already very cloud-based
- Cons
- maintenance of many repos
- cognitive costs of inconsistency across pipelines
- difficult to test pipelines before production runs
- long bash scripts
- post-build questions are hard to answer
- variety of dataset handoff process to GIS, OSE, Capital Planning, etc.
- lack of data lineage, end-to-end testing
-
We want data engineering infrastructure that:
- reduces time spent building and maintaining data pipelines
- reduces time spent performing updates and QA of datasets
- standardizes the approaches and code used to build datasets
-
The data platform we imagine may do the following:
- extract and store unstructured source data
- load source data to a persistent build database
- transform data in the build database to build data products
So far, our ideas for changes and features fall into these general buckets of work:
- refactor of data products
- bash -> python
- reduce inconsistencies across data products using dbt, etc
- expansion of QA tools
- especially source data QA, but that's possibly blocked by the reworking of data library
- design of our "data platform"
- including build engine, prod/dev environments, and maybe data storage architecture in general
- rework or replacement of data library
- use of orchestration and compute resources
- e.g. airflow
Some things we've discussed recently and likely want to do
- for each data product, list a primary data engineer
- reclaim the term "recipe" to be a data product's build instructions
- start a Data Catalog to list and detail our data products
- ensure Data Dictionaries are defined before and are inputs to builds
- start the mono repo
- build all primary data products in the mono repo
- use some amount of common code in all builds
- standardize build output folder structure in Digital Ocean
- implement a Build, QA, Publish workflow for data products
- use the QA app to inspect source data
- Celebrate! 🎊
- build an MVP build database
- ...
- implement a standardized Extract and Load process
- implement a standardized Transform process
- Celebrate! 🎊
- ...
- Celebrate! 🎊
- Celebrate! 🎊
Whiteboard pictures
- Current state
- Desired state
- output folder structure