Skip to content

Latest commit

 

History

History
118 lines (81 loc) · 4.14 KB

README.md

File metadata and controls

118 lines (81 loc) · 4.14 KB

Dagster Power User

This repo will parallel the Indicium Engineering blog series Dagster Power User.

The goal is that, at the end of the series, we have a simple end-to-end data engineering project with Dagster + embedded-elt + dbt, from deployment to the data marts.

Setup

  • Create and activate a Python virtualenv:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
dagster dev -m definitions -d el_code_location
  • For visualizing the dbt assets, run:
dagster dev -m definitions -d dbt_code_location
  • Before proceeding to the actual deployment, create an S3 bucket (name suggestion: dagster-ecs-poc-support-bucket) and upload the sap_adventure_works folder to its root.

Deployment

  • Installs: [Terraform, AWS CLI, Docker]
  • Create the files <el|dbt>_code_location/rsa_key.p8 following the Snowflake Guide to Private Key Auth. We are not using a passphrase to encrypt the key file for simplicity in this demo, but of course you should do it if you use this auth method in production.
  • Create a .env file based on .env.example
  • Create a terraform.tfvars file based on infra/terraform.tfvars.example
  • Run chmod -R +x scripts
  • If needed, setup an appropriate bucket and region to store terraform state in the config.s3.tfbackend file. Then, do one of the following before apply new terraform plans:
    • Migrate terraform state (recommended): run ./scripts/migrate.sh <stack>
    • Reset: in case the previous state is no longer available, run ./scripts/reset.sh <stack> For a definition of what a stack means, see the following topic.

Note: we did not bother about some warnings regarding Terraform variables not being used, as that would not be an issue in a real case scenario using production deployment best practices.

Usage

We have four stacks under the infra directory, base, core, dagster, locations. To operate on a given stack, we provide the convenience scripts:

  • Deployment: bash scripts/deploy.sh <stack>
  • Retraction: bash scripts/retract.sh <stack>

Therefore, we have the following:

Full Deployment

bash scripts/deploy.sh base
bash scripts/deploy.sh cluster
bash scripts/deploy.sh locations
bash scripts/deploy.sh dagster

OR

make deploy

Note: if you keep getting a DNS error at the Dagster UI even after successful deployment of all infra modules, restart the core services. For instance, assuming you are using the standard project name (i.e. dagster-ecs-poc):

aws ecs update-service --force-new-deployment --service dagster-daemon --cluster dagster-ecs-poc-cluster
aws ecs update-service --force-new-deployment --service dagster-webserver --cluster dagster-ecs-poc-cluster

OR

make restart

Then, refresh the page and you should be ready to go!

Full Retraction

bash scripts/retract.sh dagster
bash scripts/retract.sh locations
bash scripts/retract.sh cluster
bash scripts/retract.sh base

OR

make retract

Production Deployment Tips

For simplicity, we are developing using a single repo for all assets. A better practice for production deployments would consist of using 3 repos - at least in similar contexts to this project:

Issues

There have been intermittent issues with the deployment of the "dagster" stack, but re-execution by running bash scripts/deploy.sh dagster has been sufficient to solve.

We recommend using appropriate CI/CD workflows for deploying ECR images and managing ECS Tasks, for instance AWS Deploy ECS Task Github Action.