Skip to content

Latest commit

 

History

History
executable file
·
245 lines (167 loc) · 6.31 KB

README.md

File metadata and controls

executable file
·
245 lines (167 loc) · 6.31 KB

Data Vault powered by dbtVault and Greenplum

Assignment TODO

⚠️ Attention! Always delete resources after you finish your work!

Configure Developer Environment

You have got several options to set up:

Start with GitHub Codespaces

GitHub Codespaces

Use devcontainer (locally)

  1. Install Docker on your local machine.

  2. Install devcontainer CLI:

    Open command palette (CMD + SHIFT+ P) type Install devcontainer CLI

  3. Next build and open dev container:

    # build dev container
    devcontainer build .
    
    # open dev container
    devcontainer open .

Verify you are in a development container by running commands:

terraform -v

yc --version

dbt --version

If any of these commands fails printing out used software version then you are probably running it on your local machine not in a dev container!

Deploy Infrastructure

  1. Get familiar with Managed Service for Greenplum

  2. Install and configure yc CLI: Getting started with the command-line interface by Yandex Cloud

    yc init
  3. Populate .env file

    .env is used to store secrets as environment variables.

    Copy template file .env.template to .env file:

    cp .env.template .env

    Open file in editor and set your own values.

    ❗️ Never commit secrets to git

  4. Set environment variables:

    export YC_TOKEN=$(yc iam create-token)
    export YC_CLOUD_ID=$(yc config get cloud-id)
    export YC_FOLDER_ID=$(yc config get folder-id)
    export TF_VAR_folder_id=$(yc config get folder-id)
    export $(xargs <.env)
  5. Deploy using Terraform

    Configure YC Terraform provider:

    cp terraformrc ~/.terraformrc
    terraform init
    terraform validate
    terraform fmt
    terraform plan
    terraform apply

    Store terraform output values as Environment Variables:

    export DBT_HOST=$(terraform output -raw greenplum_host_fqdn)
    export DBT_USER='greenplum'
    export DBT_PASSWORD=${TF_VAR_greenplum_password}
    export S3_ACCESSKEY=$(terraform output -raw access_key)
    export S3_SECRETKEY=$(terraform output -raw secret_key)

    [EN] Reference: Getting started with Terraform by Yandex Cloud

    [RU] Reference: Начало работы с Terraform by Yandex Cloud

  6. Alternatively, deploy using yc CLI

    Deploy using yc CLI:

    Checklist:

    yc managed-greenplum cluster create gp_datavault \
    --network-name default \
    --zone-id ru-central1-a \
    --environment prestable \
    --master-host-count 2 \
    --segment-host-count 2 \
    --master-config resource-id=s3-c2-m8,disk-size=30,disk-type=network-ssd \
    --segment-config resource-id=s3-c2-m8,disk-size=30,disk-type=network-ssd \
    --segment-in-host 1 \
    --user-name greenplum \
    --user-password $TF_VAR_greenplum_password \
    --greenplum-version 6.22 \
    --assign-public-ip
    
    yc vpc gateway create --name gp-gateway
    yc vpc route-table create --name=gp-route-table --network-name=default --route destination=0.0.0.0/0,gateway-id=<gateway_id>
    yc vpc subnet update <subnet_name> --route-table-name=gp-route-table
    
    yc managed-greenplum hosts list master --cluster-name gp_datavault
    
    export DBT_HOST=$DBT_HOST
    export DBT_USER=$DBT_USER
    export DBT_PASSWORD=$TF_VAR_greenplum_password
    export S3_ACCESSKEY=$S3_ACCESSKEY
    export S3_SECRETKEY=$S3_SECRETKEY

Check database connection

Configure JDBC (DBeaver) connection:

DBeaver + Greenplum

Make sure dbt can connect to your target database:

dbt debug

dbt + Greenplum connection

If any errors check ENV values are present:

env | grep DBT_

Populate Data Vault day-by-day

  1. Initialize data sources (External tables)
dbt run-operation init_s3_sources
  1. Install packages:
dbt deps
  1. Run models step-by-step

Load one day to Data Vault structures:

dbt run -m tag:raw
dbt run -m tag:stage

dbt run -m tag:hub
dbt run -m tag:link
dbt run -m tag:satellite
dbt run -m tag:t_link
  1. Load next day

Simulate next day load by incrementing load_date varible:

# dbt_profiles.yml

vars:
  load_date: '1992-01-02' # increment by one day

And update data vault:

dbt build

Build Business Vault on top of Data Vault

  1. Point In Time (PIT) table
  2. Bridge Table

Create and submit PR