This file describes the steps necessary to run ETLs using this codebase.
There's first a set of commands to run to start working on and with the ETL. The following sections simply add more explanations (or variations).
- You will need a way to clone this repo, e.g. the
git
command line tool. - Arthur runs inside a Docker container. So make sure to have Docker installed. The software can be installed directly into a virtual environment. But we no longer recommend that.
- You will also need an account with AWS and access using a profile. Be sure to have your access already configured:
- If you have to work with multiple access keys, check out the support of profiles in the CLI.
- It is important to setup a default region in case you are not using
us-east-1
. - Leave the
output
alone or set it tojson
.
aws configure
git clone [email protected]:harrystech/arthur-redshift-etl.git
cd ../arthur-redshift-etl
git pull
bin/build_arthur.sh
You will now have an image arthur-redshift-etl:latest
.
It's easiest to set these environment variables,
e.g. in your ~/.bashrc
or ~/.zshrc
file:
export DATA_WAREHOUSE_CONFIG= ...
export ARTHUR_DEFAULT_PREFIX= ...
export AWS_DEFAULT_PROFILE= ...
export AWS_DEFAULT_REGION= ...
Then you can simply run:
bin/run_arthur.sh
You can set or override the settings on the command line:
bin/run_arthur.sh -p aws_profile ../warehouse-repo/config-dir wip
When in doubt, ask for help:
bin/run_arthur.sh -h
You should now try to deploy your code into the object store in S3:
bin/deploy_arthur.sh
Ok, so even if you want to work on the ETL code, you should first follow the steps above to get to a running setup. This section describes what else you should do when you want to develop here.
While our EC2 installations will use requirements.txt
(see bootstrap.sh),
you should always use requirements-all.txt
for local development. The packages listed in that
file are installed when building the Docker image.
See the Formatting code section in the README file.
Keep this cheat sheet close by for help with types etc.