By Ismail Tigrek (linkedin.com/in/ismailtigrek)
As presented in the Why R? Turkey 2022 conference
This tutorial will walk through deploying R code, machine learning models, or Shiny applications in the cloud environment. With this knowledge, you will be able to take any local R-based project you’ve built on your machine or at your company and deploy it into production on AWS using modern serverless and microservices architectures. In order to do this, you will learn how to properly containerize R code using Docker, allowing you to create reproducible environments. You will also learn how to set up event-based and time-based triggers. We will build out a real example that reads in live data, processes it, and writes it into a data lake, all in the cloud.
This tutorial has been published in the Why R? 2022 Abstract Book: link
These are the AWS resources and their names as used in this codebase. You will need to change the names in your version of the code to match the names of your resources. You should be able to create all the resources below with the same names except for the S3 buckets.
- Create and activate an AWS account
- Retrieve access keys
- Create S3 buckets
- Create input bucket (whyr2022test)
- Creat output bucket (whyr2022testoutput)
- Enable CloudTrail event logging for S3 buckets and objects
- Create ECR repository (whyr2022)
- Creaet ECS cluster (ETL)
- Create ECS task defintion (whyr2022)
- Create EventBridge rule (whyr2022input_upload)
- Create event pattern
{
"source": ["aws.s3"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventSource": ["s3.amazonaws.com"],
"eventName": ["PutObject"],
"requestParameters": {
"bucketName": ["whyr2022input"]
}
}
}
- Set target to ECS task whyr2022 in ECS cluster ETL
- Install and setup Docker Desktop
- Install and setup AWS CLI
- Create named AWS profile called (whyr2022)
- Put access keys into .secrets
- Authenticate Docker client to registry
aws ecr get-login-password --region us-east-1 --profile whyr2022 | docker login --username AWS --password-stdin 631607388267.dkr.ecr.us-east-1.amazonaws.com
- Build Docker image
docker build -t whyr2022 .
- Run Docker image locally to test
docker run whyr2022
- Tag Docker image
docker tag whyr2022:latest 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest
- Push Docker image to AWS ECR
docker push 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest
You can view the logs of all container runs in AWS CloudWatch under Log Groups
This was only meant to be a brief tutorial to fit into 30 minutes. Many crucial steps were overlooked for the sake of brevity. Ideally, you should look at doing the following:
We used the access keys of our root AWS account. This is not ideal for security reasons. Use your root account to create an admin user for yourself. Then lock the root credentials away and never use them again unless absolutely necessary. Then, using your new admin account, create another IAM user for your production code. Replace the access keys in your .secrets file with access keys from this account
Ideally, you don't want to store your access keys inside your code or docker image. Instead, pass in security keys as environment variables when creating your ECS task definition
We baked our script into our Docker image. This is not ideal since every time you update your code, you will need to rebuild your Docker image and re-push to ECR. Instead, create a general Docker image that can read in the name of a script to pull from an S3 bucket (you can pass this name in as an ECS task environment variable). This way, you can have a bucket that contains your R scripts, and you will only need to build your Docker image once. Every time your container is deployed, it will pull the latest version of your script from the S3 bucket.
Our script is running on a lot of assumptions, such as:
- only one file is uploaded to whyr2022input at a time
- only RDS files are uploaded to whyr2022input
- the files uploaded to whyr2022input are dataframes with two numeric columns
Production code should not run on any assumptions. Everything should be validated, and possible errors or edge cases should be gracefully handled.
Another thing that can be done to enhance the pipeline is to mark "used" input files. This can be done by appending "-used_" to the file name.
We provisioned all our resources through the AWS Console. This is not ideal since we cannot easily recreate the allocation and configuration of these resources. Ideally, you want to codify this process using an Infrastructure-as-code solution (ex: Terraform)