Skip to content
This repository has been archived by the owner on Dec 30, 2023. It is now read-only.

dataeducation/why-r-2022-serverless-r-in-the-cloud

 
 

Repository files navigation

Why R? 2022: Serverless R in the Cloud

Deploying R into Production with AWS and Docker

By Ismail Tigrek (linkedin.com/in/ismailtigrek)

As presented in the Why R? Turkey 2022 conference

This tutorial will walk through deploying R code, machine learning models, or Shiny applications in the cloud environment. With this knowledge, you will be able to take any local R-based project you’ve built on your machine or at your company and deploy it into production on AWS using modern serverless and microservices architectures. In order to do this, you will learn how to properly containerize R code using Docker, allowing you to create reproducible environments. You will also learn how to set up event-based and time-based triggers. We will build out a real example that reads in live data, processes it, and writes it into a data lake, all in the cloud.

This tutorial has been published in the Why R? 2022 Abstract Book: link

image

AWS Resources

These are the AWS resources and their names as used in this codebase. You will need to change the names in your version of the code to match the names of your resources. You should be able to create all the resources below with the same names except for the S3 buckets.

S3 Buckets: whyr2022input, whyr2022output

ECR Repository: whyr2022

ECS Cluster: ETL

ECS Task: whyr2022

EventBridge Rule: whyr2022input_upload

Setting Up

AWS

  1. Create and activate an AWS account
  2. Retrieve access keys
  3. Create S3 buckets
  1. Create ECR repository (whyr2022)
  2. Creaet ECS cluster (ETL)
  3. Create ECS task defintion (whyr2022)
  4. Create EventBridge rule (whyr2022input_upload)
  • Create event pattern
{
  "source": ["aws.s3"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["s3.amazonaws.com"],
    "eventName": ["PutObject"],
    "requestParameters": {
      "bucketName": ["whyr2022input"]
    }
  }
}
  • Set target to ECS task whyr2022 in ECS cluster ETL

Your Computer

  1. Install and setup Docker Desktop
  2. Install and setup AWS CLI
  3. Create named AWS profile called (whyr2022)
  4. Put access keys into .secrets

Deployment

  1. Authenticate Docker client to registry
aws ecr get-login-password --region us-east-1 --profile whyr2022 | docker login --username AWS --password-stdin 631607388267.dkr.ecr.us-east-1.amazonaws.com
  1. Build Docker image
docker build -t whyr2022 .
  1. Run Docker image locally to test
docker run whyr2022
  1. Tag Docker image
docker tag whyr2022:latest 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest
  1. Push Docker image to AWS ECR
docker push 631607388267.dkr.ecr.us-east-1.amazonaws.com/whyr2022:latest

View logs

You can view the logs of all container runs in AWS CloudWatch under Log Groups

Further steps

This was only meant to be a brief tutorial to fit into 30 minutes. Many crucial steps were overlooked for the sake of brevity. Ideally, you should look at doing the following:

Create an IAM user for your production code

We used the access keys of our root AWS account. This is not ideal for security reasons. Use your root account to create an admin user for yourself. Then lock the root credentials away and never use them again unless absolutely necessary. Then, using your new admin account, create another IAM user for your production code. Replace the access keys in your .secrets file with access keys from this account

Remove access keys from your code

Ideally, you don't want to store your access keys inside your code or docker image. Instead, pass in security keys as environment variables when creating your ECS task definition

Replace Docker image with a general image

We baked our script into our Docker image. This is not ideal since every time you update your code, you will need to rebuild your Docker image and re-push to ECR. Instead, create a general Docker image that can read in the name of a script to pull from an S3 bucket (you can pass this name in as an ECS task environment variable). This way, you can have a bucket that contains your R scripts, and you will only need to build your Docker image once. Every time your container is deployed, it will pull the latest version of your script from the S3 bucket.

Make code more robust

Our script is running on a lot of assumptions, such as:

  • only one file is uploaded to whyr2022input at a time
  • only RDS files are uploaded to whyr2022input
  • the files uploaded to whyr2022input are dataframes with two numeric columns

Production code should not run on any assumptions. Everything should be validated, and possible errors or edge cases should be gracefully handled.

Another thing that can be done to enhance the pipeline is to mark "used" input files. This can be done by appending "-used_" to the file name.

Convert all infrastructure to code

We provisioned all our resources through the AWS Console. This is not ideal since we cannot easily recreate the allocation and configuration of these resources. Ideally, you want to codify this process using an Infrastructure-as-code solution (ex: Terraform)

About

For my Why R? Turkey 2022 presentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 89.4%
  • Dockerfile 10.6%