Proteins are large biomolecules that play a critical role in the body, and a key problem in the field of macromolecular modeling is protein structure design. Frequently, this problem takes the form as follows: Given a protein structure and sequence, and a particular functional site, design a new protein that will fold into the same structure. Generally, additional requirements are that certain components of the structure must have conserved properties, such as the preservation of an active site.
This can be visualized with the following schematic:
In the schematic above, new structures (orange, blue, yellow) are designed that are structurally similar to the original structure (green), but have different sequences, while still conserving the ANK
region.
A number of tools and approaches leveraging deep learning, such as RFDesign and diffusion models, have been developed to approach this problem and variants of it.
In this repository, we demonstrate how RFDesign can be deployed on AWS infrastructure; The repository contains the CloudFormation template, Dockerfile and sample scripts for submitting jobs to AWS Batch.
The architecture for this approach is similar to the previously published AWS Batch Architecture for Protein Folding and Design. The architecture is as follows:
Note also that that this image only supports using the hallucination
and inpainting
functionality, and not the AF_metrics
and pyrosetta
functionality provided by RFDesign. If you wish to run that functionality, you will need to modify the image to download the Alphafold parameters. You can see the RFDesign released Docker image for how to do so.
We next outline the steps to deploy the workload.
After cloning this repository and changing directory into AWS-Batch-Arch-for-RFDesign
, you must build and push the container to Amazon Elastic Container Registry (ECR). This image will clone the RFDesign repository and install the relevant dependencies within the docker image. For this step, you must have docker installed.
Optionally, you may also choose to locally run the tests provided by the RFDesign repository. If you wish to run these tests, you should comment out the line ENTRYPOINT ["bash", "-c"]
in the Dockerfile. You can then run the tests interactively by running the docker image in interactive mode.
IMAGE_NAME="proteindesign_image" #or select your own name
sh ./build_and_push.sh Dockerfile $IMAGE_NAME #push the image to ecr
Copy the ECR URI for the image (it will look something like: xxxxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/$IMAGE_NAME
); you will need this for the Cloud Formation Template in the next step. Do not include the tag in the URI; the Cloud Formation Template deployed in the next step with automatically pull the latest version.
Please note that while it is possible to build and push the container from your local machine (assuming that you have AWS CLI access), you may want to leverage an Amazon SageMaker Notebook with a GPU instance to build and push the container; building the image is generally faster on notebook instances. When testing, we were able to use an ml.g4dn.xlarge
to build and push the image to ECR.
Next, you must deploy the underlying infrastructure to support Batch job submissions. As shown in the diagram above, this includes a Virtual Private Cloud (VPC) (which has a private and public subnet), and a Batch compute environment, which leverages the private subnet for running submitted jobs.
Use this option if you want to create all the infrastructure with new resources, including a new VPC and Batch environment.
To create all of the infrastructure needed, including the VPC, do as follows:
-
ApplicationName is the name of the application (default: ProtDesign)
-
StackAvailabilityZone: the availability zone you wish to deploy in (default: us-east-1a)
-
ProtDesignContainerRegistryURI: This is the URI of the protein design image you pushed to ECR in the previous step. If you did not record the URI, you can retrieve it from the ECR console.
- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create stack.
If you already have a VPC and Subnet you wish to use, you can use this option.
-
ApplicationName is the name of the application (default: ProtDesign)
-
DefaultSecurityGroupID: The existing Security group you wish to use.
-
Subnet: The existing Subnet you wish to use.
-
ProtDesignContainerRegistryURI: This is the URI of the protein design image you pushed to ECR in the previous step. If you did not record the URI, you can retrieve it from the ECR console.
- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create stack.
Wait for the CFN to finish executing before running the next steps. This should take a few minutes.
The RFDesign
protein hallucination functionality allows for designing new proteins. The example here is slightly modified from the test script provided by RFDesign; and uses the structures rsvf-v_5tpn.pdb
and rsvf-v_5tpn_receptor_frag.pdb
provided by the RFDesign repository.
Prior to running this job, you must first put the files rsvf-v_5tpn.pdb
and rsvf-v_5tpn_receptor_frag.pdb
(found here) in a bucket/directory in S3. S3_RF_DESIGN_LOCATION_INPUT must then be set to point to that location (e.g. "s3://my_bucket/5tpn_directory"). Set S3_RF_DESIGN_LOCATION_OUTPUT to a separate output directory of your choosing in S3.
If you wish to leverage this script for different structures, you will need to edit the corresponding structure names in the rf_design_hallucination_batch_submission_template.json
file.
First, change directory into the AWS-Batch-Arch-for-RFDesign
directory. Then execute the following:
STACK_NAME="" #use the name of the stack you used in cloud formation
export S3_RF_DESIGN_LOCATION_INPUT="REPLACE_WITH_S3_LOCATION_INPUT" #e.g. s3://my_bucket/5tpn_directory"
export S3_RF_DESIGN_LOCATION_OUTPUT="REPLACE_WITH_S3_LOCATION_OUTPUT"
envsubst < rf_design_hallucination_batch_submission_template.json > example_rf_design_hallucination_batch_submission.json
#get the job queue and job definition from the stack; you can also get this from the CloudFormation Console
JOB_QUEUE=`aws cloudformation --region us-east-1 describe-stacks --stack-name ${STACK_NAME} --query "Stacks[0].Outputs[?OutputKey=='G4dnJobQueue'].OutputValue" --output text`
PROT_DESIGN_JOB_DEFINITION=`aws cloudformation --region us-east-1 describe-stacks --stack-name ${STACK_NAME} --query "Stacks[0].Outputs[?OutputKey=='ProtDesignJobDefinition'].OutputValue" --output text`
aws batch submit-job --job-name protein_hallucination_job --job-queue $JOB_QUEUE --job-definition $PROT_DESIGN_JOB_DEFINITION --container-overrides file://example_rf_design_hallucination_batch_submission.json
When submitting this job, it took about 5-10 minutes for the job to start. Once the job started, it took and about 30 seconds to finish. Note that you can decrease the latency for the job to start by increasing the MinvCPU
parameter in the cloudformation template with a corresponding increase in cost for having continuously running CPUs. You can modify the parameters of the job by modifying the file rf_design_hallucination_batch_submission_template.json
.
Protein inpainting can be used to for protein design when only some of the structure/sequence is provided. We use the example structure 2kl8.pdb
(found here) from the RFDesign repository.
Note that for this job, S3_RF_DESIGN_LOCATION_INPUT must point to a file in S3 (e.g. "s3://my_bucket/2kl8_directory/2kl8.pdb"), and not a directory.
If you wish to leverage this script for different structures, you will need to edit the corresponding structure names in the rf_design_inpainting_batch_submission_template.json
file.
export S3_RF_DESIGN_LOCATION_INPUT="REPLACE_WITH_S3_LOCATION_INPUT"
export S3_RF_DESIGN_LOCATION_OUTPUT="REPLACE_WITH_S3_LOCATION_OUTPUT"
envsubst < rf_design_inpainting_batch_submission_template.json > example_rf_design_inpainting_batch_submission.json
#get the job queue and job definition from the stack; you can also get this from the CloudFormation Console
JOB_QUEUE=`aws cloudformation --region us-east-1 describe-stacks --stack-name ${STACK_NAME}--query "Stacks[0].Outputs[?OutputKey=='G4dnJobQueue'].OutputValue" --output text`
PROT_DESIGN_JOB_DEFINITION=`aws cloudformation --region us-east-1 describe-stacks --stack-name ${STACK_NAME} --query "Stacks[0].Outputs[?OutputKey=='ProtDesignJobDefinition'].OutputValue" --output text`
aws batch submit-job --job-name protein_inpainting_job --job-queue $JOB_QUEUE --job-definition $PROT_DESIGN_JOB_DEFINITION --container-overrides file://example_rf_design_inpainting_batch_submission.json
When submitting this job, it took about 5-10 minutes for the job to start. Once the job started, it took and about 30 seconds to finish. Note that you can decrease the latency for the job to start by increasing the MinvCPU
parameter in the cloudformation template with a corresponding increase in cost for having continuously running CPUs.
You can read more RFDesign preprint here and the corresponding github repository here.
The citation for RFDesign is:
@article {Wang2021.11.10.468128,
author = {Wang, Jue and Lisanza, Sidney and Juergens, David and Tischer, Doug and Anishchenko, Ivan and Baek, Minkyung and Watson, Joseph L. and Chun, Jung Ho and Milles, Lukas F. and Dauparas, Justas and Exp{\`o}sit, Marc and Yang, Wei and Saragovi, Amijai and Ovchinnikov, Sergey and Baker, David},
title = {Deep learning methods for designing proteins scaffolding functional sites},
elocation-id = {2021.11.10.468128},
year = {2021},
doi = {10.1101/2021.11.10.468128},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2021/11/15/2021.11.10.468128},
eprint = {https://www.biorxiv.org/content/early/2021/11/15/2021.11.10.468128.full.pdf},
journal = {bioRxiv}
}
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.