Skip to content

Latest commit

 

History

History
162 lines (141 loc) · 9.72 KB

README.md

File metadata and controls

162 lines (141 loc) · 9.72 KB

MaskedProteinEnT

Code to sample sequences with a contextual Masked EnTransformer as described in "Contextual protein and antibody encodings from equivariant graph transformers".

Self-supervised learning to transduce sequence labels for masked residues from those for unmasked residues by context matching on proteins.image

Installation

For sampling, in your virtual environment, pip install as follows:

pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Installation with Docker

Dockerfile is provided as example/demo of package use. Please see example command lines to use below. For production use you might need to mount host data dir as a subdir to /code dir where package code is located.

docker build -t masked-protein-ent .
docker run -it masked-protein-ent

Test with Colab

Example Jupyter notebook for Colab is provided in MaskedProteinEnT-colab-example.ipynb. Please note that due to volatile nature of Colab platform it is difficult to ensure that in long term such notebook will be functionining so some edits might be required. Alternatively, we provide a dockerfile for easy installation.

Sampling works well on CPUs and GPUs. Sampling is just as fast on cpus: <2min for 10000 sequences

Trained models

Download and extract trained models from Zenodo.

tar -xvzf model.tar.gz

Sampling protein sequences

To design/generate all positions on the protein, run:

MODEL=trained_models/ProtEnT_backup.ckpt
OUTDIR=./sampled_sequences
PDB_DIR=data/proteins
python3 ProteinSequenceSampler.py  \
	--output_dir ${OUTDIR} \
	--model $MODEL \
	--from_pdb $PDB_DIR \
	--sample_temperatures 0.2,0.5 \
	--num_samples 100

The above command samples all sequences at 100% masking (i.e. only coord information is used by the model). You may sample at any other masking rate between 0-100% and the model will randomly select the positions to mask. For more options, run:

python3 ProteinSequenceSampler.py --help

Sampling antibody sequences without partner context

To design/generate all positions on the protein, run:

MODEL=trained_models/ProtEnT_backup.ckpt
OUTDIR=./sampled_sequences
PDB_DIR=data/proteins
python3 ProteinSequenceSampler.py  \
	--output_dir ${OUTDIR} \
	--model $MODEL \
	--from_pdb $PDB_DIR \
	--sample_temperatures 0.2,0.5 \
	--num_samples 100 \
	--antibody \
	--mask_ab_indices 10,11,12
# To sample for a specific region
#	--mask_ab_region h3

The above command samples all sequences at 100% masking (i.e. only coord information is used by the model). You may sample at any other masking rate between 0-100% and the model will randomly select the positions to mask. For more options, run:

python3 ProteinSequenceSampler.py --help

Sampling interface residues with partner context

To generate/design the interface residues for the first partner (order determined by partners.json), run:

MODEL=trained_models/ProtPPIEnT_backup.ckpt
OUTDIR=./sampled_ppi_sequences
PDB_DIR=data/ppis
PPI_PARTNERS_DICT=data/ppis/heteromers_partners_example.json
python3 PPIAbAgSequenceSampler.py  \
        --output_dir ${OUTDIR} \
        --model $MODEL \
        --from_pdb $PDB_DIR \
	--sample_temperatures 0.2,0.5 \
       	--num_samples 100 \
	--partners_json ${PPI_PARTNERS_DICT} \
	--partner_name p0

# to design interface residues on second partner use
# --partner_name p0
# to design interface residues on both partners use
# --partner_name both

Sampling antibody interface residues with antigen context

MODEL=trained_models/ProtAbAgEnT_backup.ckpt
OUTDIR=./sampled_abag_sequences
PDB_DIR=data/abag/
PPI_PARTNERS_DICT=data/abag/1n8z_partners.json
python3 PPIAbAgSequenceSampler.py  \
        --output_dir ${OUTDIR} \
        --model $MODEL \
        --from_pdb $PDB_DIR \
	--sample_temperatures 0.2,0.5 \
       	--num_samples 100 \
	--partners_json ${PPI_PARTNERS_DICT} \
	--partner_name Ab \
        --antibody
# To specify sampling at a specific CDR loop:
# --mask_ab_region h3
# To specify sampling at a specific indices:
# --mask_ab_indices 10,11,12

Training

Installation

Model was trained with older versions of torch and pytorch_lightning. Newer versions are not backward compatible. The following instructions work for python 3.9 and cuda 11.1. To train the model, you need to install torch and other dependencies as follows: In your virtual env, run the following commands:

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements_torch191.txt

Training

  • The general training code is provided in training_masked_model.py.
  • For training the model under different settings, see scripts/train_model_protein.sh for training on the general protein dataset, scripts/fine_tune_ppi-abag.sh for fine-tuning on ppi-abag datasets.
  • Training datasets are available under this Zenodo link. See the table below for an overview and the methods section for detailed information of each dataset.
  • The training script requires a wandb entity for training logging. See --wandb_entity argument in train_masked_model.py.
Description File name Download link References
Training and validation datasets curated from the CASP12 version of Sidechainnet sidechainnet_casp12_50.pkl ⬇️ - AlQuraishi, 2019
- King & Koes, 2020
Training dataset identifiers ids_train_casp12nr50_nr70Ig_nr40Others.fasta ⬇️ n.a.
Training dataset on non-redundant heterodimer protein-protein interfaces curated from referenced work ppi_trainset_5032_noabag_aug2022.h5 ⬇️ Gainza et al, 2020
Training dataset for antibody-antigen models curated from SAbDAb trainset_highres_nr90_vhh-rabd-dms_abnr90agnr70_aug2022.h5 ⬇️ Dunbar et al, 2014
Training dataset for antibody-only models curated from SAbDAb AbSCSAbDAb_trainnr90_bkandcbcoords_aug2022.h5 ⬇️ Dunbar et al, 2014
Training dataset for antibodies with paired antibody structures generated with AlphaFold2 from a previous study train_af_paired_nr70.h5 ⬇️ - Dunbar et al, 2014
- Ruffolo et al, 2023
Training dataset for antibodies with unpaired or single-chain antibody structures generated with AlphaFold2 from a previous study train_af_unpaired_nr70.h5 ⬇️ - Dunbar et al, 2014
- Ruffolo et al, 2023
Test dataset for Proteins (TS50) protein_ts50_bkandcoords.h5 ⬇️ Li et al.
Test dataset for PPIs ppi_testset_noabag_aug2022.h5 ⬇️ Gainza et al, 2020
Test dataset for Ab-Ag interfaces curated from multiple sources testset_rabd-dms-vhh_backboneandcb_oct2022.h5 ⬇️ - Li et al, 2014
- Gainza et al, 2020
- Cho et al, 2003
- Mason et al, 2021
- Ruffolo, Gray & Sulam, 2021

References

  • EnTransformer code is based on Phil Wang's implementation of EGNN (Satorras et al. 2021) with equivariant transformer layers.
  • Models and sequence recovery reported for Antibody CDRs with different models reported in Figure 2 available at https://zenodo.org/record/8313466.
  • Please note that our protein training dataset is sourced from SidechainNet. You can download the dataset directly from their repository. We are providing it here solely for ease of access. We highly recommend visiting the linked repository above and referring to their publications for more detailed information. Please remember to cite SidechainNet in your work if you utilize their dataset.

If you use this repository to generate or score sequences, please cite:

Mahajan, S. P., Ruffolo, J. A., Gray, J. J., "Contextual protein and antibody encodings from equivariant graph transformers", biorxiv, 2023.
Mahajan, S. P., Davila-Hernandez, F.A., Ruffolo, J. A., Gray, J. J., "How well do contextual protein encodings learn structure, function, and evolutionary context?", 2023. Under Review.