Source code for the paper: ProteinGCN: Protein model quality assessment using Graph Convolutional Networks
Overview of ProteinGCN: Given a protein structure, it first generates a protein graph and uses GCN to learn the atom embeddings. Then, it pools the atom embeddings to generate residue-level embeddings. The residue embeddings are passed through a non-linear fully connected layer to predict the local scores. Further, the residue embeddings are pooled to generate a global protein embedding. Similar to residue embeddings, this is used to predict the global score.
- Compatible with PyTorch 1.0 and Python 3.x.
- Dependencies can be installed using the
requirements.txt
file.
- We use Rosetta-300k to train the ProteinGCN model and test it on both Rosetta-300k and CASP13 dataset for local(residue) and global Quality Assessment predictions.
-
Install all the requirements by executing
pip install -r requirements.txt.
-
Install required protein
.pdb
processing library by executingsh preprocess.sh
which clones and installs this github repository. -
Next execute
python preprocess_pdb_to_pkl.py
script which creates the required.pkl
files from the dataset to be used for model training. It defaults to a sample dataset provided with the code at./data/
. To use the original datasets, please change the paths accordingly. -
To start a training run:
python train.py trial_run --epochs 10
Once successfully run, this creates a folder by the name
trial_run
under the path./data/pkl/results/
which contains the test resultstest_results.csv
and best model checkpointmodel_best.pth.tar
. Rest of the training arguments and the defaults can be found inarguments.py
. We support multi-gpu training using PyTorch DataParallel on a single server by default. To enable multi-gpu training, just set the required number of gpus inCUDA_VISIBLE_DEVICES
environment. -
To get the final pearson correlation scores, run:
python correlation.py -file ./data/pkl/results/trial_run/test_results.csv
-
We have published our best ProteinGCN model that was trained on Rosetta-300k dataset. To run the pretrained model on the sample data, execute:
python train.py trial_testrun --pretrained ./pretrained/pretrained.pth.tar --epochs 0 --train 0 --val 0 --test 1
Please cite the following paper if you use this code in your work.
@article {Sanyal2020.04.06.028266,
author = {Sanyal, Soumya and Anishchenko, Ivan and Dagar, Anirudh and Baker, David and Talukdar, Partha},
title = {ProteinGCN: Protein model quality assessment using Graph Convolutional Networks},
year = {2020},
doi = {10.1101/2020.04.06.028266},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2020/04/07/2020.04.06.028266},
journal = {bioRxiv}
}
For any clarification, comments, or suggestions please create an issue or contact Soumya.