Skip to content

Latest commit

 

History

History
451 lines (377 loc) · 17.6 KB

File metadata and controls

451 lines (377 loc) · 17.6 KB

TAPE

中文版本 English Version

Background

In recent years, with sequencing technology development, the protein sequence database scale has significantly increased. However, the cost of obtaining labeled protein sequences is still very high, as it requires biological experiments. Besides, due to the inadequate number of labeled samples, the model has a high probability of overfitting the data. Borrowing the ideas from natural language processing (NLP), we can pre-train numerous unlabeled sequences by self-supervised learning. In this way, we can extract useful biological information from proteins and transfer them to other tagged tasks to make these tasks training faster and more stable convergence. These instructions refer to the work of paper TAPE, providing the model implementation of Transformer, LSTM, and ResNet.

Instructions

Training Models

We offer multiple training methods:

  • Multi-cpu training.
  • Single-gpu training.
  • Multi-gpu training.

Multi-cpu Training / Single-gpu Training

The example of using CPU training with multiple threads / GPU training with a single card is shown as follows:

python train.py \
        --train_data ./train_data # Directory of training data for training models, including multiple training files. \
        --valid_data ./valid_data # Directory of validation data for evaluating models, including multiple validation files. \
        --lr 0.0001 # Basic learning rate. \
        --use_cuda # Whether use cuda for training. \
        ... # Model parameter settings and task parameter settings will be introduced in the following chapters.

Multi-gpu Training

We use paddle.distributed.launch for multi-gpu training and parameter "--distributed" should be added. Other parameters are consistent with multi-cpu training / single-gpu training. An example of multi-gpu training is shown as follows:

python -m paddle.distributed.launch --log_dir log_dir train.py # Specify the log directory by "--log_dir" \
        --train_data ./train_data # Directory of training data for training models, including multiple training files. \
        --valid_data ./valid_data # Directory of validation data for evaluating models, including multiple validation files. \
        --lr 0.0001 # Basic learning rate. \
        --use_cuda # Only support gpu for now. \
        --distributed # Distributed training. \
        ... # Model parameter settings and task parameter settings will be introduced in the following chapters.

Evaluating Models

The model evaluation is similar to the model training method. Currently, only multi-cpu and single-gpu evaluation are supported.

python eval.py \
        --data ./eval_data # Directory of test data for evaluating models, including multiple test files. \
        --eval_model ./model # The model to be evaluated. \
        --use_cuda # Whether the model runs in CPU or GPU. \
        ... # Model parameter settings and task parameter settings will be introduced in the following chapters.

Model Inference

The model inference is similar to the model evaluation method. Currently, only multi-cpu and single-gpu prediction are supported.

cat predict_file | # The file that contains amino acid sequence. \
python predict.py \
        --batch_size 128 # The upper bound of the batch size. As the lengths of the proteins may be large, the model dynamically adjusts the batch size according to its length. \
        --model ./model # The model. \
        --use_cuda # Whether the model runs in CPU or GPU. \
        ... # Model parameter settings and task parameter settings will be introduced in the following chapters.

Sequence Models

We provides models Transformer, LSTM, and ResNet. The model related parameters should be included in "--model_config". We set the model_type (transformer, lstm, resnet) by setting "model _type" in "model_config".

python train.py \
        ... # The way to set training parameters has been introduced. \
        --model_config ./transformer_config # The configuration file of the model, organized by json format. \
        ... # Task parameter settings will be introduced in the following chapters.

Transformer

Transformer is often used in semantic modeling of natural language processing. To use transformer, you need to set the following parameters:

  • hidden_size: The hidden size of transformer.
  • layer_num: The number of layers in transformer.
  • head_num: The number of headers in each layer.

For details of transformer, please refer to the following papers:

LSTM

We use multilayer bidirectional LSTM. To use LSTM, we need to set the following parameters:

  • hidden_size: The hidden size of LSTM.
  • layer_num: The number of layers.

LSTM can refer to the following papers:

ResNet

We use multilayer ResNet. Using ResNet, we need to set the following parameters:

  • hidden_size: The hidden size of ResNet.
  • layer_num: The number of layers.
  • filter_num: The number of filter of the convolution layers.

ResNet can refer to the following papers:

Other Parameters

Other parameters can be set in the model to avoid over-fitting or excessive parameter value.

  • dropout: The dropout ratio of model parameters.
  • weight_decay: Parameter decay ratio, used to avoid excessive parameter values.

Protein Related Tasks

Referring to the paper tape, we reproduced the following tasks using PaddleHelix.

Pretraining Tasks

Pfam

Dataset pfam contains 30 million protein sequences, which can be used for pretraining models. We should set the following parameters in the model_config.

...
task: "pretrain",
...

Supervised Tasks

Secondary Structure

Dataset secondary structure consists of two sequence annotation tasks, a 3-category annotation task, and an 8-category annotation task. We should set the following parameters in the model_config.

...
task: "seq_classification",
class_num: 3,
label_name: "labels3",
...
Evaluation Results

The fine-tuning models' results are shown as follows.

Three-way Accuracy:

Model CB513 CASP12 TS115
Transformer 0.732 0.717 0.771
LSTM 0.758 0.711 0.785
ResNet 0.747 0.713 0.777

Eight-way Accuracy:

Model CB513 CASP12 TS115
Transformer 0.586 0.584 0.649
LSTM 0.617 0.589 0.667
ResNet 0.606 0.587 0.661

Remote Homology

Remote homology is a classification task with 1195 classes. We should set the following parameters in the model_config.

...
task: "classification",
class_num: 1195,
label_name: "labels",
...
Evaluation Results

The fine-tuning models' results are shown as follows.

Accuracy:

Model Fold Superfamily Family
Transformer 0.247 0.355 0.886
LSTM 0.219 0.314 0.820
ResNet 0.153 0.136 0.535

Fluorescence

Fluorescence is a regression task. We should set the following parameters in the model_config.

...
task: "regression",
label_name: "labels",
...
Evaluation Results

The fine-tuning models' results are shown as follows.

Spearman:

Model Test
Transformer 0.680
LSTM 0.533
ResNet 0.573

Stability

Stability is a regression task. We should set the following parameters in the model_config.

...
task: "regression",
label_name: "labels",
...
Evaluation Results

The fine-tuning models' results are shown as follows.

Spearman:

Model Test
Transformer 0.807
LSTM 0.785
ResNet 0.792

Warm Start / Finetuning

We can set the parameter "--init_model " to initialize the model or finetune the supervised tasks during the training process.

python train.py \
        ... \
        --init_model ./init_model # Directory of the initialization model. If this parameter is unset, the model is randomly initialized. \
        --hot_start "hot_start" # init_model for "hot_start" or "finetune" \
        ... 

Complete Example

We provide multiple training and evaluation examples in the folder demos. Here is a pretraining example of the Transformer.

#!/bin/bash
model_type="transformer" # candidate model_types: transformer, lstm, resnet
task="secondary_structure" # candidate tasks: pfam, secondary_structure, remote_homology, fluorescence, stability
model_config="./configs/${model_type}_${task}_config.json"
train_data="./secondary_structure_toy_data/"
valid_data="./secondary_structure_toy_data/"

export PYTHONPATH="../../../"
python train.py \
        --train_data ${train_data} \
        --valid_data ${valid_data} \
        --model_config ${model_config} \

Following shows the demo of model_config.

{
    "model_name": "secondary_structure",
    "task": "seq_classification",
    "class_num": 3,
    "label_name": "labels3",

    "hidden_size": 512,
    "layer_num": 12,
    "head_num": 8,

    "comment": "The following hyper-parameters are optional.",
    "dropout": 0.1,
    "weight_decay": 0.01
}

Data

The datasets can be downloaded from the following urls: pfam: raw, npz secondary structure: all remote homology: all fluorescence: all stability: all

Pre-trained Models

The pre-trained models can be downloaded from the following urls: Transformer: model LSTM: model ResNet: model

Reference

Paper-related

We mainly refer to paper TAPE. The way we train the models and the hyper-parameters might be different.

TAPE:

@inproceedings{tape2019, author = {Rao, Roshan and Bhattacharya, Nicholas and Thomas, Neil and Duan, Yan and Chen, Xi and Canny, John and Abbeel, Pieter and Song, Yun S}, title = {Evaluating Protein Transfer Learning with TAPE}, booktitle = {Advances in Neural Information Processing Systems} year = {2019} }

Transformer

@inproceedings{vaswani2017attention, title={Attention is all you need}, author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia}, booktitle={Advances in neural information processing systems}, pages={5998--6008}, year={2017} } @article{devlin2018bert, title={Bert: Pre-training of deep bidirectional transformers for language understanding}, author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, journal={arXiv preprint arXiv:1810.04805}, year={2018} }

LSTM

@article{hochreiter1997long, title={Long short-term memory}, author={Hochreiter, Sepp and Schmidhuber, J{"u}rgen}, journal={Neural computation}, volume={9}, number={8}, pages={1735--1780}, year={1997}, publisher={MIT Press} }

ResNet

@article{szegedy2016inception, title={Inception-v4, inception-resnet and the impact of residual connections on learning}, author={Szegedy, Christian and Ioffe, Sergey and Vanhoucke, Vincent and Alemi, Alex}, journal={arXiv preprint arXiv:1602.07261}, year={2016} }

Data-related

We further process the data in paper TAPE to train the models.

Pfam (Pretraining):

@article{pfam, author = {El-Gebali, Sara and Mistry, Jaina and Bateman, Alex and Eddy, Sean R and Luciani, Aur{'{e}}lien and Potter, Simon C and Qureshi, Matloob and Richardson, Lorna J and Salazar, Gustavo A and Smart, Alfredo and Sonnhammer, Erik L L and Hirsh, Layla and Paladin, Lisanna and Piovesan, Damiano and Tosatto, Silvio C E and Finn, Robert D}, doi = {10.1093/nar/gky995}, file = {::}, issn = {0305-1048}, journal = {Nucleic Acids Research}, keywords = {community,protein domains,tandem repeat sequences}, number = {D1}, pages = {D427--D432}, publisher = {Narnia}, title = {{The Pfam protein families database in 2019}}, url = {https://academic.oup.com/nar/article/47/D1/D427/5144153}, volume = {47}, year = {2019} }

SCOPe: (Remote Homology and Contact)

@article{scop, title={SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures}, author={Fox, Naomi K and Brenner, Steven E and Chandonia, John-Marc}, journal={Nucleic acids research}, volume={42}, number={D1}, pages={D304--D309}, year={2013}, publisher={Oxford University Press} }

PDB: (Secondary Structure and Contact)

@article{pdb, title={The protein data bank}, author={Berman, Helen M and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, Talapady N and Weissig, Helge and Shindyalov, Ilya N and Bourne, Philip E}, journal={Nucleic acids research}, volume={28}, number={1}, pages={235--242}, year={2000}, publisher={Oxford University Press} }

CASP12: (Secondary Structure and Contact)

@article{casp, author = {Moult, John and Fidelis, Krzysztof and Kryshtafovych, Andriy and Schwede, Torsten and Tramontano, Anna}, doi = {10.1002/prot.25415}, issn = {08873585}, journal = {Proteins: Structure, Function, and Bioinformatics}, keywords = {CASP,community wide experiment,protein structure prediction}, pages = {7--15}, publisher = {John Wiley {&} Sons, Ltd}, title = {{Critical assessment of methods of protein structure prediction (CASP)-Round XII}}, url = {http://doi.wiley.com/10.1002/prot.25415}, volume = {86}, year = {2018} }

NetSurfP2.0: (Secondary Structure)

@article{netsurfp, title={NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning}, author={Klausen, Michael Schantz and Jespersen, Martin Closter and Nielsen, Henrik and Jensen, Kamilla Kjaergaard and Jurtz, Vanessa Isabell and Soenderby, Casper Kaae and Sommer, Morten Otto Alexander and Winther, Ole and Nielsen, Morten and Petersen, Bent and others}, journal={Proteins: Structure, Function, and Bioinformatics}, year={2019}, publisher={Wiley Online Library} }

ProteinNet: (Contact)

@article{proteinnet, title={ProteinNet: a standardized data set for machine learning of protein structure}, author={AlQuraishi, Mohammed}, journal={arXiv preprint arXiv:1902.00249}, year={2019} }

Fluorescence:

@article{sarkisyan2016, title={Local fitness landscape of the green fluorescent protein}, author={Sarkisyan, Karen S and Bolotin, Dmitry A and Meer, Margarita V and Usmanova, Dinara R and Mishin, Alexander S and Sharonov, George V and Ivankov, Dmitry N and Bozhanova, Nina G and Baranov, Mikhail S and Soylemez, Onuralp and others}, journal={Nature}, volume={533}, number={7603}, pages={397}, year={2016}, publisher={Nature Publishing Group} }

Stability:

@article{rocklin2017, title={Global analysis of protein folding using massively parallel design, synthesis, and testing}, author={Rocklin, Gabriel J and Chidyausiku, Tamuka M and Goreshnik, Inna and Ford, Alex and Houliston, Scott and Lemak, Alexander and Carter, Lauren and Ravichandran, Rashmi and Mulligan, Vikram K and Chevalier, Aaron and others}, journal={Science}, volume={357}, number={6347}, pages={168--175}, year={2017}, publisher={American Association for the Advancement of Science} }