Skip to content

Latest commit

 

History

History
273 lines (202 loc) · 46.4 KB

README.md

File metadata and controls

273 lines (202 loc) · 46.4 KB

PHIStruct (Phage-Host Interaction Prediction with Structure-Aware Protein Embeddings)

badge badge badge badge badge scikit-learn
Actions Status Actions Status badge

PHIStruct is a phage-host interaction prediction tool that uses structure-aware protein embeddings to represent the receptor-binding proteins (RBPs) of phages. By incorporating structure information, it presents improvements over using sequence-only protein embeddings and feature-engineered sequence properties — especially for phages with RBPs that have low sequence similarity to those of known phages.

Preprint: https://doi.org/10.1101/2024.08.24.609479

If you find our work useful, please consider citing:

@article {PHIStruct,
    author = {Gonzales, Mark  Edward M. and Ureta, Jennifer C. and Shrestha, Anish M.S.},
    title = {PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings},
    elocation-id = {2024.08.24.609479},
    year = {2024},
    doi = {10.1101/2024.08.24.609479},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/08/24/2024.08.24.609479},
    eprint = {https://www.biorxiv.org/content/early/2024/08/24/2024.08.24.609479.full.pdf},
    journal = {bioRxiv}
}

Table of Contents

📰 News

  • 06 Nov 2024 - We presented our work at the 2024 Australian Bioinformatics and Computational Biology Society (ABACBS) National Conference in Sydney. Poster here.

♾️ Run on Google Colab

Colab

You can readily run PHIStruct on Google Colab, without the need to install anything on your own computer: http://phistruct.bioinfodlsu.com

🚀 Installation & Usage

Operating System: Windows (using WSL), Linux, or macOS

Clone the repository:

git clone https://github.com/bioinfodlsu/PHIStruct
cd PHIStruct

Create a virtual environment with the dependencies installed via Conda (we recommend using Miniconda):

conda env create -f environment.yaml

Activate this environment by running:

conda activate PHIStruct

Depending on your operating system, run the correct installation command (refer to the last column of the table below) to install and configure the remaining dependencies (you only need to do this once, that is, at installation):

OS/Build Command for Checking OS/Build Installation Command
Linux AVX2 Build cat /proc/cpuinfo | grep avx2 bash init.sh avx2
Linux SSE2 Build cat /proc/cpuinfo | grep sse2 bash init.sh sse2
Linux ARM64 Build dpkg --print-architecture or uname -m bash init.sh arm64
macOS bash init.sh osx

Note: Running the init.sh script may take a few minutes since it involves downloading a model (SaProt, around 5 GB) from Hugging Face.

Running PHIStruct

python3 phistruct.py --input <input_dir> --model <model_joblib> --output <results_dir>
  • Replace <input_dir> with the path to the directory storing the PDB files describing the structures of the receptor-binding proteins. Sample PDB files are provided here.
  • Replace <model_joblib> with the path to the trained model (recognized format: joblib or compressed joblib, framework: scikit-learn). Download our trained model from this link. No need to uncompress, but doing so will speed up loading the model albeit at the cost of additional storage requirements. Refer to this guide for the list of accepted compressed formats.
  • Replace <results_dir> with the path to the directory to which the results of running PHIStruct will be written. The results of running PHIStruct on the sample PDB files are provided here.

The results for each protein are written to a CSV file (without a header row). Each row contains two comma-separated values: a host genus and the corresponding prediction score (class probability). The rows are sorted in order of decreasing prediction score. Hence, the first row pertains to the top-ranked prediction.

Under the hood, this script first converts each protein into a structure-aware protein embedding using SaProt and then passes the embedding to a multilayer perceptron trained on all the entries in our dataset with host among the ESKAPEE genera (link). If your machine has a GPU, it will automatically be used to accelerate the protein embedding generation step.

Training PHIStruct

python3 train.py --input <training_dataset>
  • Replace <training_dataset> with the path to the training dataset. A sample can be downloaded here.

The training dataset should be formatted as a CSV file (without a header row) where each row corresponds to a training sample. The first column is for the protein IDs, the second column is for the host genera, and the next 1,280 columns are for the components of the SaProt embeddings.

This script will output a gzip-compressed, serialized version of the trained model with filename phistruct_trained.joblib.gz.

Return to Table of Contents.

📚 Description

Motivation: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity.

Method: We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera.

Results: Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp and a 17% to 18% increase over PSI-BLAST.

Teaser Figure

Return to Table of Contents.

🔬 Dataset of Predicted Structures of Receptor-Binding Proteins

DOI

We also release a dataset of protein structures, computationally predicted via ColabFold, of 19,081 non-redundant (i.e., with duplicates removed) receptor-binding proteins from 8,525 phages across 238 host genera. We identified these receptor-binding proteins based on GenBank annotations. For phage sequences without GenBank annotations, we employed a pipeline that uses the viral protein library PHROG and the machine learning model PhageRBPdetect.

Return to Table of Contents.

🧪 Reproducing Our Results

Project Structure

The experiments folder contains the files and scripts for reproducing our results. Note that additional (large) files have to be downloaded (or generated) following the instructions in the Jupyter notebooks.

Click here to show/hide the list of directories, Jupyter notebooks, and Python scripts, as well as the folder structure.

Directories

Directory Description
data Contains the data (including the FASTA files and embeddings)
preprocessing Contains text files related to the preprocessing of host information and the identification of annotated receptor-binding proteins
rbp_prediction Contains the trained model PhageRBPdetect (in JSON format), used for the computational prediction of receptor-binding proteins. Downloaded from this repository (under the MIT License)
temp Contains intermediate output files during preprocessing, exploratory data analysis, and performance evaluation

Jupyter Notebooks

Notebook Description
1. Sequence Preprocessing.ipynb Preprocessing of host information and identification of annotated receptor-binding proteins
2. RBP Computational Prediction.ipynb Computational prediction of receptor-binding proteins
3.0. Data Consolidation (SaProt).ipynb
3.1. Data Consolidation (ProstT5 - AA Tokens).ipynb
3.2. Data Consolidation (PST).ipynb
3.3. Data Consolidation (SaProt with Low-Confidence Masking).ipynb
3.4. Data Consolidation (SaProt with Structure Masking).ipynb
3.5. Data Consolidation (SaProt with Sequence Masking).ipynb
3.6. Data Consolidation (ProstT5 - 3Di Tokens).ipynb
Generation of CSV files consolidating the proteins, phage-host information, and embeddings
4. Exploratory Data Analysis.ipynb Exploratory data analysis
5.0. Classifier Building & Evaluation (SaProt).ipynb
5.1. Benchmarking - Classifier Building & Evaluation (ProstT5 - AA Tokens).ipynb
5.2. Benchmarking - Classifier Building & Evaluation (PST).ipynb
5.3. Benchmarking - Classifier Building & Evaluation (ESM-1b).ipynb
5.4. Benchmarking - Classifier Building & Evaluation (ESM-2).ipynb
5.5. Benchmarking - Classifier Building & Evaluation (ProtT5).ipynb
5.6. Benchmarking - Classifier Building & Evaluation (SaProt with Low-Confidence Masking).ipynb
5.7. Benchmarking - Classifier Building & Evaluation (SaProt with Structure Masking).ipynb
5.8. Benchmarking - Classifier Building & Evaluation (SaProt with Sequence Masking).ipynb
5.9. Benchmarking - Classifier Building & Evaluation (ProstT5 - 3Di Tokens).ipynb
5.10. Benchmarking - Classifier Building & Evaluation (SeqVec).ipynb
5.11. Benchmarking - Classifier Building & Evaluation (Random Forest).ipynb
5.12. Benchmarking - Classifier Building & Evaluation (SVM).ipynb
Construction of phage-host interaction prediction model, benchmarking, and performance evaluation
6.0. Comparison.ipynb
6.0. Comparison - Weighted.ipynb
6.1. Plotting - F1.ipynb
6.1. Plotting - F1 - Weighted.ipynb
6.2. Plotting - PR Curve.ipynb
6.2. Plotting - PR Curve - Weighted.ipynb
6.3. Confusion Matrix.ipynb
Tabular and graphical comparison of the performance of our model versus benchmarks

Python Scripts

Script Description
ClassificationUtil.py Contains the utility functions for the constructing the training and test sets, building the phage-host interaction prediction model, and evaluating its performance
ConstantsUtil.py Contains the constants used in the notebooks and scripts
MLPDropout.py Implements a multilayer perceptron with dropout in scikit-learn
RBPPredictionUtil.py Contains the utility functions for the computational prediction of receptor-binding proteins
SequenceParsingUtil.py Contains the utility functions for preprocessing host information and identifying annotated receptor-binding proteins
StructureUtil.py Contains the utility functions for consolidating the embeddings generated via structure-aware protein language models

Folder Structure

Once you have cloned this repository and finished downloading (or generating) all the additional required files following the instructions in the Jupyter notebooks, your folder structure should be similar to the one below:

  • PHIStruct (root)
    • experiments
      • data
        • GenomesDB (Download and unzip)
          • AB002632
          • ...
        • inphared
          • consolidated (Download and unzip)
            • rbp.csv
            • ...
          • embeddings
            • prottransbert (Download and unzip)
              • complete
              • hypothetical
              • rbp
          • fasta (Download and unzip)
            • complete
            • hypothetical
            • nucleotide
            • rbp
          • structure
            • pdb (Download and unzip)
            • rbp_saprot_embeddings (Download and unzip)
              • AAA74324.1_relaxed.r3.pdb.pt
            • rbp_saprot_mask_embeddings (Download and unzip)
              • AAA74324.1_relaxed.r3.pdb.pt
            • rbp_saprot_seq_mask_embeddings (Download and unzip)
              • AAA74324.1_relaxed.r3.pdb.pt
            • rbp_saprot_struct_mask_embeddings (Download and unzip)
              • AAA74324.1_relaxed.r3.pdb.pt
            • rbp_pst_embeddings (Download and unzip)
              • AAA74324.1_relaxed.r3.pdb.pt
            • rbp_prostt5_embeddings.h5 (Download)
            • rbp_prostt5_3di_embeddings.h5 (Download)
            • rbp_saprot_mask_relaxed_r3.csv (Download)
            • rbp_saprot_relaxed_r3.csv (Download)
            • rbp_saprot_seq_mask_relaxed_r3.csv (Download)
            • rbp_saprot_struct_mask_relaxed_r3.csv (Download)
            • rbp_pst_relaxed_r3.csv (Download)
            • rbp_prostt5_relaxed_r3.csv (Download)
            • rbp_prostt5_3di_relaxed_r3.csv (Download)
        • 3Oct2023_data_excluding_refseq.tsv
        • 3Oct2023_phages_downloaded_from_genbank.gb (Download)
      • preprocessing
      • rbp_prediction
      • temp
      • 1. Sequence Preprocessing.ipynb
      • ...
      • ClassificationUtil.py
      • ...

Return to Table of Contents.

Dependencies

Operating System: Windows (using WSL), Linux, or macOS

Create a virtual environment with the dependencies installed via Conda (we recommend using Miniconda):

conda env create -f environment_experiments.yaml

Activate this environment by running:

conda activate PHIStruct-experiments

Return to Table of Contents.

💻 Authors

This is a research project under the Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Philippines.

This research was partly funded by the Department of Science and Technology Philippine Council for Health Research and Development (DOST-PCHRD) under the e-Asia JRP 2021 Alternative therapeutics to tackle AMR pathogens (ATTACK-AMR) program.

This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC) and with computing resources from the Machine Learning eResearch Platform (MLeRP) of Monash University, University of Queensland, and Queensland Cyber Infrastructure Foundation Ltd.