Merge pull request #4 from jrurogers/master

Updates for corrected HSM models
aqlaboratory · Feb 16, 2023 · 53edf5d · 53edf5d
2 parents 980513e + ffe24fc
commit 53edf5d
Show file tree

Hide file tree

Showing 62 changed files with 26,107 additions and 45 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,8 @@
 data/
 analysis/
 
+.DS_Store
+
 #####################################
 # Standard Python .gitignore commands 
 # below this point - JMC

diff --git a/README.md b/README.md
@@ -4,37 +4,34 @@
 
 This repository implements the hierarchical statistical mechanical (HSM) model described in the paper [Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.](https://doi.org/10.1038/s41592-019-0687-1) 
 
-An **associated website** is available at [proteinpeptide.io](https://proteinpeptide.io). The website is built to facilitate interactions with results from the model including: (1) specific domain-peptide and protein-protein predictions, (2) the resulting networks, and (3) structures colored using the inferred energy functions from the model. Code for the website is available via the parallel repo: [aqlaboratory/hsm-web](https://github.com/aqlaboratory/hsm-web).
+An **associated website** is available at [proteinpeptide.io](https://proteinpeptide.io). The website is built to facilitate interactions with results from the model including: (1) specific domain-peptide and protein-protein predictions, (2) the resulting networks, and (3) structures colored using the inferred energy functions from the model. Code for the website is available via the parallel repo: [aqlaboratory/hsm-web](https://github.com/aqlaboratory/hsm-web). Note that the results on the website were obtained using an [old model](#model-updates).
 
 This file documents how this package might be [used](#usage), the [location of associated data](#data), and [other metadata](#reference). 
 
 ## Usage
 
-The model was implemented in Python (>= 3.5) primarily using TensorFlow (>= 1.4) ([Software Requirements](#requirements)). To work with this repository, either download pre-processed data (see below) or include new data. The folder contains two major directories: `train/` and `predict/`. Each directory is accompanied by a `README.md` file detailing usage. 
+The model was implemented in Python (>= 3.5) primarily using TensorFlow (>= 1.14) ([Software Requirements](#requirements)). To work with this repository, either download pre-processed data (see below) or include new data. The folder contains three major directories: `train/`, `predict/`, and `publication_analysis/`. Each directory is accompanied by a `README.md` file detailing usage. 
 
 To train / re-train new models, use the `train.py` script in `train/`. To make predictions using a model, use one of two scripts, `predict_domains.py` and `predict_proteins.py`, for predicting either domain-peptide interactions or protein-protein interactions. Scripts are designed with a CLI and should be used from the command line: 
 
 ```bash
 python [SCRIPT] [OPTIONS]
 ```
 
-Options for any script may be listed using the `-h/--help` flag. 
+Options for any script may be listed using the `-h/--help` flag.
 
-Pre-processed / pre-trained data and models may be downloaded from [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552) and should be unpacked at `data/` in this directory. This directory may also be used as an example of how to structure input and output files / directories.
+To reproduce analysis and figures presented in the paper [Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.](https://doi.org/10.1038/s41592-019-0687-1), use the scripts in `publication_analysis/`.
 
-An alternative use case would be to train / re-train a new model in the `train/` code and make new predictions using the `predict/` code. 
+Pre-trained models are released with this repo. An alternative use case would be to train / re-train a new model in the `train/` code and make new predictions using the `predict/` code. 
 
-### Kinase model issue
+### Model updates
 
-We have identified an issue in the dataset used to train the kinase model. For the time being, we suggest not using the kinase model until further updates are provided.
+We identified an issue in the original datasets used to train the model published in [Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.](https://doi.org/10.1038/s41592-019-0687-1). As of February 15, 2023, we have corrected the datasets (released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529)), and replaced the original models released with this repo with corrected ones. Please verify that you use the corrected models for all predictions (see documentation in `predict/`).
 
 ## Data
 
-As reported, domain-peptide and protein-protein interactions are available via [figshare (doi:10.6084/m9.figshare.10084745)](https://doi.org/10.6084/m9.figshare.10084745). In addition, we provide pre-processed data for this repository and the website repository, 
+All associated data may be downloaded from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529).
 
-- Raw training data: [figshare - doi:10.6084/m9.figshare.11520297](https://doi.org/10.6084/m9.figshare.11520297). Raw domain-peptide training data used to train the core HSM models. Unpack to `data/` in this directory.
-- Pre-processed data: [figshare - doi:10.6084/m9.figshare.11520552](https://doi.org/10.6084/m9.figshare.11520552). Needed to work with this repo. Unpack to `data/` in this directory.
-- Data supporting the website at [proteinpeptide.io](https://proteinpeptide.io)
 
 ## Requirements
 - Python (>= 3.5)
@@ -44,6 +41,7 @@ As reported, domain-peptide and protein-protein interactions are available via [
 - scikit-learn (0.20)
 - tqdm (4.41) (Progressbar. Not strictly necessary for functionality; needed to ensure package runs.)
 
+
 ## Reference
 Please reference the associated publication:
 

diff --git a/amino_acid_ordering.txt b/amino_acid_ordering.txt
@@ -0,0 +1,21 @@
+A
+C
+D
+E
+F
+G
+H
+I
+K
+L
+M
+N
+P
+Q
+R
+S
+T
+V
+W
+Y
+y
diff --git a/predict/README.md b/predict/README.md
@@ -20,9 +20,12 @@ Additional options for using either script may be listed using the `-h/--help` f
 The basic steps for predicting a new interaction is:
 ### 0. Pre-process data and models.
 
-By default, the code assumes that models are located at `predict/models/` and pre-processed data, which can be downloaded from [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552), should be available at `data/predict`. New data must be passed explicitly to the code (see the next section). Output model files should be the same as formatted by `output_models.py` in the `train/` directory. 
+By default, the code assumes that models are located at `predict/models/`. Output model files should be the same as formatted by `output_models.py` in the `train/` directory.
+
+Pre-processed input domain and peptide metadata is available from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529), specifically in `data/ppi_data/metadata`.
 
 Input domains files should have the format:
+
 ```
 Domain-Protein-Identifier,Aligned-Domain-Sequence,Domain-Type
 ```
@@ -44,6 +47,7 @@ python predict_domains.py [INPUT DOMAINS METADATA] [INPUT PEPTIDES METADATA] [OP
 ```
 
 Domain-peptide interactions are computed for all valid pairs (*e.g.* pairs that have an associated model). The two major options, `-m/--models` and `--model-format`, are useful when using newly trained models. `-m/--models` describes a directory listing new model files (each model file should specify the associated domain type). `--model-format` describes the model metadata as a csv file:
+
 ```
 Domain-Type,Peptide-Type,Domain-Alignment-Length,Peptide-Alignment-Length,Peptide-Alignment-Is-Fixed,Model-Filename
 ```
@@ -55,34 +59,49 @@ The domain and peptide alignment lengths refer to the domain / peptide alignment
 Code used for predicting protein-protein interactions is located in the predict/ directory in this repository. The functionality should primarily be accessed via the `predict_proteins.py` script.
 
 ```python
-python predict_proteins.py [-p [INPUT PPI PAIRS]] [OPTIONS] 
+python predict_proteins.py [-ppi [INPUT PPI PAIRS]] [OPTIONS] 
 ```
 Additional options for using either script may be listed using the `-h/--help` flag. 
 
 ## 0. Pre-process data and models.
 
-By default, the `predict_proteins.py` script also assumes models are located at `predict/models/` and pre-processed data, which can be downloaded via [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552), are available at `data/metadata`. New data must be passed explicitly to the code (see the next section). The same models files may be used in both domain-peptide and protein-protein interaction prediction. To use new models, the same steps to specify the new models must be passed to `predict_proteins.py`. In addition, the models requiire metadata files (by default, stored in `data/metadata`) that describe either the domain or peptide composition of proteins. Metadata are formatted as Python dictionaries (stored as pickle'd files) with the format: 
+By default, the `predict_proteins.py` script also assumes models are located at `predict/models/`. The same model files may be used in both domain-peptide and protein-protein interaction prediction. To use new models, the same steps to specify the new models must be passed to `predict_proteins.py` as described above.
+
+In addition, the models *require* metadata files that describe the domain and peptide composition of proteins. By defualt, the code assumes that pre-processed metadata for domains and peptidic sites identified in the human proteome is available at `../data/ppi_data/metadata/`. Metadata (`data/ppi_data/metadata`) can be downloaded from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529). New metadata must be passed explicitly to the code (see the next section).
 
 ## 1. Run predictions
 
 Predictions can be computed using the described script:
 
-```python
+```
 python predict_proteins.py [--ppi_pairs [INPUT PPI PAIRS]] [OPTIONS] 
 ```
 The `INPUT PPI PAIRS` option (passed using `--ppi_pairs`) passed to the code denotes a csv file containing the proteins to predict. These pairs should be formatted as a csv file where each line contains a pair of protein IDs (`<ID 1>,<ID 2>`). These IDs should reference IDs in the metadata files. If no pairs are passed, all valid pairs are returned. Different metadata files may be passed in using the `--domain_metadata` and `--peptide_metadata` options.  
 
 ## Pre-trained Models
-Released with this codebase are pre-trained models for using HSM / D to make novel predictions. The code assumes that the input sequences are correctly aligned.  
+Released with this codebase are pre-trained models for using HSM/D to make novel predictions. Pre-trained model files can be verified with the following MD5 hashes:
+
+```
+MD5 (amino_acid_ordering.txt) = c6458d6bd0cd9c5660632557b12cb28a
+MD5 (hsm_pretrained/Kinase_TK.npz) = 746c6f4901ec85965faca7fe47ede548
+MD5 (hsm_pretrained/PDZ.npz) = 0e34d0b3894c28f077122d82a2ffc910
+MD5 (hsm_pretrained/PTB.npz) = d1c0e700e5f4f395eed38683433802a3
+MD5 (hsm_pretrained/PTP.npz) = a8a1cf42dac2cfd16ec8757d20743ea7
+MD5 (hsm_pretrained/SH2.npz) = 650303cb9088ec31a8305e73e9420b11
+MD5 (hsm_pretrained/SH3.npz) = 802cb0a9fccb1fcfce9480993407a98a
+MD5 (hsm_pretrained/WH1.npz) = 60e72ea081ed8c955ed5ec88fdf4b7b7
+MD5 (hsm_pretrained/WW.npz) = 6aaf0473dba3aa32fe1336332503c09b
+MD5 (hsm_pretrained/model_formats.csv) = 4e6f1842a6307ed5891ad9a9e9c67aa0
+```
 
-Models were trained using the domain alignments released with this codebase (in the `domain_metadata.csv` file in `predictions.tar.gz` in the [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552) repository. 
+The code assumes that the input sequences are correctly aligned. Models were trained using the domain alignments released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529) in `data/ppi_data/metadata/domain_metadata.csv`. MSAs constructed for each PBD family to obtain these alignments are also released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529) in `results/data/alignment/`.
 
 Peptides are aligned as follows:
 
 | **Peptidic-type** | **Alignment** |
 | ---------- | ---------------- |
 | phosphosite | A 15 residue window aligned on the central phosphotyrosine. For example, `AAAAAAAyAAAAAAA`.|
-| C-terminal | A 6 residue windo aligned (to the right) on the C-terminus. For example, `AAAAAA`. Note, alignment offsets should be to the right (*e.g.* `AAAA` would become `--AAAA`).|
+| C-terminal | A 8 residue window aligned (to the right) on the C-terminus. For example, `AAAAAAAA`. Note, alignment offsets should be to the right (*e.g.* `AAAAAA` would become `--AAAAAA`).|
 | polyproline | Any residue sequence. polyproline peptides are "scanning", or the likelihood is computed over the whole sequence.|
 
 `A` denotes any amino acid. 

diff --git a/predict/models/amino_acid_ordering.txt b/predict/models/amino_acid_ordering.txt
@@ -18,4 +18,4 @@ T
 V
 W
 Y
-y
+y
diff --git a/predict/models/hsm_pretrained/Kinase_TK.npz b/predict/models/hsm_pretrained/Kinase_TK.npz
diff --git a/predict/models/hsm_pretrained/PDZ.npz b/predict/models/hsm_pretrained/PDZ.npz
diff --git a/predict/models/hsm_pretrained/PTB.npz b/predict/models/hsm_pretrained/PTB.npz
diff --git a/predict/models/hsm_pretrained/PTP.npz b/predict/models/hsm_pretrained/PTP.npz
diff --git a/predict/models/hsm_pretrained/SH2.npz b/predict/models/hsm_pretrained/SH2.npz
diff --git a/predict/models/hsm_pretrained/SH3.npz b/predict/models/hsm_pretrained/SH3.npz
diff --git a/predict/models/hsm_pretrained/WH1.npz b/predict/models/hsm_pretrained/WH1.npz
diff --git a/predict/models/hsm_pretrained/WW.npz b/predict/models/hsm_pretrained/WW.npz
diff --git a/predict/models/hsm_pretrained/model_formats.csv b/predict/models/hsm_pretrained/model_formats.csv
@@ -1,8 +1,8 @@
-PTP,phosphosite,146,15,1,PTP.npz
-PTB,phosphosite,94,15,1,PTB.npz
-Kinase_TK,phosphosite,190,15,1,Kinase_TK.npz
-SH3,polyproline,51,16,0,SH3.npz
-SH2,phosphosite,92,15,1,SH2.npz
-PDZ,c-terminus,75,6,1,PDZ.npz
-WW,polyproline,32,13,0,WW.npz
-WH1,polyproline,105,8,0,WH1.npz
+SH2,phosphosite,92,15,1,SH2.npz
+WH1,polyproline,105,8,0,WH1.npz
+WW,polyproline,32,13,0,WW.npz
+PTB,phosphosite,94,15,1,PTB.npz
+PDZ,c-terminus,75,8,1,PDZ.npz
+Kinase_TK,phosphosite,192,15,1,Kinase_TK.npz
+SH3,polyproline,51,16,0,SH3.npz
+PTP,phosphosite,146,15,1,PTP.npz
diff --git a/predict/predict_proteins.py b/predict/predict_proteins.py
@@ -52,8 +52,8 @@ def predict_interactions(metadata, models, output_ppis_fname, output_dpis_fname,
     import argparse
     parser = argparse.ArgumentParser()
     parser.add_argument("--ppi_pairs", type=str, default=None)
-    parser.add_argument("--domain_metadata", type=str, default="../data/predict/domain_metadata.csv")
-    parser.add_argument("--peptide_metadata", type=str, default="../data/predict/peptide_metadata.csv")
+    parser.add_argument("--domain_metadata", type=str, default="../data/ppi_data/metadata/domain_metadata.csv")
+    parser.add_argument("--peptide_metadata", type=str, default="../data/ppi_data/metadata/peptide_metadata.csv")
     parser.add_argument("--output_ppi_prediction", type=str, default="outputs/ppi_predictions.csv",
             help="Output file for PPI predictions. Output is a csv file with format <ID 1>,<ID 2>,<PPI Likelihood>.")
     parser.add_argument("--output_dpi_prediction", type=str, default="outputs/dpi_predictions.txt",

diff --git a/publication_analysis/README.md b/publication_analysis/README.md
@@ -0,0 +1,15 @@
+# Description
+
+This directory contains code used for the publication, specifically to evaluate external models and PSSMs, analyze HSM models and HSM/P predictions, and create all figures. Detailed descriptions are provided in each subdirectory.
+
+## External models
+
+Code used to evaluate NetPhorest 2.1 predictions on HSM datasets is provided in `netphorest/`. Code used to evaluate PepInt predictions on HSM datasets is provided in `pepint/`.
+
+## PSSMs
+
+Code to construct benchmark PSSMs for each PBD family is provided in `pssm/`.
+
+## Analysis and results presented in manuscript
+
+All code used to perform analysis presented in the manuscript and create figures is provided in `results/`. Data underlying the figures is released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529), with a directory structure paralleling that of this directory.
diff --git a/publication_analysis/netphorest/README.md b/publication_analysis/netphorest/README.md
@@ -0,0 +1,13 @@
+# NetPhorest predictions
+
+NetPhorest 2.1 was used to make predictions PTB, SH2, PTP, and TKs for comparison to HSM. See `driver.sh` for commands executed. To use the bash script, the following variables need to be set:
+
+* `domain`: Specify PBD family to make predictions for. Must be one of `PTB`, `SH2`, `Kinase_TK`, and `PTB`.
+* `data_path`: Specify path to data to make predictions for. This must be raw, unaligned data provided in csv format for a single PBD and must contain the columns (with header):
+```
+Domain UniProt ID,Domain Sequence,Peptidic Sequence,Bound
+```
+Raw HSM data (`/data/data_without_processed_duplicates/raw_data/`) can be downloaded from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529).
+* `NETPHOR`: Specificy netphorest 2.1 executable. Download `NetPhorest_human_2.1.zip` from [http://netphorest.science/download.shtml](http://netphorest.science/download.shtml) and compile per instructions.
+
+In order to make predictions with the HSM data, each domain was mapped to its NetPhorest model, and these mappings are provided in `map_domain_netphorest_model`. Either UniProt IDs (if only one domain of a specificied PBD family is found in the protein) or raw domain sequences are used for the mapping. For futher information about the NetPhorest models, see [Miller, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling (2008)](https://www.science.org/doi/10.1126/scisignal.1159433) and [Horn, et al. KinomeXplorer: an integrated platform for kinome biology studies (2014)](https://www.nature.com/articles/nmeth.2968).
diff --git a/publication_analysis/netphorest/driver.sh b/publication_analysis/netphorest/driver.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+domain= # name of PBD to make predictions for
+data_path= # path to location of raw (unaligned) data
+
+NETPHOR= # netphorest 2.1 executable
+
+if [ ! -d $domain ]; then mkdir $domain; fi
+cd $domain
+   # write fasta for netphorest predictions
+   # using raw data file since has UniProt IDs & netphorest independently aligns
+   data=$data_path/$domain.csv
+   python ../write_fasta.py $data
+
+   # make predictions with netphorest
+   if [ "$domain" == "Kinase_TK" ]; then
+      classifier='KIN'
+   else
+      classifier=$domain
+   fi
+   cat peptides.fasta | $NETPHOR | grep $classifier > netphorest_predictions.tab
+
+   # process predictions
+   python ../process_netphorest_predictions.py netphorest_predictions.tab $data ../map_domain_netphorest_model/${domain}_netphorest_model.csv
+cd ..
diff --git a/publication_analysis/netphorest/map_domain_netphorest_model/Kinase_TK_netphorest_model.csv b/publication_analysis/netphorest/map_domain_netphorest_model/Kinase_TK_netphorest_model.csv
@@ -0,0 +1,57 @@
+Domain UniProt ID,NetPhorest Model,Use sequence,Sequence
+P21709,Eph_group,0,
+P29317,Eph_group,0,
+P29320,Eph_group,0,
+P54764,Eph_group,0,
+P54756,Eph_group,0,
+Q9UF33,Eph_group,0,
+Q15375,Eph_group,0,
+P29322,Eph_group,0,
+Q5JZY3,Eph_group,0,
+P54762,Eph_group,0,
+P29323,Eph_group,0,
+P54753,Eph_group,0,
+P54760,Eph_group,0,
+O15197,Eph_group,0,
+P00519,Abl_group,0,
+P42684,Abl_group,0,
+P07333,FLT3_CSF1R_Kit_PDGFR_group,0,
+P10721,FLT3_CSF1R_Kit_PDGFR_group,0,
+P36888,FLT3_CSF1R_Kit_PDGFR_group,0,
+P16234,FLT3_CSF1R_Kit_PDGFR_group,0,
+P09619,FLT3_CSF1R_Kit_PDGFR_group,0,
+O60674,JAK2,0,
+O60674,JAK2,0,
+P35968,KDR_FLT1_group,0,
+P17948,KDR_FLT1_group,0,
+P08581,Met_group,0,
+Q04912,Met_group,0,
+P43405,Syk_group,0,
+P43403,Syk_group,0,
+Q06187,Tec_group,0,
+P51813,Tec_group,0,
+Q08881,Tec_group,0,
+P42680,Tec_group,0,
+P42681,Tec_group,0,
+P04629,Trk_group,0,
+Q16620,Trk_group,0,
+Q16288,Trk_group,0,
+P29597,Tyk2,0,
+P00533,EGFR_group,0,
+P04626,EGFR_group,0,
+P21860,EGFR_group,0,
+Q15303,EGFR_group,0,
+P06213,InsR_group,0,
+P08069,InsR_group,0,
+P14616,InsR_group,0,
+P51451,Src_group,0,
+P08631,Src_group,0,
+P07948,Src_group,0,
+P06239,Src_group,0,
+P09769,Src_group,0,
+P42685,Src_group,0,
+P06241,Src_group,0,
+P12931,Src_group,0,
+P07947,Src_group,0,
+Q13882,Src_group,0,
+Q9H3Y6,Src_group,0,
diff --git a/publication_analysis/netphorest/map_domain_netphorest_model/PTB_netphorest_model.csv b/publication_analysis/netphorest/map_domain_netphorest_model/PTB_netphorest_model.csv
@@ -0,0 +1,7 @@
+Domain UniProt ID,NetPhorest Model,Use sequence,Sequence
+Q9UKG1,APPL,0,
+Q8NEU8,APPL,0,
+P29353,SHC1_SHC2_SHC3_group,0,
+P98077,SHC1_SHC2_SHC3_group,0,
+Q92529,SHC1_SHC2_SHC3_group,0,
+Q6S5L8,SHC4,0,
diff --git a/publication_analysis/netphorest/map_domain_netphorest_model/PTP_netphorest_model.csv b/publication_analysis/netphorest/map_domain_netphorest_model/PTP_netphorest_model.csv
@@ -0,0 +1,23 @@
+Domain UniProt ID,NetPhorest Model,Use sequence,Sequence
+Q12923,PTPN13,0,
+Q9H3S7,PTPN23,0,
+P26045,PTPN3,0,
+P29074,PTPN4,0,
+P43378,PTPN9,0,
+P23468,R2A_group,0,
+P10586,R2A_group,0,
+Q13332,R2A_group,0,
+P23467,R3_group,0,
+Q9HD43,R3_group,0,
+Q12913,R3_group,0,
+Q16827,R3_group,0,
+Q9UMZ3,R3_group,0,
+P18433,R4_group,0,
+P23469,R4_group,0,
+P18031,NT1_group,0,
+P17706,NT1_group,0,
+P29350,NT2_group,0,
+Q06124,NT2_group,0,
+Q05209,NT4_group,0,
+Q99952,NT4_group,0,
+Q9Y2R2,NT4_group,0,
-Original file line number
+Diff line change
@@ -0,0 +1,21 @@
+    A
+    C
+    D
+    E
+    F
+    G
+    H
+    I
+    K
+    L
+    M
+    N
+    P
+    Q
+    R
+    S
+    T
+    V
+    W
+    Y
+    y