Skip to content

Commit

Permalink
Merge pull request #4 from jrurogers/master
Browse files Browse the repository at this point in the history
Updates for corrected HSM models
  • Loading branch information
jrurogers authored Feb 16, 2023
2 parents 980513e + ffe24fc commit 53edf5d
Show file tree
Hide file tree
Showing 62 changed files with 26,107 additions and 45 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
data/
analysis/

.DS_Store

#####################################
# Standard Python .gitignore commands
# below this point - JMC
Expand Down
20 changes: 9 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,37 +4,34 @@

This repository implements the hierarchical statistical mechanical (HSM) model described in the paper [Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.](https://doi.org/10.1038/s41592-019-0687-1)

An **associated website** is available at [proteinpeptide.io](https://proteinpeptide.io). The website is built to facilitate interactions with results from the model including: (1) specific domain-peptide and protein-protein predictions, (2) the resulting networks, and (3) structures colored using the inferred energy functions from the model. Code for the website is available via the parallel repo: [aqlaboratory/hsm-web](https://github.com/aqlaboratory/hsm-web).
An **associated website** is available at [proteinpeptide.io](https://proteinpeptide.io). The website is built to facilitate interactions with results from the model including: (1) specific domain-peptide and protein-protein predictions, (2) the resulting networks, and (3) structures colored using the inferred energy functions from the model. Code for the website is available via the parallel repo: [aqlaboratory/hsm-web](https://github.com/aqlaboratory/hsm-web). Note that the results on the website were obtained using an [old model](#model-updates).

This file documents how this package might be [used](#usage), the [location of associated data](#data), and [other metadata](#reference).

## Usage

The model was implemented in Python (>= 3.5) primarily using TensorFlow (>= 1.4) ([Software Requirements](#requirements)). To work with this repository, either download pre-processed data (see below) or include new data. The folder contains two major directories: `train/` and `predict/`. Each directory is accompanied by a `README.md` file detailing usage.
The model was implemented in Python (>= 3.5) primarily using TensorFlow (>= 1.14) ([Software Requirements](#requirements)). To work with this repository, either download pre-processed data (see below) or include new data. The folder contains three major directories: `train/`, `predict/`, and `publication_analysis/`. Each directory is accompanied by a `README.md` file detailing usage.

To train / re-train new models, use the `train.py` script in `train/`. To make predictions using a model, use one of two scripts, `predict_domains.py` and `predict_proteins.py`, for predicting either domain-peptide interactions or protein-protein interactions. Scripts are designed with a CLI and should be used from the command line:

```bash
python [SCRIPT] [OPTIONS]
```

Options for any script may be listed using the `-h/--help` flag.
Options for any script may be listed using the `-h/--help` flag.

Pre-processed / pre-trained data and models may be downloaded from [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552) and should be unpacked at `data/` in this directory. This directory may also be used as an example of how to structure input and output files / directories.
To reproduce analysis and figures presented in the paper [Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.](https://doi.org/10.1038/s41592-019-0687-1), use the scripts in `publication_analysis/`.

An alternative use case would be to train / re-train a new model in the `train/` code and make new predictions using the `predict/` code.
Pre-trained models are released with this repo. An alternative use case would be to train / re-train a new model in the `train/` code and make new predictions using the `predict/` code.

### Kinase model issue
### Model updates

We have identified an issue in the dataset used to train the kinase model. For the time being, we suggest not using the kinase model until further updates are provided.
We identified an issue in the original datasets used to train the model published in [Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.](https://doi.org/10.1038/s41592-019-0687-1). As of February 15, 2023, we have corrected the datasets (released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529)), and replaced the original models released with this repo with corrected ones. Please verify that you use the corrected models for all predictions (see documentation in `predict/`).

## Data

As reported, domain-peptide and protein-protein interactions are available via [figshare (doi:10.6084/m9.figshare.10084745)](https://doi.org/10.6084/m9.figshare.10084745). In addition, we provide pre-processed data for this repository and the website repository,
All associated data may be downloaded from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529).

- Raw training data: [figshare - doi:10.6084/m9.figshare.11520297](https://doi.org/10.6084/m9.figshare.11520297). Raw domain-peptide training data used to train the core HSM models. Unpack to `data/` in this directory.
- Pre-processed data: [figshare - doi:10.6084/m9.figshare.11520552](https://doi.org/10.6084/m9.figshare.11520552). Needed to work with this repo. Unpack to `data/` in this directory.
- Data supporting the website at [proteinpeptide.io](https://proteinpeptide.io)

## Requirements
- Python (>= 3.5)
Expand All @@ -44,6 +41,7 @@ As reported, domain-peptide and protein-protein interactions are available via [
- scikit-learn (0.20)
- tqdm (4.41) (Progressbar. Not strictly necessary for functionality; needed to ensure package runs.)


## Reference
Please reference the associated publication:

Expand Down
21 changes: 21 additions & 0 deletions amino_acid_ordering.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
y
33 changes: 26 additions & 7 deletions predict/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,12 @@ Additional options for using either script may be listed using the `-h/--help` f
The basic steps for predicting a new interaction is:
### 0. Pre-process data and models.

By default, the code assumes that models are located at `predict/models/` and pre-processed data, which can be downloaded from [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552), should be available at `data/predict`. New data must be passed explicitly to the code (see the next section). Output model files should be the same as formatted by `output_models.py` in the `train/` directory.
By default, the code assumes that models are located at `predict/models/`. Output model files should be the same as formatted by `output_models.py` in the `train/` directory.

Pre-processed input domain and peptide metadata is available from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529), specifically in `data/ppi_data/metadata`.

Input domains files should have the format:

```
Domain-Protein-Identifier,Aligned-Domain-Sequence,Domain-Type
```
Expand All @@ -44,6 +47,7 @@ python predict_domains.py [INPUT DOMAINS METADATA] [INPUT PEPTIDES METADATA] [OP
```

Domain-peptide interactions are computed for all valid pairs (*e.g.* pairs that have an associated model). The two major options, `-m/--models` and `--model-format`, are useful when using newly trained models. `-m/--models` describes a directory listing new model files (each model file should specify the associated domain type). `--model-format` describes the model metadata as a csv file:

```
Domain-Type,Peptide-Type,Domain-Alignment-Length,Peptide-Alignment-Length,Peptide-Alignment-Is-Fixed,Model-Filename
```
Expand All @@ -55,34 +59,49 @@ The domain and peptide alignment lengths refer to the domain / peptide alignment
Code used for predicting protein-protein interactions is located in the predict/ directory in this repository. The functionality should primarily be accessed via the `predict_proteins.py` script.

```python
python predict_proteins.py [-p [INPUT PPI PAIRS]] [OPTIONS]
python predict_proteins.py [-ppi [INPUT PPI PAIRS]] [OPTIONS]
```
Additional options for using either script may be listed using the `-h/--help` flag.

## 0. Pre-process data and models.

By default, the `predict_proteins.py` script also assumes models are located at `predict/models/` and pre-processed data, which can be downloaded via [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552), are available at `data/metadata`. New data must be passed explicitly to the code (see the next section). The same models files may be used in both domain-peptide and protein-protein interaction prediction. To use new models, the same steps to specify the new models must be passed to `predict_proteins.py`. In addition, the models requiire metadata files (by default, stored in `data/metadata`) that describe either the domain or peptide composition of proteins. Metadata are formatted as Python dictionaries (stored as pickle'd files) with the format:
By default, the `predict_proteins.py` script also assumes models are located at `predict/models/`. The same model files may be used in both domain-peptide and protein-protein interaction prediction. To use new models, the same steps to specify the new models must be passed to `predict_proteins.py` as described above.

In addition, the models *require* metadata files that describe the domain and peptide composition of proteins. By defualt, the code assumes that pre-processed metadata for domains and peptidic sites identified in the human proteome is available at `../data/ppi_data/metadata/`. Metadata (`data/ppi_data/metadata`) can be downloaded from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529). New metadata must be passed explicitly to the code (see the next section).

## 1. Run predictions

Predictions can be computed using the described script:

```python
```
python predict_proteins.py [--ppi_pairs [INPUT PPI PAIRS]] [OPTIONS]
```
The `INPUT PPI PAIRS` option (passed using `--ppi_pairs`) passed to the code denotes a csv file containing the proteins to predict. These pairs should be formatted as a csv file where each line contains a pair of protein IDs (`<ID 1>,<ID 2>`). These IDs should reference IDs in the metadata files. If no pairs are passed, all valid pairs are returned. Different metadata files may be passed in using the `--domain_metadata` and `--peptide_metadata` options.

## Pre-trained Models
Released with this codebase are pre-trained models for using HSM / D to make novel predictions. The code assumes that the input sequences are correctly aligned.
Released with this codebase are pre-trained models for using HSM/D to make novel predictions. Pre-trained model files can be verified with the following MD5 hashes:

```
MD5 (amino_acid_ordering.txt) = c6458d6bd0cd9c5660632557b12cb28a
MD5 (hsm_pretrained/Kinase_TK.npz) = 746c6f4901ec85965faca7fe47ede548
MD5 (hsm_pretrained/PDZ.npz) = 0e34d0b3894c28f077122d82a2ffc910
MD5 (hsm_pretrained/PTB.npz) = d1c0e700e5f4f395eed38683433802a3
MD5 (hsm_pretrained/PTP.npz) = a8a1cf42dac2cfd16ec8757d20743ea7
MD5 (hsm_pretrained/SH2.npz) = 650303cb9088ec31a8305e73e9420b11
MD5 (hsm_pretrained/SH3.npz) = 802cb0a9fccb1fcfce9480993407a98a
MD5 (hsm_pretrained/WH1.npz) = 60e72ea081ed8c955ed5ec88fdf4b7b7
MD5 (hsm_pretrained/WW.npz) = 6aaf0473dba3aa32fe1336332503c09b
MD5 (hsm_pretrained/model_formats.csv) = 4e6f1842a6307ed5891ad9a9e9c67aa0
```

Models were trained using the domain alignments released with this codebase (in the `domain_metadata.csv` file in `predictions.tar.gz` in the [figshare (doi:10.6084/m9.figshare.11520552)](https://doi.org/10.6084/m9.figshare.11520552) repository.
The code assumes that the input sequences are correctly aligned. Models were trained using the domain alignments released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529) in `data/ppi_data/metadata/domain_metadata.csv`. MSAs constructed for each PBD family to obtain these alignments are also released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529) in `results/data/alignment/`.

Peptides are aligned as follows:

| **Peptidic-type** | **Alignment** |
| ---------- | ---------------- |
| phosphosite | A 15 residue window aligned on the central phosphotyrosine. For example, `AAAAAAAyAAAAAAA`.|
| C-terminal | A 6 residue windo aligned (to the right) on the C-terminus. For example, `AAAAAA`. Note, alignment offsets should be to the right (*e.g.* `AAAA` would become `--AAAA`).|
| C-terminal | A 8 residue window aligned (to the right) on the C-terminus. For example, `AAAAAAAA`. Note, alignment offsets should be to the right (*e.g.* `AAAAAA` would become `--AAAAAA`).|
| polyproline | Any residue sequence. polyproline peptides are "scanning", or the likelihood is computed over the whole sequence.|

`A` denotes any amino acid.
Expand Down
2 changes: 1 addition & 1 deletion predict/models/amino_acid_ordering.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ T
V
W
Y
y
y
Binary file modified predict/models/hsm_pretrained/Kinase_TK.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/PDZ.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/PTB.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/PTP.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/SH2.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/SH3.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/WH1.npz
Binary file not shown.
Binary file modified predict/models/hsm_pretrained/WW.npz
Binary file not shown.
16 changes: 8 additions & 8 deletions predict/models/hsm_pretrained/model_formats.csv
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
PTP,phosphosite,146,15,1,PTP.npz
PTB,phosphosite,94,15,1,PTB.npz
Kinase_TK,phosphosite,190,15,1,Kinase_TK.npz
SH3,polyproline,51,16,0,SH3.npz
SH2,phosphosite,92,15,1,SH2.npz
PDZ,c-terminus,75,6,1,PDZ.npz
WW,polyproline,32,13,0,WW.npz
WH1,polyproline,105,8,0,WH1.npz
SH2,phosphosite,92,15,1,SH2.npz
WH1,polyproline,105,8,0,WH1.npz
WW,polyproline,32,13,0,WW.npz
PTB,phosphosite,94,15,1,PTB.npz
PDZ,c-terminus,75,8,1,PDZ.npz
Kinase_TK,phosphosite,192,15,1,Kinase_TK.npz
SH3,polyproline,51,16,0,SH3.npz
PTP,phosphosite,146,15,1,PTP.npz
4 changes: 2 additions & 2 deletions predict/predict_proteins.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ def predict_interactions(metadata, models, output_ppis_fname, output_dpis_fname,
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ppi_pairs", type=str, default=None)
parser.add_argument("--domain_metadata", type=str, default="../data/predict/domain_metadata.csv")
parser.add_argument("--peptide_metadata", type=str, default="../data/predict/peptide_metadata.csv")
parser.add_argument("--domain_metadata", type=str, default="../data/ppi_data/metadata/domain_metadata.csv")
parser.add_argument("--peptide_metadata", type=str, default="../data/ppi_data/metadata/peptide_metadata.csv")
parser.add_argument("--output_ppi_prediction", type=str, default="outputs/ppi_predictions.csv",
help="Output file for PPI predictions. Output is a csv file with format <ID 1>,<ID 2>,<PPI Likelihood>.")
parser.add_argument("--output_dpi_prediction", type=str, default="outputs/dpi_predictions.txt",
Expand Down
15 changes: 15 additions & 0 deletions publication_analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Description

This directory contains code used for the publication, specifically to evaluate external models and PSSMs, analyze HSM models and HSM/P predictions, and create all figures. Detailed descriptions are provided in each subdirectory.

## External models

Code used to evaluate NetPhorest 2.1 predictions on HSM datasets is provided in `netphorest/`. Code used to evaluate PepInt predictions on HSM datasets is provided in `pepint/`.

## PSSMs

Code to construct benchmark PSSMs for each PBD family is provided in `pssm/`.

## Analysis and results presented in manuscript

All code used to perform analysis presented in the manuscript and create figures is provided in `results/`. Data underlying the figures is released on [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529), with a directory structure paralleling that of this directory.
13 changes: 13 additions & 0 deletions publication_analysis/netphorest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# NetPhorest predictions

NetPhorest 2.1 was used to make predictions PTB, SH2, PTP, and TKs for comparison to HSM. See `driver.sh` for commands executed. To use the bash script, the following variables need to be set:

* `domain`: Specify PBD family to make predictions for. Must be one of `PTB`, `SH2`, `Kinase_TK`, and `PTB`.
* `data_path`: Specify path to data to make predictions for. This must be raw, unaligned data provided in csv format for a single PBD and must contain the columns (with header):
```
Domain UniProt ID,Domain Sequence,Peptidic Sequence,Bound
```
Raw HSM data (`/data/data_without_processed_duplicates/raw_data/`) can be downloaded from [figshare (doi:10.6084/m9.figshare.22105529)](https://doi.org/10.6084/m9.figshare.22105529).
* `NETPHOR`: Specificy netphorest 2.1 executable. Download `NetPhorest_human_2.1.zip` from [http://netphorest.science/download.shtml](http://netphorest.science/download.shtml) and compile per instructions.

In order to make predictions with the HSM data, each domain was mapped to its NetPhorest model, and these mappings are provided in `map_domain_netphorest_model`. Either UniProt IDs (if only one domain of a specificied PBD family is found in the protein) or raw domain sequences are used for the mapping. For futher information about the NetPhorest models, see [Miller, et al. Linear Motif Atlas for Phosphorylation-Dependent Signaling (2008)](https://www.science.org/doi/10.1126/scisignal.1159433) and [Horn, et al. KinomeXplorer: an integrated platform for kinome biology studies (2014)](https://www.nature.com/articles/nmeth.2968).
25 changes: 25 additions & 0 deletions publication_analysis/netphorest/driver.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

domain= # name of PBD to make predictions for
data_path= # path to location of raw (unaligned) data

NETPHOR= # netphorest 2.1 executable

if [ ! -d $domain ]; then mkdir $domain; fi
cd $domain
# write fasta for netphorest predictions
# using raw data file since has UniProt IDs & netphorest independently aligns
data=$data_path/$domain.csv
python ../write_fasta.py $data

# make predictions with netphorest
if [ "$domain" == "Kinase_TK" ]; then
classifier='KIN'
else
classifier=$domain
fi
cat peptides.fasta | $NETPHOR | grep $classifier > netphorest_predictions.tab

# process predictions
python ../process_netphorest_predictions.py netphorest_predictions.tab $data ../map_domain_netphorest_model/${domain}_netphorest_model.csv
cd ..
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Domain UniProt ID,NetPhorest Model,Use sequence,Sequence
P21709,Eph_group,0,
P29317,Eph_group,0,
P29320,Eph_group,0,
P54764,Eph_group,0,
P54756,Eph_group,0,
Q9UF33,Eph_group,0,
Q15375,Eph_group,0,
P29322,Eph_group,0,
Q5JZY3,Eph_group,0,
P54762,Eph_group,0,
P29323,Eph_group,0,
P54753,Eph_group,0,
P54760,Eph_group,0,
O15197,Eph_group,0,
P00519,Abl_group,0,
P42684,Abl_group,0,
P07333,FLT3_CSF1R_Kit_PDGFR_group,0,
P10721,FLT3_CSF1R_Kit_PDGFR_group,0,
P36888,FLT3_CSF1R_Kit_PDGFR_group,0,
P16234,FLT3_CSF1R_Kit_PDGFR_group,0,
P09619,FLT3_CSF1R_Kit_PDGFR_group,0,
O60674,JAK2,0,
O60674,JAK2,0,
P35968,KDR_FLT1_group,0,
P17948,KDR_FLT1_group,0,
P08581,Met_group,0,
Q04912,Met_group,0,
P43405,Syk_group,0,
P43403,Syk_group,0,
Q06187,Tec_group,0,
P51813,Tec_group,0,
Q08881,Tec_group,0,
P42680,Tec_group,0,
P42681,Tec_group,0,
P04629,Trk_group,0,
Q16620,Trk_group,0,
Q16288,Trk_group,0,
P29597,Tyk2,0,
P00533,EGFR_group,0,
P04626,EGFR_group,0,
P21860,EGFR_group,0,
Q15303,EGFR_group,0,
P06213,InsR_group,0,
P08069,InsR_group,0,
P14616,InsR_group,0,
P51451,Src_group,0,
P08631,Src_group,0,
P07948,Src_group,0,
P06239,Src_group,0,
P09769,Src_group,0,
P42685,Src_group,0,
P06241,Src_group,0,
P12931,Src_group,0,
P07947,Src_group,0,
Q13882,Src_group,0,
Q9H3Y6,Src_group,0,
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Domain UniProt ID,NetPhorest Model,Use sequence,Sequence
Q9UKG1,APPL,0,
Q8NEU8,APPL,0,
P29353,SHC1_SHC2_SHC3_group,0,
P98077,SHC1_SHC2_SHC3_group,0,
Q92529,SHC1_SHC2_SHC3_group,0,
Q6S5L8,SHC4,0,
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Domain UniProt ID,NetPhorest Model,Use sequence,Sequence
Q12923,PTPN13,0,
Q9H3S7,PTPN23,0,
P26045,PTPN3,0,
P29074,PTPN4,0,
P43378,PTPN9,0,
P23468,R2A_group,0,
P10586,R2A_group,0,
Q13332,R2A_group,0,
P23467,R3_group,0,
Q9HD43,R3_group,0,
Q12913,R3_group,0,
Q16827,R3_group,0,
Q9UMZ3,R3_group,0,
P18433,R4_group,0,
P23469,R4_group,0,
P18031,NT1_group,0,
P17706,NT1_group,0,
P29350,NT2_group,0,
Q06124,NT2_group,0,
Q05209,NT4_group,0,
Q99952,NT4_group,0,
Q9Y2R2,NT4_group,0,
Loading

0 comments on commit 53edf5d

Please sign in to comment.