Protein Regression Assessment

Repository to replicate the results of A systematic analysis of regression models for protein engineering.

Data & Results

All data and persisted results can be found in the Electronic Research Data Archive (ID=archive-xENMse).

Installation

Use conda to install the required virtual environment from the listed environment_*.yml files,

conda create --name protein_regression --file environment_*.yml (replace the asterisk with the correct system specifications). Note that the environment_nix_SLURM.yml also applies to common Linux distributions.

Alternatives

You are free to set up your own environment. Please note that in order to run the experiments the project environment should contain at least the following libraries (in non-conflicting version specifications):

numpy
scipy
tensorflow
tensorflow-probability
gpflow
scikit-learn
mlflow

and either cuda or MacOS (M1) metal support.

Reproducing Figures

If all experiments have been completed succesfully or if all persisted results have been downloaded either:

run respective ./notebooks/
run ./make_plot_* to create figures.

Replicating results

After installing the protein_regression environment and activating it download the required data into the ./data directory.

Files

Download

{blat|brca|calm|mth3|timb|toxi|ubqt}_data_df.pkl (sequences and observations as DataFrames)
{blat|brca|calm|mth3|timb|toxi|ubqt}_{esm|esm1v|esm2|prott5|pssm|seq_reps}_rep.{pkl|npz} (most embeddings as pickle or numpy persisted files)
ProtBert_{blat|brca|calm|mth3|timb|toxi|ubqt}_labelled_seqs.pkl (ProtBert embeddings pickled)
EVE_{BLAT|BRCA|CALM|MTH3|TIMB|TOXI|UBQT}_2000_samples.csv (EVE embeddings as csv).

To run experiments in the default configuration: (protein-regression) python run_experiments.py.

To run the optimization experiments: (protein-regression) python run_optimization.py.

Running specific settings and experiments

To run specific experiments settings provide the specifications as input flags to the experiment scripts. For example, we want to run the Beta-Lactamase experiments using an esm-1b embedding with a linear GP regressor using a RandomCV protocol: python run_experiments.py -d 1FQG -r esm -m GPLinearFactory -p 0.

See the python run_experiments.py --help for more details:

usage: run_experiments.py [-h] [-d {MTH3,TIMB,CALM,1FQG,BRCA,TOXI,UBQT}] [-r {transformer,esm,eve,eve_density,one_hot,esm1v,esm2,prott5,pssm}]
                          [-p PROTOCOL] [-m {KNNFactory,RandomForestFactory,GPSEFactory,GPLinearFactory,GPMaternFactory,UncertainRFFactory}] [--dim DIM]
                          [--ablation {dim-reduction,augmentation,threshold,cv}] [--no_optimize] [--mock]

Experiment Specifications

optional arguments:
  -h, --help            show this help message and exit
  -d {MTH3,TIMB,CALM,1FQG,BRCA,TOXI,UBQT}, --data {MTH3,TIMB,CALM,1FQG,BRCA,TOXI,UBQT}
                        Dataset identifier
  -r {transformer,esm,eve,eve_density,one_hot,esm1v,esm2,prott5,pssm}, --representation {transformer,esm,eve,eve_density,one_hot,esm1v,esm2,prott5,pssm}
                        Representation of data identifier
  -p PROTOCOL, --protocol PROTOCOL
                        Index for Protocol from list [Random, Positional, Fractional]
  -m {KNNFactory,RandomForestFactory,GPSEFactory,GPLinearFactory,GPMaternFactory,UncertainRFFactory}, --method_key {KNNFactory,RandomForestFactory,GPSEFactory,GPLinearFactory,GPMaternFactory,UncertainRFFactory}
                        Method identifier
  --dim DIM             Dimension reduction experiments
  --ablation {dim-reduction,augmentation,threshold,cv}
                        Specify type of ablation for the run.
  --no_optimize         Do not optimize regressor.
  --mock                Mock experiment iterations.

Project Structure

./algorithms/ contains abstract and implementation of the regressors,
./data/ contains scripts to compute embeddings/representations, and splitting protocols,
./data/files contains the required data-sets to run experiments, which includes (original .csv files, embeddings, MSA files), the persisted files are in pickle format - all downloaded files go here (!),
./notebooks/ contains jupyter notebooks to replicate the figures from the manuscript; requires that experiments have run and completed succesfully,
./notebooks/figures_main.ipynb contains the figures for the main manuscript,
./notebooks/figures_supplementary.ipynb contains the figures for the supplementary material,
./results/ directory for experimental results, ./results/cache/ caching of dictionaries obtained from MlFlow, ./results/figures/ saved figures obtained from ./make_plot_*.py scripts, ./results/mlruns output of MlFlow experiments,
./test/ contains pytest modules for specific tests, i.e. data-loading and consistency, tests of custom CV splitters, custom UC/UQ code,
./uncertainty_quantification/ module for UC/UQ code
./util/ miscalleanous utility code, used for encoding of data, pre-, and post-processing
./util/mlflow/ MlFlow specific utility module, defines variables, constants, loading functions, etc.
./visualization/ module required to generate figures; required by ./make_plot_*.py scripts,
./run_experiments.py run script for all experiments; calls ./run_single_regression_task.py with experiments specifications,
./run_optimization.py run script for all experiments with the optimization protocol; calls ./run_single_optimization_task.py with experiment specifications,
./schedule_experiments_slurm*.sh shell scripts to schedule slurm runs as assay jobs, requires files under ./slurm_configs/ as experiment input parameters,

Cite

This codebase is element of

Michael R, Kæstel-Hansen J, Mørch Groth P, Bartels S, Salomon J, Tian P, Hatzakis NS, Boomsma W. A systematic analysis of regression models for protein engineering. PLoS Comput Biol. 2024 May 3;20(5):e1012061. doi: 10.1371/journal.pcbi.1012061. PMID: 38701099; PMCID: PMC11095727.

NOTE: if you cite us and use results or data, make sure to also cite the respective sources as indicated in the Methods and Supplementary Files.

@article{MichaelKaestel2024Systematic,
  title={A systematic analysis of regression models for protein engineering},
  author={Michael, Richard and K{\ae}stel-Hansen, Jacob and M{\o}rch Groth, Peter and Bartels, Simon and Salomon, Jesper and Tian, Pengfei and Hatzakis, Nikos S and Boomsma, Wouter},
  journal={PLOS Computational Biology},
  volume={20},
  number={5},
  pages={e1012061},
  year={2024},
  publisher={Public Library of Science San Francisco, CA USA}
}

Name		Name	Last commit message	Last commit date
Latest commit History 490 Commits
algorithms		algorithms
bound		bound
data		data
notebooks		notebooks
results		results
scripts		scripts
slurm_configs		slurm_configs
test		test
uncertainty_quantification		uncertainty_quantification
util		util
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
algorithm_factories.py		algorithm_factories.py
environment_macOS.yml		environment_macOS.yml
environment_nix_SLURM.yml		environment_nix_SLURM.yml
make_overview_parameters.py		make_overview_parameters.py
make_plot_bar.py		make_plot_bar.py
make_plot_brute_force.py		make_plot_brute_force.py
make_plot_cumulative_regression.py		make_plot_cumulative_regression.py
make_plot_lowerdim.py		make_plot_lowerdim.py
make_plot_optimization.py		make_plot_optimization.py
make_plot_representations.py		make_plot_representations.py
make_plot_uncertainties.py		make_plot_uncertainties.py
protocol_factories.py		protocol_factories.py
run_experiments.py		run_experiments.py
run_gp_dim_curse_ablation.py		run_gp_dim_curse_ablation.py
run_optimization.py		run_optimization.py
run_single_optimization_task.py		run_single_optimization_task.py
run_single_regression_task.py		run_single_regression_task.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Regression Assessment

Data & Results

Installation

Alternatives

Reproducing Figures

Replicating results

Files

Running specific settings and experiments

Project Structure

Cite

About

Releases

Contributors 4

Languages

License

MachineLearningLifeScience/protein_regression

Folders and files

Latest commit

History

Repository files navigation

Protein Regression Assessment

Data & Results

Installation

Alternatives

Reproducing Figures

Replicating results

Files

Running specific settings and experiments

Project Structure

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 4

Languages