This is the official code repository for the NeurIPS 2024 Spotlight paper Kermut: Composite kernel regression for protein variant effects (preprint).
Kermut is a carefully constructed Gaussian process which obtains state-of-the-art performance for supervised variant effect prediction on ProteinGym's substitution benchmark while providing well-calibrated uncertainties.
The main
branch has been rewritten and restructured for ease of use and clarity. Implementation differences might results in minor numerical differences to those of the paper. To reproduce the paper results, instead use the reproduce
branch and consult the extensive README in that branch:
git clone -b reproduce [email protected]:petergroth/kermut.git
git clone [email protected]:petergroth/kermut.git
cd kermut
conda env create --file environment.yaml
conda activate kermut_envs
pip install -e .
To run Kermut from scratch without precomputed resources, e.g., for a new dataset, the ProteinMPNN repository must be installed. Additionally, the ESM-2 650M parameter model must be saved locally:
Kermut leverages structure-conditioned amino acid distributions from ProteinMPNN, which can has to installed from the official repository. An environment variable pointing to the installation location can then be set for later use:
export PROTEINMPNN_DIR=<path-to-ProteinMPNN-installation>
Kermut leverages protein sequence embeddings and zero-shot scores extracted from ESM-2 (paper, repo). We concretely use the 650M parameter model (esm2_t33_650M_UR50D
). While the ESM repository is installed above /via the yml-file), the model weights should be downloaded separately and placed in the models
directory:
curl -o models/esm2_t33_650M_UR50D.pt https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
curl -o models/esm2_t33_650M_UR50D-contact-regression.pt https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t33_650M_UR50D-contact-regression.pt
This section describes how to access the data that was used to generate the results. To reproduce all results from scratch, follow all steps in this section and in the Data preprocessing section. To reproduce the benchmark results using precomputed resources (ESM-2 embeddings, conditional amino-acid distributions, etc.) see the section on precomputed resources.
Kermut is evaluated on the ProteinGym benchmark (paper, repo). For full details on downloading the relevant data, please see the ProteinGym resources. In the following, commands are provided to extract the relevant data.
- Reference file: A reference file with details on all assays can be downloaded from the ProteinGym repo and should be saved as
data/DMS_substitutions.csv
The file can be downloaded by running the following:
curl -o data/DMS_substitutions.csv https://raw.githubusercontent.com/OATML-Markslab/ProteinGym/main/reference_files/DMS_substitutions.csv
- Assay data: All assays (with CV folds) can be downloaded and extracted to
data
. Run the following to download and extract all single-mutant assays. Assays will be placed indata/cv_folds_singles_substitutions
:
# Download zip archive
curl -o cv_folds_singles_substitutions.zip https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip
# Unpack and remove zip archive
unzip cv_folds_singles_substitutions.zip -d data
rm cv_folds_singles_substitutions.zip
- PDBs: All predicted structure files are downloaded and placed in
data/structures/pdbs
. PDBs are accessed viaPredicted 3D structures from inverse-folding models
in ProteinGym.
# Download zip archive
curl -o ProteinGym_AF2_structures.zip https://marks.hms.harvard.edu/proteingym/ProteinGym_AF2_structures.zip
# Unpack and remove zip archive
unzip ProteinGym_AF2_structures.zip -d data/structures/pdbs
rm ProteinGym_AF2_structures.zip
- Zero-shot scores: For the zero-shot mean function, precomputed scores can be downloaded and placed in
zero_shot_fitness_predictions
, where each zero-shot method has its own directory. The precomputed zero-shot scores from ProteinGym can be accessed viaZero-shot DMS Model scores - Substitutions
. NOTE: The full zip archive with all scores takes up approximately 44GB of storage. Alternatively, the zero-shot scores for the 650M parameter ESM-2 model is included in the precomputed resources, which in total is only approximately 4GB.
# Download zip archive
curl -o zero_shot_substitutions_scores.zip https://marks.hms.harvard.edu/proteingym/zero_shot_substitutions_scores.zip
unzip zero_shot_substitutions_scores.zip -d data/zero_shot_fitness_predictions
# Unpack and remove zip archive
rm zero_shot_substitutions_scores.zip
All outputs from the preprocessing procedure (i.e., precomputed ESM-2 embeddings, conditional amino acid distributions, processed coordinate files, and zero-shot scores from ESM-2) can be readily accessed via a zip-archive hosted by the Electronic Research Data Archive (ERDA) by the University of Copenhagen using the following link. The file takes up approximately 4GB. To download and extract the data, run the following:
# Download zip archive
curl -o kermut_data.zip https://sid.erda.dk/share_redirect/c2EWrbGSCV/kermut_data.zip
# Unpack and remove zip archive
unzip kermut_data.zip && rm kermut_data.zip
After downloading and extracting the relevant data in the Data access section, ESM-2 embeddings can be generated via:
python -m kermut.cmdline.preprocess_data.extract_esm2_embeddings \
dataset=all
To generate embeddings for an individual dataset (e.g., BLAT_ECOLX_Stiffler_2015
), run
python -m kermut.cmdline.preprocess_data.extract_esm2_embeddings \
dataset=single \
dataset.single.use_id=true \
dataset.single.id=BLAT_ECOLX_Stiffler_2015
To generate embeddings via index (i.e., row index in DMS_substitutions.csv
), run
python -m kermut.cmdline.preprocess_data.extract_esm2_embeddings \
dataset=single \
dataset.single.id=23
The embeddings are located in data/embeddings/substitutions_singles/ESM2
(for the single-mutant assays).
The structure-conditioned amino acid distributions for all residues and assays, can be computed with ProteinMPNN via
bash example_scripts/conditional_probabilities.sh
For a single dataset, see example_scripts/conditional_probabilities_single.sh
or example_scripts/conditional_probabilities_all.sh
. This generates per-assay directories in data/conditional_probs/raw_ProteinMPNN_outputs
. After this, postprocessing for easier access is performed via
python -m kermut.cmdline.preprocess_data.extract_ProteinMPNN_probs \
dataset=all
This generates per-assay npy
-files in data/conditional_probs/ProteinMPNN
.
Lastly, the 3D coordinates can be extracted from each PDB file via
python -m kermut.cmdline.preprocess_data.extract_3d_coords \
dataset=all
This saves npy
-files for each assay in data/structures/coords
.
For single assays, use same inputs as for embeddings.
If not relying on pre-computed zero-shot scores from ProteinGym, they can be computed for ESM-2 via:
python -m kermut.cmdline.preprocess_data.extract_esm2_zero_shots \
dataset=all
The implementation of Kermut relies on Hydra.
Configuration files are found in kermut/hydra_configs
.
Data paths are defined in data/paths.yaml
and must match your setup.
To evaluate Kermut on the full benchmark, run the following
python proteingym_benchmark.py --multirun \
dataset=benchmark \
cv_scheme=fold_random_5,fold_modulo_5,fold_contiguous_5 \
kernel=kermut # Default
To evaluate an alternative kernel (e.g., kermut_constant_mean
as defined in kermut/hydra_configs/kernel/kermut_constant_mean.yaml
) on DMS assay with index 9, run
python proteingym_benchmark.py --multirun \
dataset=single \
dataset.single.id=9 \
cv_scheme=fold_random_5,fold_modulo_5,fold_contiguous_5 \
kernel=kermut_constant_mean
To compute Spearman correlation and MSE per assay and cv-scheme, run
python -m kermut.cmdline.process_results.merge_results \
dataset=benchmark
This will compute results for all models stored in the model_names.benchmark
list in kermut/hydra_configs/postprocessing/default.yaml
.
For a single model, run
python -m kermut.cmdline.process_results.merge_results \
dataset=benchmark \
"model_names=[kermut]"