Skip to content

Latest commit

 

History

History
75 lines (53 loc) · 6.65 KB

README.md

File metadata and controls

75 lines (53 loc) · 6.65 KB

Phenotypes of SARS-CoV-2 spike mutants with possible predictive power for forecasting evolution

Overview

This GitHub repository is designed to aggregate data about various phenotypes of the SARS-CoV-2 spike that may be of value for forecasting the virus's evolution.

The notebook draws on different data sources for the effects of mutations on SARS-CoV-2 phenotypes. Currently, these data sources are:

The idea is that if you are interested in making forecasts of viral evolution, this repository provides a way to obtain up-to-date data on how mutations have been measured or predicted to affect spike phenotypes.

Running the notebook in the repo to get phenotypic predictions

The repository consists of a Jupyter notebook (SARS2-spike-predictor-phenos.ipynb) that can be run with an appropriate YAML configuration file (e.g., config.yaml) to tabulate both the effects of mutations on spike phenotypes and the predicted phenotypes of different SARS-CoV-2 clades.

This notebook reads in data contained within the notebook about how mutations affect spike phenotypes: look at the mutation_phenotype_csvs key in config.yaml to understand these data sources. The data itself on the spike phenotypes are in ./data/, and the README in that subdirectory provides more explanation.

The notebook reads those data and then generates four output files, which by default are as follows:

If you want, you can just use the values in those CSVs. However, although the input data in this repository with the spike predictor phenotypes is only updated sometimes (when new data become available), there are constantly new clades being designated and their estimated growth rates are being updated daily, so the clade phenotype estimates are constantly changing.

Therefore, if you want to get the latest predictions, your best bet is to clone this repository (perhaps as a submodule), and then run the notebook yourself.

After you have obtained the repo, first build the conda environment in environment.yml, then activate it with:

conda activate SARS2-spike-predictor-phenos

Then run the Jupyter notebook SARS2-spike-predictor-phenos.ipynb using papermill with:

papermill -p config_yaml config.yaml SARS2-spike-predictor-phenos.ipynb results/SARS2-spike-predictor-phenos.ipynb

Note that you can pass a custom configuration file to the notebook using the -p config_yaml <configuration YAML>, so you can potentially make a different configuration than the default one in config.yaml. In particular, if you want reproducible output then you should specify specific versions of the pango_json and pango_growth_json keys in the YAML rather than just the latest versions as in the default config.yaml.

Interactive plot phenotypes

In addition to the CSV files in ./results/, running the notebook creates an interactive plot that allows you to look at scatter plots of the phenotypes for clades. That plot is placed in docs/index.html, and is rendered on GitHub Pages at https://jbloomlab.github.io/SARS2-spike-predictor-phenos/.

Importance of the randomized phenotypes

When making predictions, there is always a danger of over-fitting or failing to account for phylogenetic correlations in a way that makes phenotypes seem more predictive of evolution than they really are. Therefore, the pipeline creates files (as described above) that randomize effects among mutations and generates clade phenotype predictions from these randomized data. You should always compare the accuracy of predictions with the actual non-randomized data to those made with these randomized data: if the actual data are not any more predictive than the randomized data, then somehow you are overfitting or neglecting to account for phylogenetic correlations.

Versioning

Each new run of this pipeline on GitHub has a tag indicating the date it was run as YYYY-MM-DD. In addition, the CHANGELOG describes updates such as adding new data.

Acknowledgments

This repository is maintained by Jesse Bloom.

Thanks to: