Fitness effects of nucleotide and amino acid mutations updated with novel estimates of neutral rates.
This repository computes an updated estimate of the fitness effects of mutations of the SARS-CoV-2 genome, based on the work presented in this paper by H.K. Haddox, G. Angehrn, L. Sesta, C. Jennings-Shaffer, S. Temple, J.G. Galloway, W.S. DeWitt, F.A. Matsen IV, and R.A. Neher.
The counts from the SARS-CoV-2 mutation-annotated tree provided by the UShER developers are used to accurately estimate the mutation rates according to:
- A site's local nucleotide context.
- The base pairing in RNA secondary structure.
- The region of the genome the site belongs to.
Fitness effects are subsequently estimated by comparing the actual observed counts to the one predicted from the inferred rates, within a Bayesian probabilistic framework that also provides uncertainties.
- Details about the computational framework can be found in the related paper.
- Data used as reference for RNA secondary structure are in Lan et al.
- Evidence about influence of secondary structure on mutation rates were first presented in a paper by Hensel.
- The original approach for estimating mutational fitness is presented in Bloom & Neher.
Interactive plots for visualizing the results of the analysis can be found at https://neherlab.github.io/SARS2-refined-fitness/.
It is possible to reproduce the fitness estimates by running the computational analysis defined in this GitHub repository.
Firstly, a customized conda environment needs to be built from the environment.yml file. To do so, you must install conda and then run:
conda env create -f environment.yml
This will create a conda environment called SARS2-refined-fitness
, that you need to activate:
conda activate SARS2-refined-fitness
The pipeline is managed by snakemake through a Snakefile, whose configuration is defined in config.yaml. To run the pipeline use:
snakemake -c <n_cpus>
where n_cpus
is the number of CPU cores you want to use.
The pipeline mainly relies on Python Jupyter notebooks to run. These can be found in the ./notebook folder.
The pipeline runs downstream from two files fetched directly from the jbloomlab/SARS2-mut-fitness GitHub repo:
- Table of clade founders nucleotides ~/results_gisaid_2024-04-24/clade_founder_nts/clade_founder_nts.csv.
- Clade-wise table of counts ~/results_gisaid_2024-04-24/expected_vs_actual_mut_counts/expected_vs_actual_mut_counts.csv.
The related links are defined in the config.yaml file and can be changed to any version of the reference UShER tree.
The file containing the nucleotide pairing predictions from Lan et al is located at ./data/lan_2022.
Ahead of the computation of mutational fitness effects, predicted and actual mutation counts can be aggregated by defining clusters of clades. This is defined by a dictionary clade_cluster
in the config.yaml file, which can be customized.
Files produced by the pipeline are saved into the ./results folder. These are subsequently divided in the following subfolders:
- curated: the training datasets for the General Linear Model (GLM) producing the predicted counts. These are divided between pre and Omicron clades.
- master_tables: the reference models inferred from the training datasets, i.e. a site's context and the associated mutation rate.
- ntmut_fitness: files
{cluster}_ntmut_fitness.csv
for each cluster of clades with the nucleotide mutation fitness effects. - aamut_fitness: files
{cluster}_aamut_fitness.csv
with fitness effects of the amino acid mutations for each cluster of clades.
A detailed description of the theoretical framework for the GLM and the Bayesian setting can be found in this paper. Here we outline some fundamental elements.
GLM's are inferred on two curated datasets containing counts for synonymous mutations. Genome sites are retained if:
- The wildtype nucleotide of the clade founder is equal to the Wuhan-1 reference.
- The site motif, i.e. 5'-3' nucleotides context does not change.
- The codon the site belongs to is conserved.
- Sites that are marked as
excluded=True
ormasked_in_usher=True
are excluded.
The output of the GLM is a predicted count for each possible nucleotide mutation and condition:
$$n^{x_i\rightarrow y_i}{\mathrm{pred}}\left(\mathbf c_i\right) = n^{x_i\rightarrow y_i}{\mathrm{pred}}\left(p_i, m_i, l_i\right),$$
where:
-
$x_i$ ,$y_i$ are the wildtype and mutant nucleotide respectively. -
$p_i$ is the pairing state. -
$m_i$ is the site motif. -
$l_i$ indicates if the site is before/after a lightswitch position along the genome.
These predicted counts can be converted in terms of rates by imposing that:
where
Once the rates are determined from the training datasets, it is possible to compute the predicted counts for every clade-specific subtree by re-scaling according to the proper evolutionary time
Fitness effects of nucleotide mutations are computed within a Bayesian
probabilistic framework. For a specific site and nucleotide mutation,
our objective is the posteior probability of the fitness effect
The posterior reads:
with the likelihood being:
Once the posterior is known, we can compute our estimate and uncertainty of the fitness effect as:
Fitness effects of amino acid mutations are computed as weighted averages
of all nucleotide mutations producing the same nonsynonymous mutation in a
given codon. For an amino acid mutation
where