Fitness effects of nucleotide and amino acid mutations updated with novel estimates of neutral rates.

Overview

This repository computes an updated estimate of the fitness effects of mutations of the SARS-CoV-2 genome, based on the work presented in this paper by H.K. Haddox, G. Angehrn, L. Sesta, C. Jennings-Shaffer, S. Temple, J.G. Galloway, W.S. DeWitt, F.A. Matsen IV, and R.A. Neher.

The counts from the SARS-CoV-2 mutation-annotated tree provided by the UShER developers are used to accurately estimate the mutation rates according to:

A site's local nucleotide context.
The base pairing in RNA secondary structure.
The region of the genome the site belongs to.

Fitness effects are subsequently estimated by comparing the actual observed counts to the one predicted from the inferred rates, within a Bayesian probabilistic framework that also provides uncertainties.

References

Details about the computational framework can be found in the related paper.
Data used as reference for RNA secondary structure are in Lan et al.
Evidence about influence of secondary structure on mutation rates were first presented in a paper by Hensel.
The original approach for estimating mutational fitness is presented in Bloom & Neher.

Interactive plots

Interactive plots for visualizing the results of the analysis can be found at https://neherlab.github.io/SARS2-refined-fitness/.

Computational pipeline

It is possible to reproduce the fitness estimates by running the computational analysis defined in this GitHub repository.

Firstly, a customized conda environment needs to be built from the environment.yml file. To do so, you must install conda and then run:

conda env create -f environment.yml

This will create a conda environment called SARS2-refined-fitness, that you need to activate:

conda activate SARS2-refined-fitness

The pipeline is managed by snakemake through a Snakefile, whose configuration is defined in config.yaml. To run the pipeline use:

snakemake -c <n_cpus>

where n_cpus is the number of CPU cores you want to use.

The pipeline mainly relies on Python Jupyter notebooks to run. These can be found in the ./notebook folder.

Input

The pipeline runs downstream from two files fetched directly from the jbloomlab/SARS2-mut-fitness GitHub repo:

Table of clade founders nucleotides ~/results_gisaid_2024-04-24/clade_founder_nts/clade_founder_nts.csv.
Clade-wise table of counts ~/results_gisaid_2024-04-24/expected_vs_actual_mut_counts/expected_vs_actual_mut_counts.csv.

The related links are defined in the config.yaml file and can be changed to any version of the reference UShER tree.

The file containing the nucleotide pairing predictions from Lan et al is located at ./data/lan_2022.

Configuration

Ahead of the computation of mutational fitness effects, predicted and actual mutation counts can be aggregated by defining clusters of clades. This is defined by a dictionary clade_cluster in the config.yaml file, which can be customized.

Output

Files produced by the pipeline are saved into the ./results folder. These are subsequently divided in the following subfolders:

curated: the training datasets for the General Linear Model (GLM) producing the predicted counts. These are divided between pre and Omicron clades.
master_tables: the reference models inferred from the training datasets, i.e. a site's context and the associated mutation rate.
ntmut_fitness: files {cluster}_ntmut_fitness.csv for each cluster of clades with the nucleotide mutation fitness effects.
aamut_fitness: files {cluster}_aamut_fitness.csv with fitness effects of the amino acid mutations for each cluster of clades.

Theoretical framework

A detailed description of the theoretical framework for the GLM and the Bayesian setting can be found in this paper. Here we outline some fundamental elements.

GLM for predicted counts

GLM's are inferred on two curated datasets containing counts for synonymous mutations. Genome sites are retained if:

The wildtype nucleotide of the clade founder is equal to the Wuhan-1 reference.
The site motif, i.e. 5'-3' nucleotides context does not change.
The codon the site belongs to is conserved.
Sites that are marked as excluded=True or masked_in_usher=True are excluded.

The output of the GLM is a predicted count for each possible nucleotide mutation and condition:

$$n^{x_i\rightarrow y_i}{\mathrm{pred}}\left(\mathbf c_i\right) = n^{x_i\rightarrow y_i}{\mathrm{pred}}\left(p_i, m_i, l_i\right),$$

where:

$x_i$, $y_i$ are the wildtype and mutant nucleotide respectively.
$p_i$ is the pairing state.
$m_i$ is the site motif.
$l_i$ indicates if the site is before/after a lightswitch position along the genome.

These predicted counts can be converted in terms of rates by imposing that:

$$ N_s = T\sum_{x, \mathbf c}\sum_{y\neq x} s^{x\rightarrow y}\left(\mathbf c\right)r^{x\rightarrow y}(\mathbf c), $$

where $r^{x\rightarrow y}(\mathbf c)$ are the rates, $s^{x\rightarrow y}(\mathbf c)$ are the number of sites in the genome for which $x\rightarrow y$ is synonymous in condition $\mathbf c$ and $N_s$ is the total number of synonymous mutations. From the previous equation one can show that $T = N_s / \sum_{x,\mathbf{c},y\neq x}s^{x\rightarrow y}(\mathbf{c})$ and finally $r^{x\rightarrow y}(\mathbf{c}) = n^{x\rightarrow y}_{\mathrm{pred}}(\mathbf{c}) / T$.

Once the rates are determined from the training datasets, it is possible to compute the predicted counts for every clade-specific subtree by re-scaling according to the proper evolutionary time $n^{a, x\rightarrow y}_{\mathrm{pred}}(\mathbf{c}) = T^a \times r^{{x\rightarrow y}}(\mathbf{c})$, $a$ being the clade label.

Posterior estimate of mutation fitness

Nucleotide mutations

Fitness effects of nucleotide mutations are computed within a Bayesian probabilistic framework. For a specific site and nucleotide mutation, our objective is the posteior probability of the fitness effect $\Delta f$ given the observed counts $n_{\mathrm{obs}}$ and integrated over the possible expected counts $n_{\mathrm{exp}}$, whose mean is the model prediction $n_{\mathrm{pred}}$.

The posterior reads:

$$ P(\Delta f \vert n_{\mathrm{obs}}) \propto P(n_{\mathrm{obs}} \vert \Delta f) P(\Delta f), $$

with the likelihood being:

$$ P(n_{\mathrm{obs}} \vert \Delta f) = \int_{0}^{+\infty}\mathrm{d}n_{\mathrm{exp}} P(n_{\mathrm{obs}} \vert \Delta f, n_{\mathrm{exp}}) P(n_{\mathrm{exp}}). $$

Once the posterior is known, we can compute our estimate and uncertainty of the fitness effect as:

$$ \begin{cases} \left\langle \Delta f \right\rangle = \int_{-\infty}^{+\infty}\mathrm{d}(\Delta f)P(\Delta f \vert n_{\mathrm{obs}})\Delta f \\ \sigma_{\Delta f} = \sqrt{\left\langle \Delta f^2 \right\rangle - \left\langle \Delta f \right\rangle^2} \end{cases} $$

Amino acid mutations

Fitness effects of amino acid mutations are computed as weighted averages of all nucleotide mutations producing the same nonsynonymous mutation in a given codon. For an amino acid mutation $a\rightarrow a'$ one gets:

$$ \begin{cases} \Delta f\left(a\rightarrow a'\right) = \frac{{\sum_i}{\sum_y} \frac{\left\langle \Delta f_i(x_i\rightarrow y_i) \right\rangle}{\sigma^2_{\Delta f_i(x_i\rightarrow y_i)}} \delta\left(a',g(\mathbf y)\right)} {{\sum_i}{\sum_y} \frac{1}{\sigma^2_{\Delta f_i(x_i\rightarrow y_i)}} \delta\left(a',g(\mathbf y)\right)} \\ \sigma_{\Delta f(a\rightarrow a')} = \frac{1} {\sqrt{{\sum_i}{\sum_y} \frac{\delta\left(a',g(\mathbf y)\right)}{\sigma^2_{\Delta f_i(x_i\rightarrow y_i)}}}} \end{cases} $$

where $g(\mathbf y)$ maps the codon into the corresponding amino acid, and the sums run over the codon sites $i$ and possible mutant nucleotides at these sites $y_i$.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
docs		docs
modules		modules
notebook		notebook
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fitness effects of nucleotide and amino acid mutations updated with novel estimates of neutral rates.

Overview

References

Interactive plots

Computational pipeline

Input

Configuration

Output

Theoretical framework

GLM for predicted counts

Posterior estimate of mutation fitness

Nucleotide mutations

Amino acid mutations

About

Releases

Packages

Contributors 2

Languages

License

neherlab/SARS2-refined-fitness

Folders and files

Latest commit

History

Repository files navigation

Fitness effects of nucleotide and amino acid mutations updated with novel estimates of neutral rates.

Overview

References

Interactive plots

Computational pipeline

Input

Configuration

Output

Theoretical framework

GLM for predicted counts

Posterior estimate of mutation fitness

Nucleotide mutations

Amino acid mutations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages