Skip to content

Visualization toolkit for SARS-CoV-2 whole genome sequencing data

License

Notifications You must be signed in to change notification settings

hsmaan/CovidGenotyper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Genotyping Tool (CGT)

DOI

Please note that the CGT backend and UI are under continuous development.

Overview

The Covid-19 Genotyping Tool (CGT) is an R-Shiny based web application that allows researchers to upload fasta sequences of Covid-19 viral genomes and compare with public sequence data available on GISAID. Genomic distance is visualized using manifold projection and network analysis, and genotype information with respective to high-prevalence SNPs is determined.

A video demonstration of CGT:

CGT Demo

Details and methodology

UI and visualizations

The CGT application was developed using the shiny R package and framework. Visualizations are generated using the ggplot2, ggnetwork, and plotly R packages. User uploaded fasta files are considered input, and the visualizations reactively adjust to user data. To expedite the loading of plots using public data, the alignment, DNA distance, and plot data fetching steps are all pre-processed for GISAID public sequence data. Once users upload in-house sequencing data, these steps are re-performed with the concatenation of user and public data. DNA distance and UMAP calculations use a heuristic to expedite processing time (see below).

Sequence and metadata retrieval

Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the GISAID EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis. Sequences are filtered for completeness (>29000 nucleotides) and high coverage (<0.1% of all ambiguous nucleotides - e.g. N, M, W). Outlier sequences are also filtered out, defined by >0.05% unique amino acid substitutions compared to all GISAID sequences. This criteria is based on the mutation rate of SARS-CoV-2 and breadth of the GISAID database. Due to space and computational limitations, since the June 26th update, 10000 sequences from those that meet the filtering criteria are randomly sampled and analyzed.

Genome sequence alignment

GISAID sequences are subset for those that have corresponding metadata. Public sequencing data is pre-aligned before being uploaded to the server. Fasta sequences are read and written using the Biostrings package. Gap removal and multiple-sequence alignment is performed using DECIPHER. Post alignment processing is done using ape. User uploaded fasta sequences are processed similarly, with the exception of complete alignment - the user sequence is aligned to the pre-aligned public data profile using AlignProfiles from DECIPHER.

DNA distance

For both pre-aligned and profile aligned data, DNA distance is determined using ape and the Kimura-80 model of nucleotide substitution. Currently only Kimura-80 is supported, but integrating other evolutionary distance metrics will be part of a future release.

The following nucleotide positions are masked post-alignment when determining DNA distance due to homoplasy, as per the recommendations from this article:

187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575

User-uploaded data is aligned and the distance between each user uploaded genome and the public genomes from GISAID are calculated first. If all user-uploaded sequences are significantly similar to a publicly uploaded genome (min dist < 1e-4, a lower bound based on distances between public genomes), then the distances for the user-uploaded genomes are imputed as the most similar publicly uploaded genome for each. Otherwise, if user-uploaded sequences do not meet this criteria (min dist > 1e-4 for any user-uploaded genome), then the entire distance matrix is recomputed. This heuristic allows for significantly improved computation time for user-uploaded data, and is based on properties of SARS-CoV-2 - Betacoronaviruses have low mutation rates, and most genomes sequenced within the current pandemic will be highly similar.

Uniform manifold projection and approximation (UMAP)

UMAP is performed to visualize DNA distances between public data and user uploaded data. In this context, UMAP aims to cluster groups of closely-linked viral genomes together based on DNA distance. The uwot implementation of UMAP is utilized using the pre-computed DNA distance matrix. Defaults for the uwot implementation are employed, with the exception of the following parameters:

init = "spectral"
metric = "cosine"
n_neighbors = 50
min_dist = 0.001
spread = 30
local_connectivity = 10

Similar to the DNA distance heuristic outlined above, if the minimum distance (<1e-4) requirements for all user-uploaded sequences are met, the UMAP coordinates of the user-uploaded sequences are imputed as the coordinates of the minimum distance matched genomes from GISAID. Otherwise, UMAP is recomputed with the added user-data.

Network analysis and minimum spanning tree (MST)

Network representations of DNA distance are visualized using the igraph package. DNA distance matrices are utilized to create graph representations. Connectivity is determined by employing the minimum spanning tree, which aims to determine a path that connects all vertices while minimizing distance. This method, along with the graphopt force-directed layout aims to visualize connectivity of viral genomes from affected individuals, potential paths of transmission based on connectivity, and large network hubs which may be associated with outbreak epicenters.

Single-nucleotide polymorphisms (SNP)

Genotype profiles of viral genomes are determined using high prevalence non-synonymous SNPs (based on minor allele frequency) within structural protein (E, M, N, S) genome regions in the public sequencing data. SNPs were called using snp-sites, annotated using snpEff, and frequency analysis was performed using vcftools. The top 9 most frequent non-synonymous (missense, nonsense) SNPs within structural protein genome regions are presented.

Local deployment

CGT can also be installed locally. Application deployment has currently only been tested on Linux systems including Ubuntu 18.04 LTS and Debian 9.0 LTS, thus we only provide installation instructions for Debian/Ubuntu systems.

1) Installing CGT dependencies

Clone the repository locally
git clone https://github.com/hsmaan/CovidGenotyper

Create a data directory within the CovidGenotyper directory

cd CovidGenotyper
mkdir data

Install snp-sites and vcftools
sudo apt-get install snp-sites vcftools

Download and install snpEff and ensure it’s installed in the right directory

unzip snpEff
mv snpEff /usr/local/bin

Download genbank (.gb) and fasta (.fa) files of SARS-CoV-2 reference NC_045512 (rename to genes.gbk and covid.fa respectively) from GenBank and create SARS-CoV-2 reference database for snpEff. Save a copy of the fasta file for use in downstream processing as ncov_ref_NC_045512.fasta

mkdir -p /usr/local/bin/snpEff/data/COVID
cp covid.fa data/ncov_ref_NC_045512.fasta
mv genes.gbk covid.fa /usr/local/bin/snpEff/data/COVID
echo COVID.genome : COVID >> /usr/local/bin/snpEff/snpEff.config
java -jar /usr/local/bin/snpEff/snpEff.jar build -genbank -v COVID

Download GFF3 file of SARS-CoV-2 reference NC_045512 from GenBank, and rename to ncov_NC_045512_Genes.GFF3 and save it in data directory

mv ncov_NC_045512_Genes.GFF3 data

Ensure R >3.5 is installed, and run the R package installation script. Ensure all packages are installed. Packages may fail due to unmet library depdendencies - check Rscript output and install

cd bin
Rscript --verbose packages_install.R
cd ..

2) Run preprocessing scripts

CGT relies on pre-processing plot data prior to deployment to ensure visualizations can be loaded quickly. Fasta sequences should be downloaded from GISAID’s EpiCoV database and saved as gisaid_cov2020_sequences_[mmm_dd].fasta in the data folder. Metadata from GISAID should be saved as gisaid_metadata_[mmm_dd].tsv, also in the data folder.

The order for processing scripts is the following:

cd bin
Rscript --verbose metadata_process.R
Rscript --verbose gisaid_sequence_process.R
sh snp_sites_process.sh
Rscript --verbose maf_sites_out.R
Rscript --verbose preprocess_plot_data.R
cd ..

3) Deploy CGT

Now that the shiny application dependencies have been installed and data has been preloaded, the shiny app can be deployed in a variety of ways, documented here.

We recommend creating a docker image of the shiny app, which can be deployed locally and on the cloud. First run the base docker image installation, which includes installation of the rocker shiny image, system dependencies, and R

sudo docker build -t cgt/base -f base.Dockerfile .

Now the shiny app can be routinely rebuilt on top of this image, after having rerun the processing scripts outlined in step 2). This allows for easy updating without having to rebuild the entire image from scratch

sudo docker build -t cgt/app -f app.Dockerfile .

The app.Dockerfile script modifies the config to expose port 80 instead of 3838 (shiny server default). To deploy the shiny app, run the following

docker run --rm cgt/app

Software information

R packages

  • shiny v1.4.0.2
  • shinyWidgets v0.5.1
  • shinycssloaders v0.3
  • shinythemes v1.1.2
  • BiocManager v1.30.10
  • Biostrings v2.54.0
  • ape v5.3
  • DECIPHER v2.14.0
  • uwot v0.1.8
  • igraph v1.2.4.2
  • ggplot2 v3.3.0
  • ggnetwork v0.5.8
  • plotly v4.9.2.1
  • Cairo v1.5.11
  • intergraph v2.0.2
  • tidyverse v1.3.0
  • data.table v1.12.8
  • stringr v1.4.0
  • reshape2 v1.4.3
  • dplyr v0.8.5
  • parallel v3.6.3
  • ggthemes v4.2.0
  • RColorBrewer v1.1.2
  • GenomicRanges v1.38.0

Command-line tools

  • snp-sites v2.3.3-2
  • snpEff v4.3t
  • VCFtools v0.1.15

References

  • Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Challenges. 2017;1(1):33–46.
  • Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J. Shiny: Web Application Framework for R. R package version 1.4.0.2. 2020. Available from: https://CRAN.R-project.org/package=shiny
  • Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  • Briatte F. ggnetwork: Geometries to Plot Networks with ‘ggplot2’. R package version 0.5.8. 2020.
  • C Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.
  • McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [Internet]. 2018; Available from: http://arxiv.org/abs/1802.03426
  • Mamun A, Rajasekaran S. An efficient Minimum Spanning Tree algorithm. In: 2016 IEEE Symposium on Computers and Communication (ISCC). 2016. p. 1047–52.
  • Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2007;35(SUPPL. 1):21–5.
  • Pagès H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: Efficient manipulation of biological strings. R package version 2.54.0. 2019.
  • Wright ES. Using DECIPHER v2.0 to analyze big biological sequence data in R. R J. 2016;8(1):352–9.
  • Paradis E, Claude J, Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20(2):289–90.
  • Melville J. uwot: The Uniform Manifold Approximation and Projection (UMAP). Method for Dimensionality Reduction. R package version 0.1.8. 2020.
  • Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal, Complex Systems. 1695. 2006. http://igraph.org
  • Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb genomics. 2016;2(4):e000056.
  • Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012.
  • Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.
  • Simon Urbanek and Jeffrey Horner (2020). Cairo: R Graphics Device using Cairo Graphics Library for Creating High-Quality Bitmap (PNG, JPEG, TIFF), Vector (PDF, SVG, PostScript) and Display (X11 and Win32) Output. R package version 1.5-11. https://CRAN.R-project.org/package=Cairo
  • Bojanowski, Michal (2015) intergraph: Coercion Routines for Network Data Objects. R package version 2.0-2. http://mbojan.github.io/intergraph
  • Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
  • Matt Dowle and Arun Srinivasan (2019). data.table: Extension of data.frame. R package version 1.12.8. https://CRAN.R-project.org/package=data.table
  • Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
  • Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
  • Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.5. https://CRAN.R-project.org/package=dplyr
  • R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
  • Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
  • Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118

License

GNU General Public License 3.0