The Covid-19 Genotyping Tool (CGT) is an R-Shiny based web application that allows researchers to upload fasta sequences of Covid-19 viral genomes and compare with public sequence data available on GISAID. Genomic distance is visualized using manifold projection and network analysis, and genotype information with respective to high-prevalence SNPs is determined.
A video demonstration of CGT:
The CGT application was developed using the shiny
R package and
framework. Visualizations are generated using the ggplot2
,
ggnetwork
, and plotly
R packages. User uploaded fasta files are
considered input, and the visualizations reactively adjust to user data.
To expedite the loading of plots using public data, the alignment, DNA
distance, and plot data fetching steps are all pre-processed for GISAID
public sequence data. Once users upload in-house sequencing data, these
steps are re-performed with the concatenation of user and public data.
DNA distance and UMAP calculations use a heuristic to expedite
processing time (see below).
Processed fasta files and metadata of Covid-19 viral genome sequence are retrieved from the GISAID EpiCoV database, which is a public database for sharing of viral genome sequence data. Viral genome data and metadata are updated on a weekly basis. Sequences are filtered for completeness (>29000 nucleotides) and high coverage (<0.1% of all ambiguous nucleotides - e.g. N, M, W). Outlier sequences are also filtered out, defined by >0.05% unique amino acid substitutions compared to all GISAID sequences. This criteria is based on the mutation rate of SARS-CoV-2 and breadth of the GISAID database. Due to space and computational limitations, since the June 26th update, 10000 sequences from those that meet the filtering criteria are randomly sampled and analyzed.
GISAID sequences are subset for those that have corresponding metadata.
Public sequencing data is pre-aligned before being uploaded to the
server. Fasta sequences are read and written using the Biostrings
package. Gap removal and multiple-sequence alignment is performed using
DECIPHER
. Post alignment processing is done using ape
. User uploaded
fasta sequences are processed similarly, with the exception of complete
alignment - the user sequence is aligned to the pre-aligned public data
profile using AlignProfiles
from DECIPHER
.
For both pre-aligned and profile aligned data, DNA distance is
determined using ape
and the Kimura-80 model of nucleotide
substitution. Currently only Kimura-80 is supported, but integrating
other evolutionary distance metrics will be part of a future release.
The following nucleotide positions are masked post-alignment when determining DNA distance due to homoplasy, as per the recommendations from this article:
187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575
User-uploaded data is aligned and the distance between each user uploaded genome and the public genomes from GISAID are calculated first. If all user-uploaded sequences are significantly similar to a publicly uploaded genome (min dist < 1e-4, a lower bound based on distances between public genomes), then the distances for the user-uploaded genomes are imputed as the most similar publicly uploaded genome for each. Otherwise, if user-uploaded sequences do not meet this criteria (min dist > 1e-4 for any user-uploaded genome), then the entire distance matrix is recomputed. This heuristic allows for significantly improved computation time for user-uploaded data, and is based on properties of SARS-CoV-2 - Betacoronaviruses have low mutation rates, and most genomes sequenced within the current pandemic will be highly similar.
UMAP is performed to visualize DNA distances between public data and
user uploaded data. In this context, UMAP aims to cluster groups of
closely-linked viral genomes together based on DNA distance. The uwot
implementation of UMAP is utilized using the pre-computed DNA distance
matrix. Defaults for the uwot
implementation are employed, with the
exception of the following parameters:
init = "spectral"
metric = "cosine"
n_neighbors = 50
min_dist = 0.001
spread = 30
local_connectivity = 10
Similar to the DNA distance heuristic outlined above, if the minimum distance (<1e-4) requirements for all user-uploaded sequences are met, the UMAP coordinates of the user-uploaded sequences are imputed as the coordinates of the minimum distance matched genomes from GISAID. Otherwise, UMAP is recomputed with the added user-data.
Network representations of DNA distance are visualized using the
igraph
package. DNA distance matrices are utilized to create graph
representations. Connectivity is determined by employing the minimum
spanning tree, which aims to determine a path that connects all vertices
while minimizing distance. This method, along with the graphopt
force-directed layout aims to visualize connectivity of viral genomes
from affected individuals, potential paths of transmission based on
connectivity, and large network hubs which may be associated with
outbreak epicenters.
Genotype profiles of viral genomes are determined using high prevalence
non-synonymous SNPs (based on minor allele frequency) within structural
protein (E, M, N, S) genome regions in the public sequencing data. SNPs
were called using snp-sites
, annotated using snpEff
, and frequency
analysis was performed using vcftools
. The top 9 most frequent
non-synonymous (missense, nonsense) SNPs within structural protein
genome regions are presented.
CGT can also be installed locally. Application deployment has currently only been tested on Linux systems including Ubuntu 18.04 LTS and Debian 9.0 LTS, thus we only provide installation instructions for Debian/Ubuntu systems.
Clone the repository locally
git clone https://github.com/hsmaan/CovidGenotyper
Create a data directory within the CovidGenotyper directory
cd CovidGenotyper
mkdir data
Install snp-sites and vcftools
sudo apt-get install snp-sites vcftools
Download and install
snpEff
and ensure it’s installed in the right directory
unzip snpEff
mv snpEff /usr/local/bin
Download genbank (.gb) and fasta (.fa) files of SARS-CoV-2
reference NC_045512 (rename to genes.gbk
and covid.fa
respectively)
from GenBank and
create SARS-CoV-2 reference database for snpEff. Save a copy of the
fasta file for use in downstream processing as
ncov_ref_NC_045512.fasta
mkdir -p /usr/local/bin/snpEff/data/COVID
cp covid.fa data/ncov_ref_NC_045512.fasta
mv genes.gbk covid.fa /usr/local/bin/snpEff/data/COVID
echo COVID.genome : COVID >> /usr/local/bin/snpEff/snpEff.config
java -jar /usr/local/bin/snpEff/snpEff.jar build -genbank -v COVID
Download GFF3 file of SARS-CoV-2 reference NC_045512 from
GenBank, and rename
to ncov_NC_045512_Genes.GFF3
and save it in data directory
mv ncov_NC_045512_Genes.GFF3 data
Ensure R >3.5 is installed, and run the R package installation
script. Ensure all packages are installed. Packages may fail due to
unmet library depdendencies - check Rscript output and install
cd bin
Rscript --verbose packages_install.R
cd ..
CGT relies on pre-processing plot data prior to deployment to ensure
visualizations can be loaded quickly. Fasta sequences should be
downloaded from GISAID’s EpiCoV database and
saved as gisaid_cov2020_sequences_[mmm_dd].fasta
in the data
folder.
Metadata from GISAID should be saved as gisaid_metadata_[mmm_dd].tsv
,
also in the data
folder.
The order for processing scripts is the following:
cd bin
Rscript --verbose metadata_process.R
Rscript --verbose gisaid_sequence_process.R
sh snp_sites_process.sh
Rscript --verbose maf_sites_out.R
Rscript --verbose preprocess_plot_data.R
cd ..
Now that the shiny application dependencies have been installed and data has been preloaded, the shiny app can be deployed in a variety of ways, documented here.
We recommend creating a docker image of the shiny app, which can be
deployed locally and on the cloud. First run the base docker image
installation, which includes installation of the rocker shiny image,
system dependencies, and R
sudo docker build -t cgt/base -f base.Dockerfile .
Now the shiny app can be routinely rebuilt on top of this image, after
having rerun the processing scripts outlined in step 2). This allows
for easy updating without having to rebuild the entire image from
scratch
sudo docker build -t cgt/app -f app.Dockerfile .
The app.Dockerfile
script modifies the config to expose port 80
instead of 3838 (shiny server default). To deploy the shiny app, run the
following
docker run --rm cgt/app
- shiny v1.4.0.2
- shinyWidgets v0.5.1
- shinycssloaders v0.3
- shinythemes v1.1.2
- BiocManager v1.30.10
- Biostrings v2.54.0
- ape v5.3
- DECIPHER v2.14.0
- uwot v0.1.8
- igraph v1.2.4.2
- ggplot2 v3.3.0
- ggnetwork v0.5.8
- plotly v4.9.2.1
- Cairo v1.5.11
- intergraph v2.0.2
- tidyverse v1.3.0
- data.table v1.12.8
- stringr v1.4.0
- reshape2 v1.4.3
- dplyr v0.8.5
- parallel v3.6.3
- ggthemes v4.2.0
- RColorBrewer v1.1.2
- GenomicRanges v1.38.0
- snp-sites v2.3.3-2
- snpEff v4.3t
- VCFtools v0.1.15
- Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Challenges. 2017;1(1):33–46.
- Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J. Shiny: Web Application Framework for R. R package version 1.4.0.2. 2020. Available from: https://CRAN.R-project.org/package=shiny
- Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
- Briatte F. ggnetwork: Geometries to Plot Networks with ‘ggplot2’. R package version 0.5.8. 2020.
- C Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.
- McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [Internet]. 2018; Available from: http://arxiv.org/abs/1802.03426
- Mamun A, Rajasekaran S. An efficient Minimum Spanning Tree algorithm. In: 2016 IEEE Symposium on Computers and Communication (ISCC). 2016. p. 1047–52.
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2007;35(SUPPL. 1):21–5.
- Pagès H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: Efficient manipulation of biological strings. R package version 2.54.0. 2019.
- Wright ES. Using DECIPHER v2.0 to analyze big biological sequence data in R. R J. 2016;8(1):352–9.
- Paradis E, Claude J, Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20(2):289–90.
- Melville J. uwot: The Uniform Manifold Approximation and Projection (UMAP). Method for Dimensionality Reduction. R package version 0.1.8. 2020.
- Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal, Complex Systems. 1695. 2006. http://igraph.org
- Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb genomics. 2016;2(4):e000056.
- Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012.
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.
- Simon Urbanek and Jeffrey Horner (2020). Cairo: R Graphics Device using Cairo Graphics Library for Creating High-Quality Bitmap (PNG, JPEG, TIFF), Vector (PDF, SVG, PostScript) and Display (X11 and Win32) Output. R package version 1.5-11. https://CRAN.R-project.org/package=Cairo
- Bojanowski, Michal (2015) intergraph: Coercion Routines for Network Data Objects. R package version 2.0-2. http://mbojan.github.io/intergraph
- Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
- Matt Dowle and Arun Srinivasan (2019). data.table: Extension of
data.frame
. R package version 1.12.8. https://CRAN.R-project.org/package=data.table - Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
- Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
- Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.5. https://CRAN.R-project.org/package=dplyr
- R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
- Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
- Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
- Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118