From 0d63aade0d0bf634c11fde5f9b3c5d15a47084d2 Mon Sep 17 00:00:00 2001 From: John Huddleston Date: Thu, 25 Apr 2024 16:16:43 -0700 Subject: [PATCH] Clarify interpretability of results Closes #93 --- manuscript/cartography.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/manuscript/cartography.tex b/manuscript/cartography.tex index b2119cbd..b71665fa 100644 --- a/manuscript/cartography.tex +++ b/manuscript/cartography.tex @@ -236,9 +236,8 @@ \section*{Abstract} We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. -We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages. -Clusters from t-SNE most accurately recapitulated known phylogenetic clades and recombinant lineages. -Both MDS and t-SNE accurately identified reassortment groups. +We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages. +Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate. @@ -292,7 +291,7 @@ \section*{Introduction} These natural viruses are highly relevant as major causes of global human mortality, common subjects of real-time genomic epidemiology, and representatives of reassortant and recombinant human pathogens. For each combination of virus and embedding method, we quantified the relationship between pairwise genetic and Euclidean embedding distances, identified clusters of closely-related genomes in embedding space, and evaluated the accuracy of clusters compared to genetic groups defined by experts. Finally, we tested the ability of these methods to capture patterns of reassortment between seasonal influenza A/H3N2 hemagglutinin (HA) and neuraminidase (NA) segments and recombination in SARS-CoV-2 genomes. -These results inform our recommendations for future applications of these methods including which are most effective for specific problems in genomic epidemiology and which parameters researchers should use for each method. +These results demonstrate the interpretability of embeddings from each method and inform our recommendations for future applications of these methods to specific problems in genomic epidemiology. % Results and Discussion can be combined. \section*{Results} @@ -592,7 +591,7 @@ \subsection*{Recommendations for application of methods to new pathogens} The computational complexity of the original MDS and t-SNE algorithms scales by the cube and square of the number of input samples ($N$), respectively \cite{Yang2006,maaten2008visualizing}. However, more efficient alternate implementations exist for both methods that can scale by $N\log{N}$ and operate on millions of samples \cite{Yang2006,Delicado2024,Yang2013,vandermaaten2013,vandermaaten2014}. -For this reason, the primary practical limitation to applying MDS and t-SNE to tens of thousands or more pathogen sequences is the time required to calculate pairwise genetic distances for all samples. +For this reason, the primary practical limitation to scaling MDS and t-SNE to increasingly more pathogen sequences is the time and space required to calculate pairwise genetic distances for all samples. Implementing more efficient pairwise distance calculations remains a future direction for this area of research. \subsection*{Limitations of methods and analysis}