Skip to content

Commit

Permalink
Clarify interpretability of results
Browse files Browse the repository at this point in the history
Closes #93
  • Loading branch information
huddlej committed Apr 25, 2024
1 parent 842f9ee commit 0d63aad
Showing 1 changed file with 4 additions and 5 deletions.
9 changes: 4 additions & 5 deletions manuscript/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -236,9 +236,8 @@ \section*{Abstract}
We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2).
For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding.
We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages.
We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages.
Clusters from t-SNE most accurately recapitulated known phylogenetic clades and recombinant lineages.
Both MDS and t-SNE accurately identified reassortment groups.
We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages.
Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages.
We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses.
Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

Expand Down Expand Up @@ -292,7 +291,7 @@ \section*{Introduction}
These natural viruses are highly relevant as major causes of global human mortality, common subjects of real-time genomic epidemiology, and representatives of reassortant and recombinant human pathogens.
For each combination of virus and embedding method, we quantified the relationship between pairwise genetic and Euclidean embedding distances, identified clusters of closely-related genomes in embedding space, and evaluated the accuracy of clusters compared to genetic groups defined by experts.
Finally, we tested the ability of these methods to capture patterns of reassortment between seasonal influenza A/H3N2 hemagglutinin (HA) and neuraminidase (NA) segments and recombination in SARS-CoV-2 genomes.
These results inform our recommendations for future applications of these methods including which are most effective for specific problems in genomic epidemiology and which parameters researchers should use for each method.
These results demonstrate the interpretability of embeddings from each method and inform our recommendations for future applications of these methods to specific problems in genomic epidemiology.

% Results and Discussion can be combined.
\section*{Results}
Expand Down Expand Up @@ -592,7 +591,7 @@ \subsection*{Recommendations for application of methods to new pathogens}

The computational complexity of the original MDS and t-SNE algorithms scales by the cube and square of the number of input samples ($N$), respectively \cite{Yang2006,maaten2008visualizing}.
However, more efficient alternate implementations exist for both methods that can scale by $N\log{N}$ and operate on millions of samples \cite{Yang2006,Delicado2024,Yang2013,vandermaaten2013,vandermaaten2014}.
For this reason, the primary practical limitation to applying MDS and t-SNE to tens of thousands or more pathogen sequences is the time required to calculate pairwise genetic distances for all samples.
For this reason, the primary practical limitation to scaling MDS and t-SNE to increasingly more pathogen sequences is the time and space required to calculate pairwise genetic distances for all samples.
Implementing more efficient pairwise distance calculations remains a future direction for this area of research.

\subsection*{Limitations of methods and analysis}
Expand Down

0 comments on commit 0d63aad

Please sign in to comment.