Sequence alignment is often used for similarity determinations. Algorithms implementing sequence alignment are abundant and widely used; the paper describing BLAST (Basic Local Alignment Search Tool algorithm) is one of the most-cited tools in biology with over 70,000 citations. However, in cases of low sequence homology, horizontal gene transfer, or lack of a priori information - common with pathogenic bacteria - alignment-based methods can have performance issues.
We introduce Sequence Non-Alignment Compression Comparison (SNACC), a pipeline employing the Normalized Compression Distance (NCD) and clustering in an alignment-free pairwise sequence comparison method. NCD is a parameter-free metric that uses compression algorithms to estimate similarity. These algorithms leverage input signal redundancies to reduce the file size encoding the signal. Similar strings contain more redundancies, resulting in a better compression score. NCD has been used in diverse applications, such as in classifying musical genres of mp3 files, scanning Android files for viruses and natural language processing. However, there are no existing tools applying NCD in biological datasets.
We investigated the use of 6 common compression algorithms on bacterial and viral nucleotide data sets with varying degrees of sequence similarity and found that only those distance matrices generated by the Lempel–Ziv–Markov chain algorithm (LZMA) were significantly similar to the benchmark test (Mantel p < 0.01). We created an intuitive command line interface to perform LZMA compression followed by clustering using any of 7 common user-specified clustering algorithms.