v3.0.0-rc
Pre-releaseThemisto-v3.0.0-rc (25 February 2023)
This is a major update to accompany the release of the preprint https://www.biorxiv.org/content/10.1101/2023.02.24.529942v1.
Data structure improvements
The de Bruijn graph is now indexed using the SBWT library (https://github.com/algbio/SBWT). This makes the k-mer search significantly faster than before. The coloring data structure has also been reworked, now using a different encoding for dense and sparse color sets. The dense sets are encoded as bitmaps and the sparse as lists of integers. We also now support using Roaring bitmaps for the color sets.
Index construction
The index construction now uses the GGCAT tool (https://github.com/algbio/GGCAT) to build the colored unitigs as a preprocessing step. This adds a build dependency to the nightly version of the Rust programming language.
Pseudoalignment changes
The pseudoalignment algorithm has been improved. Before, it reported a color if all k-mers of the query that are present in the index have that color. Now, there is a threshold parameter T such that instead of requiring all k-mers to be present, we require only a fraction T of the k-mers. There is also a now a flag --include-unknown-kmers to take into account those k-mers that are not in the index. Those k-mers are assumed to have an empty color set. This behavior matches that of Bifrost and Metagraph.
Command-line interface
There are a number of small changes to the command-line interface. Reverse complements are now added to the index by default. The names of some of the parameters have changed, but we also support the old names to avoid breaking existing scripts. We now support three ways of providing the colors for the index: color by input file, color by input sequence rank, and color by a user-provided color file. The default pseudoalignment method is now the thresholded method with threshold 0.7. The old method can be ran by setting the threshold to 1.
Other
This release also contain a number of small bug fixes.