Releases: algbio/themisto
Themisto-v3.2.2 (1 May 2024)
This release is a patch that fixes a bug where output stream was sometimes not flushed to disk when writing in the gzipped and sorted output mode. Details:
- Fixes Github issue #36.
- Unrelated change: Update GGCAT. With this update, GGCAT now builds with stable rust instead of the nightly build.
(The Linux binary originally attached to this release was compiled with maximum k = 255, which made it use 8 times more disk and up to 8 times slower on the common use case of k = 31. The binary has been recompiled for k = 31 on 10.6. 2024.
Themisto-v3.2.1 (19 December 2023)
This release contains two minor bugfixes:
- Fixed a bug that resulted in incorrect FASTA parsing if the file contained an empty line. Now, the program terminates instead with an error message.
- There have been problems with running Themisto on cluster architectures. We think that this might be due to how the KMC sorting subroutine always uses the maximum number of threads supported by the hardware, even if the user specifies a smaller amount on the command line. In cluster environments, the job scheduler may limit the number of parallel threads to a much lower amount than the maximum amount allowed by the hardware, which seemed to sometimes hang construction on our cluster. We fixed this by limiting the number of threads available to KMC sorting to the number specified on the command line. This solved the issue on our cluster, but it could be the case that this did not fix the underlying issue, and there might be some kind of a deadlock in lurking KMC that was only exposed by having such a high amount of thread contention. We leave closer investigation as future work.
Themisto-v3.2.0 (21 September 2023)
This new minor version adds a new feature --sort-hits
and fixes some lingering issues.
- Added a flag
--sort-hits
to sort the colors in each output line. - Made
-d 20
default in index construction. This does not affect pseudoalignment results, but gives a smaller index with little impact on query time - Renamed
--sort-output
to--sort-output-lines
. The old name still works to avoid breaking existing scripts. - Better error handling in FASTA and FASTQ parsing.
- Fixed a linking issue with KMC binaries that was crashing tests in debug mode.
Themisto-v3.1.3 (19 May 2023)
This version adds support for ARM architectures, which in particular makes the code compatible with Apple silicon. There are also small tweaks to make the build system compatible with Bioconda.
Themisto-v3.1.2 (16 April 2023)
This is a minor patch optimizing memory allocation in ggcat.
Themisto-v3.1.1 (15 April 2023)
This release patches two concurrency-related issues in GGCAT, which made the construction get stuck sometimes.
Themisto-v3.1.0 (13 April 2023)
New features
- Themisto now prints estimated input and output rates during pseudoalignment to help estimate how long a run will take and how large the output will be.
- Added a new command line option
--report-relevant-kmer-count
which reports for each read the number of relevant k-mers for the pseudoalignment. A k-mer is relevant if it is found in the index and has at least one color associated to it. - Added a new command line option
--relevant-kmers-fraction
to adjust the pseudoalignment algorithm so that it only reports pseudoalignments for reads for which the fraction of relevant k-mers was at least as large as a given threshold.
Performance
- Faster index construction by choosing as key k-mers the last k-mers of ggcat colored unitigs.
- Added parallelism for processing GGCAT unitigs.
- Some micro-optimizations in pseudoalignment.
- Fixed a bug that blew up the coloring index size by a factor of up to 64 if there was only one distinct color in the dataset.
Maintenance
There have been reports of crashes due to unknown instructions in the precompiled binaries (#24 and #25). We now compile the release binaries with native instructions disabled in SBWT and Roaring, which should fix these issues.
Themisto-v3.0.0 (2 March 2023)
This is a major update to accompany the release of the preprint https://www.biorxiv.org/content/10.1101/2023.02.24.529942v1.
Data structure improvements
The de Bruijn graph is now indexed using the SBWT library (https://github.com/algbio/SBWT). This makes the k-mer search significantly faster than before. The coloring data structure has also been reworked, now using a different encoding for dense and sparse color sets. The dense sets are encoded as bitmaps and the sparse as lists of integers. We also now support using Roaring bitmaps for the color sets.
Index construction
The index construction now uses the GGCAT tool (https://github.com/algbio/GGCAT) to build the colored unitigs as a preprocessing step. This adds a build dependency to the nightly version of the Rust programming language.
Pseudoalignment changes
The pseudoalignment algorithm has been improved. Before, it reported a color if all k-mers of the query that are present in the index have that color. Now, there is a threshold parameter T such that instead of requiring all k-mers to be present, we require only a fraction T of the k-mers. There is also a now a flag --include-unknown-kmers to take into account those k-mers that are not in the index. Those k-mers are assumed to have an empty color set. This behavior matches that of Bifrost and Metagraph.
Command-line interface
There are a number of small changes to the command-line interface. Reverse complements are now added to the index by default. The names of some of the parameters have changed, but we also support the old names to avoid breaking existing scripts. We now support three ways of providing the colors for the index: color by input file, color by input sequence rank, and color by a user-provided color file. The default pseudoalignment method is now the thresholded method with threshold 1.0, which is equivalent to the old intersection method.
Other
This release also contains a number of small bug fixes.
v3.0.0-rc
Themisto-v3.0.0-rc (25 February 2023)
This is a major update to accompany the release of the preprint https://www.biorxiv.org/content/10.1101/2023.02.24.529942v1.
Data structure improvements
The de Bruijn graph is now indexed using the SBWT library (https://github.com/algbio/SBWT). This makes the k-mer search significantly faster than before. The coloring data structure has also been reworked, now using a different encoding for dense and sparse color sets. The dense sets are encoded as bitmaps and the sparse as lists of integers. We also now support using Roaring bitmaps for the color sets.
Index construction
The index construction now uses the GGCAT tool (https://github.com/algbio/GGCAT) to build the colored unitigs as a preprocessing step. This adds a build dependency to the nightly version of the Rust programming language.
Pseudoalignment changes
The pseudoalignment algorithm has been improved. Before, it reported a color if all k-mers of the query that are present in the index have that color. Now, there is a threshold parameter T such that instead of requiring all k-mers to be present, we require only a fraction T of the k-mers. There is also a now a flag --include-unknown-kmers to take into account those k-mers that are not in the index. Those k-mers are assumed to have an empty color set. This behavior matches that of Bifrost and Metagraph.
Command-line interface
There are a number of small changes to the command-line interface. Reverse complements are now added to the index by default. The names of some of the parameters have changed, but we also support the old names to avoid breaking existing scripts. We now support three ways of providing the colors for the index: color by input file, color by input sequence rank, and color by a user-provided color file. The default pseudoalignment method is now the thresholded method with threshold 0.7. The old method can be ran by setting the threshold to 1.
Other
This release also contain a number of small bug fixes.
Themisto-v2.1.0 (25 November 2021)
- Performance optimization.
- Flags
--silent
and--verbose
to control to number of log messages. - Option to output the de Bruijn graph in GFA1 format in
extract-unitigs
.