You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The libcubwt is a library for fast (see Benchmarks below) GPU-based Burrows-Wheeler transform construction and inversion on the CUDA platform using prefix doubling + Skew/DC3 approach as described in the following papers:
Vitaly Osipov, Parallel Suffix Array Construction for Shared Memory Architectures, 2012
Leyuan Wang, Sean Baxter, and John D. Owens Fast Parallel Suffix Array on the GPU, 2015
Florian Büren, Daniel Jünger, Robin Kobus, Christian Hundt, Bertil Schmidt Suffix Array Construction on Multi-GPU Systems, 2019
The libcubwt provides simple API to construct Burrows-Wheeler transform from a given string over constant-size alphabet using 20.5n bytes of GPU memory.
The libcubwt works with any CUDA compatible GPU, but I recommend SM 8.9+ Ada Lovelace (CUDA 11.8 and later) due to very large L2 cache.
The libcubwt is sensitive to fast GPU memory and might not be suitable for some workloads. Please benchmark yourself.
Suffix array computation
Starting with version 1.5.0, libcubwt algorithm was changed and no longer supports the computation of suffix arrays or inverse suffix arrays. To compute these arrays, please consider using a version prior to 1.5.0.
License
The libcubwt is released under the Apache License Version 2.0 and is considered suitable for production use. However, no warranty or fitness for a particular purpose is expressed or implied.
Changes
January 24, 2024 (1.6.0)
Inverse Burrows-Wheeler transform.
March 24, 2023 (1.5.0)
Reduced memory usage and improved performance.
February 10, 2023 (1.0.0)
Initial public release of the libcubwt.
Examples of APIs (see libcubwt.cuh for complete APIs list)
/** * Allocates storage on the CUDA device that allows reusing allocated memory with each libcubwt operation. * @param max_length The maximum length of string to support. * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise. */int64_tlibcubwt_allocate_device_storage(void**device_storage, int64_tmax_length);
/** * Destroys the previously allocated storage on the CUDA device. * @param device_storage The previously allocated storage on the CUDA device. * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise. */int64_tlibcubwt_free_device_storage(void*device_storage);
/** * Constructs the Burrows-Wheeler Transform (BWT) of a given string. * @param device_storage The previously allocated storage on the CUDA device. * @param T [0..n-1] The input string. * @param L [0..n-1] The output string (can be T). * @param n The length of the input string. * @return The primary index if no error occurred, libcubwt error code otherwise. */int64_tlibcubwt_bwt(void*device_storage, constuint8_t*T, uint8_t*L, int64_tn);
/** * Reconstructs the original string from a given burrows-wheeler transformed string (BWT) with primary index. * @param device_storage The previously allocated storage on the CUDA device. * @param T [0..n-1] The input string. * @param U [0..n-1] The output string (can be T). * @param n The length of the given string. * @param freq [0..255] The input symbol frequency table (can be NULL). * @param i The primary index. * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise. */int64_tlibcubwt_unbwt(void*device_storage, constuint8_t*T, uint8_t*U, int64_tn, constint32_t*freq, int32_ti);
Benchmarks
Methodology
Input files were capped at 352MB due to GPU memory limit.
All timings exclude initialization and memory allocations. However, the time for data transfer in and out of GPU is included.
The timings are minimum of five runs measuring multi-threading performance of Burrows-Wheeler transform construction.
Specification (x86-64 architecture)
OS: Microsoft Windows 10 Pro 64-Bit
MB: MSI MPG Z390M GAMING EDGE AC, PCIe 3.0 x16
CPU: Intel Core i7-9700K Processor (12M Cache, 5GHz all cores)