Documentation questions for Rust bindings #26

RagnarGrootKoerkamp · 2024-10-02T20:58:54Z

Hey,

I'm in the process of creating a wrapper crate for Rust bindings (https://github.com/RagnarGrootKoerkamp/libsais-rs/blob/master/src/lib.rs), so that we can easily use libsais for bioinformatics applications. First tests seem to give a very nice 2x speedup over libdivsufsort :)

Some questions came up while looking at all the different versions of the libsais_* functions.

Signedness of input text: The supported alphabets seems to be u8, u16, i32 and i64 (using rust terminology). Are there specific reasons the large integer alphabets are signed? For bioinformatics, we typically want to not waste any space/bits, and so often look at 'k-mers' of 16 or 32 DNA bases, that exactly fill a u32 or u64. But it seems the library crashes when I try to use arbitrary (negative) i32/i64 values.

Said differently: could you also support u32 and u64 alphabets easily?
Is there a difference between the libsais_int and libsais16_int functions? (And similarly for libsais64_long and libsais16x64_long)
The _int/_long versions require the alphabet size k to be passed in. Is this the number of distinct characters, or a (strict?) upper bound on the maximum value? If any non-negative i32 value is allowed, what should I use for k?
Generally speaking, how is the alphabet size used? Is it bad if I always pass in i32::MAX or i64::MAX? Or is it bad if there are values (strictly?) larger than the alphabet size?

Edit: after some more experimenting, it seems that O(alphabet_size) memory is allocated, which unfortunately causes problems for (very sparse) i64 alphabets. Is there an easy fix in the library? Otherwise, probably the easiest is for us to first re-map our values to 0..k, although that would require sorting the kmers which is slow in itself.
I suppose another minor thing is that only some functions come in a variant that takes the reusable ctx context parameter, but it's not clear to me how much impact this would have anyway, especially when for my case I expect 1 invocation on many GB of data, rather than many small invocations.
Edit: I also seem to have some issue that the _omp function are not actually using multiple threads (for ~100MB input). Not sure what's going on there.

The text was updated successfully, but these errors were encountered:

IlyaGrebnov · 2024-10-03T01:01:04Z

inline

Hey,

I'm in the process of creating a wrapper crate for Rust bindings (https://github.com/RagnarGrootKoerkamp/libsais-rs/blob/master/src/lib.rs), so that we can easily use libsais for bioinformatics applications. First tests seem to give a very nice 2x speedup over libdivsufsort :)

Some questions came up while looking at all the different versions of the libsais_* functions.
1. Signedness of input text: The supported alphabets seems to be `u8`, `u16`, `i32` and `i64` (using rust terminology). Are there specific reasons the large integer alphabets are signed? For bioinformatics, we typically want to not waste any space/bits, and so often look at 'k-mers' of 16 or 32 DNA bases, that exactly fill a `u32` or `u64`. But it seems the library crashes when I try to use arbitrary (negative) `i32`/`i64` values.
   Said differently: could you also support `u32` and `u64` alphabets easily?

Unfortunately, this isn't straightforward because libsais utilizes the sign bit to mark specific types of suffixes. This is precisely why the i32/i64 version requires mutable input strings.

2. Is there a difference between the `libsais_int` and `libsais16_int` functions? (And similarly for `libsais64_long` and `libsais16x64_long`)

libsais and libsais16 are independent versions, though they share some identical functions as implementation. This includes the mentioned libsais_int and libsais16_int functions.

3. The `_int`/`_long` versions require the alphabet size `k` to be passed in. Is this the number of _distinct_ characters, or a (strict?) upper bound on the maximum value? If any non-negative `i32` value is allowed, what should I use for `k`?

k represents the upper bound for the input alphabet, ideally set as the maximum value plus one. For u8 inputs, this is effectively 256. If your values could be any non-negative i32, I recommend compacting the alphabet by renumbering the values.

4. Generally speaking, how is the alphabet size used? Is it bad if I always pass in `i32::MAX` or `i64::MAX`? Or is it bad if there are values (strictly?) larger than the alphabet size?
   Edit: after some more experimenting, it seems that `O(alphabet_size)` memory is allocated, which unfortunately causes problems for (very sparse) `i64` alphabets. Is there an easy fix in the library? Otherwise, probably the easiest is for us to first re-map our values to `0..k`, although that would require sorting the kmers which is slow in itself.

libsais requires space for bucket counting and, to minimize allocations, it uses the free space in the 'SA' array. Since you are using the i32/i64 version of the library directly, it is recommended to allocate additional space in the SA array and pass the extra number of elements as the 'fs' (free space) parameter. A recommended value for 'fs' is 6 * k.

5. I suppose another minor thing is that only some functions come in a variant that takes the reusable `ctx` context parameter, but it's not clear to me how much impact this would have anyway, especially when for my case I expect 1 invocation on many GB of data, rather than many small invocations.

'ctx' is only relevant if you need to perform very frequent computations on smaller inputs (<64KB). In your case, you can ignore these variants.

6. Edit: I also seem to have some issue that the `_omp` function are not actually using multiple threads (for `~100MB` input). Not sure what's going on there.

This could be caused by a large value of k, which forces libsais to fall back to sequential algorithms, or it might be due to the Rust bindings. Additionally, libsais is more memory-bound than compute-bound. Therefore, adding more threads beyond twice the number of memory channels is unlikely to improve performance. What can boost performance even for single-threaded algorithm is the use of large pages, so I recommend enabling them.

IlyaGrebnov · 2024-10-03T01:07:46Z

I'm writing a separate comment because it wasn't clear what you meant by kmers as inputs. If you are trying to combine multiple sequential ACGT values into a single i32 value, that could hurt performance. Using ACGT as i8 inputs would be more efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation questions for Rust bindings #26

Documentation questions for Rust bindings #26

RagnarGrootKoerkamp commented Oct 2, 2024 •

edited

Loading

IlyaGrebnov commented Oct 3, 2024

IlyaGrebnov commented Oct 3, 2024

Documentation questions for Rust bindings #26

Documentation questions for Rust bindings #26

Comments

RagnarGrootKoerkamp commented Oct 2, 2024 • edited Loading

IlyaGrebnov commented Oct 3, 2024

IlyaGrebnov commented Oct 3, 2024

RagnarGrootKoerkamp commented Oct 2, 2024 •

edited

Loading