-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized xoshiro256++ #3
Comments
Impressive performance taking advantage of how "wide" modern CPUs are — but whether this is useful in practice is another question. For ciphers it does sometimes make sense to focus on "byte fill" performance, but is this useful for general-purpose random number generation, or some specific application? |
I think in practice it replaces dSFMT, which used to be faster. All applications requiring several random numbers at a time benefit. Personally, I think |
I imagine it's going to depend on the actual application significantly. You're using micro-benchmarks here? I actually think for most applications (even ones heavily using RNGs) the speed of the RNG is not very relevant. |
I was referring to the |
Wow, really nice! Note that (as I mentioned in the Julia discussion) the dSFMT returns only 52 significant digits, thus restricting the values in [0..1) to half of the possible values (IEEE has 53 digits precision as one bit is implied). I also think that the Julia guys found that an 8-fold unroll is even faster. But of course you're using a lot of space for copies (i.e., you are not getting a longer period, just faster generation). And yes, in 90% applications the speed of the PRNG is irrelevant. I'm not particularly akin to block generation, but this is what is done in high-performance scientific computing. The Intel Math Kernel library, for example, provides only block generation. If you try to generate one value at a time, the performance is abysmal, but it is unbeatable (high vectorization) for large amounts. |
There is some discussion on how to vectorize xoshiro256++ at JuliaLang/julia#27614. The method relies on interleaving 4 xoshiro256++ generators. I implemented it and the results are impressive (see
gen_bytes_fill
):The implementation is 3.3 time faster than the non-vectorized xoshiro256++ generator and more than 2.2 times faster than splitmix64 or chacha8. It is also faster than dSFMT. However, the size of the state is blown up to 128 bytes, which is almost as large as chacha's state (136 bytes).
The text was updated successfully, but these errors were encountered: