-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX for 1KB input #21
Comments
Basically: you can use SIMD for a hash function in two ways. Traditionally, you use it to parallelize the operations within a block. BLAKE3 is different: because it's a tree hash, you can instead parallelize across leaves of the tree. The latter is much more "embarrassingly parallel" than the former: it scales directly with the number of SIMD registers, and it's simpler to implement. Furthermore, for small messages, you start having to worry about the overhead of warming up "cold" SIMD registers, and (in Go specifically) the overhead of calling into assembly instead of a potentially-inlineable function. So you get much more bang for your buck by parallelizing across leaves. It is certainly possible to parallelize within a block, and it would definitely speed things up for certain workloads; but it's unclear how much faster it would be, and that makes it hard for me to justify spending too much time and energy on it. If you're curious, here's what the asm for intra-block SIMD looks like in (slightly modified) BLAKE2b: https://github.com/lukechampine/us/blob/master/merkle/blake2b/gen.go |
Thanks for the valuable input. The 1KB case is so critical to me (I need to compute millions of them all the time) that I might decide to invest my time to do it despite I have zero experience in SIMD programming (but it might be yet another cool stuff to learn). Could you clarify couple of things? My scenario is that I have thousands 1KB chunks spread across the memory of my program (it's a merkle tree). I need to compute separate blake3 checksums for all of them. As you mentioned I see two possibilities: And both approaches are fine to me, obviously I prefer the one which allows me to compute more hashes per second. As I understand, by using AVX-512 instructions I could process up to 16 chunks in single shot? |
I encourage you to try it! Writing asm is much easier in Go nowadays thanks to Avo. I used Felix Cloutier's instruction reference to help understand what each of the SIMD instructions does. I think your best bet is to modify the |
Quick update... So far I implemented all the operations required by blake3, for the purpose of verifying the right choice of AVX-512 instructions: addition, xor, right rotation by 7, 8, 12, and 16 bits. And I run some benchmarks comparing those operations to pure-go implementations. Here are the results:
Interesting facts:
I think that the power of AVX-512 instructions will show its potential once all the operations are combined in the full algorithm, rather than running them separately one by one. I guess the cost of loading data from memory into the registers and back is high. Here you may check the code: https://github.com/outofforest/quantum/tree/51ce795a908a8addacf7d5ee529514097655f092/checksum |
Indeed, massive speedup is visible when 10 steps are combined inside asm:
But still it seems that the right rotation is a heavy operation. |
Ohh... (not so) surprisingy ChatGPT was wrong saying that there is no direct right rotation instruction for ZMM registers.
|
Yeah, you're gonna encounter a lot of overhead calling into asm functions in a tight loop. It's always better to write the entire loop in asm if you can. |
Implemented blake3 version optimized for 1KB inputs and results are AMAZING (!!!).
|
Nice! If you're able to share the code, I'd be curious to take a look :) |
This sentence caught my interest as I need to work with exactly 1KB input :-):
Is there any specific reason?
The text was updated successfully, but these errors were encountered: