-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use OR(<<, >>) for all rotations #5
base: master
Are you sure you want to change the base?
Conversation
This is entirely dependent on architecture. I imagine you measured this on Skylake or Skylake-X. Also cycle counts are more useful than percentages to understand the difference.
With Skylake In parallel modes, where latency does not matter much but overall throughput will, doing 4 independent rotations with |
I was measuring this on the Skylake-SP server that AWS gives me, and on my Kaby Lake laptop. What's the best way to treat these microarchitecture-specific differences? Should we generally optimize for the most modern thing? Is it worth shipping both and putting it behind a compiler flag? |
Those are all questions without definitive answers. If you don't mind the maintenance, having a version for each major microarchitecture would be the best solution. But generally the solution that works the best overall (for some definition of "overall") is preferable from a maintenance standpoint. |
So this performance optimization looks like more of a LLVM "bug" than an actual optimization. In fact, I believe I had already seen this behavior before somewhere, and then forgot about it. The real root cause here is that for rotations by 16, LLVM prefers to use the pair And the funny thing is that with your patch, the So the solution is simple: where Clang/LLVM is concerned, if using AVX2 also add |
The And the reason it's "fixed" with
So the pair Anyway, long story short, this is definitely a problem with LLVM. |
Discussion/experiment PR: I was playing around with different rotations, and I found that these simplified ones seem to perform a lot better under Clang/Rustc. Most of the functions improve by about 6%, but BLAKE2b improves by 14%. Have you run into this before? Is it a known Clang vs GCC thing?