-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON backend for aarch64 #457
base: main
Are you sure you want to change the base?
Conversation
The I'm testing another point where we can use the |
2d343e9 does not have an influence on performance, but the code is quite a bit cleaner, so let's leave it in. |
.github/workflows/rust.yml
Outdated
test-simd: | ||
name: Test simd backend (nightly, aarch64) | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- uses: dtolnay/rust-toolchain@nightly | ||
with: | ||
toolchain: aarch64-unknown-linux-gnu | ||
- run: cargo test --no-default-features --target aarch64-unknown-linux-gnu --features "std simd_backend" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work without cross
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust has a built-in cross-compiler. I've never really used cross
to cross-compile myself; always had to roll my own due to circumstances.
Probably, the underlaying Ubuntu machine needs to install a gcc toolchain to actually work, but I hoped that the CI would give me some feedback already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cross-compiler works for builds, but how does it run the tests on a non-native architecture?
cross
uses QEMU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, I did not spot that. The new version is only with cargo build
, not with test
, based on the new build-simd
CI snippet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using cross
for this as it should be fine for testing NEON in CI.
Here are some examples (and it's probably fine to reuse the @RustCrypto cross-install
workflow too)
https://github.com/RustCrypto/block-ciphers/blob/master/.github/workflows/aes.yml#L201-L249
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like that's done in release/4.0 now, so I'll just drop my own commit.
I'd still like to point out that when I submit the armv7hl NEON work, it will work on nightly only for a while, so you might want to add armv7-nightly to the test matrix.
@rubdos we're working off the |
Don't forget to make sure it's still constant-time! I don't actually know what that entails; it is not something I've looked into before. But a good amount of effort was put into the initial version of this crate using constant-time operations, and new backends should adhere to that too. |
@jrose-signal Indeed, that's pretty important. As far as @Tarinn and I know, this code should be constant time. It is a direct port of the AVX2 work, and there should be no time-dependent calls in there. The most sketchy call that I could see is the I'm not sure how this is generally done. Should we ask some independent reviewer to have a look at the code? |
fyi - we're continuing all development via in Could this be re-based to main which is basically from release/4.0 - Thanks! Also would we have any benchmarks between u64, neon and fiat ? For constant time, e.g. subtle CsOption etc. |
Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.
I'm working on this across a bunch of devices (including armv7). I'd like these results to appear in a small publication, but I'll certainly dump a summary in the README when I have them. I should have something by the end of the week. |
Yeah this is going to stay in
Awesome! 🥳 would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too: |
After publication, I'll make sure to make a huge amount of noise about it. That should be pretty short notice, I have some very strict deadlines about that now. Just to clarify: the deadline is there because this work is complete; there was no pressure on completion of this work that would have affected the quality. |
src/backend/vector/neon/field.rs
Outdated
impl FieldElement2625x4 { | ||
|
||
pub fn split(&self) -> [FieldElement51; 4] { | ||
let mut out = [FieldElement51::zero(); 4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut out = [FieldElement51::zero(); 4]; | |
let mut out = [FieldElement51::ZERO; 4]; |
This was changed to ZERO
src/backend/vector/neon/edwards.rs
Outdated
|
||
macro_rules! print_var { | ||
($x:ident) => { | ||
println!("{} = {:?}", stringify!($x), $x.to_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println!("{} = {:?}", stringify!($x), $x.to_bytes()); | |
println!("{} = {:?}", stringify!($x), $x.as_bytes()); |
These were renamed to as_bytes()
src/backend/vector/neon/edwards.rs
Outdated
use super::*; | ||
|
||
fn serial_add(P: edwards::EdwardsPoint, Q: edwards::EdwardsPoint) -> edwards::EdwardsPoint { | ||
use backend::serial::u64::field::FieldElement51; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use backend::serial::u64::field::FieldElement51; | |
use crate::backend::serial::u64::field::FieldElement51; |
Instead of flooding some changes here is some .patches: src/backend/vector/neon/field.rs
src/backend/vector/neon/edwards.rs
clippy is still whining but I'll send a .patch later - we are using the 2021 edition $ $ $ |
I've done the rebase together with your patches, and squashed the NEON work into the first commit. I had not tested the rebase on 4.0, sorry about that. I'll do that now. |
Working on the |
started a benchmark repo here - I ran some benchmarks earlier (will re-run now) on Mac M1 Max Aarch64 comparing between backends FWIW I'm going to re-organise it to cover wider range benchmarks between various impl's not just dalek's |
I'll see whether I'll have all the data in that format. I currently only retain the raw outputs from Criterion for further automatic processing for our paper. |
This took a bit longer than anticipated. We got a publication out, so academia-wise, we're all happy now. In between writing that paper and getting it out somewhere, there was quite some time, but I'm trying to sum up the current remaining issues with this branch in this post. Results are especially spectacular on ARMv7/32-bit ARM CPUs (~50% speedups). Results are still great on ARMv8/aarch64 (~20%) until ARMv8.2-ish: when run in A64 mode, we get slow-downs relative to base. This is because ARM decided to nerf the TBL/TBX instructions relative to previous versions. The throughput remained the same, but the total latency of the instruction is now a function of the input size. (shamelessly stolen from https://www.stonybrook.edu/commcms/ookami/support/_docs/A64FX_Microarchitecture_Manual_en_1.3.pdf) We'll require a bit of in-depth looking into the assembly to figure out a way around this. We sent some executables to @direc85 over in Finland to get the benchmark run on an Xperia 10 III, which is how we noted the issue. That's not really a tight debuggable loop. I should soon™ have an Xperia 10 IV to close the loop a bit. |
Co-authored-by: pinkforest <[email protected]> Co-authored-by: Robrecht Blancquaert <[email protected]>
…5x4::shuffle Co-authored-by: Robrecht Blacquaert <[email protected]>
(that's a disfunctional rebase, we still have to reintegrate in the new backend-selection) |
Awesome - For CI there is ARM already for cross - .github/workflows/cross.yml#L19 We also have But I think we could split backend tests into a separate file - qemu should support neon at this level - backends.yml for the sake of clarity and discoverability. I'm re-working the build script so I could include a placeholder separately there so it doesn't cause conflict. Also nightly CI is currently failing - I've fixed it here: |
Today I idly tested this on my M1 Mac and the current state of the branch is a regression from the u64 backend. Changing the implementation of |
Yep, it's quite devastating indeed... If we've got news on that front, we'll post our findings as soon as we have them. Thanks for testing it out! ❤️ |
@Tarinn made some progress by now on the aarch64 problem, but we're not there yet. We thought we would have some speedup by working around llvm/llvm-project/issues/58323 (some vget instrinsics don't get correctly lowered to DUP/MOV instructions), but manually throwing in the assembly didn't give a measurable improvement on our Odroid N2+. The manual injection of assembly for `vget_high_u32`/`vget_low_u32`diff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..316963d 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
use core::arch::aarch64::{self, vuzp1_u32};
use core::ops::{Add, Mul, Neg};
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
use crate::backend::serial::u64::field::FieldElement51;
use crate::backend::vector::neon::constants::{
P_TIMES_16_HI, P_TIMES_16_LO, P_TIMpatchdiff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..cccde8c 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
use core::arch::aarch64::{self, vuzp1_u32};
use core::ops::{Add, Mul, Neg};
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
use crate::backend::serial::u64::field::FieldElement51;
use crate::backend::vector::neon::constants::{
P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO,
};
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_high_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+ use core::arch::asm;
+ let o;
+ unsafe {
+ asm! (
+ "DUP {o:d}, {v}.D[1]",
+ v = in(vreg) v,
+ o = out(vreg) o,
+ )
+ }
+ o
+}
+
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_low_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+ use core::arch::asm;
+ let o;
+ unsafe {
+ asm! (
+ "DUP {o:d}, {v}.D[0]",
+ v = in(vreg) v,
+ o = out(vreg) o,
+ )
+ }
+ o
+}
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_high_u32;
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_low_u32;
+
macro_rules! shuffle {
($vec0:expr, $vec1:expr, $index:expr) => {
unsafe {
core::mem::transmute::<[u32; 4], u32x4>(
*core::simd::simd_swizzle!(
- core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)),
- core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)),
- $index).as_array())
+ core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)),
+ core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)),
+ $index
+ )
+ .as_array(),
+ )
}
};
}
@@ -53,8 +90,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
let b0: u32x2;
let b1: u32x2;
unsafe {
- use core::arch::aarch64::vget_high_u32;
- use core::arch::aarch64::vget_low_u32;
a0 = vget_low_u32(src.0.into()).into();
a1 = vget_low_u32(src.1.into()).into();
b0 = vget_high_u32(src.0.into()).into();
@@ -72,7 +107,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
fn repack_pair(x: (u32x4, u32x4), y: (u32x4, u32x4)) -> (u32x4, u32x4) {
unsafe {
use core::arch::aarch64::vcombine_u32;
- use core::arch::aarch64::vget_low_u32;
use core::arch::aarch64::vgetq_lane_u32;
use core::arch::aarch64::vset_lane_u32;
@@ -223,7 +257,6 @@ impl FieldElement2625x4 {
self.shuffle(Shuffle::BACD)
}
-
// Can probably be sped up using multiple vset/vget instead of table
#[inline]
pub fn blend(&self, other: FieldElement2625x4, control: Lanes) -> FieldElement2625x4 {
@@ -318,8 +351,6 @@ impl FieldElement2625x4 {
let rotated_carryout = |v: (u32x4, u32x4)| -> (u32x4, u32x4) {
unsafe {
use core::arch::aarch64::vcombine_u32;
- use core::arch::aarch64::vget_high_u32;
- use core::arch::aarch64::vget_low_u32;
use core::arch::aarch64::vqshlq_u32;
let c: (u32x4, u32x4) = (
@@ -327,16 +358,8 @@ impl FieldElement2625x4 {
vqshlq_u32(v.1.into(), shifts.1.into()).into(),
);
(
- vcombine_u32(
- vget_high_u32(c.0.into()),
- vget_low_u32(c.0.into()),
- )
- .into(),
- vcombine_u32(
- vget_high_u32(c.1.into()),
- vget_low_u32(c.1.into()),
- )
- .into(),
+ vcombine_u32(vget_high_u32(c.0.into()), vget_low_u32(c.0.into())).into(),
+ vcombine_u32(vget_high_u32(c.1.into()), vget_low_u32(c.1.into())).into(),
)
}
};
@@ -344,19 +367,9 @@ impl FieldElement2625x4 {
let combine = |v_lo: (u32x4, u32x4), v_hi: (u32x4, u32x4)| -> (u32x4, u32x4) {
unsafe {
use core::arch::aarch64::vcombine_u32;
- use core::arch::aarch64::vget_high_u32;
- use core::arch::aarch64::vget_low_u32;
(
- vcombine_u32(
- vget_low_u32(v_lo.0.into()),
- vget_high_u32(v_hi.0.into()),
- )
- .into(),
- vcombine_u32(
- vget_low_u32(v_lo.1.into()),
- vget_high_u32(v_hi.1.into()),
- )
- .into(),
+ vcombine_u32(vget_low_u32(v_lo.0.into()), vget_high_u32(v_hi.0.into())).into(),
+ vcombine_u32(vget_low_u32(v_lo.1.into()), vget_high_u32(v_hi.1.into())).into(),
)
}
};
@@ -386,7 +399,6 @@ impl FieldElement2625x4 {
#[rustfmt::skip] // Retain formatting of return tuple
let c9_19: (u32x4, u32x4) = unsafe {
use core::arch::aarch64::vcombine_u32;
- use core::arch::aarch64::vget_low_u32;
use core::arch::aarch64::vmulq_n_u32;
let c9_19_spread: (u32x4, u32x4) = (
@@ -475,8 +487,6 @@ impl FieldElement2625x4 {
fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
use core::arch::aarch64::vmull_u32;
use core::arch::aarch64::vuzp1_u32;
- use core::arch::aarch64::vget_low_u32;
- use core::arch::aarch64::vget_high_u32;
unsafe {
let x: (u32x4, u32x4) = (
vmull_u32(x.0.into(), y.0.into()).into(),
@@ -530,8 +540,6 @@ impl FieldElement2625x4 {
let negate_D = |x_01: u64x4, p_01: u64x4| -> (u64x2, u64x2) {
unsafe {
- use core::arch::aarch64::vget_low_u32;
- use core::arch::aarch64::vget_high_u32;
use core::arch::aarch64::vcombine_u32;
let x = x_01.0;
@@ -640,8 +648,6 @@ impl<'a, 'b> Mul<&'b FieldElement2625x4> for &'a FieldElement2625x4 {
fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
use core::arch::aarch64::vmull_u32;
use core::arch::aarch64::vuzp1_u32;
- use core::arch::aarch64::vget_low_u32;
- use core::arch::aarch64::vget_high_u32;
unsafe {
let x: (u32x4, u32x4) = (
vmull_u32(x.0.into(), y.0.into()).into(),
@@ -836,5 +842,3 @@ mod test {
assert_eq!(x3, splits[3]);
}
}
-
-
|
Current state of this pull request gives speed-up on the previously problematic CPUs. It is less than the speed-up on older architectures, but still a minimum of ~5% and more often at least 10 to 12% on benchmarks that use the SIMD NEON back-end, in comparison to the serial back-end. Results of the benchmarks: target_a55.tar.gz (merge conflicts should be resolved shortly) |
As promised in #449! This is joint work with @Tarinn. In fact, @Tarinn did most of the work :-)
Fixes #147 for Aarch64, and you'll see an ARMv7 PR coming in after this.
I'm working on benchmarks across several devices now (including an unpublished very hacky armv7 version), will update here when they become available. We're seeing speedups of 20-30% in relevant benchmarks. This is the same code as in zkcrypto#19.
I'll be cleaning up the commits in the next few minutes (trailing white space fixes, mostly), and I'll run a comparison benchmark for 343be3a, because that was only introduced for future ARMv7 support, and has not been tested under load.
TODO:
shuffle!
call vsvqtbx1q_u8
performanceExplain that ARMv7 is not yet supported