-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572
base: main
Are you sure you want to change the base?
Conversation
Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use https://godbolt.org/z/KG1dPnrqn https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions- |
I haven't benchmarked, just seems my cheap 2021 phone has You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product But unless it is really faster than what GCC does with simple C, no need. |
With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have |
This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as |
lucene/core/build.gradle
Outdated
cCompiler.withArguments { args -> | ||
args << "--shared" | ||
<< "-O3" | ||
<< "-march=armv8.2-a+dotprod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for simple case of benchmarking i would just try -march=native
to compile the best it can for the specific cpu/microarch. Then you could test your dot8s
on ARM 256-bit SVE (graviton3, graviton4) which might be interesting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried -march=native
on an m7g.4xlarge
(Graviton3) instance type, OS, JDK, GCC were unchanged. dot8s
and vdot8s
are now ~3.5x
faster than java implementation, compared to being 10x
better on m6g.4xlarge
(Graviton2)
Benchmark | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
VectorUtilBenchmark.binaryDotProductVector | 768 | thrpt | 15 | 10.570 | ± 0.003 | ops/us |
VectorUtilBenchmark.dot8s | 768 | thrpt | 15 | 37.659 | ± 0.422 | ops/us |
VectorUtilBenchmark.vdot8s | 768 | thrpt | 15 | 37.123 | ± 0.237 | ops/us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this happens because with the gcc10+sve, the implementation is not really unrolled. it should be possible to be 10x faster, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, the other likely explanation on the performance is that the integer dot product in java is not AS HORRIBLE on the 256-bit SVE as it is on the 128-bit neon. it more closely resembles the logic of how it behaves on AVX-256: two 8x8 bit integers ("64-bit vectors") are multiplied into intermediate 8x16-bit result (128-bit vector) and added to 8x32-bit (256-bit vector). Of course, it does not use SDOT instruction which is sad as it is CPU instruction intended precisely for this purpose.
On the 128-bit neon there is not a possibility with java's vector api to process 4x8 bit integers ("32-bit vectors") like the SDOT instruction does: https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
Nor is it even performant to take 64-bit vector and process "part 0" then "part 1". The situation is really sad, and the performance reflects that.
Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate. https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c
++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it. |
Could Lucene ever have this directly in one of its modules? We currently use the Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (#13336) |
It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs. One issue would be which to compile for/detect: IMO for arm there would be two good ones:
For x86 probably at most three?:
|
Here is my proposal visually: https://godbolt.org/z/6fcjPWojf As you can see, by passing (these variants should all be benchmarked of course first) edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too. |
And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc. But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above. |
I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit. There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress. |
Just to expand a little on a previous comment I made above.
An alternative option to putting this in |
There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"? |
Thanks Chris. Please share the link to OpenJDK issue when you get a chance. |
Let me try to squeeze some cycles out next week and see if I can make progress on this front. |
9e9b33b
to
e45224b
Compare
lucene/core/src/c/dotProduct.c
Outdated
* Looks like Apple M3 does not implement SVE and Apple's official documentation | ||
* is not explicit about this or at least I could not find it. | ||
* | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the ifdef, this does not happen with -march=native
, correct? The problem is only you try to "force" SVE? afaik, M3 doesnt support it and so -march=native
should automatically take the neon path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply having -march=native
compiler flag and no ifdef
directive makes clang (15.0) throw compilation error: SVE support not enabled
. So, I think, guarding the SVE code with ifdef
is necessary. That said there is no need for extra check on __APPLE__
and I have removed that part from ifdef
.
lucene/core/src/c/dotProduct.h
Outdated
|
||
int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit); | ||
int32_t vdot8s_neon(int8_t* vec1[], int8_t* vec2[], int32_t limit); | ||
int32_t dot8s(int8_t* a, int8_t* b, int32_t limit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we fix these prototypes to all be the same? Can we include from the .c file? Maybe also dding -Wall -Werror
will help keep the code tidy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prototypes fixed and dotProduct.h
included in dotProduct.c
. I did not add -Wall -Werror
yet as after including changes from your patch the compilation of dotProduct.c
file fails for me on Graviton boxes. Let's continue that discussion in the separate thread.
lucene/core/src/c/dotProduct.c
Outdated
// REDUCE: Add every vector element in target and write result to scalar | ||
result = svaddv_s32(svptrue_b8(), acc1); | ||
|
||
// Scalar tail. TODO: Use FMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove TODO, since aarch64 "mul" is really "madd", i expect it already emits single instruction. look at assembler if you are curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been replaced with Vector tail as suggested in a later comment.
lucene/core/src/c/dotProduct.c
Outdated
acc2 = svdot_s32(acc2, va2, vb2); | ||
acc3 = svdot_s32(acc3, va3, vb3); | ||
acc4 = svdot_s32(acc4, va4, vb4); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider "vector tail", since 4 * vector length can be significant: and vector dimensions may not be exact multiple of that. It limits the worst case processing that "scalar tail" must do. Example from the java vector code: https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L153-L158
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted scalar tail to vector and throughput score on Graviton3 dropped from 45.5
-> 44.5
which is a bit surprising. My guess is that auto-vectorized and unrolled code for scalar tail is more efficient than vector tail. Comparing scalar and vector tail in godbolt - https://godbolt.org/z/sjo6KMnP7, the whilelo
instruction generated by auto-vectorization, stands out to me but I don't understand why that would cause a meaningful drop in performance. Let me try and dig a little deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton 3
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.002 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 44.562 ± 0.448 ops/us
lucene/core/src/c/dotProduct.c
Outdated
// Scalar tail. TODO: Use FMA | ||
for (; i < limit; i++) { | ||
result += vec1[i] * vec2[i]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for any "tails" like this where we manually unroll and vectorize the main loop, we can add pragmas to prevent GCC/LLVM from trying to unroll and vectorize the tail. It is not strictly necessary but will lead to tighter code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe its just enough to disable autovectorization, as unrolling might still be a win for scalar tail: can't remember what hotspot is doing on the java vector code for this.
go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house. |
lucene/core/src/c/dotProduct.c
Outdated
for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) { | ||
// Load vectors into the Z registers which can range from 128-bit to 2048-bit wide | ||
// The predicate register - P determines which bytes are active | ||
// svptrue_b8() returns a predictae in which every element is true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With gcc autovectorization, i see use of SVE whilelo
(predicate as counter) instead. I think it basically works on loop counter, maybe look at autogenerated assembly with godbolt for inspiration? Sorry, I'm not well-versed in SVE but trying to help improve the performance as it would be important for Graviton3/Graviton4/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmuir -- I tried using svwhilelt_b8_u32(i, limit)
intrinsic for generating the predicate in both the unrolled-loop and vector tail but the performance was actually worse :-(.
To give you an idea of what I did, here is link to the ARM documentation with code sample
https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics
As someone who wrote and maintains the native scorer in Elasticsearch, it gives great speed ups for Scalar Quantised vectors on ARM. Since Panama Vector is horrible for arithmetic operations on byte-sized values; both widen and accessing various parts of the vector. We have an AVX implementation too, but it gives less improvements. I don't see Panama Vector improving in this area any time soon :-(. For floats we don't do any native. In the latest Binary Quantization #13651, Panama Vector performs very nicely. It might be that some of the motivation for a native scorer may just go away. |
gradle/java/javac.gradle
Outdated
@@ -24,7 +24,11 @@ allprojects { project -> | |||
|
|||
// Use 'release' flag instead of 'source' and 'target' | |||
tasks.withType(JavaCompile) { | |||
options.compilerArgs += ["--release", rootProject.minJavaVersion.toString()] | |||
options.compilerArgs += ["--release", rootProject.minJavaVersion.toString(), "--enable-preview"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this when everything is moved to java21 folder
In the current setup it is. One could generalize this, but this should be fine step by step in separate PRs. There are several problems to solve, also with regards to modularization and where the stub jars are:
In general it's doable but I really doubt if this is healthy for this project's infrastructure. So I am still -1 to have native code before we have refactored major parts of infrastructure. One condition is to have java 22 as minimum version, so we can at least introduce MemorySegment as top level citizen into our own APIs (e.g. remove byte arrays from our IndexInput APIs). I was arguing in the java community to have the foreign API stabilized and released with Java 21, but unfortunately it was done in 22. If we can get rid of the preview mode shit and simply tell our downstream users to accept java 22 as minimum version both at compilation and runtime, then only the Panama vector stuff needs to be mr-jarred so it would simplify a lot. Uwe |
28b05b9
to
2dce12f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look. I have to say I'm still kind of confused about why we sometimes get these vector scorers via scorersupplier and other times get them from static create() methods and I worry about the resource usage - it's difficult to reason about by staring at code, so I think we need to run some tests to be more confident
MethodType.methodType(lookup.findClass(MemorySegment), byte[].class); | ||
MethodHandle nativeMemorySegment = | ||
lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", methodType); | ||
byte[] a = new byte[size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused -- is this setup code for the benchmark? We just run it once and then run dot product on the same two vectors many times? I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the same two vectors on each iteration. We run setup once every iteration
for a total of 15
iterations across 3
forks (5 iterations per fork) for each size
being tested. Each fork is preceded by 3 warm-up iterations. Before each iteration we generate random numbers in range [0-127] in two on-heap byte[]
, allocate off-heap memory segments and populate them with contents from byte[]
. These off-heap memory segments are provided to VectorUtil.NATIVE_DOT_PRODUCT
method handle.
(Code snippet below for reference)
@Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
int size;
@Setup(Level.Iteration)
public void init() {
...
}
I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...)
I guess the question you are hinting at is how does the performance vary when the two candidate vectors are further apart in memory (L1 cache / L2 cache / L3 cache / Main-memory). Do the gains from native implementation become insignificant with increasing distance ? Its an interesting question and I propose that we add benchmark method(s) to answer them in a follow up PR. Does that sound reasonable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nonetheless I will simplify the setup code to make it a bit more readable in the next iteration.
@@ -146,6 +146,7 @@ public float getScoreCorrectionConstant(int targetOrd) throws IOException { | |||
} | |||
slice.seek(((long) targetOrd * byteSize) + numBytes); | |||
slice.readFloats(scoreCorrectionConstant, 0, 1); | |||
lastOrd = targetOrd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a bug because if we do this and then call vectorValue
we could get the wrong result because we don't calculate the binaryValue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!. I will remove this in the next revision. I was just trying to optimize for the case when getScoreCorrectionConstant(int targetOrd)
gets invoked with the same targetOrd
multiple times in succession.
NativeDotProductScorer( | ||
MemorySegmentAccessInput input, KnnVectorValues values, byte[] queryVector) { | ||
super(input, values, queryVector); | ||
if (offHeapQuery == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would this ever not be null? can we assert offHeapQuery == null instead? Maybe we could make it final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole NativeDotProductScorer
is gone in the next iteration. For the byteVector case, I am planning to generate offHeap MemorySegments in case native dot product is enabled. Then in PanamaVectorUtilSupport.dotProduct(MemorySegment a, MemorySegment b)
we will delegate the computation to native code if both segments are native.
@@ -34,6 +37,8 @@ abstract sealed class Lucene99MemorySegmentByteVectorScorer | |||
final MemorySegmentAccessInput input; | |||
final MemorySegment query; | |||
byte[] scratch; | |||
MemorySegment offHeapScratch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the cost of creating these for every scorer() we create because that happens a lot. During indexing, we create multiple scorers while adding each new document. Could we move these to the ScorerSupplier instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm now I see we did do that! But then I wonder why we also need to do it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the cost of creating these for every scorer() we create because that happens a lot.
The offHeapScratch
is only allocated in the rare case of vector data being split across multiple MemorySegments each representing a mmap region.
Hmm now I see we did do that! But then I wonder why we also need to do it here.
When creating a RandomVectorScorer()
for the give byte[]
of query, we will need to copy these bytes to an off-heap space for consumption within native dot-product. Native code can only work with memory that is outside the management of Java GC.
In the next revision, I am going to refactor some of these and add documentation to hopefully make this more clear.
@@ -0,0 +1,407 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you say where this file came from? Was it mostly copied from some other file, or is it all brand new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Will do in the next revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmk
lucene/native/src/c/dotProduct.c
Outdated
* is not explicit about this or at least I could not find it. | ||
* | ||
*/ | ||
int32_t vdot8s_sve(int8_t vec1[], int8_t vec2[], int32_t limit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier @rmuir had suggested just relying on gcc
to compile the "obvious" C code down to efficient vectorized implementation, but here it looks like we are still hand rolling the NEON asm intrinsics? Was the gcc
path less performant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the gcc path less performant?
Yes. The compiler auto-vectorized dot-product implementation shows ~15%
LOWER throughput compared to this manually unrolled and vectorized implementation which uses SVE intrinsics. Here is the JMH benchmark output from the next revision executed on AWS Graviton 3
instance. C implementation compiled with GCC 10
and compiler flags -O3 -march=native -funroll-loops
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.dot8s 768 thrpt 15 43.233 ± 0.639 ops/us
VectorUtilBenchmark.simpleDot8s 768 thrpt 15 36.437 ± 0.572 ops/us
Command to reproduce these results with the next revision
java --enable-native-access=ALL-UNNAMED \
--enable-preview \
-Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
-jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar regexp \
"(.*)?(ot8s)" \
-p "size=768"
A benefit of manually vectorized implementation is that our implementation is less likely to suffer from unintended performance regressions due to auto-vectorization variations across different ARM compilers.
*/ | ||
package org.apache.lucene.internal.vectorization; | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate copyright header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Removed in the next revision.
@goankur -- thank you for pulling out the actual native code into a new I have not looked so closely at the PR, just left a few minor comments so far, but ... how come there are so many changes necessary in |
We have measured performance using knnPerfTest.py in lucene util with this PR commit as candidate branch. cmd
Lucene_Baseline
Lucene_Candidate
The latency has dropped from 0.333ms to 0.26ms. |
nice improvement! I do see the index time increased and wonder if it is due to creating too many heavyweight RandomVectorScorers. Maybe we can make this easier by simplifying the whole RandomVectorScorer creation pathway. Although that is out of scope here, it might be good to try and work around it if possible as I suggested above |
3079c5d
to
9adcb8b
Compare
* MemorySegment backing on-heap memory to native code we get | ||
* "java.lang.UnsupportedOperationException: Not a native address" | ||
* | ||
* <p>Stack overflow thread: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is helpful, but let's not permanently record the SO thread in the code comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! removed in the next revision.
* https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione | ||
* explains the issue in more detail. | ||
* | ||
* <p>Q1. Why did we enable the native code path here if its inefficient ? A1. So that it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the unit-testing use case is a good one -- unit tests should be testing the actual production code path. I see that the required test setup would be complex, but IMO it's better to have simpler production code and complex unit tests than complex unused production code and simple unit tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! The changes here were undone and TestVectorUtilSupport
, BaseVectorizationTestCase
were modified in the next revision to exercise the native code path by obtaining and invoking method handle for dotProduct(MemorySegment a, MemorySegment b)
where 'a' and 'b' are constructed as off-heap MemorySegments.
* heap. For target-vector, copying to off-heap memory will still be needed but allocation can | ||
* happen once per scorer. | ||
* | ||
* <p>Q3. Should JMH benchmarks measure the performance of this method? A3. No, because they would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are helpful for the PR, but I think in the actual code we would want to simplify and maybe say something like: do not call in production. Indeed we could possibly even add an assert false: "inefficient implementation, do not use"
? And in production fall back to the non-native impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert false:...
won't be necessary after undoing the changes as the old code wraps input byte[]
into on-heap MemorySegment and native implementation is not exercised with that condition.
* @param numBytes Dimension of the byte vector | ||
* @return offHeap memory segment | ||
*/ | ||
public static MemorySegment offHeapByteVector(int numBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put "random" in the name so we can easily tell this is creating a random value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation slightly to simply copy the input byte[]
to off-heap MemorySegment. The process of populating the byte[]
with random bytes will happen outside this method in the next revision.
try { | ||
int limit = (int) a.byteSize(); | ||
return (int) NativeMethodHandles.DOT_PRODUCT_IMPL.invokeExact(a, b, limit); | ||
} catch (Throwable ex$) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we trying to catch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WrongMethodTypeException
and anything else (Throwable
) propagated by underlying method handle. WrongMethodTypeException
is a subclass of Throwable.
@@ -0,0 +1,407 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmk
78b7c1f
to
0694ac6
Compare
…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform
…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform
…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback
Quick Update
What's nextA lot of code in It is for this (and performance) reasons that a MemorySegment slice containing vector bytes is obtained from underlying MemorySegmentAccessInput and passed to native dotProduct. A flip side is that query vector has to be copied to an off-heap buffer. I don't like the fact that so much code had to be duplicated to make this work, so I am going to take a stab to see if there is an opportunity to minimize the duplication. I will also try to improve the coverage of unit tests although the changes in |
Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed
Description (WIP - Need to be updated)
Implements vectorized dot product for
byte[]
vectors in native C code using SVE and Neon intrinsics.TODOs
native
module and tests incore
now depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way.aarch64
architecture and applicable only tobyte[]
vectors. Adding additional conditions to enable/disable need to be added.int main(...)
indotProduct.c
. This code should be converted to unit tests exercised using CUnit support in Gradle.Graviton2
andApple M3
need to be redone as some bugs in native code were fixed.NOTE
I had to export environment variable
CC
in my shell environment for Gradle to pickup thegcc-10
compiler toolchain.export CC=/usr/bin/gcc10-cc
Build Lucene
Generates compiled shared library
${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.so
and JMH benchmarking JAR
Run Benchmark JAR
IGNORE -- These need to be redone
Summary
int8
dot-product.10X
(NEON intrinsics)4.3X
(SVE intrinsics)TBD
8X
(NEON intrinsics)NEON intrinsics
is equivalent to auto-vectorized/auto-unrolled code.SVE intrinsics
are not available.SVE intrinsics
provides+9.8%
(score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729).Graviton4
results look suspicious and need to be re-evaluated.NEON intrinsics
is2.38X
FASTER than auto-vectorization/auto-unrolled code.SVE intrinsics
are not available.Test Environment
--shared -O3 -march=native -funroll-loops
m6g.4xlarge
)m7g.4xlarge
)r8g.4xlarge
)Results
Graviton 2 (m6g.4xlarge)
Graviton 3 (m7g.4xlarge)
[TBD] - Graviton 4 (r8g.4xlarge)
Apple M3 Pro