Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

goankur
Copy link
Contributor

@goankur goankur commented Jul 16, 2024

Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

Description (WIP - Need to be updated)

Implements vectorized dot product for byte[] vectors in native C code using SVE and Neon intrinsics.

TODOs

  • Some more plumbing pending, marking as draft as work is incomplete
  • The native 'C' is now moved to a separate native module and tests in core now depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way.
  • Native dot product is only enabled on aarch64 architecture and applicable only to byte[] vectors. Adding additional conditions to enable/disable need to be added.
  • Rudimentary correctness testing of native code was done in int main(...) in dotProduct.c. This code should be converted to unit tests exercised using CUnit support in Gradle.
  • JMH benchmarks on other Graviton2 and Apple M3 need to be redone as some bugs in native code were fixed.
  • KNN benchmarks on scalar quantized vectors need to be done to assess indexing/querying impact.

NOTE

I had to export environment variable CC in my shell environment for Gradle to pickup the gcc-10 compiler toolchain. export CC=/usr/bin/gcc10-cc

Build Lucene

$ pwd
/home/goankur/open-source/lucene

$./gradlew  build

Generates compiled shared library ${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.so
and JMH benchmarking JAR

Run Benchmark JAR

java --enable-native-access=ALL-UNNAMED \
       --enable-preview \
       -Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
       -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar \
      regexp "(.*)?(binaryDotProductVector|dot8s)"

IGNORE -- These need to be redone

Summary

  • Throughput gains relative to Panama API based java implementation of int8 dot-product.
    • Graviton2: 10X (NEON intrinsics)
    • Graviton3: 4.3X (SVE intrinsics)
    • Graviton4: TBD
    • Apple M3 Pro: 8X (NEON intrinsics)
  • On Graviton 2, performance of manually unrolled loop vectorized using NEON intrinsics is equivalent to auto-vectorized/auto-unrolled code. SVE intrinsics are not available.
  • On Graviton3, manually unrolled loop vectorized using SVE intrinsics provides +9.8% (score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729).
  • [CAUTION]: Graviton4 results look suspicious and need to be re-evaluated.
  • On Apple M3 Pro, performance of manually unrolled loop vectorized using NEON intrinsics is 2.38X FASTER than auto-vectorization/auto-unrolled code. SVE intrinsics are not available.

Test Environment

  • GCC/Clang flags: --shared -O3 -march=native -funroll-loops
AWS Instance Type Operating System Compiler JDK Gradle
Graviton2 (m6g.4xlarge) Amazon Linux 2 GCC 10.5.0 Amazon Corretto 21 Gradle 8.8
Graviton3 (m7g.4xlarge) Amazon Linux 2 GCC 10.5.0 Amazon Corretto 21 Gradle 8.8
Graviton4 (r8g.4xlarge) Amazon Linux 2023 GCC: 11.4.1 Amazon Corretto 21 Gradle 8.8
Apple M3 Mac OS Sonoma (14.5) Apple clang 15.0.0 OpenJDK 21.0.3 Gradle 8.8

Results

Graviton 2 (m6g.4xlarge)

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 2.857 ± 0.014 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 28.580 ± 0.189 ops/us
VectorUtilBenchmark.neonVdot8s 768 thrpt 15 28.722 ± 0.454 ops/us
VectorUtilBenchmark.sveVdot8s 768 thrpt 15 28.892 ± 0.340 ops/us

Graviton 3 (m7g.4xlarge)

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.563 ± 0.004 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 41.729 ± 1.060 ops/us
VectorUtilBenchmark.neonVdot8s 768 thrpt 15 41.032 ± 0.215 ops/us
VectorUtilBenchmark.sveVdot8s 768 thrpt 15 45.635 ± 0.337 ops/us

[TBD] - Graviton 4 (r8g.4xlarge)


Apple M3 Pro

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.399 ± 0.025 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 34.288 ± 0.062 ops/us
VectorUtilBenchmark.neonVdot8s 768 thrpt 15 81.013 ± 0.569 ops/us
VectorUtilBenchmark.sveVdot8s 768 thrpt 15 80.657 ± 0.760 ops/us

@rmuir
Copy link
Member

rmuir commented Jul 16, 2024

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

@rmuir
Copy link
Member

rmuir commented Jul 16, 2024

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

lucene/core/build.gradle Outdated Show resolved Hide resolved
@goankur
Copy link
Contributor Author

goankur commented Jul 18, 2024

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

@rmuir
Copy link
Member

rmuir commented Jul 18, 2024

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

cCompiler.withArguments { args ->
args << "--shared"
<< "-O3"
<< "-march=armv8.2-a+dotprod"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simple case of benchmarking i would just try -march=native to compile the best it can for the specific cpu/microarch. Then you could test your dot8s on ARM 256-bit SVE (graviton3, graviton4) which might be interesting...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried -march=native on an m7g.4xlarge (Graviton3) instance type, OS, JDK, GCC were unchanged. dot8s and vdot8s are now ~3.5x faster than java implementation, compared to being 10x better on m6g.4xlarge (Graviton2)

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.003 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 37.659 ± 0.422 ops/us
VectorUtilBenchmark.vdot8s 768 thrpt 15 37.123 ± 0.237 ops/us

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this happens because with the gcc10+sve, the implementation is not really unrolled. it should be possible to be 10x faster, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the other likely explanation on the performance is that the integer dot product in java is not AS HORRIBLE on the 256-bit SVE as it is on the 128-bit neon. it more closely resembles the logic of how it behaves on AVX-256: two 8x8 bit integers ("64-bit vectors") are multiplied into intermediate 8x16-bit result (128-bit vector) and added to 8x32-bit (256-bit vector). Of course, it does not use SDOT instruction which is sad as it is CPU instruction intended precisely for this purpose.

On the 128-bit neon there is not a possibility with java's vector api to process 4x8 bit integers ("32-bit vectors") like the SDOT instruction does: https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
Nor is it even performant to take 64-bit vector and process "part 0" then "part 1". The situation is really sad, and the performance reflects that.

@ChrisHegarty
Copy link
Contributor

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Jul 18, 2024

Could Lucene ever have this directly in one of its modules? We currently use the FlatVectorsScorer to plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.

Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (#13336)

@rmuir
Copy link
Member

rmuir commented Jul 18, 2024

I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs.

One issue would be which to compile for/detect:

IMO for arm there would be two good ones:

  • armv8.2-a+dotprod (NEON)
  • armv9-a (SVE)

For x86 probably at most three?:

  • x86-64-v3 (AVX2)
  • x86-64-v4 (AVX512)
  • VNNI (should be tested but seems fit for the task? but not widespread? cascade-lake+?)

@rmuir
Copy link
Member

rmuir commented Jul 18, 2024

Here is my proposal visually: https://godbolt.org/z/6fcjPWojf

As you can see, by passing -march=cascadelake it generates VNNI instructions.
IMO, no need for any intrinsics anywhere, for x86 nor ARM. Just a dead-simple C function and being smart about which targets we compile for.

(these variants should all be benchmarked of course first)

edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too.

@rmuir
Copy link
Member

rmuir commented Jul 18, 2024

And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc.

But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above.

@rmuir
Copy link
Member

rmuir commented Jul 18, 2024

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

@ChrisHegarty
Copy link
Contributor

Just to expand a little on a previous comment I made above.

Could Lucene ever have this directly in one of its modules?

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

@msokolov
Copy link
Contributor

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"?

@goankur
Copy link
Contributor Author

goankur commented Jul 20, 2024

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

Thanks Chris. Please share the link to OpenJDK issue when you get a chance.

@goankur
Copy link
Contributor Author

goankur commented Jul 20, 2024

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

Let me try to squeeze some cycles out next week and see if I can make progress on this front.

@goankur goankur force-pushed the main branch 4 times, most recently from 9e9b33b to e45224b Compare July 30, 2024 04:31
* Looks like Apple M3 does not implement SVE and Apple's official documentation
* is not explicit about this or at least I could not find it.
*
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the ifdef, this does not happen with -march=native, correct? The problem is only you try to "force" SVE? afaik, M3 doesnt support it and so -march=native should automatically take the neon path.

Copy link
Contributor Author

@goankur goankur Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply having -march=native compiler flag and no ifdef directive makes clang (15.0) throw compilation error: SVE support not enabled. So, I think, guarding the SVE code with ifdef is necessary. That said there is no need for extra check on __APPLE__ and I have removed that part from ifdef.


int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit);
int32_t vdot8s_neon(int8_t* vec1[], int8_t* vec2[], int32_t limit);
int32_t dot8s(int8_t* a, int8_t* b, int32_t limit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix these prototypes to all be the same? Can we include from the .c file? Maybe also dding -Wall -Werror will help keep the code tidy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prototypes fixed and dotProduct.h included in dotProduct.c. I did not add -Wall -Werror yet as after including changes from your patch the compilation of dotProduct.c file fails for me on Graviton boxes. Let's continue that discussion in the separate thread.

// REDUCE: Add every vector element in target and write result to scalar
result = svaddv_s32(svptrue_b8(), acc1);

// Scalar tail. TODO: Use FMA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove TODO, since aarch64 "mul" is really "madd", i expect it already emits single instruction. look at assembler if you are curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been replaced with Vector tail as suggested in a later comment.

acc2 = svdot_s32(acc2, va2, vb2);
acc3 = svdot_s32(acc3, va3, vb3);
acc4 = svdot_s32(acc4, va4, vb4);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider "vector tail", since 4 * vector length can be significant: and vector dimensions may not be exact multiple of that. It limits the worst case processing that "scalar tail" must do. Example from the java vector code: https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L153-L158

Copy link
Contributor Author

@goankur goankur Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted scalar tail to vector and throughput score on Graviton3 dropped from 45.5 -> 44.5 which is a bit surprising. My guess is that auto-vectorized and unrolled code for scalar tail is more efficient than vector tail. Comparing scalar and vector tail in godbolt - https://godbolt.org/z/sjo6KMnP7, the whilelo instruction generated by auto-vectorization, stands out to me but I don't understand why that would cause a meaningful drop in performance. Let me try and dig a little deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton 3

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryDotProductVector     768  thrpt   15  10.570 ± 0.002  ops/us
VectorUtilBenchmark.dot8s                      768  thrpt   15  44.562 ± 0.448  ops/us

// Scalar tail. TODO: Use FMA
for (; i < limit; i++) {
result += vec1[i] * vec2[i];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for any "tails" like this where we manually unroll and vectorize the main loop, we can add pragmas to prevent GCC/LLVM from trying to unroll and vectorize the tail. It is not strictly necessary but will lead to tighter code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe its just enough to disable autovectorization, as unrolling might still be a win for scalar tail: can't remember what hotspot is doing on the java vector code for this.

@rmuir
Copy link
Member

rmuir commented Aug 1, 2024

go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house.

for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) {
// Load vectors into the Z registers which can range from 128-bit to 2048-bit wide
// The predicate register - P determines which bytes are active
// svptrue_b8() returns a predictae in which every element is true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With gcc autovectorization, i see use of SVE whilelo (predicate as counter) instead. I think it basically works on loop counter, maybe look at autogenerated assembly with godbolt for inspiration? Sorry, I'm not well-versed in SVE but trying to help improve the performance as it would be important for Graviton3/Graviton4/etc.

See also https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/WHILELO--predicate-as-counter---While-incrementing-unsigned-scalar-lower-than-scalar--predicate-as-counter--

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look into this tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmuir -- I tried using svwhilelt_b8_u32(i, limit) intrinsic for generating the predicate in both the unrolled-loop and vector tail but the performance was actually worse :-(.

To give you an idea of what I did, here is link to the ARM documentation with code sample

https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics

@ChrisHegarty
Copy link
Contributor

Anyways: At moment we do not want to have native code in Lucene Core.
..
Having the likes of OpenSearch, Elasticsearch, and Solr implement their own (high performance) direct native SIMD access seems not ideal to me. It really should somehow be an option in Lucene, even if it's not the default.

As someone who wrote and maintains the native scorer in Elasticsearch, it gives great speed ups for Scalar Quantised vectors on ARM. Since Panama Vector is horrible for arithmetic operations on byte-sized values; both widen and accessing various parts of the vector. We have an AVX implementation too, but it gives less improvements. I don't see Panama Vector improving in this area any time soon :-(. For floats we don't do any native.

In the latest Binary Quantization #13651, Panama Vector performs very nicely. It might be that some of the motivation for a native scorer may just go away.

@@ -24,7 +24,11 @@ allprojects { project ->

// Use 'release' flag instead of 'source' and 'target'
tasks.withType(JavaCompile) {
options.compilerArgs += ["--release", rootProject.minJavaVersion.toString()]
options.compilerArgs += ["--release", rootProject.minJavaVersion.toString(), "--enable-preview"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this when everything is moved to java21 folder

@uschindler
Copy link
Contributor

uschindler commented Sep 14, 2024

Anyways: At moment we do not want to have native code in Lucene Core.

+1, we don't put native code in Lucene's core.

But @uschindler is there maybe a way forward not using core? I.e. add native code to misc or sandbox or so, to enable iterating on direct access to SIMD for early adopters (opt in)?

Or is the src/javaXX approach core-only for some reason?

In the current setup it is. One could generalize this, but this should be fine step by step in separate PRs. There are several problems to solve, also with regards to modularization and where the stub jars are:

  • compilation stubs (apijars) need to move to a common folder across Gradle projects (we don't have no such common module yet as core itself has no dependencies).
  • the bootstrapping of Panama and memory segment code is based on factories which are not yet pluggable. This may need SPI. So one could simply drop a separate JAR into class path/module path and then it enabled native code.
  • I would like to prevent duplicate and hairy initialization code, so one single place that loads the service providers and does appropriate checks before.

In general it's doable but I really doubt if this is healthy for this project's infrastructure.

So I am still -1 to have native code before we have refactored major parts of infrastructure. One condition is to have java 22 as minimum version, so we can at least introduce MemorySegment as top level citizen into our own APIs (e.g. remove byte arrays from our IndexInput APIs). I was arguing in the java community to have the foreign API stabilized and released with Java 21, but unfortunately it was done in 22. If we can get rid of the preview mode shit and simply tell our downstream users to accept java 22 as minimum version both at compilation and runtime, then only the Panama vector stuff needs to be mr-jarred so it would simplify a lot.

Uwe

@goankur goankur force-pushed the main branch 3 times, most recently from 28b05b9 to 2dce12f Compare October 13, 2024 22:58
@goankur goankur marked this pull request as ready for review October 13, 2024 23:54
@goankur goankur marked this pull request as draft October 17, 2024 01:51
Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look. I have to say I'm still kind of confused about why we sometimes get these vector scorers via scorersupplier and other times get them from static create() methods and I worry about the resource usage - it's difficult to reason about by staring at code, so I think we need to run some tests to be more confident

MethodType.methodType(lookup.findClass(MemorySegment), byte[].class);
MethodHandle nativeMemorySegment =
lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", methodType);
byte[] a = new byte[size];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused -- is this setup code for the benchmark? We just run it once and then run dot product on the same two vectors many times? I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...)

Copy link
Contributor Author

@goankur goankur Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the same two vectors on each iteration. We run setup once every iteration for a total of 15 iterations across 3 forks (5 iterations per fork) for each size being tested. Each fork is preceded by 3 warm-up iterations. Before each iteration we generate random numbers in range [0-127] in two on-heap byte[], allocate off-heap memory segments and populate them with contents from byte[]. These off-heap memory segments are provided to VectorUtil.NATIVE_DOT_PRODUCT method handle.

(Code snippet below for reference)

@Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
  int size;

@Setup(Level.Iteration)
public void init() {
...
}

I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...)

I guess the question you are hinting at is how does the performance vary when the two candidate vectors are further apart in memory (L1 cache / L2 cache / L3 cache / Main-memory). Do the gains from native implementation become insignificant with increasing distance ? Its an interesting question and I propose that we add benchmark method(s) to answer them in a follow up PR. Does that sound reasonable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonetheless I will simplify the setup code to make it a bit more readable in the next iteration.

@@ -146,6 +146,7 @@ public float getScoreCorrectionConstant(int targetOrd) throws IOException {
}
slice.seek(((long) targetOrd * byteSize) + numBytes);
slice.readFloats(scoreCorrectionConstant, 0, 1);
lastOrd = targetOrd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a bug because if we do this and then call vectorValue we could get the wrong result because we don't calculate the binaryValue here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!. I will remove this in the next revision. I was just trying to optimize for the case when getScoreCorrectionConstant(int targetOrd) gets invoked with the same targetOrd multiple times in succession.

NativeDotProductScorer(
MemorySegmentAccessInput input, KnnVectorValues values, byte[] queryVector) {
super(input, values, queryVector);
if (offHeapQuery == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this ever not be null? can we assert offHeapQuery == null instead? Maybe we could make it final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole NativeDotProductScorer is gone in the next iteration. For the byteVector case, I am planning to generate offHeap MemorySegments in case native dot product is enabled. Then in PanamaVectorUtilSupport.dotProduct(MemorySegment a, MemorySegment b) we will delegate the computation to native code if both segments are native.

@@ -34,6 +37,8 @@ abstract sealed class Lucene99MemorySegmentByteVectorScorer
final MemorySegmentAccessInput input;
final MemorySegment query;
byte[] scratch;
MemorySegment offHeapScratch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the cost of creating these for every scorer() we create because that happens a lot. During indexing, we create multiple scorers while adding each new document. Could we move these to the ScorerSupplier instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm now I see we did do that! But then I wonder why we also need to do it here.

Copy link
Contributor Author

@goankur goankur Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the cost of creating these for every scorer() we create because that happens a lot.

The offHeapScratch is only allocated in the rare case of vector data being split across multiple MemorySegments each representing a mmap region.

Hmm now I see we did do that! But then I wonder why we also need to do it here.

When creating a RandomVectorScorer() for the give byte[] of query, we will need to copy these bytes to an off-heap space for consumption within native dot-product. Native code can only work with memory that is outside the management of Java GC.

In the next revision, I am going to refactor some of these and add documentation to hopefully make this more clear.

@@ -0,0 +1,407 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say where this file came from? Was it mostly copied from some other file, or is it all brand new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Will do in the next revision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk

* is not explicit about this or at least I could not find it.
*
*/
int32_t vdot8s_sve(int8_t vec1[], int8_t vec2[], int32_t limit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier @rmuir had suggested just relying on gcc to compile the "obvious" C code down to efficient vectorized implementation, but here it looks like we are still hand rolling the NEON asm intrinsics? Was the gcc path less performant?

Copy link
Contributor Author

@goankur goankur Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the gcc path less performant?

Yes. The compiler auto-vectorized dot-product implementation shows ~15% LOWER throughput compared to this manually unrolled and vectorized implementation which uses SVE intrinsics. Here is the JMH benchmark output from the next revision executed on AWS Graviton 3 instance. C implementation compiled with GCC 10 and compiler flags -O3 -march=native -funroll-loops

Benchmark                        (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.dot8s           768  thrpt   15  43.233 ± 0.639  ops/us
VectorUtilBenchmark.simpleDot8s     768  thrpt   15  36.437 ± 0.572  ops/us

Command to reproduce these results with the next revision

    java --enable-native-access=ALL-UNNAMED \
            --enable-preview \
            -Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
            -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar regexp \
             "(.*)?(ot8s)" \
             -p "size=768"

A benefit of manually vectorized implementation is that our implementation is less likely to suffer from unintended performance regressions due to auto-vectorization variations across different ARM compilers.

*/
package org.apache.lucene.internal.vectorization;

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate copyright header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Removed in the next revision.

@mikemccand
Copy link
Member

@goankur -- thank you for pulling out the actual native code into a new native Lucene module. I'm not sure we need a new module -- could we use misc or sandbox maybe?

I have not looked so closely at the PR, just left a few minor comments so far, but ... how come there are so many changes necessary in core? It looks like a lot of hard forking of existing sources to specialize specific cases (memory segment, byte quentization, ???). I had expected (from the view from 10000 feet) the PR to be nearly entirely outside of core.

@yugushihuang
Copy link

We have measured performance using knnPerfTest.py in lucene util with this PR commit as candidate branch.

cmd

'/usr/lib/jvm/java-21-amazon-corretto/bin/java', '-cp', [...], '--add-modules', 'jdk.incubator.vector', '-Djava.library.path=/home/[user_name]/lucene_candidate/lucene/native/build/libs/dotProduct/shared', 'knn.KnnGraphTester', '-quantize', '-ndoc', '1500000', '-maxConn', '32', '-beamWidthIndex', '50', '-fanout', '6', '-quantizeBits', '7', '-numMergeWorker', '12', '-numMergeThread', '4', '-encoding', 'float32', '-topK', '10', '-dim', '768', '-docs', 'enwiki-20120502-lines-1k-mpnet.vec', '-reindex', '-search-and-stats', 'enwiki-20120502-mpnet.vec', '-forceMerge', '-quiet'

Lucene_Baseline

Graph level=3 size=46, connectedness=1.00
Graph level=2 size=1405, connectedness=1.00
Graph level=1 size=46174, connectedness=1.00
Graph level=0 size=1500000, connectedness=1.00

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.332         0.333  1500000    10       6       32         50     7 bits   432.69         271.51             1          5558.90

Lucene_Candidate

Graph level=3 size=46, connectedness=1.00
Graph level=2 size=1410, connectedness=1.00
Graph level=1 size=46205, connectedness=1.00
Graph level=0 size=1500000, connectedness=1.00

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.337         0.260  1500000    10       6       32         50     7 bits   441.25         293.41             1          5558.91

The latency has dropped from 0.333ms to 0.26ms.

@msokolov
Copy link
Contributor

nice improvement! I do see the index time increased and wonder if it is due to creating too many heavyweight RandomVectorScorers. Maybe we can make this easier by simplifying the whole RandomVectorScorer creation pathway. Although that is out of scope here, it might be good to try and work around it if possible as I suggested above

@goankur goankur force-pushed the main branch 3 times, most recently from 3079c5d to 9adcb8b Compare November 1, 2024 00:23
* MemorySegment backing on-heap memory to native code we get
* "java.lang.UnsupportedOperationException: Not a native address"
*
* <p>Stack overflow thread:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is helpful, but let's not permanently record the SO thread in the code comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! removed in the next revision.

* https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
* explains the issue in more detail.
*
* <p>Q1. Why did we enable the native code path here if its inefficient ? A1. So that it can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the unit-testing use case is a good one -- unit tests should be testing the actual production code path. I see that the required test setup would be complex, but IMO it's better to have simpler production code and complex unit tests than complex unused production code and simple unit tests!

Copy link
Contributor Author

@goankur goankur Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! The changes here were undone and TestVectorUtilSupport, BaseVectorizationTestCase were modified in the next revision to exercise the native code path by obtaining and invoking method handle for dotProduct(MemorySegment a, MemorySegment b) where 'a' and 'b' are constructed as off-heap MemorySegments.

* heap. For target-vector, copying to off-heap memory will still be needed but allocation can
* happen once per scorer.
*
* <p>Q3. Should JMH benchmarks measure the performance of this method? A3. No, because they would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are helpful for the PR, but I think in the actual code we would want to simplify and maybe say something like: do not call in production. Indeed we could possibly even add an assert false: "inefficient implementation, do not use" ? And in production fall back to the non-native impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert false:... won't be necessary after undoing the changes as the old code wraps input byte[] into on-heap MemorySegment and native implementation is not exercised with that condition.

* @param numBytes Dimension of the byte vector
* @return offHeap memory segment
*/
public static MemorySegment offHeapByteVector(int numBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put "random" in the name so we can easily tell this is creating a random value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation slightly to simply copy the input byte[] to off-heap MemorySegment. The process of populating the byte[] with random bytes will happen outside this method in the next revision.

try {
int limit = (int) a.byteSize();
return (int) NativeMethodHandles.DOT_PRODUCT_IMPL.invokeExact(a, b, limit);
} catch (Throwable ex$) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we trying to catch here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WrongMethodTypeException and anything else (Throwable) propagated by underlying method handle. WrongMethodTypeException is a subclass of Throwable.

@@ -0,0 +1,407 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk

@goankur goankur force-pushed the main branch 2 times, most recently from 78b7c1f to 0694ac6 Compare November 3, 2024 22:20
Ankur Goel added 3 commits November 5, 2024 02:05
…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform
…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform
…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback
@goankur
Copy link
Contributor Author

goankur commented Nov 7, 2024

Quick Update

  1. Changes to use native dot-product for both int7 scalar-quantized and raw vectors have been integrated in core/src/java21
  2. Lucene99ScalarQuantizedVectorScorer was modified to use Lucene99MemorySegmentFlatVectorScorer when PanamaVectorizationProvider is enabled.
  3. Unit tests in TestVectorUtilSupport have been expanded to exercise native dot-product code.
  4. JMH benchmarks contain methods to benchmark performance of compiler auto-vectorized implementation and manually vectorized implementation.
  5. test.native.dotProduct system property has been added to gradle/testing/randomization.gradle to enable randomized testing of native code.
  6. Running using KnnPerfTest tool (in luceneutil) on 768 dimensional mpnet vectors show ~25% reduction in latency compared to baseline.

What's next

A lot of code in Lucene99MemorySegmentScalarQuantizedVectorScorer was duplicated from Lucene99ScalarQuantizedVectorScorer and modified to enable code path that uses native dotProduct. This is because the native code only works with off-heap memory and new implementation needs to pass vectors as off-heap MemorySegment to native code.

It is for this (and performance) reasons that a MemorySegment slice containing vector bytes is obtained from underlying MemorySegmentAccessInput and passed to native dotProduct. A flip side is that query vector has to be copied to an off-heap buffer.

I don't like the fact that so much code had to be duplicated to make this work, so I am going to take a stab to see if there is an opportunity to minimize the duplication.

I will also try to improve the coverage of unit tests although the changes in Lucene99ScalarQuantizedVectorScorer are gated behind flag -Constants.NATIVE_DOT_PRODUCT_ENABLED which is toggled by randomized testing so it should exercise the existing tests for raw and scalar quantized vectors with and without native dotProduct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants