-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using SIMD for dealing with json (and more) at speed #13773
Comments
@hpvd, what would you suggest, using the simdjson library for all JSON data handling or something else? |
I think this would be a multistep approach. We can look what is possible on https://simdjson.org/ and just pick one place in Pinot and give it a try. In the end we can utilize it in many ways.. |
@hpvd @abhioncbr - I have been very interested in exploring more wide and holistic use of SIMD in Pinot. Historically, that endeavor has not been successful because of no support in Java for the low level primitives. JNI is of course an option. For this issue, how are you planning to use SIMD in Pinot code base ? Is it via the JNI bridge that we build over Intel compiler intrinsics or using an abstraction (e.g JDK vector APi available in 14 onwards IIRC) or something else ? |
My high level suggestion would be that if there is indeed a possible path to leverage SIMD acceleration in JAVA, rather than doing piece-wise work for a specific scenario, it would be better to first get a handle on how it will be integrated into Pinot code base so that we can also re-use them in more appropriate places (e.g in the query engine). Also need to evaluate the portability aspect as well.
Agree with POCing one aspect but when we actually decide to build the feature, it should ideally be done thinking of broader, long term use thinking about everything since we are likely going to introduce platform specific dependencies into the codebase. |
@siddharthteotia have you already looked into this one: |
We may also look into Apache Doris how they leverage it.... |
Yes, my understanding was also to use the simd Java bindings. As @hpvd suggested, we can explore how jdk based projects are using it and we can take a path forward based on that. |
+1. Yes let's do some survey This is based on incubator version of vector support in JDK (Project Panama by Open JDK AFAIK). Note that the package still says "incubator" so I am not sure of production use / support for this. We have done this in the past where we took a dependency on less than productionized library (Lbuffer) and it proved to be unstable once in a while. Recently we have removed it. So, I think as a first step it will be good to see if any of the latest versions of JDK actually support it or not before we go way deeper in the POC / performance evaluation with above library Take a look at project Gandiva (under Arrow) too. We can also build a JNI bridge ourselves. I think the investment really depends on some value via POC. Curious if @gortiz / @richardstartin have any advice / suggestions. |
this article is already one year old, but pretty interesting: it shows how elastic / lucene leverage SIMD, handle incubating possibilities, show some benchmarks etc. |
this includes history, state and goals of vector API in java: |
This is a fantastic initiative and +100 on getting native SIMD. Given the pace at which Java is moving, it might be a good idea to slowly extract interfaces where SIMD can benefit. This will allow users/companies to stay with older jdk while other companies can move forward. we don't want to stuck in the same mode as last time where moving out of Java 8 meant waiting for all users to migrate to Java 8. |
jep would be great, if we find a way were the people who want and can (-> no hard internal restrictions, suitable hardware selection..) are able to benefit from new possibilities without having to wait till everybody is ready. |
just edited the title to |
I think explorations in this area are very interesting, but AFAIK Panama is not fast enough yet. Last month in JCrete we were discussing about how to access native code efficiently and it looks like nothing has changed (yet). Calling JNI/Panama code per row is prohibitively slow. The good news is that in single-stage engine and in the leaf stages in multi-stage engine these calls can be done at block level, so we should be able to absorb the cost of the JNI call. |
good overview and starter:
|
just to get an understanding how other projects handle this:
https://www.elastic.co/search-labs/blog/lucene-and-java-moving-forward-together |
as expected, lucene changes requirements and v10 now requires Java 21, see https://lucene.apache.org/core/corenews.html#apache-lucenetm-1000-available |
just started a list to get an overview of things we are missing with staying using/being compatible to older Java versions, |
Using SIMD for dealing with json at speed
inspired by postgreSQL (up to 4-fold speedup):
see: https://www.phoronix.com/news/PostgreSQL-Opt-JSON-Esc-SIMD
and since more and more CPUs support AVX512 or its successors:
https://www.phoronix.com/review/simdjson-avx-512
https://simdjson.org/ used by Clickhouse, Apache Doris...
https://github.com/simdjson/simdjson (Apache 2.0 licence)
The text was updated successfully, but these errors were encountered: