Benchmark branchless numeric parser #23

jabolina · 2024-05-29T16:37:36Z

For reference: netty/netty@4.1...franz1981:netty:4.1_branchless_varint

The current parser is InfinispanParser, and the new is BranchlessParser. There is lots more code.
Running on an Intel i7-9850H.

Integer:

Benchmark                                  (width)  Mode  Cnt  Score   Error  Units
IntegerBenchmark.parseVarint32Branchless         1  avgt   15  2.831 ± 0.132  ns/op
IntegerBenchmark.parseVarint32Branchless        15  avgt   15  4.451 ± 0.284  ns/op
IntegerBenchmark.parseVarint32Branchless        24  avgt   15  4.569 ± 0.183  ns/op
IntegerBenchmark.parseVarint32Branchless        31  avgt   15  5.756 ± 0.116  ns/op
IntegerBenchmark.parserVarint32Infinispan        1  avgt   15  2.328 ± 0.060  ns/op
IntegerBenchmark.parserVarint32Infinispan       15  avgt   15  5.309 ± 0.180  ns/op
IntegerBenchmark.parserVarint32Infinispan       24  avgt   15  7.218 ± 0.258  ns/op
IntegerBenchmark.parserVarint32Infinispan       31  avgt   15  9.469 ± 0.608  ns/op

Long:

Benchmark                               (width)  Mode  Cnt   Score   Error  Units
LongBenchmark.parseVarint32Branchless         1  avgt   15   2.641 ± 0.105  ns/op
LongBenchmark.parseVarint32Branchless        24  avgt   15   4.941 ± 0.448  ns/op
LongBenchmark.parseVarint32Branchless        31  avgt   15   6.611 ± 0.141  ns/op
LongBenchmark.parseVarint32Branchless        48  avgt   15   6.670 ± 0.213  ns/op
LongBenchmark.parserVarint32Infinispan        1  avgt   15   2.267 ± 0.075  ns/op
LongBenchmark.parserVarint32Infinispan       24  avgt   15   5.903 ± 0.129  ns/op
LongBenchmark.parserVarint32Infinispan       31  avgt   15   7.897 ± 0.306  ns/op
LongBenchmark.parserVarint32Infinispan       48  avgt   15   9.179 ± 0.407  ns/op
LongBenchmark.parserVarint32Infinispan       63  avgt   15  12.485 ± 0.236  ns/op

The smaller numbers have a similar performance on both, so this change may be unnecessary. For the bigger numbers, we have an improvement.

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java

franz1981 · 2024-06-18T15:39:10Z

I would improve the benchmark to cause mispredict i.e. using a big enough and reproducible inputs which have different var int sizes, see https://github.com/netty/netty/blob/151dfa083d28e995a18f7d2c73d4a7d3b7ab73b2/microbench/src/main/java/io/netty/handler/codec/protobuf/VarintDecodingBenchmark.java#L46 for reference

unless the point is that you always expect data to always have some specific size/length.

jabolina · 2024-06-18T17:29:54Z

Thanks, @franz1981. That seems better. I'll try updating the benchmark.

We have some updates planned for Hot Rod to reduce the client to a single connection and improve the batching/pipelining of commands. This change would make the buffer size vary between submissions. Internally, it should likely help for individual commands, although I'm not 100% sure about that. However, updating the benchmark would reflect better the actual usage.

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java

jabolina · 2024-08-30T15:09:44Z

Some months passed and I finally applied the suggestions.

Results:

Benchmark                                     (elementType)  (inputDistribution)  (inputs)  Mode  Cnt   Score   Error  Units
NumericParserBenchmark.parseNumberBranchless            INT                SMALL         1  avgt   20   4.810 ± 0.223  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL         1  avgt   20   5.298 ± 0.539  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL       128  avgt   20   4.164 ± 0.195  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL       128  avgt   20   6.688 ± 2.269  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL    128000  avgt   20   9.111 ± 0.593  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL    128000  avgt   20  15.544 ± 1.132  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                LARGE         1  avgt   20   6.619 ± 0.289  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                LARGE         1  avgt   20  14.317 ± 1.525  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM         1  avgt   20   3.613 ± 0.129  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM         1  avgt   20   4.670 ± 0.533  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM       128  avgt   20   4.772 ± 1.000  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM       128  avgt   20   7.183 ± 1.055  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM    128000  avgt   20  11.379 ± 0.620  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM    128000  avgt   20  16.079 ± 0.971  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL         1  avgt   20   3.759 ± 0.267  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL         1  avgt   20   5.182 ± 0.792  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL       128  avgt   20   5.842 ± 0.149  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL       128  avgt   20   8.953 ± 1.271  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL    128000  avgt   20  16.172 ± 1.114  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL    128000  avgt   20  17.555 ± 1.504  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL         1  avgt   20   5.394 ± 0.255  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL         1  avgt   20   7.804 ± 0.524  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL       128  avgt   20   5.676 ± 0.396  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL       128  avgt   20   7.493 ± 1.073  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL    128000  avgt   20  12.252 ± 1.023  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL    128000  avgt   20  16.152 ± 0.754  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                LARGE         1  avgt   20   6.225 ± 0.232  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                LARGE         1  avgt   20  18.368 ± 1.396  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM         1  avgt   20   4.283 ± 0.129  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM         1  avgt   20   6.546 ± 0.623  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM       128  avgt   20   6.917 ± 0.386  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM       128  avgt   20  10.355 ± 1.676  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM    128000  avgt   20  14.167 ± 0.840  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM    128000  avgt   20  18.337 ± 1.120  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL         1  avgt   20   6.648 ± 0.169  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL         1  avgt   20  16.557 ± 2.431  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL       128  avgt   20   6.811 ± 0.096  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL       128  avgt   20  12.062 ± 1.354  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL    128000  avgt   20  14.776 ± 0.908  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL    128000  avgt   20  20.272 ± 1.042  ns/op

It seems to win a few NSs on all of the tests. Not sure if there is a way we can add more optimizations 🤔

franz1981 · 2024-08-30T15:12:54Z

eheh probably not, and although it seems just few ns if you look at the ratio (or the throughput), is a HUGE improvement no?

well done @jabolina I'm happy if someone use it!
In short, it is better in any case, let's say

franz1981 · 2024-08-30T15:26:14Z

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java

+
+      // Now we isolate the bits in sequence. We check 14 bits at a time.
+      // The intervals are 0-14 bits, 16-30 (and shift 2), 32-46 (and shift 2 + 2), 48-62 (and shift 2 + 2 + 2).
+      return (continuation & 0x3FFF) |


you can use Long::compress if you have the right JDK version (since 19) using the right mask which isolates the bits you need "to compress"
see https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/lang/Long.html#compress(long,long)

Thanks! We're on 17, but I'll add it as a comment to the ISPN code.

jabolina · 2024-09-03T13:34:13Z

While integrating it into ISPN, I noticed I was wrong on some bit calculations. I've fixed everything and it is working with ISPN and continues to perform better.

Benchmark                                     (elementType)  (inputDistribution)  (inputs)  Mode  Cnt   Score   Error  Units
NumericParserBenchmark.parseNumberBranchless            INT                SMALL         1  avgt   20   4.418 ± 0.177  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL         1  avgt   20   4.619 ± 0.163  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL       128  avgt   20   3.897 ± 0.042  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL       128  avgt   20   4.045 ± 0.057  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                SMALL    128000  avgt   20   8.441 ± 0.111  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                SMALL    128000  avgt   20   8.396 ± 0.241  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                LARGE         1  avgt   20   6.751 ± 0.199  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                LARGE         1  avgt   20  10.472 ± 1.770  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM         1  avgt   20   3.507 ± 0.046  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM         1  avgt   20   3.657 ± 0.334  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM       128  avgt   20   4.501 ± 0.117  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM       128  avgt   20   5.597 ± 0.399  ns/op

NumericParserBenchmark.parseNumberBranchless            INT               MEDIUM    128000  avgt   20  10.232 ± 0.269  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT               MEDIUM    128000  avgt   20  13.567 ± 1.036  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL         1  avgt   20   3.638 ± 0.067  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL         1  avgt   20   3.974 ± 0.267  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL       128  avgt   20   5.899 ± 0.260  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL       128  avgt   20   7.243 ± 0.854  ns/op

NumericParserBenchmark.parseNumberBranchless            INT                  ALL    128000  avgt   20  14.735 ± 0.504  ns/op
NumericParserBenchmark.parseNumberInfinispan            INT                  ALL    128000  avgt   20  15.390 ± 1.167  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL         1  avgt   20   5.032 ± 0.063  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL         1  avgt   20   6.954 ± 0.612  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL       128  avgt   20   5.118 ± 0.033  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL       128  avgt   20   6.322 ± 0.467  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                SMALL    128000  avgt   20  11.584 ± 0.167  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                SMALL    128000  avgt   20  14.818 ± 0.995  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                LARGE         1  avgt   20   7.519 ± 0.242  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                LARGE         1  avgt   20  18.525 ± 2.720  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM         1  avgt   20   4.226 ± 0.076  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM         1  avgt   20   6.359 ± 0.639  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM       128  avgt   20   6.510 ± 0.296  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM       128  avgt   20   9.137 ± 0.793  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG               MEDIUM    128000  avgt   20  13.374 ± 0.548  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG               MEDIUM    128000  avgt   20  16.068 ± 1.016  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL         1  avgt   20   6.496 ± 0.061  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL         1  avgt   20  12.838 ± 1.337  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL       128  avgt   20   6.824 ± 0.151  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL       128  avgt   20   9.497 ± 0.915  ns/op

NumericParserBenchmark.parseNumberBranchless           LONG                  ALL    128000  avgt   20  13.800 ± 0.346  ns/op
NumericParserBenchmark.parseNumberInfinispan           LONG                  ALL    128000  avgt   20  16.096 ± 0.666  ns/op

Larger values are the ones with more improvements. Medium values have some ns improvements. And smaller numbers perform slightly better, a few hundred us. Which still means an improvement overall. I'll be opening the PR to ISPN just after some cleaning.

jabolina · 2024-09-03T13:40:18Z

@franz1981, you might notice that the method to read smaller values (24 bits) differs from the one used by Netty. On ISPN, we have some tests that slowly replay a buffer to check if we're not consuming more bytes than would be correct. Maybe this is not an issue for Netty. The code below reproduces it. Simulates a buffer that has received only the 3 first bytes of a larger integer.

    public static void main(String[] args) {
        // Integer.MAX_VALUE as vint: {-43, -1, -1, -1, 7, 0};
        ByteBuf buf = Unpooled.buffer(6);
        buf.writeByte(-43);
        buf.writeByte(-1);
        buf.writeByte(-1);

        // No data read and no data consumed.
        assert 0 == readRawVarint24(buf.resetReaderIndex());
        assert 0 == buf.readerIndex();
    }

    private static int readRawVarint24(ByteBuf buffer) {
        // From Netty.
        if (!buffer.isReadable()) {
            return 0;
        }
        buffer.markReaderIndex();

        byte tmp = buffer.readByte();
        if (tmp >= 0) {
            return tmp;
        }
        int result = tmp & 127;
        if (!buffer.isReadable()) {
            buffer.resetReaderIndex();
            return 0;
        }
        if ((tmp = buffer.readByte()) >= 0) {
            return result | tmp << 7;
        }
        result |= (tmp & 127) << 7;
        if (!buffer.isReadable()) {
            buffer.resetReaderIndex();
            return 0;
        }
        if ((tmp = buffer.readByte()) >= 0) {
            return result | tmp << 14;
        }
        return result | (tmp & 127) << 14;
    }

franz1981 · 2024-09-03T13:50:10Z

In netty I've avoided this (and the mark reader) by checking before if I got enough room in the buffer before and by using read bytes with offset - which won't change the offset.
In this way you can decide to move the offset based on what the branchless outcome decide (which is the "skipBytes" part in the Netty pr)

rigazilla reviewed Jun 18, 2024

View reviewed changes

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java Outdated Show resolved Hide resolved

franz1981 reviewed Jun 19, 2024

View reviewed changes

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java Outdated Show resolved Hide resolved

franz1981 reviewed Jun 19, 2024

View reviewed changes

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java Outdated Show resolved Hide resolved

franz1981 reviewed Jun 19, 2024

View reviewed changes

hotrod-client-decoder/src/main/java/org/infinispan/hotrod/numeric/BranchlessParser.java Outdated Show resolved Hide resolved

franz1981 reviewed Aug 30, 2024

View reviewed changes

Benchmark branchless numeric parser

ba87186

jabolina force-pushed the branchless-numeric branch from 5328b61 to ba87186 Compare September 3, 2024 13:30

jabolina mentioned this pull request Sep 4, 2024

ISPN-16487 Branchless varint decoding for Hot Rod infinispan/infinispan#12824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark branchless numeric parser #23

Benchmark branchless numeric parser #23

jabolina commented May 29, 2024

franz1981 commented Jun 18, 2024 •

edited

Loading

jabolina commented Jun 18, 2024

jabolina commented Aug 30, 2024

franz1981 commented Aug 30, 2024

franz1981 Aug 30, 2024

jabolina Sep 3, 2024

jabolina commented Sep 3, 2024

jabolina commented Sep 3, 2024

franz1981 commented Sep 3, 2024

Benchmark branchless numeric parser #23

Are you sure you want to change the base?

Benchmark branchless numeric parser #23

Conversation

jabolina commented May 29, 2024

franz1981 commented Jun 18, 2024 • edited Loading

jabolina commented Jun 18, 2024

jabolina commented Aug 30, 2024

franz1981 commented Aug 30, 2024

franz1981 Aug 30, 2024

Choose a reason for hiding this comment

jabolina Sep 3, 2024

Choose a reason for hiding this comment

jabolina commented Sep 3, 2024

jabolina commented Sep 3, 2024

franz1981 commented Sep 3, 2024

franz1981 commented Jun 18, 2024 •

edited

Loading