Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of parsing BytesRef to Number #11324

Open
ketanv3 opened this issue Nov 24, 2023 · 12 comments
Open

Improve the performance of parsing BytesRef to Number #11324

ketanv3 opened this issue Nov 24, 2023 · 12 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Performance This is for any performance related enhancements or bugs

Comments

@ketanv3
Copy link
Contributor

ketanv3 commented Nov 24, 2023

Is your feature request related to a problem? Please describe.
The NumberFieldMapper is used to parse an input value to one of the numeric field types. One of the input value type can be a BytesRef object, which is first transformed to an intermediate UTF-8 string, and then parsed to its numeric value.

if (value instanceof BytesRef) {
value = ((BytesRef) value).utf8ToString();
}
result = Float.parseFloat(value.toString());

This intermediate conversion allocates a short-lived char array and a string, which may be excessive. We can possibly get away with this cost and reduce the latency, number of allocations, and GC pressure.

Describe the solution you'd like
Create a number parser which operates directly on the underlying bytes, offset, and length of the BytesRef object.

@ketanv3 ketanv3 added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 24, 2023
@ketanv3
Copy link
Contributor Author

ketanv3 commented Nov 24, 2023

I wrote a quick and dirty POC for zero-copy parsing of BytesRef to Long. It uses a push parser which reads characters from a UTF-8 encoded stream of bytes and performs state transitions on a very simplified finite state machine. Once the entire input is read, the final result is accessed via the parser.

Code for reference: ketanv3@b288085

I'm sure there's room for further improvement, but early results show about 8.5% higher throughput and 75% reduction in allocations!

Benchmark                                                    Mode  Cnt         Score   Error   Units
ParserBenchmark.baseline                                    thrpt       19601195.344           ops/s
ParserBenchmark.baseline:·gc.alloc.rate                     thrpt           1630.561          MB/sec
ParserBenchmark.baseline:·gc.alloc.rate.norm                thrpt             96.014            B/op
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space            thrpt           1633.981          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space.norm       thrpt             96.216            B/op
ParserBenchmark.baseline:·gc.churn.G1_Survivor_Space        thrpt              0.004          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Survivor_Space.norm   thrpt             ≈ 10⁻⁴            B/op
ParserBenchmark.baseline:·gc.count                          thrpt             60.000          counts
ParserBenchmark.baseline:·gc.time                           thrpt             38.000              ms
ParserBenchmark.candidate                                   thrpt       21272428.895           ops/s
ParserBenchmark.candidate:·gc.alloc.rate                    thrpt            442.671          MB/sec
ParserBenchmark.candidate:·gc.alloc.rate.norm               thrpt             24.004            B/op
ParserBenchmark.candidate:·gc.churn.G1_Eden_Space           thrpt            463.261          MB/sec
ParserBenchmark.candidate:·gc.churn.G1_Eden_Space.norm      thrpt             25.121            B/op
ParserBenchmark.candidate:·gc.churn.G1_Survivor_Space       thrpt              0.003          MB/sec
ParserBenchmark.candidate:·gc.churn.G1_Survivor_Space.norm  thrpt             ≈ 10⁻⁴            B/op
ParserBenchmark.candidate:·gc.count                         thrpt             17.000          counts
ParserBenchmark.candidate:·gc.time                          thrpt             12.000              ms

@backslasht
Copy link
Contributor

This looks great, especially love the reduction in the allocation part.

@ketanv3
Copy link
Contributor Author

ketanv3 commented Nov 26, 2023

I found that calls between decoding and parsing weren't inlined. I wrote a (dirty) inlined version with a few more changes:

  • Since we're decoding numbers, the only allowed characters are digits, plus/minus, and decimal point. For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, so we need not perform the full UTF-8 decoding.
  • Manually inlined the decoding and the parsing logic.
  • Returned a primitive long.

Code for reference: ketanv3@b744c94

Throughput increased by 98% and allocations reduced by 100% (zero allocations).

Benchmark                                                   Mode  Cnt         Score   Error   Units
ParserBenchmark.baseline                                   thrpt       19802045.680           ops/s
ParserBenchmark.baseline:·gc.alloc.rate                    thrpt           1647.966          MB/sec
ParserBenchmark.baseline:·gc.alloc.rate.norm               thrpt             96.014            B/op
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space           thrpt           1634.898          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space.norm      thrpt             95.253            B/op
ParserBenchmark.baseline:·gc.churn.G1_Survivor_Space       thrpt              0.008          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Survivor_Space.norm  thrpt             ≈ 10⁻³            B/op
ParserBenchmark.baseline:·gc.count                         thrpt             60.000          counts
ParserBenchmark.baseline:·gc.time                          thrpt             38.000              ms
ParserBenchmark.candidate                                  thrpt       39169294.537           ops/s
ParserBenchmark.candidate:·gc.alloc.rate                   thrpt             ≈ 10⁻⁴          MB/sec
ParserBenchmark.candidate:·gc.alloc.rate.norm              thrpt             ≈ 10⁻⁶            B/op
ParserBenchmark.candidate:·gc.count                        thrpt                ≈ 0          counts

@jjoung1128
Copy link

Hi @ketanv3, thank you for initiating this discussion and suggesting ways to enhance the NumberFieldMapper's performance. The idea of reducing allocations to improve throughput is particularly compelling.

I just wanted to check if using ByteBuffer was considered as an option which shows 20x increase in throughput with zero allocations. This was observed when I integrated ByteBuffer into your POC code in the candidate parsing method.

Benchmark                                               Mode  Cnt          Score   Error   Units
ParserBenchmark.baseline                               thrpt        14246887.876           ops/s
ParserBenchmark.baseline:·gc.alloc.rate                thrpt            1184.545          MB/sec
ParserBenchmark.baseline:·gc.alloc.rate.norm           thrpt              96.000            B/op
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space       thrpt            1160.674          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space.norm  thrpt              94.065            B/op
ParserBenchmark.baseline:·gc.churn.G1_Old_Gen          thrpt               0.001          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Old_Gen.norm     thrpt              ≈ 10⁻⁴            B/op
ParserBenchmark.baseline:·gc.count                     thrpt              35.000          counts
ParserBenchmark.baseline:·gc.time                      thrpt              10.000              ms
ParserBenchmark.candidate                              thrpt       312958576.435           ops/s
ParserBenchmark.candidate:·gc.alloc.rate               thrpt              ≈ 10⁻⁴          MB/sec
ParserBenchmark.candidate:·gc.alloc.rate.norm          thrpt              ≈ 10⁻⁷            B/op
ParserBenchmark.candidate:·gc.count                    thrpt                 ≈ 0          counts

Code for reference: jjoung1128@7c95945cc62

@ketanv3
Copy link
Contributor Author

ketanv3 commented Nov 26, 2023

Hi @jjoung1128, thanks for your experiment!

Please note that ByteBuffer::getLong doesn't really parse the string representation of a number. It simply reads the next 8 bytes and represents those 64 bits as a long number (either in little or big endian order).

Here's an example:
The UTF-8 encoded byte representation of "12345678" is { 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38 }. Using ByteBuffer::getLong with big endian order would interpret this as a number equal to 0x3132333435363738L, which is equal to 3544952156018063160 in decimal.

byte[] utf8 = "12345678".getBytes(StandardCharsets.UTF_8);
ByteBuffer.wrap(utf8).getLong();

@jjoung1128
Copy link

I see. Thanks for the detailed explanation with the example!

@ketanv3
Copy link
Contributor Author

ketanv3 commented Dec 1, 2023

I came across a technique called "SIMD within a register" (SWAR) which uses general-purpose registers and software tricks to parallelize computation. Using this, I was able to load chunks of 8 digits (i.e., 8 bytes) into a single 64-bit long and parse it to decimal in a single shot.

The results are really impressive! It may not translate to real world gains though, but I couldn't stop myself from sharing it.
Throughput increased by 306% and allocations reduced by 100% (zero allocations).

Benchmark                                                   Mode  Cnt         Score   Error   Units
ParserBenchmark.baseline                                   thrpt       21314308.267           ops/s
ParserBenchmark.baseline:·gc.alloc.rate                    thrpt           1773.225          MB/sec
ParserBenchmark.baseline:·gc.alloc.rate.norm               thrpt             96.012            B/op
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space           thrpt           1771.155          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Eden_Space.norm      thrpt             95.900            B/op
ParserBenchmark.baseline:·gc.churn.G1_Survivor_Space       thrpt              0.010          MB/sec
ParserBenchmark.baseline:·gc.churn.G1_Survivor_Space.norm  thrpt              0.001            B/op
ParserBenchmark.baseline:·gc.count                         thrpt             53.000          counts
ParserBenchmark.baseline:·gc.time                          thrpt             49.000              ms
ParserBenchmark.candidate                                  thrpt       86503495.912           ops/s
ParserBenchmark.candidate:·gc.alloc.rate                   thrpt             ≈ 10⁻⁴          MB/sec
ParserBenchmark.candidate:·gc.alloc.rate.norm              thrpt             ≈ 10⁻⁶            B/op
ParserBenchmark.candidate:·gc.count                        thrpt                ≈ 0          counts

Code for reference: ketanv3@26990b0

Note: This code isn't production-ready as it doesn't handle under and overflows. Luckily for us, this can only happen when the length of the number is 19 and the most significant digit is 9. For most practical purposes, occurrence of such large numbers would be rare, so it should be fine to fall back to slower parsing when this happens.

The code to convert 8 digits (8 bytes in little-endian order represented as a 64-bit long) to its decimal form is inspired by work done by Daniel Lemire and Michael Eisel (blog #1, #2) (code).

@ketanv3
Copy link
Contributor Author

ketanv3 commented Dec 1, 2023

Handling under and overflows turned out to be simpler than I thought. Since numbers of length > 19 would be rejected right away, we only need to deal with under and overflows for exactly 19 digits. Luckily, when this happens, the result would just cycle over and flip the sign bit. We can simply check for that and identify bad inputs with hardly any performance hit (since the branch hardly mispredicts).

Patch: ketanv3@1cf343f

@backslasht
Copy link
Contributor

@ketanv3 - This looks great!

I think the flip the sign bit would work for negative values, but for positive values the rotation of values will still result in positive value, how would we able to identify that?

@ketanv3
Copy link
Contributor Author

ketanv3 commented Dec 4, 2023

@backslasht The sign bit deserves its own detailed explanation. I had an 'aha' moment while experimenting.

To get the basics out of the way - signed integers (and longs) are stored in two's complement form. The MSB not only represents the sign, but also the magnitude of the most negative value. Suppose we're working with 8-bit numbers for simplicity, then:

10000000 = smallest value = -128
01111111 = largest value = +127

10100110 = (-128 * 1) + (64 * 0) + (32 * 1) + (16 * 0) + (8 * 0) + (4 * 1) + (2 * 1) + (1 * 0) = -90

Part 1 - Under or overflows (may) flip the sign bit

  • Underflow - Say we start with -128 (most negative value) and subtract one from it, the binary representation becomes 0111111, which is equal to +127.
  • Overflow - Say we start with +127 (most positive value) and add one to it, the binary representation becomes 10000000, which is equal to -128.

Part 2 - Don't let more than one under or overflows

In part 1, I only showed examples where the sign bit flipped, but this isn't always true. If the number is large enough such that it causes an even number of overflows, the sign bit will be preserved. For example, 381 doesn't fit in an 8-bit number, but given it overflows twice, it gets interpreted as 125. So we cannot reliably tell if an overflow happened or not.

But what if we only allow overflows to happen once? If we can guarantee this, we can use the sign bit to reliably tell us if an overflow happened or not. For an n-bit integer, the range of numbers is $[-2^{n-1}, 2^{n-1} - 1]$. To ensure only a single under or overflow, the range of numbers should be $[-2^{n} + 1, 2^{n} - 1]$.

Here's the neat part. For a 64-bit long, the range of numbers is [-9223372036854775808, 9223372036854775807]. So the range of numbers to ensure a single under or overflow would be [-18446744073709551615, 18446744073709551615]. This is a 20-digit number! We know that all 20 digit numbers are invalid, and this theoretical range gives us a guarantee to catch all 19-digit under and overflows. Really neat!

Part 3 - Special case

Note that the implementation uses result to store positive values only, and use isNegative to convert it to negative if needed. But what happens when the input is "-9223372036854775808" (i.e., Long.MIN_VALUE)? Since its absolute value is larger than what can fit in a positive result, it will overflow.

But result overflows to -9223372036854775808 itself, and taking negative of this number overflows once again to -9223372036854775808 itself, thus preserving the correct sign bit. Keeps getting neater!

@backslasht
Copy link
Contributor

Thanks @ketanv3 for the detailed explanation with the examples.

@ankitkala ankitkala added Indexing Indexing, Bulk Indexing and anything related to indexing and removed Other labels Dec 17, 2023
@peternied peternied added the Performance This is for any performance related enhancements or bugs label Feb 7, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4]
@ketanv3 Thanks for filing, great description of the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Performance This is for any performance related enhancements or bugs
Development

No branches or pull requests

6 participants