-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I/O Stack Simplification and Optimization #430
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jbaiter
force-pushed
the
optimized-string-allocations
branch
from
May 24, 2024 06:58
c06d9df
to
770e1a8
Compare
If I/O is not an issue (e.g. because the data completely resides in the page cache, then suddenly our generously sized 64KiB String buffer becomes a problem due to the amount of memory copying that goes on when constructing a String in Java (huge drawback of immutable strings!). To optimize this, the buffer size was made dynamic, based on the block type that is looked for, and users are offered customization options to adapt the sizes to their data.
Profiling revealed that we were significantly bottltenecked by String construction during passage building. Additionally, most of those Strings were constructed based on the same sections in the source files. This change set refactors the code to always read from disk in aligned chunks, and then cache those chunks for later use. This way we only ever read and construct a String once for a given chunk and then reuse that String.
New configuration attributes on `OcrHighlightComponent`: - `sectionReadSizeKiB`: Size of sections to read from inputs - `maxSectionCacheSizeKiB`: Maximum size of cached sections, should be a multiple of `sectionReadSizeKiB`
Gone are `IterableCharSequence` and its implementations, we now have a largely stateless `SourceReader` interface that takes care of reading data from various sources. Accompanying it is a `BaseSourceReader` base class that takes care of caching and allowing sectioned access. Currently only file-system based implementations are included (for single and multiple files), but based on this API adding support for other storage backends (S3, anyone?) should be as simple as implementing a single method `int readBytes(byte[] dst, int dstOffset, int start, int len)`
They're not thrown with the current mmap-based filesystem implementations, but other implementations might have operations that can cause IOExceptions, so we add those to the API.
Maybe useful for performing regression tests between changes.
Turns out by not copying the data from the page cache ourselves, and letting the kernel handle it outside of the JVM, nets us a very decent performance improvement in multithreaded benchmarks (up to 40%). Some negligible slowdown without multithreading, so we just wholesale switch over the whole I/O stack to `FileChannel`-based reading, away from `MappedByteBuffer`.
jbaiter
force-pushed
the
optimized-string-allocations
branch
from
May 24, 2024 06:58
770e1a8
to
3778af1
Compare
schmika
reviewed
May 24, 2024
src/main/java/com/github/dbmdz/solrocr/reader/BaseSourceReader.java
Outdated
Show resolved
Hide resolved
schmika
reviewed
May 24, 2024
src/main/java/com/github/dbmdz/solrocr/reader/BaseSourceReader.java
Outdated
Show resolved
Hide resolved
Co-authored-by: schmika <[email protected]>
schmika
reviewed
May 24, 2024
schmika
reviewed
May 24, 2024
jbaiter
force-pushed
the
optimized-string-allocations
branch
from
May 24, 2024 13:34
cecbd0c
to
7deeb0a
Compare
schmika
approved these changes
May 27, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The I/O stack in the plugin was previously very close to the way the
UnifiedHighlighter
in Solr worked. However, this was not really a good match, since the interfaces involved were highlgy stateful and with a very large surface area, which made implementations brittle and overly complicated. Additionally, the way we performed I/O incurred a lot of String construction, which proved to be a significant hotspot when profiling the plugin. This change set refactors the I/O stack to be much simpler. Additionally, performance has been improved a lot by significantly reducing the number of String allocations, which also led to a decrease in the number of filesystem reads.Additionally, we no longer memory-map files, instead we simply use the regular
java.nio.FileChannel
API. This saves us one data copy on the JVM side, instead we let the OS fill the buffer for us. This nets us an additional performance improvement for multithreaded highlighting, while being roughly on par with mmapped IO in the single threaded version. An additional benefit is that we now no longer run the risk of crashing the JVM when an I/O error like a disappearing mount occurs 😅The combination of these two changes make the plugin a whole lot more ✨performant✨, with response times now all below the latency threshold for "sluggishness" (the orange line):
As for the API simplifications, it boils down to the following:
IterableCharSequence
and its implementations are goneSourceReader
with a base classBaseSourceReader
, for which implementations only have to provide aint read(byte[] dst, int dstOffset, int start, int len)
implementationThese changes should make it significantly easier to add support for new I/O backends, most importantly S3 (see #49).