Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O Stack Simplification and Optimization #430

Merged
merged 17 commits into from
May 27, 2024
Merged

Conversation

jbaiter
Copy link
Member

@jbaiter jbaiter commented May 23, 2024

The I/O stack in the plugin was previously very close to the way the UnifiedHighlighter in Solr worked. However, this was not really a good match, since the interfaces involved were highlgy stateful and with a very large surface area, which made implementations brittle and overly complicated. Additionally, the way we performed I/O incurred a lot of String construction, which proved to be a significant hotspot when profiling the plugin. This change set refactors the I/O stack to be much simpler. Additionally, performance has been improved a lot by significantly reducing the number of String allocations, which also led to a decrease in the number of filesystem reads.

Additionally, we no longer memory-map files, instead we simply use the regular java.nio.FileChannel API. This saves us one data copy on the JVM side, instead we let the OS fill the buffer for us. This nets us an additional performance improvement for multithreaded highlighting, while being roughly on par with mmapped IO in the single threaded version. An additional benefit is that we now no longer run the risk of crashing the JVM when an I/O error like a disappearing mount occurs 😅

The combination of these two changes make the plugin a whole lot more ✨performant✨, with response times now all below the latency threshold for "sluggishness" (the orange line):

perfplot

As for the API simplifications, it boils down to the following:

  • IterableCharSequence and its implementations are gone
  • New Interface SourceReader with a base class BaseSourceReader, for which implementations only have to provide a int read(byte[] dst, int dstOffset, int start, int len) implementation

These changes should make it significantly easier to add support for new I/O backends, most importantly S3 (see #49).

@jbaiter jbaiter force-pushed the optimized-string-allocations branch from c06d9df to 770e1a8 Compare May 24, 2024 06:58
jbaiter added 12 commits May 24, 2024 08:58
If I/O is not an issue (e.g. because the data completely resides in the
page cache, then suddenly our generously sized 64KiB String buffer
becomes a problem due to the amount of memory copying that goes on when
constructing a String in Java (huge drawback of immutable strings!).
To optimize this, the buffer size was made dynamic, based on the block
type that is looked for, and users are offered customization options to
adapt the sizes to their data.
Profiling revealed that we were significantly bottltenecked by String
construction during passage building. Additionally, most of those
Strings were constructed based on the same sections in the source files.

This change set refactors the code to always read from disk in aligned
chunks, and then cache those chunks for later use. This way we only
ever read and construct a String once for a given chunk and then reuse
that String.
New configuration attributes on `OcrHighlightComponent`:

- `sectionReadSizeKiB`: Size of sections to read from inputs
- `maxSectionCacheSizeKiB`: Maximum size of cached sections, should be
  a multiple of `sectionReadSizeKiB`
Gone are `IterableCharSequence` and its implementations, we now have a
largely stateless `SourceReader` interface that takes care of reading
data from various sources. Accompanying it is a `BaseSourceReader`
base class that takes care of caching and allowing sectioned access.
Currently only file-system based implementations are included (for
single and multiple files), but based on this API adding support for
other storage backends (S3, anyone?) should be as simple as implementing
a single method `int readBytes(byte[] dst, int dstOffset, int start, int len)`
They're not thrown with the current mmap-based filesystem
implementations, but other implementations might have operations that
can cause IOExceptions, so we add those to the API.
Maybe useful for performing regression tests between changes.
Turns out by not copying the data from the page cache ourselves, and
letting the kernel handle it outside of the JVM, nets us a very decent
performance improvement in multithreaded benchmarks (up to 40%).
Some negligible slowdown without multithreading, so we just wholesale
switch over the whole I/O stack to `FileChannel`-based reading, away
from `MappedByteBuffer`.
@jbaiter jbaiter force-pushed the optimized-string-allocations branch from 770e1a8 to 3778af1 Compare May 24, 2024 06:58
@jbaiter jbaiter marked this pull request as ready for review May 24, 2024 06:58
@jbaiter jbaiter force-pushed the optimized-string-allocations branch from cecbd0c to 7deeb0a Compare May 24, 2024 13:34
@jbaiter jbaiter merged commit def4f26 into main May 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants