I/O Stack Simplification and Optimization #430

jbaiter · 2024-05-23T07:12:43Z

The I/O stack in the plugin was previously very close to the way the UnifiedHighlighter in Solr worked. However, this was not really a good match, since the interfaces involved were highlgy stateful and with a very large surface area, which made implementations brittle and overly complicated. Additionally, the way we performed I/O incurred a lot of String construction, which proved to be a significant hotspot when profiling the plugin. This change set refactors the I/O stack to be much simpler. Additionally, performance has been improved a lot by significantly reducing the number of String allocations, which also led to a decrease in the number of filesystem reads.

Additionally, we no longer memory-map files, instead we simply use the regular java.nio.FileChannel API. This saves us one data copy on the JVM side, instead we let the OS fill the buffer for us. This nets us an additional performance improvement for multithreaded highlighting, while being roughly on par with mmapped IO in the single threaded version. An additional benefit is that we now no longer run the risk of crashing the JVM when an I/O error like a disappearing mount occurs 😅

The combination of these two changes make the plugin a whole lot more ✨performant✨, with response times now all below the latency threshold for "sluggishness" (the orange line):

As for the API simplifications, it boils down to the following:

IterableCharSequence and its implementations are gone
New Interface SourceReader with a base class BaseSourceReader, for which implementations only have to provide a int read(byte[] dst, int dstOffset, int start, int len) implementation

These changes should make it significantly easier to add support for new I/O backends, most importantly S3 (see #49).

If I/O is not an issue (e.g. because the data completely resides in the page cache, then suddenly our generously sized 64KiB String buffer becomes a problem due to the amount of memory copying that goes on when constructing a String in Java (huge drawback of immutable strings!). To optimize this, the buffer size was made dynamic, based on the block type that is looked for, and users are offered customization options to adapt the sizes to their data.

Profiling revealed that we were significantly bottltenecked by String construction during passage building. Additionally, most of those Strings were constructed based on the same sections in the source files. This change set refactors the code to always read from disk in aligned chunks, and then cache those chunks for later use. This way we only ever read and construct a String once for a given chunk and then reuse that String.

New configuration attributes on `OcrHighlightComponent`: - `sectionReadSizeKiB`: Size of sections to read from inputs - `maxSectionCacheSizeKiB`: Maximum size of cached sections, should be a multiple of `sectionReadSizeKiB`

Gone are `IterableCharSequence` and its implementations, we now have a largely stateless `SourceReader` interface that takes care of reading data from various sources. Accompanying it is a `BaseSourceReader` base class that takes care of caching and allowing sectioned access. Currently only file-system based implementations are included (for single and multiple files), but based on this API adding support for other storage backends (S3, anyone?) should be as simple as implementing a single method `int readBytes(byte[] dst, int dstOffset, int start, int len)`

They're not thrown with the current mmap-based filesystem implementations, but other implementations might have operations that can cause IOExceptions, so we add those to the API.

Maybe useful for performing regression tests between changes.

Turns out by not copying the data from the page cache ourselves, and letting the kernel handle it outside of the JVM, nets us a very decent performance improvement in multithreaded benchmarks (up to 40%). Some negligible slowdown without multithreading, so we just wholesale switch over the whole I/O stack to `FileChannel`-based reading, away from `MappedByteBuffer`.

…y purge the cache

src/main/java/com/github/dbmdz/solrocr/reader/BaseSourceReader.java

Co-authored-by: schmika <[email protected]>

src/main/java/com/github/dbmdz/solrocr/solr/SolrOcrHighlighter.java

docs/performance.md

jbaiter force-pushed the optimized-string-allocations branch from c06d9df to 770e1a8 Compare May 24, 2024 06:58

jbaiter added 12 commits May 24, 2024 08:58

FileBytesCharIterator: Resize buffer if neccessary

1baf6ea

Refactor HocrClassBreakIterator to make it more easily changeable

4ccd6eb

Optimize section cache setup, trading memory for CPU

3568f7d

Add missing tests for SourceReader implementations

1e83a85

Add docs for section size tuning

6d22805

Add throws IOException to all SourceReader methods.

b2ac6e6

They're not thrown with the current mmap-based filesystem implementations, but other implementations might have operations that can cause IOExceptions, so we add those to the API.

bench.py: Add option to store responses in a file.

65bf60a

Maybe useful for performing regression tests between changes.

jbaiter force-pushed the optimized-string-allocations branch from 770e1a8 to 3778af1 Compare May 24, 2024 06:58

jbaiter marked this pull request as ready for review May 24, 2024 06:58

jbaiter added 2 commits May 24, 2024 10:10

Docstrings, comments, cleanup

73d42da

BaseSourceReader: Fix incomplete cache implementation, we now actuall…

8156d02

…y purge the cache

schmika reviewed May 24, 2024

View reviewed changes

src/main/java/com/github/dbmdz/solrocr/reader/BaseSourceReader.java Outdated Show resolved Hide resolved

schmika reviewed May 24, 2024

View reviewed changes

src/main/java/com/github/dbmdz/solrocr/reader/BaseSourceReader.java Outdated Show resolved Hide resolved

jbaiter and others added 2 commits May 24, 2024 14:03

Fix logic for triggering cache eviction scan

b28ae58

Co-authored-by: schmika <[email protected]>

Update docstring on BaseSourceReader

1568ef5

schmika reviewed May 24, 2024

View reviewed changes

src/main/java/com/github/dbmdz/solrocr/solr/SolrOcrHighlighter.java Show resolved Hide resolved

schmika reviewed May 24, 2024

View reviewed changes

docs/performance.md Show resolved Hide resolved

Fix setting of maximum cache entries, test caching logic

7deeb0a

jbaiter force-pushed the optimized-string-allocations branch from cecbd0c to 7deeb0a Compare May 24, 2024 13:34

schmika approved these changes May 27, 2024

View reviewed changes

jbaiter merged commit def4f26 into main May 27, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O Stack Simplification and Optimization #430

I/O Stack Simplification and Optimization #430

jbaiter commented May 23, 2024 •

edited

Loading

I/O Stack Simplification and Optimization #430

I/O Stack Simplification and Optimization #430

Conversation

jbaiter commented May 23, 2024 • edited Loading

jbaiter commented May 23, 2024 •

edited

Loading