Multi-threaded highlighting #64

jbaiter · 2019-09-12T09:11:19Z

Currently every (doc, field, matchOffset) combination is highlighted sequentially. Since highlighting is highly I/O-bound, it would be great if this could be parallelized at the doc- or field-level so we can take advantage of Storage-Layers that allow concurrent access (see e.g. #49).

This work should probably also involve a refactor that moves away from subclassing the uhighlight.FieldHighlighter type hierarchy and replaces it with something that is better suited to our use case. Specifically we should look at determining if there's a better way to determine passage boundaries than the current BreakIterator approach.

The text was updated successfully, but these errors were encountered:

This is a relatively hacky way to implement the issues raised in #64. This PR adds a utility class to concurrently "warm" the OS page cache with files that will be used for highlighting. This should significantly reduce the I/O latency during the sequential highlighting process, especially when using a network storage layer or a RAID system. The idea is that a lot of storage layers can benefit from parallel I/O. Unfortunately, snippet generation with the current UHighlighter approach is strongly sequential, which means we give away a lot of potential performance, since we're limited by the I/O latency of the underlying storage layer. By pre-reading the data we might need in a concurrent way, we pre-populate the operating system's page cache, so any I/O performed by the snippet generation process further down the line should only hit the page cache and not incur as much of a latency hit. The class also provides a way to cancel the pre-loading of a given source pointer. This is called at the beginning of the snippet generation process, since at that point any background I/O on the target files will only add to the latency we might experience anyway.

jbaiter added enhancement New feature or request performance labels Sep 12, 2019

jbaiter mentioned this issue Feb 20, 2020

Add concurrent page cache warming to reduce IO latency #84

Merged

jbaiter linked a pull request May 10, 2024 that will close this issue

Implement concurrent highlighting on multiple threads (#64) #429

Merged

jbaiter closed this as completed in #429 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded highlighting #64

Multi-threaded highlighting #64

jbaiter commented Sep 12, 2019

Multi-threaded highlighting #64

Multi-threaded highlighting #64

Comments

jbaiter commented Sep 12, 2019