Elasticsearch: Async handling of indexing/deletion requests #8465

tobias-hotz · 2024-10-24T13:38:40Z

Currently, indexing is batched as a global queue of 200 elements (except when forceRefreshReaders is true). When the threshold is reached, the entries are submitted, and the thread submitting the 200th element waits until the elasticsearch returns the result of the request.
With deletion, we currently always send one request per deletion request and always use the deleteByQuery method.

The current design has a number of flaws:

When multiple threads index metadata and both do not use forceRefreshReaders, the queue could grow past 200 (e.g., two threads add at about the same time, and at the time the queue size is checked, the queue already has 201 elements). This causes the queue to grow indefinitely, meaning no further entries are submitted before an explicit call to sendDocumentsToIndex. This can cause Out Of Memory errors (we observed this when using multiple reindexing threads)
An indexing thread is spending a significant amount of time waiting for the elasticsearch, while it could already prepare the next batch for the ES to be consumed
As we wait after every delete call, deleting many entries takes a lot more time as needed
Due to deletion always using deleteByQuery (even when we could delete by uuid), no batching is possible
When multiple threads index metadata and both do not use forceRefreshReaders, the queue contains mixed entries of Thread 1 and Thread 2. At most call sites, after all entries have been indexed, sendDocumentsToIndex is called. It is currently possible that Thread 1 is currently sending a batch that contains entries from Thread 2 as well and Thread 2 already finished submitting the rest of the batch, causing Thread 2 to think that all its entries have been sent to Elasticsearch, but they are still being submitted by Thread 1. This is a very rare problem, though

This PR solves all of these problems. The main takeaway is that it significantly improves the performance of deleting and indexing many entries.

This is accomplished by introducing a IIndexSubmitter and IDeletionSubmitter. These new classes handle how new entries are sent to the index. The direct implementations (DirectIndexSubmitter and DirectDeletionSubmitter) are similar to how the old forceIndexChanges parameter worked in that they directly send the data to the index.
With the use of the BatchingIndexSubmitter and BatchingDeletionSubmittor, chunks are sent periodically to the elasticsearch (just as before), but a local queue is used, and we do not wait for the elasticearch. The index responses are handled asynchronously on a different thread instead. We still guarantee that the indexing will be complete once the whole block is done, as the close method sends the rest of the local queue and waits for all async responses to be complete.

We made some performance measurements on a smaller scale. Here is the average result of a bunch of runs with different CSW harvesters:

Operation	Average time before change	Average time after change	Improvement in %
Harvesting ~3600 entries	11 minutes and 17 seconds	06 minutes and 30 seconds	~74%
Harvesting ~1250 entries	06 minutes and 33 seconds	04 minutes and 28 seconds	~47%
Harvesting ~700 entries	03 minutes and 14 seconds	01 minutes and 30 seconds	~131%
Harvesting ~100 entries	00 minutes and 29 seconds	00 minutes and 29 seconds	~35%
Reindexing ~5700 entries	04 minutes and 39 seconds	02 minutes and 40 seconds	~74%
Deleting ~3600 entries	01 minutes and 35 seconds	00 minutes and 20 seconds	~475%
Deleting ~1250 entries	00 minutes and 38 seconds	00 minutes and 09 seconds	~433%
Deleting ~700 entries	00 minutes and 20 seconds	00 minutes and 04 seconds	~500%
Deleting ~100 entries	00 minutes and 05 seconds	00 minutes and 01 seconds	~500%

As you can see, there are very significant performance gains. These numbers were recorded on a local machine, if you use a remote index on a different machine, the effect may be even higher due to latency/throughput limitations.

Checklist

Funded by LGL BW

This improves indexing time, especially when the ES connection has a high latency or the ES load is high.

…dex entries

…in case on of not properly closed submittors

This causes issues with some tests. This issue was already present before the changes

fxprunayre · 2024-11-14T12:11:09Z

Interesting work @tobias-hotz. We are also investigating how to improve indexing performances for GeoNetwork 5. See draft work geonetwork/geonetwork#19 and
https://github.com/geonetwork/geonetwork/blob/main/src/modules/indexing/src/main/java/org/geonetwork/indexing/IndexingService.java#L108
Maybe some ideas can be shared, or tracking some GN4 hot spots to avoid to make similar mistakes in GN5 can be nice.

tobias-hotz · 2024-11-14T12:49:24Z

Hi @fxprunayre
thanks for taking a look.
The idea for this is to improve performance of indexing while breaking none of the existing behaviour. Some code paths assume that an element is available in the index right after the call to index, so this has to be taken into account.
Also, the index preparation is still done in the main thread, as it is not unlikely that some code relies on this, and given how big the GN4 Codebase is, I'd rather go with this.

My first approach was to just return the Future of the index response to the caller, but that was getting pretty messy and it was easy to miss a call site. That's why I chose this approach.
For GN5, it could also be benificial to move the index preparation stuff to another thread.

This change allows allows the multithreaded reindexing to work again (which is somewhat broken at the moment, mainly because of concurrency issues with the single document buffer for the bulk requests). This reduces the time spend on reindexing by a lot. So support for multithreaded indexing is something GN5 should also provide out of the box.

…ders

CLAassistant · 2024-12-08T03:42:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…indexing # Conflicts: # services/src/main/java/org/fao/geonet/api/processing/DatabaseProcessUtils.java

Handle async responses asynchronously.

d3818ea

This improves indexing time, especially when the ES connection has a high latency or the ES load is high.

tobias-hotz marked this pull request as draft October 24, 2024 13:51

tobias-hotz added 8 commits October 30, 2024 14:37

Delegate the responsibility for batching to the caller

2c08974

Dynamically compute the batch size based on the number of expected in…

418de0a

…dex entries

Use a cleaner to make sure all documents get send to the index, even …

9519a92

…in case on of not properly closed submittors

Don't use list version of indexMetadata for single entries

a9a919e

Remove no longer required method forceIndexChanges

0d49d86

Remove commit index changes from frontend

ff5c5dc

Change how a running index job is determined

34e1607

Fixed using the wrong map in BatchingIndexSubmittor

83ea448

tobias-hotz force-pushed the async_indexing branch from 702dc20 to 83ea448 Compare November 7, 2024 14:47

tobias-hotz added 2 commits November 7, 2024 16:28

Fix field updating not refreshing

66e9c38

This causes issues with some tests. This issue was already present before the changes

Fix UserSelectionsApiTest being too strict about the submittor

948575b

tobias-hotz force-pushed the async_indexing branch from ddb7766 to 948575b Compare November 7, 2024 16:07

tobias-hotz added 3 commits November 12, 2024 14:12

Add support for batch deletion as well

95e4a57

Fix batch deletion

827539e

Allow delete by query to be "batched" by running them async

61fb7ad

tobias-hotz force-pushed the async_indexing branch from 767194b to 61fb7ad Compare November 12, 2024 13:13

fxprunayre mentioned this pull request Dec 5, 2024

Performance hot spots geonetwork/geonetwork#76

Open

tobias-hotz added 4 commits December 6, 2024 16:29

submittor -> submitter

9cd5242

Remove debug sleep

2b6476c

Remove unused never implemented that still references forceRefreshRea…

b8952a0

…ders

Improve log message when deleting

d6b9655

tobias-hotz changed the title ~~Elasticsearch Indexing: Handle index responses asynchronously~~ Elasticsearch: Async handling of indexing/deletion requests Dec 10, 2024

tobias-hotz marked this pull request as ready for review December 10, 2024 14:36

Merge remote-tracking branch 'refs/remotes/upstream/main' into async_…

c65cc71

…indexing # Conflicts: # services/src/main/java/org/fao/geonet/api/processing/DatabaseProcessUtils.java

tobias-hotz mentioned this pull request Dec 10, 2024

Error when trying to view/sign CLA via CLAassistant #8550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch: Async handling of indexing/deletion requests #8465

Elasticsearch: Async handling of indexing/deletion requests #8465

tobias-hotz commented Oct 24, 2024 •

edited

Loading

fxprunayre commented Nov 14, 2024

tobias-hotz commented Nov 14, 2024

CLAassistant commented Dec 8, 2024

Elasticsearch: Async handling of indexing/deletion requests #8465

Are you sure you want to change the base?

Elasticsearch: Async handling of indexing/deletion requests #8465

Conversation

tobias-hotz commented Oct 24, 2024 • edited Loading

Checklist

fxprunayre commented Nov 14, 2024

tobias-hotz commented Nov 14, 2024

CLAassistant commented Dec 8, 2024

tobias-hotz commented Oct 24, 2024 •

edited

Loading