Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow injestion/indexing performance, possibility to index in batch? #128

Open
vever001 opened this issue Nov 8, 2024 · 1 comment
Open

Comments

@vever001
Copy link

vever001 commented Nov 8, 2024

Hello,

We are currently considering adding oe_search on joinup, but noticed that the indexing is very slow.
It looks like oe_search performs 2 POST requests to the API (1 for the token and 1 to send the payload) for every document, see \Drupal\oe_search\Plugin\search_api\backend\SearchApiEuropaSearchBackend::indexItems
This results in many requests and a slow indexing overall.

Here are xhprof results when indexing 500 items via drush, which took 2m36s to complete on my local.
image

We can see 1000 requests made to the API via curl_exec, taking 87% of the process execution time.
Each document ingestion triggers 2 requests, each taking around 100-200ms (x 1000 requests, which matches with the curl_exec footprint):

  • 'POST', 'https://***/token',
  • 'POST', 'https://***/ingestion-api/acc/rest/ingestion/text?...

The total items/nodes to index on the project is currently 25K, which would take more than 2 hours and 50K requests to the europa search API.


To avoid so many requests, would it be possible to:

  • Request the token only once
    While the token likely expires after some time, we could implement a storage mechanism with an expiration logic (e.g., using expirable keyvalue?).
  • Index multiple items in a single HTTP POST request to the Europa search API.
    This would depend on the Europa search API's capabilities. I reached out to them, and they mentioned support for bulk indexing via /rest/ingestion/bulk. However, this endpoint seems specifically designed for documents (e.g., PDFs, Word files), so I'm not sure it supports our use case at this time.
    For context, search_api processes items in batches of 50 by default. This approach would make it similar to search_api_solr for example, which sends all 50 items in a single request.

Thank you

@allternativ9
Copy link

Dear,
Thank you for sharing your technical suggestion.

To ensure better communication and support for our Europa Web Platform developers, we have established a Community of Practice (CoP @ EC and EUIBAs - European Commission and GRP-Drupal CoP @ EC and EUIBAs | General | Microsoft Teams) and several training sessions and fora to support the members of the community as follows:

Dedicated "ASK OEL" Channel and meetings GRP-Drupal CoP @ EC and EUIBAs | 04. 💭 Ask OEL Team!! | Microsoft Teams: These sessions are designed for developers to directly raise questions, discuss challenges, and receive guidance from the team. We encourage you to take advantage of these meetings to address any blockers or technical issues. They are organised every two Thursdays and your team members can join to be guided: bi-monthly sessions on Teams

You can also post your message in the Ask OEL Teams Channel meanwhile.

Thanks a lot for your understanding.
Angela Grigore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants