Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel execution of sub-queries for hybrid search #781

Merged
merged 3 commits into from
Jun 11, 2024

Conversation

VijayanB
Copy link
Member

Description

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation

Issues Resolved

#279

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

VijayanB added 2 commits June 9, 2024 22:22
…arch-project#749)

Implement parallel execution of sub-queries for hybrid search

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
…ject#779)

This parallelization is not adding any value after comparing
the benchmarks with and without this optimization.
Hence removing it.

Signed-off-by: Vijayan Balasubramanian <[email protected]>
@VijayanB
Copy link
Member Author

Client ( large dataset )

  • 1 Search Client ( opensearch-py )
  • 10K Queries
Query Count Vector Count Max segments No of Vector Search Query No of Term Query No of Sub-queries Parallelization enabled K size P50 ( client time in ms ) P90 ( client time in ms ) P99 ( client time in ms ) P50 Took Time (ms) P50 Took Time Boost P90 Took Time (ms) P90 Took Time Boost P99 Took Time (ms) P99 Took Time Boost Max CPU
10K 1M 10 1 1 2 Yes 100 100 123 149 198 31 22.50% 38 25% 45 23% 4%
10K 1M 10 1 1 2 No 100 100 158 176 215 40 baseline 51 baseline 59 baseline 2.50%
10K 1M 1 1 1 2 Yes 100 100 99 118 172 6 25% 7 22% 8 20% 1%
10K 1M 1 1 1 2 No 100 100 127 145 191 8 baseline 9 baseline 10 baseline 1%

Client Configuration ( large dataset )

  • 15 Search clients
  • 10K Queries each
Query Count Vector Count Max segments Number of search clients No of Vector Search Query No of Term Query No of Sub-queries Parallelization enabled K size P50 ( client time in ms ) P90 ( client time in ms ) P99 ( client time in ms ) P50 Took Time (ms) P50 Took Time Boost P90 Took Time (ms) P90 Took Time Boost P99 Took Time (ms) P99 Took Time Boost Max CPU
150K 1M 10 15 1 1 2 Yes 100 100 123 149 198 33 26% 41 28% 49 28% 22%
150K 1M 10 15 1 1 2 No 100 100 158 176 215 45 baseline 57 baseline 69 baseline 17.50%
150K 1M 1 15 1 1 2 Yes 100 100       6 25% 8 27% 9 18% 5%
150K 1M 1 15 1 1 2 No 100 100       8 baseline 11 baseline 11 baseline 3%

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix my comment, otherwise code looks good, great job Vijay

Signed-off-by: Vijayan Balasubramanian <[email protected]>
@VijayanB VijayanB self-assigned this Jun 11, 2024
@VijayanB VijayanB added backport 2.x Label will add auto workflow to backport PR to 2.x branch v2.15.0 labels Jun 11, 2024
@VijayanB VijayanB merged commit 76090de into opensearch-project:main Jun 11, 2024
75 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 11, 2024
* Implement parallel execution of sub-queries for hybrid search (#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
(cherry picked from commit 76090de)
VijayanB added a commit to VijayanB/neural-search that referenced this pull request Jun 11, 2024
…arch-project#781)

* Implement parallel execution of sub-queries for hybrid search (opensearch-project#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
VijayanB added a commit that referenced this pull request Jun 11, 2024
… search (#749) (#786)

* Implement parallel execution of sub-queries for hybrid search (#781)

* Implement parallel execution of sub-queries for hybrid search (#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>

* Update package name in 2.15 which is different from main

Signed-off-by: Vijayan Balasubramanian <[email protected]>

---------

Signed-off-by: Vijayan Balasubramanian <[email protected]>
*/
public static void initialize(ThreadPool threadPool) {
if (threadPool == null) {
throw new IllegalArgumentException(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this should be an IllegalStateException since the threadPool are with OS system instead of a parameter passed from customer.

boolean actuallyRewritten = rewrittenQuery != query;
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to use IllegalStateException since it indicates this is an server internal exception, also should add error messages.

try {
return weight.scorerSupplier(leafReaderContext);
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above

private static final Integer HYBRID_QUERY_EXEC_THREAD_POOL_QUEUE_SIZE = 1000;
private static final Integer MAX_THREAD_SIZE = 1000;
private static final Integer MIN_THREAD_SIZE = 2;
private static final Integer PROCESSOR_COUNT_MULTIPLIER = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to multiply the processor? It looks the tasks are all computational intensive, for such tasks more threads may even do harm to the performance, e.g. ForkjoinPool uses processor - 1 as the thread number, did we done testing on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch v2.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants