Implement parallel execution of sub-queries for hybrid search #781

VijayanB · 2024-06-10T05:25:28Z

Description

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation

Issues Resolved

#279

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…arch-project#749) Implement parallel execution of sub-queries for hybrid search Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>

…ject#779) This parallelization is not adding any value after comparing the benchmarks with and without this optimization. Hence removing it. Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB · 2024-06-10T18:00:24Z

Client ( large dataset )

1 Search Client ( opensearch-py )
10K Queries

Query Count	Vector Count	Max segments	No of Vector Search Query	No of Term Query	No of Sub-queries	Parallelization enabled	K	size	P50 ( client time in ms )	P90 ( client time in ms )	P99 ( client time in ms )	P50 Took Time (ms)	P50 Took Time Boost	P90 Took Time (ms)	P90 Took Time Boost	P99 Took Time (ms)	P99 Took Time Boost	Max CPU
10K	1M	10	1	1	2	Yes	100	100	123	149	198	31	22.50%	38	25%	45	23%	4%
10K	1M	10	1	1	2	No	100	100	158	176	215	40	baseline	51	baseline	59	baseline	2.50%
10K	1M	1	1	1	2	Yes	100	100	99	118	172	6	25%	7	22%	8	20%	1%
10K	1M	1	1	1	2	No	100	100	127	145	191	8	baseline	9	baseline	10	baseline	1%

Client Configuration ( large dataset )

15 Search clients
10K Queries each

Query Count	Vector Count	Max segments	Number of search clients	No of Vector Search Query	No of Term Query	No of Sub-queries	Parallelization enabled	K	size	P50 ( client time in ms )	P90 ( client time in ms )	P99 ( client time in ms )	P50 Took Time (ms)	P50 Took Time Boost	P90 Took Time (ms)	P90 Took Time Boost	P99 Took Time (ms)	P99 Took Time Boost	Max CPU
150K	1M	10	15	1	1	2	Yes	100	100	123	149	198	33	26%	41	28%	49	28%	22%
150K	1M	10	15	1	1	2	No	100	100	158	176	215	45	baseline	57	baseline	69	baseline	17.50%
150K	1M	1	15	1	1	2	Yes	100	100				6	25%	8	27%	9	18%	5%
150K	1M	1	15	1	1	2	No	100	100				8	baseline	11	baseline	11	baseline	3%

martin-gaievski

please fix my comment, otherwise code looks good, great job Vijay

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java

Signed-off-by: Vijayan Balasubramanian <[email protected]>

* Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]> (cherry picked from commit 76090de)

…arch-project#781) * Implement parallel execution of sub-queries for hybrid search (opensearch-project#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>

… search (#749) (#786) * Implement parallel execution of sub-queries for hybrid search (#781) * Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]> * Update package name in 2.15 which is different from main Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]>

zane-neo · 2024-06-12T23:43:13Z

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java

+     */
+    public static void initialize(ThreadPool threadPool) {
+        if (threadPool == null) {
+            throw new IllegalArgumentException(


It seems this should be an IllegalStateException since the threadPool are with OS system instead of a parameter passed from customer.

zane-neo · 2024-06-13T00:03:27Z

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

+                boolean actuallyRewritten = rewrittenQuery != query;
+                return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);
+            } catch (IOException e) {
+                throw new RuntimeException(e);


It seems better to use IllegalStateException since it indicates this is an server internal exception, also should add error messages.

zane-neo · 2024-06-13T00:21:58Z

src/main/java/org/opensearch/neuralsearch/query/HybridQueryWeight.java

+            try {
+                return weight.scorerSupplier(leafReaderContext);
+            } catch (IOException e) {
+                throw new RuntimeException(e);


zane-neo · 2024-06-13T00:30:00Z

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java

+    private static final Integer HYBRID_QUERY_EXEC_THREAD_POOL_QUEUE_SIZE = 1000;
+    private static final Integer MAX_THREAD_SIZE = 1000;
+    private static final Integer MIN_THREAD_SIZE = 2;
+    private static final Integer PROCESSOR_COUNT_MULTIPLIER = 2;


Do we really need to multiply the processor? It looks the tasks are all computational intensive, for such tasks more threads may even do harm to the performance, e.g. ForkjoinPool uses processor - 1 as the thread number, did we done testing on this?

VijayanB added 2 commits June 9, 2024 22:22

Remove parallelization while collecting hybrid scores (opensearch-pro…

352dd45

…ject#779) This parallelization is not adding any value after comparing the benchmarks with and without this optimization. Hence removing it. Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB requested review from heemin32, navneet1v, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners June 10, 2024 05:25

martin-gaievski approved these changes Jun 11, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java Outdated Show resolved Hide resolved

Update execption formatting

7b9d422

Signed-off-by: Vijayan Balasubramanian <[email protected]>

naveentatikonda approved these changes Jun 11, 2024

View reviewed changes

VijayanB self-assigned this Jun 11, 2024

VijayanB added backport 2.x Label will add auto workflow to backport PR to 2.x branch v2.15.0 labels Jun 11, 2024

VijayanB merged commit 76090de into opensearch-project:main Jun 11, 2024
75 checks passed

opensearch-trigger-bot bot mentioned this pull request Jun 11, 2024

[Backport 2.x] Implement parallel execution of sub-queries for hybrid search #785

Closed

zane-neo reviewed Jun 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel execution of sub-queries for hybrid search #781

Implement parallel execution of sub-queries for hybrid search #781

VijayanB commented Jun 10, 2024

VijayanB commented Jun 10, 2024

martin-gaievski left a comment

zane-neo Jun 12, 2024

zane-neo Jun 13, 2024

zane-neo Jun 13, 2024

zane-neo Jun 13, 2024

Implement parallel execution of sub-queries for hybrid search #781

Implement parallel execution of sub-queries for hybrid search #781

Conversation

VijayanB commented Jun 10, 2024

Description

Issues Resolved

Check List

VijayanB commented Jun 10, 2024

martin-gaievski left a comment

Choose a reason for hiding this comment

zane-neo Jun 12, 2024

Choose a reason for hiding this comment

zane-neo Jun 13, 2024

Choose a reason for hiding this comment

zane-neo Jun 13, 2024

Choose a reason for hiding this comment

zane-neo Jun 13, 2024

Choose a reason for hiding this comment