Implement parallel execution of sub-queries for hybrid search #749

VijayanB · 2024-05-15T20:29:54Z

Description

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Issues Resolved

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2024-05-16T23:37:54Z

Codecov Report

Attention: Patch coverage is 83.78378% with 12 lines in your changes are missing coverage. Please review.

Please upload report for BASE (feature/parallelize-hybrid-search@806042c). Learn more about missing BASE report.

❗ Current head 2750ad4 differs from pull request most recent head bc6b885

Please upload reports for the commit bc6b885 to get more accurate results.

Files	Patch %	Lines
...ch/neuralsearch/executors/HybridQueryExecutor.java	44.44%	4 Missing and 1 partial ⚠️
...org/opensearch/neuralsearch/query/HybridQuery.java	89.47%	2 Missing ⚠️
...ensearch/neuralsearch/query/HybridQueryScorer.java	86.66%	2 Missing ⚠️
...ensearch/neuralsearch/query/HybridQueryWeight.java	85.71%	2 Missing ⚠️
...g/opensearch/neuralsearch/plugin/NeuralSearch.java	50.00%	1 Missing ⚠️

Additional details and impacted files

@@                         Coverage Diff                          @@
##             feature/parallelize-hybrid-search     #749   +/-   ##
====================================================================
  Coverage                                     ?   84.89%           
  Complexity                                   ?      812           
====================================================================
  Files                                        ?       65           
  Lines                                        ?     2490           
  Branches                                     ?      410           
====================================================================
  Hits                                         ?     2114           
  Misses                                       ?      213           
  Partials                                     ?      163

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

navneet1v

Looked PR at high level and added comments. Please resolve them after that I will do 1 more review of the PR.

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutorCollectorManager.java

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutorCollector.java

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

src/main/java/org/opensearch/neuralsearch/query/HybridQueryRewriteCollectorManager.java

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

navneet1v · 2024-05-17T17:15:32Z

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

+                perform better. For hybrid query we need to track progress of re-write for all sub-queries */
+
+                boolean actuallyRewritten = rewrittenQuery != query;
+                return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);


can we avoid storing queries in a map? and who is going to store this simple map entry?

HybridQueryExecutorCollector.collect() will store this entry. If we don't store map, we can't assemble back to caller once the execution is completed.

You can see if you can use optional instead. That will simplify the data structure and make it more readable

return rewrittenQuery != query ? Optional.of(rewrittenQuery) : Optional.empty()

do we have data or can we run benchmark to see how much overhead (if any) adds the Optional approach? I agree this way code looks simpler, but if we're aiming for performance improvement and this is critical section then we may want to save CPU cycles of wrapping/unwrapping result to Optional.

and +1 to Navneet's point regarding map and specifically using Query as a key. I was using similar approach in initial implementation of hybrid query, problem is that for some queries hashCode can be slow.
How much time this will be called - one time per sub-query x num of shards?

understood, I mismatch the Map.Entry<Query, Boolean> with Map<Query, Boolean>, I guess the Map.Entry is used to operate on pair of objects. I'm good now, thanks Vijay

Used optional to avoid null condition, but we need query irrespective of whether query is re-written or not.

We can explicitly check for null, my point was performance. Doing first Optional.of(rewrittenQuery), and later optional.get() both will add some cycles of wrapping query object into Optional and then unwrapping it. If we have done benchmarks with Optional and without it then we can use it as datapoint to make a decision.

Sounds good. I will run benchmark with and without before merging into main. In the meantime, can we keep Optional? What do you think?

Yes, I'm not sure what is the delta in performance if any, let's keep it with Optional, run benchmark and act per per data

the intent was never to have an Optional as a member variable, the intent was to only return optional to indicate the value is not rewritten. It got misunderstood.

I realize that the query is needed irrespective of if its rewritten so Optional cannot be used. An Entry is a good option here.

In terms of performance, if Optional does turn out to be an issue then we might have to deep dive and see if wrapping results in an object is an issue overall

shatejas · 2024-05-20T20:06:45Z

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

-               perform better. For hybrid query we need to track progress of re-write for all sub-queries */
-            actuallyRewritten |= rewrittenSub != subQuery;
-            rewrittenSubQueries.add(rewrittenSub);
+            final HybridQueryExecutorCollector<IndexSearcher, Map.Entry<Query, Boolean>> collector = manager.newCollector();


Just for my understanding, why is a collector chosen here for collecting results? you could have plugged in TaskExecutor here and then simply use the list of results returned by invokeAll. The entire collector seems to be a round about way to get the results. Manager would have been useful if you wanted to return a different collector based on certain conditions IMO

Good question. The main idea is to move logic of collecting and reducing/merging into its own manager. By introducing TaskExecutor inside Scorer/Weight we are coupling how to parallelize with what needs to be parallelized. With introduction of Collector and Collector Manager, we decoupled parallelization from logic. IMO, this abstracts parallelization implementation from all of its caller.

Lets dissect this a little more. There are 3 classes here:

HybridQueryExecutor: responsible for initializing and holding a common threadpool for query, scorer and weight,

HybridQueryExecutorCollectorManager: Responsible for giving a new collector

HybridQueryExecutorCollector : Holds the query results

With introduction of Collector and Collector Manager, we decoupled parallelization from logic. IMO, this abstracts parallelization implementation from all of its caller.

HybridQueryExecutor seems to be a good wrapper to make sure the same task executor is used. I am more curious about the additional value of HybridQueryExecutorCollectorManager and HybridQueryExecutorCollector

Why can't it be

final List<Callable<Entry<Query, Boolean>>> queryRewriteTasks = new ArrayList<>(); // for each subquery queryRewriteTasks.add(() -> rewriteQuery(subQuery); //rewrite needs to return entry<query, boolean>; List<Entry<Query, Boolean>> rewrittenQueries = HybridQueryExecutor.getExecutor().invokeAll(queryRewriteTasks); // rest of the logic

This way you don't have to worry about the threadsafety of the collectors. There are some utility methods in collector manager which arent related to managing a collector but more of related to rewritten queries itself.

Overall while this works fine, I want to make sure these interfaces have a clear definition and its clear on how to use them. Currently it seems like we can simplify this by not having it unless I am misunderstanding the value it brings

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java

chishui · 2024-05-21T08:07:36Z

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutorCollector.java

+ * Query phase to parallelize sub query's action to improve latency
+ */
+@RequiredArgsConstructor(staticName = "newCollector")
+public final class HybridQueryExecutorCollector<I, R> {


Is HybridQueryExecutorCollector thread safe?

There are no setters and only way to update result is by calling collect method. Do you have any particular concerns?

Since collect() could set result, if collect() and getResults() are called asynchronously, technically, there could be race condition, right?

You are right. Currently getResult() is called from collector manager, where the contract is that it can be called only after all collectors are finished collection. This was taken care in the implementation. I can add this note in collector as well. i am reluctant to add synchronize to result variable, since it will add additional latency for scenario that is not possible at this moment. Do you have any other suggestions?

Adding a comment is a good intention. you won't be able to enforce it. You are relying on the hope that the code doesn't change in future to hit that race condition so there will always be a risk.

If you really want to enforce; your options here would be to synchronize or move away from result collection to returning results as pointed out in this thread

i am reluctant to add synchronize to result variable, since it will add additional latency for scenario that is not possible at this moment

Actually, the overhead of synchronization is way less than we thought, here are some synchronization benchmark articles I found,

https://baptiste-wicht.com/posts/2010/09/java-synchronization-mutual-exclusion-benchmark.html

https://isuru-perera.blogspot.com/2016/05/benchmarking-java-locks-with-counters.html
One operation of "synchronized" is < 1 micro second, considering the time spending on actual query logic with intensive I/O, this is negligible and you have simple implementation with concurrent situation covered.

@shatejas @chishui Fair enough. I am changing to synchronized, however while benchmarking i am planning to compare with and without synchronization to see that it is not adding any latency to this block.

+1 to @VijayanB comment. I think the purpose of this whole PR is to reduce latency. Therefore, I think @VijayanB it would good if you compare the benchmark results, if you see there is little degradation also in the latency, then we should take a call on this tradeoff whether to keep synchronized or not.

src/main/java/org/opensearch/neuralsearch/query/HybridQueryRewriteCollectorManager.java

src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java

src/main/java/org/opensearch/neuralsearch/query/HybridQueryRewriteCollectorManager.java

martin-gaievski · 2024-05-22T17:12:38Z

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

+                perform better. For hybrid query we need to track progress of re-write for all sub-queries */
+
+                boolean actuallyRewritten = rewrittenQuery != query;
+                return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);


do we have data or can we run benchmark to see how much overhead (if any) adds the Optional approach? I agree this way code looks simpler, but if we're aiming for performance improvement and this is critical section then we may want to save CPU cycles of wrapping/unwrapping result to Optional.

martin-gaievski · 2024-05-22T17:16:07Z

src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java

+                perform better. For hybrid query we need to track progress of re-write for all sub-queries */
+
+                boolean actuallyRewritten = rewrittenQuery != query;
+                return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);


and +1 to Navneet's point regarding map and specifically using Query as a key. I was using similar approach in initial implementation of hybrid query, problem is that for some queries hashCode can be slow.
How much time this will be called - one time per sub-query x num of shards?

src/main/java/org/opensearch/neuralsearch/query/HybridQueryScorer.java

Signed-off-by: Vijayan Balasubramanian <[email protected]>

martin-gaievski

Looks good to me, thank you Vijay.
Please plan to add unit tests for the new classes from the executor package. We can do it in a separate PR.

Signed-off-by: Vijayan Balasubramanian <[email protected]>

…arch-project#749) Implement parallel execution of sub-queries for hybrid search Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>

* Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>

* Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]> (cherry picked from commit 76090de)

…arch-project#781) * Implement parallel execution of sub-queries for hybrid search (opensearch-project#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>

… search (#749) (#786) * Implement parallel execution of sub-queries for hybrid search (#781) * Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]> * Update package name in 2.15 which is different from main Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB requested review from heemin32, navneet1v, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners May 15, 2024 20:29

VijayanB self-assigned this May 15, 2024

VijayanB marked this pull request as draft May 15, 2024 20:30

VijayanB changed the title ~~Parallelize hybrid query processor~~ Implement parallel execution of sub-queries for hybrid search May 16, 2024

VijayanB force-pushed the hybrid-parallel-processor branch from 53b9260 to 04ab2ac Compare May 16, 2024 21:15

VijayanB marked this pull request as ready for review May 16, 2024 21:16

VijayanB force-pushed the hybrid-parallel-processor branch 2 times, most recently from f7e0ef0 to 2750ad4 Compare May 16, 2024 23:29

VijayanB force-pushed the hybrid-parallel-processor branch from 2750ad4 to d1e4789 Compare May 17, 2024 00:23

navneet1v reviewed May 17, 2024

View reviewed changes

VijayanB requested a review from navneet1v May 20, 2024 19:04

shatejas reviewed May 20, 2024

View reviewed changes

chishui reviewed May 21, 2024

View reviewed changes

martin-gaievski reviewed May 22, 2024

View reviewed changes

VijayanB requested review from chishui and martin-gaievski May 24, 2024 02:57

VijayanB added 16 commits May 29, 2024 12:50

Add executor builders

08aa8c8

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Parallelize score supplier

7f706ee

Signed-off-by: Vijayan Balasubramanian <[email protected]>

parallelize query rewrite

0bb9947

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Parallelize get hybrid scores

d90f99e

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Add documentation

8ad4743

Signed-off-by: Vijayan Balasubramanian <[email protected]>

update changelog

bb81c27

Signed-off-by: Vijayan Balasubramanian <[email protected]>

replace stream with loop

49383fc

Signed-off-by: Vijayan Balasubramanian <[email protected]>

remove unused generic variable

ebd3b24

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Make error message more verbose

350a2b4

Signed-off-by: Vijayan Balasubramanian <[email protected]>

refactor

f08a37c

Signed-off-by: Vijayan Balasubramanian <[email protected]>

fix code reviews

62be918

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Update documentation

7701213

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Create a wrapper instead of Map.Entry

e061a1e

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Synchronize result variable

82dabae

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Improve documentation

f50357d

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Refactor executor manager into executor package

6d8c080

Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB changed the base branch from feature/parallelize-hybrid-query to feature/parallelize-hybrid-search May 29, 2024 23:06

VijayanB force-pushed the hybrid-parallel-processor branch from bf1dc5d to 6d8c080 Compare May 29, 2024 23:08

VijayanB requested a review from martin-gaievski May 29, 2024 23:10

martin-gaievski approved these changes May 30, 2024

View reviewed changes

Update access modifier to be public

bc6b885

Signed-off-by: Vijayan Balasubramanian <[email protected]>

naveentatikonda approved these changes May 31, 2024

View reviewed changes

VijayanB merged commit d889f37 into opensearch-project:feature/parallelize-hybrid-search Jun 3, 2024
64 of 70 checks passed

VijayanB mentioned this pull request Jun 11, 2024

[Backport 2.x] Implement parallel execution of sub-queries for hybrid search (#749) #786

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel execution of sub-queries for hybrid search #749

Implement parallel execution of sub-queries for hybrid search #749

VijayanB commented May 15, 2024 •

edited

Loading

codecov bot commented May 16, 2024 •

edited

Loading

navneet1v left a comment

navneet1v May 17, 2024

VijayanB May 17, 2024

shatejas May 20, 2024 •

edited

Loading

martin-gaievski May 22, 2024

martin-gaievski May 22, 2024

martin-gaievski May 22, 2024

martin-gaievski May 22, 2024

VijayanB May 22, 2024

martin-gaievski May 22, 2024

shatejas May 23, 2024 •

edited

Loading

shatejas May 20, 2024 •

edited

Loading

VijayanB May 21, 2024

shatejas May 21, 2024

chishui May 21, 2024

VijayanB May 22, 2024

chishui May 22, 2024

VijayanB May 22, 2024 •

edited

Loading

shatejas May 23, 2024

chishui May 23, 2024

VijayanB May 24, 2024

vibrantvarun May 24, 2024

martin-gaievski May 22, 2024

martin-gaievski May 22, 2024

martin-gaievski left a comment

Implement parallel execution of sub-queries for hybrid search #749

Implement parallel execution of sub-queries for hybrid search #749

Conversation

VijayanB commented May 15, 2024 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented May 16, 2024 • edited Loading

Codecov Report

navneet1v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatejas May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatejas May 23, 2024 • edited Loading

Choose a reason for hiding this comment

shatejas May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VijayanB May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martin-gaievski left a comment

Choose a reason for hiding this comment

VijayanB commented May 15, 2024 •

edited

Loading

codecov bot commented May 16, 2024 •

edited

Loading

shatejas May 20, 2024 •

edited

Loading

shatejas May 23, 2024 •

edited

Loading

shatejas May 20, 2024 •

edited

Loading

VijayanB May 22, 2024 •

edited

Loading