Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel execution of sub-queries for hybrid search #749

Conversation

VijayanB
Copy link
Member

@VijayanB VijayanB commented May 15, 2024

Description

  1. Add new thread pool to schedule tasks that are related to hybrid query execution
  2. Register executor builders with Plugin
  3. Use Lucene's Task Executor to execute and collect results
  4. Parallelize Query re-write
  5. Parallelize score supplier creation
  6. Parallelize build hybrid scores

Issues Resolved

#279

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@VijayanB VijayanB self-assigned this May 15, 2024
@VijayanB VijayanB marked this pull request as draft May 15, 2024 20:30
@VijayanB VijayanB changed the title Parallelize hybrid query processor Implement parallel execution of sub-queries for hybrid search May 16, 2024
@VijayanB VijayanB force-pushed the hybrid-parallel-processor branch from 53b9260 to 04ab2ac Compare May 16, 2024 21:15
@VijayanB VijayanB marked this pull request as ready for review May 16, 2024 21:16
@VijayanB VijayanB force-pushed the hybrid-parallel-processor branch 2 times, most recently from f7e0ef0 to 2750ad4 Compare May 16, 2024 23:29
Copy link

codecov bot commented May 16, 2024

Codecov Report

Attention: Patch coverage is 83.78378% with 12 lines in your changes are missing coverage. Please review.

Please upload report for BASE (feature/parallelize-hybrid-search@806042c). Learn more about missing BASE report.

Current head 2750ad4 differs from pull request most recent head bc6b885

Please upload reports for the commit bc6b885 to get more accurate results.

Files Patch % Lines
...ch/neuralsearch/executors/HybridQueryExecutor.java 44.44% 4 Missing and 1 partial ⚠️
...org/opensearch/neuralsearch/query/HybridQuery.java 89.47% 2 Missing ⚠️
...ensearch/neuralsearch/query/HybridQueryScorer.java 86.66% 2 Missing ⚠️
...ensearch/neuralsearch/query/HybridQueryWeight.java 85.71% 2 Missing ⚠️
...g/opensearch/neuralsearch/plugin/NeuralSearch.java 50.00% 1 Missing ⚠️
Additional details and impacted files
@@                         Coverage Diff                          @@
##             feature/parallelize-hybrid-search     #749   +/-   ##
====================================================================
  Coverage                                     ?   84.89%           
  Complexity                                   ?      812           
====================================================================
  Files                                        ?       65           
  Lines                                        ?     2490           
  Branches                                     ?      410           
====================================================================
  Hits                                         ?     2114           
  Misses                                       ?      213           
  Partials                                     ?      163           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@VijayanB VijayanB force-pushed the hybrid-parallel-processor branch from 2750ad4 to d1e4789 Compare May 17, 2024 00:23
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked PR at high level and added comments. Please resolve them after that I will do 1 more review of the PR.

perform better. For hybrid query we need to track progress of re-write for all sub-queries */

boolean actuallyRewritten = rewrittenQuery != query;
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid storing queries in a map? and who is going to store this simple map entry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HybridQueryExecutorCollector.collect() will store this entry. If we don't store map, we can't assemble back to caller once the execution is completed.

Copy link
Contributor

@shatejas shatejas May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see if you can use optional instead. That will simplify the data structure and make it more readable

return rewrittenQuery != query ? Optional.of(rewrittenQuery) : Optional.empty() 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have data or can we run benchmark to see how much overhead (if any) adds the Optional approach? I agree this way code looks simpler, but if we're aiming for performance improvement and this is critical section then we may want to save CPU cycles of wrapping/unwrapping result to Optional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and +1 to Navneet's point regarding map and specifically using Query as a key. I was using similar approach in initial implementation of hybrid query, problem is that for some queries hashCode can be slow.
How much time this will be called - one time per sub-query x num of shards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood, I mismatch the Map.Entry<Query, Boolean> with Map<Query, Boolean>, I guess the Map.Entry is used to operate on pair of objects. I'm good now, thanks Vijay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used optional to avoid null condition, but we need query irrespective of whether query is re-written or not.

We can explicitly check for null, my point was performance. Doing first Optional.of(rewrittenQuery), and later optional.get() both will add some cycles of wrapping query object into Optional and then unwrapping it. If we have done benchmarks with Optional and without it then we can use it as datapoint to make a decision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will run benchmark with and without before merging into main. In the meantime, can we keep Optional? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not sure what is the delta in performance if any, let's keep it with Optional, run benchmark and act per per data

Copy link
Contributor

@shatejas shatejas May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intent was never to have an Optional as a member variable, the intent was to only return optional to indicate the value is not rewritten. It got misunderstood.

I realize that the query is needed irrespective of if its rewritten so Optional cannot be used. An Entry is a good option here.

In terms of performance, if Optional does turn out to be an issue then we might have to deep dive and see if wrapping results in an object is an issue overall

@VijayanB VijayanB requested a review from navneet1v May 20, 2024 19:04
perform better. For hybrid query we need to track progress of re-write for all sub-queries */
actuallyRewritten |= rewrittenSub != subQuery;
rewrittenSubQueries.add(rewrittenSub);
final HybridQueryExecutorCollector<IndexSearcher, Map.Entry<Query, Boolean>> collector = manager.newCollector();
Copy link
Contributor

@shatejas shatejas May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, why is a collector chosen here for collecting results? you could have plugged in TaskExecutor here and then simply use the list of results returned by invokeAll. The entire collector seems to be a round about way to get the results. Manager would have been useful if you wanted to return a different collector based on certain conditions IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The main idea is to move logic of collecting and reducing/merging into its own manager. By introducing TaskExecutor inside Scorer/Weight we are coupling how to parallelize with what needs to be parallelized. With introduction of Collector and Collector Manager, we decoupled parallelization from logic. IMO, this abstracts parallelization implementation from all of its caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets dissect this a little more. There are 3 classes here:

  1. HybridQueryExecutor: responsible for initializing and holding a common threadpool for query, scorer and weight,
  2. HybridQueryExecutorCollectorManager: Responsible for giving a new collector
  3. HybridQueryExecutorCollector : Holds the query results

With introduction of Collector and Collector Manager, we decoupled parallelization from logic. IMO, this abstracts parallelization implementation from all of its caller.

HybridQueryExecutor seems to be a good wrapper to make sure the same task executor is used. I am more curious about the additional value of HybridQueryExecutorCollectorManager and HybridQueryExecutorCollector

Why can't it be

final List<Callable<Entry<Query, Boolean>>> queryRewriteTasks = new ArrayList<>();
// for each subquery
queryRewriteTasks.add(() -> rewriteQuery(subQuery); //rewrite needs to return entry<query, boolean>;
List<Entry<Query, Boolean>> rewrittenQueries = HybridQueryExecutor.getExecutor().invokeAll(queryRewriteTasks);

// rest of the logic

This way you don't have to worry about the threadsafety of the collectors. There are some utility methods in collector manager which arent related to managing a collector but more of related to rewritten queries itself.

Overall while this works fine, I want to make sure these interfaces have a clear definition and its clear on how to use them. Currently it seems like we can simplify this by not having it unless I am misunderstanding the value it brings

* Query phase to parallelize sub query's action to improve latency
*/
@RequiredArgsConstructor(staticName = "newCollector")
public final class HybridQueryExecutorCollector<I, R> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is HybridQueryExecutorCollector thread safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no setters and only way to update result is by calling collect method. Do you have any particular concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since collect() could set result, if collect() and getResults() are called asynchronously, technically, there could be race condition, right?

Copy link
Member Author

@VijayanB VijayanB May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Currently getResult() is called from collector manager, where the contract is that it can be called only after all collectors are finished collection. This was taken care in the implementation. I can add this note in collector as well. i am reluctant to add synchronize to result variable, since it will add additional latency for scenario that is not possible at this moment. Do you have any other suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment is a good intention. you won't be able to enforce it. You are relying on the hope that the code doesn't change in future to hit that race condition so there will always be a risk.

If you really want to enforce; your options here would be to synchronize or move away from result collection to returning results as pointed out in this thread

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am reluctant to add synchronize to result variable, since it will add additional latency for scenario that is not possible at this moment

Actually, the overhead of synchronization is way less than we thought, here are some synchronization benchmark articles I found,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas @chishui Fair enough. I am changing to synchronized, however while benchmarking i am planning to compare with and without synchronization to see that it is not adding any latency to this block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @VijayanB comment. I think the purpose of this whole PR is to reduce latency. Therefore, I think @VijayanB it would good if you compare the benchmark results, if you see there is little degradation also in the latency, then we should take a call on this tradeoff whether to keep synchronized or not.

perform better. For hybrid query we need to track progress of re-write for all sub-queries */

boolean actuallyRewritten = rewrittenQuery != query;
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have data or can we run benchmark to see how much overhead (if any) adds the Optional approach? I agree this way code looks simpler, but if we're aiming for performance improvement and this is critical section then we may want to save CPU cycles of wrapping/unwrapping result to Optional.

perform better. For hybrid query we need to track progress of re-write for all sub-queries */

boolean actuallyRewritten = rewrittenQuery != query;
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and +1 to Navneet's point regarding map and specifically using Query as a key. I was using similar approach in initial implementation of hybrid query, problem is that for some queries hashCode can be slow.
How much time this will be called - one time per sub-query x num of shards?

VijayanB added 16 commits May 29, 2024 12:50
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
@VijayanB VijayanB changed the base branch from feature/parallelize-hybrid-query to feature/parallelize-hybrid-search May 29, 2024 23:06
@VijayanB VijayanB force-pushed the hybrid-parallel-processor branch from bf1dc5d to 6d8c080 Compare May 29, 2024 23:08
@VijayanB VijayanB requested a review from martin-gaievski May 29, 2024 23:10
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you Vijay.
Please plan to add unit tests for the new classes from the executor package. We can do it in a separate PR.

Signed-off-by: Vijayan Balasubramanian <[email protected]>
@VijayanB VijayanB merged commit d889f37 into opensearch-project:feature/parallelize-hybrid-search Jun 3, 2024
64 of 70 checks passed
VijayanB added a commit to VijayanB/neural-search that referenced this pull request Jun 6, 2024
…arch-project#749)

Implement parallel execution of sub-queries for hybrid search

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
VijayanB added a commit to VijayanB/neural-search that referenced this pull request Jun 10, 2024
…arch-project#749)

Implement parallel execution of sub-queries for hybrid search

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
VijayanB added a commit that referenced this pull request Jun 11, 2024
* Implement parallel execution of sub-queries for hybrid search (#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 11, 2024
* Implement parallel execution of sub-queries for hybrid search (#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
(cherry picked from commit 76090de)
VijayanB added a commit to VijayanB/neural-search that referenced this pull request Jun 11, 2024
…arch-project#781)

* Implement parallel execution of sub-queries for hybrid search (opensearch-project#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>
VijayanB added a commit that referenced this pull request Jun 11, 2024
… search (#749) (#786)

* Implement parallel execution of sub-queries for hybrid search (#781)

* Implement parallel execution of sub-queries for hybrid search (#749)

Add new thread pool to schedule tasks that are related to hybrid query execution
Register executor builders with Plugin
Use Lucene's Task Executor to execute and collect results
Parallelize Query re-write
Parallelize score supplier creation
Parallelize build hybrid scores

Signed-off-by: Vijayan Balasubramanian <[email protected]>

* Update package name in 2.15 which is different from main

Signed-off-by: Vijayan Balasubramanian <[email protected]>

---------

Signed-off-by: Vijayan Balasubramanian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants