Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query timeout handling by checking searchContext.isSearchTimedOut() #16882

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Skip remote-repositories validations for node-joins when RepositoriesService is not in sync with cluster-state ([#16763](https://github.com/opensearch-project/OpenSearch/pull/16763))
- Fix _list/shards API failing when closed indices are present ([#16606](https://github.com/opensearch-project/OpenSearch/pull/16606))
- Fix remote shards balance ([#15335](https://github.com/opensearch-project/OpenSearch/pull/15335))
- Fix query timeout handling by checking searchContext.isSearchTimedOut() ([#16882](https://github.com/opensearch-project/OpenSearch/pull/16882))

### Security

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,10 @@
*/
@Override
protected void searchLeaf(LeafReaderContext ctx, Weight weight, Collector collector) throws IOException {
// If search is already flagged as timed out, do not continue
if (searchContext.isSearchTimedOut()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 @sohami @kkewwei it seems like when we were adapting to concurrent search, the exception driven flow was changed to state driven one, and the leaves are being processed even when the search request has exceeded the time budget (timed out).

Copy link
Collaborator

@jed326 jed326 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reta, I will have more time to take a closer look at this later this week.

From what I recall off the top of my head the cancellable.checkCancelled(); right below this should already be doing the timeout checking and then throwing the QueryPhase.TimeExceededException. The searchContext.setSearchTimedOut(true); logic was needed because with concurrent search any exceptions thrown in searchLeaf would be handled by the Lucene TaskExecutor and it was difficult to figure out from the query phase thread if the search had actually timed out or not.

Copy link
Collaborator Author

@reta reta Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reta, I will have more time to take a closer look at this later this week.

Thanks @jed326 , I will look closely as well, but cancellable.checkCancelled() seems to be somehow not short-circuiting the search, I could clearly see that by checking how long tests are taking w/o this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancellable.checkCancelled() seems to be somehow not short-circuiting the search

That's definitely strange. I do see that searchLeaf gets called for each leaf even post the timeout and we leave it to cancellable.checkCancelled() to throw an exception which we than catch for each leaf. I wonder if this repeated exception handling is what is causing it to take longer?

Maybe we should even have the searchContext.isSearchTimedOut() check in the search method above so we can short circuit those for loops when the timeout is hit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should even have the searchContext.isSearchTimedOut() check in the search method above so we can short circuit those for loops when the timeout is hit?

What I've seen, the timeout is hit by first searchLeaf call (those are called in loop), so the check in search method may not need this check (I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reta, In the next code: cancellable.checkCancelled()(line 316) will do the timeout check, it looks like duplicate work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could clearly see that by checking how long tests are taking w/o this change.

Currently the only reason I can think why this would be the case is because cancellable.checkCancelled() involves throwing and catching an exception which is slowing things down. I think we may need to run this piece of code through a profiler to understand why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 @kkewwei thanks for the feedback folks, will spend more time looking into the cause

return;

Check warning on line 310 in server/src/main/java/org/opensearch/search/internal/ContextIndexSearcher.java

View check run for this annotation

Codecov / codecov/patch

server/src/main/java/org/opensearch/search/internal/ContextIndexSearcher.java#L310

Added line #L310 was not covered by tests
}

// Check if at all we need to call this leaf for collecting results.
if (canMatch(ctx) == false) {
Expand Down
Loading