Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data race due to mutable feature vectors being shared across thread boundaries when using concurrent segment search #52

Closed
jhinch-at-atlassian-com opened this issue Oct 22, 2024 · 4 comments · Fixed by #54
Labels
bug Something isn't working

Comments

@jhinch-at-atlassian-com

What is the bug?

OpenSearch 2.12 stablised concurrent segment search which results in segments being search on different threads in parallel. The LTR plugin makes use of a MutableSupplier to share feature vectors between different parts of the code. For example in DerivedExpressionQuery and ScriptFeature. Due to the implementation being made using an atomic reference (e34d607) this results in feature vectors potentially for different documents to be retrieved, resulting in the wrong feature scores and also data races.

How can one reproduce the bug?

Steps to reproduce the behavior.

What is the expected behavior?

A clear and concise description of what you expected to happen.

What is your host/environment?

Operating system, version.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

The underlying issue seems to be that the vector is being shared across LeafReaderContext boundaries. If either a supplier was created one per LeafReaderContext or the feature vector sharing is kept unique for each LeafReaderContext then it should work as expected

@andrross
Copy link
Member

Thanks @jhinch-at-atlassian-com.

@msfroh can you share any thoughts here?

@msfroh
Copy link

msfroh commented Oct 23, 2024

Yeah... this comment looks like a bit of a smoking gun:

// XXX: this is not thread safe and may run into extremely weird issues
// if the searcher uses the parallel collector
// Hopefully elastic never runs
MutableSupplier<LtrRanker.FeatureVector> vectorSupplier = new Suppliers.MutableSupplier<>();

Now that we are using the concurrent CollectorManager API, the predicted breakage came true.

@msfroh
Copy link

msfroh commented Oct 23, 2024

@jhinch-at-atlassian-com -- Would you be able to develop and test a fix for this?

I think you're right that scoping the supplier to the given LeafCollector would work. Now that Lucene Weights can create a ScorerSupplier per LeafReaderContext, we should modify any custom weights to implement the scorerSupplier method (since I think the old scorer method was removed in Lucene 10? Or at least scorerSupplier is the preferred method now.) Any shared state should probably be scoped to the returned ScorerSupplier. That will also work if/when we upgrade to Lucene 10, which now support intra-segment concurrency -- so multiple ScorerSuppliers (and LeafCollectors) may be created for the same LeafReaderContext.

@jhinch-at-atlassian-com
Copy link
Author

I had a brief exploration into fixing this. It looks like using a thread local is the most convenient fix. The thread local is created when the a new feature vector is created and cleared after the ranker score is computed. I will raise a PR with the candidate fix shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
3 participants