-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI #12194
base: main
Are you sure you want to change the base?
[GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI #12194
Conversation
|
||
while (++wordIndex < numWords) { | ||
word = bits[wordIndex]; | ||
if (Long.bitCount(word) != Long.SIZE) { // there are 0s in this word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really need to call bitcount instruction and compare to 64, or can you just do:
if (word != -1L) { // there are 0s in this word
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes! I wonder why I didn't think of it in the first place...will update them in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres also a test missing, see below comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this one and a few more places to not use Long#bitCount
.
There are no tests about correctness of those new BitSet methods for any implementation (Fixed, Sparse,...). Would it be possible to add them? |
Thanks for the review and comment @uschindler ! This PR is currently in draft state to facilitate discussion to make sure I'm on the right track. Will definitely add dedicated tests before converting it to formal PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it! Did you manage to observe some speedups with this change? You explored implementing this new API in several different places: BitSetIterator, doc-value iterator, postings, etc. and it's already a bit exhausting to review and will get worse when we add more tests. I think it would be helpful if we focused on a single thing for the initial PR that focuses on proving that this API is a good addition, adds good testing (e.g. enhancing AssertingScorer
, CheckIndex
and other similar classes to check that it is implemented correctly), and then implement the new API on other implementations of DocIdSetIterator
in follow-up PRs.
|
||
/** | ||
* Returns the next doc ID that may not be a match. Note that this API will essentially provide | ||
* the following two guarantees: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add that it's illegal to call this method once the iterator is exhausted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I have been wondering about how to handle this case properly, hence the multiple NO_MORE_DOCS
returned below. By making it illegal to call this method once the iterator is exhausted, do you mean we need to throw an exception here? As this condition may be pretty common, I'm wondering if we could give callers an easier way to detect / handle this situation. Is it ok we always return the last non matching doc (or maxDoc
) under this scenario, and warn callers to check if the iterator's current doc is NO_MORE_DOCS
to detect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While on this topic, I'm wondering also if we should also adjust the behavior of BitSet#nextSetBit
, as it would return NO_MORE_DOCS
when there are no more set bit (as opposed to max doc):
lucene/lucene/core/src/java/org/apache/lucene/util/BitSet.java
Lines 80 to 83 in 96efb34
* Returns the index of the first set bit starting at the index specified. {@link | |
* DocIdSetIterator#NO_MORE_DOCS} is returned if there are no more set bits. | |
*/ | |
public abstract int nextSetBit(int index); |
In contrast, Java's BitSet#nextSetBit
would return -1 in such a scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind on the BitSet
question above, I got myself confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By making it illegal to call this method once the iterator is exhausted, do you mean we need to throw an exception here?
No I was thinking of just documenting the behavior as undefined. And throw assertions in AssertingScorer.
Is it ok we always return the last non matching doc (or maxDoc) under this scenario, and warn callers to check if the iterator's current doc is NO_MORE_DOCS to detect?
Yes, exactly. This is the same for nextDoc
or advance
, it is illegal to call these methods when the current doc is NO_MORE_DOCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
@@ -82,6 +82,11 @@ public int advance(int target) throws IOException { | |||
return doc; | |||
} | |||
|
|||
@Override | |||
public int peekNextNonMatchingDocID() { | |||
return NO_MORE_DOCS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should return maxDoc
, since maxDoc
would not be a match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public int advance(int target) throws IOException { | ||
return reqApproximation.advance(target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we like advance(docID()+1)
to not perform too differently from nextDoc()
, maybe we should have the peekNextNonMatchingDocID logic here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
lucene/core/src/java/org/apache/lucene/util/BitSetIterator.java
Outdated
Show resolved
Hide resolved
Thanks @jpountz for the review and comment!
So far I have only able to run
Do you have any pointer which benchmark task I could potentially use? If there isn't one available, I could try to add some next.
For sure. Once I'm able to benchmark this and observe good speed up & we are good with the API, I will break up this PR into smaller pieces. |
Note: currently these two randomized tests will fail due to implementations in IndexedDISI for doc value 1. gradlew :lucene:grouping:test --tests "org.apache.lucene.search.grouping.TestDoubleRangeGroupSelector.testGroupHeads" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=62C97A77054E21AA -Ptests.gui=false -Ptests.file.encoding=ISO-8859-1 2. gradlew :lucene:grouping:test --tests "org.apache.lucene.search.grouping.TestLongRangeGroupSelector.testGroupHeads" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=62C97A77054E21AA -Ptests.gui=false -Ptests.file.encoding=ISO-8859-1
Fun! Thanks for picking this up @zacharymorn. What about updating This idea will probably have diminishing returns after #12055, since we now prioritize only building the more sparse iterators into the bitset upfront, but it could still help. If you really want to look for impact, try switching |
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java
Outdated
Show resolved
Hide resolved
Oh and to clarify on my above comment, I’m not trying to create “scope creep” here. In fact, +1 to Adrien’s comment on limiting the scope of the initial PR. Just trying to suggest areas that might be good candidates for seeing initial impact. |
Thanks @gsmiller for your review and suggestions!
I think @mdmarshmallow might be working on this as per #11915 (comment). As part of this PR, I've also added a new API to |
Yeah I am working on it! Hopefully should be done by today as it's not that big of a change :) |
Woohoo! Thanks @zacharymorn / @mdmarshmallow! I suspect we may not really see any benefit though if the DISI can only expose the next non-matching doc within its current block. I think the real advantage here would come from being able to actually skip blocks in the DISI, which would rely on knowing that there are actually multiple consecutive "dense" blocks in a DISI. But... if our only way to know there are multiple consecutive "dense" blocks involves decoding those blocks anyway, maybe there isn't much gain to be had? Hmm... not sure. Excited to see what we learn though! |
Thanks @mdmarshmallow for working on it! Btw, I just pushed a commit (f78182b) that fixed some bugs identified by randomized tests, you may want to pull that for your work especially if you are using
@gsmiller Yeah I'm guessing that as well especially for posting and sparse / dense block, as it would take at least one pass to identify the next candidate. I have tried to cache the result as well and would like to see if that helps, and how it performs under benchmark. |
Yeah I was doing some of my own debugging and saw some of those issues. I think this fixed a decent amount of the issues I was seeing but I'm still seeing problems with some tests. I'm not sure if it's an issue with my code or not though, so I'll need to dig a bit deeper. |
…able. Caller should handle its result accordingly.
Ok so I did some more investigation, I think there might be a bug with |
Hmm that seems strange, as I would think this API won't have direct impact to |
Sorry for the delayed response, so I saw it in my optimized version of I'm also planning on benchmarking this and will post here when I do. |
Edit: Ok so it turns out I hadn't rebased the the changes, this is the actual perf numbers (deleted the old ones to not clutter up the comments):
But there doesn't seem to be much improvement either, but I checked the CPU profiles and I didn't see many very samples of |
Maybe we could try to leverage the geonames dataset (there's a few benchmarks for it in lucene-util), which has a few low-cardinality fields like the time zone or country. Then enable index sorting on these fields. And make sure we're getting performance when using the time zone in required or excluded clauses? |
Thanks @mdmarshmallow for the PR and benchmark result. I have left a few comments there. I feel the reasons you are not seeing much change from benchmark could be that the implementation in |
Thanks @jpountz for the suggestion! I actually did something similar by modifying the benchmarking code to support sorting index by
However, while I was able to run this query, the test actually failed when verifying top matches & scores between baseline and modified code. Upon further digging, I realized a potential issue to this API idea. As Lucene does a lot of two phase iterations, and two phase iterator's approximation may provide a superset of the actual matches. If we were to use this API to find and ignore / skip over a bunch of doc ids from approximation, wouldn't the result be inaccurate? For example, in the below
Please let me know if you have any suggestion on this. |
Yeah indeed. I didn't fix the random seed during my luceneutil runs, and thus the results vary a lot as they may depend on the index and queries under test.
I did a few more testings for this, and have some interesting findings: No changes (comparing baseline with baseline) :
PKLookup seems varies a lot as well when there are no changes. With changes (comparing modified with baseline), and also modify task query:
In addition, I noticed adding
I suspect it's indeed JVM compilation that's causing the difference? Below is the full jvm command line from modified
In terms of code, PKLookup will execute this section of modified code when its doing doc enumeration, but reverting changes there didn't solve the issue. |
I wonder if
Thanks for testing this. We've debated the merits of disabling background compilation ( Though, I would also expect that as you vary the particular query (
OK thanks for testing. I think net/net we can conclude that this is all noise and should not block this great change! The speedups for some cases are astounding! |
Thanks @mikemccand for the additional context!
+1
The benchmark was run with index sorted by |
Hi @jpountz @mikemccand @rmuir @uschindler @gsmiller , I have added some tests in the last few days and believed this PR is ready for review now, could you please take a look and let me know if you have any suggestion? I'm not particularly sure about my approach for conjunction and leveraging skip data by the way, and am open to alternatives! |
The logic looks pretty good to me overall. I like that we're seeing good speedups when using FILTER/MUST_NOT clauses on postings that have long runs! It'd be good to understand if we can reduce the overhead for the case when the optimization doesn't kick in, as I'd expect this case to remain the most common one. I'm also curious to get other people's take on whether the additional API is worth the performance benefits. I personally like it because it allows having first-class filters (filters on fields that are part of the index sort) that perform much more efficiently than regular filters. This can be especially useful when managing multiple tenants within the same index, e.g. multiple categories of an e-commerce catalog. |
Thanks @jpountz for the review!
Yes indeed. Aside from further optimizing the logic itself, I have been thinking about two potential approaches, but they each have some pros / cons:
What do you think about these approaches? |
I'd rather like to avoid introducing a flag, but your second idea sounds interesting. Maybe one way to implement it would be to introduce a bulk scorer for conjunctions, and split the doc ID space into something like 128 equal windows, and only checking for |
Thanks @jpountz for the feedback! On the second approach, I was actually thinking something simpler, such as this 219beab. I ran the benchmark tests after the change, and got the following results: Index with sortingResult 1:
Result 2:
Index without sortingResult 1:
Result 2:
Overall, it was able to prevent severe performance drop when the index is unsorted, and only had around 5% negative impact to However, I also noticed that somehow with sorted index, --- a/tasks/wikimedium.10M.nostopwords.tasks
+++ b/tasks/wikimedium.10M.nostopwords.tasks
@@ -17381,8 +17381,8 @@ AndHighFilterMonth: +united +filter=monthPostings:feb # freq=1185528
AndHighFilterMonth: +year +filter=monthPostings:mar # freq=1098425
AndHighFilterMonth: +its +filter=monthPostings:apr # freq=1160703
AndHighFilterMonth: +but +filter=monthPostings:may # freq=1484398
-AndMedFilterMonth: +mostly +filter=monthPostings:jun # freq=89401
-AndMedFilterMonth: +interview +filter=monthPostings:jul # freq=94736
-AndMedFilterMonth: +9 +filter=monthPostings:aug # freq=541405
-AndMedFilterMonth: +hard +filter=monthPostings:sep # freq=92045
-AndMedFilterMonth: +bay +filter=monthPostings:oct # freq=117167
\ No newline at end of file
+AndMedFilterMonth: +mostly +filter=monthPostings:jan # freq=89401
+AndMedFilterMonth: +interview +filter=monthPostings:feb # freq=94736
+AndMedFilterMonth: +9 +filter=monthPostings:mar # freq=541405
+AndMedFilterMonth: +hard +filter=monthPostings:apr # freq=92045
+AndMedFilterMonth: +bay +filter=monthPostings:may # freq=117167 and got the following results: Index with sortingResult 1:
Result 2:
Index without sortingResult 1:
Result 2:
With regard to the bulk scorer approach, it is interesting! However, if we were to split the window based on certain size and only call |
You are right, sparse queries may be affected. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Sorry for the long time without a reply. I had some hesitation about moving this change forward since it's a big API change and I didn't see much appeal for it (besides you and me I guess). But I'd like to also move sparse/zone indexing forward, and they'd need the same API at search-time, so I'm now keen on moving it forward. If you're interesting in updating this branch, I'll be interested in reviewing. Since we last worked on this branch, conjunctions introduced a |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
This PR adds a new API to DISI to find / estimate long running matches, which can then be leveraged by higher level code to skip over a range of matches and speed up certain query tasks.