Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskExecutor should not fork unnecessarily #13472

Merged
merged 11 commits into from
Jul 4, 2024

Conversation

original-brownbear
Copy link
Member

When executing N tasks and waiting on the result of all of them, we should only fork N - 1 times and execute one task on the calling thread that is getting blocked anyway. This saves at least one context switch, removes the need for any reentrancy protection, and makes better use of available CPU resources.

When executing N tasks and waiting on the result of all of them,
we should not fork one of them and just execute at least one of them outright.
This saves at least one context switch, removes the need for any
reentrancy protection, and makes full use of the supplied task executor
if tasks fan out again.
@jpountz
Copy link
Contributor

jpountz commented Jun 8, 2024

Your suggestion is how it used to work before: https://github.com/apache/lucene/pull/12499/files#diff-e744fc99cb74627f02e30c1cbda56dede66d2ecdfd57db2ce869b9a9a43fa41cR49-R64. The context switching isn't great indeed, but executing some tasks on the current thread makes it hard to correctly reason about and configure the number of threads that should perform work. The idea behind this other PR was that you would have a worker executor that would do almost all the work, and a coordination executor that would be mostly coordinating work in the worker threadpool. I'm not sure if we're at a point where this coordination executor could run on virtual threads, but at least conceptually this is how I'm thinking of it.

Something tangential that we touched a couple times but haven't implemented for now would consist of introducing an API on IndexSearcher that doesn't require waiting on futures to complete, e.g. something like public <C extends Collector, T> void search(Query query, CollectorManager<C, T> collectorManager, IOConsumer<T> resultConsumer).

Also related, we are replacing some of the forking with the new support we introduced for I/O concurrency: https://github.com/apache/lucene/pull/13359/files#diff-ad7c504406afec8940592f1fda0062d3e5321cdc5693c24ec6f5cfb02f8dd90dL100-R114.

@original-brownbear
Copy link
Member Author

The idea behind this other PR was that you would have a worker executor that would do almost all the work, and a coordination executor that would be mostly coordinating work in the worker threadpool.

But that's not what is happening practically right now? Whether I run N-1 tasks on the worker pool and one on the caller or N tasks on the worker and sleep on the caller thread, either way the coordinator thread is blocked and not coordinating anything in the meantime?
I can see the argument of using the worker pool to limit task concurrency to x number of threads/cores. But I wonder since we are blocking the coordinating executors threads, leaving them idle, maybe there isn't that much coordinator work to do in the first place and both coordinator and worker pool can just be the same threads practically? If any of the tasks enqueued from the coordinating executor fans out again (and they do), that's how it works anyways.

Your suggestion is how it used to work before:

Not quite I think: The difference is that in my approach the coordinator thread keeps pulling stuff off of the queue in a loop, instead of just doing the last task. This means that the coordinating thread will not be "wasted" as much if the worker pool takes time to execute the tasks and can't do them all in parallel.
Also, it resolves the dead-lock issue for single threaded or otherwise saturated executors.

@original-brownbear
Copy link
Member Author

I'm not sure if we're at a point where this coordination executor could run on virtual threads, but at least conceptually this is how I'm thinking of it.

That does make sense. But I wonder if this is something we should leave to the runtime/OS/... to figure out for us. It seems very desirable to limit the number of context switches when the API is synchronous so we can go through short tasks served from page cache with as little overhead as possible?

would consist of introducing an API on IndexSearcher that doesn't require waiting on futures to complete

++ that'd be really nice and save a lot of overhead here. That said we could optimize this code fairly easily to move from waiting on "futures" to waiting on a single future :) I haven't benchmarked this much, but if we see non-trivial overhead for the wait loop due to frequent wakeup-sleep cycles as we go through all of the futures, we could just have a ref-count around a single future couldn't we?

we are replacing some of the forking with the new support we introduced for I/O concurrency

❤️

@jpountz
Copy link
Contributor

jpountz commented Jun 8, 2024

Thanks for explaining I had not read your implementation carefully. I agree that we are doing less blocking than in the previous implementation of this, though we could still be blocking at times it seems? E.g. if you have two tasks and the first one takes more time? I understand what you are saying about reducing the overhead of forking, but I'm not too happy that it makes sizing thread pools more complicated in exchange? I need to think more about the trade-off.

@original-brownbear
Copy link
Member Author

E.g. if you have two tasks and the first one takes more time?

Right, that's a possible scenario, but unless we move to some kind of async API like the one you mentioned above, we'll always have blocking on the calling thread if there's a comparatively long running task running on one of the forked threads.

but I'm not too happy that it makes sizing thread pools more complicated in exchange? I need to think more about the trade-off.

To me it seem like the opposite is true, this changes makes reasoning about the sizing much easier. I find it very complicated working out the relative sizes of worker pool and coordinator pool.
I effectively want the worker pool just sized right so that I get the CPU utilisation I desire without oversubscribing and adding scheduling overhead.
Now I have to add a coordinator pool that enables that into the mix. That one I have to size in such a way that I always have another thread available as long as my worker pool isn't fully utilised. That's quite hard to get right?

With this change, I only have to size one pool and since blocking is rare I can probably ignore it in the calculation. So the size of the pool comes out to ~ desired_busy_cores / (1 - io_share_of_execution_time) doesn't it?

@javanna
Copy link
Contributor

javanna commented Jun 10, 2024

I took some time to digest the suggested code changes and the discussions above. I get the sizing issues with using two thread pools (one executing IndexSearcher#search or whatever operation that uses the executor and the other one provided as executor to the searcher) if the heavy work can be performed on either of the two pools. That would revert previous changes around offloading single slices, based on the requirement that we wanted to split the load between the two pools.

If this change though makes us revisit the need for two pools, and allows users to provide the same executor that search already executes against, I think that would be a really good simplification. I need to think more about downsides, and expectations around sizing: we may need bigger queues, because a single search operation may create many more tasks than before?

@original-brownbear
Copy link
Member Author

we may need bigger queues, because a single search operation may create many more tasks than before?

Right, an alternative would be to count in-progress searches at the top level and just make the queue unbounded? That would keep the behavior the same it is today and makes reasoning about the correct queue size simpler? Seems that's more of an ES than a Lucene concern though, Lucene should just make full use of the provided executor and that's that shouldn't it?

@javanna
Copy link
Contributor

javanna commented Jun 11, 2024

Lucene should just make full use of the provided executor and that's that shouldn't it?

Yes, I think so, but perhaps Lucene needs to provide general guidelines to users around what executor is suited and how it should be configured, what factors to take into account etc.

@msokolov
Copy link
Contributor

Would it make sense to provide a reference implementation factory method that creates a properly-configured threadpool, maybe using all available cores with whatever appropriate policies?

@original-brownbear
Copy link
Member Author

Lucene util benchmark results for this by running with one less thread for this branch vs main (credit to @jpountz and @javanna for the idea) to get an idea of the impact:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy1      105.06      (3.1%)      103.22      (3.6%)   -1.7% (  -8% -    5%) 0.103
       BrowseDayOfYearTaxoFacets       14.80      (1.0%)       14.55      (4.5%)   -1.7% (  -7% -    3%) 0.096
          OrHighMedDayTaxoFacets        6.60      (3.3%)        6.49      (2.1%)   -1.6% (  -6% -    3%) 0.062
                         Respell       52.96      (2.2%)       52.56      (1.9%)   -0.8% (  -4% -    3%) 0.243
            BrowseDateTaxoFacets       14.91      (1.2%)       14.86      (3.9%)   -0.4% (  -5% -    4%) 0.695
     BrowseRandomLabelSSDVFacets        3.73      (0.5%)        3.73      (0.5%)    0.1% (   0% -    1%) 0.714
           BrowseMonthSSDVFacets        5.58      (2.0%)        5.59      (2.0%)    0.2% (  -3% -    4%) 0.763
       BrowseDayOfYearSSDVFacets        7.61      (0.6%)        7.62      (0.6%)    0.2% (   0% -    1%) 0.276
            MedTermDayTaxoFacets       25.46      (0.7%)       25.52      (0.9%)    0.3% (  -1% -    1%) 0.328
        AndHighHighDayTaxoFacets       15.24      (0.7%)       15.28      (0.5%)    0.3% (  -1% -    1%) 0.183
         AndHighMedDayTaxoFacets       17.92      (0.7%)       17.99      (0.5%)    0.4% (   0% -    1%) 0.023
     BrowseRandomLabelTaxoFacets       11.95      (1.7%)       12.00      (1.2%)    0.4% (  -2% -    3%) 0.331
           BrowseMonthTaxoFacets       12.37      (3.0%)       12.46      (1.7%)    0.7% (  -3% -    5%) 0.358
               HighTermMonthSort      306.96     (16.4%)      309.25     (14.6%)    0.7% ( -26% -   38%) 0.879
            BrowseDateSSDVFacets        1.45      (1.0%)        1.48      (2.4%)    1.7% (  -1% -    5%) 0.004
                         Prefix3      223.49     (31.2%)      228.83     (13.7%)    2.4% ( -32% -   68%) 0.754
                          Fuzzy2       55.36     (20.9%)       58.92     (14.4%)    6.4% ( -23% -   52%) 0.256
                        PKLookup      176.48     (18.1%)      194.13     (13.2%)   10.0% ( -17% -   50%) 0.045
                    OrNotHighLow      472.02      (2.4%)      567.48     (26.2%)   20.2% (  -8% -   50%) 0.001
                HighSloppyPhrase        3.06      (3.6%)        3.69      (7.1%)   20.4% (   9% -   32%) 0.000
                      AndHighLow      784.51     (24.4%)      959.85     (12.6%)   22.4% ( -11% -   78%) 0.000
                        Wildcard      124.97      (1.4%)      154.50      (2.5%)   23.6% (  19% -   27%) 0.000
                          IntNRQ       70.70      (1.2%)       87.67      (4.0%)   24.0% (  18% -   29%) 0.000
                      HighPhrase       94.06      (2.9%)      118.04      (5.3%)   25.5% (  16% -   34%) 0.000
                     AndHighHigh       53.83      (1.5%)       67.85      (2.0%)   26.1% (  22% -   30%) 0.000
                 LowSloppyPhrase       60.97      (2.4%)       77.49      (5.6%)   27.1% (  18% -   35%) 0.000
                       LowPhrase       20.56      (1.2%)       26.27      (2.9%)   27.7% (  23% -   32%) 0.000
                       MedPhrase       29.76      (1.7%)       39.75      (5.1%)   33.6% (  26% -   40%) 0.000
             LowIntervalsOrdered       15.55      (2.5%)       20.83      (4.1%)   33.9% (  26% -   41%) 0.000
                      AndHighMed       99.55      (2.7%)      135.12      (2.1%)   35.7% (  30% -   41%) 0.000
                     LowSpanNear        3.16      (1.8%)        4.30      (1.6%)   36.3% (  32% -   40%) 0.000
                       OrHighMed      117.00      (3.8%)      164.78      (4.2%)   40.8% (  31% -   50%) 0.000
                   OrHighNotHigh       89.87      (6.3%)      128.16     (36.4%)   42.6% (   0% -   91%) 0.000
                      OrHighHigh       38.70      (1.8%)       55.41      (8.0%)   43.2% (  32% -   53%) 0.000
                 MedSloppyPhrase        7.29      (3.5%)       10.68      (4.6%)   46.5% (  37% -   56%) 0.000
                    HighSpanNear        2.54      (2.1%)        3.77      (3.2%)   48.6% (  42% -   55%) 0.000
                         MedTerm      216.76     (15.6%)      324.89     (29.6%)   49.9% (   4% -  112%) 0.000
               HighTermTitleSort       13.92      (9.3%)       23.43      (8.9%)   68.3% (  45% -   95%) 0.000
                      TermDTSort       68.68      (3.3%)      117.77     (12.2%)   71.5% (  54% -   90%) 0.000
                        HighTerm      220.46      (5.7%)      396.67     (14.8%)   79.9% (  56% -  106%) 0.000
                       OrHighLow      218.43     (26.1%)      400.99     (82.8%)   83.6% ( -20% -  260%) 0.000
            HighTermTitleBDVSort        4.45      (2.1%)        8.32      (2.1%)   86.8% (  80% -   92%) 0.000
                     MedSpanNear       22.62      (2.7%)       42.88      (5.8%)   89.6% (  78% -  100%) 0.000
                    OrHighNotLow      329.64     (22.4%)      672.19     (30.0%)  103.9% (  42% -  201%) 0.000
           HighTermDayOfYearSort       57.50      (3.8%)      125.18      (9.8%)  117.7% ( 100% -  136%) 0.000
             MedIntervalsOrdered       10.22      (4.1%)       22.48      (9.4%)  119.9% ( 102% -  139%) 0.000
            HighIntervalsOrdered        2.41      (6.1%)        5.39     (10.2%)  123.3% ( 100% -  148%) 0.000
                         LowTerm      251.06     (10.8%)      634.45      (7.9%)  152.7% ( 120% -  192%) 0.000
                    OrNotHighMed       74.81      (5.4%)      221.54     (14.8%)  196.1% ( 166% -  228%) 0.000
                   OrNotHighHigh       95.65      (7.1%)      314.65     (21.1%)  228.9% ( 187% -  276%) 0.000
                    OrHighNotMed       59.11      (6.5%)      206.56     (15.0%)  249.4% ( 214% -  289%) 0.000

This is wikimediumall, 3 threads for main and 2 threads for this branch. Effectively no regressions but some considerable speedups.
The reason for this is the obvious reduction in context switching. We go from perf output for main:

Performance counter stats for process id '157418':

   574,008,686,445      cycles                                                      
 1,130,739,465,717      instructions              #    1.97  insn per cycle         
     2,599,704,747      cache-misses                                                
           429,542      context-switches                                            

      49.053969801 seconds time elapsed

to this branch

Performance counter stats for process id '157292':

   526,556,069,563      cycles                                                      
 1,122,410,787,297      instructions              #    2.13  insn per cycle         
     2,420,210,310      cache-misses                                                
           385,991      context-switches                                            

      41.044785986 seconds time elapsed

-> same number of instructions need to be executed pretty much, but they run in fewer cycles and encounter fewer cache misses.

This is also seen in the profile of where the CPU time goes:

main looks like this:

17.21%        328981        org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect()
5.75%         109925        org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
5.24%         100195        org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector#countHit()
5.17%         98733         org.apache.lucene.util.packed.DirectMonotonicReader#get()
4.11%         78637         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20#ordValue()
3.98%         76164         org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
2.57%         49115         org.apache.lucene.codecs.lucene99.Lucene99PostingsReader$EverythingEnum#nextPosition()
1.82%         34823         org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
1.73%         33136         jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
1.63%         31172         java.util.concurrent.atomic.AtomicLong#incrementAndGet()

while this branch looks as follows:

10.79%        183254        org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect()
5.89%         100099        org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
5.62%         95387         org.apache.lucene.util.packed.DirectMonotonicReader#get()
4.59%         77917         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20#ordValue()
4.48%         76145         org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
3.20%         54407         org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector#countHit()
2.77%         47088         org.apache.lucene.codecs.lucene99.Lucene99PostingsReader$EverythingEnum#nextPosition()
2.06%         34965         org.apache.lucene.queries.spans.NearSpansOrdered#stretchToOrder()
1.91%         32484         jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()
1.81%         30763         org.apache.lucene.codecs.lucene99.Lucene99PostingsReader$BlockImpactsPostingsEnum#advance()
1.71%         28966         org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
1.66%         28206         org.apache.lucene.codecs.lucene99.Lucene99PostingsReader$EverythingEnum#advance()

-> a lot less time goes into collect which goes through contended counter increments.

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments and questions, thanks a lot for opening this PR @original-brownbear !

@msokolov
Copy link
Contributor

The luceneutil results reported here are astounding. So astounding I'm not sure I believe them? I wonder if somehow we did not run with concurrency enabled on the main branch test ... or if there was some other testing artifact? Part of my thinking is that if this change was so impactful, then wouldn't we have seen a huge regression when moving from the prior situation (where we ran N-1 tasks in the threadpool and one task on the main thread) to the current situation? Hmm I do see your comment that this is different since the "main" thread is continuing to do more tasks. Still I'm really surprised at the impact. If I can get a moment I'll try to corroborate the test result.

@javanna
Copy link
Contributor

javanna commented Jun 17, 2024

I had a similar reaction as you @msokolov. When it comes to lucene benchmarks, we only very recently started running those with concurrency enabled, so I think that previous changes to the concurrent code-path were not covered? In fact lucene-util did not provide the executor to the searcher until last week or so?

Part of my thinking is that if this change was so impactful, then wouldn't we have seen a huge regression when moving from the prior situation (where we ran N-1 tasks in the threadpool and one task on the main thread) to the current situation?

My general thinking is that there's a gain in using search concurrency in many cases, and some situations where it adds overhead or very little gain. At least when it comes to what we have observed in Elasticsearch benchmarks, we've so far mostly compared no concurrency with some concurrency, and not accurately observed differences between different attempts at running tasks concurrently: the root issue was the overhead of forking, and running no tasks in the caller thread vs one probably did not make a whole lot of difference. I think that the proposed solution is very different and that is why we are observing very different benchmark results. This change seems to take the concurrent code-path to a different level.

With that, please do double check the bench results, more eyes on it can only help.

@javanna
Copy link
Contributor

javanna commented Jun 17, 2024

@shubhamvishu @reta @sohami @mikemccand @zhaih I am pinging you folks because you have previously been involved in some of the search concurrency discussions and this PR will affect how search concurrency is exposed to users.

Elasticsearch leverages inter-segment search concurrency in Lucene by providing an executor to the IndexSearcher. Such executor is so far separate from the one that calls IndexSearcher#search: there is a search thread pool as well as a search workers thread pool in Elasticsearch. This introduces some complexity for Lucene users, as they need to size a separate executor and provide it in order to benefit from search concurrency. We have seen that the caller thread is under-utilized and could happily help execute tasks too, while it currently blocks and waits for all search tasks to complete. @original-brownbear implemented a deadlock free solution to this, that allows the caller thread to offload tasks to the executor and at the same time pull tasks while they have not yet been executed. This reduces forking and optimizes concurreunt execution. Additionally, this makes it possible to execute all of the tasks on a single executor, removing the requirements that users size and provide a separate executor to the searcher. I think this is great both from the perspective of better resource utilization as well as how easier it gets to leverage concurrency in Lucene.

What do you all think?

@javanna javanna modified the milestones: 9.10, 9.12.0 Jun 17, 2024
@original-brownbear
Copy link
Member Author

@msokolov they are astounding but in the opposite direction, in fact it's concurrency that's the problem mostly.
This is main vs main, no concurrency vs 4 threads:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearTaxoFacets       14.81      (0.4%)        5.97      (0.4%)  -59.7% ( -60% -  -59%) 0.000
            BrowseDateTaxoFacets       14.20      (9.0%)        5.85      (0.2%)  -58.8% ( -62% -  -54%) 0.000
                          IntNRQ       70.46      (1.3%)       30.29      (3.2%)  -57.0% ( -60% -  -53%) 0.000
     BrowseRandomLabelTaxoFacets       11.61      (2.7%)        5.08      (0.3%)  -56.3% ( -57% -  -54%) 0.000
                          Fuzzy1       72.82      (5.7%)       44.58      (1.1%)  -38.8% ( -43% -  -33%) 0.000
       BrowseDayOfYearSSDVFacets        7.66      (1.0%)        4.78      (0.6%)  -37.6% ( -38% -  -36%) 0.000
                       OrHighMed       74.56      (2.4%)       51.86      (3.2%)  -30.4% ( -35% -  -25%) 0.000
                     AndHighHigh       47.99      (2.7%)       34.33      (2.6%)  -28.5% ( -32% -  -23%) 0.000
                      AndHighMed       67.95      (1.5%)       52.17      (2.4%)  -23.2% ( -26% -  -19%) 0.000
                 LowSloppyPhrase       45.51      (1.7%)       37.92      (2.4%)  -16.7% ( -20% -  -12%) 0.000
                       MedPhrase       11.68      (5.0%)        9.74      (0.3%)  -16.6% ( -20% -  -11%) 0.000
           BrowseMonthTaxoFacets       12.23      (2.6%)       10.73     (27.7%)  -12.2% ( -41% -   18%) 0.378
                      OrHighHigh       45.32      (2.7%)       39.79      (4.2%)  -12.2% ( -18% -   -5%) 0.000
           BrowseMonthSSDVFacets        5.49      (4.2%)        4.85      (1.1%)  -11.7% ( -16% -   -6%) 0.000
                HighSloppyPhrase        2.01      (2.2%)        1.81      (7.4%)  -10.2% ( -19% -    0%) 0.008
                        Wildcard      123.17      (2.5%)      115.43      (0.9%)   -6.3% (  -9% -   -2%) 0.000
                    OrNotHighLow      908.00      (2.2%)      865.22      (1.4%)   -4.7% (  -8% -   -1%) 0.000
             LowIntervalsOrdered       57.32      (3.2%)       54.78      (3.9%)   -4.4% ( -11% -    2%) 0.077
            MedTermDayTaxoFacets       22.22      (0.6%)       21.57      (2.9%)   -2.9% (  -6% -    0%) 0.049
            BrowseDateSSDVFacets        1.46      (2.0%)        1.45      (2.1%)   -0.5% (  -4% -    3%) 0.743
     BrowseRandomLabelSSDVFacets        3.75      (0.6%)        3.74      (0.2%)   -0.2% (  -1% -    0%) 0.551
          OrHighMedDayTaxoFacets        1.20      (1.1%)        1.21      (4.4%)    0.9% (  -4% -    6%) 0.678
                         Respell       52.55      (1.4%)       53.25      (2.9%)    1.3% (  -2% -    5%) 0.407
         AndHighMedDayTaxoFacets       11.46      (0.8%)       11.76      (2.7%)    2.6% (   0% -    6%) 0.067
        AndHighHighDayTaxoFacets       12.74      (1.3%)       13.23      (2.1%)    3.8% (   0% -    7%) 0.002
                     MedSpanNear        8.28      (2.4%)        9.50      (5.0%)   14.7% (   7% -   22%) 0.000
                      AndHighLow      624.28     (22.4%)      726.83      (3.6%)   16.4% (  -7% -   54%) 0.147
                          Fuzzy2       51.95     (23.2%)       60.73      (2.7%)   16.9% (  -7% -   55%) 0.147
                 MedSloppyPhrase       12.94      (4.1%)       15.57     (10.9%)   20.3% (   5% -   36%) 0.001
                         Prefix3      158.65     (23.1%)      213.31      (3.8%)   34.5% (   6% -   79%) 0.003
                        PKLookup      175.73      (6.3%)      247.50      (0.6%)   40.8% (  31% -   50%) 0.000
                      HighPhrase       24.79      (6.3%)       37.67      (1.4%)   52.0% (  41% -   63%) 0.000
                       LowPhrase      153.31      (1.4%)      244.54      (1.6%)   59.5% (  55% -   63%) 0.000
                       OrHighLow      232.73     (23.8%)      371.84      (4.1%)   59.8% (  25% -  115%) 0.000
                    HighSpanNear        2.93      (3.5%)        4.82     (13.3%)   64.7% (  46% -   84%) 0.000
                     LowSpanNear       51.65      (6.0%)       98.03      (9.8%)   89.8% (  69% -  112%) 0.000
            HighTermTitleBDVSort        4.37      (4.4%)        8.65      (1.6%)   98.0% (  88% -  108%) 0.000
             MedIntervalsOrdered        9.46      (7.3%)       19.51     (13.2%)  106.1% (  79% -  136%) 0.000
            HighIntervalsOrdered        4.26      (6.5%)        8.81     (13.6%)  106.9% (  81% -  135%) 0.000
                         LowTerm      232.68      (3.8%)      485.59      (7.5%)  108.7% (  93% -  124%) 0.000
                         MedTerm      202.48     (26.4%)      535.61     (18.8%)  164.5% (  94% -  285%) 0.000
                    OrHighNotLow      172.52      (3.4%)      516.56      (7.3%)  199.4% ( 182% -  217%) 0.000
                   OrNotHighHigh       69.11      (4.1%)      224.30     (11.7%)  224.6% ( 200% -  250%) 0.000
                   OrHighNotHigh       77.32      (2.7%)      271.59     (12.7%)  251.3% ( 229% -  274%) 0.000
                      TermDTSort       62.88      (4.5%)      224.63      (5.5%)  257.2% ( 236% -  279%) 0.000
                        HighTerm      106.32      (3.1%)      385.12     (25.1%)  262.2% ( 227% -  299%) 0.000
                    OrNotHighMed       64.14     (10.1%)      247.41     (19.2%)  285.7% ( 232% -  350%) 0.000
                    OrHighNotMed       78.41      (5.3%)      306.67     (10.6%)  291.1% ( 261% -  324%) 0.000
               HighTermMonthSort      395.36     (38.7%)     2712.69     (16.2%)  586.1% ( 382% - 1046%) 0.000
           HighTermDayOfYearSort       67.77      (4.8%)      524.03     (18.3%)  673.3% ( 620% -  731%) 0.000
               HighTermTitleSort       15.06      (3.6%)      131.50      (5.9%)  773.4% ( 737% -  811%) 0.000

A large number of these items are actually showing extreme regressions from forking. Even this branch is like 50% behind no concurrency on some points. This is in fact how I got to opening this PR.

When profiling ES benchmark runs I saw a bunch of sections where the overhead of forking for a given task was higher than the cost of just executing that same task right away. It's a little hard to quantitatively show this in a flame graph but the qualitative problem is here:
This is the profiling with vanilla Lucene:

image

And this is the same situation with my changes in Lucene:

image

For weight creation, the forking overhead is still overwhelming but at least we save the future.get overhead from putting the calling thread to sleep and waking it up again. Only for longer running search tasks is the forking overhead "ok" I think. As I tried to show with the perf output, the cache effects of context switching often outweigh any benefits of parallization of IO. I could even see a point where the IO parallization causes harm, not from the IO itself but from the fact that page faulting isn't super scalable in Linux, so even if you make an NVMe drive run faster, the contention on the page fault handling might actually destroy any benefit from pushing the disk (assuming a fast disk that is) harder.

@reta
Copy link
Member

reta commented Jun 17, 2024

Additionally, this makes it possible to execute all of the tasks on a single executor, removing the requirements that users size and provide a separate executor to the searcher. I think this is great both from the perspective of better resource utilization as well as how easier it gets to leverage concurrency in Lucene.

Thanks @jpountz, this change is quite clever, and I agree with @original-brownbear that it leads to better CPU utilization since the caller thread "joins" the executor pool in the effort (also confirmed by benchmarks). May be I am missing something, but the implementation basically introduces "double queuing": task executor has one and very likely the supplied executor would have one, both are competing over taskId (directly or indirectly) to get something done.

On the general note, it resembles a lot the way ForkJoinPool is implemented (which is also utilizes the calling thread to schedule some work), may be we could explore this route as well? I mean have a specialized task executor that accepts ForkJoinPool, not a general executor, I think it could simply things quite a bit, just an idea.

@original-brownbear
Copy link
Member Author

May be I am missing something, but the implementation basically introduces "double queuing": task executor has one and very likely the supplied executor would have one, both are competing over taskId (directly or indirectly) to get something done.

You're right this is one of the remaining areas of contention that could be fixed for even better performance. Using a ForkJoinPool also has some potential for reducing the impact of memory barriers.
I wonder if this is the right area to optimize though? Looking at the results of concurrency vs. no-concurrency in #13472 (comment) I'm inclined to think this is not the code we should optimize further. Even with this change we're at times 2x-3x (vs. 8-9x without my changes) slower than main without concurrency for some parts of the benchmark.
I don't think we can eliminate all the overhead of requiring some memory barriers for synchronising tasks. So maybe the problem eventually just is with tasks that are too small?

@reta
Copy link
Member

reta commented Jun 17, 2024

I don't think we can eliminate all the overhead of requiring some memory barriers for synchronising tasks. So maybe the problem eventually just is with tasks that are too small?

This change is definitely an improvement (over the existing implementation)

I don't think we can eliminate all the overhead of requiring some memory barriers for synchronising tasks. So maybe the problem eventually just is with tasks that are too small?

Eliminate - probably not, but reduce further - likely but it needs yet to be proven (as you rightly pointed out, maybe the problem eventually just is with tasks that are too small?, so optimizing further won't help).

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lucene/CHANGES.txt Outdated Show resolved Hide resolved
@javanna javanna merged commit 62e08f5 into apache:main Jul 4, 2024
3 checks passed
@javanna
Copy link
Contributor

javanna commented Jul 4, 2024

Thanks @original-brownbear !

javanna pushed a commit to javanna/lucene that referenced this pull request Jul 4, 2024
When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by
1 to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no longer required.

Previously, a separate executor was required to prevent deadlock, and all the tasks were offloaded to it unconditionally, wasting resources in some scenarios due to unnecessary forking, and the caller thread having to wait for all tasks to be completed anyways. it can now actively contribute to the execution as well.
javanna added a commit that referenced this pull request Jul 4, 2024
When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by
1 to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no longer required.

Previously, a separate executor was required to prevent deadlock, and all the tasks were offloaded to it unconditionally, wasting resources in some scenarios due to unnecessary forking, and the caller thread having to wait for all tasks to be completed anyways. it can now actively contribute to the execution as well.

Co-authored-by: Armin Braun <[email protected]>
@jpountz
Copy link
Contributor

jpountz commented Jul 7, 2024

Should we now update luceneutil's TaskThreads to fork queries into the IndexSearcher's executor (when present)?

@original-brownbear original-brownbear deleted the less-forking-exec branch July 7, 2024 23:57
@original-brownbear
Copy link
Member Author

@jpountz yea makes sense to me.

@jpountz
Copy link
Contributor

jpountz commented Jul 8, 2024

@original-brownbear Would you like to work on a PR?

@jpountz
Copy link
Contributor

jpountz commented Jul 8, 2024

I just pushed an annotation for this change: mikemccand/luceneutil@a64ac17.

Several queries got a bit faster with a low p-value: https://people.apache.org/~mikemccand/lucenebench/2024.07.05.14.34.52.html.

@original-brownbear
Copy link
Member Author

Sure thing, on it! :) sorry could've done that right away, tired me just didn't realise it this morning .

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 20, 2024
Even with the recent Lucene improvements in
apache/lucene#13472 there is no need
to go through all the Lucene machinery for a single slice here.
The task executor allocates a bunch of objects and runs into some memory
barriers that the JVM can't necessarily compile away.
Lets save that overhead for the single slice case.
Also this PR removes the pointless list of collectors that we were
building.
javanna added a commit to javanna/elasticsearch that referenced this pull request Oct 16, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
javanna added a commit to javanna/elasticsearch that referenced this pull request Oct 16, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
javanna added a commit to elastic/elasticsearch that referenced this pull request Oct 16, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in #111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
javanna added a commit to javanna/elasticsearch that referenced this pull request Oct 16, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
javanna added a commit to javanna/elasticsearch that referenced this pull request Oct 16, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
elasticsearchmachine pushed a commit to elastic/elasticsearch that referenced this pull request Oct 16, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in #111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
elasticsearchmachine pushed a commit to elastic/elasticsearch that referenced this pull request Oct 17, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in #111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.

Co-authored-by: Elastic Machine <[email protected]>
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see apache/lucene#13472), we have removed the search
worker thread pool in elastic#111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.

We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants