Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Allow shard idle on indices with zero replicas #7761

Closed
mch2 opened this issue May 25, 2023 · 8 comments
Closed

[Segment Replication] Allow shard idle on indices with zero replicas #7761

mch2 opened this issue May 25, 2023 · 8 comments
Assignees
Labels
bug Something isn't working distributed framework

Comments

@mch2
Copy link
Member

mch2 commented May 25, 2023

Moving conversation from #7736 to a new issue.

Today indices using segment replication silently ignore shard idle. This is intentional because with segment replication replicas are only refreshed externally through their primary. So we cannot depend on a search request to specifically hit a primary in order to trigger segment copy.

There are few issues with this current implementation.

  1. Indices with zero replicas will also have shard idle disabled and cause a significant performance degradation (see benchmark below). This would most likely impact users with remote store enabled for durability with zero replicas.
  2. If a user specifies a non default value for index.search.idle.after it is ignored silently.

40 shards 0 replicas.
baseline = docrep
Contender = segrep

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                     Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |                          |     117.767 |     126.334 |  8.56688 |    min |
|             Min cumulative indexing time across primary shard |                          |     2.79192 |     3.05287 |  0.26095 |    min |
|          Median cumulative indexing time across primary shard |                          |     2.95258 |     3.15603 |  0.20345 |    min |
|             Max cumulative indexing time across primary shard |                          |     3.06253 |     3.25807 |  0.19553 |    min |
|           Cumulative indexing throttle time of primary shards |                          |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |                          |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |                          |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |                          |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |                          |           0 |     86.6928 |  86.6928 |    min |
|                      Cumulative merge count of primary shards |                          |           0 |         563 |      563 |        |
|                Min cumulative merge time across primary shard |                          |           0 |     1.15612 |  1.15612 |    min |
|             Median cumulative merge time across primary shard |                          |           0 |     2.45673 |  2.45673 |    min |
|                Max cumulative merge time across primary shard |                          |           0 |     3.08378 |  3.08378 |    min |
|              Cumulative merge throttle time of primary shards |                          |           0 |     45.4994 |  45.4994 |    min |
|       Min cumulative merge throttle time across primary shard |                          |           0 |     0.42275 |  0.42275 |    min |
|    Median cumulative merge throttle time across primary shard |                          |           0 |     1.25747 |  1.25747 |    min |
|       Max cumulative merge throttle time across primary shard |                          |           0 |     1.94772 |  1.94772 |    min |
|                     Cumulative refresh time of primary shards |                          |     11.4373 |     23.7369 |  12.2996 |    min |
|                    Cumulative refresh count of primary shards |                          |         389 |        3510 |     3121 |        |
|              Min cumulative refresh time across primary shard |                          |      0.1986 |    0.540683 |  0.34208 |    min |
|           Median cumulative refresh time across primary shard |                          |    0.279458 |     0.59135 |  0.31189 |    min |
|              Max cumulative refresh time across primary shard |                          |    0.379533 |     0.63685 |  0.25732 |    min |
|                       Cumulative flush time of primary shards |                          |     4.42715 |     1.87787 | -2.54928 |    min |
|                      Cumulative flush count of primary shards |                          |          40 |          84 |       44 |        |
|                Min cumulative flush time across primary shard |                          |     0.03735 |   0.0110667 | -0.02628 |    min |
|             Median cumulative flush time across primary shard |                          |    0.110767 |   0.0436333 | -0.06713 |    min |
|                Max cumulative flush time across primary shard |                          |     0.21965 |    0.117183 | -0.10247 |    min |
|                                       Total Young Gen GC time |                          |       9.287 |       3.867 |    -5.42 |      s |
|                                      Total Young Gen GC count |                          |         168 |         189 |       21 |        |
|                                         Total Old Gen GC time |                          |           0 |           0 |        0 |      s |
|                                        Total Old Gen GC count |                          |           0 |           0 |        0 |        |
|                                                    Store size |                          |     32.0886 |     43.6133 |  11.5247 |     GB |
|                                                 Translog size |                          | 2.04891e-06 | 2.04891e-06 |        0 |     GB |
|                                        Heap used for segments |                          |           0 |           0 |        0 |     MB |
|                                      Heap used for doc values |                          |           0 |           0 |        0 |     MB |
|                                           Heap used for terms |                          |           0 |           0 |        0 |     MB |
|                                           Heap used for norms |                          |           0 |           0 |        0 |     MB |
|                                          Heap used for points |                          |           0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |                          |           0 |           0 |        0 |     MB |
|                                                 Segment count |                          |         733 |         756 |       23 |        |
|                                                Min Throughput |             index-append |       74672 |     61215.7 | -13456.2 | docs/s |
|                                               Mean Throughput |             index-append |     79794.3 |     64712.4 | -15081.9 | docs/s |
|                                             Median Throughput |             index-append |     78470.5 |     64937.8 | -13532.8 | docs/s |
|                                                Max Throughput |             index-append |     88291.9 |       67531 | -20760.9 | docs/s |
|                                       50th percentile latency |             index-append |     321.652 |     490.719 |  169.067 |     ms |
|                                       90th percentile latency |             index-append |     632.225 |     1029.97 |  397.741 |     ms |
|                                       99th percentile latency |             index-append |     5571.06 |     1502.37 | -4068.69 |     ms |
|                                     99.9th percentile latency |             index-append |     7318.35 |     2123.22 | -5195.13 |     ms |
|                                      100th percentile latency |             index-append |     9532.39 |     2299.36 | -7233.03 |     ms |
|                                  50th percentile service time |             index-append |     321.652 |     490.719 |  169.067 |     ms |
|                                  90th percentile service time |             index-append |     632.225 |     1029.97 |  397.741 |     ms |
|                                  99th percentile service time |             index-append |     5571.06 |     1502.37 | -4068.69 |     ms |
|                                99.9th percentile service time |             index-append |     7318.35 |     2123.22 | -5195.13 |     ms |
|                                 100th percentile service time |             index-append |     9532.39 |     2299.36 | -7233.03 |     ms |
|                                                    error rate |             index-append |           0 |           0 |        0 |      % |
|                                                Min Throughput | wait-until-merges-finish |     52.7632 |    0.011783 | -52.7514 |  ops/s |
|                                               Mean Throughput | wait-until-merges-finish |     52.7632 |    0.011783 | -52.7514 |  ops/s |
|                                             Median Throughput | wait-until-merges-finish |     52.7632 |    0.011783 | -52.7514 |  ops/s |
|                                                Max Throughput | wait-until-merges-finish |     52.7632 |    0.011783 | -52.7514 |  ops/s |
|                                      100th percentile latency | wait-until-merges-finish |     18.4399 |     84867.3 |  84848.9 |     ms |
|                                 100th percentile service time | wait-until-merges-finish |     18.4399 |     84867.3 |  84848.9 |     ms |
|                                                    error rate | wait-until-merges-finish |           0 |           0 |        0 |      % |


-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

Expected behavior
There should not be a performance hit for indices using SEGMENT replication type when there are no replicas.
Users should be aware that any specified value for index.search.idle.after will have no impact if they are using SEGMENT replication type with replicas.

@mch2
Copy link
Member Author

mch2 commented Jun 20, 2023

@andrross @Bukhtawar I know there was concern on #7736 adding bi-modal behavior with/without replicas. I think the perf benefit here is large enough to accept this with some warning to users (and docs). I tried adding some validation to search.idle.after and replica count but this further coupled the settings & given they are in separate files did not cover all cases, particularly after index creation to add/remove replicas.

I've re-opened this PR under #8173 and added a warning when search.idle is updated from default values.

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jun 21, 2023

Thanks @mch2. Wondering if we could do something better. As in still honour shard_idle even with replicas, by forwarding the request to primary and performing a refresh before sending back the response, if the replica has not seen any new checkpoints in the last X intervals. This is to avoid penalising the search request if it arrives on a replica which has to otherwise do a round-trip of (replica -> primary(refresh) -> segment transfer to replica) before executing the search. Once the refresh is done on primary, upon arrival of the search request, the newer segments get replicated.

Its also possible that there has been no indexing done on the shard in which case the primary forwarding might be moot. But once we get to polling mechanism where replicas are periodically checking for newer checkpoints, they can also consider getting a "hint" in response to disambiguate between "no-indexing shard-idle" vs "indexing but shard_idle" and accordingly selecting if primary forwarding is needed or not.

Alternatively we can consider the coordinator logic to select which shards to route searches to, to prefer primary shard first to ensure requests are sent to primary, which does a refresh on demand, replicates segments before requests can be routed to replicas.

Essentially I don't want indexing heavy workloads to suffer because they had a replica for availability to start with. We can even think about a simpler way to not do any forwarding to primary but return back stale records for the initial request, which will force a refresh on-demand on-primary.

@mch2
Copy link
Member Author

mch2 commented Jun 23, 2023

forwarding the request to primary and performing a refresh before sending back the response, if the replica has not seen any new checkpoints in the last X intervals.

Are you suggesting we ping the primary before serving the search req, wait for segments to get replicated, and then issue the req on replica? Or simply ping the primary to start the replication cycle and then perform the req, knowing it would be stale? I think I'd prefer the latter here and have users route critical requests to primary where staleness is not an option.

Its also possible that there has been no indexing done on the shard in which case the primary forwarding might be moot

Even with our existing push mechanism we would know if the replica has received a published checkpoint from the primary and that we are in no-indexing vs indexing shard_idle and to route the req. We might be able to also determine staleness by comparing translog global cp to the local processed cp of the replica.

Essentially I don't want indexing heavy workloads to suffer because they had a replica for availability to start with. We can even think about a simpler way to not do any forwarding to primary but return back stale records for the initial request, which will force a refresh on-demand on-primary.

The thinking behind disabling search_idle in the first place is that using Segment Replication would still outweigh the benefits of search_idle for indexing workloads. Related issue. The workload wouldn't suffer compared to docrep in terms of throughput, however it would increase refresh/flush/merge counts when refresh interval is low. I do like this simple approach of pinging the primary to refresh & start the cycle and serving a stale read from the replica. This is not much different than setting a high refresh interval with SR today and issuing a search. The risk is that read could be significantly behind.

@mch2
Copy link
Member Author

mch2 commented Jun 23, 2023

One other thing I'm thinking through is that in an idle state when a primary does eventually refresh would yield larger segments & more data that needs to be copied out to nodes and/or remote store. With SR and remote store pressure this may increase the likelihood of backpressure kicking in, particularly with synchronous remote store uploads.

@mch2
Copy link
Member Author

mch2 commented Jun 23, 2023

So to be clear here, this would also be a behavior change with search_idle where when the req does eventually come in, we are serving a stale read vs solely increasing latency. This may be more of problem for system level indices.

@mch2
Copy link
Member Author

mch2 commented Jul 10, 2023

@Bukhtawar I am working on a better solution here. Until then are we ok with #8173 to mitigate the perf issue with 0 replicas?

@dreamer-89
Copy link
Member

@Bukhtawar I am working on a better solution here. Until then are we ok with #8173 to mitigate the perf issue with 0 replicas?

@Bukhtawar : Can you please respond to this query. We would like to move forward with this solution in case there are no concerns.

@mch2
Copy link
Member Author

mch2 commented Jul 17, 2023

Opened separate issue to explore supporting idle with segrep, closing this as completed.

@mch2 mch2 closed this as completed Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

4 participants