-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Abnormal nodes stats in indexing/search metrics during on-going index/shard closure. #12552
Comments
@vikasvb90 Nice find! I have couple of thoughts in regard to proposed solution:
|
|
Reviving this thread. The Index closure operation involves multiple-steps IndexID removal from Hashmap -> Index Close call -> Per Shard Remove call -> ShardID removal from Hashmap -> Shard Close call -> finally, the asynchronous listener beforeIndexShardClosed performing the oldShardsStats update. A fix can be updating the 'oldShardsStats' right after the 'IndexID removal from Hashmap' , but that means running expensive stat computation while blocking more critical actions. Also, since the operation is not atomic, it will only reduce the window of race condition but not completely eliminate. Computing the 'partial_results' status runs into same reasoning, we cannot deterministically deduce the value due to race conditions. Additionally, from user point of view - partial result of stats per node makes no sense. We should either send no stats (if we can deterministically identify this scenario) or send the complete value, as the user has no way to determine what is missing and how to address it. Instead, we can update the documentation of The client on their end can choose to treat such scenarios (present value is lower than previous) as missing data (recommended) as the stat value is expected to be monotonically increasing. |
Regarding the approach I suggested above, it doesn't require any heavy computation. From consistency perspective, since we are updating in-memory shardClosed count right after index/shard closure, chances of nodes/stats being inconsistent are negligible. Also, we can completely eliminate even this rare condition by swapping removal from map and updating in-memory closed shard count. We first update in-memory closed shard count and then remove from map and in node stats we only set the response to partial when we see shard count in old stats is less than shard count in in-memory variable. This will make this API completely deterministic and we also don't need to put a question mark on the reliability of this API. Edit: There can be a false positive in rarest of the scenarios but even then client doesn't need to handle this explicitly. Client can be generic and just have retries/ignorance whenever it sees |
What happens if the stats API requests lands between removal from Index Map and before updating the in-memory closed shard count. Within an index, we can have 'n' shards and closing each shard(And the corresponding 'oldShardsStats' update) is sequential operation. Maintaining 2 counters - Index closed and shard closed will still not guarantee strong consistency. |
We don't need two counters. Just one counter for overall closed shard count. Also, as I said, we first update the in-memory shard closed count and then update index or shard map. |
Thanks @vikasvb90 and @khushbr for discussion on the possible solutions. This gives good context. The stats APIs are not strongly consistent API but considering these are cumulative metrics at node level, it doesn't provide good user experience if the value goes down. @vikasvb90 you can return partial but it doesn't give any idea which metric is impacted, would this partial status be at node level or for overall stats API? In order to figure what partial means here user would need to compare against previous value of each metric to understand partial here. Bulk API is different, you get doc level status, you know exactly which doc passed/ failed. Similarly, searches can be partial but user decides on which query to get partial response and application would have handling for the same as well. In either case, you don't need previous state to understand the impact of partial. Lets explore the closed shard count counter approach further (as long as it doesn't impact applying the cluster state). |
I did a dirty POC for a fix based on keeping track of shard stats b/w Index removal from Hashmap and update of OldShardStats. Add IndexShard to pendingShardClosure queue in removeIndex()
Remove IndexShard from pendingShardClosure before adding to OldShardStats in beforeIndexShardClosed()
Make local copy of pendingShardClosure; For each IndexShard, add the stats to commonStats in stats(). This is done after OldShardStats computation.
In summary, without taking a lock, we cannot guarantee consistency. In this solution, depending on the order of execution of events - the response can end up having double or missing data. A - Remove IndexShard from pendingShardClosure queue in beforeIndexShardClosed() To reduce the chances of adding double value, we enforce (A) must happen before (B) and (C) must happen before (D). Alternatively, we can solutionize to track for inconsistent race condition state with greater accuracy. On detecting such scenario, the engine will throw an Exception, which for Node local stats with show up as 5XX and in cluster stats, the co-ordinator will handle the exception and return empty json for this node in response. @shwetathareja @vikasvb90 Let me know what you think. |
I agree. Higher the number of different in-memory state updates and higher the execution b/w these updates and updates in the in-memory maps, higher will be possibilities of race conditions. |
Describe the bug
Node level stats are supposed to be monotonically increasing in nature until the node is restarted but this principle doesn't hold during a race condition when there are index or shard closures going on in parallel.
If there are one or more shard closures or index closures happening in parallel with nodes stats execution then during fetch of any stats of shards whose states have changed to closed, either an
IllegalIndexShardStateException
or anAlreadyClosedException
is thrown which is ignored here and stats of the respective shard are skipped in the response. This inconsistency of lower nodes stats values being returned in response only stays until beforeIndexShardClosed of oldstats is invoked.How is this still possible when
beforeIndexShardClosed
is updating old shard stats before shard is closed ?Delta shard stats are computed here by iterating over in-memory map of indices of IndicesService and then over in-memory map of shard ids present in
IndexService
against each Index. Before methodbeforeIndexShardClosed
gets invoked index gets removed from the indices map here and similarly shard gets removed from shard id map here . Between removing an index or shard from the respective in-memory map and invokingbeforeIndexShardClosed
if any stats request comes in, it will only be able to return yet to be updated value of old shard stats.Related component
Indexing
To Reproduce
This isn't easy to reproduce as stats needs to be invoked at certain intermediate state. An integ test can reproduce this.
Expected behavior
Lets explore what options we have:
shard_count
field in node stats so that whenever client sees a lower node stats value and a lowershard_count
then it can simply ignore that event (or decide whatever it wants to do with it).Sequence in which stats are fetched today is old stats which store stats of the shards previously assigned on the node are copied to a local variable, stats of current shards are fetched and then these are merged to old stats local variable and returned.
Assuming R-> Remove index/shard from map, O -> Copying old stats in local variable, C-> Shard or index closure, D-> Delta stats of the current shard. And (R happens before C and O happens before D)
Following combinations are possible and all result either in lesser shard_count and lesser node stats value or correct node stats.
On receiving lesser shard_count and lesser node stats, client can either choose to re-invoke stats or ignore the event.
Edit 1:
Even with just shard_count we won't be able to distinguish between node restart case and shard closing case. We need to keep another in-memory counter which tracks count of removed shards and increments whenever an index or a shard is removed from indices or shard map. In stats execution, we can match removed count with oldShardStats shard_count and if match fails then we can return a field
partial_results
set to true in the response. We can still sendshard_count
as an additional info. We also need to distinguish b/w an index closure and a shard closure and prevent counting shards again when index is closed in the parent path.Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: