-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Store] Root cause deleted segment files during remote uploads #11025
Comments
There are two places where the local segment files map is updated in RemoteStoreRefreshListener OpenSearch/server/src/main/java/org/opensearch/index/shard/ReleasableRetryableRefreshListener.java Lines 59 to 68 in c132db9
Flow for runAfterRefreshWithPermitOpenSearch/server/src/main/java/org/opensearch/index/shard/ReleasableRetryableRefreshListener.java Lines 147 to 169 in c132db9
OpenSearch/server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java Lines 209 to 220 in c132db9
Flow for runAfterRefreshExactlyOnceOpenSearch/server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java Lines 123 to 140 in c132db9
We need to fix the |
The underlying map is backed by concurrent hashmap. If both the executions are happening simultaneously, it should still be idempotent and should not lead to any entry go missing from the map. |
Agree, but the modifications to the map is not happening concurrently, but some millis apart. The problem is the upload is still in progress for the previous refresh segment files, which are removed from the map due to newer refresh flow has already called ConcurrentHashMap would ensure no two threads modify it at the same time, but we are actually modifying the map few millis apart. Checkout the logs from integ test where i saw this issue - #9774 (comment) |
The case you mentioned can happen if there is a retry ongoing coinciding with a refresh/flush. On the solution part, lets not introduce more synchronisations or locks usage, we can deep clone the segment tracker and use it throughout the method's lifecycle. |
Ideally refreshes and the afterRefresh listeners are not supposed to be executed in concurrently, due to But with OpenSearch/server/src/main/java/org/opensearch/index/shard/ReleasableRetryableRefreshListener.java Lines 124 to 131 in c132db9
|
Even after upload is completed, another place we are using the local files reference from segmentTracker which could provide incorrect results due to allowed concurrency with retries. OpenSearch/server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java Lines 293 to 303 in c132db9
|
yeah, not going introduce any more locks as they are not needed. But need to understand if we can skip the the call to OpenSearch/server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java Lines 123 to 140 in c132db9
We anyways update this map in syncSegments. Only caveat i can see is the bytes lag computation would be delayed a bit. is there any other use of this call? The deep clone solution would complicate the code further, ideally the segmenttracker should support addition of new files and removal of files as they are uploaded, instead of just supporting a replacement of local files map. |
If there is any upload taking a lot of time, the lag would be incorrect and difficult to debug issues. We need this at both the places to ensure that the local refreshes show up asap and if there is retry happening, the local state again reflects correctly if there has been any refreshes performed since the original retry.
The local map contains the state as on local. On the approach you mentioned about removal of files once uploaded can lead to memory leak situations as we have seen that segments file name consist of file names that are not present locally in certain circumstances. As discussed, we can have a clone created in the method scope to ensure that the segment tracker is populated correctly. |
@linuxpi @gbbafna - the issue mentioned in the description is different from what we have discussed above. The issue that @gbbafna has mentioned happens when the segmentinfos returns a segment file that does not exists locally. This will lead to 2 problems - 1) During the upload, the segment file is missing and the upload fails. 2) After the upload fails, we try to update the segment tracker with the failed upload details and they do not exist. We still need to handle the above 2 cases - just fyi. |
i think its the same issue which @gbbafna has mentioned. The files would still not be deleted which are being uploaded, the only issue was that local segments map in segmentTracker was getting updated prematurely
We should have a separate issue for this i believe as GatedClosable is not able to guarantee file wont be removed
This is due the same issue we are trying to solve here. If the local files map would have had the file that were being uploaded, it would not have failed. |
sounds good. raised a draft PR #11896 |
@gbbafna @ashking94 @sachinpkale Looks like the issue still persists. Re-opening
|
Describe the bug
During remote store uploads, we do
GatedCloseable<SegmentInfos> segmentInfosGatedCloseable = indexShard.getSegmentInfosSnapshot()
which makes sure that the files we are uploading are not deleted/merged away.However we have seen in async flow , where in multiple refreshes can happen and the remote store refresh listener is uploading older files . When it tries to update the tracker, it sees that the file has been deleted from local
Stack Trace
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The locked files should not get deleted
The text was updated successfully, but these errors were encountered: