Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Snapshot Interop] Optimize batch async blob cleanup during snapshot deletion for remote store enabled indices. #12302

Closed
harishbhakuni opened this issue Feb 13, 2024 · 1 comment
Labels

Comments

@harishbhakuni
Copy link
Contributor

Describe the bug

Currently, During snapshot deletion we asynchronously try to cleanup shard blobs by creating batches of 1000 blobs at a time. If the index is remote store enabled, we also release lock for each shard blob followed by remote store cleanup if index is already deleted from the cluster. If either release lock or remote store cleanup fails even for one shard, we end up skipping the cleanup of the entire batch.

RemoteStoreLockManager remoteStoreMetadataLockManager = remoteStoreLockManagerFactory.newLockManager(
remoteStoreRepoForIndex,
indexUUID,
shardId
);
remoteStoreMetadataLockManager.release(
FileLockInfo.getLockInfoBuilder().withAcquirerId(snapshotUUID).build()
);
if (!isIndexPresent(clusterService, indexUUID)) {
// this is a temporary solution where snapshot deletion triggers remote store side
// cleanup if index is already deleted. We will add a poller in future to take
// care of remote store side cleanup.
// see https://github.com/opensearch-project/OpenSearch/issues/8469
new RemoteSegmentStoreDirectoryFactory(
remoteStoreLockManagerFactory.getRepositoriesService(),
threadPool
).newDirectory(
remoteStoreRepoForIndex,
indexUUID,
new ShardId(Index.UNKNOWN_INDEX_NAME, indexUUID, Integer.valueOf(shardId))
).close();

Due to this, we end up calling release locks for the entire batch in the next run again. this can be optimized by skipping shard blob cleanup for only those shards for which release lock or remote store cleanup failed.

Related component

Storage:Snapshots

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

During batch shard blob deletion, in cases of release lock or remote store cleanup failures, we should only skip deletion of shard blobs with failures.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@harishbhakuni Thanks for filing, sounds like this is an important issue to resolve quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants