[BUG] [Snapshot Interop] Optimize batch async blob cleanup during snapshot deletion for remote store enabled indices. #12302

harishbhakuni · 2024-02-13T16:42:22Z

Describe the bug

Currently, During snapshot deletion we asynchronously try to cleanup shard blobs by creating batches of 1000 blobs at a time. If the index is remote store enabled, we also release lock for each shard blob followed by remote store cleanup if index is already deleted from the cluster. If either release lock or remote store cleanup fails even for one shard, we end up skipping the cleanup of the entire batch.

OpenSearch/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

Lines 1135 to 1155 in 76ae14a

    
           RemoteStoreLockManager remoteStoreMetadataLockManager = remoteStoreLockManagerFactory.newLockManager( 
        
               remoteStoreRepoForIndex, 
        
               indexUUID, 
        
               shardId 
        
           ); 
        
           remoteStoreMetadataLockManager.release( 
        
               FileLockInfo.getLockInfoBuilder().withAcquirerId(snapshotUUID).build() 
        
           ); 
        
           if (!isIndexPresent(clusterService, indexUUID)) { 
        
               // this is a temporary solution where snapshot deletion triggers remote store side 
        
               // cleanup if index is already deleted. We will add a poller in future to take 
        
               // care of remote store side cleanup. 
        
               // see https://github.com/opensearch-project/OpenSearch/issues/8469 
        
               new RemoteSegmentStoreDirectoryFactory( 
        
                   remoteStoreLockManagerFactory.getRepositoriesService(), 
        
                   threadPool 
        
               ).newDirectory( 
        
                   remoteStoreRepoForIndex, 
        
                   indexUUID, 
        
                   new ShardId(Index.UNKNOWN_INDEX_NAME, indexUUID, Integer.valueOf(shardId)) 
        
               ).close();

Due to this, we end up calling release locks for the entire batch in the next run again. this can be optimized by skipping shard blob cleanup for only those shards for which release lock or remote store cleanup failed.

Related component

Storage:Snapshots

To Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

During batch shard blob deletion, in cases of release lock or remote store cleanup failures, we should only skip deletion of shard blobs with failures.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

peternied · 2024-02-14T16:26:53Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@harishbhakuni Thanks for filing, sounds like this is an important issue to resolve quickly.

harishbhakuni added bug Something isn't working untriaged labels Feb 13, 2024

github-actions bot added the Storage:Snapshots label Feb 13, 2024

peternied added Severity-Critical and removed untriaged labels Feb 14, 2024

harishbhakuni mentioned this issue Feb 14, 2024

Optimize remote store operations during snapshot Deletion #12319

Merged

8 tasks

Bukhtawar added the Storage-Lifecycle label Feb 15, 2024

github-project-automation bot added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

Bukhtawar removed the Storage-Lifecycle label Feb 15, 2024

gbbafna closed this as completed Apr 4, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Snapshot Interop] Optimize batch async blob cleanup during snapshot deletion for remote store enabled indices. #12302

[BUG] [Snapshot Interop] Optimize batch async blob cleanup during snapshot deletion for remote store enabled indices. #12302

harishbhakuni commented Feb 13, 2024

peternied commented Feb 14, 2024

[BUG] [Snapshot Interop] Optimize batch async blob cleanup during snapshot deletion for remote store enabled indices. #12302

[BUG] [Snapshot Interop] Optimize batch async blob cleanup during snapshot deletion for remote store enabled indices. #12302

Comments

harishbhakuni commented Feb 13, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Feb 14, 2024