[Remote Translog] Remove PRRL creation & modify assertions for remote translog enabled Indexes #8795

ashking94 · 2023-07-20T10:36:45Z

Is your feature request related to a problem? Please describe.
Following up from #8484 (comment), we are seeing that the PRRLs do get created for remote enabled indexes, and in the #8484, we came across cases where the assertions were failing as the RemoteStoreRecoverySource expects empty retention leases. However, there are still retention leases present in memory.

Describe the solution you'd like
We will need to remove the creation of retention leases for remote translog enabled indexes.
One of the places where the retention leases are getting created.

OpenSearch/server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

Lines 1304 to 1344 in 83a4b6e

    
               private void addPeerRecoveryRetentionLeaseForSolePrimary() { 
        
                   assert primaryMode; 
        
                   assert Thread.holdsLock(this); 
        
                   final ShardRouting primaryShard = routingTable.primaryShard(); 
        
                   final String leaseId = getPeerRecoveryRetentionLeaseId(primaryShard); 
        
                   if (retentionLeases.get(leaseId) == null) { 
        
                       if (replicationGroup.getReplicationTargets().equals(Collections.singletonList(primaryShard)) 
        
                           || indexSettings().isRemoteTranslogStoreEnabled()) { 
        
                           assert primaryShard.allocationId().getId().equals(shardAllocationId) : routingTable.assignedShards() 
        
                               + " vs " 
        
                               + shardAllocationId; 
        
                           // Safe to call innerAddRetentionLease() without a subsequent sync since there are no other members of this replication 
        
                           // group. 
        
                           logger.trace("addPeerRecoveryRetentionLeaseForSolePrimary: adding lease [{}]", leaseId); 
        
                           innerAddRetentionLease( 
        
                               leaseId, 
        
                               Math.max(0L, checkpoints.get(shardAllocationId).globalCheckpoint + 1), 
        
                               PEER_RECOVERY_RETENTION_LEASE_SOURCE 
        
                           ); 
        
                           hasAllPeerRecoveryRetentionLeases = true; 
        
                       } else { 
        
                           /* 
        
                            * We got here here via a rolling upgrade from an older version that doesn't create peer recovery retention 
        
                            * leases for every shard copy, but in this case we do not expect any leases to exist. 
        
                            */ 
        
                           assert hasAllPeerRecoveryRetentionLeases == false : routingTable + " vs " + retentionLeases; 
        
                           logger.debug("{} becoming primary of {} with missing lease: {}", primaryShard, routingTable, retentionLeases); 
        
                       } 
        
                   } else if (hasAllPeerRecoveryRetentionLeases == false 
        
                       && routingTable.assignedShards() 
        
                           .stream() 
        
                           .allMatch( 
        
                               shardRouting -> retentionLeases.contains(getPeerRecoveryRetentionLeaseId(shardRouting)) 
        
                                   || checkpoints.get(shardRouting.allocationId().getId()).tracked == false 
        
                           )) { 
        
                               // Although this index is old enough not to have all the expected peer recovery retention leases, in fact it does, so we 
        
                               // don't need to do any more work. 
        
                               hasAllPeerRecoveryRetentionLeases = true; 
        
                           } 
        
               }

Sync retention leases needs to be disabled for remote translog enabled indexes.

OpenSearch/server/src/main/java/org/opensearch/index/IndexService.java

Lines 1032 to 1061 in 83a4b6e

    
           private void sync(final Consumer<IndexShard> sync, final String source) { 
        
               for (final IndexShard shard : this.shards.values()) { 
        
                   if (shard.routingEntry().active() && shard.routingEntry().primary()) { 
        
                       switch (shard.state()) { 
        
                           case CLOSED: 
        
                           case CREATED: 
        
                           case RECOVERING: 
        
                               continue; 
        
                           case POST_RECOVERY: 
        
                               assert false : "shard " + shard.shardId() + " is in post-recovery but marked as active"; 
        
                               continue; 
        
                           case STARTED: 
        
                               try { 
        
                                   shard.runUnderPrimaryPermit(() -> sync.accept(shard), e -> { 
        
                                       if (e instanceof AlreadyClosedException == false 
        
                                           && e instanceof IndexShardClosedException == false 
        
                                           && e instanceof ShardNotInPrimaryModeException == false) { 
        
                                           logger.warn(new ParameterizedMessage("{} failed to execute {} sync", shard.shardId(), source), e); 
        
                                       } 
        
                                   }, ThreadPool.Names.SAME, source + " sync"); 
        
                               } catch (final AlreadyClosedException | IndexShardClosedException e) { 
        
                                   // the shard was closed concurrently, continue 
        
                               } 
        
                               continue; 
        
                           default: 
        
                               throw new IllegalStateException("unknown state [" + shard.state() + "]"); 
        
                       } 
        
                   } 
        
               } 
        
           }

There might be more areas where the retention leases are getting created and this needs to be investigated.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

…es once opensearch-project#8795 is resolved Signed-off-by: Bhumika Saini <[email protected]>

gbbafna · 2024-08-08T15:22:50Z

Storage Triage : Attendees 1 2 3 4 5 6

Are we seeing test failures due to same now @ashking94 ? Is this issue still relevant ?

ashking94 added enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework labels Jul 20, 2023

github-actions bot added the untriaged label Jul 20, 2023

ashking94 removed the untriaged label Jul 20, 2023

BhumikaSaini-Amazon pushed a commit to BhumikaSaini-Amazon/OpenSearch that referenced this issue Jul 20, 2023

Add TODO to update RemoteStoreRecoverySource#expectEmptyRetentionLeas…

6fde387

…es once opensearch-project#8795 is resolved Signed-off-by: Bhumika Saini <[email protected]>

Bukhtawar added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

gbbafna moved this from 🆕 New to Ready To Be Picked in Storage Project Board Aug 8, 2024

gbbafna moved this from Ready To Be Picked to 🆕 New in Storage Project Board Aug 8, 2024

sachinpkale added the needs more info label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Translog] Remove PRRL creation & modify assertions for remote translog enabled Indexes #8795

[Remote Translog] Remove PRRL creation & modify assertions for remote translog enabled Indexes #8795

ashking94 commented Jul 20, 2023

gbbafna commented Aug 8, 2024

[Remote Translog] Remove PRRL creation & modify assertions for remote translog enabled Indexes #8795

[Remote Translog] Remove PRRL creation & modify assertions for remote translog enabled Indexes #8795

Comments

ashking94 commented Jul 20, 2023

gbbafna commented Aug 8, 2024