Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Translog] Remove PRRL creation & modify assertions for remote translog enabled Indexes #8795

Open
ashking94 opened this issue Jul 20, 2023 · 1 comment
Labels
enhancement Enhancement or improvement to existing feature or request needs more info Storage:Durability Issues and PRs related to the durability framework

Comments

@ashking94
Copy link
Member

Is your feature request related to a problem? Please describe.
Following up from #8484 (comment), we are seeing that the PRRLs do get created for remote enabled indexes, and in the #8484, we came across cases where the assertions were failing as the RemoteStoreRecoverySource expects empty retention leases. However, there are still retention leases present in memory.

Describe the solution you'd like
We will need to remove the creation of retention leases for remote translog enabled indexes.
One of the places where the retention leases are getting created.

private void addPeerRecoveryRetentionLeaseForSolePrimary() {
assert primaryMode;
assert Thread.holdsLock(this);
final ShardRouting primaryShard = routingTable.primaryShard();
final String leaseId = getPeerRecoveryRetentionLeaseId(primaryShard);
if (retentionLeases.get(leaseId) == null) {
if (replicationGroup.getReplicationTargets().equals(Collections.singletonList(primaryShard))
|| indexSettings().isRemoteTranslogStoreEnabled()) {
assert primaryShard.allocationId().getId().equals(shardAllocationId) : routingTable.assignedShards()
+ " vs "
+ shardAllocationId;
// Safe to call innerAddRetentionLease() without a subsequent sync since there are no other members of this replication
// group.
logger.trace("addPeerRecoveryRetentionLeaseForSolePrimary: adding lease [{}]", leaseId);
innerAddRetentionLease(
leaseId,
Math.max(0L, checkpoints.get(shardAllocationId).globalCheckpoint + 1),
PEER_RECOVERY_RETENTION_LEASE_SOURCE
);
hasAllPeerRecoveryRetentionLeases = true;
} else {
/*
* We got here here via a rolling upgrade from an older version that doesn't create peer recovery retention
* leases for every shard copy, but in this case we do not expect any leases to exist.
*/
assert hasAllPeerRecoveryRetentionLeases == false : routingTable + " vs " + retentionLeases;
logger.debug("{} becoming primary of {} with missing lease: {}", primaryShard, routingTable, retentionLeases);
}
} else if (hasAllPeerRecoveryRetentionLeases == false
&& routingTable.assignedShards()
.stream()
.allMatch(
shardRouting -> retentionLeases.contains(getPeerRecoveryRetentionLeaseId(shardRouting))
|| checkpoints.get(shardRouting.allocationId().getId()).tracked == false
)) {
// Although this index is old enough not to have all the expected peer recovery retention leases, in fact it does, so we
// don't need to do any more work.
hasAllPeerRecoveryRetentionLeases = true;
}
}

Sync retention leases needs to be disabled for remote translog enabled indexes.

private void sync(final Consumer<IndexShard> sync, final String source) {
for (final IndexShard shard : this.shards.values()) {
if (shard.routingEntry().active() && shard.routingEntry().primary()) {
switch (shard.state()) {
case CLOSED:
case CREATED:
case RECOVERING:
continue;
case POST_RECOVERY:
assert false : "shard " + shard.shardId() + " is in post-recovery but marked as active";
continue;
case STARTED:
try {
shard.runUnderPrimaryPermit(() -> sync.accept(shard), e -> {
if (e instanceof AlreadyClosedException == false
&& e instanceof IndexShardClosedException == false
&& e instanceof ShardNotInPrimaryModeException == false) {
logger.warn(new ParameterizedMessage("{} failed to execute {} sync", shard.shardId(), source), e);
}
}, ThreadPool.Names.SAME, source + " sync");
} catch (final AlreadyClosedException | IndexShardClosedException e) {
// the shard was closed concurrently, continue
}
continue;
default:
throw new IllegalStateException("unknown state [" + shard.state() + "]");
}
}
}
}

There might be more areas where the retention leases are getting created and this needs to be investigated.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@ashking94 ashking94 added enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework labels Jul 20, 2023
BhumikaSaini-Amazon pushed a commit to BhumikaSaini-Amazon/OpenSearch that referenced this issue Jul 20, 2023
@gbbafna gbbafna moved this from 🆕 New to Ready To Be Picked in Storage Project Board Aug 8, 2024
@gbbafna
Copy link
Collaborator

gbbafna commented Aug 8, 2024

Storage Triage : Attendees 1 2 3 4 5 6

Are we seeing test failures due to same now @ashking94 ? Is this issue still relevant ?

@gbbafna gbbafna moved this from Ready To Be Picked to 🆕 New in Storage Project Board Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request needs more info Storage:Durability Issues and PRs related to the durability framework
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants