Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Sequence number based recoveries #10003

Open
dreamer-89 opened this issue Sep 12, 2023 · 0 comments
Open

[Segment Replication] Sequence number based recoveries #10003

dreamer-89 opened this issue Sep 12, 2023 · 0 comments
Labels
discuss Issues intended to help drive brainstorming and decision making Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 12, 2023

Coming from #6761 exercise, there are few tests failures (listed below) due to sequence number based recovery been not attempted. Seq no based recovery is not attempted as it needs replica shard to identify starting sequence number during recovery and replay (index) local translog operations on underlying engine. Since, NRTReplicationEngine does not support indexing operations, thus UNASSIGNED_SEQ_NO is always used as the starting seq no. On soure (primary) this results in sequence number based recovery to be not attempted. Created this issue to validate that this behavior does not have any undesired implications outside of test failures.

Sequence number based recovery is perf improvement as it recover data locally by replaying from local translog operations and skips phase 1 of recovery which copies over data (segment) files from primary, including files with different checksum from replica which adds to recovery time. With segment replication, replica sync segment files from primary and thus should not have files with different checksum. Thus, with segment replication avoiding seq no based recovery shouldn't have performance issues. With document replication, the replica and primary running on InternalEngine have different checksums data files on latest commit point which is not the case with segment replication.

Test failures

  1. PrimaryAllocationIT.testPrimaryReplicaResyncFailed [BUG] [Segment Replication] Resync failures results in removal of in-sync allocation id  #7163
  2. CloseIndexIT.testNoopPeerRecoveriesWhenIndexClosed [BUG] [Segment Replication] NO-OP recovery not attempted #7161
  3. ReplicaShardAllocatorIT.testFullClusterRestartPerformNoopRecovery [BUG] [Segment Replication] NO-OP recovery not attempted #7161
@dreamer-89 dreamer-89 added discuss Issues intended to help drive brainstorming and decision making untriaged distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

No branches or pull requests

2 participants