-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Peer recovery checkpoint publication invariants #3923
Comments
Checklist of things to do:
|
@Bukhtawar I have put out a small PR which checks if shard is in PrimaryMode before publishing a checkpoint. Please let me your thoughts on this and if this check is enough to ensure we don't publish checkpoints when not in PrimaryMode. I think we might need an integration test to cover the specific scenario of replica shard Promotion to primary, but writing integ test for this may take a while until this issue is fixed. |
@Bukhtawar From your suggested solution: After ensuring we don't publish checkpoint if shard is not in Primary Mode, I was looking to write a check for PrimaryMode to discard checkpoint as you suggested on checkpoint receiving end here before calling onNewCheckpoint(). But I cannot access/check if shard is in PrimaryMode or not. Because I don't have access to replicationtracker or I am not in same package so I cannot do PrimaryMode check in PublishCheckpointAction class. A workaround for this is to change access of this method getReplicationTracker() from default to public. I am not sure if it is a worth/good practice to change this method access to public. Please let me know your thoughts on this PrimaryMode check on checkpoint receiving end and discarding it. Also your thoughts on if this check is actually necessary ? as we are only publishing checkpoint on primaryMode. |
@Rishikesh1159 I don't think you need to change the visibility of the replication tracker. |
The logic and unit tests for this issue have been merged with this PR. Keeping this issue open until an integration test is added for testing this |
@Rishikesh1159 : One idea around mimicking this situation is by mocking the transport action responsible for updating the cluster state. One sample test in SegmentReplicationIT. @Bukhtawar : Do you have better ideas around mimicking this scenario ? |
Is your feature request related to a problem? Please describe.
With Segment replication the primary publishes checkpoints on OpenSearch refresh/flush, however soon after the relocation hand-off is complete the peer primary source gets into replica mode which ensures that it is no longer responsible for global checkpoint tracking and the peer primary target should take over. However it's possible the peer primary source continues to concurrently perform refreshes and publish checkpoints concurrently with the
RECOVERING
peer primary target till the cluster state update changes their respective shard state.Note: The recovering target has the same primary term as that of the recovery source and hence just primary term checks during checkpoint publication and checkpoint processing might not be good enough
Describe the solution you'd like
The checkpoint publication needs to check on the primaryMode to ensure it doesn't publish checkpoint and the recovery target to discard any checkpoints if it is in primaryMode during the peer recovery process.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: