[Segment Replication] Differentiate post merge checkpoints #10872

mch2 · 2023-10-23T20:06:56Z

Is your feature request related to a problem? Please describe.
Today SegRep uses the ReplicationCheckpoint to compute staleness of replicas, report stats, and enforce backpressure for lagging replicas. It would be great to differentiate in metrics & exclude from backpressure computations the lag between checkpoints that are strictly post-merge refreshes where the searchable doc count does not change.

Describe the solution you'd like
While this data should still be surfaced it should be excluded from backpressure computations.
/_cat/segment_replication should still show ongoing lag but have a separate column to identify the lag for syncing to a merged checkpoint.
This should also be surfaced through _nodes/stats API as a separate metric for both bytes & time lag.

Perhaps we can restructure the cat SR api to show a row for each checkpoint the replica is lagging and a isMerge column.

Describe alternatives you've considered
Leaving as is / silently excluding these checkpoints.

mch2 · 2023-12-14T20:42:39Z

Rather than differentiating checkpoints - we should make an implementation of IndexWriter.IndexReaderWarmer similar to Lucene NRT's PreCopyMergedSegmentWarmer.

At this point the gaps are as follows:

Support copying of files that are not associated with any ReplicationCheckpoint out to replicas. This should be abstracted to support both node-node and remote store based copy.
Create a new function through IndexShard to handle initiating the copy that does not ack until all replicas are current.
Pass the function to an implementation of IndexReaderWarmer that can be wired into EngineConfig so that the warmer can be set in IndexWriterConfig.

mch2 added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 23, 2023

anasalkouz added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Nov 9, 2023

kotwanikunal mentioned this issue Dec 5, 2023

Investigate the underlying Lucene internals to design the new NRT checkpointing behavior #11462

Closed

mch2 mentioned this issue Dec 14, 2023

Support copying of files that are not associated with any ReplicationCheckpoint out to replicas. #11619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Differentiate post merge checkpoints #10872

[Segment Replication] Differentiate post merge checkpoints #10872

mch2 commented Oct 23, 2023 •

edited

Loading

mch2 commented Dec 14, 2023

[Segment Replication] Differentiate post merge checkpoints #10872

[Segment Replication] Differentiate post merge checkpoints #10872

Comments

mch2 commented Oct 23, 2023 • edited Loading

mch2 commented Dec 14, 2023

mch2 commented Oct 23, 2023 •

edited

Loading