Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Existing Cluster Migration to/from Remote Store - HLD #12246

Closed
gbbafna opened this issue Feb 8, 2024 · 3 comments
Closed

[RFC] Existing Cluster Migration to/from Remote Store - HLD #12246

gbbafna opened this issue Feb 8, 2024 · 3 comments
Assignees
Labels
RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage v2.14.0

Comments

@gbbafna
Copy link
Collaborator

gbbafna commented Feb 8, 2024

Aim

To support migration of existing Doc Rep cluster to/from Remote backed cluster which has Remote backed Segment , and Translog enabled.

RFC : #7986

Tenets

  1. Data integrity - No data loss/corruption due to the migration process itself.
  2. No downtime for customer .
    1. Writes as usual - There should be no impact to writes .
    2. Reads - The behavior will switch from DocRep <-> Remote Store in between.
  3. Migration should be possible via rolling restart or via Blue/Green mechanism.

Background & Terminology

Remote Store - https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/remote-store/index/

Segment Replication - https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/segment-replication/index/

  1. Remote Store is enabled in a node via static settings from yml from 2.11 version onwards.
  2. A cluster can have either all remote or all non-remote nodes.
  3. On remote backed nodes, all the indices are remote enabled .
  4. The only way to migrate between docrep and remote store backed cluster is to restore snapshot into it.

What’s not supported/covered ?

Migration involving Segment Replication enabled indices w/o remote store is not currently scoped and will be considered as a follow up .

Pre Requisites

  1. Cluster is migrated to 2.m (the version which we support the migration) .
  2. All indices are using document replication or the cluster is remote backed.

Approaches

#7986

Rolling restarts of the all the nodes

The migration process starts with updation of cluster settings to mixed mode with eventual direction as remote . This will be followed by rolling restarts of all the nodes.

DocRep to Remote Backed

Primary shard copies will be starting/relocating to remote-backed mode. Replica copy of a shard will only move to remote after primary copy movement. This will be ensured by the cluster manager. It will start operating in bilingual mode - indexing documents in doc rep replicas, uploading segments/translog to remote storage and publishing checkpoints to remote replicas . Primary shard will use node attribute to find the state of each shard copy. Whenever the primary shard starts , it will need to turn on the bilingual mode.

As we restart each nodes, the new primaries followed by replica copies will start to operate in remote-backed mode. The primary will be indexing documents to older docrep replicas as well . When all the shards of an index has migrated to remote-backed mode, we will to update the index metadata to mark the index as a remote backed index .

We will need to add the direction of the migration in cluster setting . This will be used to determine the allocation of the new indices . This ensures our migration process always moves forward as it will not increase the migration work once started .

Using node attributes also helps in modelling the remote store migration like a version upgrade process where in once we migrate the shards to remote backed node, we are not moving it back to doc rep based node.

Remote Backed to DocRep

Replica shard copies will start relocating to docrep nodes. Primary copy of a shard will only move to docrep after all replica copy movement. This will be ensured by the cluster manager. Between this the primary shard on remote nodes operates in bilingual mode - indexing documents in doc rep replicas, uploading segments/translog to remote storage and publishing checkpoints to remote replicas . Primary shard will use node attribute to find the state of each shard copy. Whenever the primary shard starts , it will need to turn on the bilingual mode.

User Story for docrep to remote migration

  1. User needs to upgrade the version to 2.m (the version which we plan to support the migration) .
  2. cluster.routing.allocation.balance.prefer_primary is enabled to balance out primary
  3. Mode is set to mixed and set direction is set to docrep
  4. Restart a node and then set direction is set to remote
  5. Restart more nodes with remote store attributes in the yml one by one . Between restarts of the node, verify that shard relocation is complete.
  6. When all nodes have been restarted , we need to switch the mode mode to strict. After this doc rep based nodes won’t be able to join the cluster

User Story for remote to docrep migration

  1. User needs to upgrade the version to 2.m (the version which we plan to support the migration) .
  2. Mode is set to mixed and set direction is set to remote
  3. Restart a node and then set direction is set to docrep
  4. Restart more nodes with remote store attributes one in the yml by one . Between restarts of the node, verify that shard relocation is complete.
  5. When all nodes have been restarted , we need to switch the mode mode to strict. After this doc rep based nodes won’t be able to join the cluster

Lifecycle of primary and replica shards for DocRep to Remote

dragonstone migration - rr(2)(1)

  1. After we set direction to remote , new shards would only come up in remote mode .

  2. Primary shards will get started/relocated on the remote nodes . We will start the primary shard on remote node when remote store is in sync . The replica shards will only start on remote nodes, after primary shard has started on remote nodes.

    1. New primaries come on remote node by primary relocation on indices
      1. having replica in rolling restarts.
      2. all indices in B/G
    2. New primaries come on remote node by local recovery on indices having no replica in rolling restarts : In local recovery , the writes can be blocked for long time , till the remote store is in complete sync . To mitigate this, it is advisable to have replica enabled for all indices.
  3. After primary shard is migrated to remote , new replica copies will start/relocate in/to remote nodes as well . The new copy will take some time to come up as it has to download all the data afresh. The read throughput can decrease during migration as all the replicas will need some time to come up .

  4. Dual Mode : When the primary shard has uploaded . From now on , it persists data on RemoteStore. But since it can have older replicas on docrep nodes, it will need to send over documents to those shards. This primary shard can only move to doc-rep mode, when it fails over and there is no remote shard available, but only a doc rep replica shard.

    1. Updates Global Checkpoint -
    2. Failover -
    3. Primary Relocation -
    4. Document Replicas - Receives documents , continues to update its own contents
    5. Remote Backed Replicas - It will publish checkpoints to these .
  5. Replica on node restart : If primary is remote , and the node is remote : Then the shard can hydrate from remote , as part of RemoteIndexRecovery

  6. When all shards of index moves to remote backed, we can set the index settings to be remote enabled. We can reload the engines of all primary nodes to switch off dual mode as well. This needs a deeper thought though.

  7. When all nodes have been restarted , we need to disable the migrating mode to false.

Lifecycle of primary and replica shards for Remote to DocRep

The main difference is that primary shard will be last to move to docrep nodes. Unlike migration to remote store, migration back to docrep nodes will have no downtime

  1. After we set direction to docrep , new shards would only come up in docrep mode .
  2. Replica shards will get started/relocated on the docrep nodes . The primary shards will only start on docrep nodes, after all replica shard have started on docrep nodes.
    1. New primaries come on docrep node by primary relocation on indices having replica
    2. New primaries come on remote node by local recovery on indices having no replica in rolling restarts.
  3. When primary is on remote node and replicas are on docrep nodes, Dual Mode keeps sending documents to those copies
  4. When all shards of index moves to remote backed, we can remove the remote enabled. index settings .
  5. When all nodes have been restarted , we need to disable the migrating mode to false.

Components

Cluster Manager

Shard Allocation Deciders

  1. Replica shards can only move to remote backed nodes after primary shard starts on remote backed node .
  2. Newly created as well as allocated primary and replica shards should be allocated only on basis of direction (remote/docrep).

Primary Promotion

  1. Once a primary shard moves to a remote node with all replicas still on doc rep nodes, its failures will lead to primary moving back to doc rep nodes. This should already be handled in PrimaryShardAllocator . But we will need to verify it .
  2. Once a primary shard is remote node with at least one replica on remote node, we should promote only remote node as primary on failover .

Indexing

In mixed mode, when the primary is on remote-backed node, replicas can be on both remote and docrep nodes. In that case the primary needs to suppy documents to docrep nodes and publish checkpoints to remote nodes. We call this Dual Replication mode, where remote backed primary shard is able to take care of both docrep replicas and remote replicas. When all the shard copies of a give shard moves to remote-backed/docrep , the Dual Replication mode would be turned off.

We need to evaluate on each action : the impact on remote primary , receivers (doc rep replica and remote replica) and remote primary receiving success/failure from the receiver for the duration of migration . This would be covered in depth in Dual Replication LLD.

Reference : #5033

Primary - Primary Relocation

Doc Rep → Remote

The remote relocation will not complete all the data has been uploaded to remote in sync . The remote upload shouldn’t block writes .

Doc Rep → Doc Rep

Status Quo

Remote → Remote

We should relocate remote primary shard to another remote node just like a remote backed shard . We need to make sure that the doc rep replicas continue to receive all documents from both of the replicas and doesn’t have any holes .

Remote → DocRep

We should relocate remote primary shard to doc node just like a docrep backed shard .

Failover

A primary shard backed by remote store with at least one replica on remote node , the failover would be like a remote shard.

A primary shard backed by remote store with all replicas on docrep , the shard will move back to docrep node.

If the shard is still migrating to remote store and the primary fails, the migration process will restart on a new remote node

Supporting References

#7986

Issues

[] #12245

@gbbafna gbbafna added untriaged RFC Issues requesting major changes labels Feb 8, 2024
@gbbafna gbbafna self-assigned this Feb 8, 2024
@gbbafna gbbafna removed the untriaged label Feb 8, 2024
@rramachand21 rramachand21 added the Storage Issues and PRs relating to data and metadata storage label Feb 16, 2024
@gbbafna gbbafna moved this from 🆕 New to 👀 In review in Storage Project Board Mar 4, 2024
@gbbafna gbbafna changed the title [Draft] [RFC] Existing Cluster Migration to/from Remote Store - HLD [RFC] Existing Cluster Migration to/from Remote Store - HLD Mar 4, 2024
@gbbafna
Copy link
Collaborator Author

gbbafna commented Mar 4, 2024

Requesting feedback from @shwetathareja , @andrross , @mch2 , @ankitkala , @itiyamas .

@mch2
Copy link
Member

mch2 commented Mar 11, 2024

Thanks for getting this going @gbbafna

A few thoughts...

  • What are the API calls a user makes? Restart a node and then set direction is set to remote Does a user do this or do we know already from config before restart?

As we restart each nodes, the new primaries followed by replica copies will start to operate in remote-backed mode.

All of the shards on the restarted node would then move to remote-backed? Wouldn't the replicas need to ensure their correlating primary has moved first? Or are all primaries updated without restarts?

  • In this flow we require node restarts and accept the read throughput impact. Is there a hard requirement for the restart or is this for simplicity? If so what is driving that? If we are able to take replicas out of commission I think at shard level we can we fail & force recovery instead with the new strategy?

If the shard is still migrating to remote store and the primary fails, the migration process will restart on a new remote node

I didn't quite follow this. If we are migrating to remote store, and the DR primary fails, we would need to either select a DR replica (if still exists) or we risk data loss & choose a RS replica? Ah my mistake, replicas would have moved last so this is a non issue.

@gbbafna
Copy link
Collaborator Author

gbbafna commented Mar 18, 2024

Thanks @mch2 for reviewing .

Thanks for getting this going @gbbafna

A few thoughts...

* What are the API calls a user makes? `Restart a node and then set direction is set to remote` Does a user do this or do we know already from config before restart?

User has to do via a cluster settings update call . This has to be done after at least one node restart . Otherwise the direction set to remote will not allocate new shards to doc rep nodes.

As we restart each nodes, the new primaries followed by replica copies will start to operate in remote-backed mode.

All of the shards on the restarted node would then move to remote-backed? Wouldn't the replicas need to ensure their correlating primary has moved first? Or are all primaries updated without restarts?

* In this flow we require node restarts and accept the read throughput impact.  Is there a hard requirement for the restart or is this for simplicity?  If so what is driving that?  If we are able to take replicas out of commission I think at shard level we can we fail & force recovery instead with the new strategy?

There is hard requirement of restart, as we need to update the yml settings for the node.

If the shard is still migrating to remote store and the primary fails, the migration process will restart on a new remote node

I didn't quite follow this. If we are migrating to remote store, and the DR primary fails, we would need to either select a DR replica (if still exists) or we risk data loss & choose a RS replica? Ah my mistake, replicas would have moved last so this is a non issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage v2.14.0
Projects
Status: ✅ Done
Development

No branches or pull requests

3 participants