-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite disaster recovery based on recreate and status check #1890
Rewrite disaster recovery based on recreate and status check #1890
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments from testing this out locally by hand.
b74c7f3
to
1d27305
Compare
1d27305
to
3140906
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I just have a few questions and suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AnnaSjerling hey! I went through your guide. Wow! Tremendous work! I left some editorial comments. My idea was to tie up the beginning with the rest of the guide.
I have printed the wrong words when describing output from e.g. |
…he same as the actual output.
Thanks for the documentation updates. The preview documentation has now been torn down - reopening this PR will republish it. |
==== | ||
|
||
The `system` database contains the view of the cluster. | ||
This includes which servers and databases are present, where they live and how they are configured. | ||
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers. | ||
Databases might also need to be recreated to regain write availability. | ||
Because both of these steps are executed by modifying the `system` database, making the `system` database write available is a vital first step during disaster recovery. | ||
Because both of these steps are executed by modifying the `system` database, making the `system` database write-enabled is a vital first step during disaster recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in your hands, but I don't think it nice to change from a technically correct wording which we have defined in the beginning of the document to multiple adjacent ones which are not defined. Like write-enabled, able to accept write operations and write-capable.
This is done in two different ways: | ||
Use the following steps to remove lost servers and add new ones to the cluster. | ||
To remove lost servers, any allocations they were hosting must be moved to available servers in the cluster. | ||
This can be done in two different ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, it is not an either or, but both will be needed in most cases.
) This PR rewrites the disaster recovery docs based on the new recreate and status check procedures. With regards to the functional changes there are three major things to note. 1. The recovery of the system database is the same as before, even though the way we check if the system db is write available or not has changed. 2. The step which removes lost servers now also includes recreating some databases. It was decided this mixing of recovering servers and databases is necessary to get recreate into the guide in the best way we could see. 3. Previously it was not clear how to handle quarantined databases, even though the guide mentioned them in the intro. Now, it is more explicitly discussed how they are supposed to be handles during disaster recovery. In the past, we have also gotten feedback that the disaster recovery docs are hard to follow. Therefore, this PR also includes refactoring which introduces a new guide structure. This new structure provides more explanations of why we are asking the user to do a certain thing. --------- Co-authored-by: NataliaIvakina <[email protected]>
…2027) This PR rewrites the disaster recovery docs based on the new recreate and status check procedures. With regards to the functional changes there are three major things to note. 1. The recovery of the system database is the same as before, even though the way we check if the system db is write available or not has changed. 2. The step which removes lost servers now also includes recreating some databases. It was decided this mixing of recovering servers and databases is necessary to get recreate into the guide in the best way we could see. 3. Previously it was not clear how to handle quarantined databases, even though the guide mentioned them in the intro. Now, it is more explicitly discussed how they are supposed to be handles during disaster recovery. In the past, we have also gotten feedback that the disaster recovery docs are hard to follow. Therefore, this PR also includes refactoring which introduces a new guide structure. This new structure provides more explanations of why we are asking the user to do a certain thing. --------- Co-authored-by: Anna Sjerling <[email protected]>
This PR rewrites the disaster recovery docs based on the new recreate and status check procedures.
With regards to the functional changes there are three major things to note.
In the past, we have also gotten feedback that the disaster recovery docs are hard to follow. Therefore, this PR also includes refactoring which introduces a new guide structure. This new structure provides more explanations of why we are asking the user to do a certain thing.