Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite disaster recovery based on recreate and status check #1890

Conversation

AnnaSjerling
Copy link
Collaborator

@AnnaSjerling AnnaSjerling commented Oct 21, 2024

This PR rewrites the disaster recovery docs based on the new recreate and status check procedures.

With regards to the functional changes there are three major things to note.

  1. The recovery of the system database is the same as before, even though the way we check if the system db is write available or not has changed.
  2. The step which removes lost servers now also includes recreating some databases. It was decided this mixing of recovering servers and databases is necessary to get recreate into the guide in the best way we could see.
  3. Previously it was not clear how to handle quarantined databases, even though the guide mentioned them in the intro. Now, it is more explicitly discussed how they are supposed to be handles during disaster recovery.

In the past, we have also gotten feedback that the disaster recovery docs are hard to follow. Therefore, this PR also includes refactoring which introduces a new guide structure. This new structure provides more explanations of why we are asking the user to do a certain thing.

@AnnaSjerling AnnaSjerling changed the title Rewrite disaster recovery based on recreate 2 Rewrite disaster recovery based on recreate and status check Oct 21, 2024
Copy link
Contributor

@tselmegbaasan tselmegbaasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments from testing this out locally by hand.

modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
@AnnaSjerling AnnaSjerling force-pushed the rewrite-disaster-recovery-based-on-recreate-2 branch from b74c7f3 to 1d27305 Compare November 28, 2024 14:34
@AnnaSjerling AnnaSjerling force-pushed the rewrite-disaster-recovery-based-on-recreate-2 branch from 1d27305 to 3140906 Compare November 28, 2024 14:42
Copy link
Contributor

@tselmegbaasan tselmegbaasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just have a few questions and suggestions.

Copy link
Contributor

@NataliaIvakina NataliaIvakina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AnnaSjerling hey! I went through your guide. Wow! Tremendous work! I left some editorial comments. My idea was to tie up the beginning with the rest of the guide.

modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
modules/ROOT/pages/clustering/disaster-recovery.adoc Outdated Show resolved Hide resolved
@AnnaSjerling
Copy link
Collaborator Author

I have printed the wrong words when describing output from e.g. SHOW DATABASES. So I fixed the words and the capitalisation of them to match the actual output.

@neo-technology-commit-status-publisher
Copy link
Collaborator

Thanks for the documentation updates.

The preview documentation has now been torn down - reopening this PR will republish it.

@NataliaIvakina NataliaIvakina merged commit a2f28f2 into neo4j:dev Dec 20, 2024
8 checks passed
====

The `system` database contains the view of the cluster.
This includes which servers and databases are present, where they live and how they are configured.
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers.
Databases might also need to be recreated to regain write availability.
Because both of these steps are executed by modifying the `system` database, making the `system` database write available is a vital first step during disaster recovery.
Because both of these steps are executed by modifying the `system` database, making the `system` database write-enabled is a vital first step during disaster recovery.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in your hands, but I don't think it nice to change from a technically correct wording which we have defined in the beginning of the document to multiple adjacent ones which are not defined. Like write-enabled, able to accept write operations and write-capable.

This is done in two different ways:
Use the following steps to remove lost servers and add new ones to the cluster.
To remove lost servers, any allocations they were hosting must be moved to available servers in the cluster.
This can be done in two different ways:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, it is not an either or, but both will be needed in most cases.

NataliaIvakina added a commit to NataliaIvakina/docs-operations that referenced this pull request Dec 20, 2024
)

This PR rewrites the disaster recovery docs based on the new recreate
and status check procedures.

With regards to the functional changes there are three major things to
note.
1. The recovery of the system database is the same as before, even
though the way we check if the system db is write available or not has
changed.
2. The step which removes lost servers now also includes recreating some
databases. It was decided this mixing of recovering servers and
databases is necessary to get recreate into the guide in the best way we
could see.
3. Previously it was not clear how to handle quarantined databases, even
though the guide mentioned them in the intro. Now, it is more explicitly
discussed how they are supposed to be handles during disaster recovery.


In the past, we have also gotten feedback that the disaster recovery
docs are hard to follow. Therefore, this PR also includes refactoring
which introduces a new guide structure. This new structure provides more
explanations of why we are asking the user to do a certain thing.

---------

Co-authored-by: NataliaIvakina <[email protected]>
NataliaIvakina added a commit that referenced this pull request Dec 20, 2024
…2027)

This PR rewrites the disaster recovery docs based on the new recreate
and status check procedures.

With regards to the functional changes there are three major things to
note.
1. The recovery of the system database is the same as before, even
though the way we check if the system db is write available or not has
changed.
2. The step which removes lost servers now also includes recreating some
databases. It was decided this mixing of recovering servers and
databases is necessary to get recreate into the guide in the best way we
could see.
3. Previously it was not clear how to handle quarantined databases, even
though the guide mentioned them in the intro. Now, it is more explicitly
discussed how they are supposed to be handles during disaster recovery.


In the past, we have also gotten feedback that the disaster recovery
docs are hard to follow. Therefore, this PR also includes refactoring
which introduces a new guide structure. This new structure provides more
explanations of why we are asking the user to do a certain thing.

---------

Co-authored-by: Anna Sjerling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants