title | revision | update_date |
---|---|---|
Gorrion Disaster Recovery Guidelines |
0.0.1 |
2024-06-13 |
To identify if the project needs Disaster Recovery Plan (DRP), you should consider the following:
- Business Impact Analysis (BIA)
- Revenue impact - does the project directly generate revenue or support revenue-generating services?
- Operational impact - is the project critical to day-to-day operations and business processes?
- Reputation impact - does the project disruption impacts the client's reputation?
- Compliance - does the project subject to regulations such as GDPR or HIPAA which mandate data protection and disaster recovery?
- Supply chain - is the project part of a larger system or supply chain, where disruption could affect multiple components or services?
- State of the project - is the project live on production and how many users are there?
It should be a part of the agreement between a client ("Client") and Gorrion if, how and when should Gorrion create a DRP for Client.
The first step of DRP should be a documentation for the project. The minimal accepted coverage is a maintenance documentation and an architectural diagram. Consider documenting the architectural decisions in the form of ADRs.
- Identify critical systems and components.
- Determine the potential impact of outages on business operations.
- Evaluate risks and vulnerabilities in the existing infrastructure.
- Recovery Time Objective (RTO) - maximum acceptable downtime.
- Recovery Point Objective (RPO) - maximum period during which data loss is tolerable.
- Roles and Responsibilities - assign clear roles within the project team for handling disaster recovery.
- Team Members - include developers, DevOps (or solution architects, or internal Gorrion consultants), project managers, and key stakeholders.
- AWS Backup - create a AWS Backup plans and document them.
- Restore procedures - create restore procedures, document them and test regularly.
- Backup testing - regularly test backup integrity.
- Redundancy - ensure backups are redundant, stored off-site, and encrypted.
- Use IaaC - the infrastructure for the project should be defined as code.
- Documentation - process of bringing up, bringing down and updating the infrastructure should be documented.
- Redundant components - design systems with redundancy by using AWS services like EC2 Auto Scaling, ELB, and multi-AZ deployments for databases.
- Load balancing - distribute traffic evenly.
- Stateless architecture - implement stateless architecture where possible.
- Secondary Site - set up a secondary DR site in a different AWS region if required.
- Data Replication - use Amazon RDS Multi-AZ or AWS DMS for database replication to the DR site.
- Monitoring Tools - implement AWS CloudWatch and custom logging solutions.
- Alert Thresholds - set thresholds for key metrics and ensure alerts are properly configured to notify the team.
- Simulated Drills: Conduct scheduled disaster recovery drills to validate the recovery process.
- Documentation Update: Post-drill, update documentation based on findings to improve recovery strategies.
- Disaster Recovery Plan - detail the step-by-step recovery procedures specific to this project.
- Contact Information - maintain up-to-date contact info for the recovery team and stakeholders.
- Data Encryption - ensure all data (in transit and at rest) is encrypted using tools like AWS KMS.
- Access Controls - restrict access to critical systems and data based on least privilege principles.
- Internal Communication - establish protocols for internal team communication during a disaster.
- External Updates - prepare templates for notifying external stakeholders about the status and recovery progress.
- Tools - use AWS Database Migration Service (DMS) or other tools for real-time data synchronisation.
- Consistency Checks - ensure transactional consistency between primary and DR sites.
- Service SLAs - review and document Service Level Agreements (SLAs) with critical third-party service providers.
- Fallback Plans - prepare backup plans for third-party services that are critical to the project.
- Budgeting - monitor and optimise disaster recovery-related expenses using AWS Cost Explorer.
- Cost-effective Measures - implement affordable solutions that do not compromise recovery objectives.
- Incident Analysis - after any disaster, perform a detailed review of the incident response.
- Plan Update - update the disaster recovery plan based on lessons learned and new insights gained.