- Disaster Recovery (DR) is about preparing for and recovering from a disaster
- Disaster: events that have negative impact on business continuity or finances
- Recovery types to cloud:
- On-premise => On-premise: Traditional DR, very expensive
- On-premise => Cloud: Hybrid recovery
- Cloud => Cloud: e.g. AWS Cloud Region A => AWS Cloud Region B
- 📝Terminology
- RPO: Recovery Point Objective
- The amount of data loss you're willing to accept
- Based on:
- How back in time you can recover
- How often you take back-up
- RTO: Recovery Time Objective
- Amount of downtime you're willing to accept
- RPO: Recovery Point Objective
- Disaster Recovery Strategies
- From business continuity to uninterrupted business:
- Backup and Restore -> Pilot Light -> Warm / Hot Standby -> Hot Site / Multi Site
- Backup and Restore
- Pilot Light
- Critical core element(s) of the app is always running in the cloud
- Faster than Backup and Restore as critical systems are already up.
- Example of active/passive failover configuration.
- From networking, you have two options: Elastic IP addresses or Elastic Load Balancing (ELB)
- DNS records to point to EC2 or point to LB using CNAME.
- Changed data in DR site after failover must be reversed back to primary site after failback.
- Critical core element(s) of the app is always running in the cloud
- Warm / Hot Standby
- Hot Site / Multi Site
- Full Production Scale is running AWS and on-premise
- Example of active-active failover configuration.
- Changed data in DR site after failover must be reversed back to primary site after failback.
- 💡 In Route 53: You configure active-active failover using any routing policy (or combination of routing policies) other than failover.
- From business continuity to uninterrupted business:
- 💡 Disaster Recovery Tips
- Backup
- EBS Snapshots, RDS automated backups / Snapshots, etc...
- Regular pushes to S3 / S3 IA / Glacier, Lifecycle Policy, Cross Region Replication
- From On-premise: Snowball or Storage Gateway
- High Availability
- Use Route 53 to migrate DNS over from Region to Region
- RDS Multi-AZ, ElastiCache Multi-AZ, EFS, S3
- Site to Site VPN As a recovery from Direct Connect
- Replication
- RDS Replication (Cross Region), AWS Aurora + Global Databases
- Database replication from on-premise to RDS
- Storage Gateway
- Automation
- CloudFormation / Elastic Beanstalk to re-create a whole new environment
- Recover / Reboot EC2 instances with CloudWatch if alarms fail
- AWS Lambda functions for customized automations
- 🤗 Chaos
- Netflix has a "simian-army" randomly terminating EC2.
- Chaos monkeys terminates instances in production.
- Netflix has a "simian-army" randomly terminating EC2.
- Backup