So your company has developed a brand new system and is ready to roll it out. You might even be leveraging the latest Platform-as-a-Service (PaaS) services in the public cloud because you you want to offload as many operational concerns as possible to your cloud provider. But have you actually considered how to recover from a major outage that could affect the system? What would happen if there was a major issue affecting the services that you depend on especially an issue that could result in data loss? What if your operations team or users do something catastrophic in the system and need to an “undo”? As an architect, you should have a plan in place to recover from a loss in system infrastructure services or data that would affect your business or the customers you serve. In this post, I’ll discuss the topic of Disaster Recovery Planning.
Disaster Recovery Planning
Disaster Recovery (DR) planning is important for your team to be prepared when disaster strikes – notice I said “when” and not “if”. In spite of the improvements that cloud providers continue to make in order to make their platforms more reliable, things can and will go wrong. That includes manmade disasters as well as natural ones. All major public cloud providers have experienced widespread outages. The Channel Co has captured examples from 2019 and I’d recommend checking out the details to get a sense of the impacts. Besides issues caused by your cloud provider, application bugs and user errors can also impact your application’s data real estate.
(Image credit: Evolve IP)
Assemble the Plan
Everything starts with the “Plan”. As an architect, this is one of your most valuable contributions – helping to coordinate and lead the effort on developing a DR plan. Be prepared to work across different parts of your business to develop the plan. The plan will have several key questions that it must address.
Depending on your organization, you will likely need to work broadly across a diverse set of groups to build a comprehensive plan. Here are a few suggestions on where to begin collaborating on your plan.
Business Partner – Understanding the impacts of loss of service and data is critical from a business point of view. Work with your key stakeholders in the business to define the key time requirements for recovery. These include the RPO (Recovery Point Objective) and RTO (Recovery Time Objective). RPO addresses the maximum amount of acceptable data loss in terms of time; e.g., no more than 15 minutes of lost transactions. RTO speaks to the maximum amount of time required for services and data to be recovered; e.g., services should be back to normal operations within 2 hours.
Development Team – You likely worked with the development team on the design for your new system (but not always). The developers should have detailed information on the infrastructure and data real estate for your system. You’ll want to work closely with them to design a recovery process that covers all elements of the system including both infrastructure and data.
DevOps – You may have dedicated team members that focus on developing automation for managing infrastructure, deploying applications, creating databases, etc. These team members will be key to developing automated recovery procedures.
Operations Team – The care and feeding of your system will require supervision by team members that monitor its health and runtime status. This group will likely be your first line of defense and a key user of your DR plan.
Key Questions to Answer
Your plan should be detailed and should answer key questions.
- What? What systems and subsystems does your plan cover? What are the key components that require recovery? Don’t just assume it’s data. Your plan should also include how to recover lost infrastructure and services that may have to be rebuilt.
- Who? Who will be contacted in case of an emergency? Who will be responsible for taking action when a problem strikes? Who will be responsible for reporting status updates while the problem is ongoing? Who should you escalate to when important decisions need to be made?
- When? You should know when it’s time to execute your plan. Your plan should have clearly defined trigger points. Work with your business partner to identify acceptable downtime requirements. When one or more triggers have occurred, your plan should kick into action.
- Where? Depending on your system architecture, you may have more than one data persistence technology in play (database, caches, blobs, message stores, etc.). Your plan should outline where each of these are backed up as well as retention needs.
- How? The plan needs to be detailed on the steps to recovery. These can vary depending on the severity of the loss. The loss of service or data can be complete (perhaps due to a data center or regional outage) which will require a comprehensive set of steps to recover your entire system including cloud infrastructure as well as data. Or it can be limited to parts of your data real estate such as a corrupted database or deleted files. Ideally, your DR plan will allow you to execute recovery for specific sets of data and/or infrastructure depending on the scale of your disaster and impacts it has on your application.
Test Your Plan
A complete disaster recovery plan must include details on how you will test the recovery of your system. Doing dry runs of the process will help to ensure that the people involved in the recovery process know their roles and responsibilities. It will also give you the opportunity to validate the automation that you build to recover infrastructure and data. Testing is key to determining whether you can meet your RPO and RTO goals. And lastly – test early and test often. As your plan comes together, you should exercise it to determine any gaps or weaknesses that need to be addressed. Also keep in mind that over time your may need to adjust your plan based on changes in requirements or changes in system design.
Creating a disaster recovery plan for your business will help protect you and your customers when system failure occurs. Successful DR planning requires you to be collaborative and thorough. While it’s best to incorporate DR planning into your early design process, don’t skip it altogether if you missed that part of your design early on. Even if the system is already in production and serving users, it’s never too late to plan for disaster recovery and prepare for handling the worst.