Disaster recovery testing is the process to ensure that an organization can restore data and applications and continue operations after an interruption of its services, critical IT failure or complete disruption. It is necessary for an MSP to document this process and review it from time to time with their clients. It will ensure that you know how to save your client in the event of any fail.
It’s important to ensure that you and your customers are on the same page, not only to manage expectations, but to make sure that you can point to a list of requirements that you have fulfilled if something (or everything) goes wrong.
This should include a plan for regular testing of disaster recovery scenarios, again so you can demonstrate that you’ve done your due diligence, as well as finding and eliminating potential problems before they become real problems.
There are several variables that will affect how much testing you’ll need to do, as well as what you’ll be charging, and what expectations your clients will have. The size of the company you’re supporting, their budget for a DR solution, the complexity of their data structures and networks, (whether all contained on your network, or some internally on their network, or more at additional service providers), amount of data to be backed up, and so forth.
Plan your perfect disaster recovery strategy on AWS:
Disaster Recovery Testing Scenarios
There are many potential disasters, but we can categorize them into several major groups:
These range from server meltdowns, to storage failures to communications breakdowns, to power failures.
Probably the most common type of disaster, a user accidentally deleting anything from one file to a whole database, or an update applied to a database that erases data or crashes the database server. While these might seem more backup rather than DR issues, migrating servers from one cloud to another or to an offsite server make them DR scenarios.
One thinks first of flooding or hurricanes, but wildfires, earthquakes, tsunamis, landslides, even things like cicadas bursting out from beneath the basement have all caused data loss or system unavailability. The end results might include power loss for extended periods, destruction of the data center, evacuation of personnel (or personnel unable to get to work), loss of network connectivity, or even destruction of a large area, including branch offices, power, phone and other utilities.
Loss of key staff
You should know how to get the network passwords in case your admin is hit by a bus, but also know who has the password to your cryptocurrency wallet, or the password to make changes to your network connection, or to order supplies.
This is a growing category, that has gone from viruses that were originally about amateur hackers showing how clever they were to financially-motivated worms, Trojans, and ransomware, engineered by sophisticated professional hackers, to malware designed to steal data, which has gone beyond even the expert programmers, to systems run by nation-states.
These threats are not only pervasive and persistent, but constantly evolving. Retrofitting a building to protect from earthquakes might only need to be done once, but if you don’t keep your malware protection and DR recovery infrastructure updated, you’ll be at risk within days.
Other unexpected events
This need not be an alien invasion, it could be a simple as a distracted driver taking a shortcut through your lobby (or your server room).
Further reading Real-Life Disaster Recovery Scenario from an MSP
Methods for Disaster Recovery Testing
This isn’t as simple as picking one of the methods below. You might need to use all of them. Some cover ensuring that business practices align with the disaster recovery plan, some cover ongoing changes to your systems (or your customer’s systems), and some cover testing the hardware and software by simulating a disaster and restoring a file or system or data center to full functionality.
All of these plans should be reviewed and tests should be ongoing. This doesn’t necessarily mean running through a full plan once a month – you might run through some part of each plan on a weekly basis, a bigger part once a month, and a full test once a year. The important part is to test regularly, and ensure that any additions to the business are reflected in the DR plan.
This is a step-by-step review of the plan with the client, reading the plan to ensure that everyone is aware of all the steps and that nothing has been overlooked or added since the last review.
This kind of test is a ‘what if’ scenario. Lay out a specific kind of disaster, and ask each team member what they would do. A representative of every department should attend, and knowledge of business processes is critical. This may reveal gaps in the plan, which can be addressed before they cause a DR failure.
A parallel test restores a system that hasn’t actually broken down to an alternate location. The real system continues to run and there’s no interruption to business services. This is safe, and not only tests the functionality of backup and restore systems, but can reveal potential problems. An inexpensive way to do this is to run the restore in a virtual machine in the cloud, rather than having to dedicate a physical server somewhere.
For instance, if a new version of a server is spun up in the cloud, and it’s not exactly the same software version and OS version as the operational system, the restored system might malfunction, or a user or service on the restored system might not have the proper credentials, and cause problems. These can all be revealed by attempting to restore the production system somewhere else.
However, a parallel test is not a full test. A parallel system can test backup and restore functionality, and help with ironing out permissions and other issues, but since the restored system is not actually being put into place, with users accessing it, other issues like ensuring that the domain name service (DNS) entries are redirected to the proper place aren’t tested, and without production loads, it also won’t be clear whether the new system has the necessary capacity to run the applications.
Live, or “full interruption” testing
This actually downs the main system and attempts to recover it. It’s a more thorough test, but if the recovery attempts fail, it can cause serious and expensive downtime, and in some cases, may not be possible due to public safety or regulatory concerns. An alternative is to migrate the main system to an alternate location, perhaps from a Virtual Machine on the main server to an alternate VM on another server. This still has the potential to cause disruptions, but migrating back to the original server would normally be faster than bringing up the original server from scratch, if the restore fails or has connectivity or other problems.
A third alternative is to do a restore to an alternate server or VM, without bringing down the main server, then change network addresses or DNS entries to move traffic to the alternate server, leaving the main server online, but with no traffic.
This can even be carried a step further, by using a load balancer to spread traffic across the main server and the alternate, with either one dropping out of the pair if necessary, or after the test. This can be carried out without service interruption, but the load balancer capability will add cost and complexity to the system as a whole.
Disaster Recovery Testing Best Practices
Best practices are influenced by budget. It’s possible to put a clustered, multi-node data system in place that can recover from one service, one server, or even a whole data center in one location going down. The issue is cost.
Migrating a service from one server to another is easy and cheap. Migrating servers is more costly, and migrating whole data centers is much more expensive. It’s a question of what you (or the client) is willing to spend. There’s a balance that you’ll have to find between cost and availability. This may not be the same for all lines of business or departments: archived accounting records don’t necessarily need to be available within less than a second in the event of failures, while the web site, e-commerce system or production database may need to be available 24x7x365.
Perform testing frequently. Create a schedule for testing
This is critical to maintaining service in the event of a disaster. Many, many organizations have only found out that a system wasn’t functioning properly after a disaster took their systems down and they weren’t able to restore them. The only way to find these kinds of problems and fix them before they bankrupt the business is to test, regularly, and thoroughly.
Thoroughly document your test
Documentation is your friend. There is often resistance to documenting business practices as well as DR testing. However, these records will not only help you find gaps in protection at the next review, but also document your efforts to keep things running, essential if there’s an actual problem and everyone is pointing fingers at someone else.
Test both your DR solution and your people
The tests should include both the equipment and software, but also the people. Give department heads a scenario like this: customer ABC Enterprises has lost their entire data center in a mudslide that took out the building. We need to restore their data center to AWS instances, and find terminals for their employees to configure services and get work done for at least the next three months, until the building can be evaluated and systems purchased. What do we need to do, where is the documentation for their systems, and what’s our first step?
Review and update your DR plan regularly
Even if a plan is in place and has been successfully tested, it still needs to be reviewed and updated regularly. It’s so easy for any user with a credit card and a little knowledge to bring up a new server in the cloud, or clone a database and store it in a different service. It’s up to you to regularly review their systems to ensure that everything critical is covered and secured.
You can put policies in place to forbid people from branching out on their own, but if they’re not aware of the policy, you could still lose critical data. You need to review, get buy-in from users and departments, and ensure that everything is covered.
Learn more about disaster recovery planning:
Further reading Disaster Recovery Plan Checklist for MSPs
Disaster recovery plans cannot remain static. They have to evolve to include addition to the business, and must be tested and checked for gaps in coverage. It’s also critical to ensure that all of the relevant managers and IT personnel understand the plan and know where to get the necessary information in the event of a disaster. There are so many ways to fail – the only way to not fail is to update, test and retest the plan.