As a next step in developing a restoration and test plan, Amir Chaudry, vice president of storage at HPE, recommends cataloging all applications in the enterprise and assigning criticality to all of them.
“You have to be able to stack-rank the importance of each of your apps so if you do have to restore from your servers, you’re bringing things back online according to business and customer priorities,” he says. “That means knowing, in advance, which apps are most vital to your ongoing operations.”
Please read: Cloud backup should put you in charge
Chaudry says another part of the plan should identify an organization’s goals for recovery point objectives (RPO) and recovery time objectives (RTO). RPO measures how much data an organization can withstand losing before it runs aground. RTO gauges how much time an organization can be without key apps without causing significant damage.
“Keep it real when considering RPO and RTO,” says Chaudry. “Many IT leaders go about modeling with rose-colored glasses and decide their networks will do just fine because they planned so well. In truth, it doesn’t take long for lost data or a downed application to wreak havoc.”
Once IT leaders better map their data and application dependencies, they can then decide how best to test against them.
Haim Glickman, a senior vice president at Sungard Availability Services, a managed disaster recovery company in Pennsylvania, notes testing can be complex. As such, many organizations offload those responsibilities to firms such as his. This gives them access to outside expertise and the most up-to-date tools so they can concentrate on other business or operational priorities.
But if companies decide they have the staff, expertise, and gumption to handle it themselves, Glickman recommends that they remember their high school Biology 101 lab courses: Do everything in controlled “bubbles.” Create a staged environment that won’t interfere with your live production data. The test environment can be on premises or in the cloud. Whichever ends up being the case, do not allow the two environments to intermingle, because it could skew results and even lead to operational hazards.
Ken van Wyk, president of KRvW Associates, a small cybersecurity and incident response firm in Virginia, takes it a step further with a concept called table-topping. As he defines it, this is where an IT department simulates an actual emergency. Rather than just running digital tests in a bubble, he comes in, looks over the organization’s cataloging, criticality, and RPO annd RTO analyses, and then goes to work.
“I try to get clients to take a hard look at their assessments, but ultimately, you’ll always have people in the room who are hopelessly optimistic about their estimates of downtime,” he says. “So I push them to do live testing. We disconnect a system for a while and simulate downtime in a very real way. That can be really disruptive, of course. You need to carefully consider how you do it so you don’t make things worse than what you’re simulating. But the exercise can be eye-opening, allowing you to realistically practice and prepare for a major disaster.”
The disconnect process should not be viewed as a general solution, van Wyk adds. Rather, it should be reserved for specific instances where production resilience is regularly tested, he says.
Chaudry agrees, noting it’s not only critical to run both digital and live simulations but to do so weekly or monthly because data is constantly changing. In addition, he suggests testing restores before and after system changes or upgrades. What’s more, tests should examine both new and old data. Too often, it’s the oldest data that becomes most troublesome during recovery efforts, he says.
In the end, Chaudry says any testing program needs to map back to RPO and RTO goals. IT teams should check the duration of restore simulations against those targets. If they’re misaligned, they need to adjust and test again.
“Rinse and repeat,” Chaudry says. “Tools and outside consultants help you do that more effectively, especially those offering continuous data protection features. But if you think you can handle all of this yourself, make sure you’re backing up efficiently and testing all of your data and apps as much as possible. Your organization’s very existence may depend upon it.”
LESSONS FOR LEADERS
- Few companies test their restore plans often or effectively enough — if at all. Not knowing whether your plan works compromises disaster recovery.
- To know if you are meeting your organization’s restore needs, you need to establish recovery point objective and recovery time objective goals.
- It’s worth the expense to bring in experts to build a plan for regular testing of restores and other disaster recovery operations.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.