Thursday, 13 October 2016

Getting Protected site back online after using Forced Recovery Plan with SRM

This week I had a question from one of my customer on how to correctly test disaster recovery with SRM in the scenario as close as possible to a reality.

Most of you probably know how you can run non-disruptive failover test with SRM which lets you verify the SRM recovery plan without any impact on the Production servers.

You might also used SRM to test a planned failover where virtual machines are powered off at the Protected site and then recovered at the Recovery site.

The good thing is that official documentation provides comprehensive instructions on how to run these tests.

However, the provided information on how to correctly deal with forced recovery is a bit vague. This type of recovery is ran when the Protected datacentre is not available. And that's what our customer wanted to test to be 100% sure their infrastructure is covered for real disaster.

Obviously, when your Protected Site is down and you have to recover your environment there are not many choices. You can only run Forced Recovery on the SRM server at the Recovery Site.

But the documentation does not explain on how to deal with the situation when the Protected site comes back online.

Here is what it says:

"After the forced recovery completes and you have verified the mirroring of the storage arrays, you can resolve the issue that necessitated the forced recovery. After you resolve the underlying issue, run planned migration on the recovery plan again, resolve any problems that occur, and rerun the plan until it finishes successfully. Running the recovery plan again does not affect the recovered virtual machines at the recovery site."

When I read it first I had several questions:

1. What direction should be the storage mirroring configured before running Planned Migration provided that we have already recovered VMs at the Recovery Site?
2. How planned migration will be able to complete successfully when there are so many steps in the recovery plan that were already completed during the Forced Recovery? If you ever ran Planned Migration you know that any error will stop the Recovery Plan.
3. Should I pause/stop the storage replication prior to running Planned Migration?

So, I had no clear understanding of the sequence of actions for this scenario. That's where my home lab proved to be a very efficient investment.

To make it as close as possible to real infrastructure I deployed HPE VSA to simulate array based replication. Both sites consist of 3 hosts running, the Protected Site runs a couple of CentOS VMs on a replicated datastore.

So, here is sequence of steps I used in my lab to simulate disaster, to run forced recovery and to restore the status quo after bringing the Protected site back online.

Please note that there are many different DR scenarios and I don't have to test all of them. Also, running everything as nested lab I can't test different types of storages or replications so the output of Forced Recovery with HP 3PAR or EMC VMAX with synchronous replication might be different to what I got. 

1. The failure of Protected Site was simulated using firewall rules to deny all traffic between sites, including the replication traffic

2. Logged into vCenter at the Recovery Site and ran Forced Recovery plan.

The following screenshot depicts all the steps of the recovery plan and their status.

3.  After confirming that all VMs were successfully restored at the Recovery Site I shutdown the VMs at the Protected Site.

3. Removed the firewall rules to restore the connection between sites

SRM servers give you some hints on how to restore the status quo.

Protected Site status

Recovery Site status

Replication status
As you can see SRM understands that the failover is not fully completed yet. Therefore the replication status of the device is 'Failover in Progress'

The Recovery Plan

As you can see the Recovery Plan looks different now compared to the one in Step 2.  It actually tells you now to run the Planned Failover again.

4. Ran the Planned Failover again as instructed

Looks like SRM is smart enough to skip the steps that have already been done.
Essentially, the following actions are conducted when running Planned Failover:

 * Protected VMs are shutdown at the Protected Site
 * Protected VMs are converted to Placeholder VMs
 * The protected datastores are unmounted at the Protected Site
 * The replicated LUNs are converted to read-only mode

That brings both SRM servers to consistent state where all workload now runs at the Recovery site and replicated to the Protected Site.

Now you can follow the regular routine and reprotect the workload and then move it back to the Protected site using the Planned Failover option.

Hope that helps understand the logic of SRM Recovery after Forced Recovery.