Virtual Nomad: Windows Failover Cluster Migration with vSphere Replication

Recently I was helping a colleague of mine with the migration of one of the clients to a new datacenter. Most of the VMs were planned to be migrated using vSphere Replication. However, the customer couldn't make a decision on how to migrate its numerous Windows Failover Clusters (WFC) with physical RDM disks for the simple reason that the vSphere Replication doesn't support replication of RDM disks in Physical compatibility mode.

Yes, you still can replicate virtual RDM disk, but it will be automatically converted to VMDK file at destination so you won't be able to use cross-host WFC. That's what I thought before I found this excellent article which contains very interesting note:

"If you wish to maintain the use of a virtual RDM at the target location, it is possible to create a virtual RDM at the target location using the same size LUN, unregister (not delete from disk) the virtual machine to which the virtual RDM is attached from the vCenter Server inventory, and then use that virtual RDM as a seed for replication."

That's when I thought it should still be possible to use vSphere Replication to move WFC to another datacenter with zero impact on clustered services and zero changes on OS/Application level.

So here was my original high level plan:

Switch RDM disks to virtual compatibility mode - VMware KB106599

Rexplicate one MSCS node and all RDM disks to pre-created virtual RDM disk at the destination datacenter.

Replicate the rest of the WFC nodes with OS disks only.

Switch RDM back to physical mode at the destination datacenter

Connect all RDM disks back to the MSCS nodes

Then it was time to do a actual test. I have pretty small home lab so I had to replicate the WFC between ESXi hosts in the same cluster.

Complete the prerequisites

Prepare the port-groups, e.g. data and heartbeat networks for WFC cluster
Write down the specs of the LUNs used by WFC and provision the same size LUNs at the destination site. The LUN numbers don't have to match.

Switch RDM to virtual compatibility mode on the source WFC

Power off all WFC nodes - 2 in my case
Write down the LUN and corresponding SCSI ID

Remove disks from WFC Node 2, do not delete disks!

Delete disks from Node 1 - delete files from datastore

Create new virtual RDM using the information we collected before

Add RDM disks to second node by pointing to RDM descriptor files (.vmdk files) located in the folder of the WFC Node 1.

Create DRS affinity rule to keep WFC nodes on the same ESXi host

Power on WFC and confirm it is functional and you can failover clustered services between WFC nodes.

Note I think previous 2 steps can be skipped once you gain confidence of final result after multiple successful migrations. This time I just wanted to confirm that this particular change was successful as I have never done it before.

Create temporary virtual machine with the name of the first WFC node. This will be used to pre-create virtual RDM disks

Using information we wrote down before create the same virtual RDM disks.

Remove disks, but do not delete them from datastore

Delete the temporary VM

Power off source VMs

Disable Sharing on LSI SCSI adapter, otherwise the configuration of replication will fail

Configure Replication of the WFC node 1 - use the pre-created virtual RDM as replication seeds

Configure Replication of the rest of the WFC nodes

Trigger the replication for both VMs by pressing Synchronise Data immediately button - it won't start automatically as the VM is powered off.

Once the replication finished recover the WFC nodes

!!! Don't forget to remove Power On option

Now change virtual RDM disks into physical ones.
That is the reverse process of what we have done before with WFC Node 1 at the source datacenter

Delete virtual RDM disks from master WFC node
Create new RDM disk in physical compatibility mode
Add physical RDM disks to the rest of the WFC nodes
Create DRS anti-affinity rule to keep the WFC nodes on the different ESXi hosts

And voilà! We have our cluster replicated, recovered and running just fine at new location and on new RDM disks.

The only problem I have faced was the missing IP configuration on the VMs' NIC for the simple reason that after being replication the mac address was changes and the NIC was re-enumarated in the VM. This issue impacts only VMXNET3 adapters as far as I understood. Here you can find detailed explanation of the problem.

Now that I have proof of concept of the solution I just have to convince the customer to go this way which I believe will be the most challenging task :)

3 comments:

zudar13 January 2016 at 06:15
This comment has been removed by the author.
zudar13 January 2016 at 06:18
This comment has been removed by the author.
zudar13 January 2016 at 06:19
Thank you for such a wonderful post. I assume the guide above is true if there is no scsi bus sharing between 2 nodes. Can you help if scsi sharing is enabled between hosts.

Thanks

Tuesday, 15 September 2015

Windows Failover Cluster Migration with vSphere Replication

3 comments: