Tuesday, 5 April 2016

StarWind Virtual SAN review- Part 6 - Failure Scenarios

In this part of the review I would like to talk about some possible failure scenarios and how StarWind Virtual SAN reacts to them.

1. Heartbeat link failure

It causes no impact on the functionality for the simple reason that heartbeat link is used in case of Sync link failure only. If it is caused by failure on ESXi level or failure of the physical switch make sure you have those properly monitored.

2. Sinlge Sync link failure

ESXi hosts will still see all your targets, but storage performance may decrease because the surviving SYNC interface's bandwidth won't be able to cope with the amount of SYNC traffic.

3. All Sync links failure. 

When SW Servers can't reach each other over Sync links they will use Heartbeat link to identify whether its partner server is online or offline:

a. Online - Secondary device will disable access to all devices. Primary device will flush the cache to the disks (if you have large cache that can cause temporary disk contention) and the cache will be switched into write-through mode to ensure the integrity of the data if primary SW node fails too. All nodes will fail over storage pathes to the primary SW appliance. Once the sync link is restored you SW will resync HA devices.

b. Offline. The server will declare himself a master.

4. Sync and Heartbeat Link Failure

This is the worst case scenario when both nodes believe they are the only survivor. Data get unsynced and can be considered as corrupted, and the data restore is the only option in this situation. Therefore, don't hesitate to have more than 1 heartbeat link. Heartbeat interface doesn't generate much traffic and can co-exist with other traffic types without any significant impact. 

5. Disk Failure

a. RAID0 - SW loses access to the device. All paths will be switched to the other node. 

b. RAID1,5,6 - there will be performance decrease as your raid arrays will have to handle RAID rebuild load while servicing other IOs.
6. SSD hosting L2 cache failure

If L2 cache was set to to Write-Through mode the Virual SAN device goes offline on the given server, but its replica stays online on the partner node. To bring it back online you will need to disable L2 cache and to restart StarWind Virtual SAN service. That will make device active and will resume syncing with partner device.

7. SW Server/ESXi host failure

No comments:

Post a Comment