If I could only know that after becoming VCAP-DCA the virtualization will be the least important part of my work :)
Anyway, I have a short story to tell. Couple of days ago we had a small incident that brought our mighty EVA 6400 to knees. Yea, I know it is not as mighty at all, but in our small infrastructure the EVA is the best part of it.
At 4-15 pm I noticed that my Outlook lost connection to Exchange server. When I tried to check the Exchange server it took me 10 minutes to log in to the server. I also noticed I lost connection to vCenter. It started to look like a disaster.
Network was fine. I could ping all core switches. Since I could not open vCenter I connected to the ESXi host directly and was really shocked by the Performance graph of Datastores. The latency numbers were ridiculous - around 50 000!! ms. The storage (or SAN) was screwed.
I immediately SSHed to the host and start to check logs. BTW, my favorite way to read real-time logs is to use tail command.
tail -f /var/log/vmkernel.log lets you read your logs as they appear in log file.
There were plenty of "Cmd to dev failed" errors, but the most frequent looked like this one.
*********************************************************************************
2013-04-24T21:01:55.292Z cpu4:1245952)ScsiDeviceIO: 2316: Cmd(0x4124022c3180) 0x2a, CmdSN 0x80000041 from world 2135091 to dev "naa.6001438005df09250000500002c60000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.
*********************************************************************************
Quick search for SCSI errors codes showed that Device Status 0x28 means for "TASK SET FULL".
According to some articles this error code can either mean "This status is returned when the LUN prevents accepting SCSI commands from initiators due to lack of resources, namely the queue depth on the array" or write cache congestion.
Once I read it I knew what was the root cause of the problem. Two days ago I presented new vDisk to the backup server to provide more space for Catalog data and I knew my colleague was moving huge amount of data to the new vDisk at the very same time when we were experiencing issues.
Surely, I had no time to read how to check queue depth stats on the EVA, but I could definitely fix write cache congestion by changing vDisk Write-Cache mode from Write-Back to Write-Through.
With Write-Back the storage reports to host the write is complete when data is written to cache, but before the data is transferred to disk. It definitely speeds up the write IOs, but as we can see it causes Write Cache congestion.
When I changed Write Cache mode the situation immediately improved and latency went back to normal values.
So far, I don't know what the right solution for this type of a problem. Unfortunately, I cannot partition write cache on EVA 6400 so as to protect virtual machines from write cache congestion.
As long as there are only ESXi hosts connected to the storage the similar situations can be managed by Storage Intput/Output Control or Adaptive Queue Depth algorithm, but when you have a physical host you probably need to think about some QoS in SAN. Actually, I read such warnings hundreds of times, but never faced them personally. It was a good lesson.