Saturday, 26 September 2015

ESXi killer feature is under a threat

I know it has been there for a while, but I have just learnt it.

So, apparently the Transparent Page Sharing is disabled by default now. Here is the list of the patches and ESXI builds where TPS was disabled:

  • ESXi 5.0 Patch ESXi500-201502001, released on February 26, 2015 
  • ESXi 5.1 Update 3 released on December 4, 2014 
  • ESXi 5.5, Patch ESXi550-201501001, released on January 27, 2015
  • ESXi 6.0 
This has been one of my favourite features of the ESXi. I have always taken advantage of it. Even after Nehalem CPUs were released and Large Pages made TPS useless I still preferred to disable Large Pages to have a better understanding of memory usage on my systems. Although, some VMware White Papers stated that there is about 15% to 20% CPU performance increase when using Large Pages, but I could never get the same results in my environments. 

So, why VMware made this decision?

Accordign to this KB2080735 some academic researches "have demonstrated that by forcing a flush and reload of cache memory, it is possible to measure memory timings to try and determine an AES encryption key in use on another virtual machine running on the same physical processor of the host server if Transparent Page Sharing is enabled between the two virtual machines". Sounds pretty dangerous, huh?

However, then VMware says "VMware believes information being disclosed in real world conditions is unrealistic" and "This technique works only in a highly controlled system configured in a non-standard way that VMware believes would not be recreated in a production environment."

I understand that VMware prefers "Better safe than sorry" approach and it is fair enough provided that reputation damage would be huge if that flaw would have been exploited in a real production environment.

What exactly was changed and how?

TPS is disabled only for Inter-VM memory sharing. Memory pages within one VM are still shared, though providing significantly less savings from memory deduplication.

To be more specific, the Memory Sharing feature is not actually disabled. VMware introduced so called Salting concept which will let ESXi host deduplicate two identical memory pages in different virtual machines only when their Salt value is the same.

This new concept is enforced using new configuration settings Mem.ShareForceSalting=1. Setting this option to 0 will disable requirement for Salting and will allow Inter-VM memory sharing as it used to be before applying security patches.

If you want to specify Salt value per VM here are the steps from VMware KB2091682

  1. Log in to ESXi or vCenter with the VI-Client. 
  2. Select the ESXi relevant host. 
  3. In the Configuration tab, click Advanced Settings under the software section. 
  4. In the Advanced Settings window, click Mem. 
  5. Look for Mem.ShareForceSalting and set the value to 1. 
  6. Click OK. 
  7. Power off the VM, which you want to set salt value. 
  8. Right click on VM, click on Edit settings. 
  9. Select options menu, click on General under Advanced section 
  10. Click Configuration Parameters… 
  11. Click Add Row, new row will be added. 
  12. On the left side add the text sched.mem.pshare.salt and on the right side specify the unique string. 
  13. Power on the VM to take effect of salting. 
  14. Repeat steps 7 to 13 to set the salt value for individual VMs. 
  15. Same salting values can be specified to achieve the page sharing across VMs. 

What impact may it have on your environment?

If you take advantage of TPS to overprovision your environment and your performance stats show that assigned virtual memory is larger than your physical memory be really careful and take decision on TPS before you update your hosts.

Otherwise you are risking to see all other VMware memory management features in action - Ballooning, Compression, Swapping. Definitely, these are pretty cool features, but you don't wanna see them in your Production environment.

What should I do now?

I am not an IT security guy, but as far as I understand this security risk mostly applies to multitenant environments where virtual machines belong to different companies. It can be also a risk where security requirements to the vSphere farm are significantly higher, e.g. in banking, defence industries. So you should probably check your security policies before re-enabling TPS.

However, in most of the other companies re-enabling TPS doesn't seem to be a big issue in my opinion.  Just make sure it is your educated choice.

Monday, 21 September 2015

The use case of Route based on source MAC hash load balancing

It is a big pleasure to work with experienced clients and there is always something to learn from them.

We all know main load balancing options -  based on source port, source mac, IP hash. etc.
Very often people stick with load balancing bases on source port just because it provides sufficient distribution of the traffic across all physical NICs assigned to the port group and doesn't require any configuration on the physical switch.

What I know about source mac address load balancing does absolutely the same, but uses extra CPU cycles to compute the MAC address hash. So there was no point in using it.

However, as I have learnt today, there is significant difference in load balancing behaviour between two methods mentioned above when using VM with more than one virtual NIC.

So, when the source port load balancing is used the ESXi switch will use the port ID of the first virtual NIC of the VM to identify the uplink to use and the same port ID (and hence the same uplink) will be applied to the traffic sent/received by all other virtual NICs of that VM.

However, with source MAC address load balancing the uplink will be selected using the MAC address of each of VM's virtual NICs.

That's not very common use case, but it is still good to learn the use case of the feature nobody paid much attention.

PS  I haven't yet tested myself if this is true, but I definitely will once I get access to my home lab

Tuesday, 15 September 2015

Windows Failover Cluster Migration with vSphere Replication

Recently I was helping a colleague of mine with the migration of one of the clients to a new datacenter. Most of the VMs were planned to be migrated using vSphere Replication. However, the customer couldn't make a decision on how to migrate its numerous Windows Failover Clusters (WFC) with physical RDM disks for the simple reason that the vSphere Replication doesn't support replication of RDM disks in Physical compatibility mode.

Yes, you still can replicate virtual RDM disk, but it will be automatically converted to VMDK file at destination so you won't be able to use cross-host WFC. That's what I thought before I found this excellent article which contains very interesting note:

"If you wish to maintain the use of a virtual RDM at the target location, it is possible to create a virtual RDM at the target location using the same size LUN, unregister (not delete from disk) the virtual machine to which the virtual RDM is attached from the vCenter Server inventory, and then use that virtual RDM as a seed for replication."

That's when I thought it should still be possible to use vSphere Replication to move WFC to another datacenter with zero impact on clustered services and zero changes on OS/Application level.

Sunday, 6 September 2015

ESXi and Guest VM time sync - learning from mistake

Today I was browsing some interesting blogs while getting ready for VCAP5 exam and  stumbled upon the excellent post about time syncing in Guest VM.

The most interesting part of the post for me was the following:

"Even if you have your guests configured NOT to do periodic time syncs with VMware Tools, it will still force NTP to sync to the host on snapshot operations, suspend/resume, or vMotion."

That was pretty big surprise for me as I have always had all my VMs synced with NTP on the OS level.