Monday, 6 June 2011

vSphere Transparent Page Sharing (TPS)

Every time I learn something new about Transparent Page Sharing (TPS) I get the same excitement level I got when I was teenager and was reading science fiction books about future worlds and technologies. So when you finish reading this post ask yourself - How cool is TPS? :) 

So what is TPS about?  
The main goal of TPS is to save memory and, thus, to give us possibility of providing more memory to your virtual machines than you physical host has. This is called memory overcommitment.


How does it work?
If you  read my previous articles about vSphere Memory Management you remember that historically in x86 systems the memory is split in 4 KB pages. Bearing it in mind it gets quite obvious that many of your Windows virtual machines might have many identical pages. TPS process runs every 60 mins. It scans all memory pages and calculates hash value for each of them. Those hashes are saved in separate table and compared to each other by kernel. Every time the ESX kernel finds two identical hashes it starts bit-by-bit comparison of corresponding memory pages.  If these pages are absolutely the same the kernel leaves only one copy of page in memory  and removes the second one. All consequent requests to removed page are redirected to the first copy.  When one of your VMs requests to write to this page kernel creates new page. From this point of time this new page will be accessed only by single VM. That's why in Vmware terminology it is called Copy-On-Write (COW). The whole process of memory scanning can literally takes hours, depends on amount of RAM on your host. So don't expect immediate results when TPS kicks in. 

NUMA and TPS
When you have Host with NUMA Architecture your vSphere is smart enough to make TPS working only within boundaries of NUMA node. If you want TPS to work across different NUMA nodes you need to alter VMkernel.Boot.sharePerNode setting. However, you have to consider that additional savings of sharing memory system wide are not worth of the performance impact that you will have due to decreased rate of memory locality. For example, imagine you have 4 NUMA nodes with some identical pages. If TPS leaves only one copy in first NUMA node that means all VMs hosted in 3 other NUMA nodes will need to access its memory not locally, but through slower shared memory bus. The higher number of shared pages you have, the more bandwidth there will be on shared bus and the more performance impact your VM will experience.
  
Sharing Zero Pages
One of the nice new features presented in vSphere 4.0 was recognition by of Zero Pages. Zero Page is normal memory page filled with zeros. Every time Windows VM boots up it needs to zeroize all memory space to find out available amount of memory. vSphere can detect all zero pages of its guest VM and instead of backing these zero pages with the same amount of physical memory zero pages, it backs all of them with only one zero page. Basically, it is the regular TPS process. If you check esxtop stats right after you boot up VM you can easily see that amounts of SHRD and ZERO columns is almost the same. 
Windows 2008 R2 - screenshot taken right after VM was powered on.

However, the main point about TPS and zero pages is how zero pages are treated by old CPUs without Hardware Assisted MMU (e.g. Xeon 5450) and new CPUs with EPT/RVI (Nehalem and Opteron). There is a small research conducted by Kingston with regard to differences between old and new virtual MMU technologies (part 1, part 2). In short, the new CPUs immediately recognize zero pages and don't even grant physical memory pages to zero pages of virtual machines. The old CPUs can't recognize these zero pages immediately, and vSphere has to allocate these physical memory to these zero pages. Only when TPS kicks in kernel starts reclaiming identical pages. In VDI environments where you can have hundreds of virtual machines booting up at the same time (so called bootup storm) Hardware Assisted MMU can help you avoid situations of memory contention by instant collapsing of zero pages.

Couple of things to mention about TPS and zero pages:
1. Windows operating systems have a feature called file system cache. Using definition from MS file system cache is  "a subset of the memory system that retains recently used information for quick access". It resides in kernel memory space  and thus, in 32bit OS it is limited to 1 GB only. However, in 64 bit version this it can reach 1 TB (KB Article). It may be very helpful for Windows performance, but with regard to zero pages it leads to very low amount of shared pages. In the screenshot above you can see esxtop memory stats after you just boot Windows 2008 R2 on ESXi 4.1 running on Intel CPU with EPT. 
This is what esxtop shows 30 mins later. 
Windows 2008 $2 - screenshot taken 30 mins later after power on.

You can see how this file system cache can eliminate all benefits of sharing zero pages on 64bit OS, even though in Task Manager of this VM you can see plenty of free memory

2. Windows zeroize pages only when virtual machine is powered on. It doesn't zeroize memory pages if you reboot VM.

Large Pages
The size of Large Pages used by ESX kernel is 2 MB. Compared to 4 KB pages the chances to find two identical Large Pages are almost zero. Therefore, ESX doesn't try to look for them. If your ESX host has  CPU with Hardware Assisted MMU, that means it will aggressively try to back all guest memory with Large Pages, and if you check memory stats in esxtop the SHRD value will be very low and most of the times it will be equal to ZERO value. That effect caused a lot of discussions in internet because for people it seemed like TPS doesn't work at all with Large Pages enables. Vmware had to publish this very popular knowledge base article.

Nevertheless, even using Large Pages, kernel scans all memory for potentially sharable 4 KB pages and put hints for them. So, whenever ESX runs into the situation of memory contention (less than 6% of free physical memory) it will immediately start splitting Large Pages into small ones and removing identical copies using those hints. In esxtop you can check potential shareable size of VM's memory by looking at COWH (copy on write hints) value.



Again, I would like to mention the difference between old and new CPUs and support of Large Pages. Vmware enabled support of Large Pages starting from ESX 3.5. According to "Large Page Peformance" document by Vmware if the Guest OS or application can use Large Pages then ESX will support this feature by default for this particular VM only.  However, if you run ESX 3.5 or later on CPUs with Hardware Assisted MMU kernel will always back all Guest OS pages with Large Pages, even though you haven't enabled Large Page support in your Windows 2003 for instance. My short observations haven't confirmed support of Large Pages on old CPUs though because I still can see very high value of shared memory. This week I will try to isolate host with CPU without Hardware Assisted MMU and run only Windows 2008 virtual machines on it. Then I would be able to check if Large Pages are definitely used by checking Shared Value which is supposed to be low when LP are in use.

When I just learnt all this stuff I wondered - what would be the performance impact if I disable Large Page host wide and how many GB of shared memory we will get? So I did it. I tested it on our powerful host with 2 Xeon 5650 (with EPT) and 96 GB of RAM. Before I disabled LP support it has about 70 GB of granted memory and about 4 GB of shared memory. As you remember for hosts with Hardware Assisted MMU it basically means 4 Gbytes of zero pages. CPU usage never went higher than 25%. In 2 hours after LP were disabled the shared memory went up to 32 Gbytes and CPU usage remained the same. It has been running more than 2 weeks already and I haven't heard any complaints about performance yet. 

When I first saw such a huge difference in memory savings I thought that even though TPS is still working under memory contention, you can't see the actual amount of memory you can save using page sharing, you can't properly plan memory over-provisioning on your hosts, you don't have clear vision of your potential consolidation ratio. Just yesterday I have found a nice discussion between very strong vmware experts regarding this issue of  available memory perception (link) and it just confirmed my opinion about TPS and LP. Some people say that you can use COWH value to have an idea about potential shareable memory, but for some reasons it doesn't always give you the right data. For instance, I have Windows 2003 VM that has been running for 2 weeks on host with LP disabled. It has very high value of COWH, however, its SHARED memory is very low, and I can't find any explanation to this.




This whole article is just a compilation of other articles (mostly by yellow brick and Frank Denneman) and Vmware docs. You can find much more information just googling for TPS, Nehalem, RVI, EPT and other key words in this post. 

PS Sorry for bad formatting of screenshots - I just wanted to make them easy to read :)




If you find this post useful please share it with any of the buttons below. 

No comments:

Post a Comment