vmwarez.com
           -Where Virtualization is a Reality!

 

Monday, November 27, 2006

Beware the long snapshot!

Note to self: If you make a snapshot on a production vm in your virtual infrastructure, don't keep it much longer than a day or two at the most.

Not sure how else to say this... but, oops. We have this mail server that we use for our ISP customers. It is running in our Vi. It seemed to perform quite well until we moved a couple thousand pop accounts to it. We could not figure out where the slow-down was. We added more memory, more priority (I bet you didn't know that priority was a resource!)... The memory helped a bit, but it was still sluggish.

It seemed like this could have been a case of one of those kind of servers that is not meant for consolidation. But, as a last ditch effort, we decided to add a second virtual processor. Before doing the deed we made a snapshot of the VM just in case things went badly. Everything went fine just as one would expect. Performance did improve, but not to the extent that made us change our minds about rephysicalizing (another new word). We thought we'd give it a month to settle down and look at some long term trends before taking the plunge back to physical from virtual.

After getting back from VMWorld 2006 we thought it would be a good idea to get our Vi up-to-date. Seems we were a little early in adopting Vi3. The newest patch (3.0.1 for ESX and 2.0.1 for VC) contained over 500 bug fixes, so I was told... and that this patch would greatly improve the overall performance of our virtual infrastructure. When it came time to VMotion this mail server off a host so we could upgrade the host, it gave an error stating something about there being an active snapshot... yeah, kinda forgot about that.

This is where the "Note-to-self" from above comes in. Apparently it is a bad idea to leave a snapshot in place for much longer than a day or two. We were running it for about two months. After a little discussion, we decided to delete the snapshot since it seemed that running on vSMP was ok and after all this time we were not going to revert back. Easy, right? sure.... till the task times out. The vmdk snapshot file for the mail-store drive had grown to about 35GB. When we deleted the snapshot, the 35GB file was locked and a new snapshot file was created and used until the 35GB of changes were incorporated back into the original 150GB vmdk. I guess on a very disk-busy drive, that takes a while. I paniced and called VMware. They said that it could take as long as 8 hours to finish. So we waited and hoped nothing crazy happened in the interim.

Two hours later, it was done and it finished without a hitch. The mail server was then VMotioned off and the host got its update applied.

Now that we've learned our lesson, the mail server is performing perfectly. The second processor was the answer but we did not realize the difference in performance because of the overhead of the too-long-lived-snapshot. So, in the end we learned that snapshots are short-term friends and we will not have to put our mail server back in the physical world. That leaves just a few servers to go before we've totally virtualized all our servers. Woo Hoo!

 

2 Comments:

  • In the early days of ESX, this was even more the case. When VMware originally designed the "REDO" log file in ESX, it was designed with the intent that someone would use it for a short-term situation. When the product was 1.0, I used it for a customer's environment that we were building. It had a SQL database, and after the software was installed, database created, and all was said and done, we had to commit the changes back in once we found them valid. Unfortunately, at that time, VMware had a hard coded limit on how big a REDO log could be and the commit failed. I worked with VMware Engineers and they kept increasing the max size for me. Each time, timing out. Finally, I asked for a version with no max REDO size and the commit finally worked. I was afraid I was going to lose 4 days of work with the customer. That was almost 6 years ago or so, but I still remember it vividly. ;)

    By David Marshall, at November 28, 2006 8:31 AM  

  • Just as a heads up:
    One thing to understand regarding the difference between VMFS-2 and VMFS-3 redo/snapshots is that within VMFS-2 redo logs could grow larger than the original VMDK it was related to, wherein VMFS-3 snapshots are more block difference related, so they will not exceed the size of the VMDK they are related to.

    D@VMware

    By DanAtVMware, at December 28, 2006 9:22 AM  

Post a Comment

<< Home