6/26/15
So this week I had a run in with an issue with a production SQL server that was performing poorly. For the past month I’ve been pulled in different directions and completing the tasks was my utmost focus. But we kept having our performance of this SQL box degrading weekly as vROps has shown over the course of the last month from the analytic’s. So I started investigating the issues……
So upon 1st glance it seemed that the IOPS were much higher than normal and approaching the operating ceiling on the SAN, I was curious cause the cluster was only 4 months old, so I knew something was up cause I planned this SAN for 16 months of growth, but what I found was the negative side of a tool that when used by VMware is pretty awesome and the culprit was VMware Snapshots.
Needless to say, Snapshot are good for when you make changes to a server and want a ‘oo shit’ backup to revert to the previous state of the server after you install a application, such as Windows updates, and they break the server. However the con is if you leave them in he Snapshot Manager for too long you run into a performance crutch.
Basically I had 3 snapshots from the beginning of April that I simply forgot about, since I was doing other projects like a 20 AP wireless deployment and upgrading the network from a flat 24 to a 7 vLAN network with L2 switching and L3 routing. Basically Snapshots when used, they will need to keep a 1 to 1 copy of the data on the server as it changes so you can revert to a previous state. Now this seems awesome at 1st glance but it’s not. Essentially for every read/write on the normal VM’s .vmdk file the same action happens on each .vmdk file of the Snapshot(s) and as I’m sure you can imagine these can compound into an insane amount of IOPS being used for one task if you have a large tree of VMware Snapshots.
After deleting the Snapshots, that each took 5+ hours to remove, the performance of the SQL box returned to normal and end-users were so happy with the speed of queries. I learned a very valuable fact of the pro’s and also the unforeseen con’s of VMware snapshot. I hope this posting helps others to avoid this mishap as I for one won’t make the same mistake twice.
On a side note I also want to point out that if vCenter says the ‘Remove Snapshot’ is @ 99% for say 4-9+ hours do not freak out like I did, I had to use the VMware CLi to see that the percentage bar in vCenter was lying to me and wasn’t accurate.
You need to do the following:
1. Go to the Security setting on the ESXi host where the VM that is having its Snapshots removed and enable SSH.
2. Use a telnet program like Putty to SSH into the host via it’s IP address and login with the root information.
3. Type this command to query the list of the servers: ‘vim-cmd vmsvc/getallvms‘ and then press enter and look for the ‘vmid‘ of the server in question and remember it or write it down.
4. Then type this command to look at the task list of the server using the vmid like this: ‘vim-cmd vmsvc/get.tasklist 61‘ <– your number will be different and press enter. – You’ll get something that looks like this: ‘vim.Task:haTask-61-vim.vm.Snapshot.remove-42708222‘.
5. Then type this command: ‘vim-cmd vimsvc/task_info‘ …but add the ‘haTask‘ part of the previous line to the command (obviously your end number will be different), it should look like something like this: ‘vim-cmd vimsvc/task_info haTask-61-vim.vm.Snapshot.remove-42708222‘ and then press enter.
6. You will now see a print out of the running task, look for the one line that says ‘progress = 67 or whatever it is‘; this shows you exactly what percentage the deletion process is at. The vCenter one isn’t accurate and can make you freak out if it just sits there for a few hours, make you think it stalled or something.
NOTE: For very large VM’s, I mean size wise this process of deleting a Snapshot could take hours, and I mean many hours. Also note that if you turn off a VM, thinking it will speed up the process you WILL NOT be able to turn it back ONLINE until after the snapshot is deleted.