Upgrading applications and operating systems can be a real pain. On more occasions than I’d like to count, I have seen problems during these activities. The problems generallly fall into a few root causes the most common being: system errors, documentation, configuration anomylies and, dare I say it, Administrator mistakes. A solid Change and Release Management process can mitigate many of these issues, but that is another topic altogether. One feature I really like about VMware is the ability to take a snapshot of a Guest OS prior to upgrades or to test a configuration change. The option of quickly rolling back to a known good state can really reduce risk and help alleviate those late nights rebuilding servers. Despite the huge benefit snapshots bring to Administrators they can also be the source of problems, as I recently experienced.
Earlier this week, I was preparing to upgrade an application running on a virtual server. The application houses all of our Service Desk tickets as well as our Development Team’s Application requirements, tasks and test activities. The last time we upgraded this application was over 1 year ago. When we upgraded it last, we had created a snapshot of the VM prior to the upgrade, which was the same plan I would follow now. As I prepared for this new upgrade, I shut down the server and attempted to create a snapshot. The Infrastructure Client immediately through an unknown error. I checked the Events associated with this server and saw that it had not been rebooted in nearly 3 months, which was good – because I knew that it *had* been rebooted. I then checked the settings of the Guest VM to ensure that it was right, everything seemed good. So knowing that the VM had been working and was now not – I took a pause to collect myself.
I opened the Snapshot Manager and, to my horror, saw that the snap shot from the previous upgrade was stacked under two other snapshots. The Parent Virtual Disk was dated almost 2 years ago and the most recent snapshot was 13 months old. Given that the Parent disk was 50GB, and the sum of the snapshots was almost 20GB my heart started to sink. A quick browse of the datastore told me exactly what the problem was. The virtual disk descriptor files were gone! Without those files the ESX server would have no idea how to mount the snapshots or the virtual disk. Losing the descriptor files of a single virtual disk is not a crisis, losing the the same files for snapshots does push that envelope.
I’ll jump to the end of the story first, because it took me nearly 24 hours to fix the problem. I did fix it though, and got the server back up before production started on Monday.
Now that we have a happy ending, let’s talk about the technical stuff. Read on!