Upgrading applications and operating systems can be a real pain. On more occasions than I’d like to count, I have seen problems during these activities. The problems generallly fall into a few root causes the most common being: system errors, documentation, configuration anomylies and, dare I say it, Administrator mistakes. A solid Change and Release Management process can mitigate many of these issues, but that is another topic altogether. One feature I really like about VMware is the ability to take a snapshot of a Guest OS prior to upgrades or to test a configuration change. The option of quickly rolling back to a known good state can really reduce risk and help alleviate those late nights rebuilding servers. Despite the huge benefit snapshots bring to Administrators they can also be the source of problems, as I recently experienced.
Earlier this week, I was preparing to upgrade an application running on a virtual server. The application houses all of our Service Desk tickets as well as our Development Team’s Application requirements, tasks and test activities. The last time we upgraded this application was over 1 year ago. When we upgraded it last, we had created a snapshot of the VM prior to the upgrade, which was the same plan I would follow now. As I prepared for this new upgrade, I shut down the server and attempted to create a snapshot. The Infrastructure Client immediately through an unknown error. I checked the Events associated with this server and saw that it had not been rebooted in nearly 3 months, which was good – because I knew that it *had* been rebooted. I then checked the settings of the Guest VM to ensure that it was right, everything seemed good. So knowing that the VM had been working and was now not – I took a pause to collect myself.
I opened the Snapshot Manager and, to my horror, saw that the snap shot from the previous upgrade was stacked under two other snapshots. The Parent Virtual Disk was dated almost 2 years ago and the most recent snapshot was 13 months old. Given that the Parent disk was 50GB, and the sum of the snapshots was almost 20GB my heart started to sink. A quick browse of the datastore told me exactly what the problem was. The virtual disk descriptor files were gone! Without those files the ESX server would have no idea how to mount the snapshots or the virtual disk. Losing the descriptor files of a single virtual disk is not a crisis, losing the the same files for snapshots does push that envelope.
I’ll jump to the end of the story first, because it took me nearly 24 hours to fix the problem. I did fix it though, and got the server back up before production started on Monday.
Now that we have a happy ending, let’s talk about the technical stuff. Read on!
As I state with every technical topic, make sure you understand the concepts and know what your doing.
VMware has a KB for this one as well: Recreating a missing virtual disk (VMDK) header/descriptor file along with a nice video tutorial.
I am a firm believe of not restating existing documents, especially when they are accurate. I could summarize the steps as I did them, but if I missed a step or the process changed – you’d be in trouble. I will, however, provide some helpful tips on some of the detailed steps. Most of the work is done via the command line interface, the commands I used helped me to keep from switching screens. For the purpose of these tips, I will be using an example server of WEBSERVER-US.
Before you get started… Caution is a virtue in IT. I highly encourage you to back up everything in the folder for the VM you are having problems with.
We have 24 Datastores across 8 ESX hosts. Knowing the vfs volume and directory of a VM can be hard to keep track of. So long as you know the host the VM is on, you can run the following command to get the location.
vmware-cmd -l | grep -i <server>
ex: vmware-cmd -l | grep -i webserver-us
What does it mean:
-l (that is a lower case L) is used to list the Guest VMs on a host.
| grep -i the | is a pipe symbol, generally located on the same key as \. grep is a unix command that allows you to search output for a string. -i (lowercase I) makes the search case insensitive.
You should get an output something like this, which will tell you the location of your Guest VM.
Thus the server is located in: /vmfs/volumes/4921d2f0-f3077c97-c6c1-002264092818/WEBSERVER-US/
This one is pretty easy, just change to the directory.
I don’t like editing files, unless I need to. So here is a command that will get you this information without potentially risking your file.
more <server> | grep -i scsi
ex: more WEBSERVER-US.vmx | grep -i scsi
What does it mean:
more is a command that types a file out so that you can read it. grep and -i were covered previously in step 1.
You should see output something like this:
scsi0.present = “true”
scsi0.sharedBus = “none”
scsi0.virtualDev = “lsilogic”
scsi0:0.present = “true”
scsi0:0.fileName = “WEBSERVER-US.vmdk”
scsi0:0.deviceType = “scsi-hardDisk”
scsi0:0.redo = “”
So that gives you the information you need.
Steps 4 & 5
These two are pretty self explanatory. Just make sure you create the new file on a datastore that has the room for it.
Make sure you delete the temp-flat.vmdk file and not your actual vmdk. Per VMware, there is no way to recover a deleted vmdk on ESX4.
Steps 7 & 8
Again, pretty self explanatory. I would suggest making a copy of the temp.vmdk file before you edit it, just in case you make a mistake.
If everything went well, you should be at the end of the VMware document and now able to use the VM (assuming you had no snapshots). If you have snapshots, I would encourage you to NOT try and power on the VM using the parent disk as a test.
So now we move on to the snapshots. Since this article is getting somewhat long, I am going to write a separate article for the snapshot process.
Here is the link: Virtual Machine Snapshots