Monday, March 4, 2013

How To "Pause" (Not Suspend) A Virtual Machine In ESXi?

Last week I received a very interesting question from a fellow blogger asking whether it was possible to "pause" (not suspend) a virtual machine running on ESXi. Today ESXi only supports the suspend operation which saves the current memory state of a virtual machine to disk. With a "pause" operation, the memory state of the virtual machine is not saved to disk, it is still preserved in the physical memory of the ESXi host. The main difference with a "pause" operation is the allocated memory is not released and this allows you to quickly resume a virtual machine almost instantly at the cost of holding onto physical memory.

The use case for this particular request was also quite interesting. The user had an NFS server that housed about 200 virtual machines that needed to be restarted and the goal was to minimize the impact to his virtual machines as much as possible. He opted out from suspending the virtual machines as it would have taken too long and decided on a more creative solution. He filled up the remainder capacity on the datastore which in effect caused all virtual machines to halt their I/O operations. Though not an ideal solution IMHO, this allowed him to restart the NFS server and then run a script for the virtual machines to retry their I/O operation once the NFS server was available again.

Based on the above scenario, he asked if it was possible to "pause" the virtual machines similar to a capability Hyper-V provides today which would have provided him a quicker way to resume the virtual machines. Thinking about the question for a bit, a virtual machine is just a VMX process running in ESXi and I wondered if this process could be paused like a UNIX/Linux process using the "kill" command. Well, it turns out, it can be!

Disclaimer: This is not officially supported by VMware, use at your own risk.

Using the kill command, you can pause the VMX process by sending the STOP signal and to resume the VMX process, you can send the CONT signal. Before getting started, you will need to identify the PID (Process ID) for the virtual machine's VMX process.

There are two methods of identifying the parent VMX PID, the easiest is using the following ESXCLI command:
esxcli vm process list
The PID for the virtual machine will be listed under the "VMX Cartel ID" and in this example I have a virtual machine called vcenter51-1 and on the right I am pinging the system to verify it is up and running. An alternative way of identifying the PID is to use "ps" by running the following command:
ps -c | grep -v grep | grep [vmname]
Note: Make sure you identify the parent PID of the virtual machine if you are using the above command as you will see multiple entries for the different VMX sub-processes.

To pause the VMX process, run the following command (substitute your PID):
kill -STOP [pid]
To resume VMX process, run the following command:
kill -CONT [pid]
Here is a screenshot of pausing and then resuming the virtual machine. You can also see where the pings stop as the virtual machine is paused and then resumed. Once the virtual machine was resumed, it operated exactly where it left off with no issues as far as I can tell.
Note: I have found that if you have VM monitoring enabled, there maybe issues resuming the virtual machine. This should only be done if you have VM monitoring disabled as it may not be properly aware that the VMX process being paused on purpose. 

Though it is possible to pause a virtual machine, I am not sure I see too many valid use cases for this feature? Are there are use cases where this feature would actually be beneficial, feel free to leave a comment if you believe there are. For now, this is just another neat "notsupported" trick ;)

5 comments:

  1. I know another hypervisor who does it natively... ;)

    ReplyDelete
  2. Awesome!
    I found this trick similar to the vMotion 'stun' operation where the source VM is 'paused' while QuickResume transmits the remaining memory pages to the destination VM which is now live.
    Could that be the 'stun' operation is just a kill -STOP followed by a kill when vMotion is completed!?

    ReplyDelete
    Replies
    1. No, that is FSR (Fast Suspend & Resume) http://www.yellow-bricks.com/2011/04/13/vmotion-and-quick-resume/#comment-23924

      This completely pauses the VMX process.

      Delete
  3. BTW, you can get rid of the grep -v grep pipe by enclosing the first letter (or first few even) of your vmname in square brackets like this:

    ps -c | grep [v]mname

    it's hard to explain why this works, has to do with character sets etc, but it does work, even in the ash shell in esxi. There isn't a massive performance difference or anything of course, but it does get rid of the one pipe and 2nd grep invocation.

    Thanks for this site btw, been a big fan for a long time.

    ReplyDelete
  4. Would it work to limit the CPU for the VM to 0 MHZ?

    ReplyDelete