UPDATE 07/13/2012 - vSphere 5.0 Update 1a has just been released which resolves this issue, please take a look here for more details for the patch as this script is no longer required.

Duncan Epping recently wrote an article about Clarifying the SvMotion / VDS problem in which he describes the scenario that would impact your VMs as well as a way to remediate those impacted VMs. I would recommend you go through Duncan's article before moving any further.

The challenge now, is how to easily identify all VMs that are currently impacted by this problem in your environment? The answer is of course Automation and leveraging the vSphere API! I created the following vSphere SDK for Perl script called querySvMotionVDSIssue.pl which searches for all VMs that are connected to a VDS and checks whether or not it's expected dvPortgroup file exists in the appropriate datastore. To use the script, you just need a system with the vCLI installed or you can just use the vMA appliance.

UPDATE: The script has now been updated to support remediation for VMs connected to both a VMware VDS as well as Cisco N1KV. The solution, thanks to one of our internal engineers was to "move" the VM's dvport from one to another, all while staying within the existing dvPortgroup which will also force the creation of the .dvsdb port file. Once the dvport move has successfully completed, we will move it back to it's original dvport that it initially resided on. We no longer have to rely on creating a temporally dvPortgroup and best of all, we can now remediate both VDS and N1KV. The script now combines both the "query" and "remediation" into single script. Please take a look at the examples below on usage.

Disclaimer: This script is not officially supported by VMware, please test this in a development environment before using on production systems.

Here is a sample output of the script running in "query" mode:

Only impacted VMs will be listed in the output. To remediate, I have combined the remediation script into the query script, if you wish to remediate ALL VMs that were listed as being impacted, you can specify the --fix flag and providing the option "true". This will go ahead and remediate all impacted VMs that were listed as before.

Here is a sample output of the script running in "remediation" mode:

In the screenshot above, you may noticed a few interesting details with VM3 and VM4. If you run out of dvports in a dvPortgroup, the script will automatically increase the number of ports to satisfy the swap (max of 10 due to number of ethernet interfaces a VM can have). Once the VM has been remediated, the dvportgroup will be reconfigured to it's original configured number of ports as shown with VM3.

If you have an impacted VM that is connected to an ephemeral dvportgroup, we will not be able to remediate due to the nature of how an ephemeral binding works. You will get a message on the specific interface and you will need to manually remediate using the steps outlined by Duncan or using the "old" remediation script which will create a temporally dvPortgroup (again, this will only work for VMware VDS' only).

If you run into any issues or have questions, feel free to leave a comment.

23 thoughts on “Identifying & Fixing Virtual Machines Affected By SvMotion / VDS Issue

  1. Nice! The first script lists several of my VMs as victims of this. I’ll need to adapt the second script for my environment, though, since we use the n1kv. I guess I can use PowerCLI to create a vSwitch on the host and do the same.

    • @Luke,

      I did not have access to N1KV, but it looks like for N1KV, you will need to create the portgroup profiles directly to VSM and not at the vSphere layer.

      You should be able to use PowerCLI to create a vSwitch on the same host with the same VLAN configuration of the dvPortgroup and then reconfigure the VM.

    • @Luke,

      I’ve just updated my script and it’s implementation using a different method of remediation. This will allow you to remediate both VDS + N1KV. Please refer to the updated post for more details. Thanks

  2. Hi, I experienced this issue when svmotion machines from a datastore that was removed later on. I got errors that dvport state info could not be saved because the file containing info about dvport was gone. In my situation editing port settings (I added a temporary description for that particular port and then removed it) on dvPortgroup for affected machine recreated file in proper location on new datastore. This might be a better solution for your remediate script since you don’t need to create any temporary dvPortgroup. No vm network downtime as well.

    • @pietia7,

      I’ve actually tried this as well and it does NOT resolve the problem. Perhaps you had a different problem since the dvportgroup state file could not be created initially? The only method I’m currently aware of is a complete network reconfiguration at the VM level for that file to be regenerated.

  3. Hi William, thanks for the script. I did notice a false-positive issue with it, related to VMs that have virtual disks on multiple datastores. Take a VM with its VMX file and VMDK1 on Datastore1, and VMDK2 on Datastore2. Your script is popping positive for that VM, I’m assuming because vDS info was not found on Datastore2 (but it is present on Datastore1, which I manually confirmed).

  4. I just finished a SAN storage migration. I have vCenter 5 and ESXi 4.1 Update 2. All datastores on the new storage do not contain the dvsData folder. The folders remained on the old datastores and are still being written to. If I run your script it comes back with no VMs listed as having a problem. I want to remove the olde datastores but I want to make sure this will not cause any problems first. Any ideas why the script is not detecting any problems?

    • @kwinsor,

      From my understanding, this only impacts vCenter Server 5 and ESXi 5 which is also mentioned in the KB – http://kb.vmware.com/kb/2013639

      I have an if statement that checks for the version of ESXi on line 96 and the corresponding bracker on 143.

      I would recommend contacting VMware Support to ensure you won’t be impacted before removing the datastore.

  5. Hi,

    Thanks for the script. We have quite a few machines impacted by this bug and I would like to run your script to fix this. One thing; I am wondering if network connectivity is lost (briefly?) when ‘fixing’ a VM with the script. Our machines are used in production (mail/sql/application servers).

    Thanks

    • Hi Revoklat,

      Once the task is kicked off, if you loose connection the task would have been sent & executed on the server. You can just re-run the script and if you enable the fix param, it’ll only remediate impacted VMs, so if the previous VM was fixed before the disconnect it won’t need to re-run.

  6. Hi,

    When the VMs are fixed via the script, do they drop off the network for a short time or is the fix no impact?

    Thanks,
    Bob

    • @Bob,

      From my testing, there was no impact and I ran a continuous ping to to see if any packets were dropped and there was no. The DvPortgroup Network is not modified, except it’s just moving it’s dvport ID within the DvPortgroup

  7. In my environment the script runs OK but does not change the dvPort number:

    vi-admin@VMA001:~> ./querySvMotionVDSIssue.pl –server 10.1.1.1 –username root
    Searching for VM’s with Storage vMotion / VDS Issue …
    TESTVM01 is currently impacted

    vi-admin@VMA001:~> ./querySvMotionVDSIssue.pl –server 10.1.1.1 –username root –fix true
    Searching for VM’s with Storage vMotion / VDS Issue …
    TESTVM01 is currently impacted
    Remediating TESTVM01
    Moving from dvPort: 1153 to dvPort: 1153
    Moving from dvPort: 1153 back to dvPort: 1153
    Moving from dvPort: 1216 to dvPort: 1216
    Moving from dvPort: 1216 back to dvPort: 1216
    Remediation complete!

    vi-admin@VMA001:~> ./querySvMotionVDSIssue.pl –server 10.1.1.1 –username root
    Searching for VM’s with Storage vMotion / VDS Issue …
    TESTVM01 is currently impacted

    Please help. Thank you.

    • Hi,

      I have the same problem as Mark Strong above. I run the script and it’s find some impacted VMs. I run it with the fix parameter and the remediation procedure is completed.
      But after i run the script again, the same VMs appears to be impacted.

      I am running esxi 5.0 (721882 version) updated yesterday with update manager.
      I have feeling that with the previous esxi 5 version before the upgrade(unfortunately i cannot remember the version number) the script was working without any problems.

    • Im using static binding for all the VMs. Also all my dvportgroups have 128dvports and there are plenty of free dvports on each dvportgroup.

      I would like to note that i succeeded to remediate my impacted VMs manually from vsphere client.
      After that, the script does not report any impacted VMs.

    • Did either of you get the script to work for you? I just ran across this issue and found this post. I am trying to test this against a host that has 17 impacted virtual machines. The port groups are all configured for static binding. After remediation, the virtual machines still show as impacted

    • I figured it out. I was trying to test by running this directly against a host. When I run the script and specify the –fix and –vmname switches, I was able to remediate the VMs I was testing against and was able to verify they were remediated by querying directly to the host before and after. Thanks, this was a big help!

Thanks for the comment!