While working on updating my vGhetto Automated vSphere Lab Deployment script to add support for NSX 6.3 with vSphere 6.5, I ran into an issue with the Host Preparation step. Although the resolution turned out to be quite simple, it was very difficult to diagnose the problem. I suspect this scenario could easily be encountered by others, so I wanted to make folks aware of what I ran into. There is also another potential gotcha for host preparation that I did not encounter myself, but it was brought to my attention that I thought was also worth sharing as well.
Scenario 1 - Attempted Host Preparation and all "Install agent" tasks fails with "Cannot complete the operation. See the event log for details" and below is a screenshot of the error. There was nothing useful when looking at the event logs for either NSX or ESXi using the vSphere Web Client.
2017-02-16T12:38:53Z esxupdate: 73899: Transaction: DEBUG: Populating VIB list from all VIBs in metadata https://vcenter65-1.primp-industries.com:443/eam/vib?id=d4917629-51d1-4da9-82d6-8da54815447d; depots:
2017-02-16T12:38:54Z esxupdate: 73899: downloader: DEBUG: Downloading https://vcenter65-1.primp-industries.com:443/eam/vib?id=d4917629-51d1-4da9-82d6-8da54815447d to /tmp/tmpdfcbr23q...
2017-02-16T12:38:54Z esxupdate: 73899: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2017-02-16T12:38:54Z esxupdate: 73899: imageprofile: INFO: Adding VIB VMware_locker_tools-light_6.5.0-0.0.4564106 to ImageProfile (Updated) ESXi-6.5.0-4564106-standard
2017-02-16T12:38:54Z esxupdate: 73899: imageprofile: INFO: Adding VIB VMware_bootbank_esx-vsip_6.5.0-0.0.4987428 to ImageProfile (Updated) ESXi-6.5.0-4564106-standard
2017-02-16T12:38:54Z esxupdate: 73899: imageprofile: INFO: Adding VIB VMware_bootbank_esx-vxlan_6.5.0-0.0.4987428 to ImageProfile (Updated) ESXi-6.5.0-4564106-standard
2017-02-16T12:38:54Z esxupdate: 73899: vmware.runcommand: INFO: runcommand called with: args = '['/bin/localcli', 'system', 'maintenanceMode', 'get']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2017-02-16T12:38:54Z esxupdate: 73899: HostInfo: INFO: localcli system returned status (0) Output: Disabled Error:
The root cause ended up being something very simple which I was aware of but completely forgot about. The reason the NSX VIBs failed was that my ESXi hosts were running 6.5 GA and not recently released 6.5a which is needed for NSX 6.3. It would have been nice to get a simple error message that just stated the required version of ESXi was not being met and that would have quickly jogged my memory or at least pointed me in right direction for further troubleshooting. Once I applied the ESXi 6.5a patch and rebooted, NSX was able to successfully complete the host preparation.
Scenario 2 - This was not something I had encountered myself, but the behavior is similiar to scenario #1 and is also quite difficult to troubleshoot. Luckily, we do have a VMware KB 2053782 outlining this particular situation but I had found while reproducing this in the lab that the symptoms could easily miss-direct customers to look else where from a troubleshooting standpoint. Attempted Host Preparation where all tasks returned back successfully. The NSX VIBs actually do get installed which you can verify by going to the ESXi hosts, however NSX continues to show a "Not Ready" status for all hosts as shown in the screenshot below.
There was also nothing useful in the logs to help pinpoint the potential issue. In this particular scenario, the issue is observed when the vSphere Update Manager (VUM) service is not running on the vCenter Server, even if you are not using VUM to deploy the NSX VIBs. The ESX Agent Manager (EAM) which is responsible for deploying the NSX VIBs and it apparently relies on VUM to approve the installation or uninstallation of VIBs. If VUM is not running, you would find yourself in this situation. The solution is to either get the VUM service running again or you can disable the VUM check by modifying vCenter Server which is outlined in the above KB article. In my lab environment, I had manually disabled the VUM service to reproduce this issue, so once I had re-enabled it, NSX was able to successfully complete the host preparation step.
I have already shared this feedback with the NSX team and they will be looking into how we can improve our error messages for the future to help customers better diagnose and troubleshoot NSX issues.