One thing I love about the VMware Community is the constant sharing of knowledge and information on a regular basis. I always enjoy discovering new tricks and tidbits from the community, especially as it helps me refine my own knowledge and understanding of a given technology or solution.
My good buddy Ariel Sanchez cc'ed me on Twitter yesterday referencing a blog post by Paul Wilk about an issue he was observing in his Nested ESXi environment when configuring vSphere with Tanzu.
— Ariel Sanchez Mora (@arielsanchezmor) November 15, 2020
This was in regards to the dreaded 404 message displayed in the vSphere UI:
HTTP communication could not be completed with status 404
which is actually not unique to a Nested environment. In fact, this cryptic error message was observed even in the first release of vSphere with Tanzu which used to be called vSphere with Kubernetes with the release of vSphere 7.0 release.
Although Paul's conclusion on why his fixed work was not exactly correct, it was the fix itself that I was actually most interested in. Even with the initial vSphere 7.0 release, I had assumed this was just a cosmetic vCenter Server error message. It was not ideal, but like many other customers, I just ignored it as the enablement of Workload Management was still successful.
What helped me connect the dots was the fact that Paul solved the problem by disabling the ESXi firewall, which meant this was actually an ESXi issue. Given this was related to the OVF deployment, I immediately knew what this was actually referring to and is related to an earlier blog post I had shared about a new feature that would allow ESXi to "pull" remote OVF/OVA files from a HTTP(s) endpoint. In this case, it was not OVFTool driving the deployment but rather vCenter Server and the Content Library service, which is also responsible for OVF/OVA deployments.
It turns out that as part of deploying the Supervisor VMs, instead of using the typical "push" method for uploading an OVA, vCenter is instructing the ESXi host to "pull" the OVA files remotely which are actually hosted on the vCenter Server Appliance (VCSA) itself. What ends up happening is that because ESXi does not have the correct port in which the OVA is hosted on the VCSA, the "pull" method fails and it automatically falls back to the old "push" method. This is why you see the error message and then progress is immediately progressing.
It took a bit more digging to figure out what port VCSA was actually serving the OVA file, because I would have assumed it was on 443. It turns out, it is being served on 5480 which is also the same port for hosting the Virtual Appliance Management Interface (VAMI), which I suspect is due to the fact that it has a lighthtpd running. The way I figured out 5480 was actually because I had been spending some time with vSphere with Tanzu configuration file which is stored under /etc/vmware/wcp/wcpsvc.yaml and there is a commented out configuration mentioning where the WCP Agent VM which is another word for Supervisor VM:
As you can see from the example, it defaults to 5480. Looking at the URL path, I was able to determine where these files actually lived on the VCSA filesystem and I found there was a symlink from /opt/vmware/share/htdocs/wcpagent pointing to /storage/lifecycle/vmware-wcp/wcpagent which is where the Supervisor VM (Photon) OVAs are stored on the VCSA.
To actually confirm my suspicion, we need to configure our ESXi host to allow for outbound connectivity to 5480. To do that, I had to use one of my older blog articles back in 2011 on how to create a custom ESXi firewall rule, since 5480 was not one of the default ports that is available for configuration.
Create /etc/vmware/firewall/wcp.xml on ESXi host with the following configuration:
Then run the following two ESXCLI commands to load our new firewall configuration and enable the new ruleset:
esxcli network firewall refresh
esxcli network firewall ruleset set -e true -r wcp
If we now enable Workload Management on our vSphere with Tanzu cluster, you will see that the "Download remote files" no longer throw a 404 but is progressing as expected!
So now that we know why this happening, the custom ESXi firewall rule is not really a good solution. Since we do not allow for any custom firewall policies in ESXi, customers must create this XML file and then package it up into a custom VIB for any type of automated and scalable solution. It is also not ideal because the optimized deployment workflow should just work out of the box and if we do require ports opened on ESXi, it should be done as part of the service and then disabled when not required.
Lastly, I did find it strange that we would host the OVA files behind something other than 443 which is pretty common when serving HTTP(s) files. The VCSA does have another web server which I thought would have made more sense, which is the main landing page and is served on 443. Since the OVA files is not actually stored in the current htdocs folder but rather symlinked. A quicker and more ideal permanent solution is to just symlink the OVA files to VCSA primary htdocs directory and then update the OVA URL in the wcpsvc.yaml configuration file. The other really nice benefit is that you do not have to make any changes to the ESXi firewall nor mess with custom firewall policies.
Disclaimer: This is not officially supported by VMware, especially as changes to the VCSA filesystem can be reverted the next time it is patched or upgraded.
Step 1 - SSH to VCSA and change into /etc/vmware-vpx/docRoot directory and then run the following command to create symlink:
ln -s /storage/lifecycle/vmware-wcp/wcpagent wcpagent
Step 2 - Edit /etc/vmware/wcp/wcpsvc.yaml and uncomment the kubevm and ovfurl section and then replace the address with either the Hostname or IP Address of the VCSA and remove port 5480
Step 3 - Restart the wcp service for the changes to go into effect:
service-control --restart wcp
So there you have it, the reason and solution to why the HTTP 404 error is showing up when enabling vSphere with Tanzu. I definitely will be sharing this analysis with the Engineering team in case they were not aware and hopefully this will be resolved in a future update and this error will no longer show up and the system will automatically do the right thing.