With more and more folks trying out the new vSphere with Kubernetes capability, I have seen an uptick in questions both internally and externally around the initial setup of the infrastructure required for vSphere with Kubernetes but also during the configuration of a vSphere Cluster for Workload Management.
One of the most common question is why are there no vSphere Clusters listed or why a specific vSphere Cluster is showing up as Incompatible? There are a number of reasons that this can occur including vCenter Server not being able to communicate with NSX-T Manager to retrieve the list of NSX pre-checks which would cause the list to either be empty or listed as incompatible. Not having proper time sync between vCenter Server and NSX-T which can also manifest in a similar behavior among other infrastructure issues.
Having ran into some of these issues myself when developing my automation script, I figure it might be useful to share some of the troubleshooting tips I have used when trying to figure out what is going on whether that is during the initial setup or actually deploying workloads using vSphere with Kubernetes.
As an aside, If you are just getting started and want to quickly explore what vSphere with Kubernetes has to offer, one of the easiest way is to leverage my vSphere 7 with Kubernetes Automation Lab Deployment Script. The script supports a number of customization and can also be adjusted to deploy a minimal vSphere with Kubernetes environment that requires the least amount of physical resources as explained in this blog post. I know this may not be for everyone as it used Nested ESXi but it certainly is the fastest and most consistent way to deploy a complete functional environment in less than 40min!
During the configuration of vSphere with Kubernetes, there are a number of compatibility checks that are performed to ensure the vSphere Cluster you wish to enable Workload Management is going to work. Today, the vSphere UI does not provide much details around these incompatibilities, but the underlying vSphere with Kubernetes Management API does and this can be used to understand what the issues are.
Luckily, we do not have to write any code, we can simply use DCLI (Datcenter CLI) which is available directly on the VCSA. The vCenter REST API namespace which we are interested in is called Namespace Management (namespacemanagement) and below are the various "compatibility" checks which you can run. If you are interested in automating various aspects of vSphere with Kubernetes, be sure to check out this blog post by Vikas Shitole on how to get started.
vSphere Cluster Compatibility
This command will give you cluster level checks to see why a particular vSphere Cluster is compatible or not compatible along with a list of reasons which can also be useful for automation purposes.
dcli com vmware vcenter namespacemanagement clustercompatibility list
vSphere Distributed Switch Compatibility
This command will give you compatibility checks for the underlying network switch and it expects the ID of a specific vSphere Cluster which you can retrieve from our previous command.
dcli com vmware vcenter namespacemanagement distributedswitchcompatibility list --cluster domain-c8
NSX-T Edge Compatibility
This last command will give you compatibility checks for the NSX-T Edge Cluster that you are expecting to use with your vSphere with Kubernetes Cluster. It expects both the ID of a specific vSphere Cluster as well as the UUID of the Distributed Virtual Switch which you can find in the previous command.
dcli com vmware vcenter namespacemanagement edgeclustercompatibility list --cluster domain-c8 --distributed-switch "50 1c 91 d0 d0 11 e5 b8-c5 a8 fa e1 0d f4 d2 6c"
Once the enablement of Workload Management has begun on a vSphere Cluster, the next most common question is there are a number of errors and warnings in the vSphere UI, is that something to be concerned about?
The simple answer is no, this is expected and I know this will be addressed in a future update of vSphere with Kubernetes. You will see various warnings and errors like "HTTP communication could not completed with status 404" but these can simply be ignored. The overall process can take anywhere from 30 minutes to an hour to complete depending on the size of your environment. You can confirm that everything was configured correctly when you refresh the vSphere UI and see the Cluster Config Status show "Running" as shown in the screenshot below.
If you prefer to get a better sense of what is happening during the enablement, you can login to the VCSA and take a look at some of the logs, especially as it can be useful to debug when enablement has failed. The following two logs will give you information about general Workload Management enablement but also anything related to NSX-T which is also a large portion of the initial configuration as various NSX-T components will be deployed.
- Workload Management Logs on VCSA: /var/log/vmware/wcp/wcpsvc.log
- NSX-T Logs on VCSA: /var/log/vmware/wcp/nsxd.log
In addition to the logs above, I have also personally found using the NSX-T Manager API logs to be useful as you can see the exact error being returned from NSX-T Manager when a particular vSphere with Kubernetes operation has failed such as requesting an IP Address from Ingress IP Pool. This stemmed from my experience using the NSX-T API, where some times the actual response from the API is not very clear on what the issue is and when looking at the actual NSX-T API logs, it gives greater details and usually will pin point the issue. This was something I had found useful and hopefully this will be something the team considers as another source for useful information which can be turned into an actionable task for users trying to self-troubleshoot.
- NSX-T Manager API Logs: /var/log/proton/nsxapi.log
Once a vSphere Cluster has been enabled with Workload Management, then you can start deploying workloads whether that is a vSphere Pod VM running in the Supervisor Cluster or Tanzu Kubernetes Grid (TKG) Cluster. At this point, you are interacting and using Kubernetes and if you are new, it can certainly be daunting when something is not working as expected or if you see a partial deployment of VMs but things may still not be working.
The declarative nature of Kubernetes certainly makes it challenging as the platform will attempt to deliver the desire state but if there is not enough resources or there are underlying configuration issues, it will simply keep trying and/or wait until the issue is resolved. This definitely took some time to get used as the errors may not always be apparent and you need to look at the Kubernetes specific events. Luckily, as part of the vSphere with Kubernetes integration, you can quickly see all relevant events under the specific vSphere Namespace which you had to have created to start deploying workloads.
To do so, select your vSphere Namespace and then navigate to Monitor->Kubernetes to view the Kubernetes events. To simulate a resource issue, I had created a very constrained vSphere Cluster with insufficient resources and attempted to provision a TKG Cluster. If we look at the screenshot below, we can quickly see why the provisioning of the TKG Cluster has not progressed further.
One thing to note is that the Kubernetes events UI is not automatically refreshed which means you must explicitly refresh to see updated events which can be difficult when troubleshooting in real time. You can always use the filter to look for warning or error messages.
Another tool which I have been using quite a bit when working with Kubernetes is Octant, I wrote about it in my Useful Interactive Terminal and Graphical UI tools for Kubernetes blog post which I highly recommend to learn about other useful tools. Octant not only provides a graphical UI to easily explore and interact with a Kubernetes Cluster which includes vSphere with Kubernetes Cluster but it offers a real time refresh of events and logs which I have found to be extremely useful. You simply run the Octant binary and it will automatically launch your web browser and if you are in the context of your vSphere Namespace, you simply scroll down to the very bottom to quickly what events are happening and this is the exact same data which the vSphere UI is providing.