Network Proxies are commonly used by customers to provide connectivity from internal servers/services to access external networks like the Internet in a controlled and secured manner. While working on a recent network proxy enhancement for our VMware Event Broker Appliance (VEBA) Fling, I had setup a Squid server which is a popular network proxy solution.
I had noticed a couple of folks were asking about network proxy configuration for Standalone Tanzu Kubernetes Grid (TKG) and figure this might be interesting to explore, especially for my recently released TKG Demo Appliance Fling which enables folks to quickly go from zero to Kubernetes in just 30 minutes! I figured this would be another good opportunity to learn a bit more about TKG as well as Kubernetes (K8s) and I jokingly said to myself, how hard could this be!? 😉 Apparently it was not trivial and took a bit of trial/error to figure out the correct combination and below is the procedure that can be followed for both standard deployment of TKG as well as the TKG Demo Appliance Fling.
Proxy Setting configurations for TKG CLI
The TKG CLI uses KinD (Kubernetes in Docker) under the hood to setup the initial K8s bootstrap cluster to deploy the TKG Management Cluster. If you have not already downloaded KinD node image (registry.tkg.vmware.run/kind/node:v1.17.3_vmware.2) or if you need to go through a network proxy to do so, then the following instructions can be followed to make your Docker Client aware of a network proxy.
Here is an example of the error if Docker Client can not download the image:
# docker pull registry.tkg.vmware.run/kind/node:v1.17.3_vmware.2
Error response from daemon: Get https://registry.tkg.vmware.run/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
If you are not using a private container registry with TKG, then you also need to also ensure that the KinD Cluster can connect to your network proxy when it pulls down the required containers from the internet. Luckily, KinD can simply detect the network proxy settings of your operating system. You can either set the proxy using traditional environmental variables (http_proxy, https_proxy and no_proxy) during your use of TKG CLI or you can simply set it globally so you do not forget.
In my setup, TKG CLI is running in a Photon OS VM and global proxy settings are configured in /etc/sysconfig/proxy Proxy settings will vary across operating systems and you should check with the vendor documentation for specific instructions. The following command will set both HTTP and HTTPS proxy variables to use my proxy server and you will also want to make sure you whitelist all networks and addresses which you want to by-pass the proxy.
cat > /etc/sysconfig/proxy << EOF
Note: If you are using the TKG Demo Appliance, you only need to configure the Photon OS global proxy settings. In my example, I have white listed my local 192.168.* addresses, registry.rainpole.io which is the embedded Harbor registry, 10.2.224.4 which is the internal IP Address of VMC vCenter Server, *.svc addresses which all the internal K8s services and 100.64.0.0/13 which is the CIDR range used by TKG for the Service networks and 100.96.0.0/11 which is the CIDR range used by TKG Cluster networks.
Proxy Setting configurations for TKG Management and Workload Clusters
For our TKG Management and Workload Clusters to be aware of our network proxy, we need to update the default TKG plan templates:
To do so, we are going to take advantage of the preKubeadmCommands section and create /etc/systemd/system/containerd.service.d/http-proxy.conf containing the our proxy settings that can then be used by the containerd service. Since the TKG K8s OVA is Photon OS based, we will also create the global proxy settings /etc/sysconfig/proxy for good measure.
Note: If you are using the TKG Demo Appliance, make sure to whitelist all local and K8s internal addresses or you will run into issues as it will attempt to connect to your proxy to resolve. In my example, I have whitelisted my local 192.168.* addresses, registry.rainpole.io which is the embedded Harbor registry, 10.2.224.4 which is the internal IP Address of VMC vCenter Server, *.svc addresses which all the internal K8s services and 100.64.0.0/13 which is the CIDR range used by TKG for the Service networks and 100.96.0.0/11 which is the CIDR range used by TKG Cluster networks.
Here is an example of what to append to the two existing preKubeadmCommands sections within each of the YAML files:
- echo '[Service]' > /etc/systemd/system/containerd.service.d/http-proxy.conf
- echo 'Environment="HTTP_PROXY=http://192.168.1.3:3128"' >> /etc/systemd/system/containerd.service.d/http-proxy.conf
- echo 'Environment="HTTPS_PROXY=http://192.168.1.3:3128"' >> /etc/systemd/system/containerd.service.d/http-proxy.conf
- echo 'Environment="NO_PROXY=localhost,192.168.1.0/24,192.168.2.0/24,registry.rainpole.io,10.2.224.4,.svc,100.64.0.0/13,100.96.0.0/11"' >> /etc/systemd/system/containerd.service.d/http-proxy.conf
- echo 'PROXY_ENABLED="yes"' > /etc/sysconfig/proxy
- echo 'HTTP_PROXY="http://192.168.1.3:3128"' >> /etc/sysconfig/proxy
- echo 'HTTPS_PROXY="http://192.168.1.3:3128"' >> /etc/sysconfig/proxy
- echo 'NO_PROXY="localhost,192.168.1.0/24,192.168.2.0/24,registry.rainpole.io,10.2.224.4,.svc,100.64.0.0/13,100.96.0.0/11"' >> /etc/sysconfig/proxy
Lastly, we need to restart containerd so its aware of the new proxy settings. Through a bit of debugging and trial/error, I came to learn that this was actually needed and that it did not take effect during the pre-section. Luckily there is also a postKubeadmCommands section which does work and simply append this immediately after the preKubeadmCommands section.
- systemctl restart containerd
Once you have saved all your changes to both YAML files, you can now deploy a normal TKG Management and Workload Cluster like your normally would. To confirm that our proxy will be used within the TKG Cluster, we can deploy any K8s application from the internet. In my example, I decided to use the following:
kubectl apply -f https://raw.githubusercontent.com/lamw/vmware-k8s-app-demo/master/yelb.yaml
We should see that the application is successfully running and if you have access to your network proxy, we should also see a request originating from the TKG Cluster making the requests to both Github (which is where this particular K8s manifest file is hosted) along with a requests to Dockerhub to download the respective containers.
# k -n yelb get pods
NAME READY STATUS RESTARTS AGE
redis-server-55664c8d98-mzsrv 1/1 Running 0 16h
yelb-appserver-794d7c9458-r4xqn 1/1 Running 0 16h
yelb-db-6747f54d9-f5677 1/1 Running 0 16h
yelb-ui-79c68df689-zdq5w 1/1 Running 0 16h
If you are having issues, there are a couple of things to check. First, verify that you can manually pull a container image from within the worker node of a TKG Cluster. To do so, make sure the TKG Cluster was deployed with an SSH Key so that you can SSH using capv username as there is no default username/password. Once you are logged in, you will need to run "sudo su -" to switch to root which is then allowed to use the crictl utility which provides access to the Container runtime interface that K8s uses.
The command below will attempt to pull a container from Dockerhub and if proxy settings were not configured correctly or not detected, you should see an error like the following:
[email protected] [ ~ ]# crictl pull mreferre/yelb-db:0.3
FATA pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/mreferre/yelb-db:0.3": failed to resolve reference "docker.io/mreferre/yelb-db:0.3": failed to do request: Head https://registry-1.docker.io/v2/mreferre/yelb-db/manifests/0.3: dial tcp 220.127.116.11:443: i/o timeout
Depending on the errors here, it might be proxy configuration related but it can also be DNS resolution depending on how your environment was setup to resolve external DNS entries.
If you are still receiving errors when attempting to pull a container, you can also watch the logs of your proxy server. In my setup, I had setup Squid server which is a popular network proxy solution and tailed /var/log/squid/access.log file. If the TKG Cluster is able to connect to the proxy server, but still unable to download a container, then you probably missed an internal network which should have been whitelisted in the no_proxy settings. For example, if you configured your TKG Cluster to have a different cluster and service CIDR and you have not whitelisted those networks, it will connect to proxy to resolve those endpoints which will fail.