The Platform Services Controller (PSC) is a new infrastructure component that was first introduced in vSphere 6.0 that provides common services such as Single Sign-On, Licensing and Certificate Management capabilities for vCenter Server and other VMware-based products. A PSC can be deployed on the same system as the vCenter Server referred to as an Embedded deployment or outside of the vCenter Server which is known as an External PSC deployment. The primary use case for having an External PSC is to be able to take advantage of the new Enhanced Linked Mode (ELM) feature which provides customers with a single pane of glass for managing all of their vCenter Servers from within the vSphere Web Client.

When customers start to plan and design their vSphere 6.0 architecture, a topic that is usually brought up for discussion is whether or not they should be load balancing a pair (up to four) of their PSC's? The idea behind using a load balancer is to provider higher levels of availability for their PSC infrastructure, however it does come as an additional cost both from an Opex and Capex standpoint. More importantly, given the added complexity, does it really provide you with what you think it does?

A couple of things that stood out to me when I look at the process (VMware KB 2113315) of setting up a load balancer (VMware NSX, F5 BIG-IP, & Citrix NetScalar) for your PSC:

  • The load balancer is not actually "load balancing" the incoming requests and spreading the load across the different backend PSC nodes
  • Although all PSCs behind the load balancer is in an Active/Active configuration (multi-master replication), the load balancer itself has been configured to affinitzed to just a single PSC node

When talking to customers, they are generally surprised when I mention the above observations. When replication is setup between one or more PSC nodes, all nodes are operating in an Active/Active configuration and any one of the PSC nodes can service incoming requests. However, in a load balanced configuration, a single PSC node is actually "affinitized" to the load balancer which will be used to provide services to the registered vCenter Servers. From the vCenter Server's point of view, only a single PSC is really active in servicing the requests even though all PSCs nodes are technically in an Active/Active state. If you look at the implementation guides for the three supported load balancers (links above), you will see that this artificial "Active/Passive" behavior is actually accomplished by specifying a higher weight/priority on the primary or preferred PSC node.

So what exactly does load balancing the PSC really buy you? Well, it does provide you with a higher levels of availability for your PSC infrastructure, but it does this by simply failing over to one of the other available PSC nodes when the primary/preferred PSC node is no longer available or responding. Prior to vSphere 6.0 Update 1, this was the only other option to provide higher availability to your PSC infrastructure outside of using vSphere HA and SMP-FT. If you ask me, this is a pretty complex and potentially costly solution just to get a basic automatic node failover without any of the real benefits of setting up a load balancer in the first place.

In vSphere 6.0 Update 1, we introduced a new capability that allows us to repoint an existing vCenter Server to another PSC node as long as it is part of the same SSO Domain. What is really interesting about this feature is that you can actually get a similar behavior to what you would have gotten with load balancing your PSC minus the added complexity and cost of actually setting up the load balancer and the associated configurations on the PSC.

load-balancing-psc
In the diagram above, instead of using a load balancer as shown in the left, the alternative solution that is shown to the right is to manually "failover" or repoint to the other available and Active PSC nodes when the primary/preferred is no longer responding. With this solution, you are still deploying the same number of PSC's and setting up replication between the PSC nodes, but instead of relying on the load balancer to perform the failover for you automatically, you would be performing this operation yourself by using the new repoint functionality. The biggest benefit here is that you get the same outcome as the load balanced configure without the added complexity of setting up and managing a single or multiple load balancers which in my opinion is huge cost. At the end of the day, both solutions are fully supported by VMware and it is important to understand what capabilities are provided with using a load balancer and whether it makes sense for your organization to take on this complexity based on your SLAs.

The only down side to this solution is that when a failure occurs with the primary/preferred PSC, a manual intervention is required to repoint to one of the available Active PSC nodes. Would it not be cool if this was automated? ... 🙂

Well, I am glad you asked as this is exactly what I had thought about. Below is a sneak peak at a log snippet for a script that I had prototyped for the VCSA which automatically runs a scheduled job to periodically check the health of the primary/preferred PSC node. When it detects a failure, it will retry N-number of times and when concludes that the node has failed, it will automatically initiate a failover to the available Active PSC node. In addition, if you have an SMTP server configured on your vCenter Server, it can also send out an email notification about the failover. Stay tune for a future blog post for more details on the script which can be found here.

Screen Shot 2015-11-23 at 3.11.45 PM

18 thoughts on “What does load balancing the Platform Services Controller really give you?

  1. I’m working through my LTM config as I read this, so very timely. I would love to avoid doing so, just to avoid complexity. One question about your health check – can it be configured to account for more than 2 PSCs? We have 3 fault domains in our configuration, so I have 3 PSCs deployed. I’d love to cycle through them in the event that the primary fails.

    Now I can’t wait to see the vCenter HA shown in the VMWorld2015 vids come to fruition. 2016 is going to be a fun year.

    • Hi Jason,

      The script is merely an example/prototype to show how this could be further automated. Once published, you can of course adjust to fit your environment. To keep things simple, I only had 2 PSC deployed in my lab env.

      If you attended VMworld and saw Johnny Ferguson’s (PM for PSC) session, then you already got a sneak peak of where we would like to take the PSC in the future, which is a load balancer-less world 🙂

  2. William,
    I understand that the published load-balanced designs show active-passive solutions. I had always assumed the reason for that is that the load-balancer designs presented were done for high availability and not scalability. Is this really the only supported load-balancing model for PSCs? I don’t believe that question is really addressed in the stuff you referenced. In fact, all that reference designs really only talk about high availability with no mention of scaling. To be scalable they need to support active-active obviously. I am wondering if perhaps the authors of those designs set up active-passive because that’s the only model the PSC really supports or because in writing the guides the authors only were trying show how to deliver PSC high-availability in the load balancer configs. There are many VMware KBs and reference architectures that talk about different configurations for various things and they don’t always show them in the most optimal configuration because that would entail much more work or complexity beyond the scope of what they want to write to. It would be nice to know from VMware engineering if the PSCs can in fact be active-active, if configured appropriately on the load balancers. Do you have any insight on that? I know most people wont care about the scalability of the PSCs but it doesn’t mean its not worth asking about.

    As a related topic since you brought up a hot button issue to me, do you have any insight about why they no longer the allow the web client to be load-balanced? To me that’s another big glaring gap when it comes to talking about HA and scalability for vSphere environments.

    Lastly, any insight as to why VMware just doesn’t build in this type of HA and/or scalability to their apps (vCenter and vRA as examples) instead of always dumping that off on 3rd parties? There’s certainly plenty of open source load-balancers they could use and bake into the apps so this kind of stuff isn’t an issue. With the shift toward delivering products as appliances (i.e VCSA) this would make a lot of sense. Telling my customer they need to go out and buy load balancers to be HA and scale isn’t a popular, cheap, or technically simple answer. As a developer I’m sure you are well aware of the trend towards app designs that don’t require technologies like load-balancers and server clusters to be able to scale and remain highly-available. When can we expect to see VMware adopting that kind of design philosophy for their applications?

      • Thanks for the link. Good to see others that feel similar to me. When I bring this stuff up with our company’s VMware sales team they always pooh-pooh it. I always ask the sales engineers “Have you tried setting up your products with a load balancer?” and of course the answer is “No, we have the professional services guys do that…”. The fact is that VMware faces a real challenge from products like OpenStack and other vendor’s proprietary cloud stacks. Having the limiting architecture factors we are discussing here as they exist now is quickly going to start driving people away from VMware solutions as the competitors mature. VMware cant point to VIO as an answer to this challenge because it suffers from the same issues since it uses vCenter. VMware can’t afford to be in a fall behind kind of position because their value prop has always been that they deliver the most mature and feature-rich systems and your going to pay them well for that value. As more cloud products begin to have high-availability and scaling capabilities built in there isn’t going to be much reason for people to consider VMware anymore given its higher costs versus other products. I applaud William for his obvious understanding and continual blogging that automation is critical to building new IT infrastructures but I’d really like to see VMware echo that in all their products and especially the ones that underlay the critical infrastructure components (vCenter and vRA).

    • Michael,

      Appreciate the the comment/feedback.

      Just to be clear, ALL PSC’s that are setup in replication are actually in an Active/Active configuration (sorry, this wasn’t clear earlier and I’ve since adjusted the article). You are right that today, we’re only handling failover with the use of the load balancer and hence the article. The PSC’s themselves, can actually scale quite high from what I’ve been told by Engineering. However, that’s only one part of the story and its just as important to ensure the applications (in this case vCenter Server/vRA) would also need to be able to take advantage of this capability.

      As mentioned in the previous reply, at VMworld this year, we had demonstrated an early Tech Preview on where we plan to take PSC/vCenter Server where a load balancer will not be require and this will just be natively built into the product and seamlessly failover if it detects one of the PSC nodes are unavailable. This will be transparent as you would expect.

      I’ll be sure to forward your feedback over to the PM in case he may want to follow up

  3. Is it possible to move a PSC installation from external to embedded having a Windows-based vCenter 6? Thanks.

      • Hi William and thanks for your reply. As a personal feedback, I’d like this possibility to be implemented. I have a relatively small environment with a single vCenter. In the previous year the best practices were saying to install every component on a different VM so I’m now trying to consolidate everything. Good improvment was upgrading from vCenter 5.5 to 6 so I have the Web cliennt, syslog collector and vCenter server itself on the same machine.

        What I’d like to do next is to consolidate further having all the components back on on single VM.

        Matteo

  4. Nice one

    Quick question

    What If I have an external PSC and 1 vCenter on Site1 with an external PSC and 1 vCenter on site2 can I put a load Balancer for the PSC between Site, I know those sites are under 100ms latency easily. Also what happen if I lost PSC on site1 can I log through PSC on site2 and able to see my vCenter on Site1(manage it see vms) ?

    Thanks

    • Would like to use the same Single Sign-On Domain and Single Sign-On Site, if possible 😉

  5. “Prior to vSphere 6.0 Update 1, this was the only other option to provide higher availability to your PSC infrastructure outside of using vSphere HA and SMP-FT.”

    This line isn’t correct. We’re on 6.0.0 and I wrote a script to repoint VCSA to another PSC and restart services. So if there is a problem, just putty to the VCSA and execute a quick script and boom it is now pointing to another PSC.

  6. We are currently looking for a multi-sire deployment for one of our customers.

    The topology with 2 x PSC + 2 x VS + 1 LB isn’t quite satisfying as the LB would be a single point of failure in case the site (geographically separated from the second site) without the LB went down.

    So according to the topologies vmware proposes, there is no fail-safe inter-site-HA deployment available.
    We are trying to get a second LB in the topology but this turns out to be tricky to configure.

    Reason for that being, that the PSC’s only seem to be able to be configured with 1 Loadbalancer-FQDN. A soon as you go through the whole procedure with the HA deployment again to add the second Loadbalancer-FQDN, the initial certificated get dropped and the second ones will become active instead. This obviously breaks the whole construct.

    So we are thinking of configuring the second LoadBalancer to make it look like it is actually the first loadbalancer. Then we would be able to upload the certificates to the second one and would have a real fail-safe solution.

    In my opinion this adds huge value, as you are failsafe for either of your Sites (geographically separated).
    But this construct it neither supported by vmware nor do we know if this will actually work.

    Does anybody successfully deployed a installation with 1 x VC + LB + ext. PSC for each site and linked them together to have them all in a pain of glass? If so, I would be really interested to hear about it.

    Cheers,
    Ralph

  7. Hi William,

    Great post ! I agree with you %100

    I think VMware should change their architecture for PSC’s to be more like Windows domain controllers are specified in windows computers. All VMware products that leverage SSO, could have Primary/Secondary SSO fields to be filled in, a bit like how DNS is specified for Windows PCs. This would negate the need to have overly complex active/passive load balancers configured in an environment. I’ve configured LB with Netscaler VPX appliances, and have had nothing but issues right from the start. Another thing to consider when designing an environment, is that if you configure a load balanced pair of PSC’s and the primary PSC gets corrupt, then it will replicate this to the secondary PSC. I think having a “DR” PSC in a different site (not replicating with production site) is a great idea. This saved me a great deal of time as this did happen to me and I was able to repoint the vCenters to this DR PSC and services were restored.

    Paul

  8. William, should this multi-psc-no-lb design work in a situation where the PSC certificate(s) are not self-signed? If PSC-A is setup at an Intermediate CA, I assume PSC-B should be too? Any thoughts…gotchas…or other that you see?

  9. The diagram on the right, without the load balancer, when configuring is that a new site or an existing? Thank you!

Thanks for the comment!