No matter what is the naure of your business application are, there are good chances that they are being powered by VMware vSphere virtualization hosted on your org’s private, public or hybric cloud. Important are availability aspects of underlying infrastructure to ascertain smooth availability, monitoring and management of your application workloads.
Today, when public cloud giants are working round the clock to beat the competition in cloud offerings, a lot of enhancements are happening resulting in greater workload flexibility and agile datacenters. If you have put your hands already into VMware cloud on AWS or bare-metal offerings on IBM Softlayer, I can hear your nod already.
With that said, I am sure availability of your application and security of data are the most important aspects but manageability and monitoring of applications / workloads play an important part as well. Similarly, vCenter Server being a management plane component, doesn’t actually affect your production when not available but plays an important part in terms of monitoring, self service provisioning aspects and when providing management plane to NSX.
VMware released the high availability feature of VMware vCenter Server® 6.5 to ensure high availability against various hardware and software failures. Although nothing in this world comes without a cost, in this case, cost would be a bit of overhead, performance impact and performance penalty in case of high latency networks. However that’s a trade off between what you get and what you pay. Hence, it depends on your SLA’s and management plane reliability whether you should enable VCHA or not. But irrespective of that, lets dig into some more detail about the feature in the coming sections
Here is a list of pre-reqs to enable vCenter HA:
1) vCenter Server Appliance (Supported only on appliance)
2) Minimum three hosts cluster where VCSA will be running
3) A dedicated port group (different from management IP VLAN) to run cluster management IPs
The vCenter High Availability architecture uses a three-node cluster to provide availability against multiple types of hardware and software failures. A vCenter HA cluster consists of one Active node that serves client requests, one Passive node to take the role of Active node in the event of failure, and one quorum node called the Witness node. Any Active and Passive node-based architecture that supports automatic failover relies on a quorum or a tie-breaking entity to solve the classic split-brain problem, which refers to data/availability inconsistencies due to network failures within distributed systems maintaining replicated data. Traditional architectures use some form of shared storage to solve the split-brain problem. However, in order to support a vCenter HA cluster spanning multiple datacenters, our design does not assume a shared storage–based deployment. As a result, one node in the vCenter HA cluster is permanently designated as a quorum node, or a Witness node. The other two nodes in the cluster dynamically assume the roles of Active and Passive nodes. vCenter Server availability is assured as long as there are two nodes running inside a cluster. However, a cluster is considered to be running in a degraded state if there are only two nodes in it. A subsequent failure in a degraded cluster means vCenter services are no longer available.
A vCenter Server appliance is stateful and requires a strong, consistent state for it to work correctly. The appliance state (configuration state or runtime state) is mainly composed of:
• Database data (stored in the embedded PostgreSQL database)
• Flat files (for example, configuration files).
For the state to be stored inside the PostgreSQL database, we use the PostgreSQL native replication mechanism to keep the database data of the primary and secondary in sync. For flat files, a Linux native solution, rsync, is used for replication. Because the vCenter Server appliance requires strong consistency, it is a strong requirement to utilize a synchronous form of replication to replicate the appliance state from the Active node to the Passive node.
A vCenter HA cluster requires a vCenter HA network that is separate from the management network for the vCenter Server appliance. As such, 3 FQDNs or static IP addresses are required to be assigned to each node that is used for VCHA cluster traffic on the isolated VCHA network. Clients can have access to the Active vCenter Server appliance via the management network interface, which is public.
Different Nodes and their roles:
• Active Node:
– Node that runs the active instance of vCenter Server. – Enables and uses the public IP address of the cluster.
• Passive Node:
– Node that runs as the passive instance of vCenter Server.
– Constantly receives state updates from the Active node in synchronous mode.
– Equivalent to the Active node in terms of resources.
– Takes over the role of Active Node in the event of failover.
• Witness Node:
– Serves as a quorum node.
– Used to break a tie in the event of a network partition causing a situation where the Active and Passive nodes cannot communicate with each other.
– A light-weight VM utilizing minimal hardware resources.
– Does not take over role of Active/Passive nodes.
Availability of the vCenter Server appliance works as follows under the following failure conditions:
1. Active node fails:
– As long as the Passive node and the Witness node can communicate with each other, the Passive node will promote itself to Active and start serving client requests.
2. Passive node fails:
– As long as the Active node and the Witness node can communicate with each other, the Active node will continue to operate as Active and continue to serve client requests.
3. Witness node fails:
– As long as the Active node and the Passive node can communicate with each other, the Active node will continue to operate as Active and continue to serve client requests. The Passive node will continue to watch the Active node for failover.
4. More than one node fails or is isolated:
– This means all three nodes—Active, Passive, and Witness—cannot communicate with each other. This is more than a single point of failure and when this happens, the cluster is assumed non-functional and availability is impacted because VCHA is not designed for multiple failures.
5. Isolated node behavior:
– When a single node gets isolated from the cluster, it is automatically taken out of the cluster and all services are stopped. For example, if an Active node is isolated, all services are stopped to ensure that the Passive node can take over as long as it is connected to the Witness node.
– Isolated node detection takes into consideration intermittent network glitches and resolves to an isolated state only after all retry attempts have been exhausted.
vCenter Server HA Cluster Health
It is worthwhile noticing that an anti-affinity rule, to keep all three nodes separate, gets automatically created to ensure one hardware failure doesn’t cause multiple cluster nodes to go down simultaneously
You can easily check cluster health and nodes status by navigating to ‘Monitor’ tab after selecting vCenter Server on the left navigation pane as shown in the figure below: