NSX-T Deactivate Cluster

Submitted by Robin van Altena on Tue, 02/08/2022 - 19:22
 
 
Follow your favourite author

Leave us your email address and be the first to receive a notification when Robin posts a new blog.

NSX-T Deactivate Cluster
Tue 08 Feb, 2022
The NSX-T Management cluster consist of a three-node cluster. This means that if one node fails the cluster still has quorum and everything still operates normally. But what if two nodes fail. What is the impact on your environment and what is the fastest way to recover from this failure?
Textarea

Recovery from failures can be some of the most challenging tasks, especially if the workloads are down because of the failures. It can help you to understand the impact and the recovery tasks during outages. This can help you to recover quickly. Similar, during design sessions there is always a question like what happens if his fails. The same goes for the NSX-T management cluster. Yes, there are three manager nodes, but what if two managers fail and how do you recover from that? There is already a great deal of information available online. But the deactivate cluster command isn’t well known and I was building a setup to demonstrate the options. So, I might as well share it with you.

Lab setup

In our Lab environment I have created an NSX-T manager cluster with three nodes and a vSphere cluster consisting of three ESXi hosts as a compute cluster. On the ESXi hosts I have some VM's running to show the impact of losing one or two NSX-T managers.

Image
NSX-T manager cluster status
Textarea

The status of the cluster can be viewed in the GUI, but also from the CLI on one of the NSX-T manager nodes using get cluster status. As you can see the three-node cluster is up and running. The environment functions as normal and it is possible to vMotion VM's or deploy distributed firewall rules to the VM's.

One NSX-T manager failure

If, for some reason, one of the NSX-T manager nodes fails, the NSX-T manager cluster continues to function in a degraded state. It might be possible that there's a small interruption while the cluster evaluates the new situation or while the Virtual IP fails over to another NSX-T manager.

Image
NSX-T manager cluster degraded status with one manager down
Textarea

Not sure why my lab setup doesn’t show the appliances (after a page refresh). Before the page refreshes the two available appliances were still visible. But at least from the command line the cluster status is still visible. It shows that two out of the three nodes are running, and the overall status is degraded.

In this state the cluster is still available and functions normally. Changes to firewall rules are deployed to the VMs and changes to networking are also configured in the environment.

If it is not possible to get the failed node back online, it is possible to detach the failed one and redeploy a new NSX-T manager. But my focus is on recovery from a two manager down situation. So, I will not go into this.

Two NSX-T manager failure

If the second NSX-T manager node also fails due to the same or an additional outage, the NSX-T manager cluster becomes unavailable.

Image
NSX-T manager cluster degraded status with two managers down
Textarea

What I find strange is that the view from the GUI is the same with one or two managers down. At least it is in my lab with an upgraded NSX-T to version 3.2.0.1. The command line clearly shows that with two managers down the overall status is unavailable.

In this situation VM's can’t be vMotioned, new VM's can’t be started or added to NSX-T Segments. Depending on the state of the VM the following error can be seen when attempting a vMotion.

Errors during vMotion attempts
Textarea

Although all running VM's continue to run and still have their firewall rules enabled. No changes can be made that require the NSX- management cluster. So, the priority is to get the NSX-T management cluster fully operational again and as quick as possible.

Recovery

If there is no option to quickly get the other NSX-T manager nodes up and running again, the first step is to log on to the last NSX-T manager node and run the deactivate cluster command. With this command the cluster will be downgraded to a single NSX-T manager node. This command can take about 10 minutes depending on the environment. In my lab it took about 7 minutes.

Image
Deactivate cluster command
Textarea

Once the deactivate cluster command is completed the (single node) NSX-T manager cluster is fully operational again. Additional NSX-T manager cluster nodes can then be added thru the usual ways.

Single node NSX-T manager cluster
Textarea

I hope you have enjoyed reading this and don’t have to use this procedure in a production environment. If you have any comments or questions, please leave a note or send me a message. 

Tags

Questions, Remarks & Comments

If you have any questions and need more clarification, we are more than happy to dig deeper. Any comments are also appreciated. You can either post it online or send it directly to the author, it’s your choice.
Let us know  

 
 
Questions, Remarks & Comments

Message Robin directly, in order to receive a quick response.

More about RedLogic