Leave us your email address and be the first to receive a notification when Robin posts a new blog.
Recovery from failures can be some of the most challenging tasks, especially if the workloads are down because of the failures. It can help you to understand the impact and the recovery tasks during outages. This can help you to recover quickly. Similar, during design sessions there is always a question like what happens if his fails. The same goes for the NSX-T management cluster. Yes, there are three manager nodes, but what if two managers fail and how do you recover from that? There is already a great deal of information available online. But the deactivate cluster command isn’t well known and I was building a setup to demonstrate the options. So, I might as well share it with you.
In our Lab environment I have created an NSX-T manager cluster with three nodes and a vSphere cluster consisting of three ESXi hosts as a compute cluster. On the ESXi hosts I have some VM's running to show the impact of losing one or two NSX-T managers.
The status of the cluster can be viewed in the GUI, but also from the CLI on one of the NSX-T manager nodes using get cluster status. As you can see the three-node cluster is up and running. The environment functions as normal and it is possible to vMotion VM's or deploy distributed firewall rules to the VM's.
One NSX-T manager failure
If, for some reason, one of the NSX-T manager nodes fails, the NSX-T manager cluster continues to function in a degraded state. It might be possible that there's a small interruption while the cluster evaluates the new situation or while the Virtual IP fails over to another NSX-T manager.
Not sure why my lab setup doesn’t show the appliances (after a page refresh). Before the page refreshes the two available appliances were still visible. But at least from the command line the cluster status is still visible. It shows that two out of the three nodes are running, and the overall status is degraded.
In this state the cluster is still available and functions normally. Changes to firewall rules are deployed to the VMs and changes to networking are also configured in the environment.
If it is not possible to get the failed node back online, it is possible to detach the failed one and redeploy a new NSX-T manager. But my focus is on recovery from a two manager down situation. So, I will not go into this.
Two NSX-T manager failure
If the second NSX-T manager node also fails due to the same or an additional outage, the NSX-T manager cluster becomes unavailable.
What I find strange is that the view from the GUI is the same with one or two managers down. At least it is in my lab with an upgraded NSX-T to version 220.127.116.11. The command line clearly shows that with two managers down the overall status is unavailable.
In this situation VM's can’t be vMotioned, new VM's can’t be started or added to NSX-T Segments. Depending on the state of the VM the following error can be seen when attempting a vMotion.
Although all running VM's continue to run and still have their firewall rules enabled. No changes can be made that require the NSX- management cluster. So, the priority is to get the NSX-T management cluster fully operational again and as quick as possible.
If there is no option to quickly get the other NSX-T manager nodes up and running again, the first step is to log on to the last NSX-T manager node and run the deactivate cluster command. With this command the cluster will be downgraded to a single NSX-T manager node. This command can take about 10 minutes depending on the environment. In my lab it took about 7 minutes.
Once the deactivate cluster command is completed the (single node) NSX-T manager cluster is fully operational again. Additional NSX-T manager cluster nodes can then be added thru the usual ways.
I hope you have enjoyed reading this and don’t have to use this procedure in a production environment. If you have any comments or questions, please leave a note or send me a message.
Questions, Remarks & Comments
If you have any questions and need more clarification, we are more than happy to dig deeper. Any comments are also appreciated. You can either post it online or send it directly to the author, it’s your choice.