NSX-T Load balancer overlay required

Submitted by Bart Oevering on Tue, 11/30/2021 - 16:10
 
 
Follow your favourite author

Leave us your email address and be the first to receive a notification when Bart posts a new blog.

NSX-T Load balancer overlay required
Tue 30 Nov, 2021
For a Proof of Concept, a customer required a standalone load balancer. Since there was no need to use any other Network Function Virtualization (NFV), I build a One-Arm load balancer for them. But somehow the load balancer would not function as expected. After some basic troubleshooting I found that the load balancer wasn’t the issue, but it appeared that the Tier 1 router wasn’t functioning as expected. It looked like the service port would not come up. Why wasn’t this setup functioning?
NSX-T One-arm load balancer
Textarea

To troubleshoot this issue, I started digging through the log files on the Edge node. But I couldn't find any real errors. Since there are many entries in the logs, I used a filter to narrow down the log entries. The command below will only show logging of the NSX subcomponent load balancer, but I couldn't find an error with this.

Code (new)

root@nsxedgelb01:~#  cat /var/log/syslog | grep "subcomp=\"lb\""

2021-11-15T16:15:58.323363+00:00 nsxedgelb01.lab02.wheatley.local NSX 10097 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="INFO"] [ed1954a5-d515-4850-848d-9b509e933e3f] cfg: engine config [version: 1] processing - stage 1/6 successful

2021-11-15T16:15:58.332662+00:00 nsxedgelb01.lab02.wheatley.local NSX 10097 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="INFO"] [ed1954a5-d515-4850-848d-9b509e933e3f] cfg: engine config [version: 1] processing - stage 2/6 successful

2021-11-15T16:15:58.343753+00:00 nsxedgelb01.lab02.wheatley.local NSX 10097 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="INFO"] [ed1954a5-d515-4850-848d-9b509e933e3f] cfg: engine config [version: 1] processing - stage 3/6 successful

2021-11-15T16:15:58.424428+00:00 nsxedgelb01.lab02.wheatley.local NSX 10097 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="INFO"] [ed1954a5-d515-4850-848d-9b509e933e3f] cfg: engine config [version: 1] processing - stage 4/6 successful

Textarea

As mentioned the real issue lies with the Tier 1, because it was unable to reach the default gateway. Although I have built several NSX-T load balancers, I never built one completely without any form of overlay. This turned out to be the solution. After trying different configurations and online searches (with the help of a colleague), we came across an 'NSX-T LB Encyclopedia' by Dimitri Desmidt. This Encyclopedias shows a single remark on slide 7 that reads "[...] Edge Nodes must have at least 1 tunnel up to get its LB hosted Standalone T1 Active.".

Screenshot from the NSX-T LB Encyclopedia
Textarea

This remark triggered my thinking on why an overlay network would be needed. The load balancer is deployed together with a Tier-1 gateway and both are hosted in Active/Standby mode in two NSX-T Edge nodes. Yes, for high availability there are multiple nodes in an NSX-T edge node cluster. Could it be that the Edge nodes need an overlay to communicate over? This could explain the deployment failure of the T1 gateway and the load balancers?

So, I added some TEP addresses to the NSX-T Edge nodes, which brought my load balancer to live.

Why is the overlay really needed?

While searching for the reason behind the necessity of the overlay, I got a even bigger surprise. It seems to me that NSX creates not just one, but two tunnels for HA between the two Edge nodes. One tunnel is created on the underlay network and the second HA tunnel is created on the overlay network (172.16.21.x). It's not really a gamechanger but it is nice to get an even better understanding of how VMware NSX-T works. I found this information in the syslog after I placed one of the nodes in 'NSX Maintenance mode'.

Code (new)

nsxedgelb01> get log-file syslog | find "ha-cluster"

2021-11-19T09:27:42.932Z nsxedgelb01.lab02.wheatley.local NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="ha-cluster" level="INFO"] HA tunnel 172.16.21.1:172.16.21.2 state changed from Concat Path Down to Unreachable

2021-11-19T09:28:10.011Z nsxedgelb01.lab02.wheatley.local NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="ha-cluster" level="INFO"] HA tunnel 10.10.80.101:10.10.80.102 state changed from Admin Down to Unreachable

Textarea

Closing thoughts

In the end, it took me a while to find the reason why the load balancer wasn't working as expected. After an extensive search I found the clue to make sure the Edge has a tunnel with the status 'up'. This tunnel is not easily visible in the syslog, if it's even mentioned at all. This exercise gave me a better understanding on how the NSX-T load balancer works and how to deploy this corner case where no overlay networking is used.

Thanks for reading. Any questions or just want to leave a remark? Please do so- I’m always very curious to hear what you think of my content. 

Tags

Questions, Remarks & Comments

If you have any questions and need more clarification, we are more than happy to dig deeper. Any comments are also appreciated. You can either post it online or send it directly to the author, it’s your choice.
Let us know  

 
 
Questions, Remarks & Comments

Message Bart directly, in order to receive a quick response.

More about RedLogic