Monday, 6 November 2017

Validating NSX VTEP connectivity

This post was inspired by recent incident at the customer environment where VMs were experiencing networking issues due to MTU size misconfiguration on the TOR switches.

If you ever worked with NSX-V and Logical Switches you are aware that NSX configures VTEP vmnics with MTU equal to 1600 bytes. This allows to support VxLAN encapsulation.
However, between every two VTEP interfaces is an L2 or L3 networking device that is potentially not configured to support baby Jumbo frames (that's another name for 1600 bytes packets).

There are many posts explaining how to check MTU size and network connectivity between VTEP interfaces. It is a simple ping using esxcli:

esxcli network diag ping --netstack=vxlan --host vmknic_IP --df --size=1572

Now, imagine you have a small transport zones with 10 hosts and each host has 2 VTEP interfaces. 
You will need to run the esxcli command 360 times to validate all combinations of VTEP pairs. 

With 64 hosts the number of required ping tests reaches 16,128.  Well, that's obviously something that requires automation. 

Hopefully, the future versions of NSX will have this validation step as part of NSX Health Check. 
Meanwhile, we can take advantage of Powershell to make our VTEP validation test a bit easier.

I didn't spend much time writing the script and had only my home lab for a test, so it definitely may have some bugs. 

Here is the logic of the script:
  • Connects to NSX/vCenter and validates that connection was established successfully
  • Builds array of Transport Zones and Hosts
  • Builds array of Hosts and their VTEPs
  • Iterate through each TZ-Host-VTEP and ping all other VTEPs in the transport zone. This is a full-mesh test. 
  • The script uses pings with 2 different sizes – 64 and 1572 bytes. The first allows to check for connectivity issues and the larger packet validates that MTU size is configured correctly along the path between two VTEPs.
  • The results are displayed on the screen in real-time
  • Two reports are produced for each transport zone:
    • Summary - a table with Source Host, Destination Host and the test result
    • Detailed - a table that contains Hosts, VTEP names and IP Addresses, test result for different packet sizes and the error message, if any. 
Script has been tested with vSphere 6.5 U1 and NSX 6.3.x

Update (9/11/2017) - the script was updated to work with ESXi 6.0 and 6.5 versions.

The following screenshot provides and example of successful tests:






This is an example of error messages when using packet size 1573




As you can see the script can detect different types of issues.


here is a couple of reports' screenshots

Summary Report
















Detailed Report



Here is the script code




Feel free to provide feedback on any bugs you may encounter using this script.