NSX Edge lose of network connectivity on Broadcom BCM57414 NICs

At a client recently their VM NSX Edges were periodically losing most of their network connectivity until vMotioned to another host. We were also periodically seeing this on Windows VMs but not to the same extent likely due to the reduced network utilisation of a Windows VM compared to a NSX Edge.

NSX Edges are used to create virtualised Tier 0 / Tier 1 routers which peer to the physical network using a routing protocol such as OSPF or BGP; this then allows routing from the physical network to software defined NSX overlay networks. The majority of this client’s workload was running was running in NSX overlay networks and as you can imagine randomly losing the data path on an NSX Edge caused a lot of critical outages for clients outside the environment. Servers (within that VRF) could continue to communicate with other servers as they were not routing via the NSX Edge Node as they were routed via the Distributed Logical Router on the ESXi host (or in the same overlay network).

This client was running HPE Gen10 Servers with ESXi 7 & NSX 4.1 on hosts with Broadcom Network Interface cards (Broadcom BCM57414 Ethernet 10/25Gb 2-port SFP28 Adapter for HPE) with the latest HPE Service Pack for Proliant installed (2023.09) & up to date vCenter/ESXi/NSX.

The HPE SPP for Proliant (2023.09) incorporates these NIC firmware/driver versions for this card:

  • Firmware: 226.1.107.0
  • Driver: 226.0.121.0

What we eventually found after much troubleshooting of the physical network, ESXi, and NSX was there was a bug in the Broadcom Network Interface drivers (Broadcom defect ID: DCSG01533090) which can cause Windows virtual machines to lose connectivity when using VNXNet3 Adapters. Our suspicion was that this was bug was also causing an impact to the NSX Edge appliances but just slightly differently to how it was impacting the Windows VMs.

Broadcom 226.0.145.4 Network Driver release notes

Updating the NIC Driver in ESXi from 226.0.121.0 to 226.0.145.4 (after verifying some other VMware HCL requirements) incorporated the fix which resolved the issue and the NSX environment has been stable for several months since the update occurred.

Leave a comment