R.I.P. TCP Offload Engine NICs (TOEs)

Solarflare Delivers Smart NICs for the Masses: Software Definable,  Ultra-Scalable, Full Network Telemetry with Built-in Firewall for True Application Segmentation, Standard Ethernet TCP/UDP Compliant

As this blog post by Michael C. Bazarewsky states, Microsoft quietly pulled support for TCP Chimney in its Windows 10 operating system. Chimney was an architecture for offloading the state and responsibility of a TCP connection to a NIC that supported it. The piece cited numerous technical issues and lack of adoption, and Michael’s analysis hits the nail on the head. Goodbye TOE NICs.

During the early years of this millennium, Silicon Valley venture capitalists dumped hundreds of millions of dollars into start-ups that would deliver the next generation of network interface cards at 10Gb/sec using TCP offload engines. Many of these companies failed under their weight of trying to develop expensive, complicated silicon that just did not work. Others received a big surprise in 2005 when Microsoft settled with Alacritech over patents they held describing Microsoft’s Chimney architecture. In a cross-license arrangement with Microsoft and Broadcom, Alacritech received many tens of millions of dollars in licensing fees. Alacritech would later get tens of millions of more fees from nearly every other NIC vendor implementing a TOE in their design. At the time, Broadcom was desperate to pave the way for their acquisition of Israeli based Siloquent. Due to server OEM pressure, the settlement was a small price to pay for the certain business Broadcom would garner from sales of the Siloquent device. At 1Gb/sec, Broadcom owned an astounding 100% of the server LAN-on-Motherboard (LOM) market, and yet their position was threatened by the onslaught of new, well-funded 10Gb start-ups.

In fact, the feature list for new “Ethernet” enhancements got so full of great ideas that most vendor’s designs relied on a complex “sea of cores” promising extreme flexibility that ultimately proved to be very difficult to qualify at the server OEMs. Any minor change to one code set would cause the entire design to fail in ways that were extremely difficult to debug, not to mention being miserably poor in performance. Most notably, Netxen, another 10Gb TOE NIC vendor, quickly failed after winning major design-ins at the three big OEMs, ultimately ending in a fire sale to Qlogic. Emulex saw the same pot of gold in its acquisition of ServerEngines.

That new impetus was a move by Cisco to introduce Fibre Channel Over Ethernet (FCoE) as a standard to converge networking and storage traffic. Cisco let Qlogic and Emulex (Q & E) inside the tent before their Unified Computing System (UCS) server introduction. But the setup took some time. It required a new set of Ethernet standards, now more commonly known as Data Center Bridging (DCB). DCB was a set of physical layer requirements that attempted to emulate the reliability of TCP by injecting wire protocols that would allow “lossless” transmission of packets. What a break for Q & E! Given the duopoly’s control over the Fibre Channel market, this would surely put both companies in the pole position to take over the Ethernet NIC market. Even Broadcom spent untold millions to develop a Fiber Channel driver that would run on their NIC.

Q & E quickly released what many called the “Frankenstein NIC,” a kluge of Applied-Specified Integrated Circuits (ASIC) designed to get a product to market even while struggling to develop a single ASIC, a skill at which neither company excelled. Barely achieving its targeted functionality, no design saw much traction. Through all of our customer interactions (over 1,650), we could find only one that had implemented FCoE. This large bank has since retracted its support for FCoE and in fact, showed a presentation slide several years ago stating they were “moving from FCoE to Ethernet,” an acknowledgment that FCoE was indeed NOT Ethernet.

In conjunction with TOEs, the industry pundits believed that RDMA (Remote Direct Memory Access) was another required feature to reduce latency, and not just for High-Frequency Trading (HFT), another acknowledgment that lowering latency was critical to the hyper-scale cloud, big data, and storage architectures. However, once again, while intellectually stimulating, using RDMA in any environment proved to be complex and simply not compatible with customers’ applications or existing infrastructures.

The latest RDMA push is to position it as the underlying fabric for Non-Volatile Memory Express (NVMeF). Why? Flash has already reduced the latency of storage access by an order of magnitude, and the next generation of flash devices will reduce latency and increase capacity even further. Whenever there’s a step function in the performance of a particular block of computer architecture, developers come up with new ways to use that capability to drive efficiencies and introduce new, and more interesting applications. Much like Moore’s Law, rotating magnetic memory is on its last legs. Several of our most significant customers have already stopped buying rotating memory in favor of Flash SSDs.

Well… here we go again. RDMA is NOT Ethernet. Despite the “fake news” about running RDMA, RoCE and iWARP on Ethernet, the largest cloud companies, and our large financial services customers have declared that they cannot and will not implement NVMeF using RDMA. It just doesn’t fit in their infrastructures or applications. They want low-latency standard Ethernet.

Since our company’s beginning, we’ve never implemented TOEs, RDMA or FCoE or any of the other great and technically sound ideas for changing Ethernet. Sticking to our guns, we decided to go directly to the market and create the pull for our products. The first market to embrace our approach was High-Frequency Trading (HFT). Over 99% of the world’s volume of Electronic trading, in all instruments, runs on our company’s NICs. Why? Customers could test and run our NICs without any application modifications or changes to their infrastructure and realize enormous benefits in latency, Jitter, message rate and robustness… it’s standard Ethernet, and our kernel bypass software has become the industry’s default standard.

It’s not that there isn’t room for innovation in server networking, it’s that you have to consider the customer’s ability to adapt and manage that change in a way that’s not inconsistent or disruptive to their infrastructure, while at the same time, delivering highly valued capabilities.

  • If companies are looking for innovation in server networking, they need to look for a company that can provide the following: Best-in-class PTP synchronization
  • Ultra-high resolution time stamps for every packet at every line rate
  • A method for lossless, unobtrusive, packet capture and analysis
  • Significant performance improvement in NGINX and LXC Containers
  • A firewall NIC and Application Micro-Segmentation that can control every app, VM, or container with unique security profiles
  • Real, extensive Software Definable Networking (SDN) without agents

In summary, while it’s taken a long time for the industry to realize its inertia, logic eventually prevailed.  Today, companies can now benefit from innovations in silicon and software architecture that are in deployment and have been validated by the market.   Innovative approaches such as neural-scale networking, which is designed to respond to the high-bandwidth, ultra-low-latency, hardware-based security, telemetry, and massive connectivity needs of ultra-scale computing, is likely the only strategy to achieve a next generation cloud and datacenter architecture that can scale, be easily managed, and maybe most importantly secured.

— Russell Stern, CEO Solarflare

5 Reasons Infiniband Will Lose Relevance After 100G

Proprietary technologies briefly lead the market because they introduce disruptive features not found in the available standard offerings. Soon after, those features are merged into the standard. We’ve seen this many times in the interconnects used in High-Performance Computing (HPC). From 2001 through 2004 Myrinet adoption grew as rapidly in the Top500 as Ethernet, and if you were building a cluster at that time you likely used one or the other. Myrinet provided significantly lower latency, a higher performance switching fabric, and double the effective bandwidth, but it came with a larger price tag. In the below graph Myrinet made up nearly all of the declining gray line through 2010, by

which time the Top500 was split between Infiniband and Ethernet. Today Myrinet is gone, Infiniband is on top just edging out Ethernet, but its time in the sun has begun to fade as it faces challenges in five distinct areas.

1. Competition, in 2016 and beyond Infiniband EDR customers will have several attractive options: 25GbE, 50GbE and by 2017 100GbE along with Intel’s Omni-Path. For the past several generations Infiniband has raced so far ahead of Ethernet that it left little choice. Recently though within HPC 10GbE adoption has been growing rapidly, and is responsible for much of Ethernet’s growth in the past six months. During the same time, 40GbE has seen little penetration, it’s often viewed as too expensive. In 2016 we will see an IEEE approved 25GbE and 50GbE standard emerges, along with new & affordable cabling/optics options. It should be noted that a single 50GbE link aligns very well with the most common host server bus connection PCIe Gen3 x8 which delivers roughly 52Gbps/unidirectionally. For 100GbE we’ll need PCIe Gen4 x8. While 100Gbps could be done today with PCIe Gen3 x16 often HPC system architects leave this slot open for I/O hungry GPU cards. The second front Infiniband will be facing is Intel’s Omni-Path technology which will also offer a 100Gbps solution, but it will be directly off the host CPU complex designed to be a routable extensible interconnect fabric. Intel made a huge splash at SC15 with Omni-Path & switching which is a fusion of intellectual property Intel picked up from Cray, Qlogic, and several other Infiniband acquisitions. Some view 2017 as the year when both 100GbE and Omni-Path will begin to chip away at Infiniband’s performance revenue while 25/50GbE erodes the value focused HPC and Exascale customers Infiniband has been enjoying.

2. Bandwidth, if you’ve wanted something greater that 10GbE over a single link, you’ve pretty much had little choice up to this point. While 40GbE exists many view this as an expensive alternative. Recent pushes by two groups to flesh out 25GbE and 50GbE ahead of the IEEE have resulted in this standards group stepping up its’ efforts.  All of this has accelerated the industries approach toward a unified 100GbE server solution for 2017. Add to this Arista and others pushing Intel to provide CLR4 as an affordable 4-channel 25G, 100G optical transceiver, and things get even more interesting.

3. Latency, has always been a strong reason for selecting Infiniband. Much of its gains are the result of moving the communications stack into user space and accelerating the wire to PCIe bus connection.  These tricks are not unique to Infiniband, others have played them all for Ethernet delivering performance ethernet controllers and OS bypass stacks which now offers similar latencies when compared at similar speeds. This is why nearly all securities worldwide are traded through systems using Solarflare adapters leveraging their OS Bypass stack called OpenOnload while using standard UDP, and TCP protocols. The domain of low latency is no longer exclusive to RDMA as it can now be more easily, and transparently done using existing code, and via UDP and TCP transport layers over industry standard ethernet.

4. Single vendor, if you want Infiniband there really is only one vendor who offers a single end-to-end solution. End-to-end solution providers are great because they expose the single throat to choke when things eventually don’t work it. Conversely, many customers will avoid adopting technologies where there is only one single provider because it removes competition and choice from the equation. Also when that vendor stumbles, and they always do, with a single vendor you’re stuck. Ethernet, the open industry standard, affords you options while also providing you with interoperability.

5. Return to a Single Network, ever since Fiber Channel intruded into the data center nearly two decades ago network engineers have been looking ways to remove it. Then along came exascale, HPC by another name, and Infiniband was also pulled into the data center. Some will say Infiniband can do all three, but clearly, those people have never dealt with bridging real world Ethernet traffic with Infiniband traffic. At 100Gbps Ethernet should have what it needs in both features, and performance, to provide a pipeline for all three protocols over a single generic network fabric.

Given all the above it should be interesting to revisit this post in 2018 to see how the market reacted. For some perspective, in this blog back in December 2012, I wrote: “How Ethernet Won the West” where I predicted that both Fiber Channel and Infiniband would eventually disappear. Fiber Channel as a result of Fiber Channel over Ethernet (FCoE), which never really took off, and Infiniband because everyone else was abandoning it including Jim Cramer. Turns out while I’ve yet to be right about either, Cramer nailed it.  Since January 2013, adjusting for splits and dividends, Mellanox stock has dropped 14%.

Three Mellanox Marketing Misrepresentations

So Mellanox’s Connect-X 4 line of adapters are hitting the street, and as always tall tales are being told or rather blogged about concerning the amazing performance of these adapters. As is Mellanox’s strategy they intentionally position Infiniband’s numbers to imply that they are the same on Ethernet, which they’re not. Claims of 700 nanoseconds latency, 100Gbps & 150M messages per second. Wow, a triple threat low latency, high bandwidth, and an awesome message rate. So where does this come from? How about the second paragraph of Mellanox’s own press release for this new product: “Mellanox’s ConnectX-4 VPI adapter delivers 10, 20, 25, 40, 50, 56 and 100Gb/s throughput supporting both the InfiniBand and the Ethernet standard protocols, and the flexibility to connect any CPU architecture – x86, GPU, POWER, ARM, FPGA and more. With world-class performance at 150 million messages per second, a latency of 0.7usec, and smart acceleration engines such as RDMA, GPUDirect, and SR-IOV, ConnectX-4 will enable the most efficient compute and storage platforms.” It’s easy to understand how one might actually think that all the above numbers also pertain to Ethernet, and by extension UDP & TCP. Nothing could be further from the truth.

From Mellanox’s own website on February 14, 2015: “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows, and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So clearly all the above numbers are INFINIBAND ONLY, today three months after the above press release still the fastest Ethernet Mellanox supports is 40GbE, and this is done with their own standard OS driver only. This by design will always limit things like packet rate to 3-4Mpps, and latency to somewhere around 10,000 nanoseconds, not 700. Bandwidth could be directly OS limited, but I’ve yet to see that so on these 100Gbps adapters Mellanox might support something approaching 40Gbps/port.

So let’s imagine that someday in the distant future the gang at Mellanox delivers an OS-bypass driver for the Connect-X 4 and that it does support 100Gbps. What we’ll see is that like the prior versions of Connect-X, this is Mellanox’s answer to doing both Infiniband & Ethernet on the same adapter, a trick they picked up from now defunct Myricom who achieved this back in 2005 delivering both Myrinet & 10G Ethernet on the same Layer-1 media. This trick allows Mellanox to ship a single adapter that can be used with two totally different driver stacks to deliver Infiniband traffic over an Infiniband hardware fabric or Ethernet over traditional switches directly to applications or the OS kernel. This simplifies things for Mellanox, OEMs, and distributors, but not for customers.

Suppose I told you I had a car that could reach 330MPH in 1,000 feet, pretty impressive. Would you expect that same car to work on the highway, probably not, how about on a NASCAR track? No, because those that really know auto racing immediately realize I’m talking about a beast that burns five gallons of Nitromethane in four seconds, yes a 0.04MPG, top-fuel dragster. This class of racing is analogous to High-Performance Computing (HPC), where Infiniband is king and the problem domain is extremely well defined. In HPC we measure latency using zero byte packets and often attach adapters back to back without a switch to measure percieved network system latency. So while 700 nanoseconds of latency sounds impressive it should be noted that no end user data is passed during this test at this speed, just empty packets to prove the performance of the transport layer. In production, you can’t actually use zero byte packets because they’re simply the digital equivalent of sealed empty envelopes. Also to see this 700 nanoseconds you’ll need to be running Infiniband on both ends, along with an Infiniband supported driver stack that bypasses the operating system, note this DOES NOT support traditional UDP or TCP communications. Also to get anything near 700 nanoseconds you have to be using Infiniband RDMA functions, back to back between two systems without a network switch, and with no real data transferred, it is a synthetic measurement of the fabric’s performance.

The world of performance Ethernet is more like NASCAR, where cars typically do 200MPH and  run races measured in the hundreds of miles around closed loop tracks. Here the cars have to shift gears, brake, run for extended periods of time, refuel, handle rapid tire changes, and maintenance during the race, etc… This is not the same as running a top-fuel drag racer once down a straight 1,000-foot track. The problem is Mellanox is notorious for stating their top-fuel dragster Infiniband HPC numbers to potential NASCAR class high-performance ethernet customers, believing many will NEVER know the difference. Several years ago Mellanox had their own high-performance OS-Bypass Ethernet stack that supported UDP & TCP called VMA (Voltaire Messaging Accelerator), but it was so fraught with problems that they spun it off as an open source project in the fall of 2013. They had hoped that the community might fix its problems, but since it’s seen little if any development (15 posts in as many months). So the likelihood you’ll see 700 nanosecond class 1/2 round trip UDP or TCP latency with Mellanox anytime in the near future would be very surprising.
Let’s attack misrepresentation number two, an actual ethernet throughput of 100Gbps. This one is going to be a bit harder without an actual adapter in my hand to test, so just looking at the data sheet, several things do jump out. First ConnectX 4 uses a 16-lane PCIe Gen3 bus which typically should have an effective unidirectional PCIe data throughput of 104Gbps. On the surface, this looks good. There may be an issue under the covers though because when this adapter is plugged into a state of the art Intel Haswell server the PCIe slot maps to a single processor. You can send traffic from this adapter to the other CPU, but it first must go through the CPU it’s connected to. So sticking to one CPU, the best Haswell processor has two 20 lane QPIs with an effective combined unidirectional transfer speed of 25.6GB/sec. Now note that this is all 40 PCIe lanes combined, the ConnectX 4 only has 16 lanes so proportionally about 10.2GB/sec is available, that’s only 82Gbps. Maybe they could sustain 100Gbps, but this number on the surface appears somewhat dubious. These numbers should also limit Infiniband’s top end performance for this adapter.
Finally, we have my favorite misrepresentation, 150M messages per second. Messages is an HPC term and most people that think ethernet will translate this to 150M packets per second. A 10GbE link has a theoretical maximum packet rate of 14.88Mpps.  There is no way their ethernet driver for the ConnectX 4 could ever support this packet rate, even if they had a really great OS-bypass driver I’d be highly skeptical. This is analogous to saying you have an adapter capable of providing lossless ethernet packet capture on ten 10GbE (14.88Mpps/link) links at the same time. Nobody today, even the best FPGA NICs that cost 10X this price, will claim this.
Let’s humor Mellanox though, and buy into the fantasy, here is the reality that will creep back in. On Ethernet, we often say the smallest packet is 64 bytes so 150Mpps * 64 bytes/packet * 8 bits/byte is 76.8Gbps, that is less than the 82Gbps we mentioned above so that’s good. There are a number of clever tricks that can be used to bring this many packets into the host CPU into user space while optimizing the use of the PCIe bus, but more often than not these require that the NIC firmware is tuned for packet capture, not generic TCP/UDP traffic flow. Let’s return to the Intel Haswell E5-2699 with 18 cores at 2.3Ghz. Again for performance, we’ll steer all 150Mpps into the single Intel socket supporting this Mellanox adapter. Now for peak performance, we want to ensure that packets are going to extremely quiet cores because we know that both the OS & BIOS settings can create system jitter which kills performance and determinism. So we profile this CPU and find the 15 least busy cores, those with NOTHING going on. Now if we assume Mellanox was to have an OS Bypass UDP/TCP stack that supported a round-robin method for doling out a flood of 64-byte packets this would mean 10Mpps/core or 100 nanoseconds/packet to do something useful with each packet. That’s 250 clock ticks on Intel’s best processor. Unless you’re hand coding in assembler it’s going to be very hard to get that much done.
So when Mellanox begins talking about supporting 25GbE, 50GbE or 100GbE you need only remember one quote from their website “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So please don’t fall for the low latency, high bandwidth or packet rate Mellanox Ethernet hype, it’s just hog wash.

Update, on March 2, 2015, Mellanox posted an Ethernet only press release that claimed this adapter supported 100GbE, and using the DPDK interface in testing they could achieve 90Gbps with 75Mpps over the 100G link (roughly wire-rate 128 byte packets).