SmartNICs vs. DPUs, Who Wins?

August 25, 2020February 15, 2021 scottcschweitzer 25GbE, Accelerators, FPGA, Infiniband, networking, TCP Offload Engine Accelerators, Broadcom, DPU, Fungible, IPU, Networking, NVIDIA, Pensando, SmartNICs, Xilinx

Last week I hosted an IEEE Hot Interconnects Panel with the above title. We were lucky enough to secure some time from the following luminaries, and it made for an excellent panel:

Andy Gospodarek, Broadcom, Open Sourcerer
Pradeep Sindhu, Fungible, CEO
Michael Kagan, NVIDIA, CTO
Vipin Jain, Pensando, CTO
Gordon Brebner, Xilinx, CTO Staff & Fellow

Clicking on the image below should take you to the 90 minute Youtube video of our panel discussion. For those who are just interested in the highlights you can read below for some of the interesting facts pulled from our discussion.

**IEEE Hot Interconnects Panel: “SmartNICs vs. DPUs, Who Wins?”**

Here are some key points that contain significant value from the above panel discussion:

SmartNICs provide a second computing domain inside the server that could be used for security, orchestration, and control plane tasks. While some refer to this as an air-gapped domain it isn’t, but it is far more secure than running inside the same x86 system domain. This can be used to securely enable bare-metal as a service. — Michael Kagan
Several vendors are actively collaborating on a Portable NIC Architecture (PNA) designed to execute P4 code. When available, it would then be possible to deliver containers with P4 code that could run on any NIC that supported this PNA model. — Vipin Jain
The control plane needs to execute in the NIC for two reasons, first to offload the host CPU from what is quickly become 30% overhead for processing network traffic, and second to improve the determinism of the applications running on the server. –Vipin Jain
App stores are inevitable, when is the question. While some think it could be years, others believe it will happen within a year. Xilinx has partnered with a company that already has one for FPGA accelerators so the leap to SmartNICs shouldn’t be that challenging. –Gordon Brebner
The ISA is un-important, it’s the micro-architecture that matters. Fungible selected MIPS-64 because of it’s support for simultaneous multi-threaded execution with fine-grained context switching. — Pradeep Sindhu. While others feel that the eco-system of tools and the wide access to developers is most important and that is why they’ve selected ARM.
It should be noted that normally the ARM cores are NOT in the data plane.

The first 18 minutes are introductions and marketing messages. While these are educational, they are also somewhat canned marketing messages. The purpose of a panel discussion was to ask questions that the panel hadn’t seen in advance so we could draw out of them honest perspectives and feedback from their years of experience.

IMHO, here are some of the interesting comments, with who made them and where to find them:

18:50 Michael – The SmartNIC is a different computational domain, a computer in-front of a computer, and ideal for security. It can supervise or oversee all system I/O, key thing is that it is a real computer.

23:00 Gordon – Offloading the host CPU to the SmartNIC and enabling programmability of the device is critically important. We’ll also see functions and attributes of switches being merged into these SmartNICs.

24:50 Andy – Not only data plane offload, but control plane offload from the host is also critically important. Also hardware, in the form of on chip logic, should be applied to data plane offload whenever possible so that ARM cores are NOT being placed in the data plane.

26:00 Andy – Dropped the three letter string that most hardware providers cringe when we hear it, SDK. He stressed the importance of providing one. It should be noted that Broadcom at this point, as far as I know, appears to be the only SmartNIC OEM that provides a customer facing SmartNIC SDK.

26:50 Vipin – A cloud based device that is autonomous from the system and remotely manageable. Has it’s own brain, and that truly runs independently of the host CPU.

29:33 Pradeep – There is no golden rule, or rule of thumb like 1Gb/sec/core like what AMD has said. It’s important to determine what computations should be done in the DPU, multiplexing and stateful applications are ideal. General purpose CPUs are made for processing single threaded applications very fast, horrible at multiplexing.

33:37 Andy – 1Gb/core is really low, I’d not be comfortable with that. I would consider DPDK, or XDP and it would blow that metric away. People shouldn’t settle for this metric.

35:24 Michael – Network needs to take care of the network on it’s own, so zero core for an infinite number of Gigabits.

36:45 Gordon – The SmartNIC is a kinda filtering device, where sophisticated functions like IPS, can be offloaded into the NIC.

40:57 Andy – The Trueflow logic delivers a 4-5X improvement in packet processing. There are a very limited number of people really concerned with hitting line rate packet per second at these speeds. In the data center these PPS requirements are not realistic.

42:25 Michael – I support what Andy said, these packet rates are not realistic in the data center.

44:20 Pradeep – We’re having this discussion because general purpose CPUs can no longer keep up. This is not black and white, but a continuum, where does general processing end and a SmartNIC pick up. GRPC as an example needs to be offloaded. The correct interface is not TCP or RDMA, both are too low level. GRPC is a modern level for this communication interface. We need to have architectural innovation because scale out is here to stay!

46:00 Gordon – One thing about being FPGA based is that we can support tons of I/O. With FPGAs we don’t think in terms of cores, we look at I/O volumes, several years ago we first started looking at 100GbE then figured out how to do that and extended it to 400GbE. We can see the current way scaling well into the Terabit range. While we could likely provide Terabit range performance today it would be far to costly, it’s a price point issue, and nobody would buy it, the cost of doing things is also an issue.

48:35 Michael – CPUs don’t manage data efficiently. We have dedicated hardware engines and TCAM along with caches to service these engines, that’s the way it works.

49:45 Pradeep – The person asking the question perhaps meant control flow and not flow control, while they sound the same they mean different things. Control flow is what a CPU does, flow control is what networking does. A DPU or SmartNIC needs to do both well to be successful. It appears, and I could be wrong, that Pradeep is using pipeline to refer to consecutive stages of execution on a single macro resource like a DPU then chain as a collection of pipelines that provide a complete solution.

54:00 Vipin – Sticking with fixed function execution than line rate is possible. We need to move away from focusing on processing TCP packets, and shift focus to messages with a run-to-completion model. It is a general purpose program running in the data path.

57:20 Vipin – When it came to selecting our computational architecture it was all about ecosystem, and widely available resources and tooling. We [Pensando] went with ARM.

58:20 Pradeep – The ISA is an utter detail, it’s the macro-architecture that matters, not the micro instruction architecture. We chose MIPS because of the implementation which is a simultaneous multi-threaded implementation which is far and away a much better fine grained context switching. Much much better than anything else out there. There is also the economic price/performance to be considered.

1:00:12 Michael – I agree with Vipin it’s a matter of ecosystem, we need to provide a platform for people to develop. We’re not putting ARMs on the data path. So this performance consideration Pradeep has mentioned is not relevant. The key is providing an ecosystem that attracts as many developers as possible, and making their lives easier to produce great value on the device.

1:01:08 Andy – I agree 100%, that’s why we selected ARM, ecosystem drove our choice. With ARM their are enough Linux distributions, and you could be running containers on your NIC. The transition to ARM is trivial.

1:02:30 Gordon – Xilinx mixes ARM cores with programmable FPGA logic, and hard IP cores for things like encryption.

1:03:49 Pradeep – The real problem is the data path, but clearly ARM cores are not in the data path so they are doing control plane functions. Everyone says they are using ARM cores because of the rich ecosystem, but I’d argue that x86 has a richer ecosystem. If that’s the case then why NOT keep the control plane then in the hosts? So why does the control plane need to be imbedded inside the chip?

1:04:45 Vipin – Data path is NOT in ARM. We want it on a single die, we don’t want it hoping across many wires and killing performance. The kind of integration I can do by subsuming the ARM cores into my die is tremendous. That’s why it can not be on Intel. [Once you go off die performance suffers, so what I believe Vipin means is that he can configure on the die whatever collection of ARM cores, and hard logic he wants, and wire it together how best he sees fit to meet the needs of their customers. He can’t license x86 cores and integrate them on the same die as he can with ARM cores.] Plus if he did throw an x86 chip on the card it would blow his power budget [PCIe x16 lane cards are limited to 75W].

1:06:30 Michael – We don’t have as tight an integration with data-path and ARMs as Pensando. If you want to segregate computing domains between application tier and infrastructure tier you need another computer and putting an x86 on a NIC just isn’t practical.

1:07:10 Andy – The air-gap, bare-metal as a service, use case is a very popular one. Moving control plane functions off the x86 to the NIC, frees up x86 cores and enables a more deterministic environment for my applications.

1:08:50 Gordon – Having that programable logic alongside the ARM cores gives you both the control plane offload as well as dynamically being able to modify the data plane locally.

1:10:00 Michael – We are all for users programming the NIC we are providing an SDK, and working with third parties to host their applications and services on our NICs.

1:10:15 Andy – One of the best things we do it outreach, where we provide NICs to university developers, they disappear for a few months then return with completed applications or new use cases. Broadcom doesn’t want to tightly control how people use their devices, it isn’t open if it is limited by what’s available on the platform.

1:13:20 Vipin – Users should be allowed to own and define their own SDK to develop on the platform.

1:14:20 Pradeep – We provide programming stacks [libraries?] that are available to users through RestAPIs.

1:15:38 Gordon – We took an early lead in helping define the P4 language for programming network devices. Which became Barefoot Networks switch chips, but we’ve embraced it since very early on. We actually have a P4 to Verilog compiler so you can turn your P4 code into logic. The main SmartNIC functions inside Xilinx are written in P4. Then there are plug-ins where others can add their own P4 functions into the pipeline.

1:17:35 Michael – Yes, an app-store for our NIC, certainly. It’s a matter of how it is organized. For me it is somewhere users can go where they can safely download containerized applications or services which can then run on the SmartNIC.

1:18:20 Vipin – The App Store is a little ways out there, it is a good idea. We are working in the P4 community towards standards. He mentions PNA, the Portable NIC Architecture as an abstraction. [OMG, this is huge, and I wish I wasn’t juggling the balls trying to keep the panel moving as this would have been awesome to dig into. A PNA could then enable the capability to have containerized P4 applications that could potentially run across multiple vendors SmartNICs.] He also mentioned that you will need NIC based applications, and a fabric with infrastrucutre applications so that NICs on opposite sides of a fabric can be coordinated

1:21:30 Pradeep, An App Store at this point may be premature. In the long term something like an App Store will happen.

1:22:25 Michael, things are moving much faster these days, maybe just another year for SmartNICs and an App Store.

1:23:45 Gordon, we’ve been working with Pensando and others on the PNA concept with P4 for some time.

1:28:40 Vipin, ..more coming as I listen again on Wednesday.

For those curious the final vote was three for DPU and two for SmartNIC, but in the end the customer is the real winner.

User Level Networking (ULN) is Becoming an Over-Night Success

April 1, 2019April 1, 2019 scottcschweitzer 25GbE, HFT, Infiniband, networking, Onload, RDMA, RoCE, TCP Offload Engine

Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.

With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.

This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.

In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.

In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders.

At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well.

For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects.

At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.

As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.

5 Reasons Infiniband Will Lose Relevance After 100G

November 23, 2015September 3, 2017 scottcschweitzer Infiniband, networking, RDMA

Proprietary technologies briefly lead the market because they introduce disruptive features not found in the available standard offerings. Soon after, those features are merged into the standard. We’ve seen this many times in the interconnects used in High-Performance Computing (HPC). From 2001 through 2004 Myrinet adoption grew as rapidly in the Top500 as Ethernet, and if you were building a cluster at that time you likely used one or the other. Myrinet provided significantly lower latency, a higher performance switching fabric, and double the effective bandwidth, but it came with a larger price tag. In the below graph Myrinet made up nearly all of the declining gray line through 2010, by

which time the Top500 was split between Infiniband and Ethernet. Today Myrinet is gone, Infiniband is on top just edging out Ethernet, but its time in the sun has begun to fade as it faces challenges in five distinct areas.

1. Competition, in 2016 and beyond Infiniband EDR customers will have several attractive options: 25GbE, 50GbE and by 2017 100GbE along with Intel’s Omni-Path. For the past several generations Infiniband has raced so far ahead of Ethernet that it left little choice. Recently though within HPC 10GbE adoption has been growing rapidly, and is responsible for much of Ethernet’s growth in the past six months. During the same time, 40GbE has seen little penetration, it’s often viewed as too expensive. In 2016 we will see an IEEE approved 25GbE and 50GbE standard emerges, along with new & affordable cabling/optics options. It should be noted that a single 50GbE link aligns very well with the most common host server bus connection PCIe Gen3 x8 which delivers roughly 52Gbps/unidirectionally. For 100GbE we’ll need PCIe Gen4 x8. While 100Gbps could be done today with PCIe Gen3 x16 often HPC system architects leave this slot open for I/O hungry GPU cards. The second front Infiniband will be facing is Intel’s Omni-Path technology which will also offer a 100Gbps solution, but it will be directly off the host CPU complex designed to be a routable extensible interconnect fabric. Intel made a huge splash at SC15 with Omni-Path & switching which is a fusion of intellectual property Intel picked up from Cray, Qlogic, and several other Infiniband acquisitions. Some view 2017 as the year when both 100GbE and Omni-Path will begin to chip away at Infiniband’s performance revenue while 25/50GbE erodes the value focused HPC and Exascale customers Infiniband has been enjoying.

2. Bandwidth, if you’ve wanted something greater that 10GbE over a single link, you’ve pretty much had little choice up to this point. While 40GbE exists many view this as an expensive alternative. Recent pushes by two groups to flesh out 25GbE and 50GbE ahead of the IEEE have resulted in this standards group stepping up its’ efforts. All of this has accelerated the industries approach toward a unified 100GbE server solution for 2017. Add to this Arista and others pushing Intel to provide CLR4 as an affordable 4-channel 25G, 100G optical transceiver, and things get even more interesting.

3. Latency, has always been a strong reason for selecting Infiniband. Much of its gains are the result of moving the communications stack into user space and accelerating the wire to PCIe bus connection. These tricks are not unique to Infiniband, others have played them all for Ethernet delivering performance ethernet controllers and OS bypass stacks which now offers similar latencies when compared at similar speeds. This is why nearly all securities worldwide are traded through systems using Solarflare adapters leveraging their OS Bypass stack called OpenOnload while using standard UDP, and TCP protocols. The domain of low latency is no longer exclusive to RDMA as it can now be more easily, and transparently done using existing code, and via UDP and TCP transport layers over industry standard ethernet.

4. Single vendor, if you want Infiniband there really is only one vendor who offers a single end-to-end solution. End-to-end solution providers are great because they expose the single throat to choke when things eventually don’t work it. Conversely, many customers will avoid adopting technologies where there is only one single provider because it removes competition and choice from the equation. Also when that vendor stumbles, and they always do, with a single vendor you’re stuck. Ethernet, the open industry standard, affords you options while also providing you with interoperability.

5. Return to a Single Network, ever since Fiber Channel intruded into the data center nearly two decades ago network engineers have been looking ways to remove it. Then along came exascale, HPC by another name, and Infiniband was also pulled into the data center. Some will say Infiniband can do all three, but clearly, those people have never dealt with bridging real world Ethernet traffic with Infiniband traffic. At 100Gbps Ethernet should have what it needs in both features, and performance, to provide a pipeline for all three protocols over a single generic network fabric.

Given all the above it should be interesting to revisit this post in 2018 to see how the market reacted. For some perspective, in this blog back in December 2012, I wrote: “How Ethernet Won the West” where I predicted that both Fiber Channel and Infiniband would eventually disappear. Fiber Channel as a result of Fiber Channel over Ethernet (FCoE), which never really took off, and Infiniband because everyone else was abandoning it including Jim Cramer. Turns out while I’ve yet to be right about either, Cramer nailed it. Since January 2013, adjusting for splits and dividends, Mellanox stock has dropped 14%.

Three Mellanox Marketing Misrepresentations

February 14, 2015September 3, 2017 scottcschweitzer 25GbE, Infiniband, networking, RDMA, RoCE

So Mellanox’s Connect-X 4 line of adapters are hitting the street, and as always tall tales are being told or rather blogged about concerning the amazing performance of these adapters. As is Mellanox’s strategy they intentionally position Infiniband’s numbers to imply that they are the same on Ethernet, which they’re not. Claims of 700 nanoseconds latency, 100Gbps & 150M messages per second. Wow, a triple threat low latency, high bandwidth, and an awesome message rate. So where does this come from? How about the second paragraph of Mellanox’s own press release for this new product: “Mellanox’s ConnectX-4 VPI adapter delivers 10, 20, 25, 40, 50, 56 and 100Gb/s throughput supporting both the InfiniBand and the Ethernet standard protocols, and the flexibility to connect any CPU architecture – x86, GPU, POWER, ARM, FPGA and more. With world-class performance at 150 million messages per second, a latency of 0.7usec, and smart acceleration engines such as RDMA, GPUDirect, and SR-IOV, ConnectX-4 will enable the most efficient compute and storage platforms.” It’s easy to understand how one might actually think that all the above numbers also pertain to Ethernet, and by extension UDP & TCP. Nothing could be further from the truth.

From Mellanox’s own website on February 14, 2015: “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows, and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So clearly all the above numbers are INFINIBAND ONLY, today three months after the above press release still the fastest Ethernet Mellanox supports is 40GbE, and this is done with their own standard OS driver only. This by design will always limit things like packet rate to 3-4Mpps, and latency to somewhere around 10,000 nanoseconds, not 700. Bandwidth could be directly OS limited, but I’ve yet to see that so on these 100Gbps adapters Mellanox might support something approaching 40Gbps/port.

So let’s imagine that someday in the distant future the gang at Mellanox delivers an OS-bypass driver for the Connect-X 4 and that it does support 100Gbps. What we’ll see is that like the prior versions of Connect-X, this is Mellanox’s answer to doing both Infiniband & Ethernet on the same adapter, a trick they picked up from now defunct Myricom who achieved this back in 2005 delivering both Myrinet & 10G Ethernet on the same Layer-1 media. This trick allows Mellanox to ship a single adapter that can be used with two totally different driver stacks to deliver Infiniband traffic over an Infiniband hardware fabric or Ethernet over traditional switches directly to applications or the OS kernel. This simplifies things for Mellanox, OEMs, and distributors, but not for customers.

Suppose I told you I had a car that could reach 330MPH in 1,000 feet, pretty impressive. Would you expect that same car to work on the highway, probably not, how about on a NASCAR track? No, because those that really know auto racing immediately realize I’m talking about a beast that burns five gallons of Nitromethane in four seconds, yes a 0.04MPG, top-fuel dragster. This class of racing is analogous to High-Performance Computing (HPC), where Infiniband is king and the problem domain is extremely well defined. In HPC we measure latency using zero byte packets and often attach adapters back to back without a switch to measure percieved network system latency. So while 700 nanoseconds of latency sounds impressive it should be noted that no end user data is passed during this test at this speed, just empty packets to prove the performance of the transport layer. In production, you can’t actually use zero byte packets because they’re simply the digital equivalent of sealed empty envelopes. Also to see this 700 nanoseconds you’ll need to be running Infiniband on both ends, along with an Infiniband supported driver stack that bypasses the operating system, note this DOES NOT support traditional UDP or TCP communications. Also to get anything near 700 nanoseconds you have to be using Infiniband RDMA functions, back to back between two systems without a network switch, and with no real data transferred, it is a synthetic measurement of the fabric’s performance.

The world of performance Ethernet is more like NASCAR, where cars typically do 200MPH and run races measured in the hundreds of miles around closed loop tracks. Here the cars have to shift gears, brake, run for extended periods of time, refuel, handle rapid tire changes, and maintenance during the race, etc… This is not the same as running a top-fuel drag racer once down a straight 1,000-foot track. The problem is Mellanox is notorious for stating their top-fuel dragster Infiniband HPC numbers to potential NASCAR class high-performance ethernet customers, believing many will NEVER know the difference. Several years ago Mellanox had their own high-performance OS-Bypass Ethernet stack that supported UDP & TCP called VMA (Voltaire Messaging Accelerator), but it was so fraught with problems that they spun it off as an open source project in the fall of 2013. They had hoped that the community might fix its problems, but since it’s seen little if any development (15 posts in as many months). So the likelihood you’ll see 700 nanosecond class 1/2 round trip UDP or TCP latency with Mellanox anytime in the near future would be very surprising.

Let’s attack misrepresentation number two, an actual ethernet throughput of 100Gbps. This one is going to be a bit harder without an actual adapter in my hand to test, so just looking at the data sheet, several things do jump out. First ConnectX 4 uses a 16-lane PCIe Gen3 bus which typically should have an effective unidirectional PCIe data throughput of 104Gbps. On the surface, this looks good. There may be an issue under the covers though because when this adapter is plugged into a state of the art Intel Haswell server the PCIe slot maps to a single processor. You can send traffic from this adapter to the other CPU, but it first must go through the CPU it’s connected to. So sticking to one CPU, the best Haswell processor has two 20 lane QPIs with an effective combined unidirectional transfer speed of 25.6GB/sec. Now note that this is all 40 PCIe lanes combined, the ConnectX 4 only has 16 lanes so proportionally about 10.2GB/sec is available, that’s only 82Gbps. Maybe they could sustain 100Gbps, but this number on the surface appears somewhat dubious. These numbers should also limit Infiniband’s top end performance for this adapter.

Finally, we have my favorite misrepresentation, 150M messages per second. Messages is an HPC term and most people that think ethernet will translate this to 150M packets per second. A 10GbE link has a theoretical maximum packet rate of 14.88Mpps. There is no way their ethernet driver for the ConnectX 4 could ever support this packet rate, even if they had a really great OS-bypass driver I’d be highly skeptical. This is analogous to saying you have an adapter capable of providing lossless ethernet packet capture on ten 10GbE (14.88Mpps/link) links at the same time. Nobody today, even the best FPGA NICs that cost 10X this price, will claim this.

Let’s humor Mellanox though, and buy into the fantasy, here is the reality that will creep back in. On Ethernet, we often say the smallest packet is 64 bytes so 150Mpps * 64 bytes/packet * 8 bits/byte is 76.8Gbps, that is less than the 82Gbps we mentioned above so that’s good. There are a number of clever tricks that can be used to bring this many packets into the host CPU into user space while optimizing the use of the PCIe bus, but more often than not these require that the NIC firmware is tuned for packet capture, not generic TCP/UDP traffic flow. Let’s return to the Intel Haswell E5-2699 with 18 cores at 2.3Ghz. Again for performance, we’ll steer all 150Mpps into the single Intel socket supporting this Mellanox adapter. Now for peak performance, we want to ensure that packets are going to extremely quiet cores because we know that both the OS & BIOS settings can create system jitter which kills performance and determinism. So we profile this CPU and find the 15 least busy cores, those with NOTHING going on. Now if we assume Mellanox was to have an OS Bypass UDP/TCP stack that supported a round-robin method for doling out a flood of 64-byte packets this would mean 10Mpps/core or 100 nanoseconds/packet to do something useful with each packet. That’s 250 clock ticks on Intel’s best processor. Unless you’re hand coding in assembler it’s going to be very hard to get that much done.

So when Mellanox begins talking about supporting 25GbE, 50GbE or 100GbE you need only remember one quote from their website “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So please don’t fall for the low latency, high bandwidth or packet rate Mellanox Ethernet hype, it’s just hog wash.

Update, on March 2, 2015, Mellanox posted an Ethernet only press release that claimed this adapter supported 100GbE, and using the DPDK interface in testing they could achieve 90Gbps with 75Mpps over the 100G link (roughly wire-rate 128 byte packets).

The Mummy in the Datacenter

February 26, 2013September 2, 2017 scottcschweitzer Infiniband, networking Infiniband

This article was originally published in November of 2008 at 10GbE.net.

While Brendan Fraser travels China in his latest quest to terminate yet another mummy. IT leaders are starting to wonder if they’ve got a mummy of their own haunting their raised floor. This mummy is easy to find, he’s wrapped in thick black copper cables, and his long fingers may be attached to many of your servers. It is Infiniband!

Once praised as the next generation networking technology, having conquered High-Performance Computing, it continued its battle for world networking domination by attacking storage and now the data center. It promises you 20Gbps, hinted that it would soon offer 40Gbps and shared with you its plans for 160Gbps! It claimed full bisection, the ability to use all the network capacity available, and low latency (the time it takes to actually move a packet of data around). It’s democratic, the software stack was developed by an “open” committee of great technological leaders so it MUST be good for us. Everyone from HP to SGI has sung it’s praises whenever they’ve come by to peddle the latest in server technology. A corpse wrapped in rags, a centuries old immortal Dragon Emperor or a black cable bandit, they all can be eradicated.

We will tear this black cable Bandit down to size one claim at a time. First, they assert that it’s 20Gbps, how about 12Gbps on its best day with all the electrons flowing in the same direction. Infiniband employs what is know as 8b/10b encoding to put the bits on the wire. For every 10 signal bits, there are 8 useful data bits. Ethernet uses the same method, the difference is that Ethernet for the past 30 years has advertised the actual data rate, the 8, while Infiniband promotes the 25% larger and useless signal rate, the 10. Using Infiniband math Ethernet would then be 12.5Gbps instead of the 10Gbps it actually is. So using Ethernet math Infiniband’s Double Data Rate (DDR) is actually only 16Gbps and not the 20Gbps they claim. But wait there’s more! I said earlier that you will only get 12Gbps under ideal conditions, where did the other 4Gbps go? Today most servers use PCIe 1.1 8-lane I/O slots. Ideally, these are 16Gbps slots, once you add in PCIe overhead though you only get about 12Gbps on the best of systems. So with a straight face, they sell you 20Gbps knowing in their heart you’ll never get more than 12Gbps.

Full bi-section, the ability for a network of servers to use all the network fabric available. Infiniband claims that using their architecture and switches you can leverage the ENTIRE network fabric under the right circumstances. On slides, this might be true, but in the real world, it’s impossible. Infiniband is statically routed, meaning that packets from server A to server X have only one fixed predetermined path they can travel. One of the nations largest labs proved that on a 1,152 server Infiniband network that static routing was only 21% efficient and delivered on average 263MB/sec (2.1Gbps of the theoretical 10Gbps possible). So when they tell you full bisection, ask them why LLNL only saw 21%? In an IEEE paper presented last week, it was proven that statically routed system can not achieve greater than 38% efficiency. Now some of the really savvy Mummy supporters will say that the latest incantation of Infiniband has adaptive routing, they do this by using yet another shell game, they redefine the term adaptive routing to mean more than one static route. Real adaptive routing and using a pair of static routes are vastly different things. Real Adaptive routing can deliver 77% efficiency on 512 nodes and nearly 100% efficiency on clusters smaller than 512 nodes. If you want full bisection for more than a 16 node cluster talk with Myricom or Quadrics, they do real adaptive routing.

Latency is the time it takes to move a packet from one application on a network server to another application on a different server on the same network. Infiniband has always positioned itself as being low latency. Typically Infiniband advertises a latency of roughly three microseconds between two NICs, using zero-byte packets. Well in the past year 10GbE NICs and switches have come onto the market that can achieve similar performance. If we look at Arista’s switches they measure latency in a few hundred nanoseconds while Cisco’s latest 10GbE switches are sub four microseconds, compared to the prior generations that were measured in the 10’s of microseconds or more. Now when the Infiniband crowd crows about using low latency switching ask them about Arista or BLADE Network technologies 10GbE switches.

Infiniband claims 20Gbps and delivers less than 12Gbps. Infiniband claims full bisection yet beyond a small network they can’t exceed 38% efficiency. Infiniband claims low latency and now 10GbE can match it. Where is their value proposition in the data center?