Four Failings of RoCE

RoCERecently someone suggested that I watch this rather informative video of how Microsoft Research had attempted to make RDMA over Converged Ethernet (RoCE) lossless. Unbelievably this video exposes and documents several serious flaws in the design of RoCE. Also, it appears they’ve replaced the word “Converged” with “Commodity,” to down message that RoCE doesn’t require anything special to run on regular old Ethernet. Here are the four points I got out of the video, please let me know your take:

  • RDMA Livelock: This is a simple problem of retransmitting. Since RDMA was architected for a lossless deterministic local bus architecture accommodations were never made for dropped packets as they just didn’t happen on a bus. Ethernet, on the other hand, was designed to expect a loss, remember vampire taps. Livelock occurs when a message composed of multiple packets experiences a dropped packet somewhere in the middle. At this point, RDMA has to start over from the first packet and retransmit the whole message. If this was a multiple megabyte frame of video, this retransmit approach will Lovelock a network. So what was Microsoft’s solution rewrite the RDMA stack retransmit logic to retransmit on drop detection (this is what TCP does), good luck, who’s got this action item?
  • Programmable Flow Control (PFC) Deadlock: This happens when switches encounter incomplete ARP packets. Microsoft’s solution is a call for more research, and to filter incomplete ARP packets. More to-do’s and this one is on all the switch vendors.
  • NIC PFC Storm: It seems that the firmware in some RoCE NICs has bugs that create Pause Frame storms. Beyond NIC vendors fixing those bugs, they also suggest that NIC and switch vendors include extra new software to detect oncoming storms and shut them down. Great idea, another to-do for the anonymous NIC and switch providers.
  • Slow Receiver NICs which generate excessive pause frames because of their crappy RDMA architecture which relies on a second level host based translation tables so they can fetch the destination memory address. Oh, my god, this is how you design an HPC NIC, seriously, how cheap can you be? Make the lookup tables bigger, seriously, Myricom addressed this problem back the 1990s. It appears on some RoCE NICs that it’s not that hard for the NIC to have so many receivers of kernel bypassed packets that they must go off NIC for the destination memory address lookups.

As the speaker closes out the discussion, he says, “This experiment shows that even with RDMA low latency and high throughput cannot be achieved at the same time as network congestion can cause queues to build up in the network.” Anyone who has done this for a while knows that low-latency and high bandwidth are mutually exclusive. That’s why High-Performance Computing (HPC) tests often start the tests with zero byte packets then scale up to demonstrate how latency increase proportionately to packet size.

All the above aside, this important question remains, why would anyone map a protocol like RDMA, which was designed for use on a lossless local bus, to a switched network and think that this would work? A local lossless bus is very deterministic, and it has requirements bound to its lossless nature and predictable performance. Conversely, Ethernet was designed from the beginning to expect, and accommodate loss, and performance has always been secondary to packet delivery. That’s why Ethernet performance is non-deterministic. The resilience of Ethernet, not performance, was the primary design criteria DARPA had mandated to ensure our military’s network would remain functional at all cost.

Soon Solarflare will begin shipping ScaleOut Onload free with all their 8000 series NICs, some of which sell for under $300 USD. With ScaleOut Onload TCP now has all the kernel bypass tricks RDMA offers, but with all the benefits and compatibility of sockets based TCP, no code changes. Furthermore, it delivers the performance of RDMA, but with much better reliability and availability than RoCE.

P.S. Mellanox just informed me that the NIC specific issues mentioned above were corrected some time ago in their ConnectX-4 series cards.

RoCE vs TCP for Low-Latency Apps

rocelogoThe effectiveness of our communication as a species is one of our defining characteristics. Earlier this week while waiting in a customer’s lobby in Chicago I noticed four framed posters displaying all the hand signals used in the trading pits of four major markets. Having been focused on electronic trading for the past decade this “ancient” form of communications became an instant curiosity worthy of inspection. On reflection, I was amazed to think that trillions of dollars in transactions over decades had been conducted purely by people motioning with their hands.

About a decade ago in the High-Performance Computing (HPC) market, a precursor market for High-Frequency Trading (HFT), there was a dust-up regarding the effectiveness of Remote Direct Memory Access (RDMA). One of Myricom’s senior researchers wrote an article for HPCWire titled “A Critique of RDMA” that set off a chain reaction of critical response articles:

At the time Myricom was struggling to establish relevance for its new Myrinet-10G protocol against a competing technology, Infiniband, which was rapidly gaining traction. Now to be fair, at the time I was in sales at Myricom. The crux of the article was that the one-sided RDMA communications model, which rose from the ashes of the Virtual Interface Architecture (VIA), was still more of a problem than a solution when compared to the existing two-sided Send/Recv model used by four other competing HPC protocols (QsNet, SeaStar, Infinipath & Myrinet Express).

Now RDMA has had a decade to improve as it spread from Infiniband to Ethernet under the name RDMA over Converged Ethernet (RoCE), but it still has performance issues. The origin of RDMA is cast in a closed lossless layer-2 Infiniband network with deterministic latency. Let’s take a moment and adopt a NASCAR analogy. Think of RDMA as the vehicle and Infiniband as the track. One can take a Sprint Cup Series vehicle tuned for the Charlotte Motor Speedway, and take it for a spin on the local roads, but is that really practical (it certainly isn’t legal)? Yes, its origin is in the stock car, but how well will it do in stop and go traffic, particularly on uphill grades? How about parallel parking, oh wait there’s no reverse. Tight turns at low speeds, signaling, weather, etc. Sprint Cup Series vehicles are designed for 200MPH on a closed extremely well defined and maintained course. Ethernet, by contrast, is the road driven by everyone else, it’s unpredictable with thousands of obstacles, and is ever changing.

Those familiar with Ethernet know that lossless and deterministic latency are not two characteristics often normally associated with this network fabric. Some of us have been around the block and lived through Carrier Sense Multiple Access with Collision Detection (CSMA/DA) where packets often collided and random delays before retransmission attempts were common. TCP/IP was developed during these early days and it was designed with this packet loss as a key criterion. In the past three decades Ethernet has evolved considerably from its roots as a shared coax cable utilizing vampire taps to where we are today with dedicated twisted pair cabling and fiber optics, but on rare occasion, packets are still dropped, and performance isn’t always deterministic. Today most packet drops are as a result of network congestion. As discussed TCP/IP is equipped to handle this, unfortunately, RoCE is not.

For RoCE to perform properly it requires a lossless layer-2 network. Essentially a NASCAR track overlaid onto our public roads. To accomplish this over a routed Ethernet network a new protocol was developed: Data Center Bridging Capabilities Exchange (shortened to DCB or DCBX). DCB is used at every hop of the network to negotiate and create a lossless layer-2 fabric on top of Ethernet. It achieves this by more tightly managing queue overflows and by adjusting network flow priorities as if they were traversing separate physical media. In essence RoCE traffic is prioritized into essentially its own carpool lane ahead of other traffic in hopes of avoiding drops as a result of congestion. While this all sounds great, in talking with several large Web2.0 customers who’ve invested years in RoCE we learned that the vast number will never deploy it in production. There are far too many challenges to get and keep it working, and in high traffic volumes, it suffers. Unlike Infiniband HPC clusters which are stood up as self-contained networks (closed course race tracks) to address specific computational problems, Ethernets are in a constant state of flux with servers and switches being added and removed (our public road system) as the needs of the business change. To be clear TCP/IP is resilient to packet loss, while RoCE is not.

On the latency performance side of things, in the past decade, we’ve achieved roughly one microsecond for a 1/2 round trip (a send + receive) with both TCP and UDP, when using Solarflare’s OpenOnload. This is in line with RoCE latency which is also in the domain of one microsecond. Keep in mind that normal TCP or UDP transactions over 10GbE typically run in the range of 5 to 15 microseconds, so 1 microsecond is a huge improvement. By now you’re likely saying “So what?” For most applications like file sharing, databases, etc… the difference between one microsecond and even fifteen microseconds is lost in the 10,000+ microseconds a whole transaction might take. It turns out though that there are new breeds of network latency-sensitive applications that depend on technologies like Non-Volatile Memory Express (NVMe), Neural Networks, and high volume compound web transactions that can see significant improvements when latency is reduced. When low latency TCP is applied to these problems the performance gains are both measurable and significant.

So the next time someone suggests RoCE ask if they’ve considered a little known competing protocol called TCP/IP. While RoCE is the shiny new object, TCP/IP has had several decades of innovation behind it which explains why it’s the underlying “language of the Internet”. Consider asking those promoting RoCE what their porting budget is, and if they’ve factored in the cost of the new network switches that will be required to support DCB? It’s very likely that the application they want to deploy already supports TCP/IP, and if latency and throughput are key factors then consider contacting Solarflare about OpenOnload. OpenOnload accelerates existing sockets based applications without having to modify them.