Four Failings of RoCE

Recently someone suggested that I watch this rather informative video of how Microsoft Research had attempted to make RDMA over Converged Ethernet (RoCE) lossless. Unbelievably this video exposes and documents several serious flaws in the design of RoCE. Also, it appears they’ve replaced the word “Converged” with “Commodity,” to down message that RoCE doesn’t require anything special to run on regular old Ethernet. Here are the four points I got out of the video, please let me know your take:

RDMA Livelock: This is a simple problem of retransmitting. Since RDMA was architected for a lossless deterministic local bus architecture accommodations were never made for dropped packets as they just didn’t happen on a bus. Ethernet, on the other hand, was designed to expect a loss, remember vampire taps. Livelock occurs when a message composed of multiple packets experiences a dropped packet somewhere in the middle. At this point, RDMA has to start over from the first packet and retransmit the whole message. If this was a multiple megabyte frame of video, this retransmit approach will Lovelock a network. So what was Microsoft’s solution rewrite the RDMA stack retransmit logic to retransmit on drop detection (this is what TCP does), good luck, who’s got this action item?
Programmable Flow Control (PFC) Deadlock: This happens when switches encounter incomplete ARP packets. Microsoft’s solution is a call for more research, and to filter incomplete ARP packets. More to-do’s and this one is on all the switch vendors.
NIC PFC Storm: It seems that the firmware in some RoCE NICs has bugs that create Pause Frame storms. Beyond NIC vendors fixing those bugs, they also suggest that NIC and switch vendors include extra new software to detect oncoming storms and shut them down. Great idea, another to-do for the anonymous NIC and switch providers.
Slow Receiver NICs which generate excessive pause frames because of their crappy RDMA architecture which relies on a second level host based translation tables so they can fetch the destination memory address. Oh, my god, this is how you design an HPC NIC, seriously, how cheap can you be? Make the lookup tables bigger, seriously, Myricom addressed this problem back the 1990s. It appears on some RoCE NICs that it’s not that hard for the NIC to have so many receivers of kernel bypassed packets that they must go off NIC for the destination memory address lookups.

As the speaker closes out the discussion, he says, “This experiment shows that even with RDMA low latency and high throughput cannot be achieved at the same time as network congestion can cause queues to build up in the network.” Anyone who has done this for a while knows that low-latency and high bandwidth are mutually exclusive. That’s why High-Performance Computing (HPC) tests often start the tests with zero byte packets then scale up to demonstrate how latency increase proportionately to packet size.

All the above aside, this important question remains, why would anyone map a protocol like RDMA, which was designed for use on a lossless local bus, to a switched network and think that this would work? A local lossless bus is very deterministic, and it has requirements bound to its lossless nature and predictable performance. Conversely, Ethernet was designed from the beginning to expect, and accommodate loss, and performance has always been secondary to packet delivery. That’s why Ethernet performance is non-deterministic. The resilience of Ethernet, not performance, was the primary design criteria DARPA had mandated to ensure our military’s network would remain functional at all cost.

Soon Solarflare will begin shipping ScaleOut Onload free with all their 8000 series NICs, some of which sell for under $300 USD. With ScaleOut Onload TCP now has all the kernel bypass tricks RDMA offers, but with all the benefits and compatibility of sockets based TCP, no code changes. Furthermore, it delivers the performance of RDMA, but with much better reliability and availability than RoCE.

P.S. Mellanox just informed me that the NIC specific issues mentioned above were corrected some time ago in their ConnectX-4 series cards.

3 thoughts on “Four Failings of RoCE”

Alex Shpiner says:

June 8, 2017 at 1:51 PM

I will be happy to inform the author that Mellanox solved all these four “failings” in its ConnectX cards of two generations ago. Moreover, the second issue is caused by incorrent behavior of another vendor switch.

Motti Beck says:

June 12, 2017 at 4:08 PM

Your blog includes inaccurate data. The fact is that the market acceptance of RoCE continues to grow exponentially and RoCE enabled networking solutions have been massively used in ultra- large deployments, among them at Microsoft Azure. You may want to read my (Moti Beck) blog “Enabling Higher Azure Stack Efficiency – Networking Matters” (just do a simple search and you’ll find it).

Also, users can run today RoCE over lossy fabric and using PFC isn’t a must any more. You may want to read John Kim blog “Resilient RoCE Relaxes RDMA Requirements”.

There are other misleading statements you made, so I would encourage you to get the latest status and you may want to “refresh: your blog.

- scottcschweitzer says:
  
  June 12, 2017 at 10:52 PM
  
  The data in that blog entry came directly from Microsoft’s own video which I referenced in the opening sentence. It has also been backed up by several customers with similar sized web infrastructures to that of Microsoft, all household names. Mellanox, your employer, is the primary company driving RoCE. Let’s be honest, Mellanox realized years ago that Infiniband was never going mainstream as it’s a bus based architecture, so they crafted RoCE. This was Mellanox’s way to co-opt Ethernet and grow their market outside of HPC. RoCE has serious foundational architectural flaws owing to it being ported from a bus to a switched network. How many years did it take Microsoft to get RoCE working? Answer, Several. It’s not ready for prime time.