RoCE vs TCP for Low-Latency Apps

rocelogoThe effectiveness of our communication as a species is one of our defining characteristics. Earlier this week while waiting in a customer’s lobby in Chicago I noticed four framed posters displaying all the hand signals used in the trading pits of four major markets. Having been focused on electronic trading for the past decade this “ancient” form of communications became an instant curiosity worthy of inspection. On reflection, I was amazed to think that trillions of dollars in transactions over decades had been conducted purely by people motioning with their hands.

About a decade ago in the High-Performance Computing (HPC) market, a precursor market for High-Frequency Trading (HFT), there was a dust-up regarding the effectiveness of Remote Direct Memory Access (RDMA). One of Myricom’s senior researchers wrote an article for HPCWire titled “A Critique of RDMA” that set off a chain reaction of critical response articles:

At the time Myricom was struggling to establish relevance for its new Myrinet-10G protocol against a competing technology, Infiniband, which was rapidly gaining traction. Now to be fair, at the time I was in sales at Myricom. The crux of the article was that the one-sided RDMA communications model, which rose from the ashes of the Virtual Interface Architecture (VIA), was still more of a problem than a solution when compared to the existing two-sided Send/Recv model used by four other competing HPC protocols (QsNet, SeaStar, Infinipath & Myrinet Express).

Now RDMA has had a decade to improve as it spread from Infiniband to Ethernet under the name RDMA over Converged Ethernet (RoCE), but it still has performance issues. The origin of RDMA is cast in a closed lossless layer-2 Infiniband network with deterministic latency. Let’s take a moment and adopt a NASCAR analogy. Think of RDMA as the vehicle and Infiniband as the track. One can take a Sprint Cup Series vehicle tuned for the Charlotte Motor Speedway, and take it for a spin on the local roads, but is that really practical (it certainly isn’t legal)? Yes, its origin is in the stock car, but how well will it do in stop and go traffic, particularly on uphill grades? How about parallel parking, oh wait there’s no reverse. Tight turns at low speeds, signaling, weather, etc. Sprint Cup Series vehicles are designed for 200MPH on a closed extremely well defined and maintained course. Ethernet, by contrast, is the road driven by everyone else, it’s unpredictable with thousands of obstacles, and is ever changing.

Those familiar with Ethernet know that lossless and deterministic latency are not two characteristics often normally associated with this network fabric. Some of us have been around the block and lived through Carrier Sense Multiple Access with Collision Detection (CSMA/DA) where packets often collided and random delays before retransmission attempts were common. TCP/IP was developed during these early days and it was designed with this packet loss as a key criterion. In the past three decades Ethernet has evolved considerably from its roots as a shared coax cable utilizing vampire taps to where we are today with dedicated twisted pair cabling and fiber optics, but on rare occasion, packets are still dropped, and performance isn’t always deterministic. Today most packet drops are as a result of network congestion. As discussed TCP/IP is equipped to handle this, unfortunately, RoCE is not.

For RoCE to perform properly it requires a lossless layer-2 network. Essentially a NASCAR track overlaid onto our public roads. To accomplish this over a routed Ethernet network a new protocol was developed: Data Center Bridging Capabilities Exchange (shortened to DCB or DCBX). DCB is used at every hop of the network to negotiate and create a lossless layer-2 fabric on top of Ethernet. It achieves this by more tightly managing queue overflows and by adjusting network flow priorities as if they were traversing separate physical media. In essence RoCE traffic is prioritized into essentially its own carpool lane ahead of other traffic in hopes of avoiding drops as a result of congestion. While this all sounds great, in talking with several large Web2.0 customers who’ve invested years in RoCE we learned that the vast number will never deploy it in production. There are far too many challenges to get and keep it working, and in high traffic volumes, it suffers. Unlike Infiniband HPC clusters which are stood up as self-contained networks (closed course race tracks) to address specific computational problems, Ethernets are in a constant state of flux with servers and switches being added and removed (our public road system) as the needs of the business change. To be clear TCP/IP is resilient to packet loss, while RoCE is not.

On the latency performance side of things, in the past decade, we’ve achieved roughly one microsecond for a 1/2 round trip (a send + receive) with both TCP and UDP, when using Solarflare’s OpenOnload. This is in line with RoCE latency which is also in the domain of one microsecond. Keep in mind that normal TCP or UDP transactions over 10GbE typically run in the range of 5 to 15 microseconds, so 1 microsecond is a huge improvement. By now you’re likely saying “So what?” For most applications like file sharing, databases, etc… the difference between one microsecond and even fifteen microseconds is lost in the 10,000+ microseconds a whole transaction might take. It turns out though that there are new breeds of network latency-sensitive applications that depend on technologies like Non-Volatile Memory Express (NVMe), Neural Networks, and high volume compound web transactions that can see significant improvements when latency is reduced. When low latency TCP is applied to these problems the performance gains are both measurable and significant.

So the next time someone suggests RoCE ask if they’ve considered a little known competing protocol called TCP/IP. While RoCE is the shiny new object, TCP/IP has had several decades of innovation behind it which explains why it’s the underlying “language of the Internet”. Consider asking those promoting RoCE what their porting budget is, and if they’ve factored in the cost of the new network switches that will be required to support DCB? It’s very likely that the application they want to deploy already supports TCP/IP, and if latency and throughput are key factors then consider contacting Solarflare about OpenOnload. OpenOnload accelerates existing sockets based applications without having to modify them.

The Mummy in the Datacenter

This article was originally published in November of 2008 at 10GbE.net.

While Brendan Fraser travels China in his latest quest to terminate yet another mummy. IT leaders are starting to wonder if they’ve got a mummy of their own haunting their raised floor. This mummy is easy to find, he’s wrapped in thick black copper cables, and his long fingers may be attached to many of your servers. It is Infiniband!
 
Once praised as the next generation networking technology, having conquered High-Performance Computing, it continued its battle for world networking domination by attacking storage and now the data center. It promises you 20Gbps, hinted that it would soon offer 40Gbps and shared with you its plans for 160Gbps! It claimed full bisection, the ability to use all the network capacity available, and low latency (the time it takes to actually move a packet of data around). It’s democratic, the software stack was developed by an “open” committee of great technological leaders so it MUST be good for us. Everyone from HP to SGI has sung it’s praises whenever they’ve come by to peddle the latest in server technology. A corpse wrapped in rags, a centuries old immortal Dragon Emperor or a black cable bandit, they all can be eradicated.
 
We will tear this black cable Bandit down to size one claim at a time. First, they assert that it’s 20Gbps, how about 12Gbps on its best day with all the electrons flowing in the same direction. Infiniband employs what is know as 8b/10b encoding to put the bits on the wire. For every 10 signal bits, there are 8 useful data bits. Ethernet uses the same method, the difference is that Ethernet for the past 30 years has advertised the actual data rate, the 8, while Infiniband promotes the 25% larger and useless signal rate, the 10. Using Infiniband math Ethernet would then be 12.5Gbps instead of the 10Gbps it actually is. So using Ethernet math Infiniband’s Double Data Rate (DDR) is actually only 16Gbps and not the 20Gbps they claim. But wait there’s more! I said earlier that you will only get 12Gbps under ideal conditions, where did the other 4Gbps go? Today most servers use PCIe 1.1 8-lane I/O slots. Ideally, these are 16Gbps slots, once you add in PCIe overhead though you only get about 12Gbps on the best of systems. So with a straight face, they sell you 20Gbps knowing in their heart you’ll never get more than 12Gbps.
 
Full bi-section, the ability for a network of servers to use all the network fabric available. Infiniband claims that using their architecture and switches you can leverage the ENTIRE network fabric under the right circumstances. On slides, this might be true, but in the real world, it’s impossible. Infiniband is statically routed, meaning that packets from server A to server X have only one fixed predetermined path they can travel. One of the nations largest labs proved that on a 1,152 server Infiniband network that static routing was only 21% efficient and delivered on average 263MB/sec (2.1Gbps of the theoretical 10Gbps possible). So when they tell you full bisection, ask them why LLNL only saw 21%? In an IEEE paper presented last week, it was proven that statically routed system can not achieve greater than 38% efficiency. Now some of the really savvy Mummy supporters will say that the latest incantation of Infiniband has adaptive routing, they do this by using yet another shell game, they redefine the term adaptive routing to mean more than one static route. Real adaptive routing and using a pair of static routes are vastly different things. Real Adaptive routing can deliver 77% efficiency on 512 nodes and nearly 100% efficiency on clusters smaller than 512 nodes. If you want full bisection for more than a 16 node cluster talk with Myricom or Quadrics, they do real adaptive routing.
 
Latency is the time it takes to move a packet from one application on a network server to another application on a different server on the same network. Infiniband has always positioned itself as being low latency. Typically Infiniband advertises a latency of roughly three microseconds between two NICs, using zero-byte packets. Well in the past year 10GbE NICs and switches have come onto the market that can achieve similar performance. If we look at Arista’s switches they measure latency in a few hundred nanoseconds while Cisco’s latest 10GbE switches are sub four microseconds, compared to the prior generations that were measured in the 10’s of microseconds or more. Now when the Infiniband crowd crows about using low latency switching ask them about Arista or BLADE Network technologies 10GbE switches.
 
Infiniband claims 20Gbps and delivers less than 12Gbps. Infiniband claims full bisection yet beyond a small network they can’t exceed 38% efficiency. Infiniband claims low latency and now 10GbE can match it. Where is their value proposition in the data center?