User Level Networking (ULN) is Becoming an Over-Night Success

Kernel Bypass = User Level Networking

Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.   

With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.  

This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.     

In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.

In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders. 

At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well. 

For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects. 

At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.  

As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.

Gone in 98 Nanoseconds

Imagine a daily race with hundreds of top fuel dragsters all lined up rumbling along in parallel waiting for the same green Christmas tree light before launching off the line. In some electronic markets, with specific products, every weekday morning this is exactly what happens. It’s a race where being the fastest is the primary attribute used to determine if you’re going to be doing business. On any given day only the top finishers are rewarded with trades. Those who transmit their first orders of the day the fastest receive a favorable position at the head of the queue and are likely to do some business that day. In this market, EVERY nanosecond (a billionth of a second) of delay matters, and can be monetized. Last week the new benchmark was set at 98 nanoseconds, plus your trading algorithm, in some cases 150 nanoseconds total tick to trade.

“Latency” is the industry term for the unavoidable network delays, and “Tick to Trade Latency” aggregates together the network travel time for a UDP market data signal to arrive at a trading system, and for that trading system to transmit a TCP order into the exchange. Last year Solarflare introduced Application Nanosecond TCP Send (ANTS) and lowered the “Tick to Trade Latency” bar to 350 Nanoseconds. ANTS executes in collaboration with Solarflare’s Application Onload Engine (AOE) based on an Altera Stratix FPGA. Solarflare further advanced this high-speed trading platform to achieve 250 Nanoseconds. Then in the spring of 2017 Solarflare collaborated with LDA Technologies. LDA brought their Lightspeed TCP cores to the table and replaced the AOE with a Xilinx FPGA board once again lowering the “Tick to Trade Latency” to 120 Nanoseconds. Now through further advances, and moving to the latest Penguin Computing Skylake computing platform, all three partners just announced a STAC-T0 qualified benchmark of 98 nanoseconds “Tick to Trade Latency!”

There was even a unique case in this STAC-T0 testing where the latency was measured at negative 68 nanoseconds, meaning that a trade could be injected into the exchange before the market data from the exchange had even been completely received. Compared to traditional trading systems which require that the whole market data network packet to be received before ANY processing can be done, these advanced FPGA systems receive the market data in the packet in four-byte chunks and can begin processing that data while it is arriving. Imagine showing up in the kitchen before you wife even finishes calling your name for dinner. There could be both good and bad side effects of such rapid action, you have a moment or two to taste a few things before the table is set, or you may get some last minute chores. The same holds true for such aggressive trading.

Last week, in a Podcast with the same name we had a discussion with Vahan Sardaryan, CEO of LDA Technologies, where we went into this in more detail.

Penguin Computing is also productizing the complete platform, including Solarflare’s ANTS technology and NIC, LDA Technologies Lightspeed TCP, along with a high-performance Xilinx FPGA to provide the Ultimate Trading Machine.

The Ultimate Trading Machine

Thinning the 10GbE Herd

This article was originally published in January of 2009 at 10GbE.net.

Thinning the herd.

In 2007 over one million 10GbE network ports were purchased. Many of those were for a switch to switch interconnects but some were to connect servers to networks via 10GbE. Natural selection is now taking effect in the 10GbE NIC market as the big dogs, Intel & Broadcom, start thrashing around in an effort to secure market share as 10GbE matures. Both want to dominate the 10GbE LAN on Motherboard (LoM) market. In the NIC market, four companies likely supply over 80% of the 10GbE NICs purchased and they are Chelsio, Intel, Myricom, and Neterion. The remaining 20% of NIC sales fall to companies like Broadcom, SMC, NetXen, ServerEngines, Tehuti, AdvancedIO, Endace, Napatech, etc… One should be wondering why Broadcom is in the second group, it’s because Broadcom’s focus is on selling 10GbE silicon to OEMs like IBM and HP for LoM projects positioning their silicon on high-end server mother boards and not retailing NIC cards. 

Officially the first documented victim is NetEffect, the leader in iWarp (Infiniband for 10GbE) NICs. NetEffect rose from the ashes of a failed Infiniband company, Banderacom, earlier this decade to apply their silicon development skills and Infiniband algorithms to the more stable Ethernet market as a new feature called iWarp. NetEffect in-fact led the iWarp charge, it was the self-proclaimed leader in low-latency iWarp 10GbE NICs. In August NetEffect filed for reorganization in US Bankruptcy Court. With the failure of NetEffect the market has cast its vote and drove a stake through the heart of iWarp, hopefully terminating this feature.
Rumors have been swirling around Teak Technologies, a maker of 10GbE NICs and a switch, for some time. It appears that Teak has not weathered the storm and has since faded away, their domain name is no longer resolving to an IP address. The domain was never transferred from the founder, and the founder announced this spring on Linkedin that he had moved on some time ago. Is it conclusive evidence, no, but would you buy technology from a tech company whose URL won’t resolve to a server?
It is a tough economic climate for start-up NIC companies, particularly those in the bottom 20% as they have likely never had a quarter in the black. Now is a challenging time to be out there seeking another round of capital from ones VCs. Several have been without an injection of new funding for over two years and lack the sales volume required to sustain their own existence much beyond year end. As such we’ve directly questioned one firm to see if they are alive, and another that is widely rumored in the industry to be in trouble, but their marketing departments are still bailing.