Google, Memcache, and How Solarflare May Have Come Out on Top

This post was originally made in January of 2015, but due to a take-down letter, I received a week later this story has remained unpublished for the last seven years.

January 2015 Take-Down

Memcached was released in May of 2003 by Danga Interactive. This free, open-source software layer provides a general-purpose distributed in-memory cache that can be used by both web & app servers. Five years later, Google released version 1.1.0 of its App Engine, which also included its version of an in-memory cache called Memcache. This capability to store objects in a huge pool of memory spread across a large distributed fabric of servers is integral to the performance inherent in many of Google’s products. Google showed in a presentation earlier this year on Memcache that a query from a typical data requires 60-100 milliseconds, while a similar Memcache query only needed 3-8 milliseconds, a 20X improvement. It’s no wonder Google is a big fan of Memcache. This is why in March of 2013, Google acquired both technology and people from a small networking company called Myricom, who cracked the code to accelerate the network path to Memcache.

To better understand what Google acquired, we need to roll back to 2009 when Myricom had ported its MX communications stack to Ethernet. Mx over Ethernet (MXoE) was originally crafted for the High-Performance Computing (HPC) market, to create a very limited, but extremely low latency, stack for their then two-year-old 10GbE NICs. MXoE was reformulated using UDP instead of MX, and this new low latency driver was then named DBL. It was then engineered to service the High-Frequency Traders (HFT) on Wall Street. Throughout 2010 Myricom had evolved this stack by further adding limited TCP functionality. At that time, half round trip performance (one send plus one receive) over 10GbE was averaging 10-15 microseconds, DBL did it 4 microseconds. Today (2015) by comparison Solarflare, the leader in ultra-low latency network adapters, does this in well under 2 microseconds. So in 2012, Google was searching for a method to apply ultra-low latency networking tricks to a new technology called Memcache. In the fall of 2012 Google invited several Myricom engineers to discuss how DBL might be altered to service Memcache directly. Also they were interested in known if Myricom’s new silicon, due out in early 2013, might also be a good fit for this application. By March of 2013 Google had tendered an offer to Myricom to acquire their latest 10/40GbE chip, which was about to go into production, along with 12 of their PhDs who handled both the hardware & software architecture. Little is publicly known if that chip or the underlying accelerated driver layer for Memcache has ever made it into production at Google.

Fast forward to today (2015), and earlier this week, Solarflare released a new whitepaper highlighting how they’ve accelerated the publicly available free open-source layer called Memcached using their ultra-low latency driver layer called OpenOnload. In this whitepaper, Solarflare demonstrated performance gains of 2-3 times that of a similar Intel 10GbE adapter. Imagine Google’s Memcache farm being only 1/3 the size it is today? We’re talking serious performance gains here. For example, leveraging a 20 core server, the Intel dual 10GbE adapter supported 7.4 million multiple get operations while Solarflare provided 21.9 million, nearly a 200% increase in the number of requests. If we look at mixed throughput (get/set in a ratio of 9:1), Intel delivered 6.3 million operations per second while Solarflare delivered 13.3 million, a 110% gain. That’s throughput, how about latency performance? Using all 20 cores, and batches of 48 get requests Solarflare clocked in at 2,000 microseconds, and Intel at 6,000 microseconds. Across all latency tests, Solarflare reduced network latency by, on average, 69.4% (the lowest reduction was 50%, and the highest 85%). Here is a link to the 10-page Solarflare whitepaper with all the details.

While Google was busy acquiring technology & staff to improve their own Memcache performance, Solarflare delivered it for their customers and documented the performance gains.

828ns – A Legacy of Low Latency

Electronic trading, like no other industry, can directly link time and money. A decade ago when I started selling 10GbE NICs to Wall Street traders, they often shared with me the value of a single microsecond (millionth of a second) improvement in trading. Today these same traders are measuring gains in nanoseconds (billionths of a second). With each passing quarter our financial markets evolve, and trade execution times decrease. Trading platforms leveraging older hardware and software often can’t remain competitive as other traders continue to invest in the latest products which further reduce trade execution latency and improve order determinism.

For the past decade, Solarflare has led the market in accelerating server-side UDP/TCP networking for electronic trading with our Onload® software acceleration stack. In addition, Solarflare has regularly delivered a new generation of 10GbE network adapters that have further reduced network latency by 20-30% while also reducing jitter. Often these advances were the result of improvements in the hardware, but there were many significant enhancements to the Onload stack that contributed substantially to the overall system performance increases. Keep in mind that Onload is fully compliant to the BSD Sockets standard, which means that developers don’t have to change their code to use Onload. The table below shows this reduction in Onload latency over time along with the gain from each new generation of Solarflare adapters.

In the below graph (click on it to enlarge) you’ll see how latency with Onload compares between Solarflare’s SFN8522 and X2522 as message size increases. We’ve also included our next closest competitor, Mellanox, with their ConnectX-5 adapter and VMA offload stack.

About five years ago, Solarflare saw an opportunity to revisit TCP/UDP networking stacks within Onload and determined that it is possible to squeeze another 35-50% in performance gains if developers were willing to use a new C language application programming interface (API). This new API was built from the ground up focused on performance, and it implements only a subset of the complete BSD Sockets API. Every API call has been highly tuned to deliver optimum performance. On the road to formulating this API Solarflare has patented several new innovations, and in 2016 it leaped forward again by introducing this API and branding it TCPDirect. Initially, TCPDirect improved latency on Solarflare’s SFN8522 adapter by an astonishing 38%!

Recently TCPDirect was tested with the Solarflare’s latest X2522 cards, and it delivered an improved 48% latency reduction over Onload on the same adapter (click on the graph below). Today TCPDirect with the X2522 provides an amazing 828ns of latency with TCP. So how does this compare with Mellanox? The X2522 with TCPDirect is 39% faster than the Mellanox ConnectX-5 with VMA and Exasock! This gain is shown in the graph below. It should be noted that this testing was done using an older more performant Intel Skylake processor with a 3.6Ghz clock. Intel’s newest Cascade Lake processors burst up to 4.4Ghz, but they were not available at the time of this testing. Recent testing indicates that they should produce even more impressive results.

Trading and Time are interwoven into a single fabric, one cannot exist without the other. When trades are executing with a precision measured in nanoseconds you need a technology partner that is leading the industry, not following it. Solarflare also provides a precision time protocol (PTP) daemon that includes both IEEE-1588 (2008) and enterprise profiles. Additionally, Solarflare makes available an optional PCIe bracket kit enabling the direct connection of an external hardware master clock that can deliver a highly accurate one pulse per second (1PPS) signal.  This kit and Solarflare’s PTP daemon enable the adapter to maintain system time synchronization to within 200ns of the external master clock. Mellanox has stated that their PTP implementation “can see time locked to reference well within 500 nanoseconds of variation.”

Numerous STAC reports over the past decade with all the major OEMs and the Linux distributions used in finance have validated that Solarflare networking technology is the standard by which all others are measured. Innovations like those discussed above are the reason why over 90% of the stock exchanges, global investment banks, hedge funds, and cutting-edge high-frequency traders’ architect their systems with Solarflare hardware and software. Outside of the Linux kernel’s own communications stack, no other TCP/UDP user-space communications stack is more heavily tested or in wider production than Solarflare’s Onload platform. Today the world economy exists across hundreds of thousands of servers spread throughout the globe, and nearly all of those servers depend on Solarflare to provide the industry’s best performance with the lowest jitter possible. Below are recent STAC Research reports from the past two years that back up our claims.

June 2018 – SFC180604b– UDP over 10GbE using Solarflare OpenOnload on Red Hat OpenShift 3.10 (pre-release) with RHEL 7.5 and Solarflare XtremeScale X2522 Adapters on Supermicro SYS-1029UX-LL1-S16 Servers

June 2018 – SFC180604a– UDP over 10 GbE Solarflare OpenOnload on Red Hat Enterprise Linux 7.5 with Solarflare XtremeScale X2522 adapters on Supermicro SYS-1029UX-LL1-S16Servers

October 2017 – SFC170831– STAC-T0: Solarflare SFN8522-ONLOAD NIC with LDA Technologies LightSpeed TCP on an Alpha Data FPGA in a Penguin Computing Relion XE1112 Server

Febuary 2017 – SFC170206– UDP over 10GbE using OpenOnload on RHEL 6.6 with Solarflare SFN 8522-PLUS Adapters on HPE ProLiant XL170r Gen9 Trade & Match Servers

User Level Networking (ULN) is Becoming an Over-Night Success

Kernel Bypass = User Level Networking

Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.   

With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.  

This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.     

In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.

In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders. 

At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well. 

For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects. 

At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.  

As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.

Making the Fastest, Faster: Redis Performance Revisited

When you take something that is already considered to be the fastest and offer to make it another 50% faster people think you’re a liar. Those who built that fast thing couldn’t possibly have left that much slack in their design. Not every engineer is a “miracle worker” or notorious sand-bagger, like Scotty from the Star Ship Enterprise. So how is this possible?

A straightforward way to achieve such unbelievable gains is to alter the environment around how that fast thing is measured. Suppose the thing we’re discussing is Redis, an in-memory database. The engineers who wrote Redis rely on the Linux kernel for all network operations. When those Redis engineers measured the performance of their application what they didn’t know was that over 1/3 of the time a request spends in flight is consumed by the kernel, something they have no control over. What if they could regain that control?

Suppose we provided Redis’s direct access to the network. This would enable Redis to directly make calls to the network without any external software layers in the way. What sort of benefits might the Redis application see? There are three areas which would immediately see performance gains: latency, capacity, and determinism.

On the latency side, requests to the database would be processed faster. They are handled more quickly because the application is receiving data straight from the network directly into Redis’s memory without a detour through the kernel. This direct path reduces memory copies, eliminates kernel context switches, and removes other system overhead. The result is a dramatic reduction in time, and CPU cycles. Conversely, when Redis fulfills a database request, it can write that data directly to the network, again saving more time and reclaiming more CPU cycles. 

As more CPU cycles are freed up due to decreased latency, those compute resources go directly back into processing Redis database requests. When the Linux kernel is bypassed using Solarflare’s Cloud Onload Redis sees on average a 50% boost in the number of “Get” and “Set” commands it can process every second. Imagine Captain Kirk yelling down to Scotty to give him more power, and Scotty flips a switch, and instantly another 50% more power comes online, that’s Solarflare Cloud Onload. Below is a graph of the free version of Redis doing database GET commands using a single 25GbE link through the kernel in blue, and with an Onloaded 25GbE link in green. Solarflare Cloud Onload, is Scotty’s magic switch mentioned above. Note we scaled the number of Redis instance along the X-axis from 1 to 32 (on an x86 system with 32 cores) and the Y-axis is 0-15 million requests/second.

Finally, there is the elusive attribute of determinism. While computers are great at doing a great many things, that is also what makes them less than 100% predictable. Servers often have many sensors, fans and a control system designed to keep them operating at peak efficiency. The problem is that these devices generate events that require near-immediate attention. When a thermal sensor generates an interrupt, the CPU is alerted, it pushes the current process to the stack, services the interrupt, perhaps by turning a fan on, then returns to the previous process. When the interrupt occurs, and how long it takes the CPU to service it are both variables that hamper determinism. If a typical “Get” request takes a microsecond (millionth of a second) to service, but that CPU core is called away from processing that “Get” request in the middle by an interrupt, it could be 20 to 200 microseconds before it returns. Solarflare’s Cloud Onload communications stack moves these interrupts out of the critical path of Redis, thereby restoring determinism to the application.

So, if you’re looking to improve Redis performance by 50%, please consider Solarflare’s Cloud Onload running on one of their new X2 series NICs. Solarflare’s new X2 series NICs are available for 10GbE, 25GbE and now 100GbE. Soon we will be posting our Benchmarking Performance Guide and our Cloud Onload for Redis Cookbook that contains all the details. When these are available on Solarflare’s website then links will be added to this blog entry.  

*Update: Someone asked if I could clarify the graph a bit more. First, we focused our testing on both the GET and SET requests, as those are the two most common in-memory database commands. GET is simply used to fetch a value from the database while SET is used to store a value in the database, really basic stuff. Both graphs are very similar. For a single 25GbE link the size of the Redis GET and SET requests translates to about 11 million requests/second to fill the pipe.

It turns out that a quad-core server running four Redis instances can saturate a single 10GbE link, we’ve not tested multiple 10GbE links. Here is where Cloud Onload shines as it lifts the kernel limitations mentioned above. Note it will take you over 7 Redis instance on 7 Cores to achieve line rate 25GbE with Cloud Onload, while the kernel will require twice that or 14 instances on 14 cores to match this. Any Redis instances or CPU cores beyond this will be underutilized. The most important takeaway here though is that Cloud Onload delivers a substantial capacity gain for Redis over using the kernel, so if your server has more than a few cores Cloud Onload will enable you to get the full value out of them.

**Update: On March 23, 2019, an updated graph was posted above that focuses on 25GbE, as that’s where data centers are headed. The text was then aligned with the updated graph.

**Note: Credit to John Laroco for leading the Redis testing, and for noticing, and taking the opening picture at SJC airport earlier this month.

Container Performance Doesn’t Need to Suck

Recently the OpenShift team at Red Hat, working with Solarflare Engineering, rolled out new code that was benchmarked by a third party, STAC Research, which demonstrated networking performance from within a container that was equivalent to that of a bare metal server. We’re talking 1.2 microseconds for 99% of network traffic in a 1/2RT (half round trip), that’s a TCP receive to an application coupled with a TCP send from that application.

Network performance like this was considered leading edge in High-Performance Computing (HPC) a little more than a decade ago when Myricom rolled out Myrinet10G which debuted at 2.4 microseconds back in 2006. Both networks are 10Gbps so it’s sort of an apples to apples comparison. Today, this level of performance is available for containerized applications using generic network socket calls. It should be noted that the above numbers were for zero byte packets, a traditional HPC measurement. More realistic performance using 256-byte packets yielded a 1/2RT time for the 99th percentile of traffic which was still under 1.5 microseconds, that’s amazing! It should be noted that everything was done to both the bare metal server and the Pod configuration to optimize performance. A graph of the complete results of that testing is shown below.

Anytime we create abstractions to simplify application execution or management we introduce additional layers of code that can result in potentially unwanted delays, known as application latency. By running an application inside a container, then wrapping that container into a Pod we are increasing the distance between what we intend to do, and what is actually being executed. Docker containers are fast becoming all the rage and methods for orchestrating them using tools like Kubernetes are extremely popular. If you dive into this OpenShift blog post there are ways to cut through these layers of code for performance while still retaining the primary management benefits.

Onload Recovers Meltdown Lost Performance

The recently announced microprocessor architecture vulnerability known as Meltdown is focused on accessing memory that shouldn’t be available to the currently running program. Meltdown exploits a condition where the processor allows an unprivileged application the capability to continually harvest data unrestricted from anywhere in system memory. The flaw that enables Meltdown is based on microprocessor performance enhancements more than a decade old and are now common in Intel and some ARM processors. The solution to Meltdown is Kernel page-table isolation (KPTI), but it doesn’t come without a performance impact which ranges from 5-30%, every application behaves differently. Since Onload places the communications stack into that application’s userspace this dramatically reduces the number of kernel calls for network operations and as such avoids most of the performance impact brought on by KPTI. Redhat confirm this in a recent article on this topic. This means that applications leveraging Onload on KPTI patched kernels will see an even greater performance advantage.

By contrast Spectre tears down the isolation that exists between running applications. It allows a malicious application to trick error-free programs into leaking their secrets. It does this by scanning the process address space of those programs, and the kernel libraries on which they depend, looking for exploitable code. When this vulnerable code is executed it acts as a covert channel transmitting its secrets to the malicious application. This vulnerability affects a wider range of processors and requires both kernel and CPU microcode patches, and even then, the vulnerability hasn’t been 100% eliminated. More work remains to be done to shut down Spectre.