Making the Fastest, Faster: Redis Performance Revisited

When you take something that is already considered to be the fastest and offer to make it another 50% faster people think you’re a liar. Those who built that fast thing couldn’t possibly have left that much slack in their design. Not every engineer is a “miracle worker” or notorious sand-bagger, like Scotty from the Star Ship Enterprise. So how is this possible?

A straightforward way to achieve such unbelievable gains is to alter the environment around how that fast thing is measured. Suppose the thing we’re discussing is Redis, an in-memory database. The engineers who wrote Redis rely on the Linux kernel for all network operations. When those Redis engineers measured the performance of their application what they didn’t know was that over 1/3 of the time a request spends in flight is consumed by the kernel, something they have no control over. What if they could regain that control?

Suppose we provided Redis’s direct access to the network. This would enable Redis to directly make calls to the network without any external software layers in the way. What sort of benefits might the Redis application see? There are three areas which would immediately see performance gains: latency, capacity, and determinism.

On the latency side, requests to the database would be processed faster. They are handled more quickly because the application is receiving data straight from the network directly into Redis’s memory without a detour through the kernel. This direct path reduces memory copies, eliminates kernel context switches, and removes other system overhead. The result is a dramatic reduction in time, and CPU cycles. Conversely, when Redis fulfills a database request, it can write that data directly to the network, again saving more time and reclaiming more CPU cycles. 

As more CPU cycles are freed up due to decreased latency, those compute resources go directly back into processing Redis database requests. When the Linux kernel is bypassed using Solarflare’s Cloud Onload Redis sees on average a 50% boost in the number of “Get” and “Set” commands it can process every second. Imagine Captain Kirk yelling down to Scotty to give him more power, and Scotty flips a switch, and instantly another 50% more power comes online, that’s Solarflare Cloud Onload. Below is a graph of the free version of Redis doing database GET commands using a single 25GbE link through the kernel in blue, and with an Onloaded 25GbE link in green. Solarflare Cloud Onload, is Scotty’s magic switch mentioned above. Note we scaled the number of Redis instance along the X-axis from 1 to 32 (on an x86 system with 32 cores) and the Y-axis is 0-15 million requests/second.

Finally, there is the elusive attribute of determinism. While computers are great at doing a great many things, that is also what makes them less than 100% predictable. Servers often have many sensors, fans and a control system designed to keep them operating at peak efficiency. The problem is that these devices generate events that require near-immediate attention. When a thermal sensor generates an interrupt, the CPU is alerted, it pushes the current process to the stack, services the interrupt, perhaps by turning a fan on, then returns to the previous process. When the interrupt occurs, and how long it takes the CPU to service it are both variables that hamper determinism. If a typical “Get” request takes a microsecond (millionth of a second) to service, but that CPU core is called away from processing that “Get” request in the middle by an interrupt, it could be 20 to 200 microseconds before it returns. Solarflare’s Cloud Onload communications stack moves these interrupts out of the critical path of Redis, thereby restoring determinism to the application.

So, if you’re looking to improve Redis performance by 50%, please consider Solarflare’s Cloud Onload running on one of their new X2 series NICs. Solarflare’s new X2 series NICs are available for 10GbE, 25GbE and now 100GbE. Soon we will be posting our Benchmarking Performance Guide and our Cloud Onload for Redis Cookbook that contains all the details. When these are available on Solarflare’s website then links will be added to this blog entry.  

*Update: Someone asked if I could clarify the graph a bit more. First, we focused our testing on both the GET and SET requests, as those are the two most common in-memory database commands. GET is simply used to fetch a value from the database while SET is used to store a value in the database, really basic stuff. Both graphs are very similar. For a single 25GbE link the size of the Redis GET and SET requests translates to about 11 million requests/second to fill the pipe.

It turns out that a quad-core server running four Redis instances can saturate a single 10GbE link, we’ve not tested multiple 10GbE links. Here is where Cloud Onload shines as it lifts the kernel limitations mentioned above. Note it will take you over 7 Redis instance on 7 Cores to achieve line rate 25GbE with Cloud Onload, while the kernel will require twice that or 14 instances on 14 cores to match this. Any Redis instances or CPU cores beyond this will be underutilized. The most important takeaway here though is that Cloud Onload delivers a substantial capacity gain for Redis over using the kernel, so if your server has more than a few cores Cloud Onload will enable you to get the full value out of them.

**Update: On March 23, 2019, an updated graph was posted above that focuses on 25GbE, as that’s where data centers are headed. The text was then aligned with the updated graph.

**Note: Credit to John Laroco for leading the Redis testing, and for noticing, and taking the opening picture at SJC airport earlier this month.

Equifax & Micro Segmentation

Earlier this week it was reported that an Equifax web service was hacked creating a breach that existed for about 10 weeks. During that time the attackers used that breach to drain 143 million people’s private information. The precise technical details of the breach, which Equifax claims was detected and closed on July 29, has yet to be revealed. While it says it’s seen no other criminal activity on its main services since July 29th that’s of little concern as Elvis has left the building. At 143 million that means a majority of the adults in the US have been compromised. Outside of Equifax specific code vulnerabilities or further database hardening what could Equifax have done to thwart these attackers?

Most detection and preventative countermeasures that could have minimized Equifax’s exposure employ some variation of behavior detection at one network layer. They then shunt suspect traffic to a sideband queue for further detailed human analysis. Today the marketing trend to attract Venture Capital investment is to call these behavior detection algorithms Artificial Intelligence or Machine Learning. How intelligent they are, and to what degree they learn is something for a future blog post. While at the NGINX Conference this week we saw several companies selling NGINX layer-7 (application layer) plugins which analyzed traffic prior to passing it to NGINX’s HTML code evaluation engine. These plugins receive the entire HTML request after the OS stack has assembled it from multiple network packets. They then do a rapid analysis the request to determine if it poses a threat. If not then the request is passed back to NGINX for the web application to respond to. Along the way, the plugin abstracts metadata from the request and in parallel, it shoots that up to their cloud service for further evaluation. This metadata is then compared against prior history and other real-time customer data from with similar services to extract new potential threat vectors. As they are detected rules are then pushed back down into the plugin that can be applied to future packets.

Everything discussed above is layer-7, the application layer, traffic analysis, and mitigation. What does layer-7 have to do with network micro-segmentation? Nothing, what’s mentioned above is the current prevailing wisdom instantiated in several solutions that are all the rage today. There are several problems with a layer-7 solution. First, it competes with your web application for host CPU cycles. Second, if the traffic is determined to be malicious you’ve already invested tens of thousands of CPU instructions, perhaps even in excess of one hundred thousand instructions to make this determination, all that computer time is lost once the message is dropped. Third, the attack is now deep inside your web server and whose to say the attacker hasn’t learned what he needed to move to a lower layer attack vector to evade detection. Layer-7 while convenient, easy to use, and even easier to understand is very inefficient.

So what is network micro-segmentation, and how does it fit in? Network segmentation is the act of altering the flow of traffic such that only what you want is permitted to pass. Imagine the factory that makes M&Ms. These days they use high-speed cameras and other analytics that look for deformed M&Ms and when they see one they steer it away from the packaging system. They are in fact segmenting the flow of M&Ms to ensure that only perfect candy-coated pieces ever make it into our mouths. The same is true for network traffic, segmentation is the process of only allowing network packets to flow into or out of a given device via a specific policy or set of policies. Micro-segmentation is doing that down to the application level. At Layer-3, the network layer, that means separating traffic by the source and destination network address and port, while also taking into account the protocol (this is known as “the five-tuple”, a set of five elements). When we focus on filtering traffic by network port we can say that we are doing application level filtering because ports are used to map network traffic to applications. When we also take into account the local IP address for filtering then we can also say we filter by the local container (ex. Docker) or Virtual Machine (VM) as these can often get their own local IP address. Both of these items together can really define a very specific network micro-segmentation strategy.

So now imagine a firewall inside a smart network interface card (NIC) that can filter both inbound and outbound packets using this network micro-segmentation. This is at layer-3, the Network, micro-segmentation within the smart NIC. When detection is moved into the NIC no x86 CPU cycles are consumed when evaluating the traffic, and no host resources are lost if the packet is deemed malicious and is dropped. Furthermore, if it is a malicious packet and it’s stopped by a firewall in the NIC then the threat has never entered the host CPU complex, and as such, the system’s integrity is preserved. Consider how this can improve an enterprise’s security as it scales out both with new servers, as well as adding containers and VMs. So how can this be done?

Solarflare has been shipping its 8000 line of smart NICs since June of 2016, and later this fall they will release a new firmware called ServerLock(TM). ServerLock is a first generation firewall in the smart NIC that is centrally managed. Every second it sends a summary of network flows through the NIC, in both directions, to a central ServerLock Manager system. This system then allows administrators to view these network flows graphically and easily turn them into security roles and policies that can then be deployed. Policies can then be deployed to a specific local IP address, a collection of addresses (think Docker containers or VMs) called an “IP Set”, a host or host groups. When deployed policies can be placed in Monitor or Enforce mode. Monitor Mode will allow all traffic to flow, but it will generate alerts for all traffic outside of all the defined policies for a local IP address. In Enforce mode, ONLY traffic conforming to the defined policies will be permitted. Traffic outside of those policies will generate an alert and be dropped. Once a network device begins to drop traffic on purpose we say that that device is segmenting the network. So in Enforce mode, ServerLock smart NICs will actively segment that server’s network by only passing traffic for supported applications, only those for which a policy exists. This applies to traffic in both directions, so for example, if an administrator walks into the data center, grabs a keyboard and elects to Secure Copy (SCP) a file from a database server to his workstation things will get interesting. If the ServerLock smart NIC in that database server doesn’t have a policy supporting SCP (port 22) his outbound request from that database server to his workstation will be dropped in the NIC. Likely unknown to him an alert will be generated on the central ServerLock Manager console calling out the application and both the database server and his workstation, and he’ll have some explaining to do.

ServerLock begins shipping this fall so while it’s too late for Equifax it’s not too late for the next Equifax. So how would this help moving forward? Simple, if every server, including web servers and database servers, has a ServerLock smart NIC then every second these servers would report their flow data to the central Solarflare ServerLock Manager for further analysis. Solarflare is working with Cloudwick to do real-time analysis of this layer-3 traffic so that Cloudwick can then proactively suggest in real time back to ServerLock administrators new roles and policies to proactively protect servers against all sorts of threats. More to come as this product is released.

9/11/17 Update – It was released over the weekend that Equifax is now pointing the blame at an Apache Struts module. The exact module has yet to be disclosed, but it could be any one of the following that has been previously addressed. On Saturday The Apache group replied pointing to other sources that believe it might have been caused by exploiting a remote code execution bug in their REST plugin as outlined in CVE-2017-9805. More to come.

9/12/17 Update – Alert Logic has the best analysis thus far.

Podcast on iTunes, Google Play & Stitcher

logopodcastAfter several weeks we now have a critical mass of episodes of The Technology Evangelist Podcast to justify posting it on iTunes, Google Play, and Stitcher. So now you can subscribe from your favorite service or player and stay current. Recently we completed episodes on high-performance packet capture, electronic trading, NVMe, and Hadoop.

We’re now scheduling and recording episodes on FPGAs, digital currencies (bitcoin), Intel Skylake, TensorFlow, Containers, the race to zero, and much, much more. If you have ideas for topics or speakers, we always welcome suggestions, so please email The Technology Evangelist or comment to this blog post.

Moving On

movingEffective June 2, 2017, the primary hosting source for the 40GbE.net (including 10, 25, and 50GbE.net) blog is moving from Blogger (Google) to WordPress. This is being done to facilitate better management of the content, and to add the capability to support a newly spun up Podcast called the Technology Evangelist. Hopefully, you’ll renew your subscription to this blog on WordPress.

3X Better Performance with Nginx

Recently Solarflare concluded some testing with Nginx that measured the amount of traffic Nginx could respond to before it started dropping requests. We then scaled up the number of cores provided to Nginx to see how additional compute resources impacted the servicing of web page requests, and this is the resulting graph:

click for larger image

As you can see from the above graph most NIC implementations require about six cores to achieve 80% wire-rate. The major difference highlighted in this graph though is that with a Solarflare adapter, and their OpenOnload OS Bypass driver they can achieve 90% wire-rate performance utilizing ONLY two cores versus six. Note that this is with Intel’s most current 10G NIC the x710.

What’s interesting here though is that OpenOnload internally can bond together up to six 10G links, before a configuration file change is required to support more.  This could mean that a single 12 core server, running a single Nginx instance should be able to adequately service 90% wire-rate across all six 10G links, or theoretically 54Gbps of web page traffic. Now, of course, this is assuming everything is in memory, and the rest of the system is properly tuned. Viewed another way this is 4.5Gbps/core of web traffic serviced by Nginx running with OpenOnload on a Solarflare adapter compared to 1.4Gbps/core of web traffic with an Intel 10G NIC. This is a 3X gain in performance for Solarflare over Intel, how is the possible?

Simple, OpenOnload is a user space stack that communicates directly with the network adapter in the most efficient manner possible to service UDP & TCP requests. The latest version of OpenOnload has also been tuned to address the C10K problem. What’s important to note, is that by bypassing the Linux OS to service these communication requests Solarflare reduces the number of kernel context switches/core, memory copies, and can more effectively utilize the processor cache. All of this translates to more available cycles for Nginx on each and every core.

To further drive this point home we did an additional test just showing the performance gains OOL delivered to Nginx on 40GbE. Here you can see that the OS limits Nginx on a 10-core system to servicing about 15Gbps. With the addition of just OpenOnload to Nginx, that number jumps to 45Gbps. Again another 3X gain in performance.

If you have web servers today running Nginx, and you want to give them a gargantuan boost in performance please consider Solarflare and their OpenOnload technology. Imagine taking your existing web server today which has been running on a single Intel x520 dual port 10G card, replacing that with a Solarflare SFN7122F card, installing their OpenOnload drivers and seeing a 3X boost in performance. This is a fantastic way to breathe new life into existing installed web servers. Please consider contacting Solarflare today do a 10G OpenOnload proof of concept so you can see these performance gains for yourself first hand.