Making the Fastest, Faster: Redis Performance Revisited

When you take something that is already considered to be the fastest and offer to make it another 50% faster people think you’re a liar. Those who built that fast thing couldn’t possibly have left that much slack in their design. Not every engineer is a “miracle worker” or notorious sand-bagger, like Scotty from the Star Ship Enterprise. So how is this possible?

A straightforward way to achieve such unbelievable gains is to alter the environment around how that fast thing is measured. Suppose the thing we’re discussing is Redis, an in-memory database. The engineers who wrote Redis rely on the Linux kernel for all network operations. When those Redis engineers measured the performance of their application what they didn’t know was that over 1/3 of the time a request spends in flight is consumed by the kernel, something they have no control over. What if they could regain that control?

Suppose we provided Redis’s direct access to the network. This would enable Redis to directly make calls to the network without any external software layers in the way. What sort of benefits might the Redis application see? There are three areas which would immediately see performance gains: latency, capacity, and determinism.

On the latency side, requests to the database would be processed faster. They are handled more quickly because the application is receiving data straight from the network directly into Redis’s memory without a detour through the kernel. This direct path reduces memory copies, eliminates kernel context switches, and removes other system overhead. The result is a dramatic reduction in time, and CPU cycles. Conversely, when Redis fulfills a database request, it can write that data directly to the network, again saving more time and reclaiming more CPU cycles. 

As more CPU cycles are freed up due to decreased latency, those compute resources go directly back into processing Redis database requests. When the Linux kernel is bypassed using Solarflare’s Cloud Onload Redis sees on average a 50% boost in the number of “Get” and “Set” commands it can process every second. Imagine Captain Kirk yelling down to Scotty to give him more power, and Scotty flips a switch, and instantly another 50% more power comes online, that’s Solarflare Cloud Onload. Below is a graph of the free version of Redis doing database SET commands using a single 10GbE (blue), 25GbE (green) and 100GbE (tan) port. The light versions of the lines are Redis running through the Linux kernel, and the darker lines are using Solarflare Cloud Onload, Scotty’s magic switch. Note we scaled the number of Redis instance along the X-axis from 1 to 32 (on an x86 system with 32 cores) and the Y-axis is 0-25 million requests/second.

Finally, there is the elusive attribute of determinism. While computers are great at doing a great many things, that is also what makes them less than 100% predictable. Servers often have many sensors, fans and a control system designed to keep them operating at peak efficiency. The problem is that these devices generate events that require near-immediate attention. When a thermal sensor generates an interrupt, the CPU is alerted, it pushes the current process to the stack, services the interrupt, perhaps by turning a fan on, then returns to the previous process. When the interrupt occurs, and how long it takes the CPU to service it are both variables that hamper determinism. If a typical “Get” request takes a microsecond (millionth of a second) to service, but that CPU core is called away from processing that “Get” request in the middle by an interrupt, it could be 20 to 200 microseconds before it returns. Solarflare’s Cloud Onload communications stack moves these interrupts out of the critical path of Redis, thereby restoring determinism to the application.

So, if you’re looking to improve Redis performance by 50% on average, and up to 100% under specific circumstances, please consider Solarflare’s Cloud Onload running on one of their new X2 series NICs. Solarflare’s new X2 series NICs are available for 10GbE, 25GbE and now 100GbE. Recent testing with 100GbE has shown that a single server with 32 CPU cores, running a single Redis instance per core, can process well over 20 million Redis requests per second. Soon we will be posting our Benchmarking Performance Guide and our Cloud Onload for Redis Cookbook that contains all the details. When these are available on Solarflare’s website then links will be added to this blog entry.  

*Update: Someone asked if I could clarify the graph a bit more. First, we focused our testing on both the GET and SET requests, as those are the two most common in-memory database commands. GET is simply used to fetch a value from the database while SET is used to store a value in the database, really basic stuff. Both graphs are very similar. For a single 10GbE link the size of the Redis GET and SET requests translates to about 4 million requests/second to fill the pipe. So scaling this to 25GbE means 10M req/sec and for 100GbE that means 40M req/sec.

It turns out that a quad-core server running four Redis instances can saturate a single 10GbE link, we’ve not tested multiple 10GbE links. Today the kernel appears to hit its limit at around 5M req/sec as can be seen from our 25G testing. This is in-line with testing we did over a decade ago when doing packet capture using Libpcap we noticed that the kernel had a limit at that time of around 3M packets/sec. Over the years with new Linux kernels we’ve seen that number increase, so 5M requests today is reasonable. As mentioned above the theoretical limit for 25GbE SET requests should be about 10M req/sec. Using Redis through the kernel over a 25GbE link we do in fact hit and sustain the 5M req/sec limit, regardless of how many Redis instances or CPU cores are deployed. Here is where Cloud Onload shines as it lifts that kernel limit from 5M and enables your server to service the link at its full potential 10M, note it will take you over 12 Redis instance on 12 Cores to achieve this. Any Redis instances or CPU cores beyond this will be underutilized. The most important takeaway here though is that Cloud Onload delivers a 100% capacity gain for Redis over using the kernel, so if your server has more than six cores Cloud Onload will enable you to get the full value out of them.

At 100GbE things are still not fully understood. With 25GbE we saw that the kernel hit it’s expected 5M req/sec limit, but for 100GbE testing, the kernel went well beyond this, in-fact triple this number. We have some ideas about how this is possible, but more research is required. We’re currently exploring why this is happening, and also how Cloud Onload can do even better than the nearly 25M requests/second at 100GbE measured above.

**Note: Credit to John Laroco for leading the Redis testing, and for noticing, and taking the opening picture at SJC airport earlier this month.

Leave a Reply