Memcached is a free open source software layer that provides a distributed in memory object store which is leveraged by popular sites like Twitter, Youtube, Flickr, Craigslist, etc… Since these objects are stored in memory, rather than on much slower disk, if one optimizes the path those objects travel the results could not only be measurable but might be significant. In a recent whitepaper by Solarflare, they did just this. Using an Intel X520-DA2 10GbE adapter as a reference they compared it to a Solarflare SFN7122F with OpenOnload and saw performance gains from 2-3 times over what the Intel adapter was capable of delivering. For example, maximum batched query throughput (get) went from 7.4 Mops (million operations) on a single Intel X520-DA2 card to 21.9 Mops on a Solarflare SFN7122F card with OpenOnload. This means you’d need three servers with Intel 10G cards to deliver the same performance as a single Memcached server with one Solarflare SFN7122F network adapter. How is this possible?
Solarflare’s adapter offers 1,024 virtualized network interfaces (VNICs) on each 10GbE port each with their own receive, and transmit hardware. Traditionally Solarflare’s OpenOnload mapped application sockets to these VNICs, but a new version allows sockets to be dynamically moved between OpenOnload stacks. This allows multi-threaded applications like Memcached that leverage a single listener thread and many worker threads to easily map each worker thread to its own VNIC interface and move traffic between them.
Using two Dell PowerEdge R620 servers, each with two 10-core Intel E5-2660 v2 processors and 32GB of memory running RHEL 6.5 with Hyper-threading enabled. The first server, the traffic generator, ran memslap, and was configured with a pair of Solarflare SFN6122F adapters to create sufficient load. The second server, the one testing Memcached performance, had both an Intel X520-DA2 and a Solarflare SFN7122F with OpenOnload. To balance performance, and scalability each Memecached instance only had five cores (10 threads), so with 20 cores on the server we loaded four Memcached instances, two per CPU. We then tested using one core per instance scaling to five cores per instance. Here are some interesting data points:
- From four cores to 20 cores performance was pretty linear Solarflare went from 5 Mops to 21.9 Mops while Intel went from 1.7 Mops to 7.4 Mops.
- Latency in microseconds for a single get request saw substantial improvement going from 750us (4 cores) down to 180us (20 cores) while Intel went from 3,000us down to 425us.
- For batches of 48 get requests the latency in microseconds was also very compelling going from 10,000us (4 cores) down to 2,000us (20 cores) compared to Intel with 30,000us down to 6,000us.
If you’re serious about Memcached you should really read this 10-page whitepaper as it goes into much deeper detail.