Half a Billion Req/Sec!

Can your In-Memory Key-Value Store handle a half-billion requests per second?

Many applications from biological to financial and Web2.0 utilize in-memory databases because of their cutting-edge performance, often delivering several orders of magnitude faster response time than traditional relational databases. When these in-memory databases are moved to their own machines in a multi-tier application environment, they often can serve 10 million requests per second, and that’s turning all the dials to 11 on a high-end dual-processor server. Much of this is due to how applications communicate with the kernel and the network.

Last year, my team used one of these databases, Redis, we then bypassed the kernel, connected up a 100Gbps network, and took that 10 million requests per second to almost 50 million. Earlier this year we began working with Algo-Logic, Dell, and CC Integration to blow way beyond that 50 million target. At RedisConf2020 in May, Algo-Logic announced a 1U Dell server they’ve customized that can service nearly a half-billion requests per second. To process these requests, we’re spreading the load across two AMD EPYC CPUs and three Xilinx FPGAs. All requests are serviced directly from local memory using an in-memory key-value store system. For those requests serviced by the FPGAs, their response time is measured in billionths of a second. Perhaps we should explain how Algo-Logic got here, and why this number is significant.

Some time ago, a new form of databases came back into everyday use, and they were classified as NoSQL, because they didn’t use Structured Query language, and were non-relational. These databases rely on clever algorithmic tricks to rapidly store and retrieve information in memory; this is very different from how traditional relational databases function. These NoSQL systems are sometimes referred to as key-value stores. With these systems, you pass in a key, and a value is returned. For example, pass in “12345,” and the value “2Z67890” might be returned. In this case, the key could be an order number, and the value returned is the tracking number or status, but the point is you made a simple request and got back a simple answer, perhaps in a few billionths of a second. What Algo-Logic has done is they wrote an application for the Xilinx Alveo U50 that turns the 40 Gbps Ethernet port on this card into four smoking fast key-value stores each with access to the cards 8GB of High Bandwidth Memory (HBM). Each Alveo U50 card with Algo-Logics KVS can service 150 million requests per second. Here is a high-level architectural diagram showing all the various components:

Architecture for 490M Requests/Second into a 1U Server

There are five production network ports on the back of the server, three 40Gbps and two 25Gbps. Each of the Xilinx Alveo U50 cards has a single 40Gbps port, and the dual 25Gbps ports are on an OCP-3 form factor card called the Xilinx XtremeScale X2562 which carries requests into the AMD EPYC CPU complex. Algo-Logic’s code running in each of the Xilinx Alveo cards breaks the 40Gbps channel into four 10Gbps channels and processes requests on each individually. This enables Algo-Logic to make the best use possible of the FPGA resources available to them.

Furthermore, to overcome network overhead issues, Algo-Logic has packed 44 get requests into a single 1408 byte packet. For those familiar with Redis, this is similar to an MGET, multiple get, request. Usually, a single 32-byte get request can easily fit into the smallest Ethernet payload, which is 40 bytes, but then Ethernet adds an additional 24-bytes of routing and a 12-byte frame gap — using a single request per-packet results in networking overhead consuming 58% of the available bandwidth. This is huge, and can clearly impact the total request per second rate. Packing 44 requests of 32 bytes each into a single packet means that the network overhead drops to 3% of the total bandwidth, which means significantly greater requests per second rates.

What Algo-Logic has done here is extraordinary. They’ve found a way to tightly link the 8GB of High Bandwidth Memory on the Xilinx Alveo U50 to four independent key-value store instances that can service requests in well under a micro-second. To learn more consider reaching out to John Hagerman at Algo-Logic Systems, Inc.