SmartNICs, the Next Wave in Server Acceleration

As system architects, we seriously contemplate and research the components to include in our next server deployment. First, we break the problem being solved into its essential parts; then, we size the components necessary to address each element. Is the problem compute, memory, or storage-intensive? How much of each element will be required to craft a solution today? How much of each will be needed in three years? As responsible architects, we have to design for the future, because what we purchase today, our team will still be responsible for three years from now. Accelerators complicate this issue because they can both dramatically breath new life into existing deployed systems, or significantly skew the balance when designing new solutions.

Today foundational accelerator technology comes in four flavors: Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Multi-Processor Systems on a Chip (MPSoCs) and most recently Smart Network Interface Cards (SmartNICs). In this market, GPUs are the 900-pound gorilla, but FPGAs have made serious market progress the past few years with significant deployments in Amazon Web Services (AWS) and Microsoft Azure. MPSoCs, and now SmartNICs, blend many different computational components into a single chip package, often utilizing a mix of ARM cores, GPU cores, Artificial Intelligence (AI) engines, FPGA logic, Digital Signal Processors (DSPs), as well as memory and network controllers. For now, we’re going to skip MPSoCs and focus on SmartNICs.

SmartNICs place acceleration technology at the edge of the server, as close as possible to the network. When computational processing of network intense workloads can be accomplished at the network edge, within a SmartNIC, it can often relieve the host CPU of many mundane networking tasks. Normal server processes require that the host CPU spend, on average, 30% of it’s time managing network traffic, this is jokingly referred to as the data center tax. Imagine how much more you could get out of a server if just that 30% were freed up, and what if more could be made available?

SmartNICs that leverage ARM cores and or FPGA logic cells exist today from a growing list of companies like Broadcom, Mellanox, Netronome, and Xilinx. SmartNICs can be designed to fit into a Software-Defined Networking (SDN) architecture. They can accelerate tasks like Network Function Virtualization (NVF), Open vSwitch (OvS), or overlay network tunneling protocols like Virtual eXtensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE). I know, networking alphabet soup, but the key here is that complex routing, and packet encapsulation tasks can be handed off from the host CPU to a SmartNIC. In virtualized environments, significant amounts of host CPU cycles can be consumed by these tasks. While they are not necessarily computationally intensive, they can be volumetrically intense. With datacenter networks moving to 25GbE and 50GbE, it’s not uncommon for host CPUs to process millions of packets per second. This processing is happening today in the kernel or hypervisor networking stack. With a SmartNIC packet routing and encapsulation can be handled at the edge, dramatically limiting the impact on the host CPU.

If all you were looking for from a SmartNICs is to offload the host CPU from having to do networking, thereby saving the datacenter networking tax of 30%, this might be enough to justify their expense. Most of the SmartNIC product offerings from the companies mentioned above run in the $2K to $4K price range. So suppose you’re considering a SmartNIC that costs $3K, with the proper software, and under load testing, you’ve found that it returns 30% of your host CPU cycles, what is the point at which the ROI makes sense? A simplistic approach would suggest that $3K divided by 30% yields a system cost of $10K. So if the cost of your servers is north of $10K, then adding a $3K SmartNIC is a wise decision, but wait, there’s more.

SmartNICs can also handle many complex tasks like key-value stores, encryption, and decryption (IPsec, MACsec, soon even SSL/TLS), next-generation firewalls, electronic trading, and much more. Frankly, the NIC industry is at an inflection point similar to when video cards evolved into GPUs to support the gaming and virtualization market. While Sony coined the term GPU with the introduction of the Playstation in 1994, it was Nvidia five years later in 1999 who popularized the GPU with the introduction of the GeForce 256. I doubt that in the mid-1990s, while Nvidia was designing the NV10 chip, the heart of the GeForce 256, that their engineers were also pondering how it might be used in high-performance computing (HPC) applications a decade later that had nothing to do with graphic rendering. Today we can look at all the ground covered by GPU and FPGA accelerators over the past two decades and quickly see a path forward for SmartNICs where they may even begin offloading the primary computational tasks of a server. It’s not inconceivable to envision a server with a half dozen SmartNICs all tasked with encoding video, or acting as key-value stores, web caches, or even trading stocks on various exchanges. I can see a day soon where the importance of SmartNIC selection will eclipse server CPU selection when designing a new solution from the ground up.

Mining, and the Importance of Knowing What, How, When, and Where?

It doesn’t matter if you’re panning for gold, drilling for oil, or mining Bitcoin, your success is bounded by your best answers to what, how, when, and where? Often the “what” and “how” are tightly linked. If you own oil drilling equipment, you’re probably going to continue drilling for oil. If you buy an ASIC based Bitcoin mining rig, you can only mine Bitcoin. Traditionally “when” and “where” are the most fluid variables to address. A barrel of crude oil today is $57, but over the past year, it has fluctuated between $42 and $66. Similarly, Bitcoin, during the same year, has swung between $3,200 and $12,900, so answering the “when” can be very important. Fortunately, digital currencies can easily be mined and held, which allows us to artificially shift the “when” until the offer price of the commodity achieves the necessary profitability. In digital currency mining, the term is sometimes written HODL, originally a typo, but it has since morphed into “Hold On for Dear Life” until the currency is worth more than it cost you. Finally, we have the “where”, and I’m sure some are wondering why “where” matters in digital currency mining.

Moving backward through the above questions and drilling down specifically into digital currency mining as the application. “Where” is the easiest one, you want to install your mining equipment wherever you can get the cheapest power, manage the excess heat, and tolerate the noise. Recently two of the most extensive mining facilities, both around 300MW, have or are being stood up in former Aluminum plants. When making Aluminum, the single most costly component in the process is electricity, and it requires access to vast volumes of electricity. Often these facilities are located near hydro-electric plants where electricity is below $0.03 KW/h. Also, since every watt of power is converted into heat or sound, you need a method for cost-effectively dealing with these byproducts. One of the mining operations mentioned earlier is located in the far northern region of Russia, which makes cooling exceptionally easy. Also, with “where” you need a local government that is friendly to digital-currency mining. In the Russian example mentioned above, it took nearly two years to secure the proper legal support. Some countries like China, until recently, were not supportive of digital-currency mining. For enthusiasts like myself, we locate our mining gear in out of the way places like basements or closets, perhaps even insulating them for sound and channeling away the excess heat to somewhere useful. 

Concerning “when,” that should be now. The general strategy executed by most of us currently mining is known as “mine and hold.” With the Bitcoin halving coming in May, the expectation is that Bitcoin will see a run-up to that point. In the prior two Bitcoin halvings, the price remained roughly the same before and after the event. The last halving was in July 2016, and since then, Bitcoin has gone from a niche commodity to a mainstream offering. In the previous week, Fidelity was awarded a trust license to operate its digital assets business; further proof Bitcoin has gone mainstream. As Bitcoin is the dominant digital currency, it is believed that as it rises, so shall many of the other currencies that use it as a benchmark. So, holding some of the other mainstream digital currencies like Ethereum should also see a significant benefit from a substantial increase in the value of Bitcoin. 

Back to the “what” and “how.” With digital currency mining, you have two criteria to consider when answering the “how,” efficiency or flexibility. If you purchase a highly efficient solution, then it will be an ASIC based mining rig. You will then soon learn, if you haven’t already, that it has been designed to mine a single currency, and that’s ALL it can ever mine. Conversely, if you want flexibility, then an FPGA or GPU miner affords you various degrees of freedom, but again the choice between efficiency and flexibility comes into play. FPGA mining rigs are often 5X more efficient per watt than GPU based rigs, but the selection of FPGA bitstreams is finite, but growing monthly. Both FPGA and GPU rigs can easily switch from mining one coin to another with nominal effort; it’s the efficiency and what can be mined that separate the two.    

Finally, I’ve neglected to address the most obvious questions “why?” This is both the root of our motivation to mine and the fabric of our most social network. “Our only hope, our only peace is to understand it, to understand the why. ‘Why’ is what separates us from them, you from me. Why’ is the only real social power, without it you are powerless. And this is how you come to mewithout why,’ without power.” – Merovingian, “Matrix Reloaded” 2003

Size Matters, Especially in Computing

Yes, this is a regular size coffee cup

The only time someone says size doesn’t matter is when they have an abundance of what it is that’s being discussed. Back in the 1980s some of us took logic design and used discrete 7400 series chips to build out our projects. A 7400 has four two-input NAND gates, with four corresponding outputs, as well as power and ground pins. It is a simple 14 pin package about 3/4 of an inch long and maybe a quarter-inch wide that contains a grand total of sixteen transistors. Many of the basic gates we needed for our designs used that same exact package form factor which made for great fun. Thankfully we had young eyes back then because often times we’d be up till all hours of the night breadboarding our projects. We knew it was too late when someone would invariably slip up, insert a chip backward, and we’d all enjoy the faint whiff of burnt silicon.

Earlier this month Xilinx set a new world record by producing a field-programmable gate array (FPGA) chip which is a distant cousin of the 7400 called the Virtex Ultrascale+ VU19P. Instead of 16 transistors, it has 35 billion, with a “B”. Also, instead of four simple two-input, one output logic gates, it has nine million programmable system logic cells. A system logic cell is a “box” with six inputs and one output that is fully configurable and highly networked. Each individual little “box” is programmed by providing a logic table that maps all the possible six input combinations to the single output. So why does size matter?

Imagine you gave one child a quart-sized Ziplock bag of Legos and another several huge tackle boxes of pre-sorted bricks including Lego’s own robotics kit. Assuming both children have similar abilities and creativity which do you think will create the most compelling model? The first child’s solution wouldn’t be much larger than an apple and entirely static. While it could be revolutionary, it is limited to the constraints of the set of blocks provided. By contrast, the second child could produce a two-foot-tall robot that senses distance and moves freely about the room without bumping into walls. Which solution would you find compelling? In this case size matters in both the number and type of bricks available to the builder.  

The system logic cells mentioned above are much like small Lego bricks in that they can easily replicate the capability of more complex bricks by combining several smaller ones. FPGAs are also like Legos in that you can quickly tear down a model and re-use the build blocks to assemble a new model. For the past 30 years, FPGAs have had limitations that have prevented them from going mainstream. First, it was their speed and size, then it was the complexity of programming them. FPGAs were hard to configure, but the companies behind this technology learned from the Graphical Processing Unit (GPU) market and realized they needed tools to make programming FPGAs easier. Today new tools exist to port C/C++ programs into FPGA bitstreams. Some might think that the decade of 2010 was the age of the GPU, while 2020 is shaping up to become the age of the FPGA.