For the second year, I’ve had the pleasure of chairing the panel on SmartNICs. Chairing this panel is like being both an interviewer and a ringmaster. You have to ask thoughtful questions and respond to answers from six knowledgeable panelists while watching the Slack channel for real-time comments or questions that might further improve the event’s content. Recently I finished transcribing the talk and discovered the following seven gems sprinkled throughout the conversation.
7. Software today shapes the hardware of tomorrow. This is almost a direct quote from one of the speakers, but nearly half of the participants echoed it several times in different ways. One said that vertical integration means moving stuff done today in code for an Arm core into gates tomorrow.
6. DPUs are evolving into storage accelerators. Perhaps the biggest vendor mentioned adding compression, which means they are serious about soon positioning their DPU as a computational storage controller.
5. Side-Channel Attacks (SCA) are a consideration. Only one vendor brought up this topic, but it was on the mind of several. Applying countermeasures in silicon to thwart side-channel attacks nearly doubles the number of gates for that specific cryptographic block. I understand that the countermeasures essentially consume the inverse power while also generating the inverse electromagnetic effects so that external measurements of the chip package during cryptographic functions yield a completely fuzzed result.
4. Big Vendors are Cross-pollinating. We saw this last year with the announcement of the NVIDIA BlueField 2X, which includes a GPU on their future SmartNIC, but this appeared to be a bolt-on. NVIDIA’s roadmap didn’t integrate the GPU into the DPU until BlueField 4 some several years out. Now Xilinx, who will soon be part of AMD, is hinting at similar things. Intel, who acquired Altera several years ago, is also bolting Xeons onto their Infrastructure Processing Unit (IPU).
3. Monterey will be the Data Center OS. VMWare wasn’t on the panel, but some panelists had a lot to say on the topic. One mentioned that the data center is the new computer. This same panelist strongly implied that the future of DPUs lies in the control plane plugging into Monterey. Playing nicely with Monterey will likely become a requirement if you want to sell into future data centers.
2. The CPU on the DPU is fungible. The company seeking to acquire Arm mentioned they want to use a CPU micro-architecture in their DPU that they can tune. In other words, extending the Arm instruction set found in their DPU with additional instructions designed to process network packets. Now that could be interesting.
Finally, this is a nerdy plumbing type thing, but it will change the face of SmartNICs and bring enormous advantages to them is the move to Chiplets. Today ALL SmartNICs or DPUs rely on a single die in a single package, then one or perhaps two packages on a PCIe card. In the future, a single chip package will contain multiple dies, each with different functional components, and possibly fabricated at other process nodes, so….
1. The inflection point for chiplet adoption is integrated photonics. Chiplets becoming commonplace in DPU packages will become popular when there is a need to connect optics directly to the die in the package. This will enable very high-speed connections over extremely short distances.
Fifty-eight years ago, last month “The Jetsons” zipped into pop-culture in a flying car and introduced us to many new fictional technologies like huge flat screen TVs, tablet-based computing and video chat. Some of these technologies, had been popularized long before “The Jetson’s,” specifically video chat known previously as videotelephony was popularized as far back as the 1870s, but it took technology over a century to deliver the first commercially viable product. Today with the pandemic separating many of us from our loved ones and our workplaces, video chat has become an instrumental part of our lives. What most people don’t realize though is that video is extremely data intensive, especially at the viewing resolutions we’ve all become accustomed. Doing the computational work requires translating a high definition video (1080p) into a half dozen different resolutions, known as transcoding, to support most devices, including mobile, and to support various bandwidths. This is often done in real time and typically requires one or more CPU cores on a standard server. Now consider all the digital video you consume daily, both in video chats, and via streaming services, all this content needs to be transcoded for your consumption.
Transcoding video can be very CPU intensive, and the program typically used, FFmpeg, is very efficient at utilizing all the computational resources available. On Linux Ubuntu 16.04 using FFmpeg 2.8.17 my experience is that unconstrained FFmpeg will consume 92% of all the available compute power of the system. My test system is an AMD Ryzen 5 3300G clocked at 4.2Ghz with a Xilinx Alveo U30 Video Accelerator card. This is a hyperthreaded quad-core system. For the testing I produced two sample videos, one from Trevor Noah’s October 15th, 2020 “Cow Hugging” episode and the other was John Oliver’s October 11th, 2020 “Election 2020” episode. Using the code mentioned above here are the results in seconds of three successive runs using both files running through the AMD processor, and offloading transcoding into the Xilinx Alveo U30.
Raw Data – Transcoding on AMD Ryzen 5 Versus Xilinx U30 Alveo Video Accelerator
From this, one can make several conclusions, but the one I see most fitting is that the Xilinx Alveo U30 can transcode content 8X faster than a single AMD Ryzen 5 core at 4.2Ghz. Now, this is still development level code; the general availability code has not shipped yet. It is also only utilizing one of the two encoding engines on the U30, so additional capacity is available for parallel encoding. As more is understood, this blog post will be updated to reflect new developments.
Updates:
10/20/20 – It has been suggested that I share the options used when calling ffmpeg for both the AMD CPU execution and the Xilinx Alveo U30. Here are the two sets of command line options used.
The script that calls the AMD processor to do the work used the following options with ffmpeg:
-f rawvideo -b:v 10M -c:v h264 -y /dev/null
The script that calls the Xilinx Alveo U30 used the following options:
Dropping out the “-maxbitrate 10M” on one Alveo run later in the day yesterday didn’t seem to change much, but this will be further explored. Also it has been suggested that I look into the impact of using “-preset” which affects quality, and how that might perform differently on both platforms.
For the past three years, I’ve been writing about SmartNICs. One of my most popular blog posts is “What is a SmartNIC” from July 2017, which has been read over 6,000 times. This year, for the second time, I’ve presented at the Storage Developer Conference (SDC). The title for this blog post was also the title of my breakout session video, which ran for 50 minutes, and went live earlier this week. Here is the abstract for that session:
Since the advent of the Smart Phone over a decade ago, we’ve seen several new “Smart” technologies, but few have had a significant impact on the data center until now. SmartNICs and SmartSSDs will change the landscape of the data center, but what comes next? This talk will summarize the state of the SmartNIC market by classifying and discussing the technologies behind the leading products in the space. Then it will dive into the emerging technology of SmartSSDs and how they will change the face of storage and solutions. Finally, we’ll dive headfirst into the impact of PCIe 5 and Compute Express Link (CXL) on the future of Smart Acceleration on solution delivery.
In that talk, which has been seen by over 100 people in just the first 24-hours alone on YouTube (I’m told this doesn’t include conference attendees), I shared much of what I’ve learned over the past few months while producing the following new items on SmartNICs:
Hosting an IEEE Hot Interconnects Panel on August 19th titled: “SmartNICs vs. DPUs, Who Wins?” Over 225 people originally attended this talk, and the video has since been viewed another 300 times. We were fortunate enough to secure critical people from all the major SmartNIC companies, except Intel, for an extremely lively discussion. I also did a blog post on this last month.
Last week I hosted an IEEE Hot Interconnects Panel with the above title. We were lucky enough to secure some time from the following luminaries, and it made for an excellent panel:
Clicking on the image below should take you to the 90 minute Youtube video of our panel discussion. For those who are just interested in the highlights you can read below for some of the interesting facts pulled from our discussion.
IEEE Hot Interconnects Panel: “SmartNICs vs. DPUs, Who Wins?”
Here are some key points that contain significant value from the above panel discussion:
SmartNICs provide a second computing domain inside the server that could be used for security, orchestration, and control plane tasks. While some refer to this as an air-gapped domain it isn’t, but it is far more secure than running inside the same x86 system domain. This can be used to securely enable bare-metal as a service. — Michael Kagan
Several vendors are actively collaborating on a Portable NIC Architecture (PNA) designed to execute P4 code. When available, it would then be possible to deliver containers with P4 code that could run on any NIC that supported this PNA model. — Vipin Jain
The control plane needs to execute in the NIC for two reasons, first to offload the host CPU from what is quickly become 30% overhead for processing network traffic, and second to improve the determinism of the applications running on the server. –Vipin Jain
App stores are inevitable, when is the question. While some think it could be years, others believe it will happen within a year. Xilinx has partnered with a company that already has one for FPGA accelerators so the leap to SmartNICs shouldn’t be that challenging. –Gordon Brebner
The ISA is un-important, it’s the micro-architecture that matters. Fungible selected MIPS-64 because of it’s support for simultaneous multi-threaded execution with fine-grained context switching. — Pradeep Sindhu. While others feel that the eco-system of tools and the wide access to developers is most important and that is why they’ve selected ARM.
It should be noted that normally the ARM cores are NOT in the data plane.
The first 18 minutes are introductions and marketing messages. While these are educational, they are also somewhat canned marketing messages. The purpose of a panel discussion was to ask questions that the panel hadn’t seen in advance so we could draw out of them honest perspectives and feedback from their years of experience.
IMHO, here are some of the interesting comments, with who made them and where to find them:
18:50 Michael – The SmartNIC is a different computational domain, a computer in-front of a computer, and ideal for security. It can supervise or oversee all system I/O, key thing is that it is a real computer.
23:00 Gordon – Offloading the host CPU to the SmartNIC and enabling programmability of the device is critically important. We’ll also see functions and attributes of switches being merged into these SmartNICs.
24:50 Andy – Not only data plane offload, but control plane offload from the host is also critically important. Also hardware, in the form of on chip logic, should be applied to data plane offload whenever possible so that ARM cores are NOT being placed in the data plane.
26:00 Andy – Dropped the three letter string that most hardware providers cringe when we hear it, SDK. He stressed the importance of providing one. It should be noted that Broadcom at this point, as far as I know, appears to be the only SmartNIC OEM that provides a customer facing SmartNIC SDK.
26:50 Vipin – A cloud based device that is autonomous from the system and remotely manageable. Has it’s own brain, and that truly runs independently of the host CPU.
29:33 Pradeep – There is no golden rule, or rule of thumb like 1Gb/sec/core like what AMD has said. It’s important to determine what computations should be done in the DPU, multiplexing and stateful applications are ideal. General purpose CPUs are made for processing single threaded applications very fast, horrible at multiplexing.
33:37 Andy – 1Gb/core is really low, I’d not be comfortable with that. I would consider DPDK, or XDP and it would blow that metric away. People shouldn’t settle for this metric.
35:24 Michael – Network needs to take care of the network on it’s own, so zero core for an infinite number of Gigabits.
36:45 Gordon – The SmartNIC is a kinda filtering device, where sophisticated functions like IPS, can be offloaded into the NIC.
40:57 Andy – The Trueflow logic delivers a 4-5X improvement in packet processing. There are a very limited number of people really concerned with hitting line rate packet per second at these speeds. In the data center these PPS requirements are not realistic.
42:25 Michael – I support what Andy said, these packet rates are not realistic in the data center.
44:20 Pradeep – We’re having this discussion because general purpose CPUs can no longer keep up. This is not black and white, but a continuum, where does general processing end and a SmartNIC pick up. GRPC as an example needs to be offloaded. The correct interface is not TCP or RDMA, both are too low level. GRPC is a modern level for this communication interface. We need to have architectural innovation because scale out is here to stay!
46:00 Gordon – One thing about being FPGA based is that we can support tons of I/O. With FPGAs we don’t think in terms of cores, we look at I/O volumes, several years ago we first started looking at 100GbE then figured out how to do that and extended it to 400GbE. We can see the current way scaling well into the Terabit range. While we could likely provide Terabit range performance today it would be far to costly, it’s a price point issue, and nobody would buy it, the cost of doing things is also an issue.
48:35 Michael – CPUs don’t manage data efficiently. We have dedicated hardware engines and TCAM along with caches to service these engines, that’s the way it works.
49:45 Pradeep – The person asking the question perhaps meant control flow and not flow control, while they sound the same they mean different things. Control flow is what a CPU does, flow control is what networking does. A DPU or SmartNIC needs to do both well to be successful. It appears, and I could be wrong, that Pradeep is using pipeline to refer to consecutive stages of execution on a single macro resource like a DPU then chain as a collection of pipelines that provide a complete solution.
54:00 Vipin – Sticking with fixed function execution than line rate is possible. We need to move away from focusing on processing TCP packets, and shift focus to messages with a run-to-completion model. It is a general purpose program running in the data path.
57:20 Vipin – When it came to selecting our computational architecture it was all about ecosystem, and widely available resources and tooling. We [Pensando] went with ARM.
58:20 Pradeep – The ISA is an utter detail, it’s the macro-architecture that matters, not the micro instruction architecture. We chose MIPS because of the implementation which is a simultaneous multi-threaded implementation which is far and away a much better fine grained context switching. Much much better than anything else out there. There is also the economic price/performance to be considered.
1:00:12 Michael – I agree with Vipin it’s a matter of ecosystem, we need to provide a platform for people to develop. We’re not putting ARMs on the data path. So this performance consideration Pradeep has mentioned is not relevant. The key is providing an ecosystem that attracts as many developers as possible, and making their lives easier to produce great value on the device.
1:01:08 Andy – I agree 100%, that’s why we selected ARM, ecosystem drove our choice. With ARM their are enough Linux distributions, and you could be running containers on your NIC. The transition to ARM is trivial.
1:02:30 Gordon – Xilinx mixes ARM cores with programmable FPGA logic, and hard IP cores for things like encryption.
1:03:49 Pradeep – The real problem is the data path, but clearly ARM cores are not in the data path so they are doing control plane functions. Everyone says they are using ARM cores because of the rich ecosystem, but I’d argue that x86 has a richer ecosystem. If that’s the case then why NOT keep the control plane then in the hosts? So why does the control plane need to be imbedded inside the chip?
1:04:45 Vipin – Data path is NOT in ARM. We want it on a single die, we don’t want it hoping across many wires and killing performance. The kind of integration I can do by subsuming the ARM cores into my die is tremendous. That’s why it can not be on Intel. [Once you go off die performance suffers, so what I believe Vipin means is that he can configure on the die whatever collection of ARM cores, and hard logic he wants, and wire it together how best he sees fit to meet the needs of their customers. He can’t license x86 cores and integrate them on the same die as he can with ARM cores.] Plus if he did throw an x86 chip on the card it would blow his power budget [PCIe x16 lane cards are limited to 75W].
1:06:30 Michael – We don’t have as tight an integration with data-path and ARMs as Pensando. If you want to segregate computing domains between application tier and infrastructure tier you need another computer and putting an x86 on a NIC just isn’t practical.
1:07:10 Andy – The air-gap, bare-metal as a service, use case is a very popular one. Moving control plane functions off the x86 to the NIC, frees up x86 cores and enables a more deterministic environment for my applications.
1:08:50 Gordon – Having that programable logic alongside the ARM cores gives you both the control plane offload as well as dynamically being able to modify the data plane locally.
1:10:00 Michael – We are all for users programming the NIC we are providing an SDK, and working with third parties to host their applications and services on our NICs.
1:10:15 Andy – One of the best things we do it outreach, where we provide NICs to university developers, they disappear for a few months then return with completed applications or new use cases. Broadcom doesn’t want to tightly control how people use their devices, it isn’t open if it is limited by what’s available on the platform.
1:13:20 Vipin – Users should be allowed to own and define their own SDK to develop on the platform.
1:14:20 Pradeep – We provide programming stacks [libraries?] that are available to users through RestAPIs.
1:15:38 Gordon – We took an early lead in helping define the P4 language for programming network devices. Which became Barefoot Networks switch chips, but we’ve embraced it since very early on. We actually have a P4 to Verilog compiler so you can turn your P4 code into logic. The main SmartNIC functions inside Xilinx are written in P4. Then there are plug-ins where others can add their own P4 functions into the pipeline.
1:17:35 Michael – Yes, an app-store for our NIC, certainly. It’s a matter of how it is organized. For me it is somewhere users can go where they can safely download containerized applications or services which can then run on the SmartNIC.
1:18:20 Vipin – The App Store is a little ways out there, it is a good idea. We are working in the P4 community towards standards. He mentions PNA, the Portable NIC Architecture as an abstraction. [OMG, this is huge, and I wish I wasn’t juggling the balls trying to keep the panel moving as this would have been awesome to dig into. A PNA could then enable the capability to have containerized P4 applications that could potentially run across multiple vendors SmartNICs.] He also mentioned that you will need NIC based applications, and a fabric with infrastrucutre applications so that NICs on opposite sides of a fabric can be coordinated
1:21:30 Pradeep, An App Store at this point may be premature. In the long term something like an App Store will happen.
1:22:25 Michael, things are moving much faster these days, maybe just another year for SmartNICs and an App Store.
1:23:45 Gordon, we’ve been working with Pensando and others on the PNA concept with P4 for some time.
1:28:40 Vipin, ..more coming as I listen again on Wednesday.
For those curious the final vote was three for DPU and two for SmartNIC, but in the end the customer is the real winner.
As system architects, we seriously contemplate and research the components to include in our next server deployment. First, we break the problem being solved into its essential parts; then, we size the components necessary to address each element. Is the problem compute, memory, or storage-intensive? How much of each element will be required to craft a solution today? How much of each will be needed in three years? As responsible architects, we have to design for the future, because what we purchase today, our team will still be responsible for three years from now. Accelerators complicate this issue because they can both dramatically breath new life into existing deployed systems, or significantly skew the balance when designing new solutions.
Today foundational accelerator technology comes in four flavors: Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Multi-Processor Systems on a Chip (MPSoCs) and most recently Smart Network Interface Cards (SmartNICs). In this market, GPUs are the 900-pound gorilla, but FPGAs have made serious market progress the past few years with significant deployments in Amazon Web Services (AWS) and Microsoft Azure. MPSoCs, and now SmartNICs, blend many different computational components into a single chip package, often utilizing a mix of ARM cores, GPU cores, Artificial Intelligence (AI) engines, FPGA logic, Digital Signal Processors (DSPs), as well as memory and network controllers. For now, we’re going to skip MPSoCs and focus on SmartNICs.
SmartNICs place acceleration technology at the edge of the server, as close as possible to the network. When computational processing of network intense workloads can be accomplished at the network edge, within a SmartNIC, it can often relieve the host CPU of many mundane networking tasks. Normal server processes require that the host CPU spend, on average, 30% of it’s time managing network traffic, this is jokingly referred to as the data center tax. Imagine how much more you could get out of a server if just that 30% were freed up, and what if more could be made available?
SmartNICs that leverage ARM cores and or FPGA logic cells exist today from a growing list of companies like Broadcom, Mellanox, Netronome, and Xilinx. SmartNICs can be designed to fit into a Software-Defined Networking (SDN) architecture. They can accelerate tasks like Network Function Virtualization (NVF), Open vSwitch (OvS), or overlay network tunneling protocols like Virtual eXtensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE). I know, networking alphabet soup, but the key here is that complex routing, and packet encapsulation tasks can be handed off from the host CPU to a SmartNIC. In virtualized environments, significant amounts of host CPU cycles can be consumed by these tasks. While they are not necessarily computationally intensive, they can be volumetrically intense. With datacenter networks moving to 25GbE and 50GbE, it’s not uncommon for host CPUs to process millions of packets per second. This processing is happening today in the kernel or hypervisor networking stack. With a SmartNIC packet routing and encapsulation can be handled at the edge, dramatically limiting the impact on the host CPU.
If all you were looking for from a SmartNICs is to offload the host CPU from having to do networking, thereby saving the datacenter networking tax of 30%, this might be enough to justify their expense. Most of the SmartNIC product offerings from the companies mentioned above run in the $2K to $4K price range. So suppose you’re considering a SmartNIC that costs $3K, with the proper software, and under load testing, you’ve found that it returns 30% of your host CPU cycles, what is the point at which the ROI makes sense? A simplistic approach would suggest that $3K divided by 30% yields a system cost of $10K. So if the cost of your servers is north of $10K, then adding a $3K SmartNIC is a wise decision, but wait, there’s more.
SmartNICs can also handle many complex tasks like key-value stores, encryption, and decryption (IPsec, MACsec, soon even SSL/TLS), next-generation firewalls, electronic trading, and much more. Frankly, the NIC industry is at an inflection point similar to when video cards evolved into GPUs to support the gaming and virtualization market. While Sony coined the term GPU with the introduction of the Playstation in 1994, it was Nvidia five years later in 1999 who popularized the GPU with the introduction of the GeForce 256. I doubt that in the mid-1990s, while Nvidia was designing the NV10 chip, the heart of the GeForce 256, that their engineers were also pondering how it might be used in high-performance computing (HPC) applications a decade later that had nothing to do with graphic rendering. Today we can look at all the ground covered by GPU and FPGA accelerators over the past two decades and quickly see a path forward for SmartNICs where they may even begin offloading the primary computational tasks of a server. It’s not inconceivable to envision a server with a half dozen SmartNICs all tasked with encoding video, or acting as key-value stores, web caches, or even trading stocks on various exchanges. I can see a day soon where the importance of SmartNIC selection will eclipse server CPU selection when designing a new solution from the ground up.
Accelerators are like calling in a special forces team to address a serious competitive threat. By design, a special forces team, known as “A Detachment” or “A-Team” consists of two officers and ten sergeants, all of which are cross-trained in five different skill areas: weapons, engineering, medical, communications, and operations intelligence. This enables the detachment to survive for months or even years in a hostile area without any operational support. Accelerators are the computational equivalent.
A well-designed accelerator has different blocks of silicon to address each of the four primary computational workloads we discussed in part two:
Scaler, working with integers, and letters
Floating-point, the real numbers with decimal points
Vector, one-dimensional arrays of floating-point numbers
Artificial Intelligence (AI), vectors with low precision floating point mixed with integers
If workload types are much like special forces skills then what types of physical computational cores can we leverage in an accelerator design that are optimized to address these specific tasks?
For scaler problems, Intel’s x86 platform has led for decades as far back as the early 1980s. Quietly over the last 25 years, the ARM architecture has evolved. In the past five years, ARM has demonstrated everything necessary for it to be a serious data center player. Add to that ARM’s architecture licensing model which has led to third parties developing their cores which are instruction set compatible. Both of these factors have resulted in at least a dozen companies from Apple to Samsung developing their ARM core designs. Today ARM cores can be found in everything from Nest Thermostats to Apple iPhones. Today the most popular architecture for workload acceleration is the ARMv8-A. Specifically, the Cortex-A72 design which supports both 32bit and 64bit computing, with 1-4 computational cores. Today the Broadcom Stingray, Mellanox Bluefield, NXP Layerscape, and Xilinx Versal all use the ARM Cortex-A72.
When it comes to accelerating floating point, the current trend has been towards Graphical Processing Units (GPUs). While GPUs have been around for about a decade now, it wasn’t until the NVIDIA Tesla debuted that they were viewed as a real computational accelerator. GPUs are also suitable for the third workload model vector processing. In essence, GPUs can kill two birds with one computational stone. Another solution that can also accelerate certain types of floating-point operations are digital signal processing (DSP) engines. DSPs are very good at real-world computational problems that have a high degree of multiple-accumulates and matrix operations. Here is where some accelerator boards are stronger than others. While the Broadcom Stingray only has a cryptographic engine designed to handle single-pass hashing and encryption/decryption (both scaler tasks) it lacks any sort of added acceleration for floating-point math. Mellanox’s Bluefield chip also doesn’t include any silicon specifically dedicated to floating-point. What they do promote is the fact that Bluefield provides GPUDirect so the processor can communicate directly with GPUs on another PCIe card. NXP only has ARM cores, so no additional floating-point support is provided. By contrast, Xilinx’s Versal architecture includes anywhere from 472 to 3,984 DSP engines depending on the chip series and model.
Artificial Intelligence (AI) workloads leverage vector processing but instead of high precision floating point, they only require low precision or integer numbers. Again Broadcom, Mellanox, and NXP all fall short as they don’t include any silicon to process these workloads directly. Mellanox as mentioned earlier, does support GPUDirect for passing AI workloads to another PCIe board but that’s a far cry from on-chip dedicated silicon. Xilinx’s Versal architecture includes anywhere from 128 to 400 AI engines for accelerating these workloads.
Finally, the most significant differentiator is the inclusion of FPGA logic, also known as adaptable engines. This is something unique to only Xilinx accelerator cards. This is the capability to take frequently called upon routines written in C/C++ and port them over to dedicated logic which can improve the performance of a routine by at least 8X.
In the case of Xilinx’s new Versal architecture, the senior officer is an ARM Cortex-R5 for real-time workloads. The junior one officer, and the one who does much of the work, is an ARM Cortex-A72 quad-core processor. The two ARM engines are primarily for control plane functions. Then Versal has AI cores, DSP engines, and adaptable engines (FPGA logic) to accelerate the volumetric workloads. When it comes to application acceleration in hardware the Xilinx Versal is the A-Team!
Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation.
When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.
The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.
Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.
Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.
So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?
Stay tuned for part three, where we cover hardware acceleration platforms.
“… when you have access to the vastness of space, you realize there’s only one resource worth fighting over… even killing for: More time. Time is the single most precious commodity in the universe.”
— Kalique Abrasax, Jupiter Ascending (2015)
Computing is humanities purest quest to convert time into work. In 2000 IBM demonstrated slicing one second into 10 billion units (10GHz) and then squeezing computational work out of each unit. At the time IBM had defined a new 130-nanometer process they called “CMOS 9S“. It was planned for future generation PowerPC chips. In parallel IBM was ramping up production of the POWER4 at 1.9GHz. Now you may be asking yourself, “but wait a minute I’ve never seen any production 10GHz CPUs, especially not 20 years ago,” and you’re correct. IBM’s POWER6 was as close as we’ve gotten with one version of that chip advertised at 5GHz, and in the lab they achieved 6GHz. I’ve also heard IBM reps brag about 7GHz with POWER8 if you turn half the cores off. So why has computing hit the wall at 4-5GHz and computation not reached 10GHz over the last twenty years?
Intel explained this five years ago in the blog post, “Why has CPU frequency ceased to grow?” The problem has a name called the “conveyor level.” Imagine a CPU as a conveyor belt driven assembly line with four workstations labeled A through D. Since an assembly line is a serial process the worker at station B can’t start until the worker at station A finishes. Ideally, each station is designed to take the same amount of time to finish their work, so the following station isn’t impacted. The slowest worker then defines the speed of the conveyor on any given day. So if the most time-consuming stage in the CPU pipeline is 250 picoseconds, then the clock frequency is 4GHz. There is also the issue of heat.
As an electron races through a computer circuit, it experiences a form of friction, known as resistance. Just like rubbing your hands together on a cold day produces heat, so does an electron zipping through a computer circuit. When designing any chip heat is the enemy. The smaller the chip geometry, today its seven nanometers, the more devices you can pack into a given space on a chip. More devices mean more heat. That same square centimeter of space at 7nm still has the same thermal limitations it did at 130nm 20 years ago. Sure we can use fancy liquid systems to rapidly wick heat away from the chip, instead of relying on airflow over an area limited heat sink, but at the end of the day, every watt of power the chip consumes becomes heat. Now there are individual circuits throughout the chip specifically designed to detect and respond to over-heating situations. The last thing anyone wants is a smoldering piece of silicon where their CPU once was. In the 7GHz example above, the IBM representative said that if you viewed the POWER8 chip as a big chessboard and you turned off all the CPU cores on the white squares than all the cores on the black squares could be clocked at nearly twice the speed or 7GHz. Why is this interesting?
For some computational problems its much better to have two consecutive computations in the same unit of time than two unrelated ones. Electronic trading, also known as high-frequency trading (HFT) is the premier market-driven problem that benefits most from increasing clock frequency. Traders often ascribe a dollar value to a millionth of a second, and it varies from market to market based on the rules and volumes of each market. In the end, though it always boils down to the trader’s speed and response to a market signal. If I’m faster than you at making the right decision, then I win the business and book the profit. Sticking with HFT, where do accelerators fit in?
Traders lease connections to exchanges. The closer and faster they can respond to signals from those connections, the more competitive they will be. Suppose my trading platform requires signals from the market to travel through my server, then another switch on my private network, back through a second server, then finally out to the market. The networking alone, even with kernel bypass through two servers and a switch could easily be several microseconds. Add a few more microseconds for trading logic in both servers, and you could be looking at almost ten microseconds to submit a trade in response to a signal. Two years ago Solarflare with LDA Technologies demonstrated 98 nanoseconds tick to trade. This was using accelerator technology and compared to the trading platform mentioned above; it is three orders of magnitude faster. That’s the difference between walking from NYC to LAX versus flying at Mach 5 and arriving in an hour. Time matters and acceleration is not just for HFTs anymore. Why do you think Google bought Myricom, Amazon picked up Annapurna Labs, Nvidia purchased Mellanox, or Xilinx acquired Solarflare?
Please stay tuned, more to come in part two. In the meantime feel free to check out previous articles on this topic: