x86 Has Hit the Wall, and Now Come the Accelerators – Part 3

TV’s Original A-Team

Accelerators are like calling in a special forces team to address a serious competitive threat. By design, a special forces team, known as “A Detachment” or “A-Team” consists of two officers and ten sergeants, all of which are cross-trained in five different skill areas: weapons, engineering, medical, communications, and operations intelligence. This enables the detachment to survive for months or even years in a hostile area without any operational support. Accelerators are the computational equivalent.

A well-designed accelerator has different blocks of silicon to address each of the four primary computational workloads we discussed in part two:

  • Scaler, working with integers, and letters
  • Floating-point, the real numbers with decimal points
  • Vector, one-dimensional arrays of floating-point numbers
  • Artificial Intelligence (AI), vectors with low precision floating point mixed with integers

If workload types are much like special forces skills then what types of physical computational cores can we leverage in an accelerator design that are optimized to address these specific tasks?

For scaler problems, Intel’s x86 platform has led for decades as far back as the early 1980s. Quietly over the last 25 years, the ARM architecture has evolved. In the past five years, ARM has demonstrated everything necessary for it to be a serious data center player. Add to that ARM’s architecture licensing model which has led to third parties developing their cores which are instruction set compatible. Both of these factors have resulted in at least a dozen companies from Apple to Samsung developing their ARM core designs. Today ARM cores can be found in everything from Nest Thermostats to Apple iPhones. Today the most popular architecture for workload acceleration is the ARMv8-A. Specifically, the Cortex-A72 design which supports both 32bit and 64bit computing, with 1-4 computational cores. Today the Broadcom Stingray, Mellanox Bluefield, NXP Layerscape, and Xilinx Versal all use the ARM Cortex-A72.

When it comes to accelerating floating point, the current trend has been towards Graphical Processing Units (GPUs). While GPUs have been around for about a decade now, it wasn’t until the NVIDIA Tesla debuted that they were viewed as a real computational accelerator. GPUs are also suitable for the third workload model vector processing. In essence, GPUs can kill two birds with one computational stone. Another solution that can also accelerate certain types of floating-point operations are digital signal processing (DSP) engines. DSPs are very good at real-world computational problems that have a high degree of multiple-accumulates and matrix operations. Here is where some accelerator boards are stronger than others. While the Broadcom Stingray only has a cryptographic engine designed to handle single-pass hashing and encryption/decryption (both scaler tasks) it lacks any sort of added acceleration for floating-point math. Mellanox’s Bluefield chip also doesn’t include any silicon specifically dedicated to floating-point. What they do promote is the fact that Bluefield provides GPUDirect so the processor can communicate directly with GPUs on another PCIe card. NXP only has ARM cores, so no additional floating-point support is provided. By contrast, Xilinx’s Versal architecture includes anywhere from 472 to 3,984 DSP engines depending on the chip series and model.

Artificial Intelligence (AI) workloads leverage vector processing but instead of high precision floating point, they only require low precision or integer numbers. Again Broadcom, Mellanox, and NXP all fall short as they don’t include any silicon to process these workloads directly. Mellanox as mentioned earlier, does support GPUDirect for passing AI workloads to another PCIe board but that’s a far cry from on-chip dedicated silicon. Xilinx’s Versal architecture includes anywhere from 128 to 400 AI engines for accelerating these workloads.

Finally, the most significant differentiator is the inclusion of FPGA logic, also known as adaptable engines. This is something unique to only Xilinx accelerator cards. This is the capability to take frequently called upon routines written in C/C++ and port them over to dedicated logic which can improve the performance of a routine by at least 8X.

In the case of Xilinx’s new Versal architecture, the senior officer is an ARM Cortex-R5 for real-time workloads. The junior one officer, and the one who does much of the work, is an ARM Cortex-A72 quad-core processor. The two ARM engines are primarily for control plane functions. Then Versal has AI cores, DSP engines, and adaptable engines (FPGA logic) to accelerate the volumetric workloads. When it comes to application acceleration in hardware the Xilinx Versal is the A-Team!

x86 Has Hit the Wall, and Now Come the Accelerators – Part 2

Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation. 

When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.

The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.

Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.

Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.

So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?

Stay tuned for part three, where we cover hardware acceleration platforms.

7nm, Miniaturization to Integration

Last night while channel surfing I came across Men in Black III, and was dropped right into the scene where a 1969 Tommy Lee Jones was placing Will Smith into the Neuralizer pictured on the left. For those not familiar with the original 1997 MiB franchise a Neuralizer is a cigar-sized plot device for washing peoples memories of an alien encounter that is normally carried inside their jacket pocket. The writers were clearly poking fun at miniaturization and how much humanity has come to take it for granted.

Those of us who grew up in the 1960s and 70s lived through the miniaturization wave as the Japanese led the industry by shrinking radios and televisions from cabinet sized living room appliances to handheld devices. One year for Father’s Day in the late 70s we bought my dad a portable black and white TV with a radio that ran on batteries so he could watch it on the boat in the evenings. It was roughly the size of three laptops stacked on top of one another. It may sound corny now, but it was amazing back then. Today we watch theater quality movies in color, on a much larger screen from a device that drops into our pocket and don’t think twice about it. We’ve grown accustom to technology improving at a rapid rate, and it’s now expected, but what happens when that rate is no longer sustainable?

Last year the industry began etching chips with a new seven nanometer process, which is equivalent to Intel’s 10nm process. Apple’s A12 Bionic chip that powers their XR and XS series iPhones is one of the first using this new 7nm process. This chip contains 6,900 million transistors and is arguably one of the most advanced devices every produced by mankind. By contrasts, my first computer in 1983 was a TRS-80 Model III powered by the Zilog Z80 processor. The Z80 used a 4,000nm process and only contained 8,500 transistors. So in 35 years we’ve reduced the process size by three orders of magnitude resulting in a transistor density improvement of six orders of magnitude, wow! How do we top that, and where are we in the grand scheme of the physics of miniaturization?

In a 1965 paper by Gordon Moore, then founder of Fairchild Semiconductor and later CEO of Intel, Gordon stated that the density of integrated circuits would double every year, now known as Moore’s Law. From 1970 through 2014 this “law” had essentially proved true. Before Intel’s current 10nm geometry their prior generation was 14nm and that was achieved in 2014 so it’s taken them five years to accomplish 10nm. Not exactly Moore’s law, but that’s the tip of the iceberg. As the industry goes from 14nm to 7nm/10nm physics is once again throwing up a roadblock, this hasn’t been the first one, but it could be the last one. Chips are made using Silicon, and Silicon atoms have a diameter of about 0.2 nanometers. So at a seven nanometers node size, we’re talking 35 or so silicon atoms, which isn’t a very large number. It turns out that below seven nanometers, as we have fewer and fewer silicon atoms to manage electron flows, things get dicey. Chips begin to experience quantum effects, most notably those pesky electrons, which are about a millionth of a nanometer in size, begin to exhibit something called quantum tunneling. This means that they no longer behave like they are supposed to and they move between devices etched into the silicon with a sort of reckless disregard for the “normal” rules of physics. This has been known though for some time.

Back in 2016 a team at Lawrence Berkley National Labs demonstrated a one nanometer transistor device, but that leveraged Carbon nanotubes to manage electron flow and stave off the quantum tunneling effect. For those not familiar with Carbon nanotubes think teeny tiny diamond straws where the wall of the straw is one atom thick. While using Carbon nanotubes to solve the problem is ingenious, it doesn’t fit into how we make chips today as you can’t etch a Carbon nanotube using conventional chip fabrication processes. So while it’s a solution to the problem it’s one that can’t easily be utilized. So we may be working at 7nm for some time to come. This only means that one aspect of miniaturization has ground to a halt. When I’ve used the term chip above to represent an integrated circuit the more precise term is actually a “die.”

Until recently it was common practice to place a single “die” inside a package. A package is what most of us think of as the chip as it has a bunch of metal pins coming out of the bottom or sides. In recent years the industry has developed new techniques that allow us to layer multiple dies onto one another within the same physical package enabling the creation of very complex chips. This is similar to a seven-layer cake where different types of cake can be in each layer and the icing can be used to convey flavors across the cake layers. This means that a chip can contain several and eventually many dies, or layers. A recent example of this is Xilinx’s new Versal chip line.

Within the Versal chip package there are multiple dies that contain two different pairs of ARM CPU cores, hundreds of Artificial Intelligence (AI) engines, thousands of Digital Signal Processors (DSP), a huge Field Programmable Gate Array (FPGA) area, several classes of memory, and multiple programmable memory, PCIe and Ethernet controllers. The Versal platform is a flexible toolbox of computational power, with the ARM cores handling traditional CPU and real-time processing tasks. The AI cores churn through new machine learning workloads while the DSPs are leveraged for advanced signal processing, think 5G, and the FPGA can be used as the versatile computational glue to pull all these complex engines together. Finally, we have the memory, PCIe and Ethernet controllers to interface with the real world. So while Intel and AMD focus on scaling the number CPU cores on the chip and NVidia works to improve Graphical Processing Unit (GPU) density Xilinx’s is the first to go all-in on chip-level workload integration. This is the key to accelerating the data center going forward.

So until we solve the quantum tunneling problem, with new fabrication techniques, we can utilize advances in integration as shown above to move the industry forward.

Will AI Restore HFT Profitability?

225x225bbSeveral years ago Michael Lewis wrote “Flash Boys: A Wall Street Revolt” and created a populist backlash that rocked the world of High-Frequency Trading (HFT). Lewis had publicly pulled back the curtain and shared his perspective on how our financial markets worked. Right or wrong, he sensationalized an industry that most people didn’t understand, and yet one in which nearly all of us have invested our life savings.

Several months after Flash Boys Peter Kovac, a Trader with EWT, authored “Flash Boys: Not So Fast” where he refuted a number of the sensationalistic claims in Lewis’s book. While I’ve not yet read Kovacs book, it’s behind “Dark Pools: The Rise of Machine Traders and the Rigging of the U.S. Stock Market” in my reading list, from the reviews it appears he did an excellent job of covering the other side of this argument.

All that aside, earlier today I came across an article by Joe Parsons in “The Trade” titled “HFT: Not so flashy anymore.” Parsons makes several very salient points:

  • Making money in HFT today is no longer as easy as it once was. As a result, big players are rapidly gobbling up little ones. For example in April Virtu picked up KCG and Quantlab recently acquired Teza Technologies.
  • Speed is no longer the primary driving differentiator between HFT shops. As someone in the speed business, I can tell you that today one microsecond for a network 1/2 round trip in trading is table stakes. Some shops are deploying leading-edge technologies that can reduce this to under 200 nanoseconds. Speed as the only value proposition an HFT offers hasn’t been the case for a rather long time. This is one of the key assertions that was made five years ago in the “Dark Pools” book mentioned above.
  • MiFID II is having a greater impact on HFT then most people are aware. It brings with it new, and more stringent regulations governing dark pools and block trades.
  • Perhaps the most important point though that Parsons makes is that the shift today is away from arbitrage as a primary revenue source and towards analytics and artificial intelligence (AI). He downplays this point by offering it up as a possible future.

AI is the future of HFT as shops look for new and clever ways to push intelligence as close as possible to the exchanges. Companies like Xilinx, the leader in FPGAs, Nvidia with their P40 GPU, and Google, yes Google, with their Tensorflow Processing Unit (TPU) will define the hardware roadmap moving forward. How this will all unfold is still to be determined, but it will be one wild ride.