x86 Has Hit the Wall, and Now Come the Accelerators – Part 3

Accelerators are like calling in a special forces team to address a serious competitive threat. By design, a special forces team, known as “A Detachment” or “A-Team” consists of two officers and ten sergeants, all of which are cross-trained in five different skill areas: weapons, engineering, medical, communications, and operations intelligence. This enables the detachment to survive for months or even years in a hostile area without any operational support. Accelerators are the computational equivalent.

A well-designed accelerator has different blocks of silicon to address each of the four primary computational workloads we discussed in part two:

Scaler, working with integers, and letters
Floating-point, the real numbers with decimal points
Vector, one-dimensional arrays of floating-point numbers
Artificial Intelligence (AI), vectors with low precision floating point mixed with integers

If workload types are much like special forces skills then what types of physical computational cores can we leverage in an accelerator design that are optimized to address these specific tasks?

For scaler problems, Intel’s x86 platform has led for decades as far back as the early 1980s. Quietly over the last 25 years, the ARM architecture has evolved. In the past five years, ARM has demonstrated everything necessary for it to be a serious data center player. Add to that ARM’s architecture licensing model which has led to third parties developing their cores which are instruction set compatible. Both of these factors have resulted in at least a dozen companies from Apple to Samsung developing their ARM core designs. Today ARM cores can be found in everything from Nest Thermostats to Apple iPhones. Today the most popular architecture for workload acceleration is the ARMv8-A. Specifically, the Cortex-A72 design which supports both 32bit and 64bit computing, with 1-4 computational cores. Today the Broadcom Stingray, Mellanox Bluefield, NXP Layerscape, and Xilinx Versal all use the ARM Cortex-A72.

When it comes to accelerating floating point, the current trend has been towards Graphical Processing Units (GPUs). While GPUs have been around for about a decade now, it wasn’t until the NVIDIA Tesla debuted that they were viewed as a real computational accelerator. GPUs are also suitable for the third workload model vector processing. In essence, GPUs can kill two birds with one computational stone. Another solution that can also accelerate certain types of floating-point operations are digital signal processing (DSP) engines. DSPs are very good at real-world computational problems that have a high degree of multiple-accumulates and matrix operations. Here is where some accelerator boards are stronger than others. While the Broadcom Stingray only has a cryptographic engine designed to handle single-pass hashing and encryption/decryption (both scaler tasks) it lacks any sort of added acceleration for floating-point math. Mellanox’s Bluefield chip also doesn’t include any silicon specifically dedicated to floating-point. What they do promote is the fact that Bluefield provides GPUDirect so the processor can communicate directly with GPUs on another PCIe card. NXP only has ARM cores, so no additional floating-point support is provided. By contrast, Xilinx’s Versal architecture includes anywhere from 472 to 3,984 DSP engines depending on the chip series and model.

Artificial Intelligence (AI) workloads leverage vector processing but instead of high precision floating point, they only require low precision or integer numbers. Again Broadcom, Mellanox, and NXP all fall short as they don’t include any silicon to process these workloads directly. Mellanox as mentioned earlier, does support GPUDirect for passing AI workloads to another PCIe board but that’s a far cry from on-chip dedicated silicon. Xilinx’s Versal architecture includes anywhere from 128 to 400 AI engines for accelerating these workloads.

Finally, the most significant differentiator is the inclusion of FPGA logic, also known as adaptable engines. This is something unique to only Xilinx accelerator cards. This is the capability to take frequently called upon routines written in C/C++ and port them over to dedicated logic which can improve the performance of a routine by at least 8X.

In the case of Xilinx’s new Versal architecture, the senior officer is an ARM Cortex-R5 for real-time workloads. The junior one officer, and the one who does much of the work, is an ARM Cortex-A72 quad-core processor. The two ARM engines are primarily for control plane functions. Then Versal has AI cores, DSP engines, and adaptable engines (FPGA logic) to accelerate the volumetric workloads. When it comes to application acceleration in hardware the Xilinx Versal is the A-Team!