
Accelerators are like calling in a special forces team to address a serious competitive threat. By design, a special forces team, known as “A Detachment” or “A-Team” consists of two officers and ten sergeants, all of which are cross-trained in five different skill areas: weapons, engineering, medical, communications, and operations intelligence. This enables the detachment to survive for months or even years in a hostile area without any operational support. Accelerators are the computational equivalent.
A well-designed accelerator has different blocks of silicon to address each of the four primary computational workloads we discussed in part two:
- Scaler, working with integers, and letters
- Floating-point, the real numbers with decimal points
- Vector, one-dimensional arrays of floating-point numbers
- Artificial Intelligence (AI), vectors with low precision floating point mixed with integers
If workload types are much like special forces skills then what types of physical computational cores can we leverage in an accelerator design that are optimized to address these specific tasks?
For scaler problems, Intel’s x86 platform has led for decades as far back as the early 1980s. Quietly over the last 25 years, the ARM architecture has evolved. In the past five years, ARM has demonstrated everything necessary for it to be a serious data center player. Add to that ARM’s architecture licensing model which has led to third parties developing their cores which are instruction set compatible. Both of these factors have resulted in at least a dozen companies from Apple to Samsung developing their ARM core designs. Today ARM cores can be found in everything from Nest Thermostats to Apple iPhones. Today the most popular architecture for workload acceleration is the ARMv8-A. Specifically, the Cortex-A72 design which supports both 32bit and 64bit computing, with 1-4 computational cores. Today the Broadcom Stingray, Mellanox Bluefield, NXP Layerscape, and Xilinx Versal all use the ARM Cortex-A72.
When it comes to accelerating floating point, the current trend has been towards Graphical Processing Units (GPUs). While GPUs have been around for about a decade now, it wasn’t until the NVIDIA Tesla debuted that they were viewed as a real computational accelerator. GPUs are also suitable for the third workload model vector processing. In essence, GPUs can kill two birds with one computational stone. Another solution that can also accelerate certain types of floating-point operations
Artificial Intelligence (AI) workloads leverage vector processing but instead of high precision floating point, they only require low precision or integer numbers. Again Broadcom, Mellanox, and NXP all fall short as they don’t include any silicon to process these workloads directly. Mellanox as mentioned earlier, does support GPUDirect for passing AI workloads to another PCIe board but that’s a far cry from on-chip dedicated silicon. Xilinx’s Versal architecture includes anywhere from 128 to 400 AI engines for accelerating these workloads.
Finally, the most significant differentiator is the inclusion of FPGA logic, also known as adaptable engines. This is something unique to only Xilinx accelerator cards. This is the capability to take frequently called upon routines written in C/C++ and port them over to dedicated logic which can improve the performance of a routine by at least 8X.
In the case of Xilinx’s new Versal architecture, the senior officer is an ARM Cortex-R5 for real-time workloads. The junior one officer, and the one who does much of the
[…] tuned for part three, where we cover hardware acceleration […]