Data Types, Computation & Electronic Trading

January 31, 2021January 31, 2021 scottcschweitzer AI, FPGA, HFT, networking, Performance

I’ll return to the “Expanded Role of HPC Accelerators” in my next post, but before doing so, we need to take a step back and look at how data is stored to understand how best to operate on and accelerate that data. When you look under the hood of your car, you’ll find that there are at least six different types of mechanical fasteners holding things together.

Artificial Intelligence and Electronic Trading

Hoses often require a slotted screwdriver to remove, while others a Phillips’s head. Some panels require a star-like Torx wrench while others a hexagonal Allen, but the most common one you’ll find are hexagonal headed bolts and nuts in both English and Metric sizes. So why can’t the engineers who design these products select a single fastener style? Simple, each problem has unique characteristics, and these engineers are often choosing the most appropriate solution. The same is true for data types within a computer. Ask almost anyone, and they’ll say that data is stored in a computer in bytes, just like your engine has fasteners.

Computers process data in many end-user formats, from simple text and numbers to sounds, images, video, and much more. Ultimately, it all becomes bits organized, managed, and stored as bytes. We then wrap abstraction layers around these collections of bytes as we shape them into numbers that our computer can then process. We know that some numbers are “natural,” they’re also called “integers,” meaning they have no fractional component, for example, one, two, and three. Simultaneously, other “real” numbers contain a decimal point that can have anywhere from one to an infinite collection of digits to the decimal’s right. Computers process data using both of these numerical formats.

Textual data is relatively easy to grasp as it is often stored as an integer without a positive or negative sign. For example, the letter “A” is stored using a standard that has assigned it the value 65, which as a single byte is “01000001.” In Computer Science, how a number is represented can be just as important as the number itself’s value. One can store the number three as an integer, but when it is divided by the integer value two, some people would say the result is 1.5, but they could be wrong. Depending on the division operator used, some languages support several, and the data type assigned to the product there could be several different answers, all of which would be correct within their context. As you move from integer to real numbers or, more specifically, floating-point numbers, these numerical representations expand considerably based on the degree of precision required by the problem at hand. Today, some of the more common numerical data types are:

Integers, which often take a single byte of storage, and can be signed or unsigned. Integers can come in at least seven different distinct lengths from a nibble stored as four bits, a byte (eight bits), a half-word (sixteen bits), word (thirty-two bits), double word (sixty-four bits), octaword (one hundred and twenty-eight bits), and an n-bit value.
Half Precision floating-point, also known as FP16 is where the real number is expressed in 16 bits of storage. The numerical value is stored in ten bits; then there are five bits for the exponent and a single sign bit.
Single Precision, also known as floating-point 32 or FP32, for 32 bits. Here we have 23 bits for the fractional component, eight for the exponent, and one for the sign.
Double Precision, also known as floating-point 64 or FP64, for 64 bits. Finally, we have 52 bits for the fractional component, 11 for the exponent, and one for the sign.

There are other types, but these are the major ones. How computational units process data differs broadly based on the architecture of the processing unit. When we talk about processing units, we’re specifically talking about the Central Processing Unit (CPUs), Graphical Processing Unit (GPUs), Digital Signal Processor (DSPs), and Field Programmable Gate Array (FPGAs). CPUs are the most general. They can process all of the above data types and much more; they are the BMW X5 sport utility vehicle of computation. They’ll get you pretty much anywhere you want; they’ll do it in style and provide good performance. GPUs are that tricked out Jeep Wrangler Rubicon designed to crawl over pretty much anything you can throw at it while still carrying a family of four. On the other hand, DSPs are dirt bikes, they’ll get one person over rough terrain faster than anything available, but they’re not going to shine when they hit the asphalt. Finally, we have FPGAs, and they’re the Dodge Challenger SRT Demon in the pack; they never leave the asphalt; they just leave everything else on it behind. All that to say that GPUs and DSPs are designed to operate on floating-point data, while FPGAs do much better with integer data. So why does this matter?

Every problem is not suited to a slot headed fastener; sometimes you’ll need a Torx, while others a hexagonal headed bolt. The same is true in computer science. If your system’s primary application is weather forecasting, which is floating-point intense, you might want vast quantities of GPUs. Conversely, if you’re doing genetics sequencing, where data is entirely integer-based, you’ll find that FPGAs may outperform GPUs, delivering up to a 100X performance per watt advantage. Certain aspects of Artificial Intelligence (AI) have benefited more from using FP16 based calculations over using FP32 or FP64. In this case, DSPs may outshine GPUs in calculations per watt. As AI emerges in key computational markets moving forward, we’ll see more and more DSPs applications; one of these will be electronic trading.

Today the cutting edge of electronic trading platforms utilize FPGA boards that have many built-in ultra-high-performance networking ports. These boards, in some cases, have upwards of 16 external physical network connections. Trading data and orders into markets are sent via network messages, which are entirely character; hence integer, based. These FPGAs contain code blocks that rapidly process this integer data, but computations slow down considerably as some of these integers are converted to floating-point for computation. For example, some messages use a twelve-character format for the price where the first six digits are the whole number, and the second six digits represent the decimal number. So, a price of $12.34 would be represented as the character string “000012340000.” Other fields also use twelve-character values for a number, but the first ten digits are the whole number, and the last two the decimal value. In this case, 12,572.75 shares of a stock would be represented as “000001257275.” Now, of course, doing financial computations maintaining the price or quantity as characters is possible; it would be far more efficient if each were recast as single-precision (FP32) numbers. Then computation could be rapidly processed. Here’s where a new blend of FPGA processing, to rapidly move around character data, and DSP computing for handling financial calculations using single precision math will shine. Furthermore, DSP engines are an ideal platform for executing trained AI-based algorithms that will drive financial trading moving forward into the future.

Someday soon, we’ll see trading platforms that will execute entirely on a single high-performance chip. This chip will contain a blend of large blocks of adaptable FPGA logic; we’re talking millions of logic tables, along with thousands of DSP engines and dozens of high-performance network connections. This will enable intelligent trading decisions to be made and orders generated in tens of a billionth of a second!

x86 Has Hit the Wall, and Now Come the Accelerators – Part 3

August 21, 2019 scottcschweitzer Accelerators, AI, networking

Accelerators are like calling in a special forces team to address a serious competitive threat. By design, a special forces team, known as “A Detachment” or “A-Team” consists of two officers and ten sergeants, all of which are cross-trained in five different skill areas: weapons, engineering, medical, communications, and operations intelligence. This enables the detachment to survive for months or even years in a hostile area without any operational support. Accelerators are the computational equivalent.

A well-designed accelerator has different blocks of silicon to address each of the four primary computational workloads we discussed in part two:

Scaler, working with integers, and letters
Floating-point, the real numbers with decimal points
Vector, one-dimensional arrays of floating-point numbers
Artificial Intelligence (AI), vectors with low precision floating point mixed with integers

If workload types are much like special forces skills then what types of physical computational cores can we leverage in an accelerator design that are optimized to address these specific tasks?

For scaler problems, Intel’s x86 platform has led for decades as far back as the early 1980s. Quietly over the last 25 years, the ARM architecture has evolved. In the past five years, ARM has demonstrated everything necessary for it to be a serious data center player. Add to that ARM’s architecture licensing model which has led to third parties developing their cores which are instruction set compatible. Both of these factors have resulted in at least a dozen companies from Apple to Samsung developing their ARM core designs. Today ARM cores can be found in everything from Nest Thermostats to Apple iPhones. Today the most popular architecture for workload acceleration is the ARMv8-A. Specifically, the Cortex-A72 design which supports both 32bit and 64bit computing, with 1-4 computational cores. Today the Broadcom Stingray, Mellanox Bluefield, NXP Layerscape, and Xilinx Versal all use the ARM Cortex-A72.

When it comes to accelerating floating point, the current trend has been towards Graphical Processing Units (GPUs). While GPUs have been around for about a decade now, it wasn’t until the NVIDIA Tesla debuted that they were viewed as a real computational accelerator. GPUs are also suitable for the third workload model vector processing. In essence, GPUs can kill two birds with one computational stone. Another solution that can also accelerate certain types of floating-point operations are digital signal processing (DSP) engines. DSPs are very good at real-world computational problems that have a high degree of multiple-accumulates and matrix operations. Here is where some accelerator boards are stronger than others. While the Broadcom Stingray only has a cryptographic engine designed to handle single-pass hashing and encryption/decryption (both scaler tasks) it lacks any sort of added acceleration for floating-point math. Mellanox’s Bluefield chip also doesn’t include any silicon specifically dedicated to floating-point. What they do promote is the fact that Bluefield provides GPUDirect so the processor can communicate directly with GPUs on another PCIe card. NXP only has ARM cores, so no additional floating-point support is provided. By contrast, Xilinx’s Versal architecture includes anywhere from 472 to 3,984 DSP engines depending on the chip series and model.

Artificial Intelligence (AI) workloads leverage vector processing but instead of high precision floating point, they only require low precision or integer numbers. Again Broadcom, Mellanox, and NXP all fall short as they don’t include any silicon to process these workloads directly. Mellanox as mentioned earlier, does support GPUDirect for passing AI workloads to another PCIe board but that’s a far cry from on-chip dedicated silicon. Xilinx’s Versal architecture includes anywhere from 128 to 400 AI engines for accelerating these workloads.

Finally, the most significant differentiator is the inclusion of FPGA logic, also known as adaptable engines. This is something unique to only Xilinx accelerator cards. This is the capability to take frequently called upon routines written in C/C++ and port them over to dedicated logic which can improve the performance of a routine by at least 8X.

In the case of Xilinx’s new Versal architecture, the senior officer is an ARM Cortex-R5 for real-time workloads. The junior one officer, and the one who does much of the work, is an ARM Cortex-A72 quad-core processor. The two ARM engines are primarily for control plane functions. Then Versal has AI cores, DSP engines, and adaptable engines (FPGA logic) to accelerate the volumetric workloads. When it comes to application acceleration in hardware the Xilinx Versal is the A-Team!

x86 Has Hit the Wall, and Now Come the Accelerators – Part 2

August 15, 2019August 21, 2019 scottcschweitzer Accelerators, AI, HFT, networking

Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation.

When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.

The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.

Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.

Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.

So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?

Stay tuned for part three, where we cover hardware acceleration platforms.

7nm, Miniaturization to Integration

June 14, 2019July 15, 2019 scottcschweitzer AI

Last night while channel surfing I came across Men in Black III, and was dropped right into the scene where a 1969 Tommy Lee Jones was placing Will Smith into the Neuralizer pictured on the left. For those not familiar with the original 1997 MiB franchise a Neuralizer is a cigar-sized plot device for washing peoples memories of an alien encounter that is normally carried inside their jacket pocket. The writers were clearly poking fun at miniaturization and how much humanity has come to take it for granted.

Those of us who grew up in the 1960s and 70s lived through the miniaturization wave as the Japanese led the industry by shrinking radios and televisions from cabinet sized living room appliances to handheld devices. One year for Father’s Day in the late 70s we bought my dad a portable black and white TV with a radio that ran on batteries so he could watch it on the boat in the evenings. It was roughly the size of three laptops stacked on top of one another. It may sound corny now, but it was amazing back then. Today we watch theater quality movies in color, on a much larger screen from a device that drops into our pocket and don’t think twice about it. We’ve grown accustom to technology improving at a rapid rate, and it’s now expected, but what happens when that rate is no longer sustainable?

Last year the industry began etching chips with a new seven nanometer process, which is equivalent to Intel’s 10nm process. Apple’s A12 Bionic chip that powers their XR and XS series iPhones is one of the first using this new 7nm process. This chip contains 6,900 million transistors and is arguably one of the most advanced devices every produced by mankind. By contrasts, my first computer in 1983 was a TRS-80 Model III powered by the Zilog Z80 processor. The Z80 used a 4,000nm process and only contained 8,500 transistors. So in 35 years we’ve reduced the process size by three orders of magnitude resulting in a transistor density improvement of six orders of magnitude, wow! How do we top that, and where are we in the grand scheme of the physics of miniaturization?

In a 1965 paper by Gordon Moore, then founder of Fairchild Semiconductor and later CEO of Intel, Gordon stated that the density of integrated circuits would double every year, now known as Moore’s Law. From 1970 through 2014 this “law” had essentially proved true. Before Intel’s current 10nm geometry their prior generation was 14nm and that was achieved in 2014 so it’s taken them five years to accomplish 10nm. Not exactly Moore’s law, but that’s the tip of the iceberg. As the industry goes from 14nm to 7nm/10nm physics is once again throwing up a roadblock, this hasn’t been the first one, but it could be the last one. Chips are made using Silicon, and Silicon atoms have a diameter of about 0.2 nanometers. So at a seven nanometers node size, we’re talking 35 or so silicon atoms, which isn’t a very large number. It turns out that below seven nanometers, as we have fewer and fewer silicon atoms to manage electron flows, things get dicey. Chips begin to experience quantum effects, most notably those pesky electrons, which are about a millionth of a nanometer in size, begin to exhibit something called quantum tunneling. This means that they no longer behave like they are supposed to and they move between devices etched into the silicon with a sort of reckless disregard for the “normal” rules of physics. This has been known though for some time.

Back in 2016 a team at Lawrence Berkley National Labs demonstrated a one nanometer transistor device, but that leveraged Carbon nanotubes to manage electron flow and stave off the quantum tunneling effect. For those not familiar with Carbon nanotubes think teeny tiny diamond straws where the wall of the straw is one atom thick. While using Carbon nanotubes to solve the problem is ingenious, it doesn’t fit into how we make chips today as you can’t etch a Carbon nanotube using conventional chip fabrication processes. So while it’s a solution to the problem it’s one that can’t easily be utilized. So we may be working at 7nm for some time to come. This only means that one aspect of miniaturization has ground to a halt. When I’ve used the term chip above to represent an integrated circuit the more precise term is actually a “die.”

Until recently it was common practice to place a single “die” inside a package. A package is what most of us think of as the chip as it has a bunch of metal pins coming out of the bottom or sides. In recent years the industry has developed new techniques that allow us to layer multiple dies onto one another within the same physical package enabling the creation of very complex chips. This is similar to a seven-layer cake where different types of cake can be in each layer and the icing can be used to convey flavors across the cake layers. This means that a chip can contain several and eventually many dies, or layers. A recent example of this is Xilinx’s new Versal chip line.

Within the Versal chip package there are multiple dies that contain two different pairs of ARM CPU cores, hundreds of Artificial Intelligence (AI) engines, thousands of Digital Signal Processors (DSP), a huge Field Programmable Gate Array (FPGA) area, several classes of memory, and multiple programmable memory, PCIe and Ethernet controllers. The Versal platform is a flexible toolbox of computational power, with the ARM cores handling traditional CPU and real-time processing tasks. The AI cores churn through new machine learning workloads while the DSPs are leveraged for advanced signal processing, think 5G, and the FPGA can be used as the versatile computational glue to pull all these complex engines together. Finally, we have the memory, PCIe and Ethernet controllers to interface with the real world. So while Intel and AMD focus on scaling the number CPU cores on the chip and NVidia works to improve Graphical Processing Unit (GPU) density Xilinx’s is the first to go all-in on chip-level workload integration. This is the key to accelerating the data center going forward.

So until we solve the quantum tunneling problem, with new fabrication techniques, we can utilize advances in integration as shown above to move the industry forward.

Will AI Restore HFT Profitability?

June 17, 2017March 23, 2018 scottcschweitzer AI, HFT FPGA, TensorFlow

225x225bb Several years ago Michael Lewis wrote “Flash Boys: A Wall Street Revolt” and created a populist backlash that rocked the world of High-Frequency Trading (HFT). Lewis had publicly pulled back the curtain and shared his perspective on how our financial markets worked. Right or wrong, he sensationalized an industry that most people didn’t understand, and yet one in which nearly all of us have invested our life savings.

Several months after Flash Boys Peter Kovac, a Trader with EWT, authored “Flash Boys: Not So Fast” where he refuted a number of the sensationalistic claims in Lewis’s book. While I’ve not yet read Kovacs book, it’s behind “Dark Pools: The Rise of Machine Traders and the Rigging of the U.S. Stock Market” in my reading list, from the reviews it appears he did an excellent job of covering the other side of this argument.

All that aside, earlier today I came across an article by Joe Parsons in “The Trade” titled “HFT: Not so flashy anymore.” Parsons makes several very salient points:

Making money in HFT today is no longer as easy as it once was. As a result, big players are rapidly gobbling up little ones. For example in April Virtu picked up KCG and Quantlab recently acquired Teza Technologies.
Speed is no longer the primary driving differentiator between HFT shops. As someone in the speed business, I can tell you that today one microsecond for a network 1/2 round trip in trading is table stakes. Some shops are deploying leading-edge technologies that can reduce this to under 200 nanoseconds. Speed as the only value proposition an HFT offers hasn’t been the case for a rather long time. This is one of the key assertions that was made five years ago in the “Dark Pools” book mentioned above.
MiFID II is having a greater impact on HFT then most people are aware. It brings with it new, and more stringent regulations governing dark pools and block trades.
Perhaps the most important point though that Parsons makes is that the shift today is away from arbitrage as a primary revenue source and towards analytics and artificial intelligence (AI). He downplays this point by offering it up as a possible future.

AI is the future of HFT as shops look for new and clever ways to push intelligence as close as possible to the exchanges. Companies like Xilinx, the leader in FPGAs, Nvidia with their P40 GPU, and Google, yes Google, with their Tensorflow Processing Unit (TPU) will define the hardware roadmap moving forward. How this will all unfold is still to be determined, but it will be one wild ride.