Data Types, Computation & Electronic Trading

January 31, 2021January 31, 2021 scottcschweitzer AI, FPGA, HFT, networking, Performance

I’ll return to the “Expanded Role of HPC Accelerators” in my next post, but before doing so, we need to take a step back and look at how data is stored to understand how best to operate on and accelerate that data. When you look under the hood of your car, you’ll find that there are at least six different types of mechanical fasteners holding things together.

Artificial Intelligence and Electronic Trading

Hoses often require a slotted screwdriver to remove, while others a Phillips’s head. Some panels require a star-like Torx wrench while others a hexagonal Allen, but the most common one you’ll find are hexagonal headed bolts and nuts in both English and Metric sizes. So why can’t the engineers who design these products select a single fastener style? Simple, each problem has unique characteristics, and these engineers are often choosing the most appropriate solution. The same is true for data types within a computer. Ask almost anyone, and they’ll say that data is stored in a computer in bytes, just like your engine has fasteners.

Computers process data in many end-user formats, from simple text and numbers to sounds, images, video, and much more. Ultimately, it all becomes bits organized, managed, and stored as bytes. We then wrap abstraction layers around these collections of bytes as we shape them into numbers that our computer can then process. We know that some numbers are “natural,” they’re also called “integers,” meaning they have no fractional component, for example, one, two, and three. Simultaneously, other “real” numbers contain a decimal point that can have anywhere from one to an infinite collection of digits to the decimal’s right. Computers process data using both of these numerical formats.

Textual data is relatively easy to grasp as it is often stored as an integer without a positive or negative sign. For example, the letter “A” is stored using a standard that has assigned it the value 65, which as a single byte is “01000001.” In Computer Science, how a number is represented can be just as important as the number itself’s value. One can store the number three as an integer, but when it is divided by the integer value two, some people would say the result is 1.5, but they could be wrong. Depending on the division operator used, some languages support several, and the data type assigned to the product there could be several different answers, all of which would be correct within their context. As you move from integer to real numbers or, more specifically, floating-point numbers, these numerical representations expand considerably based on the degree of precision required by the problem at hand. Today, some of the more common numerical data types are:

Integers, which often take a single byte of storage, and can be signed or unsigned. Integers can come in at least seven different distinct lengths from a nibble stored as four bits, a byte (eight bits), a half-word (sixteen bits), word (thirty-two bits), double word (sixty-four bits), octaword (one hundred and twenty-eight bits), and an n-bit value.
Half Precision floating-point, also known as FP16 is where the real number is expressed in 16 bits of storage. The numerical value is stored in ten bits; then there are five bits for the exponent and a single sign bit.
Single Precision, also known as floating-point 32 or FP32, for 32 bits. Here we have 23 bits for the fractional component, eight for the exponent, and one for the sign.
Double Precision, also known as floating-point 64 or FP64, for 64 bits. Finally, we have 52 bits for the fractional component, 11 for the exponent, and one for the sign.

There are other types, but these are the major ones. How computational units process data differs broadly based on the architecture of the processing unit. When we talk about processing units, we’re specifically talking about the Central Processing Unit (CPUs), Graphical Processing Unit (GPUs), Digital Signal Processor (DSPs), and Field Programmable Gate Array (FPGAs). CPUs are the most general. They can process all of the above data types and much more; they are the BMW X5 sport utility vehicle of computation. They’ll get you pretty much anywhere you want; they’ll do it in style and provide good performance. GPUs are that tricked out Jeep Wrangler Rubicon designed to crawl over pretty much anything you can throw at it while still carrying a family of four. On the other hand, DSPs are dirt bikes, they’ll get one person over rough terrain faster than anything available, but they’re not going to shine when they hit the asphalt. Finally, we have FPGAs, and they’re the Dodge Challenger SRT Demon in the pack; they never leave the asphalt; they just leave everything else on it behind. All that to say that GPUs and DSPs are designed to operate on floating-point data, while FPGAs do much better with integer data. So why does this matter?

Every problem is not suited to a slot headed fastener; sometimes you’ll need a Torx, while others a hexagonal headed bolt. The same is true in computer science. If your system’s primary application is weather forecasting, which is floating-point intense, you might want vast quantities of GPUs. Conversely, if you’re doing genetics sequencing, where data is entirely integer-based, you’ll find that FPGAs may outperform GPUs, delivering up to a 100X performance per watt advantage. Certain aspects of Artificial Intelligence (AI) have benefited more from using FP16 based calculations over using FP32 or FP64. In this case, DSPs may outshine GPUs in calculations per watt. As AI emerges in key computational markets moving forward, we’ll see more and more DSPs applications; one of these will be electronic trading.

Today the cutting edge of electronic trading platforms utilize FPGA boards that have many built-in ultra-high-performance networking ports. These boards, in some cases, have upwards of 16 external physical network connections. Trading data and orders into markets are sent via network messages, which are entirely character; hence integer, based. These FPGAs contain code blocks that rapidly process this integer data, but computations slow down considerably as some of these integers are converted to floating-point for computation. For example, some messages use a twelve-character format for the price where the first six digits are the whole number, and the second six digits represent the decimal number. So, a price of $12.34 would be represented as the character string “000012340000.” Other fields also use twelve-character values for a number, but the first ten digits are the whole number, and the last two the decimal value. In this case, 12,572.75 shares of a stock would be represented as “000001257275.” Now, of course, doing financial computations maintaining the price or quantity as characters is possible; it would be far more efficient if each were recast as single-precision (FP32) numbers. Then computation could be rapidly processed. Here’s where a new blend of FPGA processing, to rapidly move around character data, and DSP computing for handling financial calculations using single precision math will shine. Furthermore, DSP engines are an ideal platform for executing trained AI-based algorithms that will drive financial trading moving forward into the future.

Someday soon, we’ll see trading platforms that will execute entirely on a single high-performance chip. This chip will contain a blend of large blocks of adaptable FPGA logic; we’re talking millions of logic tables, along with thousands of DSP engines and dozens of high-performance network connections. This will enable intelligent trading decisions to be made and orders generated in tens of a billionth of a second!

The Expanded Role of HPC Accelerators, Part 1

January 17, 2021January 17, 2021 scottcschweitzer Performance

When Lisa Su, CEO of AMD, presented a keynote talk at CES 2021 last week, she reminded me of a crucial aspect of High-Performance Computing (HPC) that often goes unnoticed. HPC is at the center of many computing innovations. Since SC04, I’ve attended the US SuperComputing conference in November pretty much religiously every other year or so. SC is where everyone in technology hauls out their best and brightest ideas and technologies, and I’ve been thrilled over the years to be a part of it with NEC, Myricom, and Solarflare. Some of the most intelligent people I’ve ever met or had the pleasure of working with, I first met at an SC event or dinner. SC though is continuously changing; just today, I posted a reference to the Cerebras CS-1, which uses a single chip that measures 8.5″ on each side to achieve SC performance that is 200X faster than #466 on the Top500.org list of supers. High-Performance Computing (HPC) is entering its fourth wave of innovation.

Cerebras CS-1 Single System SuperComputer Data Center in a Box

The first was defined by Seymour Cray in the early 1970s when he brought out the vector-based mainframe. The second wave was the clustering of Linux computers, which started to become a dominant force in HPC in the late 1990s. When this began, Intel systems were all single-core, with some supporting multiple CPU sockets. The “free” Linux operating system and low-cost Gigabit Ethernet (GbE) were the catalysts that enabled universities to quickly and easily cobble together significantly robust systems. Simultaneously, the open development of a Message Passing Interface (MPI) was completed that made it much easier to port existing first wave HPC applications over to clustered Linux systems without having to use TCP/IP. This second wave brought about advancements in HPC networking and storage that further defined it as a unique market. Today we’re at the tail end of the third wave of innovation driven by the Graphical Processing Unit (GPU). Some would say the dominant HPC brand today is NVIDIA because they’ve pushed GPUs’ envelope further and faster than anyone else, and they own Mellanox, the Infiniband networking guys. Today, our focus is the expanding role of accelerators, beyond GPUs, in HPC as they will define this new fourth wave of innovation.

Last week I thought this fourth wave would be defined by a whole new mode where all HPC computations are pushed off to special-purpose accelerators. These accelerators would then leverage the latest advances of the PCI express bus, new protocols for this bus, and the addition of Non-Volatile Memory express (NVMe) for storage. The fourth and soon fifth generation of the PCIe bus has provided dramatic speed improvements and support for two new protocols (CXL & CCIX) on this bus. Then along came the Cerebras CS-1 utilizing an 8.5″ square chip package that holds a single gargantuan chip with over a trillion transistors. While I think Cerebras may stand alone for some time using this single chip approach, it won’t be long before AMD considers the possibility of pouring hundreds of Zen3 chiplets into a package with an Infinity fabric that is MUCH larger than anything previously utilized. Imagine a single package that rivals Cerebras at 8.5″ square with hundreds of Zen3 chiplets (these are eight x86 cores sharing a common L3 cache), a large number of High Bandwidth Memory (HBM) chiplets, some FPGA chiplets contributed from Xilinx, along with Machine Learning (ML) chiplets from Xilinx latest Versal family, and chiplets for encryption, and 100GbE or faster networking. Talk about a system on a chip; this would be an HPC Super in a single package rivaling other many rack systems on the Top500.org.

More to come in Part II as we explain in more detail what I’d been thinking about regarding accelerators.

Performance vs. Perception

January 10, 2021January 10, 2021 scottcschweitzer Performance

Tesla Model 3 Performance Edition and Polaris Slingshot S

Our technology-focused world has jaded us, causing us to blur the line between our understanding of performance and perception. It is often easier than you might think to take factual data and conflate it with how something feels or our perception of what we experience. Performance is factual. A 2020 Tesla Model 3 Performance Edition accelerates from 0-60 MPH in 3.5 seconds. In comparison, a 2019 Polaris Slingshot Model S accelerates from 0-60 MPH in 5.5 seconds. On paper, the Slingshot accelerates 57% slower than the Tesla Model 3 Performance Edition; the data can be easily looked up and confirmed; these are empirical facts. Also, they’re somewhat easy to verify at a large parking lot or on a back road. The Slingshot accelerates over two seconds slower than the Tesla, both pictured to the right, but if you were to sit in the passenger’s seat of both, during the test, without a stopwatch, I’d bet serious money you’d say the Slingshot was faster.

Perception is a funny thing, all our senses are in play, and they help formulate our opinion of what we’ve experienced. The Tesla Model 3 Performance Edition is a superb vehicle, my cousin has one, and he can’t gush enough about how amazing it is. When you sit in the passenger’s seat, you experience comfortable black leather, plush cushioning, and 12-way adjustable heated seats. The cabin is climate-controlled, sound dampened, and trimmed in wood accents with plenty of glass and steel, making you feel safe, secure, and comfortable. Accelerating from 0-60 MPH is a simple task; the driver stomps on the accelerator, and the car does all the work, shifting as it needs to, with little engine noise. The 3.5 seconds fly by as the vehicle blows past 60 MPH like a rocket sled on rails. So how can a three-wheeled ride like the Slingshot ever compare?

If you’ve not experienced the Slingshot, it’s something entirely different as it engages all your senses, much like riding a motorcycle. There are only three wheels, two seats, no doors, and even the windshield and roof are optional. The standard passenger’s seat has one position, all the way back, and it isn’t heated. The seat is made from rigid foam, covered in all-weather vinyl, with luxury and comfort not being design considerations. Did I mention there are no doors, the cabin is open to the world, you see, hear and smell everything, there’s no wood trim, and the climate is the climate, no heat or A/C. With less than six inches of ground clearance, your bottom is 10” off the surface of the road, and an average height person can easily reach down and touch the road, although I wouldn’t recommend it while moving.

The Slingshot assaults each and every sense as it shapes your perception of accelerating from 0-60 MPH. The driver shifts from first through second and into third, slipping and chirping the back wheel with each of the three transitions from a standing start. The roar of the engine fills your ears; as you’re shoved back into the hard seat, you grab for the roll bar, then the knuckles of your right hand turn white, while you catch a whiff of the clutch and feel the air blow back your hair. It is an incredibly visceral experience, all while your smile grows to an ear to ear grin. Those 5.5 seconds could even be six, given the lost traction as you chirped the rear wheel, but it wouldn’t matter your passenger would swear on a bible; it was three seconds. How is this possible?

How could someone who’s been a passenger in both cars ever think the Slingshot was possibly faster? Simple perception. While Tesla engages your eyes and your ability to sense acceleration, that’s it. The Tesla was designed to shield you from the outside world while wrapping you in comfort, and they’ve done a fantastic job. Conversely, the Slingshot is all about exposing every nerve and sense in your body to the world around you. As we go around some turns, you might feel the slightest amount of drift as the lone rear tire gently swings around to the outside of the turn.

The above example goes to show that feelings can sometimes overcome facts. We live in a technological world where facts, like hard performance data, and our emotions, and perceptions, can easily be misconstrued. We all need to keep this in mind as we evaluate new solutions.

P.S. Yes, for those who’ve asked, the 2019 Slingshot S pictured above has been my project car for the past two months. The guy I bought it from in early November had purchased it new in May and had installed the roll bars and canvas top. Since then I’ve made a dozen other improvements from adding a full height windshield to a 500 watt Bluetooth amplifier with a pair of 6.5″ Kicker speakers (it didn’t come with a radio).