Data Types, Computation & Electronic Trading

I’ll return to the “Expanded Role of HPC Accelerators” in my next post, but before doing so, we need to take a step back and look at how data is stored to understand how best to operate on and accelerate that data. When you look under the hood of your car, you’ll find that there are at least six different types of mechanical fasteners holding things together.

Artificial Intelligence and Electronic Trading

Hoses often require a slotted screwdriver to remove, while others a Phillips’s head. Some panels require a star-like Torx wrench while others a hexagonal Allen, but the most common one you’ll find are hexagonal headed bolts and nuts in both English and Metric sizes. So why can’t the engineers who design these products select a single fastener style? Simple, each problem has unique characteristics, and these engineers are often choosing the most appropriate solution. The same is true for data types within a computer. Ask almost anyone, and they’ll say that data is stored in a computer in bytes, just like your engine has fasteners. 

Computers process data in many end-user formats, from simple text and numbers to sounds, images, video, and much more. Ultimately, it all becomes bits organized, managed, and stored as bytes. We then wrap abstraction layers around these collections of bytes as we shape them into numbers that our computer can then process. We know that some numbers are “natural,” they’re also called “integers,” meaning they have no fractional component, for example, one, two, and three. Simultaneously, other “real” numbers contain a decimal point that can have anywhere from one to an infinite collection of digits to the decimal’s right. Computers process data using both of these numerical formats.     

Textual data is relatively easy to grasp as it is often stored as an integer without a positive or negative sign. For example, the letter “A” is stored using a standard that has assigned it the value 65, which as a single byte is “01000001.” In Computer Science, how a number is represented can be just as important as the number itself’s value. One can store the number three as an integer, but when it is divided by the integer value two, some people would say the result is 1.5, but they could be wrong. Depending on the division operator used, some languages support several, and the data type assigned to the product there could be several different answers, all of which would be correct within their context. As you move from integer to real numbers or, more specifically, floating-point numbers, these numerical representations expand considerably based on the degree of precision required by the problem at hand. Today, some of the more common numerical data types are:

  • Integers, which often take a single byte of storage, and can be signed or unsigned. Integers can come in at least seven different distinct lengths from a nibble stored as four bits, a byte (eight bits), a half-word (sixteen bits), word (thirty-two bits), double word (sixty-four bits), octaword (one hundred and twenty-eight bits), and an n-bit value.
  • Half Precision floating-point, also known as FP16 is where the real number is expressed in 16 bits of storage. The numerical value is stored in ten bits; then there are five bits for the exponent and a single sign bit.
  • Single Precision, also known as floating-point 32 or FP32, for 32 bits. Here we have 23 bits for the fractional component, eight for the exponent, and one for the sign.
  • Double Precision, also known as floating-point 64 or FP64, for 64 bits. Finally, we have 52 bits for the fractional component, 11 for the exponent, and one for the sign.

There are other types, but these are the major ones. How computational units process data differs broadly based on the architecture of the processing unit. When we talk about processing units, we’re specifically talking about the Central Processing Unit (CPUs), Graphical Processing Unit (GPUs), Digital Signal Processor (DSPs), and Field Programmable Gate Array (FPGAs). CPUs are the most general. They can process all of the above data types and much more; they are the BMW X5 sport utility vehicle of computation. They’ll get you pretty much anywhere you want; they’ll do it in style and provide good performance. GPUs are that tricked out Jeep Wrangler Rubicon designed to crawl over pretty much anything you can throw at it while still carrying a family of four. On the other hand, DSPs are dirt bikes, they’ll get one person over rough terrain faster than anything available, but they’re not going to shine when they hit the asphalt. Finally, we have FPGAs, and they’re the Dodge Challenger SRT Demon in the pack; they never leave the asphalt; they just leave everything else on it behind. All that to say that GPUs and DSPs are designed to operate on floating-point data, while FPGAs do much better with integer data. So why does this matter?

Every problem is not suited to a slot headed fastener; sometimes you’ll need a Torx, while others a hexagonal headed bolt. The same is true in computer science. If your system’s primary application is weather forecasting, which is floating-point intense, you might want vast quantities of GPUs. Conversely, if you’re doing genetics sequencing, where data is entirely integer-based, you’ll find that FPGAs may outperform GPUs, delivering up to a 100X performance per watt advantage. Certain aspects of Artificial Intelligence (AI) have benefited more from using FP16 based calculations over using FP32 or FP64. In this case, DSPs may outshine GPUs in calculations per watt. As AI emerges in key computational markets moving forward, we’ll see more and more DSPs applications; one of these will be electronic trading.

Today the cutting edge of electronic trading platforms utilize FPGA boards that have many built-in ultra-high-performance networking ports. These boards, in some cases, have upwards of 16 external physical network connections. Trading data and orders into markets are sent via network messages, which are entirely character; hence integer, based. These FPGAs contain code blocks that rapidly process this integer data, but computations slow down considerably as some of these integers are converted to floating-point for computation. For example, some messages use a twelve-character format for the price where the first six digits are the whole number, and the second six digits represent the decimal number. So, a price of $12.34 would be represented as the character string “000012340000.” Other fields also use twelve-character values for a number, but the first ten digits are the whole number, and the last two the decimal value. In this case, 12,572.75 shares of a stock would be represented as “000001257275.” Now, of course, doing financial computations maintaining the price or quantity as characters is possible; it would be far more efficient if each were recast as single-precision (FP32) numbers. Then computation could be rapidly processed. Here’s where a new blend of FPGA processing, to rapidly move around character data, and DSP computing for handling financial calculations using single precision math will shine. Furthermore, DSP engines are an ideal platform for executing trained AI-based algorithms that will drive financial trading moving forward into the future.

Someday soon, we’ll see trading platforms that will execute entirely on a single high-performance chip. This chip will contain a blend of large blocks of adaptable FPGA logic; we’re talking millions of logic tables, along with thousands of DSP engines and dozens of high-performance network connections. This will enable intelligent trading decisions to be made and orders generated in tens of a billionth of a second!  

Analog Time in a Digital World

1880s Self-Winding Clock

Imagine buying a product today that your family would cherish and still be using in 2160? I’m not talking about a piece of quality furniture or artwork, but an analog clock with some parts that are in continuous motion. This past weekend I once again got my grandfather clock, pictured to the right, functioning. This is a pendulum clock initially made for and installed at the NY Stock Exchange, then moved into a law office, and later a private residence. I inherited it some forty years ago, before becoming a teen, and it has operated a few times since it was placed in my care. This timepiece was manufactured in the 1880s, and it is a self-winding, battery-powered unit. Batteries were a very new technology in 1880. This clock shipped with two wet cells that the new owner then had to set up. The instructions called for pouring powered Sulfuric Acid from paper envelopes into each of two glass bottles, adding water, then stirring. The lids of the glass bottles contained the anode and cathode of the cell. The two cells were then wired in series and produced a three-volt battery.

When it was installed at the Exchange, around the time of Thomas Edison, it was modified with the addition of a red button on the side designed to synchronize the clock with the others on the Exchange. The button on this clock, and all others like it at the Exchange, would be pressed on the hour, prior to the opening of the market. It has since been rewired, so the button triggers an out of sync winding. During my childhood, this clock ran for a few years until it fell silent as a result of dead batteries. Over my adult life, it has run continuously several times, often only for a year or two at a stretch until the batteries were depleted. The issue, more often than not, was simply access to replacement batteries. In the 1950s, the batteries this clock required, a pair of dry-cell No. 6, were available in most hardware stores. Many devices were designed to use this No.6 cell, including some of the earliest automobiles, early in the 1900s it was a very popular power source.

We moved recently, and one of my wife’s conditions on hanging my grandfather clock in the living room was that it function. In 2005 after an earlier move from California to North Carolina, we hired a local clock repairman to restore this family timepiece. He cleaned out the old lubrication, replaced a few worn parts, hung the clock, and sold us his last pair of original No. 6 dry-cell batteries. For those not familiar with the No. 6, it was a 1.5-volt battery the size of a large can of beer, but the standout attribute of this battery was that it could provide high instantaneous current for a brief period of time. In the 1990s, No. 6 cells were banned in the US because they used Mercury. The replacements offered had the same size and fit, but couldn’t produce the required instantaneous current.

Last week after some research, and a little math, I realized that four, dual D-cell battery boxes connected via terminal strips to limit current loss, could produce about 30% more instantaneous current at three volts than the original pair of No. 6 cells. So, I glued the boxes together to form a maintainable brick, added two five terminal strips, one positive and one negative, then tested all the wiring and batteries. After rehanging the clock, leveling the case, installing the new battery box, and pressing the wind button, I raised the pendulum and let it go. The escapement rocked back and forth, enabling the secondhand gear to creep ahead one tooth at a time, but after ten minutes, the clock fell silent once again.

Several more attempts, each roughly ten minutes, and the internal switch eventually kicked in, and the batteries did their job. The clock wound automatically for the first time in well over a decade; I was elated. Alas though another few minutes later, the clock came to rest once more. After some additional research into how the clock was losing energy, I came across a few suggestions. The hands were placed correctly, secured, and didn’t touch anything. I shoved my iPhone camera into the side of the case to get a view of the pendulum hanging on the escapement and found that it was hanging a bit askew. I then sprayed a small amount of synthetic lubricant on the escapement, crossed my fingers, and gave the pendulum another nudge. Fifteen minutes later, it coasted to a halt. Tinkering a few more times with the pendulum, over the next few hours, and a bit more lubricant, the clock was eventually sustaining movement. That was Sunday, it’s now Tuesday night, and the clock is going strong and hasn’t stopped since. As of this morning, the clock was losing about three minutes every twenty-four hours.

Now the chase is one to improve the accuracy. This is done by changing the pendulum length through a nut below the pendulum’s bob. If you loosen the nut, it lowers the bob making the pendulum longer, and the clock runs slower. Tighten the nut, and the pendulum is shorter, and the clock runs faster. Perhaps a successive series of turns over the next few days will get this 140-year-old device down to a few seconds a day!

Update Friday, June 12, after some very gentle tightening of the pendulum, thereby shortening its length and speeding up its swing, we’re down from three minutes to 61 seconds a day. There may be a few threads left, so hopefully, we can get this loss down to less than 30 seconds a day.

Update Monday, July 6, the clock is running strong on the original eight D-cells from last month, I’m expecting about a year on each set, and it still sounds strong. Some additional tweaks and we’re now only losing 25 seconds a day or one every hour. So I just need to add a minute every two days, not bad.

Update Sunday, June 27, 2021, the clock has run non-stop for the past year and is still running strong on the initial set of eight Duracell D-cells. At this time the clock is still winding regularly every hour, and it is losing less than 15 seconds/day so I often adjust it every week or so setting it two minutes ahead. After inspecting all the cells for leakage, they look fine, and testing each one I’ve found they are all 1.44V plus or minus 0.001V, which is impressive. The four unused, lets call them control cells, have all maintained a voltage of 1.61V. From my reading it appears that until the voltage drops below 1.3V I should still get good performance out of them. So they will remain in service for another year. What an engineering marvel.

Google, Memcache, and How Solarflare May Have Come Out on Top

This post was originally made in January of 2015, but due to a take-down letter, I received a week later this story has remained unpublished for the last seven years.

January 2015 Take-Down

Memcached was released in May of 2003 by Danga Interactive. This free, open-source software layer provides a general-purpose distributed in-memory cache that can be used by both web & app servers. Five years later, Google released version 1.1.0 of its App Engine, which also included its version of an in-memory cache called Memcache. This capability to store objects in a huge pool of memory spread across a large distributed fabric of servers is integral to the performance inherent in many of Google’s products. Google showed in a presentation earlier this year on Memcache that a query from a typical data requires 60-100 milliseconds, while a similar Memcache query only needed 3-8 milliseconds, a 20X improvement. It’s no wonder Google is a big fan of Memcache. This is why in March of 2013, Google acquired both technology and people from a small networking company called Myricom, who cracked the code to accelerate the network path to Memcache.

To better understand what Google acquired, we need to roll back to 2009 when Myricom had ported its MX communications stack to Ethernet. Mx over Ethernet (MXoE) was originally crafted for the High-Performance Computing (HPC) market, to create a very limited, but extremely low latency, stack for their then two-year-old 10GbE NICs. MXoE was reformulated using UDP instead of MX, and this new low latency driver was then named DBL. It was then engineered to service the High-Frequency Traders (HFT) on Wall Street. Throughout 2010 Myricom had evolved this stack by further adding limited TCP functionality. At that time, half round trip performance (one send plus one receive) over 10GbE was averaging 10-15 microseconds, DBL did it 4 microseconds. Today (2015) by comparison Solarflare, the leader in ultra-low latency network adapters, does this in well under 2 microseconds. So in 2012, Google was searching for a method to apply ultra-low latency networking tricks to a new technology called Memcache. In the fall of 2012 Google invited several Myricom engineers to discuss how DBL might be altered to service Memcache directly. Also they were interested in known if Myricom’s new silicon, due out in early 2013, might also be a good fit for this application. By March of 2013 Google had tendered an offer to Myricom to acquire their latest 10/40GbE chip, which was about to go into production, along with 12 of their PhDs who handled both the hardware & software architecture. Little is publicly known if that chip or the underlying accelerated driver layer for Memcache has ever made it into production at Google.

Fast forward to today (2015), and earlier this week, Solarflare released a new whitepaper highlighting how they’ve accelerated the publicly available free open-source layer called Memcached using their ultra-low latency driver layer called OpenOnload. In this whitepaper, Solarflare demonstrated performance gains of 2-3 times that of a similar Intel 10GbE adapter. Imagine Google’s Memcache farm being only 1/3 the size it is today? We’re talking serious performance gains here. For example, leveraging a 20 core server, the Intel dual 10GbE adapter supported 7.4 million multiple get operations while Solarflare provided 21.9 million, nearly a 200% increase in the number of requests. If we look at mixed throughput (get/set in a ratio of 9:1), Intel delivered 6.3 million operations per second while Solarflare delivered 13.3 million, a 110% gain. That’s throughput, how about latency performance? Using all 20 cores, and batches of 48 get requests Solarflare clocked in at 2,000 microseconds, and Intel at 6,000 microseconds. Across all latency tests, Solarflare reduced network latency by, on average, 69.4% (the lowest reduction was 50%, and the highest 85%). Here is a link to the 10-page Solarflare whitepaper with all the details.

While Google was busy acquiring technology & staff to improve their own Memcache performance, Solarflare delivered it for their customers and documented the performance gains.

Half a Billion Req/Sec!

Can your In-Memory Key-Value Store handle a half-billion requests per second?

Many applications from biological to financial and Web2.0 utilize in-memory databases because of their cutting-edge performance, often delivering several orders of magnitude faster response time than traditional relational databases. When these in-memory databases are moved to their own machines in a multi-tier application environment, they often can serve 10 million requests per second, and that’s turning all the dials to 11 on a high-end dual-processor server. Much of this is due to how applications communicate with the kernel and the network.

Last year, my team used one of these databases, Redis, we then bypassed the kernel, connected up a 100Gbps network, and took that 10 million requests per second to almost 50 million. Earlier this year we began working with Algo-Logic, Dell, and CC Integration to blow way beyond that 50 million target. At RedisConf2020 in May, Algo-Logic announced a 1U Dell server they’ve customized that can service nearly a half-billion requests per second. To process these requests, we’re spreading the load across two AMD EPYC CPUs and three Xilinx FPGAs. All requests are serviced directly from local memory using an in-memory key-value store system. For those requests serviced by the FPGAs, their response time is measured in billionths of a second. Perhaps we should explain how Algo-Logic got here, and why this number is significant.

Some time ago, a new form of databases came back into everyday use, and they were classified as NoSQL, because they didn’t use Structured Query language, and were non-relational. These databases rely on clever algorithmic tricks to rapidly store and retrieve information in memory; this is very different from how traditional relational databases function. These NoSQL systems are sometimes referred to as key-value stores. With these systems, you pass in a key, and a value is returned. For example, pass in “12345,” and the value “2Z67890” might be returned. In this case, the key could be an order number, and the value returned is the tracking number or status, but the point is you made a simple request and got back a simple answer, perhaps in a few billionths of a second. What Algo-Logic has done is they wrote an application for the Xilinx Alveo U50 that turns the 40 Gbps Ethernet port on this card into four smoking fast key-value stores each with access to the cards 8GB of High Bandwidth Memory (HBM). Each Alveo U50 card with Algo-Logics KVS can service 150 million requests per second. Here is a high-level architectural diagram showing all the various components:

Architecture for 490M Requests/Second into a 1U Server

There are five production network ports on the back of the server, three 40Gbps and two 25Gbps. Each of the Xilinx Alveo U50 cards has a single 40Gbps port, and the dual 25Gbps ports are on an OCP-3 form factor card called the Xilinx XtremeScale X2562 which carries requests into the AMD EPYC CPU complex. Algo-Logic’s code running in each of the Xilinx Alveo cards breaks the 40Gbps channel into four 10Gbps channels and processes requests on each individually. This enables Algo-Logic to make the best use possible of the FPGA resources available to them.

Furthermore, to overcome network overhead issues, Algo-Logic has packed 44 get requests into a single 1408 byte packet. For those familiar with Redis, this is similar to an MGET, multiple get, request. Usually, a single 32-byte get request can easily fit into the smallest Ethernet payload, which is 40 bytes, but then Ethernet adds an additional 24-bytes of routing and a 12-byte frame gap — using a single request per-packet results in networking overhead consuming 58% of the available bandwidth. This is huge, and can clearly impact the total request per second rate. Packing 44 requests of 32 bytes each into a single packet means that the network overhead drops to 3% of the total bandwidth, which means significantly greater requests per second rates.

What Algo-Logic has done here is extraordinary. They’ve found a way to tightly link the 8GB of High Bandwidth Memory on the Xilinx Alveo U50 to four independent key-value store instances that can service requests in well under a micro-second. To learn more consider reaching out to John Hagerman at Algo-Logic Systems, Inc.

SmartNICs, the Next Wave in Server Acceleration

As system architects, we seriously contemplate and research the components to include in our next server deployment. First, we break the problem being solved into its essential parts; then, we size the components necessary to address each element. Is the problem compute, memory, or storage-intensive? How much of each element will be required to craft a solution today? How much of each will be needed in three years? As responsible architects, we have to design for the future, because what we purchase today, our team will still be responsible for three years from now. Accelerators complicate this issue because they can both dramatically breath new life into existing deployed systems, or significantly skew the balance when designing new solutions.

Today foundational accelerator technology comes in four flavors: Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Multi-Processor Systems on a Chip (MPSoCs) and most recently Smart Network Interface Cards (SmartNICs). In this market, GPUs are the 900-pound gorilla, but FPGAs have made serious market progress the past few years with significant deployments in Amazon Web Services (AWS) and Microsoft Azure. MPSoCs, and now SmartNICs, blend many different computational components into a single chip package, often utilizing a mix of ARM cores, GPU cores, Artificial Intelligence (AI) engines, FPGA logic, Digital Signal Processors (DSPs), as well as memory and network controllers. For now, we’re going to skip MPSoCs and focus on SmartNICs.

SmartNICs place acceleration technology at the edge of the server, as close as possible to the network. When computational processing of network intense workloads can be accomplished at the network edge, within a SmartNIC, it can often relieve the host CPU of many mundane networking tasks. Normal server processes require that the host CPU spend, on average, 30% of it’s time managing network traffic, this is jokingly referred to as the data center tax. Imagine how much more you could get out of a server if just that 30% were freed up, and what if more could be made available?

SmartNICs that leverage ARM cores and or FPGA logic cells exist today from a growing list of companies like Broadcom, Mellanox, Netronome, and Xilinx. SmartNICs can be designed to fit into a Software-Defined Networking (SDN) architecture. They can accelerate tasks like Network Function Virtualization (NVF), Open vSwitch (OvS), or overlay network tunneling protocols like Virtual eXtensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE). I know, networking alphabet soup, but the key here is that complex routing, and packet encapsulation tasks can be handed off from the host CPU to a SmartNIC. In virtualized environments, significant amounts of host CPU cycles can be consumed by these tasks. While they are not necessarily computationally intensive, they can be volumetrically intense. With datacenter networks moving to 25GbE and 50GbE, it’s not uncommon for host CPUs to process millions of packets per second. This processing is happening today in the kernel or hypervisor networking stack. With a SmartNIC packet routing and encapsulation can be handled at the edge, dramatically limiting the impact on the host CPU.

If all you were looking for from a SmartNICs is to offload the host CPU from having to do networking, thereby saving the datacenter networking tax of 30%, this might be enough to justify their expense. Most of the SmartNIC product offerings from the companies mentioned above run in the $2K to $4K price range. So suppose you’re considering a SmartNIC that costs $3K, with the proper software, and under load testing, you’ve found that it returns 30% of your host CPU cycles, what is the point at which the ROI makes sense? A simplistic approach would suggest that $3K divided by 30% yields a system cost of $10K. So if the cost of your servers is north of $10K, then adding a $3K SmartNIC is a wise decision, but wait, there’s more.

SmartNICs can also handle many complex tasks like key-value stores, encryption, and decryption (IPsec, MACsec, soon even SSL/TLS), next-generation firewalls, electronic trading, and much more. Frankly, the NIC industry is at an inflection point similar to when video cards evolved into GPUs to support the gaming and virtualization market. While Sony coined the term GPU with the introduction of the Playstation in 1994, it was Nvidia five years later in 1999 who popularized the GPU with the introduction of the GeForce 256. I doubt that in the mid-1990s, while Nvidia was designing the NV10 chip, the heart of the GeForce 256, that their engineers were also pondering how it might be used in high-performance computing (HPC) applications a decade later that had nothing to do with graphic rendering. Today we can look at all the ground covered by GPU and FPGA accelerators over the past two decades and quickly see a path forward for SmartNICs where they may even begin offloading the primary computational tasks of a server. It’s not inconceivable to envision a server with a half dozen SmartNICs all tasked with encoding video, or acting as key-value stores, web caches, or even trading stocks on various exchanges. I can see a day soon where the importance of SmartNIC selection will eclipse server CPU selection when designing a new solution from the ground up.

x86 Has Hit the Wall, and Now Come the Accelerators – Part 2

Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation. 

When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.

The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.

Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.

Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.

So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?

Stay tuned for part three, where we cover hardware acceleration platforms.

x86 Has Hit the Wall, and Now Come the Accelerators

“… when you have access to the vastness of space, you realize there’s only one resource worth fighting over… even killing for: More time. Time is the single most precious commodity in the universe.”

— Kalique Abrasax, Jupiter Ascending (2015)

Computing is humanities purest quest to convert time into work. In 2000 IBM demonstrated slicing one second into 10 billion units (10GHz) and then squeezing computational work out of each unit. At the time IBM had defined a new 130-nanometer process they called “CMOS 9S“. It was planned for future generation PowerPC chips. In parallel IBM was ramping up production of the POWER4 at 1.9GHz. Now you may be asking yourself, “but wait a minute I’ve never seen any production 10GHz CPUs, especially not 20 years ago,” and you’re correct. IBM’s POWER6 was as close as we’ve gotten with one version of that chip advertised at 5GHz, and in the lab they achieved 6GHz. I’ve also heard IBM reps brag about 7GHz with POWER8 if you turn half the cores off. So why has computing hit the wall at 4-5GHz and computation not reached 10GHz over the last twenty years?

Intel explained this five years ago in the blog post, “Why has CPU frequency ceased to grow?” The problem has a name called the “conveyor level.” Imagine a CPU as a conveyor belt driven assembly line with four workstations labeled A through D. Since an assembly line is a serial process the worker at station B can’t start until the worker at station A finishes. Ideally, each station is designed to take the same amount of time to finish their work, so the following station isn’t impacted. The slowest worker then defines the speed of the conveyor on any given day. So if the most time-consuming stage in the CPU pipeline is 250 picoseconds, then the clock frequency is 4GHz. There is also the issue of heat.

As an electron races through a computer circuit, it experiences a form of friction, known as resistance. Just like rubbing your hands together on a cold day produces heat, so does an electron zipping through a computer circuit. When designing any chip heat is the enemy. The smaller the chip geometry, today its seven nanometers, the more devices you can pack into a given space on a chip. More devices mean more heat. That same square centimeter of space at 7nm still has the same thermal limitations it did at 130nm 20 years ago. Sure we can use fancy liquid systems to rapidly wick heat away from the chip, instead of relying on airflow over an area limited heat sink, but at the end of the day, every watt of power the chip consumes becomes heat. Now there are individual circuits throughout the chip specifically designed to detect and respond to over-heating situations. The last thing anyone wants is a smoldering piece of silicon where their CPU once was. In the 7GHz example above, the IBM representative said that if you viewed the POWER8 chip as a big chessboard and you turned off all the CPU cores on the white squares than all the cores on the black squares could be clocked at nearly twice the speed or 7GHz. Why is this interesting?

For some computational problems its much better to have two consecutive computations in the same unit of time than two unrelated ones. Electronic trading, also known as high-frequency trading (HFT) is the premier market-driven problem that benefits most from increasing clock frequency. Traders often ascribe a dollar value to a millionth of a second, and it varies from market to market based on the rules and volumes of each market. In the end, though it always boils down to the trader’s speed and response to a market signal. If I’m faster than you at making the right decision, then I win the business and book the profit. Sticking with HFT, where do accelerators fit in?

Traders lease connections to exchanges. The closer and faster they can respond to signals from those connections, the more competitive they will be. Suppose my trading platform requires signals from the market to travel through my server, then another switch on my private network, back through a second server, then finally out to the market. The networking alone, even with kernel bypass through two servers and a switch could easily be several microseconds. Add a few more microseconds for trading logic in both servers, and you could be looking at almost ten microseconds to submit a trade in response to a signal. Two years ago Solarflare with LDA Technologies demonstrated 98 nanoseconds tick to trade. This was using accelerator technology and compared to the trading platform mentioned above; it is three orders of magnitude faster. That’s the difference between walking from NYC to LAX versus flying at Mach 5 and arriving in an hour. Time matters and acceleration is not just for HFTs anymore. Why do you think Google bought Myricom, Amazon picked up Annapurna Labs, Nvidia purchased Mellanox, or Xilinx acquired Solarflare?

Please stay tuned, more to come in part two. In the meantime feel free to check out previous articles on this topic:

828ns – A Legacy of Low Latency

Electronic trading, like no other industry, can directly link time and money. A decade ago when I started selling 10GbE NICs to Wall Street traders, they often shared with me the value of a single microsecond (millionth of a second) improvement in trading. Today these same traders are measuring gains in nanoseconds (billionths of a second). With each passing quarter our financial markets evolve, and trade execution times decrease. Trading platforms leveraging older hardware and software often can’t remain competitive as other traders continue to invest in the latest products which further reduce trade execution latency and improve order determinism.

For the past decade, Solarflare has led the market in accelerating server-side UDP/TCP networking for electronic trading with our Onload® software acceleration stack. In addition, Solarflare has regularly delivered a new generation of 10GbE network adapters that have further reduced network latency by 20-30% while also reducing jitter. Often these advances were the result of improvements in the hardware, but there were many significant enhancements to the Onload stack that contributed substantially to the overall system performance increases. Keep in mind that Onload is fully compliant to the BSD Sockets standard, which means that developers don’t have to change their code to use Onload. The table below shows this reduction in Onload latency over time along with the gain from each new generation of Solarflare adapters.

In the below graph (click on it to enlarge) you’ll see how latency with Onload compares between Solarflare’s SFN8522 and X2522 as message size increases. We’ve also included our next closest competitor, Mellanox, with their ConnectX-5 adapter and VMA offload stack.

About five years ago, Solarflare saw an opportunity to revisit TCP/UDP networking stacks within Onload and determined that it is possible to squeeze another 35-50% in performance gains if developers were willing to use a new C language application programming interface (API). This new API was built from the ground up focused on performance, and it implements only a subset of the complete BSD Sockets API. Every API call has been highly tuned to deliver optimum performance. On the road to formulating this API Solarflare has patented several new innovations, and in 2016 it leaped forward again by introducing this API and branding it TCPDirect. Initially, TCPDirect improved latency on Solarflare’s SFN8522 adapter by an astonishing 38%!

Recently TCPDirect was tested with the Solarflare’s latest X2522 cards, and it delivered an improved 48% latency reduction over Onload on the same adapter (click on the graph below). Today TCPDirect with the X2522 provides an amazing 828ns of latency with TCP. So how does this compare with Mellanox? The X2522 with TCPDirect is 39% faster than the Mellanox ConnectX-5 with VMA and Exasock! This gain is shown in the graph below. It should be noted that this testing was done using an older more performant Intel Skylake processor with a 3.6Ghz clock. Intel’s newest Cascade Lake processors burst up to 4.4Ghz, but they were not available at the time of this testing. Recent testing indicates that they should produce even more impressive results.

Trading and Time are interwoven into a single fabric, one cannot exist without the other. When trades are executing with a precision measured in nanoseconds you need a technology partner that is leading the industry, not following it. Solarflare also provides a precision time protocol (PTP) daemon that includes both IEEE-1588 (2008) and enterprise profiles. Additionally, Solarflare makes available an optional PCIe bracket kit enabling the direct connection of an external hardware master clock that can deliver a highly accurate one pulse per second (1PPS) signal.  This kit and Solarflare’s PTP daemon enable the adapter to maintain system time synchronization to within 200ns of the external master clock. Mellanox has stated that their PTP implementation “can see time locked to reference well within 500 nanoseconds of variation.”

Numerous STAC reports over the past decade with all the major OEMs and the Linux distributions used in finance have validated that Solarflare networking technology is the standard by which all others are measured. Innovations like those discussed above are the reason why over 90% of the stock exchanges, global investment banks, hedge funds, and cutting-edge high-frequency traders’ architect their systems with Solarflare hardware and software. Outside of the Linux kernel’s own communications stack, no other TCP/UDP user-space communications stack is more heavily tested or in wider production than Solarflare’s Onload platform. Today the world economy exists across hundreds of thousands of servers spread throughout the globe, and nearly all of those servers depend on Solarflare to provide the industry’s best performance with the lowest jitter possible. Below are recent STAC Research reports from the past two years that back up our claims.

June 2018 – SFC180604b– UDP over 10GbE using Solarflare OpenOnload on Red Hat OpenShift 3.10 (pre-release) with RHEL 7.5 and Solarflare XtremeScale X2522 Adapters on Supermicro SYS-1029UX-LL1-S16 Servers

June 2018 – SFC180604a– UDP over 10 GbE Solarflare OpenOnload on Red Hat Enterprise Linux 7.5 with Solarflare XtremeScale X2522 adapters on Supermicro SYS-1029UX-LL1-S16Servers

October 2017 – SFC170831– STAC-T0: Solarflare SFN8522-ONLOAD NIC with LDA Technologies LightSpeed TCP on an Alpha Data FPGA in a Penguin Computing Relion XE1112 Server

Febuary 2017 – SFC170206– UDP over 10GbE using OpenOnload on RHEL 6.6 with Solarflare SFN 8522-PLUS Adapters on HPE ProLiant XL170r Gen9 Trade & Match Servers

User Level Networking (ULN) is Becoming an Over-Night Success

Kernel Bypass = User Level Networking

Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.   

With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.  

This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.     

In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.

In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders. 

At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well. 

For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects. 

At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.  

As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.

Container Performance Doesn’t Need to Suck

Recently the OpenShift team at Red Hat, working with Solarflare Engineering, rolled out new code that was benchmarked by a third party, STAC Research, which demonstrated networking performance from within a container that was equivalent to that of a bare metal server. We’re talking 1.2 microseconds for 99% of network traffic in a 1/2RT (half round trip), that’s a TCP receive to an application coupled with a TCP send from that application.

Network performance like this was considered leading edge in High-Performance Computing (HPC) a little more than a decade ago when Myricom rolled out Myrinet10G which debuted at 2.4 microseconds back in 2006. Both networks are 10Gbps so it’s sort of an apples to apples comparison. Today, this level of performance is available for containerized applications using generic network socket calls. It should be noted that the above numbers were for zero byte packets, a traditional HPC measurement. More realistic performance using 256-byte packets yielded a 1/2RT time for the 99th percentile of traffic which was still under 1.5 microseconds, that’s amazing! It should be noted that everything was done to both the bare metal server and the Pod configuration to optimize performance. A graph of the complete results of that testing is shown below.

Anytime we create abstractions to simplify application execution or management we introduce additional layers of code that can result in potentially unwanted delays, known as application latency. By running an application inside a container, then wrapping that container into a Pod we are increasing the distance between what we intend to do, and what is actually being executed. Docker containers are fast becoming all the rage and methods for orchestrating them using tools like Kubernetes are extremely popular. If you dive into this OpenShift blog post there are ways to cut through these layers of code for performance while still retaining the primary management benefits.