x86 Has Hit the Wall, and Now Come the Accelerators

“… when you have access to the vastness of space, you realize there’s only one resource worth fighting over… even killing for: More time. Time is the single most precious commodity in the universe.”
— Kalique Abrasax, Jupiter Ascending (2015)

Computing is humanities purest quest to convert time into work. In 2000 IBM demonstrated slicing one second into 10 billion units (10GHz) and then squeezing computational work out of each unit. At the time IBM had defined a new 130-nanometer process they called “CMOS 9S“. It was planned for future generation PowerPC chips. In parallel IBM was ramping up production of the POWER4 at 1.9GHz. Now you may be asking yourself, “but wait a minute I’ve never seen any production 10GHz CPUs, especially not 20 years ago,” and you’re correct. IBM’s POWER6 was as close as we’ve gotten with one version of that chip advertised at 5GHz, and in the lab they achieved 6GHz. I’ve also heard IBM reps brag about 7GHz with POWER8 if you turn half the cores off. So why has computing hit the wall at 4-5GHz and computation not reached 10GHz over the last twenty years?

Intel explained this five years ago in the blog post, “Why has CPU frequency ceased to grow?” The problem has a name called the “conveyor level.” Imagine a CPU as a conveyor belt driven assembly line with four workstations labeled A through D. Since an assembly line is a serial process the worker at station B can’t start until the worker at station A finishes. Ideally, each station is designed to take the same amount of time to finish their work, so the following station isn’t impacted. The slowest worker then defines the speed of the conveyor on any given day. So if the most time-consuming stage in the CPU pipeline is 250 picoseconds, then the clock frequency is 4GHz. There is also the issue of heat.

As an electron races through a computer circuit, it experiences a form of friction, known as resistance. Just like rubbing your hands together on a cold day produces heat, so does an electron zipping through a computer circuit. When designing any chip heat is the enemy. The smaller the chip geometry, today its seven nanometers, the more devices you can pack into a given space on a chip. More devices mean more heat. That same square centimeter of space at 7nm still has the same thermal limitations it did at 130nm 20 years ago. Sure we can use fancy liquid systems to rapidly wick heat away from the chip, instead of relying on airflow over an area limited heat sink, but at the end of the day, every watt of power the chip consumes becomes heat. Now there are individual circuits throughout the chip specifically designed to detect and respond to over-heating situations. The last thing anyone wants is a smoldering piece of silicon where their CPU once was. In the 7GHz example above, the IBM representative said that if you viewed the POWER8 chip as a big chessboard and you turned off all the CPU cores on the white squares than all the cores on the black squares could be clocked at nearly twice the speed or 7GHz. Why is this interesting?

For some computational problems its much better to have two consecutive computations in the same unit of time than two unrelated ones. Electronic trading, also known as high-frequency trading (HFT) is the premier market-driven problem that benefits most from increasing clock frequency. Traders often ascribe a dollar value to a millionth of a second, and it varies from market to market based on the rules and volumes of each market. In the end, though it always boils down to the trader’s speed and response to a market signal. If I’m faster than you at making the right decision, then I win the business and book the profit. Sticking with HFT, where do accelerators fit in?

Traders lease connections to exchanges. The closer and faster they can respond to signals from those connections, the more competitive they will be. Suppose my trading platform requires signals from the market to travel through my server, then another switch on my private network, back through a second server, then finally out to the market. The networking alone, even with kernel bypass through two servers and a switch could easily be several microseconds. Add a few more microseconds for trading logic in both servers, and you could be looking at almost ten microseconds to submit a trade in response to a signal. Two years ago Solarflare with LDA Technologies demonstrated 98 nanoseconds tick to trade. This was using accelerator technology and compared to the trading platform mentioned above; it is three orders of magnitude faster. That’s the difference between walking from NYC to LAX versus flying at Mach 5 and arriving in an hour. Time matters and acceleration is not just for HFTs anymore. Why do you think Google bought Myricom, Amazon picked up Annapurna Labs, Nvidia purchased Mellanox, or Xilinx acquired Solarflare?

Please stay tuned, more to come in part two. In the meantime feel free to check out previous articles on this topic: