TE10: Gone in 98 Nanoseconds: A New Low for Tick to Trade

Tonight we had a discussion with Vahan Sardaryan, CEO & Founder of LDA Technologies, to talk about their new STAC-T0 Benchmark which set the bar surprisingly low, 98 nanoseconds, for network tick to trade latency.

During that call, we reviewed the following:

  • What is the significance of 98 nanoseconds?
  • Who worked with LDA to make this possible?
  • In layman’s terms, how was the feat accomplished?
  • The jitter of the solution was six to nine nanoseconds.
  • What is the significance of having STAC validate this achievement?
  • This was months worth of work, what sort of ghost hunting was required?
  • Who could benefit from this technological advance?
  • The report highlights a measurement of -68 nanoseconds, can we trade into the future?
  • Where do we go from here, and can the bar be set even lower?

Scott would like to apologize for the quality of this podcast as both him and Vahan had to dial into the recording system due to, you guessed it, networking issues.

Interested in evaluating Solarflare’s ServerLock Firewall in the NIC technology?
Please send an email to sschweitzer@solarflare.com

120ns Half Round Trip Latency, OMG!

nanosecondEveryone hates waiting, and as such we continually improve the performance of our technology to remove this waiting, often called latency. In markets like financial trading, this latency can be monetized. Several years ago a high-frequency trading shop told me that for them one microsecond (millionth of a second) improvement translated to $60K per network port per day. How is that even possible? Well, if my stock market trade gets into the exchange before your’s I win, you lose, it’s that simple. Prior to May 2017, Solarflare had a network latency solution for electronic markets that delivered 250 nanoseconds (billionths of a second) from tick to trade. The “tick” in “tick to trade” is the market data that arrives from the exchange via a network packet.

So when NASDAQ generates a stock market ticker signal indicating that IBM is now trading at $160/share, your order into NASDAQ could be as quick as 250 nanoseconds plus the time it takes your algorithm to decide to buy. The 250 nanoseconds are how much time it will take for the market data, a UDP network packet, to be brought into the FPGA-based NIC, plus the time it will take to generate the order as a TCP packet and inject that order back into the exchange. To put this into perspective, 250 nanoseconds is the time required for a photon of light travel 82 yards, less than a football field.

If that doesn’t sound fast enough, this week Solarflare, LDA Technology, and Xilinx announced LightSpeed TCP which under the proper circumstances can reduce network latency for trades from 250 nanoseconds down to 120 nanoseconds. So by contrast, 120 nanoseconds are the time required for light to travel 40 yards. So they’ve taken trading from a kickoff return to a few first downs.

1.44 us Full Round Trip Latency, Unlikely

Tuesday morning one of the guys on my team woke me with a text stating a competitor was claiming 1.44 microseconds for a full round trip (RT) using UDP.  Two things about this immediately struck me as strange: first it was reported as a full round trip number, and second, the number (excluding units) was oddly close to what I’d thought the theoretical 1/2 RT limit might be. You see in the ultra-low latency, high-frequency trading market, time is everything. One need only be a few nanoseconds faster than their competition to win the lion’s share of the business. So speed is everything, but in the end, physics sets the speed limit.

In an ideal world if one were to measure the time required for a UDP packet to enter a network server adapter, traverse the Ethernet controller chip, travel the host PCIe bus, through the Intel CPU complex and finally end up in memory they’d find that this journey was roughly 730 nanoseconds. Now it should be noted that this varies across Intel server families & clock rates. We could be off by as much as +/- 100 nanoseconds, measuring at this level is pretty challenging, but 730 nanoseconds is a reasonable number to start with. Also, it should be noted that this is with Solarflare’s current 7000 series Ethernet Controller ASIC.

Breaking this down further, the most expensive part of this trip is the 500 nanoseconds or so the UDP packet will spend in Solarflare’s Ethernet controller chip. This chip is arguably the most popular low latency Ethernet Controller ASIC on the market today, it includes a high-performance PHY layer, an L2 switch, and built-in PCIe controller logic, everything happens within this single chip.  Over 1,000 financial trading firms rely on this technology daily, most of the world’s financial exchanges and nearly all of their high-performance customers depend on Solarflare, and as such they’ve turned all the dials possible to squeeze out every available nanosecond. Add to this 150 nanoseconds, the time the packet will spend traveling across the PCIe bus using DMA to cache via DDIO (not RAM), and finally another 80 nanoseconds or so to store it in RAM, making your final total 730 nanoseconds to receive a packet to memory. Again, your mileage will vary considerably so please only use these numbers as rough reference points. For a 1/2RT you’ll need to double this number (a receive plus a send) which brings the 1/2RT total to 1,460 nanoseconds, or 1.46 microseconds. It should also be noted that receives and sends have different costs, sends often consume less time, so again your numbers will vary, and this number should, in fact, be smaller. That’s Solarflare physics.  Solarflare has a new 8000 series Ethernet Controller ASIC coming out soon which will further trim down the 500 nanoseconds spent in the ASIC, but by exactly how much is still a closely guarded secret.

So is 1.44 microseconds for a conventional (through to user space vs. done completely in an FPGA) full round trip possible today? Well, the PCIe and memory components of this total 920 nanoseconds (150 nanoseconds for the PCIe bus plus 80 nanoseconds for CPU to memory, and both times 4 to address a full round trip). This leaves 520 nanoseconds to traverse the Ethernet Controller logic four times, or 130 nanoseconds for each pass. Considering that the most popular low-latency Ethernet controller chip on the planet requires 500 nanoseconds, doing it in 130 nanoseconds with the same degree of utility is highly unlikely.

On checking this competitor’s data sheet for this product we found that they have documented 1.82 microseconds for a UDP 1/2RT using 64-byte packets. Compare this to the 1.44 microseconds they claimed verbally for a full round trip, and one could see that they’ve significantly stretched the truth. If it sounds too good to be true, it probably is…