Stock Trading in 300ns: Low Latency Redefined

Henry Ford is quoted as having once said “If I had asked people what they wanted, they would have said faster horses.” Not all innovations are as ground breaking as the automobile, but when one approaches an old problem with both a new strategy and improved technology great things can happen. In June Solarflare released an update to OpenOnload (OOL) that introduced TCP Delegated Send, and in late September they will begin shipping the latest version of their Application Onload Engine (AOE) with an advanced FPGA. The combination of these two will result in the capability of turning around a TCP Send in 300ns, compared to roughly 1,700ns today. In latency focused applications a savings of 1,400ns, an 82% improvement, is game changing. To understand how Solarflare pulls this off let’s look at a much simpler example.

My son uses an online exchange to trade Magic cards, and traders are rated on how fast they fill orders. Not much different than NASDAQ processing an order for Apple stock. When my son started he would receive an order on his computer, and search through a heap of cards to find the one necessary to fill that order. He would then go down several flights of stairs to my office to fetch an envelope, and stamp then goes back up to his computer. Next, he would address the envelope, apply the stamp, run back down the stairs and walk the completed trade to the mailbox. Today he has a cache of pre-stamped envelopes with the return addresses pre-written out sitting beside his computer. All his cards are in a binder with an updated index. Filling a trade is a trivial matter. He simply checks the index, pulls the required card from the binder, updates the index, stuffs the card in an envelope, writes the final address on the front, and runs it out to the mailbox. Essentially, this is a Delegated Send. Everything that can be preprocessed in advance of the actual trade is prefetched & prepackaged.

When it comes to TCP and Delegated Send, at the start of the trading day the trading application, through OOL, establishes a TCP connection with the exchange. The trading application then calls a routine in OOL to take over control of the socket’s send path, and to obtain the Ethernet, IP and TCP headers for the connection.  The application adds to these a message template and passes the resulting packet template to the FPGA where it remains cached, much like my son’s stack of pre-stamped envelopes. In response to incoming packets arriving at the FPGA causing the RTL trading code to trigger a trade, the trade is then inserted into the pre-formatted packet, the checksum computed, and packet transferred to the exchange. The whole process takes approximately 300ns. When the ACK arrives from the exchange it is then passed transparently back to the trading application through OOL. Now some will point out that other FPGA solutions exist today that enables you to possibly trade at these speeds, but do any of these solutions make it this simple? With some minor modifications to your existing trading application you can quickly take advantage of Delegated Send with the AOE, no other FPGA solution even comes close!

So if latency in trading is important to you, and you’d like your orders moving along 1,000ns faster then perhaps it’s time to take a serious look at Delegated Send on Solarflare’s AOE. To learn more please consider checking out this whitepaper.

For those already familiar with Solarflare’s AOE, this new version of the product has several very substantial improvements:

  • It leverages the latest Solarflare high-performance ASIC with a PCIe Gen3 interface.
  • Flexible, open choice of FPGA PCS/MAC.
  • All FDK modules, and examples delivered with full source code.
  • Sample implementations of 10GbE & 40GbE pass-through (requires PCS/MAC).
  • Sample implementations of all four 1600MHz DDR3 AOE memory channels.
For additional information please send me an email.

Rise of Heterogeneous Systems Architectures, and the Role of APUs

In his talk “The Race to Zero” last week at Flagg Management’s HPC for Wall Street show Dr. Greg Rodgers, a PhD from AMD Research, discussed the rise of highly integrated Heterogeneous Systems Architectures (HSA). For the past six years I’ve exhibited at both of Russell Flagg’s annual shows, and during that time I’ve seen many different approaches to reducing latency & improving performance for the High Frequency Trading (HFT) market. Many companies have pitched custom FPGA solutions, GPUs, HPC RISC implementations, ultra-dense Intel solutions, but not until this talk had I heard anything that was truly innovative. In Dr. Rodgers brief 15 minute session he proposed a heterogeneous architecture for addressing a wider range of computational problems by tightly integrating several different processing models onto the same chip, the innovation. The concept of a heterogeneous computing environment is not new, in-fact it’s been around for at least two decades. While working at NEC in 2004, one of my colleagues at our US Research division demonstrated a new product that loosely coupling several different computing resource pools together. That way jobs submitted with the tool could easily & efficiently be parceled out and leverage both scalar clusters & massively parallel systems (Earth Simulator) without having to be broken up, and submitted individually to specific systems. What Dr. Rodgers is proposing is a much higher level integration on the same chip.

If this were anyone else I might have easily written off the concept as an intellectual exercise that would never see the light of day, but this was Greg Rodgers. I’ve known Greg for nearly eight years, and when we first met he was carrying around a pre-announced IBM JS21 PowerPC blade server under his arm between booths at SuperComputing 2005. He was evangelizing the need to build huge clusters using the latest in IBM’s arsenal of PowerPC workhorse chips in an ultra-dense form factor. Greg has built many large clusters during his career, and when he believes in an approach it will eventually be implemented in a very large cluster. It may end up at the Department of Energy, or a University or other Government lab, but it will happen.

AMD currently producing an ultra dense cluster in a box with their SeaMicro SM15000-OP. This is a 10U enclosure that houses 512 cores, each 64-bit, x86, at 2.0/2.3/2.8 Ghz. To reach 512 cores they use 64 sockets each housing a new Octal core Opteron. Each socket supports 64GB for a total of 4TB of system memory. AMD also provides 10GbE to each socket internally, and expose 16 10GbE uplinks externally. This is a true HPC cluster in a box, but because it’s all x86 cores it’s designed for scalar workloads. What Greg is proposing is to shift this architecture from pure x86 to “Acceleration Processing Units” (APUs) that marry a GPU, with two x86 cores, caches and other I/O on the same die (chip). That way memory can be shared, and data movement minimized. This would enable data parallel workloads and serial/task parallel workloads to coexist within the same chip, and be able to share memory when appropriate. Furthermore Greg has proposed the following HSA concepts:

  • A unified programming model that enables task parallel and data parallel workloads while also supporting sequential workloads.
  • A single unified virtual address space addressable by all compute cores with well-defined memory regions supporting both global & private access.
  • User level queuing between the “Latency Compute Unit” (LCU) and the “Throughput Compute Unit” (TCU) without system calls.
  • Preemption & context switching, extending context management to the GPU.
  • HSA Intermediate Language (HSAIL) to split compilation between the front end and the finalizer to improve optimization & portability

Greg was actively promoting the HSA foundation as a method for moving HPC for HFT forward. Furthermore, he discussed AMD Open 3.0 which is their next generation open systems compute platform. Here is a link to Greg’s slides. It will be interesting to see how this approach plays within the market, especially at the SC13 show in November

Hey HFT, 4.9 Microsecond TCP Latency on Windows!

To all those High-Frequency Trading (HFT) shops out there that believe you need Linux to secure the lowest latency, well that’s no longer the case. Myricom & Emulex now bring you FastStack DBL for Windows Version 2.2 which boasts an impressive 4.9 microseconds for a half round trip. That’s a round trip divided by two which represents a send plus a receive. Oh, and that’s out of the box using a simple transparent mode wrapper (dblrun). If you want you can also code to the DBL API which now works for both UDP and TCP channels.

In April of 2011 at the HPC Linux for Wall Street show Myricom announced DBL 2.0 which included TCP, and Windows support. Since then we’ve learned a few tricks and have further reduced the latency in transparent mode to 4.9us. Initially, to achieve out of the box transparent mode we leveraged a technique called Layered Service Provider (LSP). We realized though that this trick cost us 500ns, and after considerably more research and coding have developed a non-LSP solution that saves the 500ns and has some additional performance improvements.

If you need additional flexibility, or require that DBL be in your code path along with other 3rd party modules, then DBL also offers a rich Application Programming Interface (API).

Furthermore, FastStack DBL now supports both UDP and TCP in transparent mode on both 32 bit and 64 bit run time environments in both transparent & API mode. So if you’re an HFT shop that uses Windows in production, or has been considering using it, now you’ve no longer got an excuse not to give it a try. All you’ve got to loose is latency.

Low Latency is Just the Ante

This article was originally published in May 2012 at 10GbE.net

For the past few years, all you needed was a low latency network adapter to win business in High-Frequency Trading (HFT).  Now that’s just the Ante. Today most shops have a much greater appreciation for the role that NIC latency plays in their HFT infrastructure. They are also now aware of the other key requirements driving network adapter selection. One of the most obvious today is out of the box integration. How quickly & easily can the low latency drivers be installed and engaged with your existing production code? Does their low latency driver also provide a Java interface? Do they have low latency drivers for Windows? There are a number of far more technical requirements involving multicast groups, polling models, etc…, but this is not the place to be giving away any secrets. 

Today only three low latency NICs have anted up to deliver sub four-microsecond performance for the HFT market: Myricom, Solarflare & Mellanox. Solarflare and Myricom dominate the market because both were early to market with a transparent acceleration mode which enabled customers to quickly install the low latency driver and engage existing production code with little or no modification. Furthermore, both support Java. In January Mellanox introduced VMA 6, which requires their latest ConnectX-3 adapters, that now supports a transparent acceleration mode. This one feature has kept Mellanox at the table.
 
So what’s next? Solarflare’s CTO implies in his May blog post that the “world just got smaller” and says “one more down” then refers people to a link about the Emulex/Myricom partnership. Here he’s intimating that the partnership will remove Myricom from relevance in the HFT market. Nothing could be further from the truth, this partnership validates Myricom’s model and provides them with a substantial channel & manufacturing partner that is enabling them to compete even more efficiently, and aggressively moving forward. What Solarflare fails to understand is that Myricom has developed market specific software like DBL for HFTs for several other markets: Sniffer10G for network analytics, VideoPump for video content providers, and MX for high performance clusters. Furthermore Myricom has tuned their generic 10GbE driver to offer 50% better throughput than Solflare’s and double Chelsio’s. In the HFT space Myricom is the only vendor to offer a transparent mode low latency option for Windows. In addition Myricom has a roadmap of HFT features, built on customer requirements, that will dramatically improve DBL performance and functionality over the next 12 months. So is your NIC vendor focused on making the low latency driver you run your business on today even better, or have they gone down the FPGA rabbit hole?

FPGAs on 10GbE NICs, An Idea Whose Time Has Passed

This article was originally published in April of 2012 on 10GbE.net.

A few months ago SolarFlare announced a new class of Network Interface Card (NIC), a hybrid adapter, that will be available in June. This hybrid combines their generic 10GbE ASIC with a Field Programmable Gate Array (FPGA) chip, some RAM and then they wrap all this hardware in a Software Development Kit (SDK). This will then be marketed as a complete solution for the High-Frequency Trading (HFT) market. Rumors exist that they’ll also try and sell it into the network security market, and perhaps others.

At the time of this writing high performance dual port NICs have a street price between $550 & $750, this new hybrid NIC is rumored to cost ten times this. So why would someone even consider this approach? Simple to reduce load on the host CPU cores. The initial pitch is that this hybrid will take on the role of the feed handler. Typically a feed handler runs on several cores of a multi-core server today. It receives trading data from all the relevant exchanges, then filters off all the unwanted information, normalizes what remains and then passes this onto cores running the algorithmic trading code. By freeing up the feed handler cores, through the use of a hybrid NIC, this processing power can be allocated to run more advanced algorithmic codes.
On the surface, the pitch sounds like a great idea. Use a proven low-latency ASIC to pull packets off the wire, send the boring packets on to the OS and the interesting financial stuff to the FPGA. It’s when you get into the details that you realize it’s nothing more than a marketing scheme. When this product was designed I’m sure it sounded like a good idea, most 1U and 2U servers had eight cores and systems were getting CPU bound. As this NIC hits the market though Intel has once again turned the crank and vendors like IBM and HP are now delivering dual socket 16 core, 32 thread servers that will easily pickup the slack. A nicely configured HP DL360P with 16 cores, 32GB memory, etc… is available today for $10K, adding one of these hybrid NICs will nearly double the hardware price of your trading platform. Note, this before you even crack open the SDK and hire the small army of consultants you’ll need to program the FPGA.
 
Typically we’ve found that the normal packet flow from multiple exchanges into a trading server is roughly 200-300K packets per second, with very rare bursts up to 800K. So if one were to set aside four cores for feed handling, with an average feed load of 250Kpps, and assuming the feeds were evenly distributed each core would have 16 microseconds per packet. On these new 2.2Ghz Intel E5 systems this translates to roughly 8K instructions per packet to filter & normalize. This assumes two threads per core and an average of four clock ticks per instruction.
 
Like TCP Offload Engines (TOEs) these hybrid NICs sound great when they’re first proposed, but on in-depth analysis and particularly after Moore’s law kicks in, they soon become solutions looking for a problem, a novelty. With Intel’s new E5s why would anyone seriously invest their time hardware & consulting budgets on an outdated approach?