Stratus and Solarflare for Capital Markets and Exchanges

by David Whitney, Director of Global Financial Services, Stratus

The partnership of Stratus, the global standard for fault-tolerant hardware solutions, and Solarflare, the unchallenged leader in application network acceleration for financial services, at face value seems like an odd one. Stratus ‘always on’ server technology removes all single points of failure, which eliminates the need to write and maintain costly code to ensure high availability and fast failover scenarios.  Stratus and high performance are rarely been used in the same sentence.

Let’s go back further… Throughout the 1980’s and 90’s Stratus, and their proprietary VOSS operating system, globally dominated financial services from exchanges to investment banks. In those days, the priority for trading infrastructures was uptime which was provided by resilient hardware and software architectures. With the advent of electronic trading, the needs of today’s capital markets have shifted. High-Frequency Trading (HFT) has resulted in an explosion in transactional volumes. Driven by the requirements of one of the largest stock exchanges in the world, they realized that critical applications need to not only be highly available but also extremely focused on performance (low latency) and deterministic (zero jitter) behavior.

Stratus provides a solution that guarantees availability in mission-critical trading systems, without the costly overhead associated with today’s software-based High Availability (HA) solutions as well as the need for multiple physical servers. You could conceivably cut your server footprint in half by using a single Stratus server where before you’d need at least two physical servers. Stratus is also a “drop and go” solution. No custom code needs to be written, there is no concept of Stratus FT built customer applications. This isn’t just for Linux environments, Stratus also has hardened OS solutions for Windows and VMWare as well.

Solarflare brings low latency networking to the relationship with their custom Ethernet controller ASIC and Onload Linux Operating System Bypass communications stack. Normally network traffic arrives at the server’s network interface card (NIC) and is passed to the Operating System through the host CPU. This process involves copying the network data several times and switching the CPU’s context from kernel to user mode one or more times. All of these events take both time and CPU cycles. With over a decade of R&D, Solarflare has considerably shortened this path. Under Solarflare’s control applications often receive data in about 20% of the time it would typically take. The savings are measured in microseconds (millionths of a second), typically several or more. In trading speed often speed matters most, so a dollar value can be placed on this savings. Back in 2010, one trader valued the savings at $50,000/micro-second for each day of trading.

Both Stratus and Solarflare have worked together to dramatically reduce jitter to nearly zero. Jitter is caused by those seemingly inevitable events that distract a CPU core from its primary task of electronic trading. For example, the temperature of thermal sensor somewhere in the system may exceed a predetermined level and it raises a system interrupt. A CPU core is then assigned to handle that interrupt and determine which fan needs to be turned on or sped up. While this event, known as “Jitter”, sounds trivial the distraction to processes this interrupt and return to trading often results in a delay measured in the 100’s of microseconds. Imagine you’re trading strategy normally executes in 10s of microseconds, network latency adds 1-2 microseconds, and then all the sudden the system pauses your trading algorithm for 250 microseconds while it does some system housekeeping. By the time control is returned to your algorithm it’s possible that the value of what you’re trading has changed. Both Stratus and Solarflare have worked exceedingly hard to remove Jitter from the FT platform.

Going forward, Solarflare and Stratus will be adding Precision Time Protocol support to a new version of Onload for the Stratus FT Platform.

99.99999% Available + 2.7us = 1 Awesome Computer

What do you get when you put together a pair of dual socket servers running in hardware lock-step with a pair of leading edge, ultra-low latency OS Bypass network adapters all running RedHat Enterprise Linux? One awesome 24 core system that boasts 99.99999% uptime, zero jitter, 2.7 micro seconds of 1/2 round trip UDP latency, and 2.9 microseconds for TCP.

How is this possible? First, we’ll cover what Stratus Technologies has done with Lock-Step, and how it makes the ftServer dramatically different than all others. Then we’ll explain what jitter is, and why removing it is so critical for deterministic systems like financial trading. Finally, we’ll cover these impressive Solarflare ultra-low latency numbers, and what they really mean.

We’ve all bought something with a credit card, flown through Chicago O’hare, used public utilities, and possibly even called 9-1-1. What you don’t know is that very often at the heart of each of these systems is a Stratus server. Stratus should adopt the old Timex slogan “It takes a licking and keeps on ticking” because that’s what it means to provide 99.99999% up time, you’re allowed three seconds a year for unplanned outages. Three seconds is how long it takes me to say “99.99999% up time.” How is this possible? Imagine running a three legged race with a friend. Ideally, if you each compared your actions continuously with every step you could run the race at the pace of the slowest of the two of you. This is the key concept behind Lock-Step, comparing, then knowing what to do as one starts to stumble to ensure the team continues moving forward no matter what happens. Stratus leverages the latest 12-core Intel Haswell E5-2670v3 server processors with support for up to 512GB of DDR4. If any hardware component in the server fails, the system as a whole continues moving forward, alerts an admin who then replaces the failed component, then that subsystem is brought back online. I challenge you to find another computer in your life that has ever offered that level of availability over the typical 5-7 year lifecycle that Stratus servers often see.

So what is Jitter? When a computer core becomes distracted from doing its primary task to go off and do some routine house keeping (operating system or hardware driven), the impact of that temporary distraction is known as Jitter. With normal computing tasks, Jitter is hardly noticeable, it’s the computer equivalent of background noise. With certain VERY time critical computing tasks though, like say financial trading, even one Jitter event could be devastating. Suppose your server’s primary function is financial trading, and it receives a signal from market A that someone wants to buy IBM at $100, and on market B it sees a second signal that another entity wishes to sell IBM at $99. So the trading algorithm on your server buys the stock on B for $99, but then the instant it has confirmation of your purchase a thermal sensor in your server generates an interrupt. The CPU then that is running your trading algorithm goes off to service that interrupt which results in it running some code to determine which fan to turn on. Eventually, say a millisecond or so later, control is returned to your trading algorithm, but by then the buyer on market A is gone, and the new price of IBM has fallen to $99. That’s the impact of Jitter, brief often totally random moments in the trading day stolen to do basic house keeping. These stolen moments can quickly add up for traders, and for exchanges, they can be devastating. Imagine a delayed order as a result of Jitter missing an opportunity! Stratus Technologies has crawled through their server architecture and eliminated all potential sources of Jitter. Traders & exchanges using other platforms have to do all this by hand, and this is still as much art as it is science. That’s one reason why over 1,400 different customers regularly depend on Solarflare.

Finally, there’s ultra-low latency networking via generic TCP/IP and UDP networking. In the diagram below network latency is in blue. Market data arrives via UDP and orders are placed through the more reliable TCP/IP protocol. Here is a quick anatomy of part of the trading process showing one UDP receive and one TCP send. There are other components, but this is a distilled example.

Initially, the packet is received in from the wire, the light blue block, and the packet passes through the physical interface, electrical networking signals are converted to layer-2 logical bits. From there the packet is passed to the on-chip layer-2 switch which steers the packet to one of 2,048 virtualized NICs (vNIC) instances, also on the chip. The VNIC then uses DMA to transfer the packet into system memory, all of which takes 500 nanoseconds. The packet has now left the network adapter and is on its way to a communications stack somewhere in system memory, the dark blue box. Here is where Solarflare shines. In the top timeline, the dark blue box represents their host kernel device driver and the Linux communications stack. Solarflare’s kernel device driver is arguably one of the fastest in the industry, but most of this dark blue box is time spent working with the kernel. There are CPU task switches, and several memory copies of the packet, as it moves through the system, and thousands of CPU instructions are executed, all told this can be nearly 3,000 nanoseconds. In the bottom timeline, the packet is DMA’d directly into user-space where Solarflare’s very tight user space stack sits. This is where the packet is quickly processed and handed off to the end user application via the traditional sockets interface. All without additional data copies, and CPU task switches, and completed in just under 1,000 nano seconds a savings of about 2,000 nanoseconds or roughly 4,600 CPU instructions for this processor at this speed. All this, and we’ve just received a packet into our application, represented by the green blocks.
So in the two bars above the first represents market data coming in via Solarflare’s generic kernel device driver than going through the normal Linux stack until the packet is handed off to the application. The response packet, in this case, a trade via TCP, is sent back through the stack to the network adapter and eventually put on the wire, all told just over 9,000 nanoseconds. With Stratus & Solarflare the second bar shows the latency of the same transaction, but traveling through Solarflare’s OS Bypass stack in both directions, the difference here is that the transaction hits the exchange over 4,000 nanoseconds sooner. This means you can trade at nearly twice the speed, a true competitive advantage. Now four millionths of a second aren’t something humans can easily grasp, so let’s jump to light speed, this is how long it takes a photon of light to cover nearly a mile.
So if you’re looking to build a financial trading system with ultra-high availability, zero jitter & extreme network performance, you have only one choice Stratus’s new ftServer.

1.44 us Full Round Trip Latency, Unlikely

Tuesday morning one of the guys on my team woke me with a text stating a competitor was claiming 1.44 microseconds for a full round trip (RT) using UDP.  Two things about this immediately struck me as strange: first it was reported as a full round trip number, and second, the number (excluding units) was oddly close to what I’d thought the theoretical 1/2 RT limit might be. You see in the ultra-low latency, high-frequency trading market, time is everything. One need only be a few nanoseconds faster than their competition to win the lion’s share of the business. So speed is everything, but in the end, physics sets the speed limit.

In an ideal world if one were to measure the time required for a UDP packet to enter a network server adapter, traverse the Ethernet controller chip, travel the host PCIe bus, through the Intel CPU complex and finally end up in memory they’d find that this journey was roughly 730 nanoseconds. Now it should be noted that this varies across Intel server families & clock rates. We could be off by as much as +/- 100 nanoseconds, measuring at this level is pretty challenging, but 730 nanoseconds is a reasonable number to start with. Also, it should be noted that this is with Solarflare’s current 7000 series Ethernet Controller ASIC.

Breaking this down further, the most expensive part of this trip is the 500 nanoseconds or so the UDP packet will spend in Solarflare’s Ethernet controller chip. This chip is arguably the most popular low latency Ethernet Controller ASIC on the market today, it includes a high-performance PHY layer, an L2 switch, and built-in PCIe controller logic, everything happens within this single chip.  Over 1,000 financial trading firms rely on this technology daily, most of the world’s financial exchanges and nearly all of their high-performance customers depend on Solarflare, and as such they’ve turned all the dials possible to squeeze out every available nanosecond. Add to this 150 nanoseconds, the time the packet will spend traveling across the PCIe bus using DMA to cache via DDIO (not RAM), and finally another 80 nanoseconds or so to store it in RAM, making your final total 730 nanoseconds to receive a packet to memory. Again, your mileage will vary considerably so please only use these numbers as rough reference points. For a 1/2RT you’ll need to double this number (a receive plus a send) which brings the 1/2RT total to 1,460 nanoseconds, or 1.46 microseconds. It should also be noted that receives and sends have different costs, sends often consume less time, so again your numbers will vary, and this number should, in fact, be smaller. That’s Solarflare physics.  Solarflare has a new 8000 series Ethernet Controller ASIC coming out soon which will further trim down the 500 nanoseconds spent in the ASIC, but by exactly how much is still a closely guarded secret.

So is 1.44 microseconds for a conventional (through to user space vs. done completely in an FPGA) full round trip possible today? Well, the PCIe and memory components of this total 920 nanoseconds (150 nanoseconds for the PCIe bus plus 80 nanoseconds for CPU to memory, and both times 4 to address a full round trip). This leaves 520 nanoseconds to traverse the Ethernet Controller logic four times, or 130 nanoseconds for each pass. Considering that the most popular low-latency Ethernet controller chip on the planet requires 500 nanoseconds, doing it in 130 nanoseconds with the same degree of utility is highly unlikely.

On checking this competitor’s data sheet for this product we found that they have documented 1.82 microseconds for a UDP 1/2RT using 64-byte packets. Compare this to the 1.44 microseconds they claimed verbally for a full round trip, and one could see that they’ve significantly stretched the truth. If it sounds too good to be true, it probably is…

Stock Trading in 300ns: Low Latency Redefined

Henry Ford is quoted as having once said “If I had asked people what they wanted, they would have said faster horses.” Not all innovations are as ground breaking as the automobile, but when one approaches an old problem with both a new strategy and improved technology great things can happen. In June Solarflare released an update to OpenOnload (OOL) that introduced TCP Delegated Send, and in late September they will begin shipping the latest version of their Application Onload Engine (AOE) with an advanced FPGA. The combination of these two will result in the capability of turning around a TCP Send in 300ns, compared to roughly 1,700ns today. In latency focused applications a savings of 1,400ns, an 82% improvement, is game changing. To understand how Solarflare pulls this off let’s look at a much simpler example.

My son uses an online exchange to trade Magic cards, and traders are rated on how fast they fill orders. Not much different than NASDAQ processing an order for Apple stock. When my son started he would receive an order on his computer, and search through a heap of cards to find the one necessary to fill that order. He would then go down several flights of stairs to my office to fetch an envelope, and stamp then goes back up to his computer. Next, he would address the envelope, apply the stamp, run back down the stairs and walk the completed trade to the mailbox. Today he has a cache of pre-stamped envelopes with the return addresses pre-written out sitting beside his computer. All his cards are in a binder with an updated index. Filling a trade is a trivial matter. He simply checks the index, pulls the required card from the binder, updates the index, stuffs the card in an envelope, writes the final address on the front, and runs it out to the mailbox. Essentially, this is a Delegated Send. Everything that can be preprocessed in advance of the actual trade is prefetched & prepackaged.

When it comes to TCP and Delegated Send, at the start of the trading day the trading application, through OOL, establishes a TCP connection with the exchange. The trading application then calls a routine in OOL to take over control of the socket’s send path, and to obtain the Ethernet, IP and TCP headers for the connection.  The application adds to these a message template and passes the resulting packet template to the FPGA where it remains cached, much like my son’s stack of pre-stamped envelopes. In response to incoming packets arriving at the FPGA causing the RTL trading code to trigger a trade, the trade is then inserted into the pre-formatted packet, the checksum computed, and packet transferred to the exchange. The whole process takes approximately 300ns. When the ACK arrives from the exchange it is then passed transparently back to the trading application through OOL. Now some will point out that other FPGA solutions exist today that enables you to possibly trade at these speeds, but do any of these solutions make it this simple? With some minor modifications to your existing trading application you can quickly take advantage of Delegated Send with the AOE, no other FPGA solution even comes close!

So if latency in trading is important to you, and you’d like your orders moving along 1,000ns faster then perhaps it’s time to take a serious look at Delegated Send on Solarflare’s AOE. To learn more please consider checking out this whitepaper.

For those already familiar with Solarflare’s AOE, this new version of the product has several very substantial improvements:

  • It leverages the latest Solarflare high-performance ASIC with a PCIe Gen3 interface.
  • Flexible, open choice of FPGA PCS/MAC.
  • All FDK modules, and examples delivered with full source code.
  • Sample implementations of 10GbE & 40GbE pass-through (requires PCS/MAC).
  • Sample implementations of all four 1600MHz DDR3 AOE memory channels.
For additional information please send me an email.

Rise of Heterogeneous Systems Architectures, and the Role of APUs

In his talk “The Race to Zero” last week at Flagg Management’s HPC for Wall Street show Dr. Greg Rodgers, a PhD from AMD Research, discussed the rise of highly integrated Heterogeneous Systems Architectures (HSA). For the past six years I’ve exhibited at both of Russell Flagg’s annual shows, and during that time I’ve seen many different approaches to reducing latency & improving performance for the High Frequency Trading (HFT) market. Many companies have pitched custom FPGA solutions, GPUs, HPC RISC implementations, ultra-dense Intel solutions, but not until this talk had I heard anything that was truly innovative. In Dr. Rodgers brief 15 minute session he proposed a heterogeneous architecture for addressing a wider range of computational problems by tightly integrating several different processing models onto the same chip, the innovation. The concept of a heterogeneous computing environment is not new, in-fact it’s been around for at least two decades. While working at NEC in 2004, one of my colleagues at our US Research division demonstrated a new product that loosely coupling several different computing resource pools together. That way jobs submitted with the tool could easily & efficiently be parceled out and leverage both scalar clusters & massively parallel systems (Earth Simulator) without having to be broken up, and submitted individually to specific systems. What Dr. Rodgers is proposing is a much higher level integration on the same chip.

If this were anyone else I might have easily written off the concept as an intellectual exercise that would never see the light of day, but this was Greg Rodgers. I’ve known Greg for nearly eight years, and when we first met he was carrying around a pre-announced IBM JS21 PowerPC blade server under his arm between booths at SuperComputing 2005. He was evangelizing the need to build huge clusters using the latest in IBM’s arsenal of PowerPC workhorse chips in an ultra-dense form factor. Greg has built many large clusters during his career, and when he believes in an approach it will eventually be implemented in a very large cluster. It may end up at the Department of Energy, or a University or other Government lab, but it will happen.

AMD currently producing an ultra dense cluster in a box with their SeaMicro SM15000-OP. This is a 10U enclosure that houses 512 cores, each 64-bit, x86, at 2.0/2.3/2.8 Ghz. To reach 512 cores they use 64 sockets each housing a new Octal core Opteron. Each socket supports 64GB for a total of 4TB of system memory. AMD also provides 10GbE to each socket internally, and expose 16 10GbE uplinks externally. This is a true HPC cluster in a box, but because it’s all x86 cores it’s designed for scalar workloads. What Greg is proposing is to shift this architecture from pure x86 to “Acceleration Processing Units” (APUs) that marry a GPU, with two x86 cores, caches and other I/O on the same die (chip). That way memory can be shared, and data movement minimized. This would enable data parallel workloads and serial/task parallel workloads to coexist within the same chip, and be able to share memory when appropriate. Furthermore Greg has proposed the following HSA concepts:

  • A unified programming model that enables task parallel and data parallel workloads while also supporting sequential workloads.
  • A single unified virtual address space addressable by all compute cores with well-defined memory regions supporting both global & private access.
  • User level queuing between the “Latency Compute Unit” (LCU) and the “Throughput Compute Unit” (TCU) without system calls.
  • Preemption & context switching, extending context management to the GPU.
  • HSA Intermediate Language (HSAIL) to split compilation between the front end and the finalizer to improve optimization & portability

Greg was actively promoting the HSA foundation as a method for moving HPC for HFT forward. Furthermore, he discussed AMD Open 3.0 which is their next generation open systems compute platform. Here is a link to Greg’s slides. It will be interesting to see how this approach plays within the market, especially at the SC13 show in November

Hey HFT, 4.9 Microsecond TCP Latency on Windows!

To all those High-Frequency Trading (HFT) shops out there that believe you need Linux to secure the lowest latency, well that’s no longer the case. Myricom & Emulex now bring you FastStack DBL for Windows Version 2.2 which boasts an impressive 4.9 microseconds for a half round trip. That’s a round trip divided by two which represents a send plus a receive. Oh, and that’s out of the box using a simple transparent mode wrapper (dblrun). If you want you can also code to the DBL API which now works for both UDP and TCP channels.

In April of 2011 at the HPC Linux for Wall Street show Myricom announced DBL 2.0 which included TCP, and Windows support. Since then we’ve learned a few tricks and have further reduced the latency in transparent mode to 4.9us. Initially, to achieve out of the box transparent mode we leveraged a technique called Layered Service Provider (LSP). We realized though that this trick cost us 500ns, and after considerably more research and coding have developed a non-LSP solution that saves the 500ns and has some additional performance improvements.

If you need additional flexibility, or require that DBL be in your code path along with other 3rd party modules, then DBL also offers a rich Application Programming Interface (API).

Furthermore, FastStack DBL now supports both UDP and TCP in transparent mode on both 32 bit and 64 bit run time environments in both transparent & API mode. So if you’re an HFT shop that uses Windows in production, or has been considering using it, now you’ve no longer got an excuse not to give it a try. All you’ve got to loose is latency.

Low Latency is Just the Ante

This article was originally published in May 2012 at 10GbE.net

For the past few years, all you needed was a low latency network adapter to win business in High-Frequency Trading (HFT).  Now that’s just the Ante. Today most shops have a much greater appreciation for the role that NIC latency plays in their HFT infrastructure. They are also now aware of the other key requirements driving network adapter selection. One of the most obvious today is out of the box integration. How quickly & easily can the low latency drivers be installed and engaged with your existing production code? Does their low latency driver also provide a Java interface? Do they have low latency drivers for Windows? There are a number of far more technical requirements involving multicast groups, polling models, etc…, but this is not the place to be giving away any secrets. 

Today only three low latency NICs have anted up to deliver sub four-microsecond performance for the HFT market: Myricom, Solarflare & Mellanox. Solarflare and Myricom dominate the market because both were early to market with a transparent acceleration mode which enabled customers to quickly install the low latency driver and engage existing production code with little or no modification. Furthermore, both support Java. In January Mellanox introduced VMA 6, which requires their latest ConnectX-3 adapters, that now supports a transparent acceleration mode. This one feature has kept Mellanox at the table.
 
So what’s next? Solarflare’s CTO implies in his May blog post that the “world just got smaller” and says “one more down” then refers people to a link about the Emulex/Myricom partnership. Here he’s intimating that the partnership will remove Myricom from relevance in the HFT market. Nothing could be further from the truth, this partnership validates Myricom’s model and provides them with a substantial channel & manufacturing partner that is enabling them to compete even more efficiently, and aggressively moving forward. What Solarflare fails to understand is that Myricom has developed market specific software like DBL for HFTs for several other markets: Sniffer10G for network analytics, VideoPump for video content providers, and MX for high performance clusters. Furthermore Myricom has tuned their generic 10GbE driver to offer 50% better throughput than Solflare’s and double Chelsio’s. In the HFT space Myricom is the only vendor to offer a transparent mode low latency option for Windows. In addition Myricom has a roadmap of HFT features, built on customer requirements, that will dramatically improve DBL performance and functionality over the next 12 months. So is your NIC vendor focused on making the low latency driver you run your business on today even better, or have they gone down the FPGA rabbit hole?

FPGAs on 10GbE NICs, An Idea Whose Time Has Passed

This article was originally published in April of 2012 on 10GbE.net.

A few months ago SolarFlare announced a new class of Network Interface Card (NIC), a hybrid adapter, that will be available in June. This hybrid combines their generic 10GbE ASIC with a Field Programmable Gate Array (FPGA) chip, some RAM and then they wrap all this hardware in a Software Development Kit (SDK). This will then be marketed as a complete solution for the High-Frequency Trading (HFT) market. Rumors exist that they’ll also try and sell it into the network security market, and perhaps others.

At the time of this writing high performance dual port NICs have a street price between $550 & $750, this new hybrid NIC is rumored to cost ten times this. So why would someone even consider this approach? Simple to reduce load on the host CPU cores. The initial pitch is that this hybrid will take on the role of the feed handler. Typically a feed handler runs on several cores of a multi-core server today. It receives trading data from all the relevant exchanges, then filters off all the unwanted information, normalizes what remains and then passes this onto cores running the algorithmic trading code. By freeing up the feed handler cores, through the use of a hybrid NIC, this processing power can be allocated to run more advanced algorithmic codes.
On the surface, the pitch sounds like a great idea. Use a proven low-latency ASIC to pull packets off the wire, send the boring packets on to the OS and the interesting financial stuff to the FPGA. It’s when you get into the details that you realize it’s nothing more than a marketing scheme. When this product was designed I’m sure it sounded like a good idea, most 1U and 2U servers had eight cores and systems were getting CPU bound. As this NIC hits the market though Intel has once again turned the crank and vendors like IBM and HP are now delivering dual socket 16 core, 32 thread servers that will easily pickup the slack. A nicely configured HP DL360P with 16 cores, 32GB memory, etc… is available today for $10K, adding one of these hybrid NICs will nearly double the hardware price of your trading platform. Note, this before you even crack open the SDK and hire the small army of consultants you’ll need to program the FPGA.
 
Typically we’ve found that the normal packet flow from multiple exchanges into a trading server is roughly 200-300K packets per second, with very rare bursts up to 800K. So if one were to set aside four cores for feed handling, with an average feed load of 250Kpps, and assuming the feeds were evenly distributed each core would have 16 microseconds per packet. On these new 2.2Ghz Intel E5 systems this translates to roughly 8K instructions per packet to filter & normalize. This assumes two threads per core and an average of four clock ticks per instruction.
 
Like TCP Offload Engines (TOEs) these hybrid NICs sound great when they’re first proposed, but on in-depth analysis and particularly after Moore’s law kicks in, they soon become solutions looking for a problem, a novelty. With Intel’s new E5s why would anyone seriously invest their time hardware & consulting budgets on an outdated approach?