Ultra-low Latency Networking for Windows, Is There a Need?

At a Flagg Management HPC on Wall Street event back in 2011, then CEO of Myricom Nan Boden, was pitching the concept of Ultra-low Latency for Windows. As a setup for this concept, she asked the audience how many used Linux for their low latency trading platforms. I was sitting in the back, but could easily see many, but not all hands were in the air. She then asked how many used Windows, and from the back, I saw a few lonely hands. Following the session, I commented on how few people were interested, and she said that was correct from my viewpoint in the rear. From her position on the stage nearly half had hands up, just in front of their bodies so few others could see. It was as if low latency trading on Windows were some dirty little secret.

Later that year in August 2011 Myricom went on to release DBL for Windows. While I remained there for another two years following this event, and handling sales for the Eastern region, all those hidden raised hands lead to very few sales, but why? Price was not the issue, DBL for Windows was extremely aggressively positioned against Linux. It wasn’t performance, while it was a measurable amount slower than Linux, it was still considerably faster than default Windows. We never were able to ferret out what the actual issue was. If you’re a Windows user craving ultra-low latency, please consider reaching out to me and sharing your story.

Thank you all for your time this year, and Happy Holidays.

Security: DARPA, HFT & Financial Markets

Today nearly half of all Americans are invested in the financial markets. This past October the Dow Jones posted the “Pentagon Turns to High-Speed Traders to Fortify Markets Against Cyberattack.” The reporter had talked with a number of High-Frequency Trading (HFT) shops which had consulted directly with the Defense Advanced Research Projects Administration (DARPA). The objectives of these discussions were to determine how we could fortify the US financial markets against Cyber attacks.

The reporter learned that the following possible scenarios were discussed as part of the “Financial Markets Vulnerability Project:”

  1. Inject false information into stock data feeds
  2. Flood the stock market with fake orders and trigger a market crash
  3. Cripple a widely used payroll system
  4. Credit Card Processors
  5. Report fake news into systems used to algorithmically drive trading

While protecting the US financial markets is something we expect of our government, the markets themselves are actually already insulated from outside attackers. The first two threats in the above list are essentially the same, placing fake orders into the exchange with no intent to honor them. To connect to an exchange’s servers a trader must be a member in good standing on that exchange and pay significant connection fees for their server to participate in that exchange. Traders place a very high value on their access to each exchange, and while HFT shops may only hold a security for a few millionths of a second, they understand the long-term value of losing access to an exchange. Most HFT shops have leased many 10GbE connections on multiple exchange servers, across multiple exchanges, and big bank’s dark pool, and very often Solarflare NIC cards are on both sides of these connections. So while it is technically possible for an HFT shop to inject enormous volumes of orders into one or more exchanges, a type of Denial of Service attack, using one or more physical ports on one or more exchange servers it could quickly result in financial suicide for that the trading firm. The exchanges and the Securities and Exchange Commission (SEC) don’t take kindly to trading partners seeking to game the system. Quickly the exchanges, and soon after the SEC, would step in and shut down inappropriate activity. *It should be noted that the above image was taken on December 6, 2017, in New York City’s Times Square.

To further improve security for its trading customers later this month Solarflare will begin rolling out a beta of ServerLock™ which is a firmware update for these very same NICs powering the exchanges and HFT shops worldwide. With ServerLock™ the HFT shops and the exchanges themselves could rapidly pump the breaks on any given logical connection directly within the NIC hardware.  This is the point at which DARPA and others should be interested. If the logic within the exchange were to detect and validate a threat they could then within a few millionths of a second install a filter into the NIC hardware to drop all subsequent packets from that threat. At that point, the threat would be eliminated, and it would no longer consume exchange CPU cycles. For HFT shops if they were to detect an algorithm had gone rogue they could employ ServerLock™ to physical cut a trading platform from the exchange without having to actually touch the platforms precious code. Much like throwing a cover over Schrodinger’s box, by applying the filter in the NIC hardware the trading platform itself remains intact for later investigation.

Number three on the list above is crippling a widely used payroll processor like ADP who processes payroll checks for one out of six Americans? First ADP uses at least two different networks. One permits inbound payroll data from their client companies, over the public internet via SSL secured connections, and a second which is a private Automated Clearing House (ACH) network. The ACH network is a member network connecting banks to clearinghouses like the Federal Reserve. Much like the exchanges above, being a paid member of an ACH network then attacking that same network would not be a wise move for a business. As for the public Internet-facing connections that ADP maintains, they likely are practicing the latest defense in depth technologies coupled with least privilege in an effort to avoid the issues faced earlier this year by Equifax.

Next, we have the Credit Card Processors also know as Payment Card Industry (PCI) players from Amex to Square who are fighting a never-ending battle to secure their systems against outsider threats. Much like the ACH network the PCI industry has its own collection of private networks for processing credit card transactions, ex. the Mastercard network, or Visa network, etc… These networks, like the ACH networks, are member networks, and attacking them would also be counterproductive. The world economy would likely not be in Jeopardy if at any point say the Amex or Discover networks were to stop processing credit cards for a few hours. We have seen the Internet websites of these providers, ex. Mastercard, have been targets of some of the most substantial Distributed DoS (DDoS) attacks the world has ever seen, and they’ve all faired it pretty well. Most have learned from these assaults how to further harden their networks.

Who would have thought two years ago that “Fake News” could possibly have turned the tide of a US Presidential election, or be used as a tool to dramatically shift a financial market? While at DEFCON 2015 I watched as Charlie Miller and Chris Valasek presented their now infamous hack of a Jeep Grand Cherokee. At the start of their talk, Charlie joked that had they thought the wired article would have moved Chrysler stock more than a point or two he would have partnered up with a VC to fund shorting their stock. He said that had he done that he’d now be sitting on the beach of his private island now sipping his favorite frozen drink through a straw, rather than lecturing us. Charlie explained that he expected their announcement would be similar to Google or Microsoft announcing a bug, but he was very wrong. It led to a recall of 1.4 million vehicles and the stock dropped double-digit percentage points following the story and the recall. While this was real news, it was a controlled news release from someone outside the company. They could have easily made hundreds of millions of US dollars shorting the stock. Now what most people aren’t aware of is that there are electronic news systems that some HFT algorithmic platforms are subscribed to. Some of these systems even “read” tweets from key people (ex. our president) to determine if their comments might move a particular security or market in one direction or another. Knowing this, these systems can then be gamed by issuing false stories expecting that the HFT algorithms will then “read” these stories and stock prices will move appropriately. When retractions are issued later it might also be expected that they will place orders that would also benefit from these retractions. So how do we suppress the impact of “fake news” on our financial markets?

These news services know that HFT systems trade on their output. Given that, they should be investing heavily in machine learning based systems to rapidly fact-check and score the potential truthfulness of a given story. For those stories that score beyond belief, they should then be kicked to humans for validation or potentially be delayed until they are backed up by additional sources or even held until after the US markets close to further limit their impact.

Gone in 98 Nanoseconds

Imagine a daily race with hundreds of top fuel dragsters all lined up rumbling along in parallel waiting for the same green Christmas tree light before launching off the line. In some electronic markets, with specific products, every weekday morning this is exactly what happens. It’s a race where being the fastest is the primary attribute used to determine if you’re going to be doing business. On any given day only the top finishers are rewarded with trades. Those who transmit their first orders of the day the fastest receive a favorable position at the head of the queue and are likely to do some business that day. In this market, EVERY nanosecond (a billionth of a second) of delay matters, and can be monetized. Last week the new benchmark was set at 98 nanoseconds, plus your trading algorithm, in some cases 150 nanoseconds total tick to trade.

“Latency” is the industry term for the unavoidable network delays, and “Tick to Trade Latency” aggregates together the network travel time for a UDP market data signal to arrive at a trading system, and for that trading system to transmit a TCP order into the exchange. Last year Solarflare introduced Application Nanosecond TCP Send (ANTS) and lowered the “Tick to Trade Latency” bar to 350 Nanoseconds. ANTS executes in collaboration with Solarflare’s Application Onload Engine (AOE) based on an Altera Stratix FPGA. Solarflare further advanced this high-speed trading platform to achieve 250 Nanoseconds. Then in the spring of 2017 Solarflare collaborated with LDA Technologies. LDA brought their Lightspeed TCP cores to the table and replaced the AOE with a Xilinx FPGA board once again lowering the “Tick to Trade Latency” to 120 Nanoseconds. Now through further advances, and moving to the latest Penguin Computing Skylake computing platform, all three partners just announced a STAC-T0 qualified benchmark of 98 nanoseconds “Tick to Trade Latency!”

There was even a unique case in this STAC-T0 testing where the latency was measured at negative 68 nanoseconds, meaning that a trade could be injected into the exchange before the market data from the exchange had even been completely received. Compared to traditional trading systems which require that the whole market data network packet to be received before ANY processing can be done, these advanced FPGA systems receive the market data in the packet in four-byte chunks and can begin processing that data while it is arriving. Imagine showing up in the kitchen before you wife even finishes calling your name for dinner. There could be both good and bad side effects of such rapid action, you have a moment or two to taste a few things before the table is set, or you may get some last minute chores. The same holds true for such aggressive trading.

Last week, in a Podcast with the same name we had a discussion with Vahan Sardaryan, CEO of LDA Technologies, where we went into this in more detail.

Penguin Computing is also productizing the complete platform, including Solarflare’s ANTS technology and NIC, LDA Technologies Lightspeed TCP, along with a high-performance Xilinx FPGA to provide the Ultimate Trading Machine.

The Ultimate Trading Machine

Security Entirely Chimerical, SEC

On September 20th SEC Chairman Jay Claton released a “Statement on Cybersecurity.” It is an extremely dry read, but if one suffers through it they’ll find several interesting points.

“I recognize that even the most diligent cybersecurity efforts will not address all cyber risks that enterprises face. That stark reality makes adequate disclosure no less important.”

How does the SEC define “adequate disclosure?” The federal government has requirements that in some extreme breach cases require a report within one hour to DHS’s CERT. When faced with this class of breach recently it was found that the SEC waited 14 days, is this adequate disclosure? Much further down in the SEC Statement they disclosed the following.

“In August 2017, the Commission learned that an incident previously detected in 2016 may have provided the basis for illicit gain through trading. Specifically, a software vulnerability in the test filing component of our EDGAR system, which was patched promptly after discovery, was exploited and resulted in access to nonpublic information.”

So in the best case, the SEC waited only eight months to inform the public of this breach, but it could have been as much as 20 months. Unlike the publicly traded companies, the SEC regulates it isn’t legally required to tell investors or the public if it is ever breached. It is ONLY required to inform a law enforcement agency. EDGAR was also breached in 2014, but that saw little attention.

Now it’s one thing to breach an entity and remove data, but how about intentionally leaving false data behind for the purpose of capitalizing on that deposit? In at least two cases over the past few years, false business acquisition reports for Avon and the Rocky Mountain Chocolate Factory have been inserted into EDGAR. In the Avon case, the stock ran up 10 points. Does the SEC own up to these, well kinda of…

“As another example, our Division of Enforcement has investigated and filed cases against individuals who we allege placed fake SEC filings on our EDGAR system in an effort to profit from the resulting market movements.”

Ok, so EDGAR is a 30-year-old piece of swiss cheese riddled with potential attack surfaces some by design, others by just not keeping current on penetration testing of their systems. What about their physical assets?

“For example, a 2014 internal review by the SEC’s Office of Inspector General (“OIG”), an independent office within the agency, found that certain SEC laptops that may have contained nonpublic information could not be located.”

All the above quotes were from the Wednesday SEC Statement, but in a 2016 GAO report on the SEC, it stated that the SEC:

“…wasn’t always using encryption, supported software, well-tuned firewalls, and other key security tools while going about its business.”

Banking, in fact, our financial market structure as a whole is based on a singular concept, TRUST. The SEC was created in the wake of the Great Depression in 1934 as a way to restore trust in the markets. Technology savvy individuals will always attempt to exploit this trust for their own gain, it’s a part of how the game is played. In our financial system, the SEC plays the role of the gambling commission to ensure that the players, dealers, pit bosses, and the house are all working from the same set of published public rules. To his credit Chairman Clayton is working within the system in an attempt to shine daylight on an agency in trouble and out of touch with the technology driving the markets its charged with regulating. Today it is now possible to trade a stock based on a tick (a signal that something moved) within 150 billionths of a second, but it takes the SEC 1.2 million seconds (14 days) to report a serious breach of their security to law enforcement. Clearly, work remains to be done.

R.I.P. TCP Offload Engine NICs (TOEs)

Solarflare Delivers Smart NICs for the Masses: Software Definable,  Ultra-Scalable, Full Network Telemetry with Built-in Firewall for True Application Segmentation, Standard Ethernet TCP/UDP Compliant

As this blog post by Michael C. Bazarewsky states, Microsoft quietly pulled support for TCP Chimney in its Windows 10 operating system. Chimney was an architecture for offloading the state and responsibility of a TCP connection to a NIC that supported it. The piece cited numerous technical issues and lack of adoption, and Michael’s analysis hits the nail on the head. Goodbye TOE NICs.

During the early years of this millennium, Silicon Valley venture capitalists dumped hundreds of millions of dollars into start-ups that would deliver the next generation of network interface cards at 10Gb/sec using TCP offload engines. Many of these companies failed under their weight of trying to develop expensive, complicated silicon that just did not work. Others received a big surprise in 2005 when Microsoft settled with Alacritech over patents they held describing Microsoft’s Chimney architecture. In a cross-license arrangement with Microsoft and Broadcom, Alacritech received many tens of millions of dollars in licensing fees. Alacritech would later get tens of millions of more fees from nearly every other NIC vendor implementing a TOE in their design. At the time, Broadcom was desperate to pave the way for their acquisition of Israeli based Siloquent. Due to server OEM pressure, the settlement was a small price to pay for the certain business Broadcom would garner from sales of the Siloquent device. At 1Gb/sec, Broadcom owned an astounding 100% of the server LAN-on-Motherboard (LOM) market, and yet their position was threatened by the onslaught of new, well-funded 10Gb start-ups.

In fact, the feature list for new “Ethernet” enhancements got so full of great ideas that most vendor’s designs relied on a complex “sea of cores” promising extreme flexibility that ultimately proved to be very difficult to qualify at the server OEMs. Any minor change to one code set would cause the entire design to fail in ways that were extremely difficult to debug, not to mention being miserably poor in performance. Most notably, Netxen, another 10Gb TOE NIC vendor, quickly failed after winning major design-ins at the three big OEMs, ultimately ending in a fire sale to Qlogic. Emulex saw the same pot of gold in its acquisition of ServerEngines.

That new impetus was a move by Cisco to introduce Fibre Channel Over Ethernet (FCoE) as a standard to converge networking and storage traffic. Cisco let Qlogic and Emulex (Q & E) inside the tent before their Unified Computing System (UCS) server introduction. But the setup took some time. It required a new set of Ethernet standards, now more commonly known as Data Center Bridging (DCB). DCB was a set of physical layer requirements that attempted to emulate the reliability of TCP by injecting wire protocols that would allow “lossless” transmission of packets. What a break for Q & E! Given the duopoly’s control over the Fibre Channel market, this would surely put both companies in the pole position to take over the Ethernet NIC market. Even Broadcom spent untold millions to develop a Fiber Channel driver that would run on their NIC.

Q & E quickly released what many called the “Frankenstein NIC,” a kluge of Applied-Specified Integrated Circuits (ASIC) designed to get a product to market even while struggling to develop a single ASIC, a skill at which neither company excelled. Barely achieving its targeted functionality, no design saw much traction. Through all of our customer interactions (over 1,650), we could find only one that had implemented FCoE. This large bank has since retracted its support for FCoE and in fact, showed a presentation slide several years ago stating they were “moving from FCoE to Ethernet,” an acknowledgment that FCoE was indeed NOT Ethernet.

In conjunction with TOEs, the industry pundits believed that RDMA (Remote Direct Memory Access) was another required feature to reduce latency, and not just for High-Frequency Trading (HFT), another acknowledgment that lowering latency was critical to the hyper-scale cloud, big data, and storage architectures. However, once again, while intellectually stimulating, using RDMA in any environment proved to be complex and simply not compatible with customers’ applications or existing infrastructures.

The latest RDMA push is to position it as the underlying fabric for Non-Volatile Memory Express (NVMeF). Why? Flash has already reduced the latency of storage access by an order of magnitude, and the next generation of flash devices will reduce latency and increase capacity even further. Whenever there’s a step function in the performance of a particular block of computer architecture, developers come up with new ways to use that capability to drive efficiencies and introduce new, and more interesting applications. Much like Moore’s Law, rotating magnetic memory is on its last legs. Several of our most significant customers have already stopped buying rotating memory in favor of Flash SSDs.

Well… here we go again. RDMA is NOT Ethernet. Despite the “fake news” about running RDMA, RoCE and iWARP on Ethernet, the largest cloud companies, and our large financial services customers have declared that they cannot and will not implement NVMeF using RDMA. It just doesn’t fit in their infrastructures or applications. They want low-latency standard Ethernet.

Since our company’s beginning, we’ve never implemented TOEs, RDMA or FCoE or any of the other great and technically sound ideas for changing Ethernet. Sticking to our guns, we decided to go directly to the market and create the pull for our products. The first market to embrace our approach was High-Frequency Trading (HFT). Over 99% of the world’s volume of Electronic trading, in all instruments, runs on our company’s NICs. Why? Customers could test and run our NICs without any application modifications or changes to their infrastructure and realize enormous benefits in latency, Jitter, message rate and robustness… it’s standard Ethernet, and our kernel bypass software has become the industry’s default standard.

It’s not that there isn’t room for innovation in server networking, it’s that you have to consider the customer’s ability to adapt and manage that change in a way that’s not inconsistent or disruptive to their infrastructure, while at the same time, delivering highly valued capabilities.

  • If companies are looking for innovation in server networking, they need to look for a company that can provide the following: Best-in-class PTP synchronization
  • Ultra-high resolution time stamps for every packet at every line rate
  • A method for lossless, unobtrusive, packet capture and analysis
  • Significant performance improvement in NGINX and LXC Containers
  • A firewall NIC and Application Micro-Segmentation that can control every app, VM, or container with unique security profiles
  • Real, extensive Software Definable Networking (SDN) without agents

In summary, while it’s taken a long time for the industry to realize its inertia, logic eventually prevailed.  Today, companies can now benefit from innovations in silicon and software architecture that are in deployment and have been validated by the market.   Innovative approaches such as neural-scale networking, which is designed to respond to the high-bandwidth, ultra-low-latency, hardware-based security, telemetry, and massive connectivity needs of ultra-scale computing, is likely the only strategy to achieve a next-generation cloud and data center architecture that can scale, be easily managed, and maybe most importantly secured.

— Russell Stern, CEO Solarflare

Will AI Restore HFT Profitability?

225x225bbSeveral years ago Michael Lewis wrote “Flash Boys: A Wall Street Revolt” and created a populist backlash that rocked the world of High-Frequency Trading (HFT). Lewis had publicly pulled back the curtain and shared his perspective on how our financial markets worked. Right or wrong, he sensationalized an industry that most people didn’t understand, and yet one in which nearly all of us have invested our life savings.

Several months after Flash Boys Peter Kovac, a Trader with EWT, authored “Flash Boys: Not So Fast” where he refuted a number of the sensationalistic claims in Lewis’s book. While I’ve not yet read Kovacs book, it’s behind “Dark Pools: The Rise of Machine Traders and the Rigging of the U.S. Stock Market” in my reading list, from the reviews it appears he did an excellent job of covering the other side of this argument.

All that aside, earlier today I came across an article by Joe Parsons in “The Trade” titled “HFT: Not so flashy anymore.” Parsons makes several very salient points:

  • Making money in HFT today is no longer as easy as it once was. As a result, big players are rapidly gobbling up little ones. For example in April Virtu picked up KCG and Quantlab recently acquired Teza Technologies.
  • Speed is no longer the primary driving differentiator between HFT shops. As someone in the speed business, I can tell you that today one microsecond for a network 1/2 round trip in trading is table stakes. Some shops are deploying leading-edge technologies that can reduce this to under 200 nanoseconds. Speed as the only value proposition an HFT offers hasn’t been the case for a rather long time. This is one of the key assertions that was made five years ago in the “Dark Pools” book mentioned above.
  • MiFID II is having a greater impact on HFT then most people are aware. It brings with it new, and more stringent regulations governing dark pools and block trades.
  • Perhaps the most important point though that Parsons makes is that the shift today is away from arbitrage as a primary revenue source and towards analytics and artificial intelligence (AI). He downplays this point by offering it up as a possible future.

AI is the future of HFT as shops look for new and clever ways to push intelligence as close as possible to the exchanges. Companies like Xilinx, the leader in FPGAs, Nvidia with their P40 GPU, and Google, yes Google, with their Tensorflow Processing Unit (TPU) will define the hardware roadmap moving forward. How this will all unfold is still to be determined, but it will be one wild ride.

Near-Real-Time Analytics for HFT

RobotTradingArtificial Intelligence (AI) advances are finally progressing along a geometric curve thanks to cutting edge technologies like Google’s new TenserFlow Processing Unit (TPU), and NVidia’s latest Tesla V100 GPU platform.  Couple these with updated products like FPGAs from Xilinx such as their Kintex, refreshed Intel Purley 32 core CPUs, and advances in storage such as NVMe appliances from companies like X-IO, computing has never been so exciting! Near-real-time analytics for High-Frequency Trading (HFT) is now possible. This topic will be thoroughly discussed at the upcoming STAC Summit in NYC this coming Monday, June 5th. Please consider joining Hollis Beall Director of Performance Engineering at X-IO at the STAC Summit 1 PM panel discussion titled “The Brave New World of Big I/O!”  If you can’t make it or wish to get some background, there is Bill Miller’s Blog titled “Big Data Analytics: From Sometime Later to Real-Time” where he tips his hand at what where Hollis will be heading.

Stratus and Solarflare for Capital Markets and Exchanges

by David Whitney, Director of Global Financial Services, Stratus

The partnership of Stratus, the global standard for fault-tolerant hardware solutions, and Solarflare, the unchallenged leader in application network acceleration for financial services, at face value seems like an odd one. Stratus ‘always on’ server technology removes all single points of failure, which eliminates the need to write and maintain costly code to ensure high availability and fast failover scenarios.  Stratus and high performance are rarely been used in the same sentence.

Let’s go back further… Throughout the 1980’s and 90’s Stratus, and their proprietary VOSS operating system, globally dominated financial services from exchanges to investment banks. In those days, the priority for trading infrastructures was uptime which was provided by resilient hardware and software architectures. With the advent of electronic trading, the needs of today’s capital markets have shifted. High-Frequency Trading (HFT) has resulted in an explosion in transactional volumes. Driven by the requirements of one of the largest stock exchanges in the world, they realized that critical applications need to not only be highly available but also extremely focused on performance (low latency) and deterministic (zero jitter) behavior.

Stratus provides a solution that guarantees availability in mission-critical trading systems, without the costly overhead associated with today’s software-based High Availability (HA) solutions as well as the need for multiple physical servers. You could conceivably cut your server footprint in half by using a single Stratus server where before you’d need at least two physical servers. Stratus is also a “drop and go” solution. No custom code needs to be written, there is no concept of Stratus FT built customer applications. This isn’t just for Linux environments, Stratus also has hardened OS solutions for Windows and VMWare as well.

Solarflare brings low latency networking to the relationship with their custom Ethernet controller ASIC and Onload Linux Operating System Bypass communications stack. Normally network traffic arrives at the server’s network interface card (NIC) and is passed to the Operating System through the host CPU. This process involves copying the network data several times and switching the CPU’s context from kernel to user mode one or more times. All of these events take both time and CPU cycles. With over a decade of R&D, Solarflare has considerably shortened this path. Under Solarflare’s control applications often receive data in about 20% of the time it would typically take. The savings are measured in microseconds (millionths of a second), typically several or more. In trading speed often speed matters most, so a dollar value can be placed on this savings. Back in 2010, one trader valued the savings at $50,000/micro-second for each day of trading.

Both Stratus and Solarflare have worked together to dramatically reduce jitter to nearly zero. Jitter is caused by those seemingly inevitable events that distract a CPU core from its primary task of electronic trading. For example, the temperature of thermal sensor somewhere in the system may exceed a predetermined level and it raises a system interrupt. A CPU core is then assigned to handle that interrupt and determine which fan needs to be turned on or sped up. While this event, known as “Jitter”, sounds trivial the distraction to processes this interrupt and return to trading often results in a delay measured in the 100’s of microseconds. Imagine you’re trading strategy normally executes in 10s of microseconds, network latency adds 1-2 microseconds, and then all the sudden the system pauses your trading algorithm for 250 microseconds while it does some system housekeeping. By the time control is returned to your algorithm it’s possible that the value of what you’re trading has changed. Both Stratus and Solarflare have worked exceedingly hard to remove Jitter from the FT platform.

Going forward, Solarflare and Stratus will be adding Precision Time Protocol support to a new version of Onload for the Stratus FT Platform.

99.99999% Available + 2.7us = 1 Awesome Computer

What do you get when you put together a pair of dual socket servers running in hardware lock-step with a pair of leading edge, ultra-low latency OS Bypass network adapters all running RedHat Enterprise Linux? One awesome 24 core system that boasts 99.99999% uptime, zero jitter, 2.7 micro seconds of 1/2 round trip UDP latency, and 2.9 microseconds for TCP.

How is this possible? First, we’ll cover what Stratus Technologies has done with Lock-Step, and how it makes the ftServer dramatically different than all others. Then we’ll explain what jitter is, and why removing it is so critical for deterministic systems like financial trading. Finally, we’ll cover these impressive Solarflare ultra-low latency numbers, and what they really mean.

We’ve all bought something with a credit card, flown through Chicago O’hare, used public utilities, and possibly even called 9-1-1. What you don’t know is that very often at the heart of each of these systems is a Stratus server. Stratus should adopt the old Timex slogan “It takes a licking and keeps on ticking” because that’s what it means to provide 99.99999% up time, you’re allowed three seconds a year for unplanned outages. Three seconds is how long it takes me to say “99.99999% up time.” How is this possible? Imagine running a three legged race with a friend. Ideally, if you each compared your actions continuously with every step you could run the race at the pace of the slowest of the two of you. This is the key concept behind Lock-Step, comparing, then knowing what to do as one starts to stumble to ensure the team continues moving forward no matter what happens. Stratus leverages the latest 12-core Intel Haswell E5-2670v3 server processors with support for up to 512GB of DDR4. If any hardware component in the server fails, the system as a whole continues moving forward, alerts an admin who then replaces the failed component, then that subsystem is brought back online. I challenge you to find another computer in your life that has ever offered that level of availability over the typical 5-7 year lifecycle that Stratus servers often see.

So what is Jitter? When a computer core becomes distracted from doing its primary task to go off and do some routine house keeping (operating system or hardware driven), the impact of that temporary distraction is known as Jitter. With normal computing tasks, Jitter is hardly noticeable, it’s the computer equivalent of background noise. With certain VERY time critical computing tasks though, like say financial trading, even one Jitter event could be devastating. Suppose your server’s primary function is financial trading, and it receives a signal from market A that someone wants to buy IBM at $100, and on market B it sees a second signal that another entity wishes to sell IBM at $99. So the trading algorithm on your server buys the stock on B for $99, but then the instant it has confirmation of your purchase a thermal sensor in your server generates an interrupt. The CPU then that is running your trading algorithm goes off to service that interrupt which results in it running some code to determine which fan to turn on. Eventually, say a millisecond or so later, control is returned to your trading algorithm, but by then the buyer on market A is gone, and the new price of IBM has fallen to $99. That’s the impact of Jitter, brief often totally random moments in the trading day stolen to do basic house keeping. These stolen moments can quickly add up for traders, and for exchanges, they can be devastating. Imagine a delayed order as a result of Jitter missing an opportunity! Stratus Technologies has crawled through their server architecture and eliminated all potential sources of Jitter. Traders & exchanges using other platforms have to do all this by hand, and this is still as much art as it is science. That’s one reason why over 1,400 different customers regularly depend on Solarflare.

Finally, there’s ultra-low latency networking via generic TCP/IP and UDP networking. In the diagram below network latency is in blue. Market data arrives via UDP and orders are placed through the more reliable TCP/IP protocol. Here is a quick anatomy of part of the trading process showing one UDP receive and one TCP send. There are other components, but this is a distilled example.

Initially, the packet is received in from the wire, the light blue block, and the packet passes through the physical interface, electrical networking signals are converted to layer-2 logical bits. From there the packet is passed to the on-chip layer-2 switch which steers the packet to one of 2,048 virtualized NICs (vNIC) instances, also on the chip. The VNIC then uses DMA to transfer the packet into system memory, all of which takes 500 nanoseconds. The packet has now left the network adapter and is on its way to a communications stack somewhere in system memory, the dark blue box. Here is where Solarflare shines. In the top timeline, the dark blue box represents their host kernel device driver and the Linux communications stack. Solarflare’s kernel device driver is arguably one of the fastest in the industry, but most of this dark blue box is time spent working with the kernel. There are CPU task switches, and several memory copies of the packet, as it moves through the system, and thousands of CPU instructions are executed, all told this can be nearly 3,000 nanoseconds. In the bottom timeline, the packet is DMA’d directly into user-space where Solarflare’s very tight user space stack sits. This is where the packet is quickly processed and handed off to the end user application via the traditional sockets interface. All without additional data copies, and CPU task switches, and completed in just under 1,000 nano seconds a savings of about 2,000 nanoseconds or roughly 4,600 CPU instructions for this processor at this speed. All this, and we’ve just received a packet into our application, represented by the green blocks.
So in the two bars above the first represents market data coming in via Solarflare’s generic kernel device driver than going through the normal Linux stack until the packet is handed off to the application. The response packet, in this case, a trade via TCP, is sent back through the stack to the network adapter and eventually put on the wire, all told just over 9,000 nanoseconds. With Stratus & Solarflare the second bar shows the latency of the same transaction, but traveling through Solarflare’s OS Bypass stack in both directions, the difference here is that the transaction hits the exchange over 4,000 nanoseconds sooner. This means you can trade at nearly twice the speed, a true competitive advantage. Now four millionths of a second aren’t something humans can easily grasp, so let’s jump to light speed, this is how long it takes a photon of light to cover nearly a mile.
So if you’re looking to build a financial trading system with ultra-high availability, zero jitter & extreme network performance, you have only one choice Stratus’s new ftServer.

1.44 us Full Round Trip Latency, Unlikely

Tuesday morning one of the guys on my team woke me with a text stating a competitor was claiming 1.44 microseconds for a full round trip (RT) using UDP.  Two things about this immediately struck me as strange: first it was reported as a full round trip number, and second, the number (excluding units) was oddly close to what I’d thought the theoretical 1/2 RT limit might be. You see in the ultra-low latency, high-frequency trading market, time is everything. One need only be a few nanoseconds faster than their competition to win the lion’s share of the business. So speed is everything, but in the end, physics sets the speed limit.

In an ideal world if one were to measure the time required for a UDP packet to enter a network server adapter, traverse the Ethernet controller chip, travel the host PCIe bus, through the Intel CPU complex and finally end up in memory they’d find that this journey was roughly 730 nanoseconds. Now it should be noted that this varies across Intel server families & clock rates. We could be off by as much as +/- 100 nanoseconds, measuring at this level is pretty challenging, but 730 nanoseconds is a reasonable number to start with. Also, it should be noted that this is with Solarflare’s current 7000 series Ethernet Controller ASIC.

Breaking this down further, the most expensive part of this trip is the 500 nanoseconds or so the UDP packet will spend in Solarflare’s Ethernet controller chip. This chip is arguably the most popular low latency Ethernet Controller ASIC on the market today, it includes a high-performance PHY layer, an L2 switch, and built-in PCIe controller logic, everything happens within this single chip.  Over 1,000 financial trading firms rely on this technology daily, most of the world’s financial exchanges and nearly all of their high-performance customers depend on Solarflare, and as such they’ve turned all the dials possible to squeeze out every available nanosecond. Add to this 150 nanoseconds, the time the packet will spend traveling across the PCIe bus using DMA to cache via DDIO (not RAM), and finally another 80 nanoseconds or so to store it in RAM, making your final total 730 nanoseconds to receive a packet to memory. Again, your mileage will vary considerably so please only use these numbers as rough reference points. For a 1/2RT you’ll need to double this number (a receive plus a send) which brings the 1/2RT total to 1,460 nanoseconds, or 1.46 microseconds. It should also be noted that receives and sends have different costs, sends often consume less time, so again your numbers will vary, and this number should, in fact, be smaller. That’s Solarflare physics.  Solarflare has a new 8000 series Ethernet Controller ASIC coming out soon which will further trim down the 500 nanoseconds spent in the ASIC, but by exactly how much is still a closely guarded secret.

So is 1.44 microseconds for a conventional (through to user space vs. done completely in an FPGA) full round trip possible today? Well, the PCIe and memory components of this total 920 nanoseconds (150 nanoseconds for the PCIe bus plus 80 nanoseconds for CPU to memory, and both times 4 to address a full round trip). This leaves 520 nanoseconds to traverse the Ethernet Controller logic four times, or 130 nanoseconds for each pass. Considering that the most popular low-latency Ethernet controller chip on the planet requires 500 nanoseconds, doing it in 130 nanoseconds with the same degree of utility is highly unlikely.

On checking this competitor’s data sheet for this product we found that they have documented 1.82 microseconds for a UDP 1/2RT using 64-byte packets. Compare this to the 1.44 microseconds they claimed verbally for a full round trip, and one could see that they’ve significantly stretched the truth. If it sounds too good to be true, it probably is…