Browsing: HFT

Gone in 98 Nanoseconds

Imagine a daily race with hundreds of top fuel dragsters all lined up rumbling along in parallel waiting for the same green Christmas tree light before launching off the line. In some electronic markets, with specific products, every weekday morning this is exactly what happens. It’s a race where being the fastest is the primary attribute used to determine if you’re going to be doing business. On any given day only the top finishers are rewarded with trades. Those who transmit their first orders of the day the fastest receive a favorable position at the head of the queue and are likely to do some business that day. In this market, EVERY nanosecond (a billionth of a second) of delay matters, and can be monetized. Last week the new benchmark was set at 98 nanoseconds, plus your trading algorithm, in some cases 150 nanoseconds total tick to trade.

“Latency” is the industry term for the unavoidable network delays, and “Tick to Trade Latency” aggregates together the network travel time for a UDP market data signal to arrive at a trading system, and for that trading system to transmit a TCP order into the exchange. Last year Solarflare introduced Application Nanosecond TCP Send (ANTS) and lowered the “Tick to Trade Latency” bar to 350 Nanoseconds. ANTS executes in collaboration with Solarflare’s Application Onload Engine (AOE) based on an Altera Stratix FPGA. Solarflare further advanced this high-speed trading platform to achieve 250 Nanoseconds. Then in the spring of 2017 Solarflare collaborated with LDA Technologies. LDA brought their Lightspeed TCP cores to the table and replaced the AOE with a Xilinx FPGA board once again lowering the “Tick to Trade Latency” to 120 Nanoseconds. Now through further advances, and moving to the latest Penguin Computing Skylake computing platform, all three partners just announced a STAC-T0 qualified benchmark of 98 nanoseconds “Tick to Trade Latency!”

There was even a unique case in this STAC-T0 testing where the latency was measured at negative 68 nanoseconds, meaning that a trade could be injected into the exchange before the market data from the exchange had even been completely received. Compared to traditional trading systems which require that the whole market data network packet to be received before ANY processing can be done, these advanced FPGA systems receive the market data in the packet in four-byte chunks and can begin processing that data while it is arriving. Imagine showing up in the kitchen before you wife even finishes calling your name for dinner. There could be both good and bad side effects of such rapid action, you have a moment or two to taste a few things before the table is set, or you may get some last minute chores. The same holds true for such aggressive trading.

Last week, in a Podcast with the same name we had a discussion with Vahan Sardaryan, CEO of LDA Technologies, where we went into this in more detail.

Penguin Computing is also productizing the complete platform, including Solarflare’s ANTS technology and NIC, LDA Technologies Lightspeed TCP, along with a high-performance Xilinx FPGA to provide the Ultimate Trading Machine.

The Ultimate Trading Machine

{ 1 Comment }

Security Entirely Chimerical, SEC

On September 20th SEC Chairman Jay Claton released a “Statement on Cybersecurity.” It is an extremely dry read, but if one suffers through it they’ll find several interesting points.

“I recognize that even the most diligent cybersecurity efforts will not address all cyber risks that enterprises face. That stark reality makes adequate disclosure no less important.”

How does the SEC define “adequate disclosure?” The federal government has requirements that in some extreme breach cases require a report within one hour to DHS’s CERT. When faced with this class of breach recently it was found that the SEC waited 14 days, is this adequate disclosure? Much further down in the SEC Statement they disclosed the following.

“In August 2017, the Commission learned that an incident previously detected in 2016 may have provided the basis for illicit gain through trading. Specifically, a software vulnerability in the test filing component of our EDGAR system, which was patched promptly after discovery, was exploited and resulted in access to nonpublic information.”

So in the best case, the SEC waited only eight months to inform the public of this breach, but it could have been as much as 20 months. Unlike the publicly traded companies, the SEC regulates it isn’t legally required to tell investors or the public if it is ever breached. It is ONLY required to inform a law enforcement agency. EDGAR was also breached in 2014, but that saw little attention.

Now it’s one thing to breach an entity and remove data, but how about intentionally leaving false data behind for the purpose of capitalizing on that deposit? In at least two cases over the past few years, false business acquisition reports for Avon and the Rocky Mountain Chocolate Factory have been inserted into EDGAR. In the Avon case, the stock ran up 10 points. Does the SEC own up to these, well kinda of…

“As another example, our Division of Enforcement has investigated and filed cases against individuals who we allege placed fake SEC filings on our EDGAR system in an effort to profit from the resulting market movements.”

Ok, so EDGAR is a 30-year-old piece of swiss cheese riddled with potential attack surfaces some by design, others by just not keeping current on penetration testing of their systems. What about their physical assets?

“For example, a 2014 internal review by the SEC’s Office of Inspector General (“OIG”), an independent office within the agency, found that certain SEC laptops that may have contained nonpublic information could not be located.”

All the above quotes were from the Wednesday SEC Statement, but in a 2016 GAO report on the SEC, it stated that the SEC:

“…wasn’t always using encryption, supported software, well-tuned firewalls, and other key security tools while going about its business.”

Banking, in fact, our financial market structure as a whole is based on a singular concept, TRUST. The SEC was created in the wake of the Great Depression in 1934 as a way to restore trust in the markets. Technology savvy individuals will always attempt to exploit this trust for their own gain, it’s a part of how the game is played. In our financial system, the SEC plays the role of the gambling commission to ensure that the players, dealers, pit bosses, and the house are all working from the same set of published public rules. To his credit Chairman Clayton is working within the system in an attempt to shine daylight on an agency in trouble and out of touch with the technology driving the markets its charged with regulating. Today it is now possible to trade a stock based on a tick (a signal that something moved) within 150 billionths of a second, but it takes the SEC 1.2 million seconds (14 days) to report a serious breach of their security to law enforcement. Clearly, work remains to be done.

{ Add a Comment }

R.I.P. TCP Offload Engine NICs (TOEs)

Solarflare Delivers Smart NICs for the Masses: Software Definable,  Ultra-Scalable, Full Network Telemetry with Built-in Firewall for True Application Segmentation, Standard Ethernet TCP/UDP Compliant

As this blog post by Michael C. Bazarewsky states, Microsoft quietly pulled support for TCP Chimney in its Windows 10 operating system. Chimney was an architecture for offloading the state and responsibility of a TCP connection to a NIC that supported it. The piece cited numerous technical issues and lack of adoption, and Michael’s analysis hits the nail on the head. Goodbye TOE NICs.

During the early years of this millennium, Silicon Valley venture capitalists dumped hundreds of millions of dollars into start-ups that would deliver the next generation of network interface cards at 10Gb/sec using TCP offload engines. Many of these companies failed under their weight of trying to develop expensive, complicated silicon that just did not work. Others received a big surprise in 2005 when Microsoft settled with Alacritech over patents they held describing Microsoft’s Chimney architecture. In a cross-license arrangement with Microsoft and Broadcom, Alacritech received many tens of millions of dollars in licensing fees. Alacritech would later get tens of millions of more fees from nearly every other NIC vendor implementing a TOE in their design. At the time, Broadcom was desperate to pave the way for their acquisition of Israeli based Siloquent. Due to server OEM pressure, the settlement was a small price to pay for the certain business Broadcom would garner from sales of the Siloquent device. At 1Gb/sec, Broadcom owned an astounding 100% of the server LAN-on-Motherboard (LOM) market, and yet their position was threatened by the onslaught of new, well-funded 10Gb start-ups.

In fact, the feature list for new “Ethernet” enhancements got so full of great ideas that most vendor’s designs relied on a complex “sea of cores” promising extreme flexibility that ultimately proved to be very difficult to qualify at the server OEMs. Any minor change to one code set would cause the entire design to fail in ways that were extremely difficult to debug, not to mention being miserably poor in performance. Most notably, Netxen, another 10Gb TOE NIC vendor, quickly failed after winning major design-ins at the three big OEMs, ultimately ending in a fire sale to Qlogic. Emulex saw the same pot of gold in its acquisition of ServerEngines.

That new impetus was a move by Cisco to introduce Fibre Channel Over Ethernet (FCoE) as a standard to converge networking and storage traffic. Cisco let Qlogic and Emulex (Q & E) inside the tent before their Unified Computing System (UCS) server introduction. But the setup took some time. It required a new set of Ethernet standards, now more commonly known as Data Center Bridging (DCB). DCB was a set of physical layer requirements that attempted to emulate the reliability of TCP by injecting wire protocols that would allow “lossless” transmission of packets. What a break for Q & E! Given the duopoly’s control over the Fibre Channel market, this would surely put both companies in the pole position to take over the Ethernet NIC market. Even Broadcom spent untold millions to develop a Fiber Channel driver that would run on their NIC.

Q & E quickly released what many called the “Frankenstein NIC,” a kluge of Applied-Specified Integrated Circuits (ASIC) designed to get a product to market even while struggling to develop a single ASIC, a skill at which neither company excelled. Barely achieving its targeted functionality, no design saw much traction. Through all of our customer interactions (over 1,650), we could find only one that had implemented FCoE. This large bank has since retracted its support for FCoE and in fact, showed a presentation slide several years ago stating they were “moving from FCoE to Ethernet,” an acknowledgment that FCoE was indeed NOT Ethernet.

In conjunction with TOEs, the industry pundits believed that RDMA (Remote Direct Memory Access) was another required feature to reduce latency, and not just for High-Frequency Trading (HFT), another acknowledgment that lowering latency was critical to the hyper-scale cloud, big data, and storage architectures. However, once again, while intellectually stimulating, using RDMA in any environment proved to be complex and simply not compatible with customers’ applications or existing infrastructures.

The latest RDMA push is to position it as the underlying fabric for Non-Volatile Memory Express (NVMeF). Why? Flash has already reduced the latency of storage access by an order of magnitude, and the next generation of flash devices will reduce latency and increase capacity even further. Whenever there’s a step function in the performance of a particular block of computer architecture, developers come up with new ways to use that capability to drive efficiencies and introduce new, and more interesting applications. Much like Moore’s Law, rotating magnetic memory is on its last legs. Several of our most significant customers have already stopped buying rotating memory in favor of Flash SSDs.

Well… here we go again. RDMA is NOT Ethernet. Despite the “fake news” about running RDMA, RoCE and iWARP on Ethernet, the largest cloud companies, and our large financial services customers have declared that they cannot and will not implement NVMeF using RDMA. It just doesn’t fit in their infrastructures or applications. They want low-latency standard Ethernet.

Since our company’s beginning, we’ve never implemented TOEs, RDMA or FCoE or any of the other great and technically sound ideas for changing Ethernet. Sticking to our guns, we decided to go directly to the market and create the pull for our products. The first market to embrace our approach was High-Frequency Trading (HFT). Over 99% of the world’s volume of Electronic trading, in all instruments, runs on our company’s NICs. Why? Customers could test and run our NICs without any application modifications or changes to their infrastructure and realize enormous benefits in latency, Jitter, message rate and robustness… it’s standard Ethernet, and our kernel bypass software has become the industry’s default standard.

It’s not that there isn’t room for innovation in server networking, it’s that you have to consider the customer’s ability to adapt and manage that change in a way that’s not inconsistent or disruptive to their infrastructure, while at the same time, delivering highly valued capabilities.

  • If companies are looking for innovation in server networking, they need to look for a company that can provide the following: Best-in-class PTP synchronization
  • Ultra-high resolution time stamps for every packet at every line rate
  • A method for lossless, unobtrusive, packet capture and analysis
  • Significant performance improvement in NGINX and LXC Containers
  • A firewall NIC and Application Micro-Segmentation that can control every app, VM, or container with unique security profiles
  • Real, extensive Software Definable Networking (SDN) without agents

In summary, while it’s taken a long time for the industry to realize its inertia, logic eventually prevailed.  Today, companies can now benefit from innovations in silicon and software architecture that are in deployment and have been validated by the market.   Innovative approaches such as neural-scale networking, which is designed to respond to the high-bandwidth, ultra-low-latency, hardware-based security, telemetry, and massive connectivity needs of ultra-scale computing, is likely the only strategy to achieve a next generation cloud and datacenter architecture that can scale, be easily managed, and maybe most importantly secured.

— Russell Stern, CEO Solarflare

{ 1 Comment }

Will AI Restore HFT Profitability?

225x225bbSeveral years ago Michael Lewis wrote “Flash Boys: A Wall Street Revolt” and created a populist backlash that rocked the world of High-Frequency Trading (HFT). Lewis had publicly pulled back the curtain and shared his perspective on how our financial markets worked. Right or wrong, he sensationalized an industry that most people didn’t understand, and yet one in which nearly all of us have invested our life savings.

Several months after Flash Boys Peter Kovac, a Trader with EWT, authored “Flash Boys: Not So Fast” where he refuted a number of the sensationalistic claims in Lewis’s book. While I’ve not yet read Kovacs book, it’s behind “Dark Pools: The Rise of Machine Traders and the Rigging of the U.S. Stock Market” in my reading list, from the reviews it appears he did an excellent job of covering the other side of this argument.

All that aside, earlier today I came across an article by Joe Parsons in “The Trade” titled “HFT: Not so flashy anymore.” Parsons makes several very salient points:

  • Making money in HFT today is no longer as easy as it once was. As a result, big players are rapidly gobbling up little ones. For example in April Virtu picked up KCG and Quantlab recently acquired Teza Technologies.
  • Speed is no longer the primary driving differentiator between HFT shops. As someone in the speed business, I can tell you that today one microsecond for a network 1/2 round trip in trading is table stakes. Some shops are deploying leading edge technologies that can reduce this to under 200 nanoseconds. Speed as the only value proposition an HFT offers hasn’t been the case for a rather long time. This is one of the key assertions that was made five years ago in the “Dark Pools” book mentioned above.
  • MiFID II is having a greater impact on HFT then most people are aware. It brings with it new, and more stringent regulations governing dark pools and block trades.
  • Perhaps the most important point though that Parsons makes is that the shift today is away from arbitrage as a primary revenue source and towards analytics and artificial intelligence (AI). He down plays this point by offering it up as a possible future.

AI is the future of HFT as shops look for new and clever ways to push intelligence as close as possible to the exchanges. Companies like Xilinx, the leader in FPGAs, Nvidia with their P40 GPU, and Google, yes Google, with their Tensorflow Processing Unit (TPU) will define the hardware roadmap moving forward. How this will all unfold is still to be determined, but it will be one wild ride.

{ Add a Comment }

Near-Real-Time Analytics for HFT

RobotTradingArtificial Intelligence (AI) advances are finally progressing along a geometric curve thanks to cutting edge technologies like Google’s new TenserFlow Processing Unit (TPU), and NVidia’s latest Tesla V100 GPU platform.  Couple these with updated products like FPGAs from Xilinx such as their Kintex, refreshed Intel Purley 32 core CPUs, and advances in storage such as NVMe appliances from companies like X-IO, computing has never been so exciting! Near-real-time analytics for High-Frequency Trading (HFT) is now possible. This topic will be thoroughly discussed at the upcoming STAC Summit in NYC this coming Monday, June 5th. Please consider joining Hollis Beall Director of Performance Engineering at X-IO at the STAC Summit 1 PM panel discussion titled “The Brave New World of Big I/O!”  If you can’t make it or wish to get some background, there is Bill Miller’s Blog titled “Big Data Analytics: From Sometime Later to Real-Time” where he tips his hand at what where Hollis will be heading.

{ Add a Comment }

Stratus and Solarflare for Capital Markets and Exchanges

by David Whitney, Director of Global Financial Services, Stratus

The partnership of Stratus, the global standard for fault tolerant hardware solutions, and Solarflare, the unchallenged leader in application network acceleration for financial services, at face value seems like an odd one. Stratus ‘always on’ server technology removes all single points of failure, which eliminates the need to write and maintain costly code to ensure high availability and fast failover scenarios.  Stratus and high performance are rarely been used in the same sentence.

Let’s go back further… Throughout the 1980’s and 90’s Stratus, and their proprietary VOSS operating system, globally dominated financial services from exchanges to investment banks. In those days, the priority for trading infrastructures was uptime which was provided by resilient hardware and software architectures. With the advent of electronic trading, the needs of today’s capital markets have shifted. High-Frequency Trading (HFT) has resulted in an explosion in transactional volumes. Driven by the requirements of one of the largest stock exchanges in the world, they realized that critical applications need to not only be highly available but also extremely focused on performance (low latency) and deterministic (zero jitter) behavior.

Stratus provides a solution that guarantees availability in mission critical trading systems, without the costly overhead associated with today’s software based High Availability (HA) solutions as well as the need for multiple physical servers. You could conceivably cut your server footprint in half by using a single Stratus server where before you’d need at least two physical servers. Stratus is also a “drop and go” solution. No custom code needs to be written, there is no concept of Stratus FT built customer applications. This isn’t just for Linux environments, Stratus also has hardened OS solutions for Windows and VMWare as well.

Solarflare brings low latency networking to the relationship with their custom Ethernet controller ASIC and Onload Linux Operating System Bypass communications stack. Normally network traffic arrives at the server’s network interface card (NIC) and is passed to the Operating System through the host CPU. This process involves copying the network data several times and switching the CPU’s context from kernel to user mode one or more times. All of these events take both time and CPU cycles. With over a decade of R&D, Solarflare has considerably shortened this path. Under Solarflare’s control applications often receive data in about 20% of the time it would typically take. The savings is measured in microseconds (millionths of a second), typically several or more. In trading speed often speed matters most, so a dollar value can be placed on this savings. Back in 2010, one trader valued the savings at $50,000/micro-second for each day of trading.

Both Stratus and Solarflare have worked together to dramatically reduce jitter to nearly zero. Jitter is caused by those seemingly inevitable events that distract a CPU core from its primary task of electronic trading. For example, the temperature of thermal sensor somewhere in the system may exceed a predetermined level and it raises a system interrupt. A CPU core is then assigned to handle that interrupt and determine which fan needs to be turned on or sped up. While this event, known as “Jitter”, sounds trivial the distraction to processes this interrupt and return to trading often results in a delay measured in the 100’s of microseconds. Imagine you’re trading strategy normally executes in 10s of microseconds, network latency adds 1-2 microseconds, and then all the sudden the system pauses your trading algorithm for 250 microseconds while it does some system housekeeping. By the time control is returned to your algorithm it’s possible that the value of what you’re trading has changed. Both Stratus and Solarflare have worked exceedingly hard to remove Jitter from the FT platform.

Going forward, Solarflare and Stratus will be adding Precision Time Protocol support to a new version of Onload for the Stratus FT Platform.

{ Add a Comment }

99.99999% Available + 2.7us = 1 Awesome Computer

What do you get when you put together a pair of dual socket servers running in hardware lock-step with a pair of leading edge, ultra-low latency OS Bypass network adapters all running RedHat Enterprise Linux? One awesome 24 core system that boasts 99.99999% uptime, zero jitter, 2.7 micro seconds of 1/2 round trip UDP latency, and 2.9 microseconds for TCP.

How is this possible? First, we’ll cover what Stratus Technologies has done with Lock-Step, and how it makes the ftServer dramatically different than all others. Then we’ll explain what jitter is, and why removing it is so critical for deterministic systems like financial trading. Finally, we’ll cover these impressive Solarflare ultra-low latency numbers, and what they really mean.

We’ve all bought something with a credit card, flown through Chicago O’hare, used public utilities, and possibly even called 9-1-1. What you don’t know is that very often at the heart of each of these systems is a Stratus server. Stratus should adopt the old Timex slogan “It takes a licking and keeps on ticking” because that’s what it means to provide 99.99999% up time, you’re allowed three seconds a year for unplanned outages. Three seconds is how long it takes me to say “99.99999% up time.” How is this possible? Imagine running a three legged race with a friend. Ideally, if you each compared your actions continuously with every step you could run the race at the pace of the slowest of the two of you. This is the key concept behind Lock-Step, comparing, then knowing what to do as one starts to stumble to ensure the team continues moving forward no matter what happens. Stratus leverages the latest 12-core Intel Haswell E5-2670v3 server processors with support for up to 512GB of DDR4. If any hardware component in the server fails, the system as a whole continues moving forward, alerts an admin who then replaces the failed component, then that subsystem is brought back online. I challenge you to find another computer in your life that has ever offered that level of availability over the typical 5-7 year lifecycle that Stratus servers often see.

So what is Jitter? When a computer core becomes distracted from doing its primary task to go off and do some routine house keeping (operating system or hardware driven), the impact of that temporary distraction is known as Jitter. With normal computing tasks, Jitter is hardly noticeable, it’s the computer equivalent of background noise. With certain VERY time critical computing tasks though, like say financial trading, even one Jitter event could be devastating. Suppose your server’s primary function is financial trading, and it receives a signal from market A that someone wants to buy IBM at $100, and on market B it sees a second signal that another entity wishes to sell IBM at $99. So the trading algorithm on your server buys the stock on B for $99, but then the instant it has confirmation of your purchase a thermal sensor in your server generates an interrupt. The CPU then that is running your trading algorithm goes off to service that interrupt which results in it running some code to determine which fan to turn on. Eventually, say a millisecond or so later, control is returned to your trading algorithm, but by then the buyer on market A is gone, and the new price of IBM has fallen to $99. That’s the impact of Jitter, brief often totally random moments in the trading day stolen to do basic house keeping. These stolen moments can quickly add up for traders, and for exchanges, they can be devastating. Imagine a delayed order as a result of Jitter missing an opportunity! Stratus Technologies has crawled through their server architecture and eliminated all potential sources of Jitter. Traders & exchanges using other platforms have to do all this by hand, and this is still as much art as it is science. That’s one reason why over 1,400 different customers regularly depend on Solarflare.

Finally, there’s ultra-low latency networking via generic TCP/IP and UDP networking. In the diagram below network latency is in blue. Market data arrives via UDP and orders are placed through the more reliable TCP/IP protocol. Here is a quick anatomy of part of the trading process showing one UDP receive and one TCP send. There are other components, but this is a distilled example.

Initially, the packet is received in from the wire, the light blue block, and the packet passes through the physical interface, electrical networking signals are converted to layer-2 logical bits. From there the packet is passed to the on-chip layer-2 switch which steers the packet to one of 2,048 virtualized NICs (vNIC) instances, also on the chip. The VNIC then uses DMA to transfer the packet into system memory, all of which takes 500 nanoseconds. The packet has now left the network adapter and is on its way to a communications stack somewhere in system memory, the dark blue box. Here is where Solarflare shines. In the top timeline, the dark blue box represents their host kernel device driver and the Linux communications stack. Solarflare’s kernel device driver is arguably one of the fastest in the industry, but most of this dark blue box is time spent working with the kernel. There are CPU task switches, and several memory copies of the packet, as it moves through the system, and thousands of CPU instructions are executed, all told this can be nearly 3,000 nanoseconds. In the bottom timeline, the packet is DMA’d directly into user-space where Solarflare’s very tight user space stack sits. This is where the packet is quickly processed and handed off to the end user application via the traditional sockets interface. All without additional data copies, and CPU task switches, and completed in just under 1,000 nano seconds a savings of about 2,000 nanoseconds or roughly 4,600 CPU instructions for this processor at this speed. All this, and we’ve just received a packet into our application, represented by the green blocks.
So in the two bars above the first represents market data coming in via Solarflare’s generic kernel device driver than going through the normal Linux stack until the packet is handed off to the application. The response packet, in this case, a trade via TCP, is sent back through the stack to the network adapter and eventually put on the wire, all told just over 9,000 nanoseconds. With Stratus & Solarflare the second bar shows the latency of the same transaction, but traveling through Solarflare’s OS Bypass stack in both directions, the difference here is that the transaction hits the exchange over 4,000 nanoseconds sooner. This means you can trade at nearly twice the speed, a true competitive advantage. Now four millionths of a second aren’t something humans can easily grasp, so let’s jump to light speed, this is how long it takes a photon of light to cover nearly a mile.
So if you’re looking to build a financial trading system with ultra-high availability, zero jitter & extreme network performance, you have only one choice Stratus’s new ftServer.

{ Add a Comment }

1.44 us Full Round Trip Latency, Unlikely

Tuesday morning one of the guys on my team woke me with a text stating a competitor was claiming 1.44 microseconds for a full round trip (RT) using UDP.  Two things about this immediately struck me as strange: first it was reported as a full round trip number, and second, the number (excluding units) was oddly close to what I’d thought the theoretical 1/2 RT limit might be. You see in the ultra-low latency, high-frequency trading market, time is everything. One need only be a few nanoseconds faster than their competition to win the lion’s share of the business. So speed is everything, but in the end, physics sets the speed limit.

In an ideal world if one were to measure the time required for a UDP packet to enter a network server adapter, traverse the Ethernet controller chip, travel the host PCIe bus, through the Intel CPU complex and finally end up in memory they’d find that this journey was roughly 730 nanoseconds. Now it should be noted that this varies across Intel server families & clock rates. We could be off by as much as +/- 100 nanoseconds, measuring at this level is pretty challenging, but 730 nanoseconds is a reasonable number to start with. Also, it should be noted that this is with Solarflare’s current 7000 series Ethernet Controller ASIC.

Breaking this down further, the most expensive part of this trip is the 500 nanoseconds or so the UDP packet will spend in Solarflare’s Ethernet controller chip. This chip is arguably the most popular low latency Ethernet Controller ASIC on the market today, it includes a high-performance PHY layer, an L2 switch, and built-in PCIe controller logic, everything happens within this single chip.  Over 1,000 financial trading firms rely on this technology daily, most of the world’s financial exchanges and nearly all of their high-performance customers depend on Solarflare, and as such they’ve turned all the dials possible to squeeze out every available nanosecond. Add to this 150 nanoseconds, the time the packet will spend traveling across the PCIe bus using DMA to cache via DDIO (not RAM), and finally another 80 nanoseconds or so to store it in RAM, making your final total 730 nanoseconds to receive a packet to memory. Again, your mileage will vary considerably so please only use these numbers as rough reference points. For a 1/2RT you’ll need to double this number (a receive plus a send) which brings the 1/2RT total to 1,460 nanoseconds, or 1.46 microseconds. It should also be noted that receives and sends have different costs, sends often consume less time, so again your numbers will vary, and this number should, in fact, be smaller. That’s Solarflare physics.  Solarflare has a new 8000 series Ethernet Controller ASIC coming out soon which will further trim down the 500 nanoseconds spent in the ASIC, but by exactly how much is still a closely guarded secret.

So is 1.44 microseconds for a conventional (through to user space vs. done completely in an FPGA) full round trip possible today? Well, the PCIe and memory components of this total 920 nanoseconds (150 nanoseconds for the PCIe bus plus 80 nanoseconds for CPU to memory, and both times 4 to address a full round trip). This leaves 520 nanoseconds to traverse the Ethernet Controller logic four times, or 130 nanoseconds for each pass. Considering that the most popular low-latency Ethernet controller chip on the planet requires 500 nanoseconds, doing it in 130 nanoseconds with the same degree of utility is highly unlikely.

On checking this competitor’s data sheet for this product we found that they have documented 1.82 microseconds for a UDP 1/2RT using 64-byte packets. Compare this to the 1.44 microseconds they claimed verbally for a full round trip, and one could see that they’ve significantly stretched the truth. If it sounds too good to be true, it probably is…

{ 3 Comments }

Stock Trading in 300ns: Low Latency Redefined

Henry Ford is quoted as having once said “If I had asked people what they wanted, they would have said faster horses.” Not all innovations are as ground breaking as the automobile, but when one approaches an old problem with both a new strategy and improved technology great things can happen. In June Solarflare released an update to OpenOnload (OOL) that introduced TCP Delegated Send, and in late September they will begin shipping the latest version of their Application Onload Engine (AOE) with an advanced FPGA. The combination of these two will result in the capability of turning around a TCP Send in 300ns, compared to roughly 1,700ns today. In latency focused applications a savings of 1,400ns, an 82% improvement, is game changing. To understand how Solarflare pulls this off let’s look at a much simpler example.

My son uses an online exchange to trade Magic cards, and traders are rated on how fast they fill orders. Not much different than NASDAQ processing an order for Apple stock. When my son started he would receive an order on his computer, and search through a heap of cards to find the one necessary to fill that order. He would then go down several flights of stairs to my office to fetch an envelope, and stamp then goes back up to his computer. Next, he would address the envelope, apply the stamp, run back down the stairs and walk the completed trade to the mailbox. Today he has a cache of pre-stamped envelopes with the return addresses pre-written out sitting beside his computer. All his cards are in a binder with an updated index. Filling a trade is a trivial matter. He simply checks the index, pulls the required card from the binder, updates the index, stuffs the card in an envelope, writes the final address on the front, and runs it out to the mailbox. Essentially, this is a Delegated Send. Everything that can be preprocessed in advance of the actual trade is prefetched & prepackaged.

When it comes to TCP and Delegated Send, at the start of the trading day the trading application, through OOL, establishes a TCP connection with the exchange. The trading application then calls a routine in OOL to take over control of the socket’s send path, and to obtain the Ethernet, IP and TCP headers for the connection.  The application adds to these a message template and passes the resulting packet template to the FPGA where it remains cached, much like my son’s stack of pre-stamped envelopes. In response to incoming packets arriving at the FPGA causing the RTL trading code to trigger a trade, the trade is then inserted into the pre-formatted packet, the checksum computed, and packet transferred to the exchange. The whole process takes approximately 300ns. When the ACK arrives from the exchange it is then passed transparently back to the trading application through OOL. Now some will point out that other FPGA solutions exist today that enables you to possibly trade at these speeds, but do any of these solutions make it this simple? With some minor modifications to your existing trading application you can quickly take advantage of Delegated Send with the AOE, no other FPGA solution even comes close!

So if latency in trading is important to you, and you’d like your orders moving along 1,000ns faster then perhaps it’s time to take a serious look at Delegated Send on Solarflare’s AOE. To learn more please consider checking out this whitepaper.

For those already familiar with Solarflare’s AOE, this new version of the product has several very substantial improvements:

  • It leverages the latest Solarflare high-performance ASIC with a PCIe Gen3 interface.
  • Flexible, open choice of FPGA PCS/MAC.
  • All FDK modules, and examples delivered with full source code.
  • Sample implementations of 10GbE & 40GbE pass-through (requires PCS/MAC).
  • Sample implementations of all four 1600MHz DDR3 AOE memory channels.
For additional information please send me an email.

{ 5 Comments }

Rise of Heterogeneous Systems Architectures, and the Role of APUs

In his talk “The Race to Zero” last week at Flagg Management’s HPC for Wall Street show Dr. Greg Rodgers, a PhD from AMD Research, discussed the rise of highly integrated Heterogeneous Systems Architectures (HSA). For the past six years I’ve exhibited at both of Russell Flagg’s annual shows, and during that time I’ve seen many different approaches to reducing latency & improving performance for the High Frequency Trading (HFT) market. Many companies have pitched custom FPGA solutions, GPUs, HPC RISC implementations, ultra-dense Intel solutions, but not until this talk had I heard anything that was truly innovative. In Dr. Rodgers brief 15 minute session he proposed a heterogeneous architecture for addressing a wider range of computational problems by tightly integrating several different processing models onto the same chip, the innovation. The concept of a heterogeneous computing environment is not new, in-fact it’s been around for at least two decades. While working at NEC in 2004, one of my colleagues at our US Research division demonstrated a new product that loosely coupling several different computing resource pools together. That way jobs submitted with the tool could easily & efficiently be parceled out and leverage both scalar clusters & massively parallel systems (Earth Simulator) without having to be broken up, and submitted individually to specific systems. What Dr. Rodgers is proposing is a much higher level integration on the same chip.

If this were anyone else I might have easily written off the concept as an intellectual exercise that would never see the light of day, but this was Greg Rodgers. I’ve known Greg for nearly eight years, and when we first met he was carrying around a pre-announced IBM JS21 PowerPC blade server under his arm between booths at SuperComputing 2005. He was evangelizing the need to build huge clusters using the latest in IBM’s arsenal of PowerPC workhorse chips in an ultra-dense form factor. Greg has built many large clusters during his career, and when he believes in an approach it will eventually be implemented in a very large cluster. It may end up at the Department of Energy, or a University or other Government lab, but it will happen.

AMD currently producing an ultra dense cluster in a box with their SeaMicro SM15000-OP. This is a 10U enclosure that houses 512 cores, each 64-bit, x86, at 2.0/2.3/2.8 Ghz. To reach 512 cores they use 64 sockets each housing a new Octal core Opteron. Each socket supports 64GB for a total of 4TB of system memory. AMD also provides 10GbE to each socket internally, and expose 16 10GbE uplinks externally. This is a true HPC cluster in a box, but because it’s all x86 cores it’s designed for scalar workloads. What Greg is proposing is to shift this architecture from pure x86 to “Acceleration Processing Units” (APUs) that marry a GPU, with two x86 cores, caches and other I/O on the same die (chip). That way memory can be shared, and data movement minimized. This would enable data parallel workloads and serial/task parallel workloads to coexist within the same chip, and be able to share memory when appropriate. Furthermore Greg has proposed the following HSA concepts:

  • A unified programming model that enables task parallel and data parallel workloads while also supporting sequential workloads.
  • A single unified virtual address space addressable by all compute cores with well-defined memory regions supporting both global & private access.
  • User level queuing between the “Latency Compute Unit” (LCU) and the “Throughput Compute Unit” (TCU) without system calls.
  • Preemption & context switching, extending context management to the GPU.
  • HSA Intermediate Language (HSAIL) to split compilation between the front end and the finalizer to improve optimization & portability

Greg was actively promoting the HSA foundation as a method for moving HPC for HFT forward. Furthermore, he discussed AMD Open 3.0 which is their next generation open systems compute platform. Here is a link to Greg’s slides. It will be interesting to see how this approach plays within the market, especially at the SC13 show in November

{ Add a Comment }