Artificial Intelligence (AI) advances are finally progressing along a geometric curve thanks to cutting edge technologies like Google’s new TenserFlow Processing Unit (TPU), and NVidia’s latest Tesla V100 GPU platform. Couple these with updated products like FPGAs from Xilinx such as their Kintex, refreshed Intel Purley 32 core CPUs, and advances in storage such as NVMe appliances from companies like X-IO, computing has never been so exciting! Near-real-time analytics for High-Frequency Trading (HFT) is now possible. This topic will be thoroughly discussed at the upcoming STAC Summit in NYC this coming Monday, June 5th. Please consider joining Hollis Beall Director of Performance Engineering at X-IO at the STAC Summit 1 PM panel discussion titled “The Brave New World of Big I/O!” If you can’t make it or wish to get some background, there is Bill Miller’s Blog titled “Big Data Analytics: From Sometime Later to Real-Time” where he tips his hand at what where Hollis will be heading.
Recently someone suggested that I watch this rather informative video of how Microsoft Research had attempted to make RDMA over Converged Ethernet (RoCE) lossless. Unbelievably this video exposes and documents several serious flaws in the design of RoCE. Also, it appears they’ve replaced the word “Converged” with “Commodity,” to down message that RoCE doesn’t require anything special to run on regular old Ethernet. Here are the four points I got out of the video, please let me know your take:
- RDMA Livelock: This is a simple problem of retransmitting. Since RDMA was architected for a lossless deterministic local bus architecture accommodations were never made for dropped packets as they just didn’t happen on a bus. Ethernet, on the other hand, was designed to expect a loss, remember vampire taps. Livelock occurs when a message composed of multiple packets experiences a dropped packet somewhere in the middle. At this point, RDMA has to start over from the first packet and retransmit the whole message. If this was a multiple megabyte frame of video, this retransmit approach will Lovelock a network. So what was Microsoft’s solution rewrite the RDMA stack retransmit logic to retransmit on drop detection (this is what TCP does), good luck, who’s got this action item?
- Programmable Flow Control (PFC) Deadlock: This happens when switches encounter incomplete ARP packets. Microsoft’s solution is a call for more research, and to filter incomplete ARP packets. More to-do’s and this one is on all the switch vendors.
- NIC PFC Storm: It seems that the firmware in some RoCE NICs has bugs that create Pause Frame storms. Beyond NIC vendors fixing those bugs, they also suggest that NIC and switch vendors include extra new software to detect oncoming storms and shut them down. Great idea, another to-do for the anonymous NIC and switch providers.
- Slow Receiver NICs which generate excessive pause frames because of their crappy RDMA architecture which relies on a second level host based translation tables so they can fetch the destination memory address. Oh, my god, this is how you design an HPC NIC, seriously, how cheap can you be? Make the lookup tables bigger, seriously, Myricom addressed this problem back the 1990s. It appears on some RoCE NICs that it’s not that hard for the NIC to have so many receivers of kernel bypassed packets that they must go off NIC for the destination memory address lookups.
As the speaker closes out the discussion, he says, “This experiment shows that even with RDMA low latency and high throughput cannot be achieved at the same time as network congestion can cause queues to build up in the network.” Anyone who has done this for a while knows that low-latency and high bandwidth are mutually exclusive. That’s why High-Performance Computing (HPC) tests often start the tests with zero byte packets then scale up to demonstrate how latency increase proportionately to packet size.
All the above aside, this important question remains, why would anyone map a protocol like RDMA, which was designed for use on a lossless local bus, to a switched network and think that this would work? A local lossless bus is very deterministic, and it has requirements bound to its lossless nature and predictable performance. Conversely, Ethernet was designed from the beginning to expect, and accommodate loss, and performance has always been secondary to packet delivery. That’s why Ethernet performance is non-deterministic. The resilience of Ethernet, not performance, was the primary design criteria DARPA had mandated to ensure our military’s network would remain functional at all cost.
Soon Solarflare will begin shipping ScaleOut Onload free with all their 8000 series NICs, some of which sell for under $300 USD. With ScaleOut Onload TCP now has all the kernel bypass tricks RDMA offers, but with all the benefits and compatibility of sockets based TCP, no code changes. Furthermore, it delivers the performance of RDMA, but with much better reliability and availability than RoCE.
P.S. Mellanox just informed me that the NIC specific issues mentioned above were corrected some time ago in their ConnectX-4 series cards.
Effective June 2, 2017, the primary hosting source for the 40GbE.net (including 10, 25, and 50GbE.net) blog is moving from Blogger (Google) to WordPress. This is being done to facilitate better management of the content, and to add the capability to support a newly spun up Podcast called the Technology Evangelist. Hopefully, you’ll renew your subscription to this blog on WordPress.
I’d like to hear from those of you in the comments section regarding deploying software in containers today into production. My interest is specifically regarding large deployments across a number of servers, and the issues you’re having with networking. People really using Kubernetes and Docker Swarm. Not those tinkering with containers on a single host, but DevOps who’ve suffered the real bruises and scraps from setting up MACvlans, IPvlans, Calico, Flannel, Kuryr, Magnum, Weave, Contiy Networking, etc…
Some will suggest I read the various mailing lists (check), join Slack channels (check), attend DockerCon (check), or even contribute to projects they prefer (you really don’t want my code). I’m not looking for that sort of feedback because in all those various forums the problem I have at my level of container networking experience is separating the posers from the real doers. My hope is that those willing to suggest ideas, can provide concrete examples of server-based container networking rough edges they’ve experienced and that if improved it would make a significant difference for their company. If that’s you then please comment publicly below, or use the private form to the right. Thank you for your time.
Everyone hates waiting, and as such we continually improve the performance of our technology to remove this waiting, often called latency. In markets like financial trading, this latency can be monetized. Several years ago a high-frequency trading shop told me that for them one microsecond (millionth of a second) improvement translated to $60K per network port per day. How is that even possible? Well, if my stock market trade gets into the exchange before your’s I win, you lose, it’s that simple. Prior to May 2017, Solarflare had a network latency solution for electronic markets that delivered 250 nanoseconds (billionths of a second) from tick to trade. The “tick” in “tick to trade” is the market data that arrives from the exchange via a network packet.
So when NASDAQ generates a stock market ticker signal indicating that IBM is now trading at $160/share, your order into NASDAQ could be as quick as 250 nanoseconds plus the time it takes your algorithm to decide to buy. The 250 nanoseconds are how much time it will take for the market data, a UDP network packet, to be brought into the FPGA-based NIC, plus the time it will take to generate the order as a TCP packet and inject that order back into the exchange. To put this into perspective, 250 nanoseconds is the time required for a photon of light travel 82 yards, less than a football field.
If that doesn’t sound fast enough, this week Solarflare, LDA Technology, and Xilinx announced LightSpeed TCP which under the proper circumstances can reduce network latency for trades from 250 nanoseconds down to 120 nanoseconds. So by contrast, 120 nanoseconds are the time required for light to travel 40 yards. So they’ve taken trading from a kickoff return to a few first downs.
Large Container Environments Need Connectivity for 1,000s of Micro-services
An epic migration is underway from hypervisors and virtual machines to containers and micro-services. The motivation is simple, there is far less overhead with containers and the payback is huge. You get more apps per server as host operating systems, multiple guest operating systems, and hypervisors are replaced by a single operating system. Solarflare is seeking to advance the development of networking for containers. Our goal is to provide the best possible performance, with the highest degree of connectivity, and easiest-to-deploy NICs for containers.
Solarflare’s first step in addressing the special networking requirements of containers is the delivery of the industry’s first Ethernet NIC with “ultra-scale connectivity.” This line of NICs has the ability to establish virtual connections from a container microservice to thousands of other containers and microservices. Ultra-scale network connectivity eliminates the performance penalty of vSwitch overhead, buffer copying, and Linux context switching. It provides application servers the capacity to provide each micro-service with a dedicated network link. This ability to scale connectivity is critical to the success of deploying large container environments within a data center, across multiple data centers, and multiple global regions.
Neural-Class Networks Require Ultra-Scale Connectivity
A “Neural Network” is a distributed, scale-out computing model that enables AI deep learning which is emerging as the core of next-gen applications software. Deep learning algorithms use huge neural networks, consisting of many layers of neurons (servers), to process massive amounts of data for instant facial, and voice recognition, language translation, and hundreds of other AI applications.
“Neural-class“ networks are computing environments which may not be used for artificial intelligence, but share the same distributed scale-out architecture, and massive size. Neural-class networks can be found in the data centers of public cloud service providers, stock exchanges, large retailers, insurance providers, and carriers, to name a few. These neural-class networks need ultra-scale connectivity. For example, in a typical neural-class network, a single 80-inch rack houses 38 dual-processor servers, each server with 10 dual-threaded cores, for a total of 1,520 threads. In this example, in order for each thread to work together on a deep learning or trading algorithm without constant Linux context switching, virtual network connections are needed to over 1,000 other threads in the rack.
Solarflare XtremeScale™ Family of Software-Defined NICs
XtremeScale Software-Defined NICs from Solarflare (SFN8000 series) are designed from the ground-up for neural-class networks. The result is a new class of Ethernet adapter with the ultra-high-performance packet processing and connectivity of expensive network processors, and the low-cost and power of general purpose NICs. There are six capabilities needed in neural-class networks which can be found only in XtremeScale software-defined NICs:
- Ultra-High Bandwidth – In 2017, Solarflare will provide high-frequency trading, CDN and cloud service provider applications with port speeds up to 100Gbps, backed by “cut-through” technology establishing a direct path between VMs and NICs to improve CPU efficiency.
- Ultra-Low Latency – Data centers are distributed environments with thousands of cores that need to constantly communicate with each other. Solarflare kernel bypass technologies provide sub-one microsecond latency with industry standard TCP/IP.
- Ultra-Scale Connectivity – A single densely-populated server rack easily exceeds over 1,000 cores. Solarflare can interconnect the cores to each other for distributed applications with NICs supporting 2,048 virtual connections.
- Software-Defined – Using well-defined APIs, network acceleration, monitoring, and security can be enabled and tuned, for thousands of separate vNIC connections, with software-defined NICs from Solarflare.
- Hardware-Based Security – Approximately 90% of network traffic is within a data center. With thousands of servers per data center, Solarflare can secure entry to each server with hardware-based firewalls.
- Instrumentation for Telemetry – Network acceleration, monitoring and hardware security is made possible by a new class of NIC from Solarflare which captures network packets at line speeds up to 100Gbps.
In May Solarflare will release a family of kernel bypass libraries called Universal Kernel Bypass (UKB). This starts with an advanced version of DPDK providing packets directly from the NIC to the container to several versions of Onload which provide higher level sockets connections from the NIC directly to containers.
The effectiveness of our communication as a species is one of our defining characteristics. Earlier this week while waiting in a customer’s lobby in Chicago I noticed four framed posters displaying all the hand signals used in the trading pits of four major markets. Having been focused on electronic trading for the past decade this “ancient” form of communications became an instant curiosity worthy of inspection. On reflection, I was amazed to think that trillions of dollars in transactions over decades had been conducted purely by people motioning with their hands.
About a decade ago in the High-Performance Computing (HPC) market, a precursor market for High-Frequency Trading (HFT), there was a dust-up regarding the effectiveness of Remote Direct Memory Access (RDMA). One of Myricom’s senior researchers wrote an article for HPCWire titled “A Critique of RDMA” that set off a chain reaction of critical response articles:
- “Is RDMA Really That Bad“
- “A Tutorial of the RDMA Model“
- “A good example of RDMA people doing marketing without any technical clue..“
- “Why Compromise?“
- “Why Pretend?“
At the time Myricom was struggling to establish relevance for its new Myrinet-10G protocol against a competing technology, Infiniband, which was rapidly gaining traction. Now to be fair, at the time I was in sales at Myricom. The crux of the article was that the one-sided RDMA communications model, which rose from the ashes of the Virtual Interface Architecture (VIA), was still more of a problem than a solution when compared to the existing two-sided Send/Recv model used by four other competing HPC protocols (QsNet, SeaStar, Infinipath & Myrinet Express).
Now RDMA has had a decade to improve as it spread from Infiniband to Ethernet under the name RDMA over Converged Ethernet (RoCE), but it still has performance issues. The origin of RDMA is cast in a closed lossless layer-2 Infiniband network with deterministic latency. Let’s take a moment and adopt a NASCAR analogy. Think of RDMA as the vehicle and Infiniband as the track. One can take a Sprint Cup Series vehicle tuned for the Charlotte Motor Speedway, and take it for a spin on the local roads, but is that really practical (it certainly isn’t legal)? Yes, its origin is in the stock car, but how well will it do in stop and go traffic, particularly on uphill grades? How about parallel parking, oh wait there’s no reverse. Tight turns at low speeds, signaling, weather, etc. Sprint Cup Series vehicles are designed for 200MPH on a closed extremely well defined and maintained course. Ethernet, by contrast, is the road driven by everyone else, it’s unpredictable with thousands of obstacles, and is ever changing.
Those familiar with Ethernet know that lossless and deterministic latency are not two characteristics often normally associated with this network fabric. Some of us have been around the block and lived through Carrier Sense Multiple Access with Collision Detection (CSMA/DA) where packets often collided and random delays before retransmission attempts were common. TCP/IP was developed during these early days and it was designed with this packet loss as a key criterion. In the past three decades Ethernet has evolved considerably from its roots as a shared coax cable utilizing vampire taps to where we are today with dedicated twisted pair cabling and fiber optics, but on rare occasion, packets are still dropped, and performance isn’t always deterministic. Today most packet drops are as a result of network congestion. As discussed TCP/IP is equipped to handle this, unfortunately, RoCE is not.
For RoCE to perform properly it requires a lossless layer-2 network. Essentially a NASCAR track overlaid onto our public roads. To accomplish this over a routed Ethernet network a new protocol was developed: Data Center Bridging Capabilities Exchange (shortened to DCB or DCBX). DCB is used at every hop of the network to negotiate and create a lossless layer-2 fabric on top of Ethernet. It achieves this by more tightly managing queue overflows and by adjusting network flow priorities as if they were traversing separate physical media. In essence RoCE traffic is prioritized into essentially its own carpool lane ahead of other traffic in hopes of avoiding drops as a result of congestion. While this all sounds great, in talking with several large Web2.0 customers who’ve invested years in RoCE we learned that the vast number will never deploy it in production. There are far too many challenges to get and keep it working, and in high traffic volumes, it suffers. Unlike Infiniband HPC clusters which are stood up as self-contained networks (closed course race tracks) to address specific computational problems, Ethernets are in a constant state of flux with servers and switches being added and removed (our public road system) as the needs of the business change. To be clear TCP/IP is resilient to packet loss, while RoCE is not.
On the latency performance side of things, in the past decade, we’ve achieved roughly one microsecond for a 1/2 round trip (a send + receive) with both TCP and UDP, when using Solarflare’s OpenOnload. This is in line with RoCE latency which is also in the domain of one microsecond. Keep in mind that normal TCP or UDP transactions over 10GbE typically run in the range of 5 to 15 microseconds, so 1 microsecond is a huge improvement. By now you’re likely saying “So what?” For most applications like file sharing, databases, etc… the difference between one microsecond and even fifteen microseconds is lost in the 10,000+ microseconds a whole transaction might take. It turns out though that there are new breeds of network latency-sensitive applications that depend on technologies like Non-Volatile Memory Express (NVMe), Neural Networks, and high volume compound web transactions that can see significant improvements when latency is reduced. When low latency TCP is applied to these problems the performance gains are both measurable and significant.
So the next time someone suggests RoCE ask if they’ve considered a little known competing protocol called TCP/IP. While RoCE is the shiny new object, TCP/IP has had several decades of innovation behind it which explains why it’s the underlying “language of the Internet”. Consider asking those promoting RoCE what their porting budget is, and if they’ve factored in the cost of the new network switches that will be required to support DCB? It’s very likely that the application they want to deploy already supports TCP/IP, and if latency and throughput are key factors then consider contacting Solarflare about OpenOnload. OpenOnload accelerates existing sockets based applications without having to modify them.
In 2003 we saw the emergence of the 10GbE server adapter market with only several players, we’ll call this the first wave. Early products by Neterion and Intel carried extremely high price tags, often approaching $10K. This lead to a flood of companies jumping into the market in an effort to secure an early mover advantage. High-Performance Computing (HPC) companies like Myricom with it’s Myrinet 2G, and Mellanox with Infiniband SDR 10G was viewed by some as possibly having a competitive advantage as they’d already developed silicon in this area. In August of 2005, I joined Myricom to help them transition from HPC to the wider Ethernet market. By March of 2006, we launched a single port 10GbE product with a $595 price point, three years accompanied by a 10X drop in market price. That year the 10GbE market had grown to 18 different companies all offering 10GbE server adapters, we’ll consider this the second wave. In my 2013 article “Crash & Boom: Inside the 10GbE Adapter Market” I explored what had happened up to that point to take the market from 18 players down to 10, you guessed it the third wave. Today only six companies remain who are actually advancing the Ethernet Controller market forward, and this is perhaps the fourth wave.
Intel is the dominant 10GbE adapter market player. They are viewed by many as the commodity option who checks the majority of the feature boxes while delivering reasonable performance. Both Mellanox and QLogic are the exascale players as their silicon carries Infiniband specific features which they’ve convinced this market are important. In storage Chelsio rules as they’ve focused considerable silicon towards offloading the computational requirements of iSCSI. For the low latency and performance over BSD compliant TCP and UDP sockets sought by the financial traders of the world, Solarflare is king. This leaves one remaining actor, Broadcom, and in fact, they were acquired by Avago who also picked up Emulex. The word is they’ve dramatically cut their Ethernet controller development staff right after having completed their 25GbE controller ASIC, which may be why we’ve not seen it reach the market.
So as the 10GbE market sees feature & performance gains as the silicon is migrated over the next several years to 25GbE and 50GbE expect to continue seeing these four players dominate in their respective niches: Intel, Mellanox, Qlogic, Solarflare & Chelsio. I view this final phase as the fifth wave.
by David Whitney, Director of Global Financial Services, Stratus
The partnership of Stratus, the global standard for fault-tolerant hardware solutions, and Solarflare, the unchallenged leader in application network acceleration for financial services, at face value seems like an odd one. Stratus ‘always on’ server technology removes all single points of failure, which eliminates the need to write and maintain costly code to ensure high availability and fast failover scenarios. Stratus and high performance are rarely been used in the same sentence.
Let’s go back further… Throughout the 1980’s and 90’s Stratus, and their proprietary VOSS operating system, globally dominated financial services from exchanges to investment banks. In those days, the priority for trading infrastructures was uptime which was provided by resilient hardware and software architectures. With the advent of electronic trading, the needs of today’s capital markets have shifted. High-Frequency Trading (HFT) has resulted in an explosion in transactional volumes. Driven by the requirements of one of the largest stock exchanges in the world, they realized that critical applications need to not only be highly available but also extremely focused on performance (low latency) and deterministic (zero jitter) behavior.
Stratus provides a solution that guarantees availability in mission-critical trading systems, without the costly overhead associated with today’s software-based High Availability (HA) solutions as well as the need for multiple physical servers. You could conceivably cut your server footprint in half by using a single Stratus server where before you’d need at least two physical servers. Stratus is also a “drop and go” solution. No custom code needs to be written, there is no concept of Stratus FT built customer applications. This isn’t just for Linux environments, Stratus also has hardened OS solutions for Windows and VMWare as well.
Solarflare brings low latency networking to the relationship with their custom Ethernet controller ASIC and Onload Linux Operating System Bypass communications stack. Normally network traffic arrives at the server’s network interface card (NIC) and is passed to the Operating System through the host CPU. This process involves copying the network data several times and switching the CPU’s context from kernel to user mode one or more times. All of these events take both time and CPU cycles. With over a decade of R&D, Solarflare has considerably shortened this path. Under Solarflare’s control applications often receive data in about 20% of the time it would typically take. The savings are measured in microseconds (millionths of a second), typically several or more. In trading speed often speed matters most, so a dollar value can be placed on this savings. Back in 2010, one trader valued the savings at $50,000/micro-second for each day of trading.
Both Stratus and Solarflare have worked together to dramatically reduce jitter to nearly zero. Jitter is caused by those seemingly inevitable events that distract a CPU core from its primary task of electronic trading. For example, the temperature of thermal sensor somewhere in the system may exceed a predetermined level and it raises a system interrupt. A CPU core is then assigned to handle that interrupt and determine which fan needs to be turned on or sped up. While this event, known as “Jitter”, sounds trivial the distraction to processes this interrupt and return to trading often results in a delay measured in the 100’s of microseconds. Imagine you’re trading strategy normally executes in 10s of microseconds, network latency adds 1-2 microseconds, and then all the sudden the system pauses your trading algorithm for 250 microseconds while it does some system housekeeping. By the time control is returned to your algorithm it’s possible that the value of what you’re trading has changed. Both Stratus and Solarflare have worked exceedingly hard to remove Jitter from the FT platform.
Going forward, Solarflare and Stratus will be adding Precision Time Protocol support to a new version of Onload for the Stratus FT Platform.
Solarflare wants to talk with you at Black Hat in Las Vegas next month, and we’re raffling off a Wifi Pineapple to those who sign up for a meeting. What is a Wifi Pineapple you ask, perhaps one of the best tools available for diagnosing wireless security issues?
At Black Hat, Solarflare will be talking about their new line of SFN8xxx series adapters that support five-tuple packet filtering directly in hardware. The SFN8xxx series adapters support thousands of filters and an additional one thousand counters that can be applied to track filter usage. Along with filtering, we’ll be discussing the tamper-proof nature of this new line of adapters, and its capability to support over the wire firmware or filter table updates via an SSL/TLS link directly to the controller on the adapter.
To learn more or set up a meeting for Wednesday, August 3 or Thursday, August 4th at Black Hat please send an email to firstname.lastname@example.org, and you’ll be automatically enrolled in our drawing for a Wifi Pineapple.