The August 2019 acquisition of Solarflare by Xilinx defined my role as transitional, with a fixed expiration date. I'm actively seeking new opportunities outside the company that would start in August.
After nearly seventeen years with IBM, in July of 2000, I left for a startup called Telleo founded by four IBM Researchers I knew and trusted. From 1983 through April 1994, I worked at IBM Research in NY and often dealt with colleagues at the Almaden Research Center in Silicon Valley. When they asked me to join, there was no interview; I already had impressed all four of them years earlier, this was in May of 2000. In March of 2001, the implosion of Telleo was evident. Although I’d not been laid off, I voluntarily quit just before Telleo stopped paying on their IBM lease, which I’d negotiated. The DotCom bubble burst in late 2000, so by early 2001, you were toast if you weren’t running on revenue. Now, if you didn’t live in Silicon Valley during 2001, imagine a large mining town where the mine had closed, this was close to what it was like, just on a much grander scale. Highway 101 had gone from packed during rush hour to what it typically looked like during the weekend. Venture Capitalists drew the purse strings closed, and if you weren’t running on revenue, you were out of business. Most dot-com startups bled red monthly and eventually expired.
Now imagine being an unemployed technology executive in the epicenter of the worst technology employment disaster in history, up until that point, with a wife who volunteered and two young kids. I was pretty motivated to find gainful employment. For the past few years, a friend of mine had run a small Internet Service Provider and had allowed me to host my Linux server there in return for some occasional consulting.
I’d set Nessus up on that server, along with several other tools, so it could be used to ethically hack client’s Internet servers, only by request, of course. One day when I was feeling particularly desperate, I wrote a small Perl script that sent a simple cover letter to jobs@X.com. Where “X” was a simple string starting with “aa” and eventually ending at “zzzzzzzz”. It would wait a few seconds between each email, and since these were to email@example.com I figured it was an appropriate email blast. Remember this was 2001, before SPAM was a widely used term. I thought “That’s what the “jobs” account is for anyway, right?” My email was very polite and requested a position and briefly highlighted my career.
Well, somewhere around 4,000 emails later, I got shut down, and my Internet domain, ScottSchweitzer.com was Black Holed. For those not familiar with the Internet version of this term, it essentially means no email from your domain even enters the Internet. If your ISP is a friend and he fixes it for you, he can run the risk of getting sucked in, and all the domains he hosts get sucked into the void as well. Death for an ISP. Fortunately, my friend that ran the ISP was a life-long IBMer, and he had networking connections at some of the highest levels in the Internet, so the ban stopped with my domain.
To clean this up required some emails and phone calls to fix the problem from the top down. It took two weeks and a fair amount of explaining to get my domain back online to the point where I could once again send out emails. Fortunately, I always have at least several active email accounts, and domains. Also, this work wasn’t in vain, as I’d received a few consulting gigs as a result of the email blast. So now you know someone who was banned from the Internet!
This post was originally made in January of 2015, but due to a take-down letter, I received a week later this story has remained unpublished for the last seven years.
Memcached was released in May of 2003 by Danga Interactive. This free, open-source software layer provides a general-purpose distributed in-memory cache that can be used by both web & app servers. Five years later, Google released version 1.1.0 of its App Engine, which also included its version of an in-memory cache called Memcache. This capability to store objects in a huge pool of memory spread across a large distributed fabric of servers is integral to the performance inherent in many of Google’s products. Google showed in a presentation earlier this year on Memcache that a query from a typical data requires 60-100 milliseconds, while a similar Memcache query only needed 3-8 milliseconds, a 20X improvement. It’s no wonder Google is a big fan of Memcache. This is why in March of 2013, Google acquired both technology and people from a small networking company called Myricom, who cracked the code to accelerate the network path to Memcache.
To better understand what Google acquired, we need to roll back to 2009 when Myricom had ported its MX communications stack to Ethernet. Mx over Ethernet (MXoE) was originally crafted for the High-Performance Computing (HPC) market, to create a very limited, but extremely low latency, stack for their then two-year-old 10GbE NICs. MXoE was reformulated using UDP instead of MX, and this new low latency driver was then named DBL. It was then engineered to service the High-Frequency Traders (HFT) on Wall Street. Throughout 2010 Myricom had evolved this stack by further adding limited TCP functionality. At that time, half round trip performance (one send plus one receive) over 10GbE was averaging 10-15 microseconds, DBL did it 4 microseconds. Today (2015) by comparison Solarflare, the leader in ultra-low latency network adapters, does this in well under 2 microseconds. So in 2012, Google was searching for a method to apply ultra-low latency networking tricks to a new technology called Memcache. In the fall of 2012 Google invited several Myricom engineers to discuss how DBL might be altered to service Memcache directly. Also they were interested in known if Myricom’s new silicon, due out in early 2013, might also be a good fit for this application. By March of 2013 Google had tendered an offer to Myricom to acquire their latest 10/40GbE chip, which was about to go into production, along with 12 of their PhDs who handled both the hardware & software architecture. Little is publicly known if that chip or the underlying accelerated driver layer for Memcache has ever made it into production at Google.
Fast forward to today (2015), and earlier this week, Solarflare released a new whitepaper highlighting how they’ve accelerated the publicly available free open-source layer called Memcached using their ultra-low latency driver layer called OpenOnload. In this whitepaper, Solarflare demonstrated performance gains of 2-3 times that of a similar Intel 10GbE adapter. Imagine Google’s Memcache farm being only 1/3 the size it is today? We’re talking serious performance gains here. For example, leveraging a 20 core server, the Intel dual 10GbE adapter supported 7.4 million multiple get operations while Solarflare provided 21.9 million, nearly a 200% increase in the number of requests. If we look at mixed throughput (get/set in a ratio of 9:1), Intel delivered 6.3 million operations per second while Solarflare delivered 13.3 million, a 110% gain. That’s throughput, how about latency performance? Using all 20 cores, and batches of 48 get requests Solarflare clocked in at 2,000 microseconds, and Intel at 6,000 microseconds. Across all latency tests, Solarflare reduced network latency by, on average, 69.4% (the lowest reduction was 50%, and the highest 85%). Here is a link to the 10-page Solarflare whitepaper with all the details.
While Google was busy acquiring technology & staff to improve their own Memcache performance, Solarflare delivered it for their customers and documented the performance gains.
Many applications from biological to financial and Web2.0 utilize in-memory databases because of their cutting-edge performance, often delivering several orders of magnitude faster response time than traditional relational databases. When these in-memory databases are moved to their own machines in a multi-tier application environment, they often can serve 10 million requests per second, and that’s turning all the dials to 11 on a high-end dual-processor server. Much of this is due to how applications communicate with the kernel and the network.
Last year, my team used one of these databases, Redis, we then bypassed the kernel, connected up a 100Gbps network, and took that 10 million requests per second to almost 50 million. Earlier this year we began working with Algo-Logic, Dell, and CC Integration to blow way beyond that 50 million target. At RedisConf2020 in May, Algo-Logic announced a 1U Dell server they’ve customized that can service nearly a half-billion requests per second. To process these requests, we’re spreading the load across two AMD EPYC CPUs and three Xilinx FPGAs. All requests are serviced directly from local memory using an in-memory key-value store system. For those requests serviced by the FPGAs, their response time is measured in billionths of a second. Perhaps we should explain how Algo-Logic got here, and why this number is significant.
Some time ago, a new form of databases came back into everyday use, and they were classified as NoSQL, because they didn’t use Structured Query language, and were non-relational. These databases rely on clever algorithmic tricks to rapidly store and retrieve information in memory; this is very different from how traditional relational databases function. These NoSQL systems are sometimes referred to as key-value stores. With these systems, you pass in a key, and a value is returned. For example, pass in “12345,” and the value “2Z67890” might be returned. In this case, the key could be an order number, and the value returned is the tracking number or status, but the point is you made a simple request and got back a simple answer, perhaps in a few billionths of a second. What Algo-Logic has done is they wrote an application for the Xilinx Alveo U50 that turns the 40 Gbps Ethernet port on this card into four smoking fast key-value stores each with access to the cards 8GB of High Bandwidth Memory (HBM). Each Alveo U50 card with Algo-Logics KVS can service 150 million requests per second. Here is a high-level architectural diagram showing all the various components:
There are five production network ports on the back of the server, three 40Gbps and two 25Gbps. Each of the Xilinx Alveo U50 cards has a single 40Gbps port, and the dual 25Gbps ports are on an OCP-3 form factor card called the Xilinx XtremeScale X2562 which carries requests into the AMD EPYC CPU complex. Algo-Logic’s code running in each of the Xilinx Alveo cards breaks the 40Gbps channel into four 10Gbps channels and processes requests on each individually. This enables Algo-Logic to make the best use possible of the FPGA resources available to them.
Furthermore, to overcome network overhead issues, Algo-Logic has packed 44 get requests into a single 1408 byte packet. For those familiar with Redis, this is similar to an MGET, multiple get, request. Usually, a single 32-byte get request can easily fit into the smallest Ethernet payload, which is 40 bytes, but then Ethernet adds an additional 24-bytes of routing and a 12-byte frame gap — using a single request per-packet results in networking overhead consuming 58% of the available bandwidth. This is huge, and can clearly impact the total request per second rate. Packing 44 requests of 32 bytes each into a single packet means that the network overhead drops to 3% of the total bandwidth, which means significantly greater requests per second rates.
What Algo-Logic has done here is extraordinary. They’ve found a way to tightly link the 8GB of High Bandwidth Memory on the Xilinx Alveo U50 to four independent key-value store instances that can service requests in well under a micro-second. To learn more consider reaching out to John Hagerman at Algo-Logic Systems, Inc.
As system architects, we seriously contemplate and research the components to include in our next server deployment. First, we break the problem being solved into its essential parts; then, we size the components necessary to address each element. Is the problem compute, memory, or storage-intensive? How much of each element will be required to craft a solution today? How much of each will be needed in three years? As responsible architects, we have to design for the future, because what we purchase today, our team will still be responsible for three years from now. Accelerators complicate this issue because they can both dramatically breath new life into existing deployed systems, or significantly skew the balance when designing new solutions.
Today foundational accelerator technology comes in four flavors: Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Multi-Processor Systems on a Chip (MPSoCs) and most recently Smart Network Interface Cards (SmartNICs). In this market, GPUs are the 900-pound gorilla, but FPGAs have made serious market progress the past few years with significant deployments in Amazon Web Services (AWS) and Microsoft Azure. MPSoCs, and now SmartNICs, blend many different computational components into a single chip package, often utilizing a mix of ARM cores, GPU cores, Artificial Intelligence (AI) engines, FPGA logic, Digital Signal Processors (DSPs), as well as memory and network controllers. For now, we’re going to skip MPSoCs and focus on SmartNICs.
SmartNICs place acceleration technology at the edge of the server, as close as possible to the network. When computational processing of network intense workloads can be accomplished at the network edge, within a SmartNIC, it can often relieve the host CPU of many mundane networking tasks. Normal server processes require that the host CPU spend, on average, 30% of it’s time managing network traffic, this is jokingly referred to as the data center tax. Imagine how much more you could get out of a server if just that 30% were freed up, and what if more could be made available?
SmartNICs that leverage ARM cores and or FPGA logic cells exist today from a growing list of companies like Broadcom, Mellanox, Netronome, and Xilinx. SmartNICs can be designed to fit into a Software-Defined Networking (SDN) architecture. They can accelerate tasks like Network Function Virtualization (NVF), Open vSwitch (OvS), or overlay network tunneling protocols like Virtual eXtensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE). I know, networking alphabet soup, but the key here is that complex routing, and packet encapsulation tasks can be handed off from the host CPU to a SmartNIC. In virtualized environments, significant amounts of host CPU cycles can be consumed by these tasks. While they are not necessarily computationally intensive, they can be volumetrically intense. With datacenter networks moving to 25GbE and 50GbE, it’s not uncommon for host CPUs to process millions of packets per second. This processing is happening today in the kernel or hypervisor networking stack. With a SmartNIC packet routing and encapsulation can be handled at the edge, dramatically limiting the impact on the host CPU.
If all you were looking for from a SmartNICs is to offload the host CPU from having to do networking, thereby saving the datacenter networking tax of 30%, this might be enough to justify their expense. Most of the SmartNIC product offerings from the companies mentioned above run in the $2K to $4K price range. So suppose you’re considering a SmartNIC that costs $3K, with the proper software, and under load testing, you’ve found that it returns 30% of your host CPU cycles, what is the point at which the ROI makes sense? A simplistic approach would suggest that $3K divided by 30% yields a system cost of $10K. So if the cost of your servers is north of $10K, then adding a $3K SmartNIC is a wise decision, but wait, there’s more.
SmartNICs can also handle many complex tasks like key-value stores, encryption, and decryption (IPsec, MACsec, soon even SSL/TLS), next-generation firewalls, electronic trading, and much more. Frankly, the NIC industry is at an inflection point similar to when video cards evolved into GPUs to support the gaming and virtualization market. While Sony coined the term GPU with the introduction of the Playstation in 1994, it was Nvidia five years later in 1999 who popularized the GPU with the introduction of the GeForce 256. I doubt that in the mid-1990s, while Nvidia was designing the NV10 chip, the heart of the GeForce 256, that their engineers were also pondering how it might be used in high-performance computing (HPC) applications a decade later that had nothing to do with graphic rendering. Today we can look at all the ground covered by GPU and FPGA accelerators over the past two decades and quickly see a path forward for SmartNICs where they may even begin offloading the primary computational tasks of a server. It’s not inconceivable to envision a server with a half dozen SmartNICs all tasked with encoding video, or acting as key-value stores, web caches, or even trading stocks on various exchanges. I can see a day soon where the importance of SmartNIC selection will eclipse server CPU selection when designing a new solution from the ground up.
This is the first in a series designed to dispel the mystique
of digital currency, think Bitcoin, but trust me we’ll go way beyond that. My
goal is to explain in common language all you’ll need to know about digital currency,
so you can confidently answer questions when you sit down with your first
social network, your family.
As a species, our most significant evolutionary trait is our capability for building social networks. While social networks are found in many other high order mammals, we really do take it to the next level. Let’s face it for our first 36 months outside the womb we’re pretty much a defenseless bag of water rolling around wherever we’re placed. We can’t even effectively flee from even the most basic predator without assistance. From the moment we’re born we establish strong bonds with our parents and siblings, who become our first social network, our family. In our formative years, this network fills all our basic needs. It is this first network that introduces us to the second network we join, often very early in life, and that is the network of money.
Initially, we learn how to spend our first social network’s money as we roll down the aisle of the market and point out the foods we like. Soon members of our first network are sharing their money with us in return for our time when we do a task or achieve a milestone. Many of you up to this point possibly never viewed money as a social network, until recently it hadn’t occurred to me, but in fact, it is. On its face money is nothing more than a worthless token designed exclusively to be exchanged. It is these exchanges that form a network of commerce. If you closely examine your tightest social bonds, outside of your first network, they very likely were started or fueled by money. Outside of family one of my closest friends exists because over a decade ago I exchanged money in return for joining another social network. While I haven’t been associated with that network for years, this friend still shows up whenever I need him most, and it’s no longer about money.
Terminology is critical to our understanding. Mentally we often have trouble grasping something until we can assign it a name. For years you’ve carried money around in your pocket, but have you ever considered it as your fiat currency? Here in the US, our fiat currency is the dollar. It’s very possible you’ve taken it for granted so long you never even viewed it this way. A fiat currency is the one officially sanctioned and managed by the government, another network, under which you live. If you only need one currency, in my case the dollar, then the concept of fiat becomes mute, but what happens when you begin to use more than one?
Sitting here in North Carolina a month ago for the first time in my life I need to spend some Bitcoin which I’d been gifted a few years earlier. Bitcoin is the most well-known digital currency, but it is NOT a fiat currency because it hasn’t been issued by a government. The vendor I wished to purchase a small digital currency mining rig from in China ONLY accepted Bitcoin, NOT my US dollars via credit card. At the time Bitcoin was trading around $10,000 USD so buying something for $250 USD meant I was spending a fraction of a Bitcoin. I’m only aware of a single fractional unit of a Bitcoin and that’s called a Satoshi which is equal to one ten-millionth of a Bitcoin. So, I shelled out roughly 2.5M Satoshi for my rig and anxiously awaited its arrival. We’ll dive more into Bitcoin and Satoshi in future posts, but I thought it a prudent example where the fiat currency wasn’t accepted. Next year we’ll have Libra, Facebook’s digital currency, and that’s already giving those who manage our fiat currencies serious concerns.
In computer science, there’s a concept called Metcalfe’s law which states that the value of a network is equal to the square of the number of nodes or members in that network. My extended family has roughly fifty members, so its potential value is fifty squared or 2,500. My friend has nearly 100 in his extended family, so his family’s value is 10,000. Metcalfe’s law came along with computer networking, long after Alexander Graham Bell had invented the telephone, but Bell was aware that the value of his invention would only be truly realized once it had been widely adopted. Within 10 years of its invention over 100,000 phones had been installed in the US. Bell died 46 years after his invention knowing that his new network had changed the world.
Facebook is the largest social network our species has ever created. With over two billion active users this means that one person in four uses this network monthly. Right now, the clear majority of those two billion people live their lives with their fiat currency, and unless they travel they rarely if ever deal with another. Next year Facebook will issue its fiat currency, Libra, and it will change everything. Now it should be noted that Facebook isn’t a country so the concept of them issuing currency has raised some serious concerns from those who do issue currency. Countries around the globe are taking Facebook’s Libra head-on because they know it represents a Pandora’s box of problems for their own monetary systems, and here’s the main reason why.
Looking at Facebook as a network we could say its value is two squared or four, for now, let’s drop the all the billions as it just makes the numbers incomprehensibly large. The US has a population of 0.327 billion so the value of it as a network (again without the billions) is 0.1 and China has a population of 1.386 billion, so the value of its network is 1.9. If we now view these countries populations as networks, we see that Facebook’s value is twice that of the US and China combined. Extend that to currencies and you can why all the fuss, and why you might need to understand digital currency.
Next, we’ll dismiss the intrinsic value argument, that’s where grandpa says Bitcoin is worthless because there’s nothing behind it. You can then counter with the US went off the gold standard decades ago so why isn’t the dollar worthless? We’ll provide you with that side of the argument.
Accelerators are like calling in a special forces team to address a serious competitive threat. By design, a special forces team, known as “A Detachment” or “A-Team” consists of two officers and ten sergeants, all of which are cross-trained in five different skill areas: weapons, engineering, medical, communications, and operations intelligence. This enables the detachment to survive for months or even years in a hostile area without any operational support. Accelerators are the computational equivalent.
A well-designed accelerator has different blocks of silicon to address each of the four primary computational workloads we discussed in part two:
Scaler, working with integers, and letters
Floating-point, the real numbers with decimal points
Vector, one-dimensional arrays of floating-point numbers
Artificial Intelligence (AI), vectors with low precision floating point mixed with integers
If workload types are much like special forces skills then what types of physical computational cores can we leverage in an accelerator design that are optimized to address these specific tasks?
For scaler problems, Intel’s x86 platform has led for decades as far back as the early 1980s. Quietly over the last 25 years, the ARM architecture has evolved. In the past five years, ARM has demonstrated everything necessary for it to be a serious data center player. Add to that ARM’s architecture licensing model which has led to third parties developing their cores which are instruction set compatible. Both of these factors have resulted in at least a dozen companies from Apple to Samsung developing their ARM core designs. Today ARM cores can be found in everything from Nest Thermostats to Apple iPhones. Today the most popular architecture for workload acceleration is the ARMv8-A. Specifically, the Cortex-A72 design which supports both 32bit and 64bit computing, with 1-4 computational cores. Today the Broadcom Stingray, Mellanox Bluefield, NXP Layerscape, and Xilinx Versal all use the ARM Cortex-A72.
When it comes to accelerating floating point, the current trend has been towards Graphical Processing Units (GPUs). While GPUs have been around for about a decade now, it wasn’t until the NVIDIA Tesla debuted that they were viewed as a real computational accelerator. GPUs are also suitable for the third workload model vector processing. In essence, GPUs can kill two birds with one computational stone. Another solution that can also accelerate certain types of floating-point operations are digital signal processing (DSP) engines. DSPs are very good at real-world computational problems that have a high degree of multiple-accumulates and matrix operations. Here is where some accelerator boards are stronger than others. While the Broadcom Stingray only has a cryptographic engine designed to handle single-pass hashing and encryption/decryption (both scaler tasks) it lacks any sort of added acceleration for floating-point math. Mellanox’s Bluefield chip also doesn’t include any silicon specifically dedicated to floating-point. What they do promote is the fact that Bluefield provides GPUDirect so the processor can communicate directly with GPUs on another PCIe card. NXP only has ARM cores, so no additional floating-point support is provided. By contrast, Xilinx’s Versal architecture includes anywhere from 472 to 3,984 DSP engines depending on the chip series and model.
Artificial Intelligence (AI) workloads leverage vector processing but instead of high precision floating point, they only require low precision or integer numbers. Again Broadcom, Mellanox, and NXP all fall short as they don’t include any silicon to process these workloads directly. Mellanox as mentioned earlier, does support GPUDirect for passing AI workloads to another PCIe board but that’s a far cry from on-chip dedicated silicon. Xilinx’s Versal architecture includes anywhere from 128 to 400 AI engines for accelerating these workloads.
Finally, the most significant differentiator is the inclusion of FPGA logic, also known as adaptable engines. This is something unique to only Xilinx accelerator cards. This is the capability to take frequently called upon routines written in C/C++ and port them over to dedicated logic which can improve the performance of a routine by at least 8X.
In the case of Xilinx’s new Versal architecture, the senior officer is an ARM Cortex-R5 for real-time workloads. The junior one officer, and the one who does much of the work, is an ARM Cortex-A72 quad-core processor. The two ARM engines are primarily for control plane functions. Then Versal has AI cores, DSP engines, and adaptable engines (FPGA logic) to accelerate the volumetric workloads. When it comes to application acceleration in hardware the Xilinx Versal is the A-Team!
Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation.
When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.
The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.
Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.
Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.
So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?
Stay tuned for part three, where we cover hardware acceleration platforms.
Since the dawn of time humanity has needed to protect both people and things. Initial security methods were all “software based” in the sense that they relied on the user putting their trust in a process, people and social conventions. At first, it was cavemen hiding what they most valued, leveraging security through obscurity or they posted a trusted associate to watch the entrance. Finally, we expanded our security methods to include some form of “Keep Out” signs through writings and carvings. Then in 600BC along comes Theodorus of Samos, who invented the key. Warded locks had existed about three hundred years before Theodorus, but the “key” was just designed to bypass obstructions to its rotation making it slightly more challenging to access the hidden trip lever inside. For a Warded lock the “key” often looked like what we call a skeleton key today.
It could be argued that the lock represented our first “hardware based” security system as the user placed their trust in a physical token or key based system. Systems secured in hardware require that the user present their token in person, it is then validated, and if it passes, the security measures are removed. It should be noted that we trust this approach because it’s both the presence of the token and the accountability of a person in the vicinity who knows how to execute the exact process with the token to ensure success.
Now every system man invents can also be defeated. One of the first skills most hackers teach themselves is how to pick a lock. This allows us to dynamically replicate the function of the key using two very simple and compact tools (a torsion bar and a pick). Whenever we pick a lock we risk exposure, something we avoid at all cost, because the process of picking a lock looks visually different than that of using a key. Picking a lock using the tools mentioned above requires two hands. One provides a steady rotational force using the torsion bar. While the other manipulates the pick to raise the pins until each aligns with the cylinder and hangs up. Both hands require a very fine sense of touch, too heavy handed with the torsion bar and you can snap the last pin or two while freeing the lock. This will break it for future key users, and potentially expose your attempted tampering. Too light or heavy with the pick and you won’t feel the pins hanging up, it’s more skill than a science. The point is that while using a key takes seconds picking a lock takes much longer, somewhere between a few seconds to well over a minute, or never, depending on the complexity of the cylinder, and the person’s skill. The difference between defeating a software system and a hardware one is typically this aspect of presence. While it’s not always the case, often to defeat hardware-based systems it requires that the attacker be physically present because defeating hardware commonly requires hardware. Hackers often operate from countries far outside the reach of law enforcement, so physical presence is not an option. Attackers are driven by a risk-reward model, and showing up in person is considered very high risk, so the reward needs to be exponentially greater.
Today companies hide their most valuable assets in servers located in large secure data centers. There are plenty of excellent real-world hardware and software systems in place to ensure proper physical access to these systems. These security measures are so good that hackers rarely try to evade them because the risk of detection and capture is too high. Yet we need only look at the past month, April 2019, to see that companies like Microsoft,Starwood,Toyota, GA Tech and Questcare have all reported breaches. In Microsoft’s case, 6% of all MSN, HotMail, and Outlook accounts were breached, but they’ve not disclosed the details or the number of accounts. This is possible because attackers need to only break into a single system within the enterprise to reach the data center and establish a beachhead from which they can then land and expand. Attackers usually obtain a secure foothold through a phishing email or clickbait.
It takes only one undereducated employee to open a phishing email in outlook, launch a malicious attachment, or click on a rogue webpage link and it’s game over. Lockheed did extensive research in this area and they produced their now famous Cyber Kill Chain model. At a high level, it highlights the process by which attackers seize control of an enterprise. Anyone of these attack vectors can result in the installation of a remote access trojan (RAT) or a Zero-Day exploit that will give the attacker near unlimited access to the employee’s system. From there the attacker will seek out a poorly secured server in the office or data center to establish a beachhead from which they’ll launch their attack. The compromised employee system may not always be available, but it does makes for a great point to retreat back to in the event that the primary beachhead server system is discovered and sanitized.
Once an attacker has a foothold in the data center its game over. Very often they can easily move laterally, east-west, through the data center to other systems. The MITRE ATT&CK (Adversarial Tactics Techniques & Common Knowledge) framework, while similar to Lockheed’s approach, drills down much further. Specifically, on the lateral movement strategies, Mitre uncovered 17 different methods for compromising internal servers. This highlights the point that very few defenses exist in the traditional data center and those that do are often very well understood by attackers. These defenses are typically OS based firewalls that all seasoned hackers know how to disable. Hackers will disable logging, then tear down the firewall. They can also sometimes leverage an island hopping attack to a vendor or customer systems through private networks or gateways. Or in the case of the Starwood breach of Marriott the attackers got lucky and when their IT systems were merged so were the exploited systems. This is known as a data lemon, an acquisition that comes with infected and unsecured systems. Also, it should be noted that malicious insiders, employees that are aware of a pending termination or just seeking to augment their income, make up over 30% of the reported breaches. In this attack example, a malicious insider simply leverages their access and knowledge to drain all the value from their employer’s systems. So what hardware countermeasures can be put in place to limit east-west or lateral attacks within the data center? Today you have three hardware options to secure your data center servers against east-west attacks. We have switch access control lists (ACLs), top of rack firewalls or something uniquely innovative Solarflare’s ServerLock enabled NICs.
Often enterprises leverage ACLs in their top of rack 10/25/100G switches to protect east-west traffic within the data center. The problem with this approach is one of scale. IT teams can easily exhaust these resources when they attempt comprehensive application level segmentation at the server. These top of rack switches provide between 100 and 1,000 ACLs per port. By contrast, Solarflare’s ServerLock provides 5,000 ACLs per NIC, along with some foundational subnet level filtering.
In extreme cases, companies might leverage hardware firewalls internally to further zone off systems they are looking to secure. Here the problem is one of volume. Since these firewalls are used within the data center they will be tasked with filtering enormous amounts of network data. Typically the traffic inside a data center is 10X the traffic volume entering the data center. So for mission-critical clusters or server groups, they will demand high bandwidth, and these firewalls can become very expensive and directly impact application performance. Some of the fastest appliance-based firewalls designed to handle these kinds of high volumes are both expensive and add another 2.5 to 3.5 microseconds of latency in each direction. This means that if an intranet server were to fetch information from a database behind an internal firewall the transaction would see an additional delay of 5-6 microseconds. While this honestly doesn’t sound like much think of it like compound interest. If the transaction is simple and there’s only one request, then 5-6 microseconds will go unnoticed, but what happens when that employee’s request decomposes into hundreds or even thousands of database server calls? Delays then become seconds. By comparison, Solarflare’s ServerLock NIC based ACL approach adds only 0.25 to 0.75 microseconds of latency in each direction.
Finally, we have Solarflare’s ServerLock solution which executes entirely within the hardware of the server’s own Network Interface Card (NIC). There are NO server side services or agents, so there is no attackable software surface area of any kind. Think about that for a moment, a server-side security solution with ZERO ATTACKABLE SURFACE AREA. Once ServerLock is engaged through the binding process with a centralized ServerLock DirectorOne controller the local control plane for the NIC that manages security is torn down. This means that even if a hacker or malicious insider were to elevate their privilege to root they would NOT be able to see or affect the security settings on the NIC. ServerLock can test up to 5,000 ACLs against a network packet within the NIC in just over 250 nanoseconds. If your security policies leverage subnet wildcards the worst case latency is under 750 nanoseconds. Both inbound and outbound network traffic is checked in hardware. All of the Solarflare NICs within a data center can be managed by ServerLock DirectorOne controllers. Today a single ServerLock DirectorOne can manage up to 1,000 NICs.
ServerLock DirectorOne is a bundle of code that is delivered as an ISO image and can be installed onto a bare metal server, into a VM or a container. It is designed to manage all the ServerLock NICs within an infrastructure domain. To engage ServerLock on a system you run a simple binding process that facilitates an exchange of secrets between the DirectorOne controller and the ServerLock NIC. Once engaged the ServerLock NIC will begin sharing new network flows with the DirectorOne controller. DirectorOne provides visibility to all the network flows across all the ServerLock enabled systems within your infrastructure domain. At that point, you can then begin defining security policies and place them in compliance or enforcement mode. In compliance mode, no traffic through the NIC will be filtered, but any traffic that is not in compliance with the defined security policies for that NIC will generate alerts. Once a policy is moved into “enforcement” mode all out of policy packets will have the default action applied to them.
If you’re looking for the most secure solution to protect your companies servers you should consider Solarflare’s ServerLock. It is the most affordable, and secure way to protect your valuable corporate assets.
Electronic trading, like no other industry, can directly link time and money. A decade ago when I started selling 10GbE NICs to Wall Street traders, they often shared with me the value of a single microsecond (millionth of a second) improvement in trading. Today these same traders are measuring gains in nanoseconds (billionths of a second). With each passing quarter our financial markets evolve, and trade execution times decrease. Trading platforms leveraging older hardware and software often can’t remain competitive as other traders continue to invest in the latest products which further reduce trade execution latency and improve order determinism.
For the past decade, Solarflare has led the market in accelerating server-side UDP/TCP networking for electronic trading with our Onload® software acceleration stack. In addition, Solarflare has regularly delivered a new generation of 10GbE network adapters that have further reduced network latency by 20-30% while also reducing jitter. Often these advances were the result of improvements in the hardware, but there were many significant enhancements to the Onload stack that contributed substantially to the overall system performance increases. Keep in mind that Onload is fully compliant to the BSD Sockets standard, which means that developers don’t have to change their code to use Onload. The table below shows this reduction in Onload latency over time along with the gain from each new generation of Solarflare adapters.
In the below graph (click on it to enlarge) you’ll see how latency with Onload compares between Solarflare’s SFN8522 and X2522 as message size increases. We’ve also included our next closest competitor, Mellanox, with their ConnectX-5 adapter and VMA offload stack.
About five years ago, Solarflare saw an opportunity to revisit TCP/UDP networking stacks within Onload and determined that it is possible to squeeze another 35-50% in performance gains if developers were willing to use a new C language application programming interface (API). This new API was built from the ground up focused on performance, and it implements only a subset of the complete BSD Sockets API. Every API call has been highly tuned to deliver optimum performance. On the road to formulating this API Solarflare has patented several new innovations, and in 2016 it leaped forward again by introducing this API and branding it TCPDirect. Initially, TCPDirect improved latency on Solarflare’s SFN8522 adapter by an astonishing 38%!
Recently TCPDirect was tested with the Solarflare’s latest X2522 cards, and it delivered an improved 48% latency reduction over Onload on the same adapter (click on the graph below). Today TCPDirect with the X2522 provides an amazing 828ns of latency with TCP. So how does this compare with Mellanox? The X2522 with TCPDirect is 39% faster than the Mellanox ConnectX-5 with VMA and Exasock! This gain is shown in the graph below. It should be noted that this testing was done using an older more performant Intel Skylake processor with a 3.6Ghz clock. Intel’s newest Cascade Lake processors burst up to 4.4Ghz, but they were not available at the time of this testing. Recent testing indicates that they should produce even more impressive results.
Trading and Time are interwoven into a single fabric, one cannot exist without the other. When trades are executing with a precision measured in nanoseconds you need a technology partner that is leading the industry, not following it. Solarflare also provides a precision time protocol (PTP) daemon that includes both IEEE-1588 (2008) and enterprise profiles. Additionally, Solarflare makes available an optional PCIe bracket kit enabling the direct connection of an external hardware master clock that can deliver a highly accurate one pulse per second (1PPS) signal. This kit and Solarflare’s PTP daemon enable the adapter to maintain system time synchronization to within 200ns of the external master clock. Mellanox has stated that their PTP implementation “can see time locked to reference well within 500 nanoseconds of variation.”
Numerous STAC reports over the past decade with all the major OEMs and the Linux distributions used in finance have validated that Solarflare networking technology is the standard by which all others are measured. Innovations like those discussed above are the reason why over 90% of the stock exchanges, global investment banks, hedge funds, and cutting-edge high-frequency traders’ architect their systems with Solarflare hardware and software. Outside of the Linux kernel’s own communications stack, no other TCP/UDP user-space communications stack is more heavily tested or in wider production than Solarflare’s Onload platform. Today the world economy exists across hundreds of thousands of servers spread throughout the globe, and nearly all of those servers depend on Solarflare to provide the industry’s best performance with the lowest jitter possible. Below are recent STAC Research reports from the past two years that back up our claims.
Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.
With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.
This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.
In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.
In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders.
At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well.
For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects.
At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.
As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.