Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation.
When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.
The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.
Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.
Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.
So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?
Stay tuned for part three, where we cover hardware acceleration platforms.
“… when you have access to the vastness of space, you realize there’s only one resource worth fighting over… even killing for: More time. Time is the single most precious commodity in the universe.”
— Kalique Abrasax, Jupiter Ascending (2015)
Computing is humanities purest quest to convert time into work. In 2000 IBM demonstrated slicing one second into 10 billion units (10GHz) and then squeezing computational work out of each unit. At the time IBM had defined a new 130-nanometer process they called “CMOS 9S“. It was planned for future generation PowerPC chips. In parallel IBM was ramping up production of the POWER4 at 1.9GHz. Now you may be asking yourself, “but wait a minute I’ve never seen any production 10GHz CPUs, especially not 20 years ago,” and you’re correct. IBM’s POWER6 was as close as we’ve gotten with one version of that chip advertised at 5GHz, and in the lab they achieved 6GHz. I’ve also heard IBM reps brag about 7GHz with POWER8 if you turn half the cores off. So why has computing hit the wall at 4-5GHz and computation not reached 10GHz over the last twenty years?
Intel explained this five years ago in the blog post, “Why has CPU frequency ceased to grow?” The problem has a name called the “conveyor level.” Imagine a CPU as a conveyor belt driven assembly line with four workstations labeled A through D. Since an assembly line is a serial process the worker at station B can’t start until the worker at station A finishes. Ideally, each station is designed to take the same amount of time to finish their work, so the following station isn’t impacted. The slowest worker then defines the speed of the conveyor on any given day. So if the most time-consuming stage in the CPU pipeline is 250 picoseconds, then the clock frequency is 4GHz. There is also the issue of heat.
As an electron races through a computer circuit, it experiences a form of friction, known as resistance. Just like rubbing your hands together on a cold day produces heat, so does an electron zipping through a computer circuit. When designing any chip heat is the enemy. The smaller the chip geometry, today its seven nanometers, the more devices you can pack into a given space on a chip. More devices mean more heat. That same square centimeter of space at 7nm still has the same thermal limitations it did at 130nm 20 years ago. Sure we can use fancy liquid systems to rapidly wick heat away from the chip, instead of relying on airflow over an area limited heat sink, but at the end of the day, every watt of power the chip consumes becomes heat. Now there are individual circuits throughout the chip specifically designed to detect and respond to over-heating situations. The last thing anyone wants is a smoldering piece of silicon where their CPU once was. In the 7GHz example above, the IBM representative said that if you viewed the POWER8 chip as a big chessboard and you turned off all the CPU cores on the white squares than all the cores on the black squares could be clocked at nearly twice the speed or 7GHz. Why is this interesting?
For some computational problems its much better to have two consecutive computations in the same unit of time than two unrelated ones. Electronic trading, also known as high-frequency trading (HFT) is the premier market-driven problem that benefits most from increasing clock frequency. Traders often ascribe a dollar value to a millionth of a second, and it varies from market to market based on the rules and volumes of each market. In the end, though it always boils down to the trader’s speed and response to a market signal. If I’m faster than you at making the right decision, then I win the business and book the profit. Sticking with HFT, where do accelerators fit in?
Traders lease connections to exchanges. The closer and faster they can respond to signals from those connections, the more competitive they will be. Suppose my trading platform requires signals from the market to travel through my server, then another switch on my private network, back through a second server, then finally out to the market. The networking alone, even with kernel bypass through two servers and a switch could easily be several microseconds. Add a few more microseconds for trading logic in both servers, and you could be looking at almost ten microseconds to submit a trade in response to a signal. Two years ago Solarflare with LDA Technologies demonstrated 98 nanoseconds tick to trade. This was using accelerator technology and compared to the trading platform mentioned above; it is three orders of magnitude faster. That’s the difference between walking from NYC to LAX versus flying at Mach 5 and arriving in an hour. Time matters and acceleration is not just for HFTs anymore. Why do you think Google bought Myricom, Amazon picked up Annapurna Labs, Nvidia purchased Mellanox, or Xilinx acquired Solarflare?
Please stay tuned, more to come in part two. In the mean time feel free to check out previous articles on this topic:
We’ve all attended large industry international trade conferences hosting tens of thousands of people. These are spectacles designed to raise brand awareness, educate those in attendance about industry advances, network with colleagues you haven’t seen in a spell, all while promoting new products and services. By contrast there are also smaller regional industry trade shows that are scaled-down versions of these larger events with many of the same objectives, and then there are Security BSides events.
For those not familiar with BSides, they were started in 2009 to further educate folks on cybersecurity at the city and regional level. Think Blackhat, but on a Saturday at the local civic center, and with perhaps 200 people instead of 19,000. Let’s face it, most security engineers are introverts so socializing at significant events like Blackhat is uncomfortable. While bringing a few coworkers or friends on a Saturday to a BSides event can be downright fun. Let’s face who doesn’t want to sit for 20-30 minutes in the lock-pick village with their friends to test their skills on some of MasterLock, Schlage or Kwikset’s most common products. It’s heartwarming to teach a NOOB (short for a newbie) how to pick a lock, then watch their excitement when the hasp clicks open for the first time.
Then there’s always the Capture the Flag (CTF) or wireless CTF for when you’re not interested in the session(s) being offered. If you’ve not played a security capture the flag event before then you really are missing something. It is a challenging series of puzzles served up Jeopardy-style. Say 10 points if you can decrypt this phrase. Or 20 points if you can determine whose attacking your machine on five different ports. Perhaps another 50 points if you can write a piece of code that can read a web page, unscramble five words, and post the five proper words back to the website in three seconds before the clock expires and the words are no longer valid. It’s an intellectual problem solving competition at its finest, and did I mention there is a leaderboard. Often projected high on the wall for all to see throughout the day are the teams with the highest scores. It really warms the heart when your team is the second on the board and it stays in the top five most of the day. While we were the second on the board at BSides Asheville, we didn’t stay in the top five for long.
More seriously though, for a $20 entry fee (which includes a T-shirt) these BSides events offer an affordable local event for cybersecurity engineers and hobbyists. BSides enables socially challenged people the opportunity to step out of their shell, and reach out to similar like-minded individuals while networking in a comfortable and technical space. You can bond over lock-picking, a CTF challenge, during lunch or between sessions. Bring one of your nerd friends as a wingman, or better yet several to form a CTF team, and make a day of it. If you’d like to check out an online CTF one of our favorites is RingZer0. If you want to see the hacker side of the Technology Evangelist, W3bMind5, or read about his team’s experiences at BSides Asheville then they can be found at RedstoneCTF.
The RedstoneCTF team may be attending BSidesCLT on September 28th and BSidesRDU on October 19th.
Last night while channel surfing I came across Men in Black III, and was dropped right into the scene where a 1969 Tommy Lee Jones was placing Will Smith into the Neuralizer pictured on the left. For those not familiar with the original 1997 MiB franchise a Neuralizer is a cigar-sized plot device for washing peoples memories of an alien encounter that is normally carried inside their jacket pocket. The writers were clearly poking fun at miniaturization and how much humanity has come to take it for granted.
Those of us who grew up in the 1960s and 70s lived through the miniaturization wave as the Japanese led the industry by shrinking radios and televisions from cabinet sized living room appliances to handheld devices. One year for Father’s Day in the late 70s we bought my dad a portable black and white TV with a radio that ran on batteries so he could watch it on the boat in the evenings. It was roughly the size of three laptops stacked on top of one another. It may sound corny now, but it was amazing back then. Today we watch theater quality movies in color, on a much larger screen from a device that drops into our pocket and don’tthink twice about it. We’ve grown accustom to technology improving at a rapid rate, and it’s now expected, but what happens when that rate is no longer sustainable?
Last year the industry began etching chips with a new seven nanometer process, which is equivalent to Intel’s 10nm process. Apple’s A12 Bionic chip that powers their XR and XS series iPhones is one of the first using this new 7nm process. This chip contains 6,900 million transistors and is arguably one of the most advanced devices every produced by mankind. By contrasts, my first computer in 1983 was a TRS-80 Model III powered by the Zilog Z80 processor. The Z80 used a 4,000nm process and only contained 8,500 transistors. So in 35 years we’ve reduced the process size by three orders of magnitude resulting in a transistor density improvement of six orders of magnitude, wow! How do we top that, and where are we in the grand scheme of the physics of miniaturization?
In a 1965 paper by Gordon Moore, then founder of Fairchild Semiconductor and later CEO of Intel, Gordon stated that the density of integrated circuits would double every year, now known as Moore’s Law. From 1970 through 2014 this “law” had essentially proved true. Before Intel’s current 10nm geometry their prior generation was 14nm and that was achieved in 2014 so it’s taken them five years to accomplish 10nm. Not exactly Moore’s law, but that’s the tip of the iceberg. As the industry goes from 14nm to 7nm/10nm physics is once again throwing up a roadblock, this hasn’t been the first one, but it could be the last one. Chips are made using Silicon, and Silicon atoms have a diameter of about 0.2 nanometers. So at a seven nanometers node size, we’re talking 35 or so silicon atoms, which isn’t a very large number. It turns out that below seven nanometers, as we have fewer and fewer silicon atoms to manage electron flows, things get dicey. Chips begin to experience quantum effects, most notably those pesky electrons, which are about a millionth of a nanometer in size, begin to exhibit something called quantum tunneling. This means that they no longer behave like they are supposed to and they move between devices etched into the silicon with a sort of reckless disregard for the “normal” rules of physics. This has been known though for some time.
Back in 2016 a team at Lawrence Berkley National Labs demonstrated a one nanometer transistor device, but that leveraged Carbon nanotubes to manage electron flow and stave off the quantum tunneling effect. For those not familiar with Carbon nanotubes think teeny tiny diamond straws where the wall of the straw is one atom thick. While using Carbon nanotubes to solve the problem is ingenious, it doesn’t fit into how we make chips today as you can’t etch a Carbon nanotube using conventional chip fabrication processes. So while it’s a solution to the problem it’s one that can’t easily be utilized. So we may be working at 7nm for some time to come. This only means that one aspect of miniaturization has ground to a halt. When I’ve used the term chip above to represent an integrated circuit the more precise term is actually a “die.”
Until recently it was common practice to place a single “die” inside a package. A package is what most of us think of as the chip as it has a bunch of metal pins coming out of the bottom or sides. In recent years the industry has developed new techniques that allow us to layer multiple dies onto one another within the same physical package enabling the creation of very complex chips. This is similar to a seven-layer cake where different types of cake can be in each layer and the icing can be used to convey flavors across the cake layers. This means that a chip can contain several and eventually many dies, or layers. A recent example of this is Xilinx’s new Versal chip line.
Within the Versal chip package there are multiple dies that contain two different pairs of ARM CPU cores, hundreds of Artificial Intelligence (AI) engines, thousands of Digital Signal Processors (DSP), a huge Field Programmable Gate Array (FPGA) area, several classes of memory, and multiple programmable memory, PCIe and Ethernet controllers. The Versal platform is a flexible toolbox of computational power, with the ARM cores handling traditional CPU and real-time processing tasks. The AI cores churn through new machine learning workloads while the DSPs are leveraged for advanced signal processing, think 5G, and the FPGA can be used as the versatile computational glue to pull all these complex engines together. Finally, we have the memory, PCIe and Ethernet controllers to interface with the real world. So while Intel and AMD focus on scaling the number CPU cores on the chip and NVidia works to improve Graphical Processing Unit (GPU) density Xilinx’s is the first to go all-in on chip-level workload integration. This is the key to accelerating the data center going forward.
So until we solve the quantum tunneling problem, with new fabrication techniques, we can utilize advances in integration as shown above to move the industry forward.
While in Hawaii recently on vacation my millennial son tossed out a bucket list suggestion that we both go deep water Spearfishing. Immediately the iconic battle from the James Bond movie “Thunderball” leaped to mind. It’s the scene where the villain Largo’s minions in black wetsuits wage war against a platoon of US Navy Seals in red wetsuits. The whole sequence is fought with untethered spearguns and dive knives, safety first! Not one to back down from a challenge I arranged the dive and along the way we learned a few things worthy of sharing.
To further set the stage, back in 1992 I earned my PADI Open Water dive certification and have since made hundreds of dives, so pulling on a wetsuit, donning flippers, a mask and snorkel is nothing new, or so I thought. This was a 2mm one-piece wetsuit design which offered both thermal protection from the water as well as solar protection from burning exposed skin. The difference between this suit and my normal warm water one is that this one is decorated with an open water camouflage design. The purpose of the camouflage is to make the wearer look like a mass of seaweed to attack the smaller fish to the shade. The mask and snorkel are typical, but the fins were a whole different game. When spearfishing your objective is to not scare off the small fish which then alert the larger game fish. To do this you must minimize ALL your movements, including your kicks. Most of your time is spent drifting on the surface and lying in wait for your prey. Did I mention the chum, yes cut up bait fish are introduced into the water near where you’re drifting to draw in larger game fish, and sometimes sharks. Towards this end when spearfishing you use free diving fins which are nearly a meter long, three feet for my friends in the US. This enables the diver to make subtle ankle movements that gently propel them through the water.
When prey arrives the hunter slowly moves the one-meter long wood speargun from their side into a position in front of them. They then lock out their dominant arm holding the gun, support the stock with their free hand, and slowly scan left and right to ensure that no other divers are in harm’s way. Finally, the hunter aligns the gun with the target and squeezes the trigger. The bolt travels a maximum of five meters, with the optimum killing distance between three and five meters. Yes, you have to be very close to the fish, move with extreme care, and you have to make your only shot count. If your shot is true and you hit the fish solidly in the head then you’re instructed to drop the gun. Now there are a few caveats that I’ve not yet covered. The dive master instructed us to NOT shoot any fish that appears to be larger that 100 pounds. It turns out that connected to the back of the speargun is about 100 feet of floating line (1/2″ thick) that ends with a buoy. Divers can easily get tangled up in this line if they’re not careful while drifting. A 100-pound fish, with some room to run after being speared, can generate enough momentum to pull a fully grown diver under water, potentially resulting in their death. We were instructed that if a fish is in the area that is larger than 100 pounds, but less than 200 pounds, to slowly pass the gun to the dive master so they could then double check the area before taking a more experienced shot. Death from accidentally being speared, or dragged under by a fish, was represented as a very tangible threat. We had two spear guns, five divers, and five hours of hunting, and yet there was only one clear shot that proved fruitless. The fish felt the spear but it did not penetrate its skin because the spear had reached the end of the line attaching it to the gun as it touched the fish. So what does all this have to do with Spear phishing?
Phishing is the process of using emails containing malware designed to compromise the computer reading these emails. Spear phishing is the act of specifically targeting a single individual using a very custom crafted email and phishing attachment. While generic phishing attacks are often “spray and pray” based assaults, sometimes the employees of a given company or industry, spear phishing attacks are laser-focused on a single person. The attacker thoroughly researches their target, combing the web, social media and perhaps even doing some real-world social engineering and recognizance, to learn everything they can. The attacker’s objective is to select the most attractive strategy designed to elicit a response that results in the target opening an infected attachment. As in spearfishing, you may only get one shot so it has to be your best.
In both, the above cases the hunter thoroughly researches their prey looking for the most opportune places to hunt, the proper times, and the most alluring baits. They then choose the appropriate weapon, and thoroughly practice the use of that weapon to ensure that they can make it function properly with the single shot they might get on their target. They then select and distribute the proper baits, and lie in wait for their prey.
Something that is common and often overlooked is that in both Spearfishing and Spear Phishing the hunter is far more exposed, and hence significantly more vulnerable than they might be had they used ANY other method of attack. In Spearfishing the hunter is in the water only meters from his prey, and if they’re successful they need to move fast to land their catch on the boat before the arrival of sharks. A wounded fish instantly spills blood into the water and flails around in an effort to free itself. Sharks can detect blood in the water up to 1/3 of a mile away, and when they are near sense the electrical impulses from a fish’s muscles in distress and their splashing to zero in very quickly on what is now “their” prey. Sharks aren’t known for being discriminating eaters, so it is not uncommon at this point for the hunter to also become the hunted. In Spear Phishing if the attacker isn’t meticulous in covering their tracks during their research, social engineering efforts, bait selection (phishing email), and weapon design (phishing exploit used within the email) these can often be used to uncover their identity.
So be ever vigilant as you approach your email, there will be times when you’re only one click away from being speared, and your system becoming compromised!
Since the dawn of time humanity has needed to protect both people and things. Initial security methods were all “software based” in the sense that they relied on the user putting their trust in a process, people and social conventions. At first, it was cavemen hiding what they most valued, leveraging security through obscurity or they posted a trusted associate to watch the entrance. Finally, we expanded our security methods to include some form of “Keep Out” signs through writings and carvings. Then in 600BC along comes Theodorus of Samos, who invented the key. Warded locks had existed about three hundred years before Theodorus, but the “key” was just designed to bypass obstructions to its rotation making it slightly more challenging to access the hidden trip lever inside. For a Warded lock the “key” often looked like what we call a skeleton key today.
It could be argued that the lock represented our first “hardware based” security system as the user placed their trust in a physical token or key based system. Systems secured in hardware require that the user present their token in person, it is then validated, and if it passes, the security measures are removed. It should be noted that we trust this approach because it’s both the presence of the token and the accountability of a person in the vicinity who knows how to execute the exact process with the token to ensure success.
Now every system man invents can also be defeated. One of the first skills most hackers teach themselves is how to pick a lock. This allows us to dynamically replicate the function of the key using two very simple and compact tools (a torsion bar and a pick). Whenever we pick a lock we risk exposure, something we avoid at all cost, because the process of picking a lock looks visually different than that of using a key. Picking a lock using the tools mentioned above requires two hands. One provides a steady rotational force using the torsion bar. While the other manipulates the pick to raise the pins until each aligns with the cylinder and hangs up. Both hands require a very fine sense of touch, too heavy handed with the torsion bar and you can snap the last pin or two while freeing the lock. This will break it for future key users, and potentially expose your attempted tampering. Too light or heavy with the pick and you won’t feel the pins hanging up, it’s more skill than a science. The point is that while using a key takes seconds picking a lock takes much longer, somewhere between a few seconds to well over a minute, or never, depending on the complexity of the cylinder, and the person’s skill. The difference between defeating a software system and a hardware one is typically this aspect of presence. While it’s not always the case, often to defeat hardware-based systems it requires that the attacker be physically present because defeating hardware commonly requires hardware. Hackers often operate from countries far outside the reach of law enforcement, so physical presence is not an option. Attackers are driven by a risk-reward model, and showing up in person is considered very high risk, so the reward needs to be exponentially greater.
Today companies hide their most valuable assets in servers located in large secure data centers. There are plenty of excellent real-world hardware and software systems in place to ensure proper physical access to these systems. These security measures are so good that hackers rarely try to evade them because the risk of detection and capture is too high. Yet we need only look at the past month, April 2019, to see that companies like Microsoft,Starwood,Toyota, GA Tech and Questcare have all reported breaches. In Microsoft’s case, 6% of all MSN, HotMail, and Outlook accounts were breached, but they’ve not disclosed the details or the number of accounts. This is possible because attackers need to only break into a single system within the enterprise to reach the data center and establish a beachhead from which they can then land and expand. Attackers usually obtain a secure foothold through a phishing email or clickbait.
It takes only one undereducated employee to open a phishing email in outlook, launch a malicious attachment, or click on a rogue webpage link and it’s game over. Lockheed did extensive research in this area and they produced their now famous Cyber Kill Chain model. At a high level, it highlights the process by which attackers seize control of an enterprise. Anyone of these attack vectors can result in the installation of a remote access trojan (RAT) or a Zero-Day exploit that will give the attacker near unlimited access to the employee’s system. From there the attacker will seek out a poorly secured server in the office or data center to establish a beachhead from which they’ll launch their attack. The compromised employee system may not always be available, but it does makes for a great point to retreat back to in the event that the primary beachhead server system is discovered and sanitized.
Once an attacker has a foothold in the data center its game over. Very often they can easily move laterally, east-west, through the data center to other systems. The MITRE ATT&CK (Adversarial Tactics Techniques & Common Knowledge) framework, while similar to Lockheed’s approach, drills down much further. Specifically, on the lateral movement strategies, Mitre uncovered 17 different methods for compromising internal servers. This highlights the point that very few defenses exist in the traditional data center and those that do are often very well understood by attackers. These defenses are typically OS based firewalls that all seasoned hackers know how to disable. Hackers will disable logging, then tear down the firewall. They can also sometimes leverage an island hopping attack to a vendor or customer systems through private networks or gateways. Or in the case of the Starwood breach of Marriott the attackers got lucky and when their IT systems were merged so were the exploited systems. This is known as a data lemon, an acquisition that comes with infected and unsecured systems. Also, it should be noted that malicious insiders, employees that are aware of a pending termination or just seeking to augment their income, make up over 30% of the reported breaches. In this attack example, a malicious insider simply leverages their access and knowledge to drain all the value from their employer’s systems. So what hardware countermeasures can be put in place to limit east-west or lateral attacks within the data center? Today you have three hardware options to secure your data center servers against east-west attacks. We have switch access control lists (ACLs), top of rack firewalls or something uniquely innovative Solarflare’s ServerLock enabled NICs.
Often enterprises leverage ACLs in their top of rack 10/25/100G switches to protect east-west traffic within the data center. The problem with this approach is one of scale. IT teams can easily exhaust these resources when they attempt comprehensive application level segmentation at the server. These top of rack switches provide between 100 and 1,000 ACLs per port. By contrast, Solarflare’s ServerLock provides 5,000 ACLs per NIC, along with some foundational subnet level filtering.
In extreme cases, companies might leverage hardware firewalls internally to further zone off systems they are looking to secure. Here the problem is one of volume. Since these firewalls are used within the data center they will be tasked with filtering enormous amounts of network data. Typically the traffic inside a data center is 10X the traffic volume entering the data center. So for mission-critical clusters or server groups, they will demand high bandwidth, and these firewalls can become very expensive and directly impact application performance. Some of the fastest appliance-based firewalls designed to handle these kinds of high volumes are both expensive and add another 2.5 to 3.5 microseconds of latency in each direction. This means that if an intranet server were to fetch information from a database behind an internal firewall the transaction would see an additional delay of 5-6 microseconds. While this honestly doesn’t sound like much think of it like compound interest. If the transaction is simple and there’s only one request, then 5-6 microseconds will go unnoticed, but what happens when that employee’s request decomposes into hundreds or even thousands of database server calls? Delays then become seconds. By comparison, Solarflare’s ServerLock NIC based ACL approach adds only 0.25 to 0.75 microseconds of latency in each direction.
Finally, we have Solarflare’s ServerLock solution which executes entirely within the hardware of the server’s own Network Interface Card (NIC). There are NO server side services or agents, so there is no attackable software surface area of any kind. Think about that for a moment, a server-side security solution with ZERO ATTACKABLE SURFACE AREA. Once ServerLock is engaged through the binding process with a centralized ServerLock DirectorOne controller the local control plane for the NIC that manages security is torn down. This means that even if a hacker or malicious insider were to elevate their privilege to root they would NOT be able to see or affect the security settings on the NIC. ServerLock can test up to 5,000 ACLs against a network packet within the NIC in just over 250 nanoseconds. If your security policies leverage subnet wildcards the worst case latency is under 750 nanoseconds. Both inbound and outbound network traffic is checked in hardware. All of the Solarflare NICs within a data center can be managed by ServerLock DirectorOne controllers. Today a single ServerLock DirectorOne can manage up to 1,000 NICs.
ServerLock DirectorOne is a bundle of code that is delivered as an ISO image and can be installed onto a bare metal server, into a VM or a container. It is designed to manage all the ServerLock NICs within an infrastructure domain. To engage ServerLock on a system you run a simple binding process that facilitates an exchange of secrets between the DirectorOne controller and the ServerLock NIC. Once engaged the ServerLock NIC will begin sharing new network flows with the DirectorOne controller. DirectorOne provides visibility to all the network flows across all the ServerLock enabled systems within your infrastructure domain. At that point, you can then begin defining security policies and place them in compliance or enforcement mode. In compliance mode, no traffic through the NIC will be filtered, but any traffic that is not in compliance with the defined security policies for that NIC will generate alerts. Once a policy is moved into “enforcement” mode all out of policy packets will have the default action applied to them.
If you’re looking for the most secure solution to protect your companies servers you should consider Solarflare’s ServerLock. It is the most affordable, and secure way to protect your valuable corporate assets.
Electronic trading, like no other industry, can directly link time and money. A decade ago when I started selling 10GbE NICs to Wall Street traders, they often shared with me the value of a single microsecond (millionth of a second) improvement in trading. Today these same traders are measuring gains in nanoseconds (billionths of a second). With each passing quarter our financial markets evolve, and trade execution times decrease. Trading platforms leveraging older hardware and software often can’t remain competitive as other traders continue to invest in the latest products which further reduce trade execution latency and improve order determinism.
For the past decade, Solarflare has led the market in accelerating server-side UDP/TCP networking for electronic trading with our Onload® software acceleration stack. In addition, Solarflare has regularly delivered a new generation of 10GbE network adapters that have further reduced network latency by 20-30% while also reducing jitter. Often these advances were the result of improvements in the hardware, but there were many significant enhancements to the Onload stack that contributed substantially to the overall system performance increases. Keep in mind that Onload is fully compliant to the BSD Sockets standard, which means that developers don’t have to change their code to use Onload. The table below shows this reduction in Onload latency over time along with the gain from each new generation of Solarflare adapters.
In the below graph (click on it to enlarge) you’ll see how latency with Onload compares between Solarflare’s SFN8522 and X2522 as message size increases. We’ve also included our next closest competitor, Mellanox, with their ConnectX-5 adapter and VMA offload stack.
About five years ago, Solarflare saw an opportunity to revisit TCP/UDP networking stacks within Onload and determined that it is possible to squeeze another 35-50% in performance gains if developers were willing to use a new C language application programming interface (API). This new API was built from the ground up focused on performance, and it implements only a subset of the complete BSD Sockets API. Every API call has been highly tuned to deliver optimum performance. On the road to formulating this API Solarflare has patented several new innovations, and in 2016 it leaped forward again by introducing this API and branding it TCPDirect. Initially, TCPDirect improved latency on Solarflare’s SFN8522 adapter by an astonishing 38%!
Recently TCPDirect was tested with the Solarflare’s latest X2522 cards, and it delivered an improved 48% latency reduction over Onload on the same adapter (click on the graph below). Today TCPDirect with the X2522 provides an amazing 828ns of latency with TCP. So how does this compare with Mellanox? The X2522 with TCPDirect is 39% faster than the Mellanox ConnectX-5 with VMA and Exasock! This gain is shown in the graph below. It should be noted that this testing was done using an older more performant Intel Skylake processor with a 3.6Ghz clock. Intel’s newest Cascade Lake processors burst up to 4.4Ghz, but they were not available at the time of this testing. Recent testing indicates that they should produce even more impressive results.
Trading and Time are interwoven into a single fabric, one cannot exist without the other. When trades are executing with a precision measured in nanoseconds you need a technology partner that is leading the industry, not following it. Solarflare also provides a precision time protocol (PTP) daemon that includes both IEEE-1588 (2008) and enterprise profiles. Additionally, Solarflare makes available an optional PCIe bracket kit enabling the direct connection of an external hardware master clock that can deliver a highly accurate one pulse per second (1PPS) signal. This kit and Solarflare’s PTP daemon enable the adapter to maintain system time synchronization to within 200ns of the external master clock. Mellanox has stated that their PTP implementation “can see time locked to reference well within 500 nanoseconds of variation.”
Numerous STAC reports over the past decade with all the major OEMs and the Linux distributions used in finance have validated that Solarflare networking technology is the standard by which all others are measured. Innovations like those discussed above are the reason why over 90% of the stock exchanges, global investment banks, hedge funds, and cutting-edge high-frequency traders’ architect their systems with Solarflare hardware and software. Outside of the Linux kernel’s own communications stack, no other TCP/UDP user-space communications stack is more heavily tested or in wider production than Solarflare’s Onload platform. Today the world economy exists across hundreds of thousands of servers spread throughout the globe, and nearly all of those servers depend on Solarflare to provide the industry’s best performance with the lowest jitter possible. Below are recent STAC Research reports from the past two years that back up our claims.
Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.
With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.
This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.
In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.
In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders.
At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well.
For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects.
At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.
As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.
When you take something that is already considered to be the fastest and offer to make it another 50% faster people think you’re a liar. Those who built that fast thing couldn’t possibly have left that much slack in their design. Not every engineer is a “miracle worker” or notorious sand-bagger, like Scotty from the Star Ship Enterprise. So how is this possible?
A straightforward way to achieve such unbelievable gains is to alter the environment around how that fast thing is measured. Suppose the thing we’re discussing is Redis, an in-memory database. The engineers who wrote Redis rely on the Linux kernel for all network operations. When those Redis engineers measured the performance of their application what they didn’t know was that over 1/3 of the time a request spends in flight is consumed by the kernel, something they have no control over. What if they could regain that control?
Suppose we provided Redis’s direct access to the network. This would enable Redis to directly make calls to the network without any external software layers in the way. What sort of benefits might the Redis application see? There are three areas which would immediately see performance gains: latency, capacity, and determinism.
On the latency side, requests to the database would be processed faster. They are handled more quickly because the application is receiving data straight from the network directly into Redis’s memory without a detour through the kernel. This direct path reduces memory copies, eliminates kernel context switches, and removes other system overhead. The result is a dramatic reduction in time, and CPU cycles. Conversely, when Redis fulfills a database request, it can write that data directly to the network, again saving more time and reclaiming more CPU cycles.
As more CPU cycles are freed up due to decreased latency, those compute resources go directly back into processing Redis database requests. When the Linux kernel is bypassed using Solarflare’s Cloud Onload Redis sees on average a 50% boost in the number of “Get” and “Set” commands it can process every second. Imagine Captain Kirk yelling down to Scotty to give him more power, and Scotty flips a switch, and instantly another 50% more power comes online, that’s Solarflare Cloud Onload. Below is a graph of the free version of Redis doing database GET commands using a single 25GbE link through the kernel in blue, and with an Onloaded 25GbE link in green. Solarflare Cloud Onload, is Scotty’s magic switch mentioned above. Note we scaled the number of Redis instance along the X-axis from 1 to 32 (on an x86 system with 32 cores) and the Y-axis is 0-15 million requests/second.
Finally, there is the elusive attribute of determinism. While computers are great at doing a great many things, that is also what makes them less than 100% predictable. Servers often have many sensors, fans and a control system designed to keep them operating at peak efficiency. The problem is that these devices generate events that require near-immediate attention. When a thermal sensor generates an interrupt, the CPU is alerted, it pushes the current process to the stack, services the interrupt, perhaps by turning a fan on, then returns to the previous process. When the interrupt occurs, and how long it takes the CPU to service it are both variables that hamper determinism. If a typical “Get” request takes a microsecond (millionth of a second) to service, but that CPU core is called away from processing that “Get” request in the middle by an interrupt, it could be 20 to 200 microseconds before it returns. Solarflare’s Cloud Onload communications stack moves these interrupts out of the critical path of Redis, thereby restoring determinism to the application.
So, if you’re looking to improve Redis performance by 50%, please consider Solarflare’s Cloud Onload running on one of their new X2 series NICs. Solarflare’s new X2 series NICs are available for 10GbE, 25GbE and now 100GbE. Soon we will be posting our Benchmarking Performance Guide and our Cloud Onload for Redis Cookbook that contains all the details. When these are available on Solarflare’s website then links will be added to this blog entry.
*Update: Someone asked if I could clarify the graph a bit more. First, we focused our testing on both the GET and SET requests, as those are the two most common in-memory database commands. GET is simply used to fetch a value from the database while SET is used to store a value in the database, really basic stuff. Both graphs are very similar. For a single 25GbE link the size of the Redis GET and SET requests translates to about 11 million requests/second to fill the pipe.
It turns out that a quad-core server running four Redis instances can saturate a single 10GbE link, we’ve not tested multiple 10GbE links. Here is where Cloud Onload shines as it lifts the kernel limitations mentioned above. Note it will take you over 7 Redis instance on 7 Cores to achieve line rate 25GbE with Cloud Onload, while the kernel will require twice that or 14 instances on 14 cores to match this. Any Redis instances or CPU cores beyond this will be underutilized. The most important takeaway here though is that Cloud Onload delivers a substantial capacity gain for Redis over using the kernel, so if your server has more than a few cores Cloud Onload will enable you to get the full value out of them.
**Update: On March 23, 2019, an updated graph was posted above that focuses on 25GbE, as that’s where data centers are headed. The text was then aligned with the updated graph.
**Note: Credit to John Laroco for leading the Redis testing, and for noticing, and taking the opening picture at SJC airport earlier this month.
Many in corporate America still don’t view East-West attacks as a real, let alone a significant threat. Over the past several years while meeting with corporate customers to discuss our future security product, it wasn’t uncommon to encounter the occasional Ostrich. These are the 38% of people who responded to the June 2018 SANS Institute report stating that they’ve not yet been the victim of a breach. In security we have a saying “There are only two types of companies, those that know they’ve been breached, and those that have yet to discover it.” While this sounds somewhat flippant, it’s a cold hard fact that thieves see themselves as the predators and they view your company as the prey. Much like a pride of female lions roaming the Africa savanna for a large herd, black-hat hackers go where the money is. If your company delivers value into a worldwide market, then rest assured there is someone out there looking to make an easy buck from the efforts of your company. It could be contractors hired by a competitor or nation-state actors looking to steal your product designs, a ransomware attacker seeking to extort money, or merely a freelancer surfing for financial records to access your corporate bank account. These threats are real, and if you take a close look at the network traffic attempting to enter your enterprise, you’ll see the barbarians at your gate.
A few months back my team had placed a test server on the Internet with a single “You shouldn’t be here” web page with a previously unused, unadvertised, network address. This server had all its network ports secured in hardware so that only port 80 traffic was permitted. No data of any value existed on the system, and it wasn’t networked back into our enterprise. Within one week we’d recorded over 48,000 attempts to compromise the server. Several even leveraged a family of web exploits I’d discovered and reported back in 1997 to the Lotus Notes Domino development team (it warmed my heart to see these in the logs). This specific IP address was assigned to our company by AT&T, but it doesn’t show up in any public external registry as belonging to our company, so there was no apparent value behind it, yet 48,000 attempts were made. So what’s the gizmo in the picture above?
In the January 2019 issue of “2600 Magazine, The Hacker Quarterly” a hacker with the handle “s0ke” wrote an article entitled “A Brief Tunneling Tutorial.” In it, s0ke describes how to set up a persistent SSH tunnel to a remote box under his control using a Raspberry Pi. This then enables the attacker to access the corporate network just as if he was sitting in the office. In many ways, this exploit is similar to sending someone a phishing email that then installs a Remote Access Trojan (RAT) on their laptop or desktop, but it’s even better as the device is always on and available. Yesterday I took this one step further. Knowing that most corporate networks leverage IP Phones for flexibility and that IP Phones require Power over Ethernet (PoE), I ordered a new Raspberry Pi accessory called a Pi PoE Switch Hat. This is a simple little board that snaps onto the top of the Pi and leverages the power found on the ethernet port to power the entire server. The whole computer shown above is about the size of a pack of cigarettes with a good sized matchbook attached. When this case arrives, I’ll utilize our 3D printer to make matching black panels that will then be superglued in place to cover all the exposed ports and even the red cable. The only physically exposed port will be a short black RJ45 cable designed to plug into a power over Ethernet port and two tiny holes so light from the power and signal LEDs can escape (a tiny patch of black electrical tape will cover these once deployed).
When the Raspberry Pi software bundle is complete and functioning correctly, as outlined in s0ke’s article, then I’ll layer in accessing my remote box via The Onion Router (Tor) and pushing my SSH tunnel out through port 80 or 443. This should make it transparent to any enterprise detection tools. Tor should mask the address of my remote box from their logs. In case my Pi is discovered I’ll also install some countermeasures to wipe it clean when a local console is attached. At this point with IT’s approval, I may briefly test it in our office to confirm its working correctly. Then it becomes a show-and-tell box, with a single powerpoint slide outlining that east-west threats are real and that a determined hacker with $100 in hardware and less than one minute of unaccompanied access in their facility can own their network. The actual hardware may be too provocative to display, so I’ll lead with the slide. If someone calls me on it though I may pull the unit out of my bag and move the discussion from the hypothetical to real. If you think this might be a bit much, I’m always open to suggestions on better ways to drive a point home, so please share your thoughts.
P.S. The build is underway, the Pi and Pi PoE Switch Hat have arrived. To keep the image as flexible as possible I’ve installed generic Raspbian on an 8GB Micro-SD card. Applied all updates, and have begun putting on custom code, system generically named “printer” at this point . Also, a Power over Ethernet injector was ordered so the system could be tested in a “production like” power environment. It should be completed by the end of the month, perhaps in time for testing in my hotel during my next trip. Updated: 2019-01-20
A persistent automated SSH tunnel has been set up between the “printer” and the “dropbox” system and I’ve logged into the “printer” by connecting via “ssh -p 9091 scott@localhost” on the “dropbox,” this is very cool. There is a flaw in the Pi PoE Switch board or its set up at this point as it is pulling the power off the ethernet port, but it is NOT switching the traffic so at this point the solution utilizes two Ethernet cables, one for power and the second for the signal. This will be resolved shortly. Updated: 2019-01-23
But why risk the Ethernet port not being a powered Ethernet jack, and also who wants to leave behind such a cool Raspberry Pi 3B+ platform behind when something with less horsepower could easily do the job? So shortly after the above intrusion device was functional I simply moved the Micro-SD card over to a Raspberry Pi Zero. A regular SD card is shown in the picture for the purpose of scale. The Pi Zero is awesome if you require a low power small system on a chip (SoC) platform. For those not familiar with the Pi Zero it’s a $5 single core 1Ghz ARM platform that consumes on average 100mw, so it can run for days on a USB battery. Add to that a $14 Ethernet to MicroUSB dongle and again you have a single cable hacking solution that only requires a generic Ethernet port. Of course it still needs a tight black case to keep it neat, but that’s what 3D printers are for.
Now, this solution will burn out in a couple of days, but as a hacker if you’ve not established a solid beachhead in that time then perhaps you should consider another line of work. Some might ask why I’m telling hackers how to do this, but frankly, they’ve known for years since SoC computers first became main stream. So IT managers beware, solutions like these are more common than you think, and they are leaking into pop culture through shows like Mr. Robot. This particular show has received high marks for technical excellence, and Myth Busters would have a hard time finding a flaw. One need only rewatch Season 1 episode 5, to see how a Raspberry Pi could be used to destroy tapes in a facility like Iron Mountain. Sounds unrealistic, then you must watch this Youtube video where they validate that this specific hack is in-fact plausible. The point is no network is safe from a determined hacker, from the CAN bus in your car, to building HVAC systems, or industrial air-gapped control networks. Strong security processes and policies, strict enforcement, and honeypot detection inside the enterprise are all methods to thwart and detect skilled hackers. Updated: 2019-01-27