x86 Has Hit the Wall, and Now Come the Accelerators – Part 2

August 15, 2019August 21, 2019 scottcschweitzer Accelerators, AI, HFT, networking

Before we return to accelerators as a solution, we need to make a pit stop and explore the how behind the why. The why is simple; we buy a product or service to solve a problem. We intellectually evaluate stories and experiences, distill out the solutions that apply then affix those to tangible objects or services we can acquire. Rarely does someone buy an iPad to own an iPad, they have a specific use case in mind as their justification for that expense. The same holds for servers and accelerator cards. At this point in our technological evolution, the how for most remains a mystery which needs some explanation.

When a technician visits your home to fix a broken appliance, they don’t just walk in with a lone flat-bladed screwdriver. They carry a pretty large toolbox which was explicitly assembled to repair appliances. The contents of that tool box are different than those of a carpenter’s or automotive mechanic’s. While all three might have a screwdriver, only the carpenter would have a wood chisel, and the mechanic a torque wrench. Different problems demand different tools. For the past several decades, many of us have viewed the x86 architecture as the computational tool to solve ALL our information processing issues. Guess what, a great many things don’t optimize well to the x86 model, but if you throw enough clock cycles and CPU cores at most problems, a solution will eventually be reached.

The High-Performance Compute (HPC) market realized this many years ago, so they built heterogeneous computing environments with schedulers for each type of problem. They classified problems into scaler, floating-point, and vector. Since then we’ve added, Artificial Intelligence (AI), also known as Machine Learning (ML). Scaler problems are the ones that deal with integers (numbers without a decimal point) which is often how we represent text. So, for example, a database lookup of your name to fetch your address is entirely a scaler problem. Next, we have floating-point, or calculations with a decimal point, the real numbers. These require different computational routines, and as early as 1983, we introduced special numerical co-processors (early accelerators) in our PCs to handle this specific class of problems (ex. Intel 8087). Today we can farm these class of problems out to Graphical Processing Units (GPUs) as they have many parallel cores explicitly designed for this purpose.

Then there’s the mysterious class called vector computing. A vector is a one-dimensional array of numbers. Some might argue that vectors are just a special case of floating-point problems, and they are, but their treatment at the processing level sets them far apart. Consider the Pythagorean theorem. Solving for C when you know A and B requires not only a floating-point processor but many steps to arrive at the value for C. For illustration let’s say it takes ten CPU instructions to arrive at a value for C, it’s probably more. Now imagine you have a set of 256 values for A and a corresponding set of 256 values for B, this would take 2,560 instructions to produce a solution, the complete set C. A vector processor will load the entire set of A and B values at the same time into CPU registers, square the results in one instruction, sum them in another, square-root the last result in another then present the solution set C in a final instruction, a few instructions instead of 2,560. Problems like weather forecasting map extremely well into the vector processing model.

Finally, there is the fourth, relatively new, class of problems that fall into the realm of AI or ML. Here the math being done is vector based, its a mix of both integer (scaler) and real numbers, but with intentionally low precision. The difference being that the value computed doesn’t always need to be perfect, just close enough. Much like when you do your taxes, and you leave off the change in your calculations. The IRS is okay with whole numbers because they’re good enough. Your self-driving car can drift an inch or so in any direction, and it won’t make any difference as it will still be more accurate than your Grandma Nat behind the wheel.

So now, back to the problem at hand, how do we accelerate today’s complicated workloads? For the past three decades, we’ve been taking a scaler platform, the x86 processor with floating-point capabilities, and using it as a double-ended screwdriver with both a flat and a Philips head to address every problem we have. How do we move forward?

Stay tuned for part three, where we cover hardware acceleration platforms.

In Security, Hardware Trumps Software

May 6, 2019May 7, 2019 scottcschweitzer Microsegmentation, networking, Security, Uncategorized

Since the dawn of time humanity has needed to protect both people and things. Initial security methods were all “software based” in the sense that they relied on the user putting their trust in a process, people and social conventions. At first, it was cavemen hiding what they most valued, leveraging security through obscurity or they posted a trusted associate to watch the entrance. Finally, we expanded our security methods to include some form of “Keep Out” signs through writings and carvings. Then in 600BC along comes Theodorus of Samos, who invented the key. Warded locks had existed about three hundred years before Theodorus, but the “key” was just designed to bypass obstructions to its rotation making it slightly more challenging to access the hidden trip lever inside. For a Warded lock the “key” often looked like what we call a skeleton key today.

It could be argued that the lock represented our first “hardware based” security system as the user placed their trust in a physical token or key based system. Systems secured in hardware require that the user present their token in person, it is then validated, and if it passes, the security measures are removed. It should be noted that we trust this approach because it’s both the presence of the token and the accountability of a person in the vicinity who knows how to execute the exact process with the token to ensure success.

Now every system man invents can also be defeated. One of the first skills most hackers teach themselves is how to pick a lock. This allows us to dynamically replicate the function of the key using two very simple and compact tools (a torsion bar and a pick). Whenever we pick a lock we risk exposure, something we avoid at all cost, because the process of picking a lock looks visually different than that of using a key. Picking a lock using the tools mentioned above requires two hands. One provides a steady rotational force using the torsion bar. While the other manipulates the pick to raise the pins until each aligns with the cylinder and hangs up. Both hands require a very fine sense of touch, too heavy handed with the torsion bar and you can snap the last pin or two while freeing the lock. This will break it for future key users, and potentially expose your attempted tampering. Too light or heavy with the pick and you won’t feel the pins hanging up, it’s more skill than a science. The point is that while using a key takes seconds picking a lock takes much longer, somewhere between a few seconds to well over a minute, or never, depending on the complexity of the cylinder, and the person’s skill. The difference between defeating a software system and a hardware one is typically this aspect of presence. While it’s not always the case, often to defeat hardware-based systems it requires that the attacker be physically present because defeating hardware commonly requires hardware. Hackers often operate from countries far outside the reach of law enforcement, so physical presence is not an option. Attackers are driven by a risk-reward model, and showing up in person is considered very high risk, so the reward needs to be exponentially greater.

Today companies hide their most valuable assets in servers located in large secure data centers. There are plenty of excellent real-world hardware and software systems in place to ensure proper physical access to these systems. These security measures are so good that hackers rarely try to evade them because the risk of detection and capture is too high. Yet we need only look at the past month, April 2019, to see that companies like Microsoft, Starwood, Toyota, GA Tech and Questcare have all reported breaches. In Microsoft’s case, 6% of all MSN, HotMail, and Outlook accounts were breached, but they’ve not disclosed the details or the number of accounts. This is possible because attackers need to only break into a single system within the enterprise to reach the data center and establish a beachhead from which they can then land and expand. Attackers usually obtain a secure foothold through a phishing email or clickbait.

It takes only one undereducated employee to open a phishing email in outlook, launch a malicious attachment, or click on a rogue webpage link and it’s game over. Lockheed did extensive research in this area and they produced their now famous Cyber Kill Chain model. At a high level, it highlights the process by which attackers seize control of an enterprise. Anyone of these attack vectors can result in the installation of a remote access trojan (RAT) or a Zero-Day exploit that will give the attacker near unlimited access to the employee’s system. From there the attacker will seek out a poorly secured server in the office or data center to establish a beachhead from which they’ll launch their attack. The compromised employee system may not always be available, but it does makes for a great point to retreat back to in the event that the primary beachhead server system is discovered and sanitized.

Once an attacker has a foothold in the data center its game over. Very often they can easily move laterally, east-west, through the data center to other systems. The MITRE ATT&CK (Adversarial Tactics Techniques & Common Knowledge) framework, while similar to Lockheed’s approach, drills down much further. Specifically, on the lateral movement strategies, Mitre uncovered 17 different methods for compromising internal servers. This highlights the point that very few defenses exist in the traditional data center and those that do are often very well understood by attackers. These defenses are typically OS based firewalls that all seasoned hackers know how to disable. Hackers will disable logging, then tear down the firewall. They can also sometimes leverage an island hopping attack to a vendor or customer systems through private networks or gateways. Or in the case of the Starwood breach of Marriott the attackers got lucky and when their IT systems were merged so were the exploited systems. This is known as a data lemon, an acquisition that comes with infected and unsecured systems. Also, it should be noted that malicious insiders, employees that are aware of a pending termination or just seeking to augment their income, make up over 30% of the reported breaches. In this attack example, a malicious insider simply leverages their access and knowledge to drain all the value from their employer’s systems. So what hardware countermeasures can be put in place to limit east-west or lateral attacks within the data center? Today you have three hardware options to secure your data center servers against east-west attacks. We have switch access control lists (ACLs), top of rack firewalls or something uniquely innovative Solarflare’s ServerLock enabled NICs.

Often enterprises leverage ACLs in their top of rack 10/25/100G switches to protect east-west traffic within the data center. The problem with this approach is one of scale. IT teams can easily exhaust these resources when they attempt comprehensive application level segmentation at the server. These top of rack switches provide between 100 and 1,000 ACLs per port. By contrast, Solarflare’s ServerLock provides 5,000 ACLs per NIC, along with some foundational subnet level filtering.

In extreme cases, companies might leverage hardware firewalls internally to further zone off systems they are looking to secure. Here the problem is one of volume. Since these firewalls are used within the data center they will be tasked with filtering enormous amounts of network data. Typically the traffic inside a data center is 10X the traffic volume entering the data center. So for mission-critical clusters or server groups, they will demand high bandwidth, and these firewalls can become very expensive and directly impact application performance. Some of the fastest appliance-based firewalls designed to handle these kinds of high volumes are both expensive and add another 2.5 to 3.5 microseconds of latency in each direction. This means that if an intranet server were to fetch information from a database behind an internal firewall the transaction would see an additional delay of 5-6 microseconds. While this honestly doesn’t sound like much think of it like compound interest. If the transaction is simple and there’s only one request, then 5-6 microseconds will go unnoticed, but what happens when that employee’s request decomposes into hundreds or even thousands of database server calls? Delays then become seconds. By comparison, Solarflare’s ServerLock NIC based ACL approach adds only 0.25 to 0.75 microseconds of latency in each direction.

Finally, we have Solarflare’s ServerLock solution which executes entirely within the hardware of the server’s own Network Interface Card (NIC). There are NO server side services or agents, so there is no attackable software surface area of any kind. Think about that for a moment, a server-side security solution with ZERO ATTACKABLE SURFACE AREA. Once ServerLock is engaged through the binding process with a centralized ServerLock DirectorOne controller the local control plane for the NIC that manages security is torn down. This means that even if a hacker or malicious insider were to elevate their privilege to root they would NOT be able to see or affect the security settings on the NIC. ServerLock can test up to 5,000 ACLs against a network packet within the NIC in just over 250 nanoseconds. If your security policies leverage subnet wildcards the worst case latency is under 750 nanoseconds. Both inbound and outbound network traffic is checked in hardware. All of the Solarflare NICs within a data center can be managed by ServerLock DirectorOne controllers. Today a single ServerLock DirectorOne can manage up to 1,000 NICs.

ServerLock DirectorOne is a bundle of code that is delivered as an ISO image and can be installed onto a bare metal server, into a VM or a container. It is designed to manage all the ServerLock NICs within an infrastructure domain. To engage ServerLock on a system you run a simple binding process that facilitates an exchange of secrets between the DirectorOne controller and the ServerLock NIC. Once engaged the ServerLock NIC will begin sharing new network flows with the DirectorOne controller. DirectorOne provides visibility to all the network flows across all the ServerLock enabled systems within your infrastructure domain. At that point, you can then begin defining security policies and place them in compliance or enforcement mode. In compliance mode, no traffic through the NIC will be filtered, but any traffic that is not in compliance with the defined security policies for that NIC will generate alerts. Once a policy is moved into “enforcement” mode all out of policy packets will have the default action applied to them.

If you’re looking for the most secure solution to protect your companies servers you should consider Solarflare’s ServerLock. It is the most affordable, and secure way to protect your valuable corporate assets.

828ns – A Legacy of Low Latency

April 18, 2019April 18, 2019 scottcschweitzer HFT, networking, Onload

Electronic trading, like no other industry, can directly link time and money. A decade ago when I started selling 10GbE NICs to Wall Street traders, they often shared with me the value of a single microsecond (millionth of a second) improvement in trading. Today these same traders are measuring gains in nanoseconds (billionths of a second). With each passing quarter our financial markets evolve, and trade execution times decrease. Trading platforms leveraging older hardware and software often can’t remain competitive as other traders continue to invest in the latest products which further reduce trade execution latency and improve order determinism.

For the past decade, Solarflare has led the market in accelerating server-side UDP/TCP networking for electronic trading with our Onload® software acceleration stack. In addition, Solarflare has regularly delivered a new generation of 10GbE network adapters that have further reduced network latency by 20-30% while also reducing jitter. Often these advances were the result of improvements in the hardware, but there were many significant enhancements to the Onload stack that contributed substantially to the overall system performance increases. Keep in mind that Onload is fully compliant to the BSD Sockets standard, which means that developers don’t have to change their code to use Onload. The table below shows this reduction in Onload latency over time along with the gain from each new generation of Solarflare adapters.

In the below graph (click on it to enlarge) you’ll see how latency with Onload compares between Solarflare’s SFN8522 and X2522 as message size increases. We’ve also included our next closest competitor, Mellanox, with their ConnectX-5 adapter and VMA offload stack.

About five years ago, Solarflare saw an opportunity to revisit TCP/UDP networking stacks within Onload and determined that it is possible to squeeze another 35-50% in performance gains if developers were willing to use a new C language application programming interface (API). This new API was built from the ground up focused on performance, and it implements only a subset of the complete BSD Sockets API. Every API call has been highly tuned to deliver optimum performance. On the road to formulating this API Solarflare has patented several new innovations, and in 2016 it leaped forward again by introducing this API and branding it TCPDirect. Initially, TCPDirect improved latency on Solarflare’s SFN8522 adapter by an astonishing 38%!

Recently TCPDirect was tested with the Solarflare’s latest X2522 cards, and it delivered an improved 48% latency reduction over Onload on the same adapter (click on the graph below). Today TCPDirect with the X2522 provides an amazing 828ns of latency with TCP. So how does this compare with Mellanox? The X2522 with TCPDirect is 39% faster than the Mellanox ConnectX-5 with VMA and Exasock! This gain is shown in the graph below. It should be noted that this testing was done using an older more performant Intel Skylake processor with a 3.6Ghz clock. Intel’s newest Cascade Lake processors burst up to 4.4Ghz, but they were not available at the time of this testing. Recent testing indicates that they should produce even more impressive results.

Trading and Time are interwoven into a single fabric, one cannot exist without the other. When trades are executing with a precision measured in nanoseconds you need a technology partner that is leading the industry, not following it. Solarflare also provides a precision time protocol (PTP) daemon that includes both IEEE-1588 (2008) and enterprise profiles. Additionally, Solarflare makes available an optional PCIe bracket kit enabling the direct connection of an external hardware master clock that can deliver a highly accurate one pulse per second (1PPS) signal. This kit and Solarflare’s PTP daemon enable the adapter to maintain system time synchronization to within 200ns of the external master clock. Mellanox has stated that their PTP implementation “can see time locked to reference well within 500 nanoseconds of variation.”

Numerous STAC reports over the past decade with all the major OEMs and the Linux distributions used in finance have validated that Solarflare networking technology is the standard by which all others are measured. Innovations like those discussed above are the reason why over 90% of the stock exchanges, global investment banks, hedge funds, and cutting-edge high-frequency traders’ architect their systems with Solarflare hardware and software. Outside of the Linux kernel’s own communications stack, no other TCP/UDP user-space communications stack is more heavily tested or in wider production than Solarflare’s Onload platform. Today the world economy exists across hundreds of thousands of servers spread throughout the globe, and nearly all of those servers depend on Solarflare to provide the industry’s best performance with the lowest jitter possible. Below are recent STAC Research reports from the past two years that back up our claims.

June 2018 – SFC180604b– UDP over 10GbE using Solarflare OpenOnload on Red Hat OpenShift 3.10 (pre-release) with RHEL 7.5 and Solarflare XtremeScale X2522 Adapters on Supermicro SYS-1029UX-LL1-S16 Servers

June 2018 – SFC180604a– UDP over 10 GbE Solarflare OpenOnload on Red Hat Enterprise Linux 7.5 with Solarflare XtremeScale X2522 adapters on Supermicro SYS-1029UX-LL1-S16Servers

October 2017 – SFC170831– STAC-T0: Solarflare SFN8522-ONLOAD NIC with LDA Technologies LightSpeed TCP on an Alpha Data FPGA in a Penguin Computing Relion XE1112 Server

Febuary 2017 – SFC170206– UDP over 10GbE using OpenOnload on RHEL 6.6 with Solarflare SFN 8522-PLUS Adapters on HPE ProLiant XL170r Gen9 Trade & Match Servers

User Level Networking (ULN) is Becoming an Over-Night Success

April 1, 2019April 1, 2019 scottcschweitzer 25GbE, HFT, Infiniband, networking, Onload, RDMA, RoCE, TCP Offload Engine

Rarely is an over-night success, over-night. Often success comes as a result of years or even decades of hard work, refinement, and maturity. ULN is just such a technology, while it is only now becoming fashionable as word leaks out that Google and Tencent have been adopting it internally because they’ve proven significant performance gains, it has been nearly 25 years in the making. Since the mid-1990s we have seen many efforts which have advanced kernel bypass otherwise known as ULN.

With the advent of both Gigabit Ethernet (GbE) and the Linux operating system, we saw the emergence of large (1,024 or more) clusters of high-performance servers. These clusters were often designed to focus on particular computing tasks, typically single applications representing complex computational problems. These problems were particularly thorny because they involved very chatty sophisticated programs that modeled fluid dynamics (ex. Boeing and airflow over a wing) or finite particle analysis (ex. Ford and GM with simulated car crash models) or seismic analysis (ex. Saudi Aramco and oil production). Don’t get me wrong, there were also many more like modeling nuclear weapons storage, but the above were just a few of dozens of classes of problems. So, the HPC crowd was seeking networking which was even faster and more efficient than generic Transmission Control Protocol (TCP) over GbE. They’d also realized that the Linux kernel was beginning to bottleneck their overall performance, so they started to explore options for bypassing the Kernel altogether.

This June the most popular Kernel bypass communications stack, the Message Passing Interface(MPI), will celebrate its 25th anniversary. MPI represented the dawn of a new approach to networking, a ULN communications stack. For MPI to achieve its desired performance objectives, it required a lower level networking device driver. In those early days, you could use the Virtual Interface Architecture(VIA) promoted by Intel, Microsoft and Compaq, which eventually became Infiniband’s Remote Direct Memory Access(RDMA), or Myrinetpromoted by Myricom. It should be noted that these weren’t the only two options, just the two most highly utilized at the time. Since then Myrinet has faded away, and Infiniband has dominated HPC.

In parallel to the maturing of ULN, we’ve had an explosion in core counts on CPUs. This year Intel will begin rolling out premium server-based processor chips supporting up to 48-cores, while AMD counters with a 64. On the surface, this is excellent news, but it further complicates other system-wide server performance issues, most notably access to the network. Since most servers are a dual socket, this brings the potential maximum core counts to 96 and 128 respectively. What we’ve noticed though through internal testing is that often as the total number of processing cores on a server increases beyond ten the operating system typically becomes the networking performance bottleneck. As mentioned previously the High-Performance Computing (HPC) market anticipated this issue long ago.

In 2010 there was a move by several companies to bring HPC technology to markets outside HPC. With this, we saw the introduction of Myricom’s Datagram Bypass Layer(DBL), Solarflare’s OpenOnload, and Voltaire’s Messaging Accelerator(VMA). Both DBL and VMA were born from fifteen years of MPI experience, and they were crafted to provide kernel bypass on Linux. Initially, DBL only supported the Unreliable Datagram Protocol (UDP), and it took Myricom nearly two more years to add Transmission Control Protocol (TCP) support. While Myricom was able to morph their Myrinet eXpress (MX) stack into DBL, the fact remained that they didn’t have their own ULN TCP stack and were torn between licensing one versus building their own. An interesting side note, the initial customer motivation to create DBL was for a storage company called SANBlaze, but Myricom quickly realized that it could also use DBL to accelerate stock market data for Chicago traders.

At that time 10GbE Network Interface Cards (NICs) had a 1/2 round trip for UDP based market data of about 10-15 microseconds. The initial version of DBL brought that down to under five microseconds. In financial trading, there is a direct correlation between time and money, and saving 5-10 microseconds on market data delivery means the difference between winning or losing a bid. At nearly the same time Solarflare also appeared in Chicago promoting its new OpenOnload that accelerated not only UDP but also the more complex TCP sessions. While market data comes in on UDP packets, orders into the exchanges are submitted using TCP. At the same time, and in parallel to this, one of the two biggest HPC Infiniband players Voltaire, later acquired by Mellanox, had crafted its own ULN called VMA. It too had realized that the lucrative financial markets were demanding ULN technology, and the time was right to apply their kernel bypass solution to this problem as well.

For four years, it was a three-way horse race between DBL, OpenOnload, and VMA for the best ULN solution on Linux providing support for both UDP and TCP. Since 2010 ULN for both UDP and TCP has come into production at nearly all of the worldwide financial exchanges, institutional banks, and high-frequency traders. While DBL and VMA still exist today, they make up less than 5% of utilization of ULN technology within financial customers. It turns out that in the fall of 2012 Myricom privately demonstrated to Google the value of using DBL to accelerate a Web2.0 application used extensively throughout Google called Memcached. By March of 2013 Google had acquired the necessary people and intellectual property from Myricom to bring both DBL and Myricom’s latest NIC technology in-house. With the core DBL development team gone, DBL’s utilization within the financial markets waned, and those customers have moved on to OpenOnload. Since then Google has dramatically expanded its use of this ULN technology in-house. Roughly four years ago with the adoption of VMA falling off to less than 2% adoption, Mellanox open-sourced VMA and moved it out to Github. Quietly over the past several years as other cloud providers had recognized Google’s ULN moves, these other players have begun spawning their own ULN projects.

At the same time in 2013 as word leaked out that Google had its own internal ULN project, Intel released their Data Plane Development Kit (DPDK). With DPDK it became much easier for applications to gain access directly to the raw networking device. This did not go unnoticed by China’s Tencent Cloud team as they started with the open source Free-BSD stack, carved out what they needed from it, then ported that on-top of DPDK. The resulting project was called F-Stack, and it can be found on Github today. Other projects like the OpenFastPath Foundation driven by Nokia, ARM, Cavium, and Marvell our advancing their own ULN. So today if you’re seeking out a ULN partner that supports both UDP and TCP your top five options are Solarflare’s Cloud Onload, VMA, F-Stack, OpenFastPath, and Seastar. Only one of these though is commercially available and fully supported, Solarflare’s Onload.

As you consider how you might accelerate your network intensive Web2.0 applications like web servers, software load balancers, in-memory databases, micro-service frameworks, and distributed compute grids you should consider Solarflare’s Cloud Onload. With Cloud Onload we’ve seen performance gains ranging from 50%-400% depending on how network intensive an application is. Over the past decade, Solarflare’s Onload technology has accelerated electronic trading worldwide, and today over 90% of all exchanges, institutional banks, and high-frequency trading shops have installed Onload. The only other ULN technology that even comes close to the worldwide adoption of Onload is MPI, but that’s a ULN stack designed for HPC messaging and it does not support UDP or TCP. If your enterprise relies on any of the Web2.0 classes mentioned above, consider reaching out to Solarflare to learn how they can accelerate your network traffic.

Making the Fastest, Faster: Redis Performance Revisited

March 16, 2019March 23, 2019 scottcschweitzer networking, Onload, Website Redis

When you take something that is already considered to be the fastest and offer to make it another 50% faster people think you’re a liar. Those who built that fast thing couldn’t possibly have left that much slack in their design. Not every engineer is a “miracle worker” or notorious sand-bagger, like Scotty from the Star Ship Enterprise. So how is this possible?

A straightforward way to achieve such unbelievable gains is to alter the environment around how that fast thing is measured. Suppose the thing we’re discussing is Redis, an in-memory database. The engineers who wrote Redis rely on the Linux kernel for all network operations. When those Redis engineers measured the performance of their application what they didn’t know was that over 1/3 of the time a request spends in flight is consumed by the kernel, something they have no control over. What if they could regain that control?

Suppose we provided Redis’s direct access to the network. This would enable Redis to directly make calls to the network without any external software layers in the way. What sort of benefits might the Redis application see? There are three areas which would immediately see performance gains: latency, capacity, and determinism.

On the latency side, requests to the database would be processed faster. They are handled more quickly because the application is receiving data straight from the network directly into Redis’s memory without a detour through the kernel. This direct path reduces memory copies, eliminates kernel context switches, and removes other system overhead. The result is a dramatic reduction in time, and CPU cycles. Conversely, when Redis fulfills a database request, it can write that data directly to the network, again saving more time and reclaiming more CPU cycles.

As more CPU cycles are freed up due to decreased latency, those compute resources go directly back into processing Redis database requests. When the Linux kernel is bypassed using Solarflare’s Cloud Onload Redis sees on average a 50% boost in the number of “Get” and “Set” commands it can process every second. Imagine Captain Kirk yelling down to Scotty to give him more power, and Scotty flips a switch, and instantly another 50% more power comes online, that’s Solarflare Cloud Onload. Below is a graph of the free version of Redis doing database GET commands using a single 25GbE link through the kernel in blue, and with an Onloaded 25GbE link in green. Solarflare Cloud Onload, is Scotty’s magic switch mentioned above. Note we scaled the number of Redis instance along the X-axis from 1 to 32 (on an x86 system with 32 cores) and the Y-axis is 0-15 million requests/second.

Finally, there is the elusive attribute of determinism. While computers are great at doing a great many things, that is also what makes them less than 100% predictable. Servers often have many sensors, fans and a control system designed to keep them operating at peak efficiency. The problem is that these devices generate events that require near-immediate attention. When a thermal sensor generates an interrupt, the CPU is alerted, it pushes the current process to the stack, services the interrupt, perhaps by turning a fan on, then returns to the previous process. When the interrupt occurs, and how long it takes the CPU to service it are both variables that hamper determinism. If a typical “Get” request takes a microsecond (millionth of a second) to service, but that CPU core is called away from processing that “Get” request in the middle by an interrupt, it could be 20 to 200 microseconds before it returns. Solarflare’s Cloud Onload communications stack moves these interrupts out of the critical path of Redis, thereby restoring determinism to the application.

So, if you’re looking to improve Redis performance by 50%, please consider Solarflare’s Cloud Onload running on one of their new X2 series NICs. Solarflare’s new X2 series NICs are available for 10GbE, 25GbE and now 100GbE. Soon we will be posting our Benchmarking Performance Guide and our Cloud Onload for Redis Cookbook that contains all the details. When these are available on Solarflare’s website then links will be added to this blog entry.

*Update: Someone asked if I could clarify the graph a bit more. First, we focused our testing on both the GET and SET requests, as those are the two most common in-memory database commands. GET is simply used to fetch a value from the database while SET is used to store a value in the database, really basic stuff. Both graphs are very similar. For a single 25GbE link the size of the Redis GET and SET requests translates to about 11 million requests/second to fill the pipe.

It turns out that a quad-core server running four Redis instances can saturate a single 10GbE link, we’ve not tested multiple 10GbE links. Here is where Cloud Onload shines as it lifts the kernel limitations mentioned above. Note it will take you over 7 Redis instance on 7 Cores to achieve line rate 25GbE with Cloud Onload, while the kernel will require twice that or 14 instances on 14 cores to match this. Any Redis instances or CPU cores beyond this will be underutilized. The most important takeaway here though is that Cloud Onload delivers a substantial capacity gain for Redis over using the kernel, so if your server has more than a few cores Cloud Onload will enable you to get the full value out of them.

**Update: On March 23, 2019, an updated graph was posted above that focuses on 25GbE, as that’s where data centers are headed. The text was then aligned with the updated graph.

**Note: Credit to John Laroco for leading the Redis testing, and for noticing, and taking the opening picture at SJC airport earlier this month.

IPv6, an Appropriate Glacier Turns 21

July 25, 2018July 25, 2018 scottcschweitzer networking

This month IPv6 hit its 21st anniversary, but outside of Google, cloud providers, cell phone companies and ISPs, who really cares? One would think by now it would be widely adopted or dead, as technologies rarely last two decades if they’re unsuccessful. Estimates are that between 5% and 25% of Internet traffic is IPv6, and adoption rates vary greatly between countries. So what about the other 75% to 95% of Internet traffic? It’s using IPv4, you know those addresses of the form 192.168.1.1, technology which is 35 years old! These addresses resolve into four bytes that provide 4.3 billion unique numbers that can be assigned to publicly routed devices.

Back in 1983 having a worldwide network with a potential capacity of 4.3 billion connected devices was inconceivable. The IPv4 system was designed to link supercomputers together across the globe, military installations, large company mainframes and perhaps minicomputers used by smaller companies. For those who didn’t live through 1983, Radio Shack ruled the personal computer market with the TRS80 Model I, III, and the Color Computer. If you were online you paid CompuServe and dialled in via a 300 baud (bit/sec modem). Personal computers from Apple and IBM were just hitting the market, and cell phones were nothing more than glorified two-way radios addressable via a phone number, while selfies were called Polaroids, and you waited a minute or so to see a low quality “instant” picture. Who could have imagined then that soon a significant percentage of the people on the planet would have networked computers in their pockets, strapped to our wrists, controlling our automobiles, refrigerators and septic systems?

Now for those who might not be keeping up, we “officially” ran out of publicly available IPv4 address blocks back in January 2011, yeah right, who knew? Now to be fair the January 31, 2011 date is when we exhausted the top-level blocks which are doled out to the five regional Internet registries (RIRs). Going one level farther down, the last of the RIRs to consume its final block was North America in September 2015. So how have we survived the end of available IPv4 addresses? Simple, since 1994 we’ve been playing a series of tricks to expand the available address space beyond 4.3 billion. Suppose like me you have a single IPv4 Internet address on your home router, inside your home network the router and all the devices use an address on that network of 192.168.1.X. Your router then uses a trick called Network Address Translation (NAT) to map all the 192.168.1.X devices on your network to that single IPv4 address. Brantley Coile founded a company in 1994 called Network Translation that patented NAT, then rolled it out in a device that Cisco later bought and rebranded the PIX Firewall (Private Internet eXchange). Today the vast majority of Internet-connected devices are using a NAT’d address. These days it’s highly unlikely that a company will assign a publicly routed Internet address to a laptop or workstation. I joined IBM Research back in late 1983, and by 1987 was standing up servers with their own unique publicly routed 9.X.X.X addresses. This was before the days of firewalls and security appliances. At the time IBM was one of roughly 100 entities worldwide who had their own class A Internet address space (an IPv4 address starting with 126 or less, for example, General Electric has 3, IBM 9, HP 15, and Ford 19). If you control a class A address you have 16 million publicly routable Internet addresses at your disposal.

Administrators for decades have trained ourselves to grasp a four-byte number like 10.5.17.23, for example, so we could then key it into another device and manage or networks. We invested time in knowing IPv4 and building networks to meet our needs. IPv6 address look like 2001:0db8:85a3:0000:0000:8a2e:0370:7334, this is not human-friendly. That is why inside large companies, where the administrators are familiar with IPv4, there’s resistance to moving to IPv6 addresses. IPv6 is designed for automated machine management. Personally, this spring I was assigned the first IPv6 address I took note of when we converted over to Google Fiber at home. So what was the first thing I did once the fibre was active? I requested an IPv4 address. It turns out there are several servers in my house which I need to reach remotely, and I wasn’t about to begin pasting in an IPv6 address whenever I needed to connect with them. There may be a better way, but I fall back on what I know and trust, it’s human nature.

Today most of our new Internet edge devices, for example, routers and smartphones, are intelligent enough that they self-configure and the whole issue of IPv4 to IPv6 conversion will slowly fade into the background. Within the home or the Enterprise though, where devices need a human touch, IPv4 will live long and prosper.

Container Performance Doesn’t Need to Suck

July 16, 2018July 16, 2018 scottcschweitzer HFT, networking, Onload

Recently the OpenShift team at Red Hat, working with Solarflare Engineering, rolled out new code that was benchmarked by a third party, STAC Research, which demonstrated networking performance from within a container that was equivalent to that of a bare metal server. We’re talking 1.2 microseconds for 99% of network traffic in a 1/2RT (half round trip), that’s a TCP receive to an application coupled with a TCP send from that application.

Network performance like this was considered leading edge in High-Performance Computing (HPC) a little more than a decade ago when Myricom rolled out Myrinet10G which debuted at 2.4 microseconds back in 2006. Both networks are 10Gbps so it’s sort of an apples to apples comparison. Today, this level of performance is available for containerized applications using generic network socket calls. It should be noted that the above numbers were for zero byte packets, a traditional HPC measurement. More realistic performance using 256-byte packets yielded a 1/2RT time for the 99th percentile of traffic which was still under 1.5 microseconds, that’s amazing! It should be noted that everything was done to both the bare metal server and the Pod configuration to optimize performance. A graph of the complete results of that testing is shown below.

Anytime we create abstractions to simplify application execution or management we introduce additional layers of code that can result in potentially unwanted delays, known as application latency. By running an application inside a container, then wrapping that container into a Pod we are increasing the distance between what we intend to do, and what is actually being executed. Docker containers are fast becoming all the rage and methods for orchestrating them using tools like Kubernetes are extremely popular. If you dive into this OpenShift blog post there are ways to cut through these layers of code for performance while still retaining the primary management benefits.

What the FEC?
Auto-Detect Finally Here for 25G!

February 18, 2018March 26, 2018 scottcschweitzer 25GbE, networking

As technology marches forward new challenges arise that were not previously an issue. Consider as mankind moved from walking to horseback we cleared trails where there was once brush covered paths. As we transitioned from horseback to carriages those paths needed to become dirt roads, and the carriages added suspension systems. With the move from carriages to automobiles, we further smoothed the surface traveled by adding gravel. As the automobiles moved faster, we added an adhesive to the gravel creating paved roads. With the introduction of highways, we required engineered roads with multi-layered surfaces. Each generation reduced the variability in the road surface by utilizing new techniques that enabled greater speed and performance. The same holds true for computer networks.

Over the past three decades as we transitioned from 10Mbps to 25Gbps Ethernet we’ve required many innovations to support these greater speeds. The latest of these being Forward Error Correction (FEC). The intent of FEC is to reduce the bit error rate (BER) as the cable length increases. In 2017 we saw the ratification of the IEEE 25GbE specification which provides two unique methods of FEC. There is BASE-R FEC (also known as Firecode) and RS-FEC (known also as Reed Solomon). Both of these FEC algorithms introduce additional network latency as the signal is decoded, BASE-R is about 80 nanoseconds while RS-FEC is about 250 nanoseconds. The complexities don’t end here though, it turns out there are three different Direct Attach (DA) cable types with varying levels of quality, from good, to best we have:

CA-25G-L: up to 5m, requires RS-FEC
CA-25G-S: up to 3m, lower loss, requires either RS-FEC or BASE-R FEC
CA-25G-N: up to 3m, even lower loss, can work with RS-FEC, BASE-R FEC, or no FEC

But wait there’s more, if you order now we’ll throw in auto-negotiation (AN) and link training (LT) as both are required by the 25GbE IEEE standard (10GbE didn’t need these tricks). So what does AN actually negotiate? Two things, link speed and which type, if any, FEC will be utilized. It should be noted that existing 25GbE NICs that have been on the market likely only support one type of FEC. As for LT, it helps to improve the quality of the 25GbE link itself. It turns out though that the current generation of 25GbE switches came out prior to AN being worked out so support is at best poor to mixed. Often manual switch and adapter configuration are required. Oh, and did I mention that optical modules don’t support AN/LT? Well, they don’t, but some will support short links with no FEC.

So where does this leave people who want to deploy 25GbE? You need to be careful that both your network switch and server NICs will work well together. We strongly advise that you do a proof of concept prior to a full deployment. Not all 25G server NICs do both AN/LT because their chips (ASICs) were designed and fabricated prior to the completion of the IEEE specification for 25GbE last year. Solarflare’s 25GbE X2522 server NICs which debut next month include support for all the above, in fact, when initially powered up they will begin by:

First looking at cable, is it SFP or SFP28?
If it’s SFP28 it will attempt AN/LT, then 25G no AN/LT, then 10G
If it’s a 25G link, then it will try and detect which FEC is being used by the switch

Additionally, the server administrator can manually override the defaults and select AN/LT and the FEC type and setting (auto, on, off).

I grew up in New York, and remember listening to Sy Sims on TV say “an educated consumer is our best customer…”

P.S. I’d like to give a special thanks to Martin Porter, Solarflare’s VP of Engineering, for pulling all this together into a few slides.

The New Network Edge – The Server

February 10, 2018March 22, 2018 scottcschweitzer Microsegmentation, networking, Security

Today cleverly crafted spear phishing emails and drive-by downloads make it almost trivial for a determined attacker to infect a corporate workstation or laptop. Wombat’s “State of the Phish 2018” report shows that 76% of InfoSec professionals experienced phishing attacks in 2017. Malware Remote Access Toolkits (RATs) like Remcos for Windows can easily be rebuilt with a new name and bound to legitimate applications, documents or presentations. Apple Mac users, myself included, are typically a smug group when it comes to Malware so for them, there’s MacSpy which is nearly as feature rich. A good RAT assumes total control over the workstation or server on which they are installed then it leverages a secure HTTPS connection back to their command and control server. Furthermore, they employ their own proprietary encryption techniques to secure their traffic prior to HTTPS being applied. This prevents commercial outbound web proxies designed to inspect HTTPS traffic from gaining any useful insights into the toolkits nefarious activities. With the existence of sophisticated RATs, we must reconsider our view of the enterprise network. Once a laptop or workstation on the corporate network is compromised in the above fashion all the classic network defenses, firewalls, IDS, and IPS are rendered useless. These toolkits force us to reconsider that the New Network Edge is the server itself, and that requires a new layer in our Defense in Depth model.

The data on our enterprise servers are the jewels that attackers are paid a hefty sum to acquire. Whether it’s a lone hacker for hire by a competitor, a hacktivist group or a rogue nation state, there are bad actors looking to obtain your companies secrets. Normally the ONLY defenses on the corporate network between workstations and servers are the network switches and software firewalls that exist on both ends. The network switches enforce sub-networks (subnets) and virtualized local area networks (VLANs) that impose a logical structure on the physical network. Access Control Lists (ACLs) then define how traffic is routed across these logical boundaries. These ACLs are driven by the needs of the business and meant to reflect how information should flow between different parts of the enterprise. By contrast, the software firewalls on both the workstations and servers also define what is permitted to enter and leave these systems. As defenses, both these methods fall woefully short, but today they’re the last line of defense. We need something far more rigorous that can be centrally managed to defend the New Network Edge, our servers.

As a representation of the businesses processes, switch ACLs are often fairly loose when permitting systems on one network access to those on another. For example, someone on the inside sales team sitting in their cubical on their workstation has access to the Customer Relationship Management (CRM) system which resides on a server that is physically somewhere else. The workstation and server are very likely on different subnets or VLAN within the same enterprise, but ACLs exist that enable the sales person’s workstation access to customer data provided by the CRM system. Furthermore, that CRM system is actually pulling the customer data from a third system, a database server. It is possible that the CRM server and the database server may be on the same physical server, or perhaps in the same server rack, but very possibly on the same logical network. The question is, is there a logical path from the inside sales person’s workstation to the database server, the answer should be no, but guess what? It doesn’t matter. Once the inside salesperson is successfully spear fished then it’s only a matter of time before the attacker has access to the database server through the CRM server.

The attacker will first enable the keylogger, then watch the sales person’s screen to see what they are doing, harvest all their user ids and passwords, perhaps turn on the microphone and listen to their conversations, and inspect all the outgoing network connections. Next, the attacker will use what they’ve harvested and learned to begin their assault, first on the CRM server. Their goal at this point is to establish a secondary beachhead with the greatest potential reach from which to launch their primary assault while keeping the inside sales person’s workstation as their fallback position. From the CRM server, they should be able to easily access many of the generic service machines: DNS, DHCP, NTP, print, file, and database systems. The point here is that where external attackers often have to actively probe a network to see how it responds, internal RAT based attacks can passively watch and enumerate all the ports and addresses typically used. In doing so they avoid any internal dark space honeypots, tripwires, or sweep detectors. So how do we protect the New Network Edge, the server itself?

A new layer needs to be added to our defense in depth model called micro-segmentation or application segmentation. This enforces a strict set of policies on the boundary layer between the server and the network. Cisco, Arista, and other switch providers, with a switch-based view of the world, would have you believe that doing it in the switch is the best idea. VMWare, with its hypervisor view of the world, would have you believe that their new NSX product is the solution. Others like Illumio and Tuffin would have you believe that a server-based agent is the silver bullet for micro-segmentation. Then there’s Solarflare, a NIC company, with its NIC based view of the world, and its new entrant in the market called ServerLock.

Cisco sells a product called Tetration designed to orchestrate all the switches within your enterprise and provide finely grained micro-segmentation of your network traffic. It requires additional Cisco servers be installed to receive traffic flow data from all the switches, processes the data, then provides network admins with both the visibility and orchestration of the security policies across all the switches. There are several downsides to this approach, it is complex, expensive, and can very possibly be limited by the ACL storage capabilities of the top of rack switches. As we scale to 100s of VMs per system or 1,000s of containers these ACLs will likely be stretched beyond their limits.

VMWare NSX includes both an advanced virtual switch and a firewall that both require host CPU cycles to operate. Again, as we scale to 100s of VMs per system the CPU demands placed on the system by both the virtual switch and the NSX firewall will become significant, and measurable. Also, it should be noted that being an entirely software-based solution NSX has a large attackable surface area that could eventually be compromised. Especially given the Meltdown and Spectre vulnerabilities recently reported by Intel. Finally, VMWare NSX is a commercial product with a premium price tag.

This brings us to the agent-based solutions like Illumio and Tuffin. We’ll focus on Illumio which comes with two components the Policy Compute Engine (PCE) and the Virtual Enforcement Node (VEN). The marketing literature states that the VEN is attached to a workload, but it’s an agent installed on every server under Illumio’s control and it reports network traffic flow data into the PCE while also controlling the local OS software firewall. The PCE then provides visualization and a platform for orchestrating security policies. The Achilles heel of the VEN is that it’s a software agent which means that it both consumes x86 CPU cycles and provides a large attackable surface area. Large in the sense that both its agent and the OS-based firewall on which it depends can both be easily circumvented. An attacker need only escalate their privileges to root/admin to hamstring the OS firewall or disable or blind the VEN. Like VMWare NSX, Illumio and Tuffin are premium products.

Finally, we have Solarflare’s NIC based solution called ServerLock. Unlike NSX and Illumio which rely on Intel CPU cycles to handle firewall filtering, Solarflare executes its packet filtering engine entirely within the chip on the NIC. This means that when an inbound network packet is denied access and dropped it takes zero host CPU cycles, compared to the 15K plus x86 cycles required by software firewalls like NSX or IPTables. ServerLock NICs also establish a TLS-based domain of trust with a central ServerLock Manager similar to Illumio’s PCE. The ServerLock Manager receives flow data from all the ServerLock NICs under management and provides Visibility, Alerting and Policy Management. Unlike Illumio though the flow data coming from the ServerLock NICs requires no host CPU cycles to gather and transmit, these tasks are done entirely within the NIC. Furthermore, once the Solarflare NIC is bound to a ServerLock Manager the local control plane for viewing and managing the NIC’s hardware filter table is torn down so even if an application were to obtain root privilege there is no physical path to view or manage the filter table. At this point the, it is only capable of being changed from the specific ServerLock Manager to which it is bound. All of the above comes standard with new Solarflare X2 based NICs that are priced at or below competitive Intel NIC price points. ServerLock itself is enabled as an annual service sold as a site license.

So when you think of micro-segmentation would you rather it be done in hardware or software?

P.S. Someone asked why there is a link to a specific RAT or why I’ve included a link to an article about them, simple it validates that these toolkits are in-fact real, and readily accessible. For some people, threats aren’t real until they can actually see them. Also, another person asked, what if we’re using Salesforce.com, that’s ok, as an attacker instead of hitting the CRM server I’ll try the file servers, intranet websites, print servers, or whatever that inside salesperson has access to. Eventually, if I’m determined and the bounty is high enough, I’ll have access to everything.

2600, Attacking Enterprise Networks

January 2, 2018March 28, 2018 scottcschweitzer networking, Security

Since 1984 the magazine “2600” has been the undisputed publication created by and for hackers and phone phreakers worldwide. The January 2017 cover to the right is misleading, the magazine isn’t named after the Atari system, but rather the 2600 Hz tone ATT used, that phreakers leveraged, to control long distance phone lines. Most article bylines are hacker handles, rather than proper names, as the articles themselves don’t always paint inside the numbers. In January 2017 a hacker with the handle “Daelphinux” published the first of a five-part series of articles titled “Successful Network Attacks – Phase One” and every quarter since then he’s published the next phase, with the final Phase Five hitting the streets today January 2, 2018. This collection of five articles is perhaps the most concise executive-level summary of how an attacker breaches an enterprise that I’ve read thus far. Hopefully, I won’t offend Daelphinux by attempting to summarizing the key points of his five articles in this single blog post. I strongly suggest that everyone reading this review the original text in these five issues of 2600.

Daelphinux classified a Successful Network Attack into five phases:

Reconnaissance – “Gathering as much useful information about the targets as possible.”
Scanning – “Gathering useful information about the target’s networks and any possible exploits.”
Gaining Access – “Getting into the network to be able to accomplish the attack’s goal.”
Maintaining Access – “Ensuring access to the network persists long enough to accomplish the attack’s goal.”
Covering Tracks – “Obfuscating the attacker’s presence on the network such that they cannot be traced.”

Each phase builds on the first, with Daelphinux envisioning this as a pyramid with phase one on the bottom, and each successive phase building on the prior one. Attackers will need tools and skills at each level in order to conduct a successful attack. Defenders, the enterprise admins, will also need tools and skills for several phases to detect and defend against an attack.

Phase 1 – Reconnaissance. As defined by Daelphinux the attacker seeks to gather all the raw data they can on their target before actively engaging them. The key here is in gathering only the “useful” data, as an attacker will rapidly accumulate an enormous pile of information. This information should come from a wide variety of sources including, but not limited to: deep web searches, web crawling those results, calling all the targets publicly available phone numbers to learn everything they can about the target, draining “whois” databases for all known corporate DNS assets, launching phishing and email scams at targeted employes and generic email address, real-life social engineering by making new friends, and finally dumpster diving.

All these efforts will produce a heap of information, but most of it will be useless. Here is where intelligent sifting comes in. Important information is often in handwritten notes, or cast of printouts such as printer test configuration pages, IP table listings, usernames, equipment manufacturer shipping boxes, operating system manuals, internal organizational charts & structures, and corporate policies (especially password).

Recognizing these reconnaissance efforts is the initial step toward thwarting an attack. For example, reviewing security footage looking for dumpster divers might sound trivial, but differentiating between an attacker and someone looking something to sell for their next meal can be challenging. Other activities like social engineering become far more complex to detect via video surveillance as these activities appear as background noise. While this phase might be the toughest to detect, it is the easiest to defend against. If you can cut off the flow of information outside enterprise you can seriously hamper their reconnaissance efforts. To do this you can hide the whois records, destroy printed copies of purchase orders, destroy shipping boxes used to pack new servers or appliances, destroy discarded manuals, remove and clean printer hard drives, make sure all in-house shredders use cross-cut shredding and finally burn any really sensitive info that has already been shredded. Much of this can be addressed through procedures and training.

Phase 2 – Scanning, In April Daelphinux covered the details of this phase. He said that when an attacker moves on to phase two they are committed, and unlikely to walk away. This phase is the first real step where the attacker can’t evade detection as they have to deploy active tools to electronically gather as much insight as they can. Tools like: Nmap, Nessus, Metasploit, ZAP, Xenotix or Grabber. Here they are looking to generate IP Maps, enumerate subnets, determine network speeds, the resilience of networks, open ports on clients, appliances & servers used, along with applications and the versions in production. All this scanning will provide the attacker with another huge heap of data to sift through with the eventual goal being to define the attack vectors that define the real “meat” of the attack.

This phase is the last chance for a defender to stop an attack before valuable assets are stolen. During this phase, administrators might notice network slowdowns, so on detecting these and investigating see if one IP address or a small range of addresses is touching a wide range of resources if so you are likely in the process of being scanned. Attackers will often launch the scanning phase remotely. So using network address translation (NAT) internally, next-generation firewalls, and current IPS & IDS appliances can in some cases detect this. It is always strongly recommended to be current in patching all your appliances and to follow proper admin process.

Phase 3 – Gaining Access, was published in 2600 in July of 2017. Without gaining access an attack isn’t an attack. Some of the key tools used are Metasploit and ZAP, but they will also leverage trojans, zombies, and backdoors to gain multiple toeholds into the enterprise’s network. Attackers often use remote shells instead of graphical tools, as shells often prove to be faster, more flexible and powerful.

Typically attackers operate from an endpoint that is not their intended target, for example leveraging another user’s machine to attack a server. By using another unsuspecting endpoint if it becomes compromised they can then move onto another workstation within the enterprise network. Attackers are typically interested in copying, moving, deleting and or altering files. In this article, we’re not interested in those seeking to launch a Denial of Service (DoS) attack, just those looking to exfiltrate value from the enterprise. Detecting attacks during this phase requires active monitoring, and the question is when, not if, you’ll detect an attacker. Doing active file audits, examining files that are changed outside of regularly expected intervals, watching for irregular traffic patterns, and reviewing access and error logs are all known methods of detecting an attack.

Phase 4 – Maintaining Access, was published in the October 2017 issue. This is the last phase where defending against an attack can successfully prevent data loss. Daelphinux stated that there are three things which should be kept in mind:

The Attacker has already breached the network.
The Attacker is actively attempting to achieve their goal.
Less experienced attackers tend to get overly comfortable with their success at this point.

This is the stage at which attackers are at their most vulnerable. They are moving around in plain sight within your network and can be detected if you’re looking for them. Network monitoring is the key to detection in this phase, but also, unfortunately, detection at this phase is also based on hope. You as the defender are hoping you discover them, and they are hoping their connection won’t get them detected and severed.

Skilled attackers will setup precautions to prevent this. One of the attacker’s strategies will be to take over network monitoring tools in an effort to circumvent detection. Enterprises need a heavy level of paranoia at this point to ensure that they are checking everything. Looking for the use of uncommon ports or protocols is another method for detecting attackers. Typically Intrusion Prevention (IPS) and Intrusion Detection Systems (IDS) and Security Information and Event Management Systems (SIEMS) are useful tools in ferreting out hackers. SIEMS themselves often are now targets for attackers. Multiple instances of these devices or appliances should be leveraged to thwart a takedown. Consider having these monitors watch each other, so in the event, one goes down you can use that as a potential indicator of an attack. If both or all monitors go down simultaneously then it’s very possible you’re under attack. Killing connections that meet certain criteria are vital to cutting off an attackers access, for example terminating connections that last longer than 25 minutes. Think that scene in the original “Transformers” when the general shouts “Cut the hard lines!”

Phase 5 – Covering Tracks, today January 2, 2018, we saw the latest issue of 2600 with Daelphinux final installment in the series on “Successful Network Attacks – Phase Five – Covering Tracks.” Once an attacker has reached phase five they’ve taken what they need, and now the coverup begins. Attempting to defend against this phase “is a form of damage control,” at best you’re preserving forensic evidence to hopefully reconstruct what they took after the fact. An attacker has to leave as cleanly as they entered otherwise they could dilute some of the value of what was taken. As Daelphinux points out, what good is a list of users and passwords if the passwords have all been changed. The same holds true of Malware and backdoors, you can’t sell a backdoor into an enterprise if it has been removed.

The best defense against this is redundant, and perhaps even hidden copies of logs. Attackers will often sanitize the logs they find, but they rarely go looking for additional copies of logs, especially if some effort has been made to hide and even secure them. It’s possible that if you have automated your logging such that multiple copies are generated, and accesses are tracked the attacker will notice this in phase three and just avoid it. Attackers will normally obfuscate both their IP and MAC addresses to further frustrate tracking them, often using addresses already on the network. Again, as mentioned in phase three setting connection limits, timeouts, and alerts when these are reached is often a good way to thwart or even detect an attacker. It should also be noted that attackers will often escalate their privilege on a system so they can disable logging. As Daelphinux noted disabling logging will often generate its own log event, but then after that, you won’t know what was done. Some attackers may even just erase or corrupt log files on the way out the door, a sort of “salt the earth” strategy to make determining what was taken that much more difficult. Regardless, a company will need to make some determinations on what was stolen or affected and alert their customers. A recent Ponemon report states that just cleaning up and dealing with a data breach in the US often costs companies $244/record.

Hats off to Daelphinux for authoring “Successful Network Attacks,” and “2600, The Hacker Quarterly” for publishing it.

I hope everyone has a Happy and Secure New Year.

The Technology Evangelist

Since my first computer in 1983, a TRS80 Model III, I've been educating others on the promises of computing.

networking