Density Comes to 10GbE

Over the past 30 years of Ethernet, we’ve seen at least five major generations. As each generation matures, typically around the third major wave of silicon, we invariably see quad-port adapters emerge on the market. Over the past six months, each of the major 10GbE server adapter providers have offered quad-port cards. In the Four Reasons Why 10GbE NIC Design Matters, I talked about the importance of design in Ethernet controller chips that produce features like VNICs, Physical, and Virtual Functions.

The street price for Solarflare & Intel adapters is now around $650 which is very reasonable. This is typically only 50% higher than the companies more popular dual-port 10GbE offerings.  What is more important though is that due to the size of the SFP+ cages all of these cards are ONLY offered as full height adapters. Since every card uses at least a third generation ethernet controller chip, they are all single chip PCIe Gen 3 interface solutions, which are also typically low power.  What’s even more important though is the level of sophistication of these adapters. With the rise in popularity of virtualization, we can now see that all four of them have poured substantial silicon resources into supporting this aspect of data center computing.

It should also be noted that multi-chip quad-port 10G solutions due exist from tier-2 Ethernet server adapter vendors: ATTO, Silicom, SmallTree and Interface Masters. All of these products are designed around Intel’s previous generation 82599 Ethernet controller or its low power variant. Most of these products were produced for the video editing market where directly attaching editing workstations to servers is popular. In this industry port density on the video editing servers is the primary concern, with price, and performance trailing. These adapters run anywhere from $1,100 to $1,700, double the price of those in the table above. One vendor even chose to violate the PCIe faceplate form factor spacing standards to produce a line of low-profile cards that only work in a select set of servers. This was done to support an appliance product, but they are also retailing them as Non-Conforming cards, so be careful.

So as you prepare for that future datacenter upgrade, now you know who the key players are in the dense 10G market. Please consider giving Solarflare an opportunity to earn your business. To request a Proof-of-Concept with the new SFN7004F please contact Solarflare.

Severs Can Protect Themselves From a DDoS Attack

Solarflare is completing SolarSecure Server Defense, a Docker Container housing a start-of-the-art threat detection, and mitigation system. This system dynamically detects new threats and updates the filters applied to all network packets traversing the kernel network device driver in an effort to fend off future attacks in real time without direct human intervention. To do this Solarflare has employed four technologies: OpenOnload, SolarCapture Live, Bro Network Security Monitor, and SolarSecure Filter Engine.

OpenOnload provides an OSBypass means of shunting copies of all packets making it past the current filter set to SolarCapture. SolarCapture provides a Libpcap framework for packet capture which then hands these copied packets onto Bro for analysis. Bro then applies a series of scripts to each packet, and if a script detects a hit it raises an event. Each class of event then triggers a special SolarSecure Filter Engine script which then creates a new network packet filter. This filter is then loaded in real-time into the packet filter engine of the network adapter’s kernel device driver to be applied to all future network packets. Finally, Server Defense can alert your admins as new rules are created on each server across your infrastructure.

SolarSecure Server Defense inspects all inbound, outbound, container to container, and VM to VM packets on the same physical server, and filters are applied to every packet. This uniquely positions Solarflare Server Defense as the only containerized cyber defense solution designed to protect each individual server, VM or container, within an enterprise from a wide class of threats ranging from a simple SYN flood to a sophisticated DDoS attack. Even more compelling, it can actually defend from attacks originating from inside the same physical network, behind your existing perimeter defenses. It can actually defend one VM from an attack launched by another VM on the same physical server!

To learn more please contact Scott Schweitzer at Solarflare.

Performance Single Chip Quad-Port 10G SFP+ NIC

The ASIC currently driving Solarflare’s dual port 40GbE card is actually a dual core package with each core supporting two physical network interfaces. This ASIC has now been placed in a new board designed to support four SFP+ 10GbE ports with a typical power consumption of 13 watts. In adherence with the PCIe physical specifications and the available SFP+ cages for designers, this had to be brought to market as a full height PCIe adapter. Furthermore, this full height PCI Express network server adapter utilizes a PCIe generation 3 interface with eight lanes. This PCIe bus configuration is good for 52Gbps in each direction so it’s a perfect match for Solarflare’s new quad-port 10GbE adapter, the SFN7004F. There is also a version available with Solarflare’s OS Bypass layer, the SFN7124F.

So if you’re doing ultra dense 10G deployments, data center, cloud, enterprise, etc… you now have a new affordable quad-port option available in the SFN7004F.

To learn more contact Scott Schweitzer at Solarflare.

Detecting Data Breaches in Real Time

Not a week passes without news of another company announcing a data security breach. Many of these breaches start with the Point of Sale (POS) systems, but as we saw with Anthem, Sony and Edward Snowden, that isn’t always the case. Regardless of where the breach starts, nearly all of the valuable data lost flows through, and eventually out of the enterprise. Imagine if a small team of clowns walked into your business in the middle of the day, went straight to your server room, pulled out big clown scissors, cut all the cables front and back on your servers and proceeded to carry them out to their clown car. Certainly, employees would question what was going on, and surely someone would stop them before the servers actually left the building. Today that’s exactly what’s happening; only the clowns are black hat hackers acting remotely.

All companies have firewalls, many have intrusion detection systems, and some install intrusion prevention systems, but does your company capture and analyze all the traffic flows entering, and leaving your enterprise? Even more daunting, imagine capturing all of the flows within your company, then scrubbing that data looking for unique traffic patterns, perhaps in real time? At then end of December Norse specifically identified the Sony employee who was laid off in May, and who departed with tens of gigabytes of Sony movies and digital assets. This employee was someone in IT, possibly very much like you who had access to many of the digital security certificates, admin ids and passwords within Sony, many of those items were included in files and spreadsheets that Gods Of Peace released. Sony knew months before that they were separating people from the business, had they been looking for unusual internal network traffic patterns they might very well have been able to thwart this digital theft.

For the rest of the story please visit the full article in this months issue of Cyber Defense Magazine.

3X Better Performance with Nginx

Recently Solarflare concluded some testing with Nginx that measured the amount of traffic Nginx could respond to before it started dropping requests. We then scaled up the number of cores provided to Nginx to see how additional compute resources impacted the servicing of web page requests, and this is the resulting graph:

click for larger image

As you can see from the above graph most NIC implementations require about six cores to achieve 80% wire-rate. The major difference highlighted in this graph though is that with a Solarflare adapter, and their OpenOnload OS Bypass driver they can achieve 90% wire-rate performance utilizing ONLY two cores versus six. Note that this is with Intel’s most current 10G NIC the x710.

What’s interesting here though is that OpenOnload internally can bond together up to six 10G links, before a configuration file change is required to support more.  This could mean that a single 12 core server, running a single Nginx instance should be able to adequately service 90% wire-rate across all six 10G links, or theoretically 54Gbps of web page traffic. Now, of course, this is assuming everything is in memory, and the rest of the system is properly tuned. Viewed another way this is 4.5Gbps/core of web traffic serviced by Nginx running with OpenOnload on a Solarflare adapter compared to 1.4Gbps/core of web traffic with an Intel 10G NIC. This is a 3X gain in performance for Solarflare over Intel, how is the possible?

Simple, OpenOnload is a user space stack that communicates directly with the network adapter in the most efficient manner possible to service UDP & TCP requests. The latest version of OpenOnload has also been tuned to address the C10K problem. What’s important to note, is that by bypassing the Linux OS to service these communication requests Solarflare reduces the number of kernel context switches/core, memory copies, and can more effectively utilize the processor cache. All of this translates to more available cycles for Nginx on each and every core.

To further drive this point home we did an additional test just showing the performance gains OOL delivered to Nginx on 40GbE. Here you can see that the OS limits Nginx on a 10-core system to servicing about 15Gbps. With the addition of just OpenOnload to Nginx, that number jumps to 45Gbps. Again another 3X gain in performance.

If you have web servers today running Nginx, and you want to give them a gargantuan boost in performance please consider Solarflare and their OpenOnload technology. Imagine taking your existing web server today which has been running on a single Intel x520 dual port 10G card, replacing that with a Solarflare SFN7122F card, installing their OpenOnload drivers and seeing a 3X boost in performance. This is a fantastic way to breathe new life into existing installed web servers. Please consider contacting Solarflare today do a 10G OpenOnload proof of concept so you can see these performance gains for yourself first hand.

A 10GbE Capture Platform: Snort, Bro, Suricata & Wireshark

Perhaps you’re responsible for your companies network security, or maybe you’re designing an appliance for your business?  If so, then you’ve likely already become familiar with SnortBro,  Suricata, and Wireshark. As you may have recently discovered for real performance with these applications at 10GbE speeds you need the proper adapter, and capture driver or you risk dropping vast numbers of packets. Furthermore, it is now possible to not only capture a copy of the packets received on the server but also transmitted. All while running your programs on other cores within the same server. Furthermore, if you’re interested in capturing all the received, transmitted, and virtual machine (VM) to VM traffic within your server then you can actually designate one VM to capture a copy of all the network traffic for analysis. To further sweeten things, this can also be done from within a Docker container built to handle capture.

Some might ask why you would want to capture transmitted packets and run them through Snort, Bro or Suricata? Simple, to look for outbound traffic patterns that might indicate a breach. Perhaps a VM on one of your servers has been compromised, and it is sending out your companies precious AutoCAD files in the middle of the night to a country in Asia you don’t do business with. If you’re not looking at transmitted packets you may never detect, or stop a breach of this nature. Setting up rules to look for file transfers of specific types, during specific times, or conforming to other criteria specific to outbound traffic is a fairly new trend. Also, this capture doesn’t have to be packets on your server, you can take a more traditional approach and dedicate a server for capture in every rack, then use an optical tap or the spanning port off a switch. In fact, you can install multiple adapters and aggregate the ports together until you hit the performance limits of your system.

Most accelerated 10G capture platforms require both a performance adapter and a special purpose capture driver.  Furthermore to capture both received and transmitted packets in parallel you have only one choice, and that is an adapter and software from Solarflare. You can start with Solarflare’s Flareon Ultra SFN7122F adapter and a SolarCapture Live license, and as your needs grow scale to their dual 40G adapter the SFN7142Q.

Solarflare provides this high-performance capture platform designed specifically for engineers looking to build leading edge security solutions. Let’s take a closer look at the adapter and software. The network server adapter, the Solarflare SFN7122F, is a board that contains one Solarflare single core Ethernet controller chip. This Ethernet controller core on this chip has multiple packet engines each dedicated to processing received or transmitted packets. This enables the SFN7122F adapter to support wire-rate lossless packet capture, even with huge bursts of the smallest sized packets (64 bytes each) on a single port. This dedication of resources enables transmitting wire-rate 64-byte packets at the same time, on the same interface, and in parallel, without impacting capture performance. Furthermore, the SFN7142Q utilizes the same Ethernet controller, but with two of these on the same chip so it can support capture on two 40G ports, or four 10G port, or wire-rate lossless capture on two 10G ports.

The next component in this platform is SolarCapture Live (SCL), which provides a complete Libpcap replacement library, and a Snort DAQ interface. This allows for two fairly seamless methods for easily connecting to Snort. If SCL is initialized in cluster mode it can spawn multiple capture instances, up to one per core, and deliver all network packets in Libpcap format spread across these cores. SCL then uses advanced receive flow steering to flow-hash the packets across all of these capture nodes within the capture cluster. Flow-hashing is the process of looking at several key fields in the packet header then always routing all the traffic from a given flow consistently to the same cluster node (core) so security applications like Snort, Suricata and Bro can always see all the given data for that specific network flow.

This Solarflare capture platform also supports an optional Solarflare Precision Time Protocol (PTP) software license that can accept an external hardware Pulse Per Second (PPS) signal (via an additional optional bracket kit) which provides the necessary mini-BNC connectors that can then be used to attach the adapter to an external master clock. Unlike similar adapters, this optional PCIe faceplate has a second mini-BNC connector to support daisy chaining the clock signal out of the adapter into another adapter. These Solarflare adapters include a highly precise clock chip, the Stratum 3, this ensures that time stamping is accurate to within 100 nanoseconds from the PTP master, precision time stamping is typically only available on much more expensive FPGA based adapters. Furthermore, the PTP license enables time stamping for the capture of both received and transmitted packets, so you can use it to measure application performance. Additionally, Solarflare’s 100 nanosecond precision is 15X more precise than a competing adapter at a similar price point that only captures and time stamps inbound packets.

So if you’re looking to get into packet capture for security monitoring or performance analysis, please consider contacting Solarflare, and ask about their SFN7122F with SolarCapture Live. You’ll be pleasantly surprised at how well it performs when compared to the much more expensive FPGA based solutions which sell for 5X or more the price of this unique bundle.

10G/40G NIC Partitioning Using SR-IOV or PF-IOV Modes

Partitioning a network interface card (NIC) so that multiple virtual machines (VMs) can use it at the same time has always been challenging. Last week Solarflare released an updated network device driver for Linux that now supports Single Route Input Output Virtualization (SR-IOV) and Physical Function Input Output Virtualization (PF-IOV) modes.  The actual details for setting up the adapter to leverage these modes can be found in the Solarflare Adapter User Guide.

Borrowing from the User Guide:

SR‐IOV enabled on Solarflare adapters provides accelerated cut‐through performance and is fully compatible with hypervisor based services and management tools. The advanced design of the Solarflare SFN7000 series adapter incorporates a number of features to support SR‐IOV. These features can be summarized as follows:

Multiple PCIe Physical Functions (PF).

Each physical port on the dual‐port 10G or 40G adapter can be exposed to the OS as multiple physical functions. A total of 16 PFs are supported per adapter with each having a unique MAC address.

PCIe Virtual Functions (VF).

A PF can support a configurable number of PCIe virtual functions. In total 240 VFs can be allocated between the PFs. The adapter can also support a total of 2048 MSI‐X interrupts.

Layer 2 Switching Capability.

A layer 2 switch configured in firmware supports the transport of network packets between PCI physical functions (PF) and Virtual functions (VF). This allows received packets to be replicated across multiple PFs/VFs and allows packets transmitted from one PF to be received on another PF or VF.

  • On a 10GbE dualport adapter each physical port can be exposed as a maximum 8 PFs
  •  On a 40GbE dualport adapter (in 2*40G mode) each physical port can be exposed as a maximum 8 PFs. 
  • On a 40GbE dualport adapter (in 4*10G mode) each physical port can be exposed as a maximum 4 PFs. 

All of this allows each VM to receive it’s own Virtual NIC. This enables applications to communicate directly with the wire via Hypervisor Bypass, PCI Passthru or DirectPath I/O.  This restores native network performance to VMs. So with dual port 10G adapters, Solarflare provides support for 8PFs per physical port and a total of 240 VFs per adapter.

Furthermore, Solarflare also supports Kernel Bypass alongside Hypervisor Bypass to deliver new bare metal ultra-low latency from within a VM.

To learn more drop me an email.

Three Mellanox Marketing Misrepresentations

So Mellanox’s Connect-X 4 line of adapters are hitting the street, and as always tall tales are being told or rather blogged about concerning the amazing performance of these adapters. As is Mellanox’s strategy they intentionally position Infiniband’s numbers to imply that they are the same on Ethernet, which they’re not. Claims of 700 nanoseconds latency, 100Gbps & 150M messages per second. Wow, a triple threat low latency, high bandwidth, and an awesome message rate. So where does this come from? How about the second paragraph of Mellanox’s own press release for this new product: “Mellanox’s ConnectX-4 VPI adapter delivers 10, 20, 25, 40, 50, 56 and 100Gb/s throughput supporting both the InfiniBand and the Ethernet standard protocols, and the flexibility to connect any CPU architecture – x86, GPU, POWER, ARM, FPGA and more. With world-class performance at 150 million messages per second, a latency of 0.7usec, and smart acceleration engines such as RDMA, GPUDirect, and SR-IOV, ConnectX-4 will enable the most efficient compute and storage platforms.” It’s easy to understand how one might actually think that all the above numbers also pertain to Ethernet, and by extension UDP & TCP. Nothing could be further from the truth.

From Mellanox’s own website on February 14, 2015: “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows, and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So clearly all the above numbers are INFINIBAND ONLY, today three months after the above press release still the fastest Ethernet Mellanox supports is 40GbE, and this is done with their own standard OS driver only. This by design will always limit things like packet rate to 3-4Mpps, and latency to somewhere around 10,000 nanoseconds, not 700. Bandwidth could be directly OS limited, but I’ve yet to see that so on these 100Gbps adapters Mellanox might support something approaching 40Gbps/port.

So let’s imagine that someday in the distant future the gang at Mellanox delivers an OS-bypass driver for the Connect-X 4 and that it does support 100Gbps. What we’ll see is that like the prior versions of Connect-X, this is Mellanox’s answer to doing both Infiniband & Ethernet on the same adapter, a trick they picked up from now defunct Myricom who achieved this back in 2005 delivering both Myrinet & 10G Ethernet on the same Layer-1 media. This trick allows Mellanox to ship a single adapter that can be used with two totally different driver stacks to deliver Infiniband traffic over an Infiniband hardware fabric or Ethernet over traditional switches directly to applications or the OS kernel. This simplifies things for Mellanox, OEMs, and distributors, but not for customers.

Suppose I told you I had a car that could reach 330MPH in 1,000 feet, pretty impressive. Would you expect that same car to work on the highway, probably not, how about on a NASCAR track? No, because those that really know auto racing immediately realize I’m talking about a beast that burns five gallons of Nitromethane in four seconds, yes a 0.04MPG, top-fuel dragster. This class of racing is analogous to High-Performance Computing (HPC), where Infiniband is king and the problem domain is extremely well defined. In HPC we measure latency using zero byte packets and often attach adapters back to back without a switch to measure percieved network system latency. So while 700 nanoseconds of latency sounds impressive it should be noted that no end user data is passed during this test at this speed, just empty packets to prove the performance of the transport layer. In production, you can’t actually use zero byte packets because they’re simply the digital equivalent of sealed empty envelopes. Also to see this 700 nanoseconds you’ll need to be running Infiniband on both ends, along with an Infiniband supported driver stack that bypasses the operating system, note this DOES NOT support traditional UDP or TCP communications. Also to get anything near 700 nanoseconds you have to be using Infiniband RDMA functions, back to back between two systems without a network switch, and with no real data transferred, it is a synthetic measurement of the fabric’s performance.

The world of performance Ethernet is more like NASCAR, where cars typically do 200MPH and  run races measured in the hundreds of miles around closed loop tracks. Here the cars have to shift gears, brake, run for extended periods of time, refuel, handle rapid tire changes, and maintenance during the race, etc… This is not the same as running a top-fuel drag racer once down a straight 1,000-foot track. The problem is Mellanox is notorious for stating their top-fuel dragster Infiniband HPC numbers to potential NASCAR class high-performance ethernet customers, believing many will NEVER know the difference. Several years ago Mellanox had their own high-performance OS-Bypass Ethernet stack that supported UDP & TCP called VMA (Voltaire Messaging Accelerator), but it was so fraught with problems that they spun it off as an open source project in the fall of 2013. They had hoped that the community might fix its problems, but since it’s seen little if any development (15 posts in as many months). So the likelihood you’ll see 700 nanosecond class 1/2 round trip UDP or TCP latency with Mellanox anytime in the near future would be very surprising.
Let’s attack misrepresentation number two, an actual ethernet throughput of 100Gbps. This one is going to be a bit harder without an actual adapter in my hand to test, so just looking at the data sheet, several things do jump out. First ConnectX 4 uses a 16-lane PCIe Gen3 bus which typically should have an effective unidirectional PCIe data throughput of 104Gbps. On the surface, this looks good. There may be an issue under the covers though because when this adapter is plugged into a state of the art Intel Haswell server the PCIe slot maps to a single processor. You can send traffic from this adapter to the other CPU, but it first must go through the CPU it’s connected to. So sticking to one CPU, the best Haswell processor has two 20 lane QPIs with an effective combined unidirectional transfer speed of 25.6GB/sec. Now note that this is all 40 PCIe lanes combined, the ConnectX 4 only has 16 lanes so proportionally about 10.2GB/sec is available, that’s only 82Gbps. Maybe they could sustain 100Gbps, but this number on the surface appears somewhat dubious. These numbers should also limit Infiniband’s top end performance for this adapter.
Finally, we have my favorite misrepresentation, 150M messages per second. Messages is an HPC term and most people that think ethernet will translate this to 150M packets per second. A 10GbE link has a theoretical maximum packet rate of 14.88Mpps.  There is no way their ethernet driver for the ConnectX 4 could ever support this packet rate, even if they had a really great OS-bypass driver I’d be highly skeptical. This is analogous to saying you have an adapter capable of providing lossless ethernet packet capture on ten 10GbE (14.88Mpps/link) links at the same time. Nobody today, even the best FPGA NICs that cost 10X this price, will claim this.
Let’s humor Mellanox though, and buy into the fantasy, here is the reality that will creep back in. On Ethernet, we often say the smallest packet is 64 bytes so 150Mpps * 64 bytes/packet * 8 bits/byte is 76.8Gbps, that is less than the 82Gbps we mentioned above so that’s good. There are a number of clever tricks that can be used to bring this many packets into the host CPU into user space while optimizing the use of the PCIe bus, but more often than not these require that the NIC firmware is tuned for packet capture, not generic TCP/UDP traffic flow. Let’s return to the Intel Haswell E5-2699 with 18 cores at 2.3Ghz. Again for performance, we’ll steer all 150Mpps into the single Intel socket supporting this Mellanox adapter. Now for peak performance, we want to ensure that packets are going to extremely quiet cores because we know that both the OS & BIOS settings can create system jitter which kills performance and determinism. So we profile this CPU and find the 15 least busy cores, those with NOTHING going on. Now if we assume Mellanox was to have an OS Bypass UDP/TCP stack that supported a round-robin method for doling out a flood of 64-byte packets this would mean 10Mpps/core or 100 nanoseconds/packet to do something useful with each packet. That’s 250 clock ticks on Intel’s best processor. Unless you’re hand coding in assembler it’s going to be very hard to get that much done.
So when Mellanox begins talking about supporting 25GbE, 50GbE or 100GbE you need only remember one quote from their website “Mellanox MTNIC Ethernet driver support for Linux, Microsoft Windows and VMware ESXi are based on the ConnectX® EN 10GbE and 40GbE NIC only.” So please don’t fall for the low latency, high bandwidth or packet rate Mellanox Ethernet hype, it’s just hog wash.

Update, on March 2, 2015, Mellanox posted an Ethernet only press release that claimed this adapter supported 100GbE, and using the DPDK interface in testing they could achieve 90Gbps with 75Mpps over the 100G link (roughly wire-rate 128 byte packets).

Performance 10GbE Capture & Time Stamping

From the Sony to Park-n-Fly, we’re seeing the impact data breaches can have on corporations. Towards that end, many companies are beginning to consider looking at all the data that both enters & leaves their enterprise in an attempt to thwart breaches as they occur. At the heart of this effort lies wire-rate lossless 10GbE capture. Furthermore, some customers are seeking highly accurate time stamps on all captured packets so performance can be measured, and issues can easily be tracked and collated across an enterprise against the backing of a solid, trusted temporal (time-based) infrastructure. So to capture both received and transmitted packets on two 10G interfaces in parallel, and with time stamps you only have one choice for those looking for an adapter/software bundle under $5K and that is Solarflare’s SFN7142Q-SCP. This is Solarflare’s SFN7142Q adapter with a PPS bracket kit, a Precision Time Protocol (PTP), and a SolarCapture Pro license. Note you will need to order a pair of QSA modules separately, due to agency certification issues.

So why has Solarflare gone to the trouble of bundling together all of these parts? Simple, to make it easier for potential customers to try out precision packet capture with highly accurate time stamps. Let’s take a moment, and decompose this bundle into its component pieces to fully appreciate why this is so important.

First, we’ll start with the network server adapter, the Solarflare SFN7142Q. This board is driven by a single Solarflare dual core ethernet controller chip. Each ethernet controller core on this chip has multiple packet engines for both receive and transmit queues. This enables the adapter to support wire-rate lossless packet capture even with huge bursts of the smallest sized packets (64 bytes each). Furthermore, this adapter has the capability to also transmit wire-rate 64-byte packets at the same time, on the same interface. Solarflare’s capture bundle also includes the PPS (Pulse Per Second) bracket kit that provides the necessary mini-BNC connectors that attach the adapter to an external master clock. Unlike similar adapters, there is also a second mini-BNC connector to support daisy chaining the clock signal out of the adapter into another adapter. The SFN7142Q includes a highly precise clock chip, the Stratum 3, this ensures that time stamping is accurate to within 50 nanoseconds from the PTP master. This is 30X more precise than a competing adapter that only captures and time stamps inbound packets.

While the SFN7142Q sports two 40GbE QSFP ports, to ensure wire-rate lossless packet capture Solarflare provides two QSA modules (which you must order separately) that convert the QSFP socket into 10G SFP+ sockets. This enables each of the two ethernet controller cores on the adapter to each focus on a single 10GbE interface.

Three software license keys are preloaded in the SFN7142Q adapter, and they are OpenOnload (OOL), Precision Time Protocol (PTP), and SolarCapture Pro (SCP). OpenOnload is Solarflare’s user space stack that permits it to do a zero copy bypass of the operating system, and place the captured data directly into the memory connected to the core processing that particular data flow. Precision Time Protocol is a method whereby the external pulse per second master clock can be rationalized to the real time of day then distributed to other applications or servers. Finally, we have SolarCapture Pro. When it comes to capturing, SolarCapture Pro is arguably the best. Unlike some other solutions it also captures transmitted packets and can time stamp both in-bound & out-bound packets, all features only found in higher priced FPGA based solutions. Also, SCP can be initialized in cluster mode to spawn multiple capture instances, one per core delivering the data in Libpcap format, then flow-hash the data across all of the cores within the cluster. Flow-hashing is the process of looking at several key fields in the packet header then routing all the traffic from a given source & destination always to the same core so security applications like Snort, Bro & Suricata see all the data for a given network flow.

So if you’re looking to get into packet capture for performance monitoring, or security programs please consider contacting Solarflare, and asking about their SFN7142Q-SCP. You’ll be pleasantly surprised how it performs when compared to much more expensive FPGA based solutions. Finally, in a future post, we’ll talk about Capture SolarSystem that leverages all the above to deliver an appliance tuned for high volume packet capture.

Memcached 3X Faster with Solarflare

Memcached is a free open source software layer that provides a distributed in memory object store which is leveraged by popular sites like Twitter, Youtube, Flickr, Craigslist, etc… Since these objects are stored in memory, rather than on much slower disk, if one optimizes the path those objects travel the results could not only be measurable but might be significant. In a recent whitepaper by Solarflare, they did just this. Using an Intel X520-DA2 10GbE adapter as a reference they compared it to a Solarflare SFN7122F with OpenOnload and saw performance gains from 2-3 times over what the Intel adapter was capable of delivering. For example, maximum batched query throughput (get) went from 7.4 Mops (million operations) on a single Intel X520-DA2 card to 21.9 Mops on a Solarflare SFN7122F card with OpenOnload. This means you’d need three servers with Intel 10G cards to deliver the same performance as a single Memcached server with one Solarflare SFN7122F network adapter. How is this possible?

Solarflare’s adapter offers 1,024 virtualized network interfaces (VNICs) on each 10GbE port each with their own receive, and transmit hardware. Traditionally Solarflare’s OpenOnload mapped application sockets to these VNICs, but a new version allows sockets to be dynamically moved between OpenOnload stacks. This allows multi-threaded applications like Memcached that leverage a single listener thread and many worker threads to easily map each worker thread to its own VNIC interface and move traffic between them.

Using two Dell PowerEdge R620 servers, each with two 10-core Intel E5-2660 v2 processors and 32GB of memory running RHEL 6.5 with Hyper-threading enabled. The first server, the traffic generator, ran memslap, and was configured with a pair of Solarflare SFN6122F adapters to create sufficient load.  The second server, the one testing Memcached performance, had both an Intel X520-DA2 and a Solarflare SFN7122F with OpenOnload. To balance performance, and scalability each Memecached instance only had five cores (10 threads), so with 20 cores on the server we loaded four Memcached instances, two per CPU. We then tested using one core per instance scaling to five cores per instance. Here are some interesting data points:

  • From four cores to 20 cores performance was pretty linear Solarflare went from 5 Mops to 21.9 Mops while Intel went from 1.7 Mops to 7.4 Mops.
  • Latency in microseconds for a single get request saw substantial improvement going from 750us (4 cores) down to 180us (20 cores) while Intel went from 3,000us down to 425us.
  • For batches of 48 get requests the latency in microseconds was also very compelling going from 10,000us (4 cores) down to 2,000us (20 cores) compared to Intel with 30,000us down to 6,000us.

If you’re serious about Memcached you should really read this 10-page whitepaper as it goes into much deeper detail.