R.I.P. TCP Offload Engine NICs (TOEs)

Solarflare Delivers Smart NICs for the Masses: Software Definable,  Ultra-Scalable, Full Network Telemetry with Built-in Firewall for True Application Segmentation, Standard Ethernet TCP/UDP Compliant

As this blog post by Michael C. Bazarewsky states, Microsoft quietly pulled support for TCP Chimney in its Windows 10 operating system. Chimney was an architecture for offloading the state and responsibility of a TCP connection to a NIC that supported it. The piece cited numerous technical issues and lack of adoption, and Michael’s analysis hits the nail on the head. Goodbye TOE NICs.

During the early years of this millennium, Silicon Valley venture capitalists dumped hundreds of millions of dollars into start-ups that would deliver the next generation of network interface cards at 10Gb/sec using TCP offload engines. Many of these companies failed under their weight of trying to develop expensive, complicated silicon that just did not work. Others received a big surprise in 2005 when Microsoft settled with Alacritech over patents they held describing Microsoft’s Chimney architecture. In a cross-license arrangement with Microsoft and Broadcom, Alacritech received many tens of millions of dollars in licensing fees. Alacritech would later get tens of millions of more fees from nearly every other NIC vendor implementing a TOE in their design. At the time, Broadcom was desperate to pave the way for their acquisition of Israeli based Siloquent. Due to server OEM pressure, the settlement was a small price to pay for the certain business Broadcom would garner from sales of the Siloquent device. At 1Gb/sec, Broadcom owned an astounding 100% of the server LAN-on-Motherboard (LOM) market, and yet their position was threatened by the onslaught of new, well-funded 10Gb start-ups.

In fact, the feature list for new “Ethernet” enhancements got so full of great ideas that most vendor’s designs relied on a complex “sea of cores” promising extreme flexibility that ultimately proved to be very difficult to qualify at the server OEMs. Any minor change to one code set would cause the entire design to fail in ways that were extremely difficult to debug, not to mention being miserably poor in performance. Most notably, Netxen, another 10Gb TOE NIC vendor, quickly failed after winning major design-ins at the three big OEMs, ultimately ending in a fire sale to Qlogic. Emulex saw the same pot of gold in its acquisition of ServerEngines.

That new impetus was a move by Cisco to introduce Fibre Channel Over Ethernet (FCoE) as a standard to converge networking and storage traffic. Cisco let Qlogic and Emulex (Q & E) inside the tent before their Unified Computing System (UCS) server introduction. But the setup took some time. It required a new set of Ethernet standards, now more commonly known as Data Center Bridging (DCB). DCB was a set of physical layer requirements that attempted to emulate the reliability of TCP by injecting wire protocols that would allow “lossless” transmission of packets. What a break for Q & E! Given the duopoly’s control over the Fibre Channel market, this would surely put both companies in the pole position to take over the Ethernet NIC market. Even Broadcom spent untold millions to develop a Fiber Channel driver that would run on their NIC.

Q & E quickly released what many called the “Frankenstein NIC,” a kluge of Applied-Specified Integrated Circuits (ASIC) designed to get a product to market even while struggling to develop a single ASIC, a skill at which neither company excelled. Barely achieving its targeted functionality, no design saw much traction. Through all of our customer interactions (over 1,650), we could find only one that had implemented FCoE. This large bank has since retracted its support for FCoE and in fact, showed a presentation slide several years ago stating they were “moving from FCoE to Ethernet,” an acknowledgment that FCoE was indeed NOT Ethernet.

In conjunction with TOEs, the industry pundits believed that RDMA (Remote Direct Memory Access) was another required feature to reduce latency, and not just for High-Frequency Trading (HFT), another acknowledgment that lowering latency was critical to the hyper-scale cloud, big data, and storage architectures. However, once again, while intellectually stimulating, using RDMA in any environment proved to be complex and simply not compatible with customers’ applications or existing infrastructures.

The latest RDMA push is to position it as the underlying fabric for Non-Volatile Memory Express (NVMeF). Why? Flash has already reduced the latency of storage access by an order of magnitude, and the next generation of flash devices will reduce latency and increase capacity even further. Whenever there’s a step function in the performance of a particular block of computer architecture, developers come up with new ways to use that capability to drive efficiencies and introduce new, and more interesting applications. Much like Moore’s Law, rotating magnetic memory is on its last legs. Several of our most significant customers have already stopped buying rotating memory in favor of Flash SSDs.

Well… here we go again. RDMA is NOT Ethernet. Despite the “fake news” about running RDMA, RoCE and iWARP on Ethernet, the largest cloud companies, and our large financial services customers have declared that they cannot and will not implement NVMeF using RDMA. It just doesn’t fit in their infrastructures or applications. They want low-latency standard Ethernet.

Since our company’s beginning, we’ve never implemented TOEs, RDMA or FCoE or any of the other great and technically sound ideas for changing Ethernet. Sticking to our guns, we decided to go directly to the market and create the pull for our products. The first market to embrace our approach was High-Frequency Trading (HFT). Over 99% of the world’s volume of Electronic trading, in all instruments, runs on our company’s NICs. Why? Customers could test and run our NICs without any application modifications or changes to their infrastructure and realize enormous benefits in latency, Jitter, message rate and robustness… it’s standard Ethernet, and our kernel bypass software has become the industry’s default standard.

It’s not that there isn’t room for innovation in server networking, it’s that you have to consider the customer’s ability to adapt and manage that change in a way that’s not inconsistent or disruptive to their infrastructure, while at the same time, delivering highly valued capabilities.

  • If companies are looking for innovation in server networking, they need to look for a company that can provide the following: Best-in-class PTP synchronization
  • Ultra-high resolution time stamps for every packet at every line rate
  • A method for lossless, unobtrusive, packet capture and analysis
  • Significant performance improvement in NGINX and LXC Containers
  • A firewall NIC and Application Micro-Segmentation that can control every app, VM, or container with unique security profiles
  • Real, extensive Software Definable Networking (SDN) without agents

In summary, while it’s taken a long time for the industry to realize its inertia, logic eventually prevailed.  Today, companies can now benefit from innovations in silicon and software architecture that are in deployment and have been validated by the market.   Innovative approaches such as neural-scale networking, which is designed to respond to the high-bandwidth, ultra-low-latency, hardware-based security, telemetry, and massive connectivity needs of ultra-scale computing, is likely the only strategy to achieve a next-generation cloud and data center architecture that can scale, be easily managed, and maybe most importantly secured.

— Russell Stern, CEO Solarflare

Cloaked Data Lakes

Once Jessie James was asked why he robbed banks and answered: “Because that’s where the money is?” Today a corporation’s most valuable asset, aside from its people, is its data. For those folks who are Star Trek fans imagine if you could engage your data lake’s network cloaking device just before deployment? It would waver out of view then totally disappear from your enterprise network to all but those who are responsible for extracting value from it. Your key data scientists and applications could still see and interact with your cloaked data lakes, but to others exploring and scanning the network, it would be entirely transparent as if it were not even there.

Imagine if you will that a Klingon Bird of Prey is cloaked and patrolling the Neutral Zone. Along comes the Federation Starship Enterprise, also patrolling the Neutral Zone, but the Federation is actively scanning the quadrant. Since the Klingon ship is Cloaked the Federation can’t detect them, but the moment the Enterprises scanners pass over the Bird of Prey it automatically jumps to red alert, energizes its weapons systems and alters course to shadow the Federation ship. Imagine if the same could be true of an insider threat or an internal breach via say a phishing attack that is seeking out your companies data. The moment someone pings a system or executes a port scan of even one IP addresses of the servers within your data lake alarm bells are set off, and no reply is returned. The scanner would see no answer, and expect that nothing exists, little would they know the hell that would soon reign down on them.

Your network administrators would then be alerted that their new server orchestration system had raised an alert. They’ll quickly see that the attacker is another admin’s workstation, someone that has been suspected of being an insider threat, but they’ve been too cagey to nail down. Now it’s 9 PM at night, and he’s port scanning the exact range of internal network addresses that were set aside a week earlier for this new data lake. He then moves on to softer targets exfiltrating data from older systems. Little does he know though that every server he’s touched the past week has been tracking and reporting every network flow back to his workstation. Management was just waiting for the perfect piece of evidence and this attempted port scan, along with all the other network flows was the final straw.

His plan had been to finish out the week, then quit on Friday and sell all his companies data to its competitors. He had decided to stay on an extra two weeks when he heard they were standing up a new Hadoop cluster. He figured that it would make a juicy soft target with tons of the newest aggregated data which could be enormously valuable. What he didn’t know, because he wasn’t invited to those planning meetings, was that the cluster included a new stealth security feature from Solarflare called Active Cloaking. He also wasn’t aware that that feature was the driving reason why many of his companies servers over the past two weeks had been upgraded to new Solarflare 10GbE NICs with ServerLock.

Since he was a server administrator responsible for some of the older legacy systems he wasn’t involved in the latest round of network upgrades. While he had noticed that lately some of the newer servers were no longer accessible to him via SSH, what he wasn’t aware of was that every server he touched was now reporting his every move. What would prove even more damning though was that some of those older servers, which had been upgraded with Solarflare ServerLock enabled NICs, were left as internal SSH/SCP honeypots with old legacy data that held little if any real value, but would prove damning evidence once compromised. Tonight had proved to be his downfall, his manager, and his VP, along with building security had just entered his cubical and stated that the police were on their way.

At Black Hat last month both Solarflare and Cloudwick (CDL) demonstrated ServerLock and data lake cloaking. In September several huge enterprises will begin testing SeverLock, and if you’re an insider threat consider yourself warned!

1st Ever Firewall in a NIC

Last week at Black Hat Solarflare issued a press release debuting their SolarSecure solution, which places a firewall directly in your server’s Network Interface Card (NIC). This NIC based firewall called ServerLock, not only provides security, but it also offers agentless server-based network visibility. This visibility enables you to see all the applications running on every ServerLock enabled server. You can then quickly and easily develop security policies which can be used for compliance or enforcement. During the Black Hat show setup, we took a 10-minute break to have an on-camera interview with Security Guy Radio that covered some of the key aspects of SolarSecure.

SolarSecure has several very unique features not found in any other solution:

  • Security and visibility are entirely handled by the NIC hardware and firmware, there are NO server side software agents, and as such, the solution is entirely OS independent.
  • Once the NIC is bound to the centralized manager it begins reporting traffic flows to the manager which then represents those graphically for the admins to easily turn into security policies. Policies can be created for specific applications, enabling application level network segmentation.
  • Every NIC maintains separate firewall tables for each local IP address hosted on the NIC to avoid potential conflicts from multiple VMs or Containers sharing the same NIC.
  • Each NIC is capable of handling over 5,000 filter table rules along with another 1,000 packet counters that can be attached to rules.
  • Packets transit the rules engine between 50 and 250 nanoseconds so the latency hit is negligible.
  • The NIC filters both inbound and outbound packets. Packets which are dropped as a result of a match to a firewall rule generate an alert on the management console and inbound packets consume ZERO host CPU cycles.

Here is a brief animated explainer video which was produced prior to the show that sets up the problem and explains Solarflare’s solution. We also produced a one-minute demonstration of the management application and its capabilities.

Storage Over TCP in a Flash

By Patrick Dehkordi

Recently Solarflare delivered a TCP transport for Non-Volatile Memory Express (NVMe). The big deal with NVMe is that it’s FLASH memory based, and often multi-ported so when these “disk blocks” are transferred over the network, even with TCP, they often arrive 100 times faster than they would if they were coming off spinning media. We’re talking 100 microseconds versus average 15K RPM disk seek times measured in milliseconds. Unlike RoCE or iWARP a TCP transport provides storage over Ethernet without requiring ANY network infrastructure changes.

It should be noted that this should work for ANY NIC and does not require RDMA, RoCE, iWARP or any special NIC offload technology. Furthermore, since this is generic TCP/IP over Ethernet you don’t need to touch your switches to setup Data Center Bridging. Also, you don’t need Data Center Ethernet, Converged Ethernet, or Converged Enhanced Ethernet, just plain old Ethernet.  Nor do you need to set things up to use Pause Frames or Priority Flow Control. This is industry changing stuff, and yet not hard to implement for testing so I’ve included a recipe for how to make this function in your lab below, it is also cross-posted in the GitHub.

At present this is a fork of the v4.11 kernel. This adds two new kconfig options:

  • NVME_TCP : enable initiator support
  • NVME_TARGET_TCP : enable target support

The target requires the nvmet module to be loaded. Configuration is identical to RDMA, except "tcp" should be used for addr_trtype.

The host requires the nvme, nvme_core and nvme_fabrics modules to be loaded. Again, the configuration is identical to RDMA, except -t tcp should be passed to the nvme command line utility instead of -t rdma. This requires a patched version of nvme-cli.

Example assumptions

This is assuming a target IP of 10.0.0.1, a subsytem name of ramdisk and an underlying block device /dev/ram0. This is further assuming an existing system with RedHat/Centos Distribution built on 3.x kernel.

Building the Linux kernel

For more info refer to https://kernelnewbies.org/KernelBuild

Install or confirm the following packages are installed

yum install gcc make git ctags ncurses-devel openssl-devel

Download, unzip or clone the repo into a local directory

git clone https://github.com/solarflarecommunications/nvme-of-tcp/tree/v4.11-nvme-of-tcp
cd nvme-of-tcp-4.11

Create a .config file or copy the existing .config file into the build directory

scp /boot/config-3.10.0-327.el7.x86_64 .config

Modify the .config to include the relevant NVMe modules

make menuconfig

Under “Block Devices” at a minimum select “NVM Express block device” “NVMe over Fabrics TCP host support” “NVMe over Fabrics TCP target support” Save and Exit the text based kernel configuration utility.

Confirm the changes

grep NVME_ .config

Compile and install the kernel

(To save time you can utilize multiple CPUs by including the j option)

make -j 16
make -j 16 modules_install install 

Confirm that the build is included in the boot menu entry

(This is dependent on the bootloader being used, for GRUB2)

cat /boot/grub2/grub.cfg | grep menuentry

Set the build as the default boot option

grub2-set-default 'Red Hat Enterprise Linux Server (4.11.0) 7.x (Maipo)’

Reboot the system

reboot

Confirm that the kernel has been updated:

uname -a 
Linux host.name 4.11.0 #1 SMP date  x86_64 GNU/Linux

NVMe CLI Update

Download the correct version of the NVMe CLI utility that includes TCP:

git clone https://github.com/solarflarecommunications/nvme-cli

Update the NVMe CLI utility:

cd nvme-cli
make
make install

Target setup

Load the target driver

This should automatically load the dependencies,nvmenvme_core and nvme_fabrics

modprobe nvmet_tcp

Set up storage subsystem

mkdir /sys/kernel/config/nvmet/subsystems/ramdisk
echo 1 > /sys/kernel/config/nvmet/subsystems/ramdisk/attr_allow_any_host
mkdir /sys/kernel/config/nvmet/subsystems/ramdisk/namespaces/1
echo -n /dev/ram0 > /sys/kernel/config/nvmet/subsystems/ramdisk/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/ramdisk/namespaces/1/enable

Setup port

mkdir /sys/kernel/config/nvmet/ports/1
echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam
echo "tcp" > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo "11345" > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
echo "10.0.0.1" > /sys/kernel/config/nvmet/ports/1/addr_traddr

Associate subsystem with port

ln -s /sys/kernel/config/nvmet/subsystems/ramdisk /sys/kernel/config/nvmet/ports/1/subsystems/ramdisk

Initiator setup

Load the initiator driver

This should automatically load the dependencies,nvmenvme_core and nvme_fabrics.

modprobe nvme_tcp

Use the NVMe CLI utility to connect the initiator to the target:

nvme connect -t tcp -a 10.0.0.1 -s 11345 -n ramdisk

lsblk should now show an NVMe device.

What’s a Smart NIC?

While reading these words, it’s not just your brain doing the processing required to make this feat possible. We’ve all seen over and under exposed photos and can appreciate the decision making necessary to achieve a perfect light balanced photo. In the laboratory, we observed that the optic nerve connecting the eye to the brain is responsible for measuring the intensity of the light hitting the back of your eye. In response to this data, each optic nerve dynamically adjusts the aperture of the iris in your eye connected to this nerve to optimize these levels. For those with some photography experience, you might recall that there is a direct relationship between aperture (f-stop) and focal length. It also turns out that your optic nerve, after years of training as a child, has come to realize you’re reading text up close, so it is now also responsible for modifying the muscles around that eye to sharpen your focus on this text. All this data processing is completed before your brain has even registered the first word in the title. Imagine if your brain was responsible for processing all the data and actions that are required for your body to function properly?

Much like your optic nerve, the difference between a standard Network Interface Card (NIC) and a smart NIC is how much processing the Smart NIC offloads from the host CPU. Until recently Smart NICs were designed around Field Programmable Gate Array (FPGA) platforms costing thousands of dollars. As their name implies, FPGAs are designed to accept localized programming that can be easily updated once installed. Now a new breed of Smart NIC is emerging that while it isn’t nearly as flexible as an FPGA, they contain several sophisticated capabilities not previously found in NICs costing only a few hundred dollars. These new affordable Smart NICs can include a firewall for security, a layer 2/3 switch for traffic steering, several performance acceleration techniques, and network visibility with possibly remote management.

The firewall mentioned above filters all network packets against a table built specifically for each local Internet Protocol (IP) address under control. An application processing network traffic is required to register a numerical network port. This port then becomes the internal address to send and receive network traffic. Filtering at the application level then becomes a simple process of only permitting traffic for specific numeric network ports. The industry has labeled this “application network segmentation,” and in this instance, it is done entirely in the NIC. So How does this assist the host x86 CPU? It turns out that by the point at which operating system software filtering kicks in the host CPU has often expended over 10K CPU cycles to process a packet. If the packet is dropped the cost of that drop is 10K lost host CPU cycles. If that filtering was done in the NIC, and the packet was then dropped there would be NO host CPU impact.

Smart NICs also often have an internal switch which is used to steer packets within the server rapidly. This steering enables the NIC to move packets to and from interfaces and virtual NIC buffers which can be mapped to applications, virtual machines or containers. Efficiently steering packets is another offload method that can dramatically reduce host CPU overhead.

Improving overall server performance, often through kernel bypass, has been the providence of High-Performance Computing (HPC) for decades. Now it’s available for generic Ethernet and can be applied to existing and off the shelf applications. As an example, Solarflare has labeled its family of Kernel Bypass accelerations techniques Universal Kernel Bypass (UKB). There are two classes of traffic to accelerate: network packet and application sockets based. To speed up network packets UKB includes an implementation of the Data Plane Development Kit (DPDK) and EtherFabric VirtualInterface (EF_VI), both are designed to deliver high volumes of packets, well into the 10s of millions per second, to applications familiar with these Application Programming Interfaces (APIs). For more standard off-the-shelf applications there are several sockets based acceleration libraries included with UKB: ScaleOut Onload, Onload, and TCPDirect. While ScaleOut Onload (SOO) is free and comes with all Solarflare 8000 series NICs, Onload (OOL) and TCPDirect require an additional license as they provide micro-second and sub-microsecond 1/2 round trip network latencies. By comparison, SOO delivers 2-3 microsecond latency, but the real value proposition of SOO is the dramatic reduction in host CPU resources required to move network data. SOO is classified as “zero-copy” because network data is copied once directly into or out of your application’s buffer. SOO saves the host CPU thousands of instructions, multiple memory copies, and one or more CPU context switches, all dramatically improve application performance, often 2-3X, depending on how network intense an application is.

Finally, Smart NICs can also securely report NIC network traffic flows, and packet counts off the NIC to a centralized controller. This controller can then graphically display for network administrators everything that is going on within every server under its management. This is real enterprise visibility, and since only flow metadata and packet counts are being shipped off NIC over a secure TLS link the impact on the enterprise network is negligible. Imagine all the NICs in all your servers reporting in their traffic flows, and allowing you to manage and secure those streams in real time, with ZERO host CPU impact. That’s one Smart NIC!

What are Neural Class Networks?

In 1971 Intel released the 4004. The 4004 was their first 4-bit single core processor, and for the next 34 years, that’s pretty much how x86 computing progressed. Sure they bumped up the architecture and speed as designs and processes improved, but one thing remained constant a single processing engine. Under pressure in the 1990s from Unix workstations driven by Reduced Instruction Set Computing (RISC) with multi-core pipeline architectures like IBM’s PowerPC, Sun’s Ultra-SPARC and the MIPS R3000 Intel began exploring multi-core architectures.

So from 1971 until 2005, every x86 processor had a single core, life was simple. Intel even provided reference designs so system builders could even put two of these processors into the same system. Sure there were fringe companies that developed Symmetrical Multi-Processing (SMP) systems (ex. Sequent & NEC) with more than two CPU sockets, but they were large frame expensive custom servers not found in general mainstream use. So if you wanted to scale out your computational capacity to tackle a tough problem, you had to rack more servers.

This single core challenge largely drove commodity Linux clustering making it all the rage by the turn of the century, particularly in high-performance computing (HPC). It wasn’t uncommon to tightly couple 1,000 or more dual socket single core systems together to tackle a tough computational problem. Government agencies leveraged large clusters to model our nuclear stockpile and computationally secure it. Auto companies crashed dozens of virtual car designs on a daily basis, and oil companies crunched seismic data to compute untapped oil reserves. Then the game changed, and x86 shifted to multi-core processors. Yesterday Intel announced the availability of their new Skylake server platform, but the industry is already refocusing on 2018’s Cascade Lake 32 core, 64 thread, server chip. So why does all this matter?

As a general rule, Google doesn’t publish or confirm the computation capacity of their data centers. If we pick a specific example, their Oregon data center, we can apply particular assumptions and project that it’s roughly 100,000 servers. Again assuming they are using dual socket eight core hyper-threaded processors this translates to 3.2 million parallel threads of computation tightly coupled in one physical location potentially addressing a single problem. Structures like this are quickly approximating what took nature millions of years to develop, an organic brain based on the neuron. Neurons on average have 7,000 connections to both local and remote neurons within their system, that’s a considerable amount of networking per single computational unit.

By contrast, the common cockroach has one million neurons, and that frog you played with as a kid sports 16 million. So if we equate a neuron to a single thread of execution on an x86 that would put a Google data center somewhere between the cockroach and a frog. If in 2018 Google were to upgrade the Oregon facility to Cascade Lakes it would still be only 12.8 million threads, still less than a frog. Given the geometric core growth though it will only be a few decades before Google is deploying data centers approaching the capability of the 86 billion neurons found in the human brain. Oh, and that’s assuming an x86 thread, and a neuron is even computationally similar.

So what is Neural Class Networking? As mentioned above a neuron on average is connected to 7,000 other neurons. Imagine if every hardware thread of execution in your server were networked to even 64 external threads of execution on related systems working on the same problem, that’s the start of neural class networking. Today we have servers with typically 32 threads of execution. Solarflare’s newest generation of XtremeScale Smart NICs provides 2,048 virtual NICs, so each thread of those 32 threads has the capability to sustain 64 dedicated hardware paths to other external threads. That’s the start of Neural Class Networking.

Capture, Analytics, and Solutions

My first co-op job at IBM Research back in 1984 was to help roll out the IBM PC to the companies best and brightest. It wasn’t long into that position, perhaps a month, when we noticed a large number of monochrome monitors had a consistent burn-in pattern. A horizontal bar across the top, and a vertical bar on the left. Now I was not new to personal computers having purchased my own TRS-80 Model III a year earlier, but it was apparent that a vast number within Research were all being used to do the same thing. So I asked, VisiCalc (MS Excel’s great-grandfather).

At the time personal computers were still scarce, and seeing one outside a fortune 100 company was akin to a unicorn sighting, so the subtlety between tool and solution was lost on this 21-year-old brain. Shortly after the cause of the burn-in was understood I had the opportunity to have lunch with one of these researchers and discuss his use of VisiCalc. That single IBM PC in his lab existed to turn experimental raw data that it directly collected into comprehensible observations. He had a terminal in his office for email, and document creation, so this system existed entirely to run two programs. One that collected raw data from his lab equipment, and VisiCalc, which translated that data into understandable information which helped him make sense of our world.

Recently Solarflare began adding Analytics packages to their open network packet capture platform as they ready it for general availability later this month. The first of these packages is Trading Analytics, and it wasn’t until I recently saw an actual demo of this application that the real value of these packages kicked in. The demo clarified for me this value much like the VisiCalc ah-ha moment mentioned above, but perhaps we need some additional context.

Imagine a video surveillance system at a large national airport. Someone in security can easily track a single passenger from the curb to the gate, without ever losing sight of them, now extend that to a computer network where the network packets are the people. The key value proposition of a distributed open network packet capture platform is the capability to gather copies of what happened at various points in your network while preserving the exact moment in time that that data was at that location. Now imagine that data being how a company trades electronically on several different stock exchanges. The capability to visualize the exact moment when an exchange notified you that IBM stock was trading at $150/share, then it updated you every step of the way showing you the precise instant that each of the various bits in your trading infrastructure kicked in. The value of being able to see the entire transaction with nanosecond resolution can then be monetized, especially when you can see this performance over hundreds, thousands or even millions of trades. Given the proper visualization and reports, this data can then empower you to revisit the slow stages in your trading platform to further wring out any latency that might result in lost market opportunities.

So what is analytics? Well, first it’s the programming that is responsible for decoding the market data. That means translating all the binary data contained in a captured network packet into humanly meaningful data. While this may sound trivial, trust me it isn’t, there are dozens of different network protocol formats and hundreds of exchanges worldwide that leverage these formats so it can quickly become a can of worms. Solarflare’s package translates over two dozen different network protocols, for example, NYSE ARCA, CME FIX and FAST into a single humanly readable framework. Next, it understands how to read and use the highly accurate time stamps attached to each network packet representing when those packets arrived at a given collection point. Then it correlates all the network sequence ID numbers and internal message ID numbers along with those time stamps so it can align a trade across your entire environment, and show you where it was at each precise moment in time. Finally, and this is the most important part, it can report and display the results in many different ways so that it easily fits into a methodology familiar to its consumer.

So what differentiates a tool from a solution? Some will argue that VisiCalc and Solarflare’s Trading Analytics are nothing more than a sophisticated set of fancy hammers and chisels, but they are much more. A solution adds significant, and measurable, value to the process by removing the mundane and menial tasks from that process thereby allowing us to focus on the real tasks that require our intellect.

Technology Evangelist Podcast

Well after a few weeks of planning and preparation the Technology Evangelist Podcast is finally available. This podcast will focus on bringing the engineers, marketing, and sales folks on the cutting edge of technology to the mic to explain it.

Our first episode features Ron Miller, the CTO of Cloudwick, talking about “Hadoop and Securing Hadoop Clusters.” Ron is an expert in Cyber Security having founded Mirage Networks in 2003. We’re honored to have Ron share with us some background on Hadoop and how one might secure Hadoop clusters.

In our second episode, Mark Zeller joined us to talk about non-Volatile Memory Express (NVMe), and how it will replace spinning disks over the years to come. We touch on the benefits of this technology, talk about erasure coding, and review where the technology is headed. This episode has been recorded and is pending final approval.

Yesterday on Saturday, June 10th, Bob Van Valzah had some time to stop by and discuss Electronic Trading. This episode covers such topics as what is trading, the race to zero, dark pools, and the book Flash Boys.  This episode has also been recorded and is pending final approval.

Four Container Networking Benefits

ContainerContainer networking is walking in the footsteps taken by virtualization over a decade ago. Still, networking is a non-trivial task as there are both underlay and overlay networks one needs to consider. Underlay Networks like a bridge, MACVLAN and IPVLAN are designed to map physical ports on the server to containers with as little overhead as possible. Conversely, there are also Overlay networks that require packet level encapsulation using technologies like VXLAN and NVGRE to accomplish the same goals.  Anytime network packets have to flow through hypervisors or layers of virtualization performance will suffer. Towards that end, Solarflare is now providing the following four benefits for those leveraging containers.

  1. NGINX Plus running in a container can now utilize ScaleOut Onload. In doing so NGINX Plus will achieve 40% improvement in performance over using standard host networking. With the introduction of Universal Kernel Bypass (UKB) Solarflare is now including for FREE both DPDK and ScaleOut Onload for all their base 8000 series adapters. This means that people wanting to improve application performance should seriously consider testing ScaleOut Onload.
  2. For those looking to leverage orchestration platforms like Kubernetes, Solarflare has provided the kernel organization with an Advanced Receive Flow Steering driver. This new driver improves performance in all the above-mentioned underlay networking configurations by ensuring that packets destined for containers are quickly and efficiently delivered to that container.
  3. At the end of July during the Black Hat Cyber Security conference, Solarflare will demonstrate a new security solution. This solution will secure all traffic to and from containers with enterprise unique IP addresses via hardware firewall in the NIC.
  4. Early this fall, as part of Solarflare’s Container Initiative they will be delivering an updated version of ScaleOut Onload that leverages MACVLANs and supports multiple network namespaces. This version should further improve both performance and security.

To learn more about all the above, and to also gain NGINX, Red Hat & Penguin Computing’s perspectives on containers please consider attending Contain NY next Tuesday on Wall St. You can click here to learn more.

Four Failings of RoCE

RoCERecently someone suggested that I watch this rather informative video of how Microsoft Research had attempted to make RDMA over Converged Ethernet (RoCE) lossless. Unbelievably this video exposes and documents several serious flaws in the design of RoCE. Also, it appears they’ve replaced the word “Converged” with “Commodity,” to down message that RoCE doesn’t require anything special to run on regular old Ethernet. Here are the four points I got out of the video, please let me know your take:

  • RDMA Livelock: This is a simple problem of retransmitting. Since RDMA was architected for a lossless deterministic local bus architecture accommodations were never made for dropped packets as they just didn’t happen on a bus. Ethernet, on the other hand, was designed to expect a loss, remember vampire taps. Livelock occurs when a message composed of multiple packets experiences a dropped packet somewhere in the middle. At this point, RDMA has to start over from the first packet and retransmit the whole message. If this was a multiple megabyte frame of video, this retransmit approach will Lovelock a network. So what was Microsoft’s solution rewrite the RDMA stack retransmit logic to retransmit on drop detection (this is what TCP does), good luck, who’s got this action item?
  • Programmable Flow Control (PFC) Deadlock: This happens when switches encounter incomplete ARP packets. Microsoft’s solution is a call for more research, and to filter incomplete ARP packets. More to-do’s and this one is on all the switch vendors.
  • NIC PFC Storm: It seems that the firmware in some RoCE NICs has bugs that create Pause Frame storms. Beyond NIC vendors fixing those bugs, they also suggest that NIC and switch vendors include extra new software to detect oncoming storms and shut them down. Great idea, another to-do for the anonymous NIC and switch providers.
  • Slow Receiver NICs which generate excessive pause frames because of their crappy RDMA architecture which relies on a second level host based translation tables so they can fetch the destination memory address. Oh, my god, this is how you design an HPC NIC, seriously, how cheap can you be? Make the lookup tables bigger, seriously, Myricom addressed this problem back the 1990s. It appears on some RoCE NICs that it’s not that hard for the NIC to have so many receivers of kernel bypassed packets that they must go off NIC for the destination memory address lookups.

As the speaker closes out the discussion, he says, “This experiment shows that even with RDMA low latency and high throughput cannot be achieved at the same time as network congestion can cause queues to build up in the network.” Anyone who has done this for a while knows that low-latency and high bandwidth are mutually exclusive. That’s why High-Performance Computing (HPC) tests often start the tests with zero byte packets then scale up to demonstrate how latency increase proportionately to packet size.

All the above aside, this important question remains, why would anyone map a protocol like RDMA, which was designed for use on a lossless local bus, to a switched network and think that this would work? A local lossless bus is very deterministic, and it has requirements bound to its lossless nature and predictable performance. Conversely, Ethernet was designed from the beginning to expect, and accommodate loss, and performance has always been secondary to packet delivery. That’s why Ethernet performance is non-deterministic. The resilience of Ethernet, not performance, was the primary design criteria DARPA had mandated to ensure our military’s network would remain functional at all cost.

Soon Solarflare will begin shipping ScaleOut Onload free with all their 8000 series NICs, some of which sell for under $300 USD. With ScaleOut Onload TCP now has all the kernel bypass tricks RDMA offers, but with all the benefits and compatibility of sockets based TCP, no code changes. Furthermore, it delivers the performance of RDMA, but with much better reliability and availability than RoCE.

P.S. Mellanox just informed me that the NIC specific issues mentioned above were corrected some time ago in their ConnectX-4 series cards.