Storage Over TCP in a Flash

By Patrick Dehkordi

Recently Solarflare delivered a TCP transport for Non-Volatile Memory Express (NVMe). The big deal with NVMe is that it’s FLASH memory based, and often multi-ported so when these “disk blocks” are transferred over the network, even with TCP, they often arrive 100 times faster than they would if they were coming off spinning media. We’re talking 100 microseconds versus average 15K RPM disk seek times measured in milliseconds. Unlike RoCE or iWARP a TCP transport provides storage over Ethernet without requiring ANY network infrastructure changes.

It should be noted that this should work for ANY NIC and does not require RDMA, RoCE, iWARP or any special NIC offload technology. Furthermore, since this is generic TCP/IP over Ethernet you don’t need to touch your switches to setup Data Center Bridging. Also, you don’t need Data Center Ethernet, Converged Ethernet, or Converged Enhanced Ethernet, just plain old Ethernet.  Nor do you need to set things up to use Pause Frames, or Priority Flow Control. This is industry changing stuff, and yet not hard to implement for testing so I’ve included a recipe for how to make this function in your lab below, it is also cross posted in the GitHub.

At present this is a fork of the v4.11 kernel. This adds two new kconfig options:

  • NVME_TCP : enable initiator support
  • NVME_TARGET_TCP : enable target support

The target requires the nvmet module to be loaded. Configuration is identical to RDMA, except "tcp" should be used for addr_trtype.

The host requires the nvme, nvme_core and nvme_fabrics modules to be loaded. Again, configuration is identical to RDMA, except -t tcp should be passed to the nvme command line utility instead of -t rdma. This requires a patched version of nvme-cli.

Example assumptions

This is assuming a target IP of 10.0.0.1, a subsytem name of ramdisk and an underlying block device /dev/ram0. This is further assuming an existing system with RedHat/Centos Distribution built on 3.x kernel.

Building the Linux kernel

For more info refer to: https://kernelnewbies.org/KernelBuild

Install or confirm the following packages are installed

yum install gcc make git ctags ncurses-devel openssl-devel

Download, unzip or clone the repo into a local directory

git clone https://github.com/solarflarecommunications/nvme-of-tcp/tree/v4.11-nvme-of-tcp
cd nvme-of-tcp-4.11

Create a .config file or copy the existing .config file into the build directory

scp /boot/config-3.10.0-327.el7.x86_64 .config

Modify the .config to include the relevant NVMe modules

make menuconfig

Under “Block Devices” at a minimum select “NVM Express block device” “NVMe over Fabrics TCP host support” “NVMe over Fabrics TCP target support” Save and Exit the text based kernel configuration utility.

Confirm the changes

grep NVME_ .config

Compile and install the kernel

(To save time you can utilize multiple CPUs by including the j option)

make -j 16
make -j 16 modules_install install 

Confirm that the build is included in the boot menu entry

(This is dependent on the bootloader being used, for GRUB2)

cat /boot/grub2/grub.cfg | grep menuentry

Set the build as the default boot option

grub2-set-default 'Red Hat Enterprise Linux Server (4.11.0) 7.x (Maipo)’

Reboot the system

reboot

Confirm that the kernel has been updated:

uname -a 
Linux host.name 4.11.0 #1 SMP date  x86_64 GNU/Linux

NVMe CLI Update

Download the correct version of the NVMe CLI utility that includes TCP:

git clone https://github.com/solarflarecommunications/nvme-cli

Update the NVMe CLI utility:

cd nvme-cli
make
make install

Target setup

Load the target driver

This should automatically load the dependencies,nvmenvme_core and nvme_fabrics

modprobe nvmet_tcp

Set up storage subsystem

mkdir /sys/kernel/config/nvmet/subsystems/ramdisk
echo 1 > /sys/kernel/config/nvmet/subsystems/ramdisk/attr_allow_any_host
mkdir /sys/kernel/config/nvmet/subsystems/ramdisk/namespaces/1
echo -n /dev/ram0 > /sys/kernel/config/nvmet/subsystems/ramdisk/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/ramdisk/namespaces/1/enable

Setup port

mkdir /sys/kernel/config/nvmet/ports/1
echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam
echo "tcp" > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo "11345" > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
echo "10.0.0.1" > /sys/kernel/config/nvmet/ports/1/addr_traddr

Associate subsystem with port

ln -s /sys/kernel/config/nvmet/subsystems/ramdisk /sys/kernel/config/nvmet/ports/1/subsystems/ramdisk

Initiator setup

Load the initiator driver

This should automatically load the dependencies,nvmenvme_core and nvme_fabrics.

modprobe nvme_tcp

Use the NVMe CLI utility to connect the initiator to the target:

nvme connect -t tcp -a 10.0.0.1 -s 11345 -n ramdisk

lsblk should now show an NVMe device.

What’s a Smart NIC?

While reading these words, it’s not just your brain doing the processing required to make this feat possible. We’ve all seen over and under exposed photos and can appreciate the decision making necessary to achieve a perfect light balanced photo. In the laboratory, we observed that the optic nerve connecting the eye to the brain is responsible for measuring the intensity of the light hitting the back of your eye. In response to this data, each optic nerve dynamically adjusts the aperture of the iris in your eye connected to this nerve to optimize these levels. For those with some photography experience, you might recall that there is a direct relationship between aperture (f-stop) and focal length. It also turns out that your optic nerve, after years of training as a child, has come to realize you’re reading text up close, so it is now also responsible for modifying the muscles around that eye to sharpen your focus on this text. All this data processing is completed before your brain has even registered the first word in the title. Imagine if your brain was responsible for processing all the data and actions that are required for your body to function properly?

Much like your optic nerve, the difference between a standard Network Interface Card (NIC) and a smart NIC is how much processing the Smart NIC offloads from the host CPU. Until recently Smart NICs were designed around Field Programmable Gate Array (FPGA) platforms costing thousands of dollars. As their name implies, FPGAs are designed to accept localized programming that can be easily updated once installed. Now a new breed of Smart NIC is emerging that while it isn’t nearly as flexible as an FPGA, they contain several sophisticated capabilities not previously found in NICs costing only a few hundred dollars. These new affordable Smart NICs can include a firewall for security, a layer 2/3 switch for traffic steering, several performance acceleration techniques, and network visibility with possibly remote management.

The firewall mentioned above filters all network packets against a table built specifically for each local Internet Protocol (IP) address under control. An application processing network traffic is required to register a numerical network port. This port then becomes the internal address to send and receive network traffic. Filtering at the application level then becomes a simple process of only permitting traffic for specific numeric network ports. The industry has labeled this “application network segmentation,” and in this instance, it is done entirely in the NIC. So How does this assist the host x86 CPU? It turns out that by the point at which operating system software filtering kicks in the host CPU has often expended over 10K CPU cycles to process a packet. If the packet is dropped the cost of that drop is 10K lost host CPU cycles. If that filtering was done in the NIC, and the packet was then dropped there would be NO host CPU impact.

Smart NICs also often have an internal switch which is used to steer packets within the server rapidly. This steering enables the NIC to move packets to and from interfaces and virtual NIC buffers which can be mapped to applications, virtual machines or containers. Efficiently steering packets is another offload method that can dramatically reduce host CPU overhead.

Improving overall server performance, often through kernel bypass, has been the providence of High-Performance Computing (HPC) for decades. Now it’s available for generic Ethernet and can be applied to existing and off the shelf applications. As an example, Solarflare has labeled its family of Kernel Bypass accelerations techniques Universal Kernel Bypass (UKB). There are two classes of traffic to accelerate: network packet and application sockets based. To speed up network packets UKB includes an implementation of the Data Plane Development Kit (DPDK) and EtherFabric VirtualInterface (EF_VI), both are designed to deliver high volumes of packets, well into the 10s of millions per second, to applications familiar with these Application Programming Interfaces (APIs). For more standard off-the-shelf applications there are several sockets based acceleration libraries included with UKB: ScaleOut Onload, Onload, and TCPDirect. While ScaleOut Onload (SOO) is free and comes with all Solarflare 8000 series NICs, Onload (OOL) and TCPDirect require an additional license as they provide micro-second and sub-microsecond 1/2 round trip network latencies. By comparison, SOO delivers 2-3 microsecond latency, but the real value proposition of SOO is the dramatic reduction in host CPU resources required to move network data. SOO is classified as “zero-copy” because network data is copied once directly into or out of your application’s buffer. SOO saves the host CPU thousands of instructions, multiple memory copies, and one or more CPU context switches, all dramatically improve application performance, often 2-3X, depending on how network intense an application is.

Finally, Smart NICs can also securely report NIC network traffic flows, and packet counts off the NIC to a centralized controller. This controller can then graphically display for network administrators everything that is going on within every server under its management. This is real enterprise visibility, and since only flow metadata and packet counts are being shipped off NIC over a secure TLS link the impact on the enterprise network is negligible. Imagine all the NICs in all your servers reporting in their traffic flows, and allowing you to manage and secure those streams in real time, with ZERO host CPU impact. That’s one Smart NIC!

What are Neural Class Networks?

In 1971 Intel released the 4004. The 4004 was their first 4-bit single core processor, and for the next 34 years, that’s pretty much how x86 computing progressed. Sure they bumped up the architecture and speed as designs and processes improved, but one thing remained constant a single processing engine. Under pressure in the 1990s from Unix workstations driven by Reduced Instruction Set Computing (RISC) with multi-core pipeline architectures like IBM’s PowerPC, Sun’s Ultra-SPARC and the MIPS R3000 Intel began exploring multi-core architectures.

So from 1971 until 2005, every x86 processor had a single core, life was simple. Intel even provided reference designs so system builders could even put two of these processors into the same system. Sure there were fringe companies that developed Symmetrical Multi-Processing (SMP) systems (ex. Sequent & NEC) with more than two CPU sockets, but they were large frame expensive custom servers not found in general mainstream use. So if you wanted to scale out your computational capacity to tackle a tough problem, you had to rack more servers.

This single core challenge largely drove commodity Linux clustering making it all the rage by the turn of the century, particularly in high-performance computing (HPC). It wasn’t uncommon to tightly couple 1,000 or more dual socket single core systems together to tackle a tough computational problem. Government agencies leveraged large clusters to model our nuclear stockpile and computationally secure it. Auto companies crashed dozens of virtual car designs on a daily basis, and oil companies crunched seismic data to compute untapped oil reserves. Then the game changed, and x86 shifted to multi-core processors. Yesterday Intel announced the availability of their new Skylake server platform, but the industry is already refocusing on 2018’s Cascade Lake 32 core, 64 thread, server chip. So why does all this matter?

As a general rule, Google doesn’t publish or confirm the computation capacity of their data centers. If we pick a specific example, their Oregon data center, we can apply particular assumptions and project that it’s roughly 100,000 servers. Again assuming they are using dual socket eight core hyper-threaded processors this translates to 3.2 million parallel threads of computation tightly coupled in one physical location potentially addressing a single problem. Structures like this are quickly approximating what took nature millions of years to develop, an organic brain based on the neuron. Neurons on average have 7,000 connections to both local and remote neurons within their system, that’s a considerable amount of networking per single computational unit.

By contrast, the common cockroach has one million neurons, and that frog you played with as a kid sports 16 million. So if we equate a neuron to a single thread of execution on an x86 that would put a Google data center somewhere between the cockroach and a frog. If in 2018 Google were to upgrade the Oregon facility to Cascade Lakes it would still be only 12.8 million threads, still less than a frog. Given the geometric core growth though it will only be a few decades before Google is deploying data centers approaching the capability of the 86 billion neurons found in the human brain. Oh, and that’s assuming an x86 thread, and a neuron is even computationally similar.

So what is Neural Class Networking? As mentioned above a neuron on average is connected to 7,000 other neurons. Imagine if every hardware thread of execution in your server were networked to even 64 external threads of execution on related systems working on the same problem, that’s the start of neural class networking. Today we have servers with typically 32 threads of execution. Solarflare’s newest generation of XtremeScale Smart NICs provides 2,048 virtual NICs, so each thread of those 32 threads has the capability to sustain 64 dedicated hardware paths to other external threads. That’s the start of Neural Class Networking.

Capture, Analytics, and Solutions

My first co-op job at IBM Research back in 1984 was to help roll out the IBM PC to the companies best and brightest. It wasn’t long into that position, perhaps a month, when we noticed a large number of monochrome monitors had a consistent burn-in pattern. A horizontal bar across the top, and a vertical bar on the left. Now I was not new to personal computers having purchased my own TRS-80 Model III a year earlier, but it was apparent that a vast number within Research were all being used to do the same thing. So I asked, VisiCalc (MS Excel’s great grandfather).

At the time personal computers were still scarce, and seeing one outside a fortune 100 company was akin to a unicorn sighting, so the subtlety between tool and solution was lost on this 21-year-old brain. Shortly after the cause of the burn-in was understood I had the opportunity to have lunch with one of these researchers and discuss his use of VisiCalc. That single IBM PC in his lab existed to turn experimental raw data that it directly collected into comprehensible observations. He had a terminal in his office for email, and document creation, so this system existed entirely to run two programs. One that collected raw data from his lab equipment, and VisiCalc, which translated that data into understandable information which helped him make sense of our world.

Recently Solarflare began adding Analytics packages to their open network packet capture platform as they ready it for general availability later this month. The first of these packages is Trading Analytics, and it wasn’t until I recently saw an actual demo of this application that the real value of these packages kicked in. The demo clarified for me this value much like the VisiCalc ah-ha moment mentioned above, but perhaps we need some additional context.

Imagine a video surveillance system at a large national airport. Someone in security can easily track a single passenger from the curb to the gate, without ever losing sight of them, now extend that to a computer network where the network packets are the people. The key value proposition of a distributed open network packet capture platform is the capability to gather copies of what happened at various points in your network while preserving the exact moment in time that that data was at that location. Now imagine that data being how a company trades electronically on several different stock exchanges. The capability to visualize the exact moment when an exchange notified you that IBM stock was trading at $150/share, then it updated you every step of the way showing you the precise instant that each of the various bits in your trading infrastructure kicked in. The value of being able to see the entire transaction with nanosecond resolution can then be monetized, especially when you can see this performance over hundreds, thousands or even millions of trades. Given the proper visualization and reports, this data can then empower you to revisit the slow stages in your trading platform to further wring out any latency that might result in lost market opportunities.

So what is analytics? Well, first it’s the programming that is responsible for decoding the market data. That means translating all the binary data contained in a captured network packet into humanly meaningful data. While this may sound trivial, trust me it isn’t, there are dozens of different network protocol formats and hundreds of exchanges worldwide that leverage these formats so it can quickly become a can of worms. Solarflare’s package translates over two dozen different network protocols, for example, NYSE ARCA, CME FIX and FAST into a single humanly readable framework. Next, it understands how to read and use the highly accurate time stamps attached to each network packet representing when those packets arrived at a given collection point. Then it correlates all the network sequence ID numbers and internal message ID numbers along with those time stamps so it can align a trade across your entire environment, and show you where it was at each precise moment in time. Finally, and this is the most important part, it can report and display the results in many different ways so that it easily fits into a methodology familiar to its consumer.

So what differentiates a tool from a solution? Some will argue that VisiCalc and Solarflare’s Trading Analytics are nothing more than a sophisticated set of fancy hammers and chisels, but they are much more. A solution adds significant, and measurable, value to the process by removing the mundane and menial tasks from that process thereby allowing us to focus on the real tasks that require our intellect.

Podcast on iTunes, Google Play & Stitcher

logopodcastAfter several weeks we now have a critical mass of episodes of The Technology Evangelist Podcast to justify posting it on iTunes, Google Play, and Stitcher. So now you can subscribe from your favorite service or player and stay current. Recently we completed episodes on high performance packet capture, electronic trading, NVMe, and Hadoop.

We’re now scheduling and recording episodes on FPGAs, digital currencies (bitcoin), Intel Skylake, TensorFlow, Containers, the race to zero, and much, much more. If you have ideas for topics or speakers, we always welcome suggestions, so please email The Technology Evangelist or comment to this blog post.

Will AI Restore HFT Profitability?

225x225bbSeveral years ago Michael Lewis wrote “Flash Boys: A Wall Street Revolt” and created a populist backlash that rocked the world of High-Frequency Trading (HFT). Lewis had publicly pulled back the curtain and shared his perspective on how our financial markets worked. Right or wrong, he sensationalized an industry that most people didn’t understand, and yet one in which nearly all of us have invested our life savings.

Several months after Flash Boys Peter Kovac, a Trader with EWT, authored “Flash Boys: Not So Fast” where he refuted a number of the sensationalistic claims in Lewis’s book. While I’ve not yet read Kovacs book, it’s behind “Dark Pools: The Rise of Machine Traders and the Rigging of the U.S. Stock Market” in my reading list, from the reviews it appears he did an excellent job of covering the other side of this argument.

All that aside, earlier today I came across an article by Joe Parsons in “The Trade” titled “HFT: Not so flashy anymore.” Parsons makes several very salient points:

  • Making money in HFT today is no longer as easy as it once was. As a result, big players are rapidly gobbling up little ones. For example in April Virtu picked up KCG and Quantlab recently acquired Teza Technologies.
  • Speed is no longer the primary driving differentiator between HFT shops. As someone in the speed business, I can tell you that today one microsecond for a network 1/2 round trip in trading is table stakes. Some shops are deploying leading edge technologies that can reduce this to under 200 nanoseconds. Speed as the only value proposition an HFT offers hasn’t been the case for a rather long time. This is one of the key assertions that was made five years ago in the “Dark Pools” book mentioned above.
  • MiFID II is having a greater impact on HFT then most people are aware. It brings with it new, and more stringent regulations governing dark pools and block trades.
  • Perhaps the most important point though that Parsons makes is that the shift today is away from arbitrage as a primary revenue source and towards analytics and artificial intelligence (AI). He down plays this point by offering it up as a possible future.

AI is the future of HFT as shops look for new and clever ways to push intelligence as close as possible to the exchanges. Companies like Xilinx, the leader in FPGAs, Nvidia with their P40 GPU, and Google, yes Google, with their Tensorflow Processing Unit (TPU) will define the hardware roadmap moving forward. How this will all unfold is still to be determined, but it will be one wild ride.

Container Networking is Hard

ContainerNetworkingisHardLast night I had the honor of speaking at Solarflare’s first “Contain NY” event. We were also very fortune to have two guest speakers, Shanna Chan from Red Hat, and Shaun Empie from NGINX. Shanna presented OpenShift then provided a live demonstration where she updated some code, rebuilt the application, constructed the container, then deployed the container into QA. Shaun followed that up by reviewing NGINX Plus with flawless application delivery and live action monitoring. He then rolled out and scaled up a website with shocking ease. I’m glad I went first as both would have been tough acts to follow.

While Shanna and Shaun both made container deployment appear easy, both of these deployments were focused on speed to deploy, not maximizing the performance of what was being deployed. As one dives into the details of how to extract the most from the resources we’re provided we quickly learn that container networking is hard, and performance networking from within containers is even an order of magnitude more challenging. Tim Hockin, a Google Engineering Manager, was quoted in the eBook “Networking, Security & Storage” by THENEWSTACK that “Every network in the world is a special snowflake; they’re all different, and there’s no way that we can build that into our system.”

Last night when I asked those assembled why container networking was hard no one offered what I thought was the obvious answer, that we expect to do everything we do on bare metal from within a container, and we expect that the container can be fully orchestrated. While that might not sound like “a big ask”, when you look at what is done to achieve performance networking today within the host, this actually is. Perhaps I should back up, when I say performance networking within a host I mean kernel bypass networking.

For kernel bypass to work it typically “binds” the server NIC’s silicon resources pretty tightly to one or more applications running in user space. This tight “binding” is accomplished using one of the at least several common methods: Solarflare’s Onload, Intel’s DPDK, or Mellanox’s RoCE. Each approach has its own pluses and minuses, but that’s not the point of this blog entry. When using any of the above it is this binding that establishes the fast path from the NIC to host memory. The objectives of these approaches, and this “binding” though runs counter to what one needs when doing orchestration, and that is a level of abstraction between the physical NIC hardware and the application/container. This level of abstraction can then be rewired so containers can easily be spun up, torn down, or migrated between hosts.

It is this abstraction layer where we all get knotted up. Do we use an underlay network leveraging MACVLANs or IPVLANs or an overlay network using VXLAN or NVGRE? Can we leverage a Container Network Interface (CNI) to do the trick? This is the part about container networking that is still maturing. While MACVLANs provide the closest link to the hardware and afford the best performance, they’re a layer-2 interface, and running unchecked in large scale deployments they could lead to a MAC explosion resulting in trouble with your switches. My understanding is that with this layer of connectivity there is no real entry point to abstract MACVLANs into say a CNI so one could use Kubernetes to orchestrate their deployment. Conversely, IPVLANs are a layer-3 interface and have already been abstracted to a CNI for Kubernetes orchestration. The real question is what is the performance penalty one can observe and measure between using a MACVLAN connected container and an IPvLAN connected one? All work to be done. Stay tuned…