Container Networking is Hard

ContainerNetworkingisHardLast night I had the honor of speaking at Solarflare’s first “Contain NY” event. We were also very fortune to have two guest speakers, Shanna Chan from Red Hat, and Shaun Empie from NGINX. Shanna presented OpenShift then provided a live demonstration where she updated some code, rebuilt the application, constructed the container, then deployed the container into QA. Shaun followed that up by reviewing NGINX Plus with flawless application delivery and live action monitoring. He then rolled out and scaled up a website with shocking ease. I’m glad I went first as both would have been tough acts to follow.

While Shanna and Shaun both made container deployment appear easy, both of these deployments were focused on speed to deploy, not maximizing the performance of what was being deployed. As one dives into the details of how to extract the most from the resources we’re provided we quickly learn that container networking is hard, and performance networking from within containers is even an order of magnitude more challenging. Tim Hockin, a Google Engineering Manager, was quoted in the eBook “Networking, Security & Storage” by THENEWSTACK that “Every network in the world is a special snowflake; they’re all different, and there’s no way that we can build that into our system.”

Last night when I asked those assembled why container networking was hard no one offered what I thought was the obvious answer, that we expect to do everything we do on bare metal from within a container, and we expect that the container can be fully orchestrated. While that might not sound like “a big ask”, when you look at what is done to achieve performance networking today within the host, this actually is. Perhaps I should back up, when I say performance networking within a host I mean kernel bypass networking.

For kernel bypass to work it typically “binds” the server NIC’s silicon resources pretty tightly to one or more applications running in user space. This tight “binding” is accomplished using one of the at least several common methods: Solarflare’s Onload, Intel’s DPDK, or Mellanox’s RoCE. Each approach has its own pluses and minuses, but that’s not the point of this blog entry. When using any of the above it is this binding that establishes the fast path from the NIC to host memory. The objectives of these approaches, and this “binding” though runs counter to what one needs when doing orchestration, and that is a level of abstraction between the physical NIC hardware and the application/container. This level of abstraction can then be rewired so containers can easily be spun up, torn down, or migrated between hosts.

It is this abstraction layer where we all get knotted up. Do we use an underlay network leveraging MACVLANs or IPVLANs or an overlay network using VXLAN or NVGRE? Can we leverage a Container Network Interface (CNI) to do the trick? This is the part about container networking that is still maturing. While MACVLANs provide the closest link to the hardware and afford the best performance, they’re a layer-2 interface, and running unchecked in large scale deployments they could lead to a MAC explosion resulting in trouble with your switches. My understanding is that with this layer of connectivity there is no real entry point to abstract MACVLANs into say a CNI so one could use Kubernetes to orchestrate their deployment. Conversely, IPVLANs are a layer-3 interface and have already been abstracted to a CNI for Kubernetes orchestration. The real question is what is the performance penalty one can observe and measure between using a MACVLAN connected container and an IPvLAN connected one? All work to be done. Stay tuned…

Technology Evangelist Podcast

Well after a few weeks of planning and preparation the Technology Evangelist Podcast is finally available. This podcast will focus on bringing the engineers, marketing, and sales folks on the cutting edge of technology to the mic to explain it.

Our first episode features Ron Miller, the CTO of Cloudwick, talking about “Hadoop and Securing Hadoop Clusters.” Ron is an expert in Cyber Security having founded Mirage Networks in 2003. We’re honored to have Ron share with us some background on Hadoop and how one might secure Hadoop clusters.

In our second episode, Mark Zeller joined us to talk about non-Volatile Memory Express (NVMe), and how it will replace spinning disks over the years to come. We touch on the benefits of this technology, talk about erasure coding, and review where the technology is headed. This episode has been recorded and is pending final approval.

Yesterday on Saturday, June 10th, Bob Van Valzah had some time to stop by and discuss Electronic Trading. This episode covers such topics as what is trading, the race to zero, dark pools, and the book Flash Boys.  This episode has also been recorded and is pending final approval.

Four Container Networking Benefits

ContainerContainer networking is walking in the footsteps taken by virtualization over a decade ago. Still, networking is a non-trivial task as there are both underlay and overlay networks one needs to consider. Underlay Networks like a bridge, MACVLAN and IPVLAN are designed to map physical ports on the server to containers with as little overhead as possible. Conversely, there are also Overlay networks that require packet level encapsulation using technologies like VXLAN and NVGRE to accomplish the same goals.  Anytime network packets have to flow through hypervisors or layers of virtualization performance will suffer. Towards that end, Solarflare is now providing the following four benefits for those leveraging containers.

  1. NGINX Plus running in a container can now utilize ScaleOut Onload. In doing so NGINX Plus will achieve 40% improvement in performance over using standard host networking. With the introduction of Universal Kernel Bypass (UKB) Solarflare is now including for FREE both DPDK and ScaleOut Onload for all their base 8000 series adapters. This means that people wanting to improve application performance should seriously consider testing ScaleOut Onload.
  2. For those looking to leverage orchestration platforms like Kubernetes, Solarflare has provided the kernel organization with an Advanced Receive Flow Steering driver. This new driver improves performance in all the above-mentioned underlay networking configurations by ensuring that packets destined for containers are quickly and efficiently delivered to that container.
  3. At the end of July during the Black Hat Cyber Security conference, Solarflare will demonstrate a new security solution. This solution will secure all traffic to and from containers with enterprise unique IP addresses via hardware firewall in the NIC.
  4. Early this fall, as part of Solarflare’s Container Initiative they will be delivering an updated version of ScaleOut Onload that leverages MACVLANs and supports multiple network namespaces. This version should further improve both performance and security.

To learn more about all the above, and to also gain NGINX, Red Hat & Penguin Computing’s perspectives on containers please consider attending Contain NY next Tuesday on Wall St. You can click here to learn more.

5 Petabytes of Public Hadoop Data

HadoopSecurityBack in 2015, I sat through a DOE presentation during a government cyber security conference on SCADA (Supervisory Control and Data Acquisition) systems accessible from the web. SCADA is used to allow computers to manage public utilities, water, gas, petroleum refineries, nuclear power plants, etc… The speaker did a live demo using Shodan where he was able to demonstrate something like over 65K open SCADA networks reachable from the Internet. This article backs up the above-mentioned presentation, though the author points out that the maps only show German made SCADA systems. To be more precise the maps show Seimens SCADA controllers, which dominate the market. Most of these systems were for industrial control, and they should have been air-gapped, physically not connected, to ANY external network, let alone the Internet. Last night a friend suggested I read “Hadoop Servers Expose Over 5 Petabytes of Data” which shows that Hadoop clusters are no different.

Guess what? Shodan was leveraged again, but this time to find Internet accessible Hadoop clusters. In aggregate it found clusters containing upwards of 5 Petabytes, which for those without a computer science degree that’s 5 million Gigabytes. The article goes on to mention that over the past year nearly 500 Hadoop systems have been held for ransom. The article then goes on to point out where to go to secure a Hadoop system.  I bring all this up because very soon at Black Hat in July Solarflare will be demonstrating with Cloudwick how we can use the server NIC hardware to directly secure a Hadoop cluster. This can be done without changing a single line of code or altering the Hadoop configuration, stay tuned…

Near-Real-Time Analytics for HFT

RobotTradingArtificial Intelligence (AI) advances are finally progressing along a geometric curve thanks to cutting edge technologies like Google’s new TenserFlow Processing Unit (TPU), and NVidia’s latest Tesla V100 GPU platform.  Couple these with updated products like FPGAs from Xilinx such as their Kintex, refreshed Intel Purley 32 core CPUs, and advances in storage such as NVMe appliances from companies like X-IO, computing has never been so exciting! Near-real-time analytics for High-Frequency Trading (HFT) is now possible. This topic will be thoroughly discussed at the upcoming STAC Summit in NYC this coming Monday, June 5th. Please consider joining Hollis Beall Director of Performance Engineering at X-IO at the STAC Summit 1 PM panel discussion titled “The Brave New World of Big I/O!”  If you can’t make it or wish to get some background, there is Bill Miller’s Blog titled “Big Data Analytics: From Sometime Later to Real-Time” where he tips his hand at what where Hollis will be heading.

Four Failings of RoCE

RoCERecently someone suggested that I watch this rather informative video of how Microsoft Research had attempted to make RDMA over Converged Ethernet (RoCE) lossless. Unbelievably this video exposes and documents several serious flaws in the design of RoCE. Also, it appears they’ve replaced the word “Converged” with “Commodity,” to down message that RoCE doesn’t require anything special to run on regular old Ethernet. Here are the four points I got out of the video, please let me know your take:

  • RDMA Livelock: This is a simple problem of retransmitting. Since RDMA was architected for a lossless deterministic local bus architecture accommodations were never made for dropped packets as they just didn’t happen on a bus. Ethernet, on the other hand, was designed to expect a loss, remember vampire taps. Livelock occurs when a message composed of multiple packets experiences a dropped packet somewhere in the middle. At this point, RDMA has to start over from the first packet and retransmit the whole message. If this was a multiple megabyte frame of video, this retransmit approach will Lovelock a network. So what was Microsoft’s solution rewrite the RDMA stack retransmit logic to retransmit on drop detection (this is what TCP does), good luck, who’s got this action item?
  • Programmable Flow Control (PFC) Deadlock: This happens when switches encounter incomplete ARP packets. Microsoft’s solution is a call for more research, and to filter incomplete ARP packets. More to-do’s and this one is on all the switch vendors.
  • NIC PFC Storm: It seems that the firmware in some RoCE NICs has bugs that create Pause Frame storms. Beyond NIC vendors fixing those bugs, they also suggest that NIC and switch vendors include extra new software to detect oncoming storms and shut them down. Great idea, another to-do for the anonymous NIC and switch providers.
  • Slow Receiver NICs which generate excessive pause frames because of their crappy RDMA architecture which relies on a second level host based translation tables so they can fetch the destination memory address. Oh, my god, this is how you design an HPC NIC, seriously, how cheap can you be? Make the lookup tables bigger, seriously, Myricom addressed this problem back the 1990s. It appears on some RoCE NICs that it’s not that hard for the NIC to have so many receivers of kernel bypassed packets that they must go off NIC for the destination memory address lookups.

As the speaker closes out the discussion, he says, “This experiment shows that even with RDMA low latency and high throughput cannot be achieved at the same time as network congestion can cause queues to build up in the network.” Anyone who has done this for a while knows that low-latency and high bandwidth are mutually exclusive. That’s why High-Performance Computing (HPC) tests often start the tests with zero byte packets then scale up to demonstrate how latency increase proportionately to packet size.

All the above aside, this important question remains, why would anyone map a protocol like RDMA, which was designed for use on a lossless local bus, to a switched network and think that this would work? A local lossless bus is very deterministic, and it has requirements bound to its lossless nature and predictable performance. Conversely, Ethernet was designed from the beginning to expect, and accommodate loss, and performance has always been secondary to packet delivery. That’s why Ethernet performance is non-deterministic. The resilience of Ethernet, not performance, was the primary design criteria DARPA had mandated to ensure our military’s network would remain functional at all cost.

Soon Solarflare will begin shipping ScaleOut Onload free with all their 8000 series NICs, some of which sell for under $300 USD. With ScaleOut Onload TCP now has all the kernel bypass tricks RDMA offers, but with all the benefits and compatibility of sockets based TCP, no code changes. Furthermore, it delivers the performance of RDMA, but with much better reliability and availability than RoCE.

P.S. Mellanox just informed me that the NIC specific issues mentioned above were corrected some time ago in their ConnectX-4 series cards.

Moving On

movingEffective June 2, 2017, the primary hosting source for the 40GbE.net (including 10, 25, and 50GbE.net) blog is moving from Blogger (Google) to WordPress. This is being done to facilitate better management of the content, and to add the capability to support a newly spun up Podcast called the Technology Evangelist. Hopefully, you’ll renew your subscription to this blog on WordPress.