Container Networking is Hard

ContainerNetworkingisHardLast night I had the honor of speaking at Solarflare’s first “Contain NY” event. We were also very fortunate to have two guest speakers, Shanna Chan from Red Hat, and Shaun Empie from NGINX. Shanna presented OpenShift then provided a live demonstration where she updated some code, rebuilt the application, constructed the container, then deployed the container into QA. Shaun followed that up by reviewing NGINX Plus with flawless application delivery and live action monitoring. He then rolled out and scaled up a website with shocking ease. I’m glad I went first as both would have been tough acts to follow.

While Shanna and Shaun both made container deployment appear easy, both of these deployments were focused on speed to deploy, not maximizing the performance of what was being deployed. As one dives into the details of how to extract the most from the resources we’re provided we quickly learn that container networking is hard, and performance networking from within containers is even an order of magnitude more challenging. Tim Hockin, a Google Engineering Manager, was quoted in the eBook “Networking, Security & Storage” by THENEWSTACK that “Every network in the world is a special snowflake; they’re all different, and there’s no way that we can build that into our system.”

Last night when I asked those assembled why container networking was hard no one offered what I thought was the obvious answer, that we expect to do everything we do on bare metal from within a container, and we expect that the container can be fully orchestrated. While that might not sound like “a big ask”, when you look at what is done to achieve performance networking today within the host, this actually is. Perhaps I should back up, when I say performance networking within a host I mean kernel bypass networking.

For kernel bypass to work it typically “binds” the server NIC’s silicon resources pretty tightly to one or more applications running in userspace. This tight “binding” is accomplished using one of the at least several common methods: Solarflare’s Onload, Intel’s DPDK, or Mellanox’s RoCE. Each approach has its own pluses and minuses, but that’s not the point of this blog entry. When using any of the above it is this binding that establishes the fast path from the NIC to host memory. The objectives of these approaches, and this “binding” though runs counter to what one needs when doing orchestration, and that is a level of abstraction between the physical NIC hardware and the application/container. This level of abstraction can then be rewired so containers can easily be spun up, torn down, or migrated between hosts.

It is this abstraction layer where we all get knotted up. Do we use an underlying network leveraging MACVLANs or IPVLANs or an overlay network using VXLAN or NVGRE? Can we leverage a Container Network Interface (CNI) to do the trick? This is the part about container networking that is still maturing. While MACVLANs provide the closest link to the hardware and afford the best performance, they’re a layer-2 interface, and running unchecked in large-scale deployments they could lead to a MAC explosion resulting in trouble with your switches. My understanding is that with this layer of connectivity there is no real entry point to abstract MACVLANs into say a CNI so one could use Kubernetes to orchestrate their deployment. Conversely, IPVLANs are a layer-3 interface and have already been abstracted to a CNI for Kubernetes orchestration. The real question is what is the performance penalty one can observe and measure between using a MACVLAN connected container and an IPvLAN connected one? All work to be done. Stay tuned…

Leave a Reply