What are Neural Class Networks?

In 1971 Intel released the 4004. The 4004 was their first 4-bit single core processor, and for the next 34 years, that’s pretty much how x86 computing progressed. Sure they bumped up the architecture and speed as designs and processes improved, but one thing remained constant a single processing engine. Under pressure in the 1990s from Unix workstations driven by Reduced Instruction Set Computing (RISC) with multi-core pipeline architectures like IBM’s PowerPC, Sun’s Ultra-SPARC and the MIPS R3000 Intel began exploring multi-core architectures.

So from 1971 until 2005, every x86 processor had a single core, life was simple. Intel even provided reference designs so system builders could even put two of these processors into the same system. Sure there were fringe companies that developed Symmetrical Multi-Processing (SMP) systems (ex. Sequent & NEC) with more than two CPU sockets, but they were large frame expensive custom servers not found in general mainstream use. So if you wanted to scale out your computational capacity to tackle a tough problem, you had to rack more servers.

This single core challenge largely drove commodity Linux clustering making it all the rage by the turn of the century, particularly in high-performance computing (HPC). It wasn’t uncommon to tightly couple 1,000 or more dual socket single core systems together to tackle a tough computational problem. Government agencies leveraged large clusters to model our nuclear stockpile and computationally secure it. Auto companies crashed dozens of virtual car designs on a daily basis, and oil companies crunched seismic data to compute untapped oil reserves. Then the game changed, and x86 shifted to multi-core processors. Yesterday Intel announced the availability of their new Skylake server platform, but the industry is already refocusing on 2018’s Cascade Lake 32 core, 64 thread, server chip. So why does all this matter?

As a general rule, Google doesn’t publish or confirm the computation capacity of their data centers. If we pick a specific example, their Oregon data center, we can apply particular assumptions and project that it’s roughly 100,000 servers. Again assuming they are using dual socket eight core hyper-threaded processors this translates to 3.2 million parallel threads of computation tightly coupled in one physical location potentially addressing a single problem. Structures like this are quickly approximating what took nature millions of years to develop, an organic brain based on the neuron. Neurons on average have 7,000 connections to both local and remote neurons within their system, that’s a considerable amount of networking per single computational unit.

By contrast, the common cockroach has one million neurons, and that frog you played with as a kid sports 16 million. So if we equate a neuron to a single thread of execution on an x86 that would put a Google data center somewhere between the cockroach and a frog. If in 2018 Google were to upgrade the Oregon facility to Cascade Lakes it would still be only 12.8 million threads, still less than a frog. Given the geometric core growth though it will only be a few decades before Google is deploying data centers approaching the capability of the 86 billion neurons found in the human brain. Oh, and that’s assuming an x86 thread, and a neuron is even computationally similar.

So what is Neural Class Networking? As mentioned above a neuron on average is connected to 7,000 other neurons. Imagine if every hardware thread of execution in your server were networked to even 64 external threads of execution on related systems working on the same problem, that’s the start of neural class networking. Today we have servers with typically 32 threads of execution. Solarflare’s newest generation of XtremeScale Smart NICs provides 2,048 virtual NICs, so each thread of those 32 threads has the capability to sustain 64 dedicated hardware paths to other external threads. That’s the start of Neural Class Networking.

{ Add a Comment }

Capture, Analytics, and Solutions

My first co-op job at IBM Research back in 1984 was to help roll out the IBM PC to the companies best and brightest. It wasn’t long into that position, perhaps a month, when we noticed a large number of monochrome monitors had a consistent burn-in pattern. A horizontal bar across the top, and a vertical bar on the left. Now I was not new to personal computers having purchased my own TRS-80 Model III a year earlier, but it was apparent that a vast number within Research were all being used to do the same thing. So I asked, VisiCalc (MS Excel’s great grandfather).

At the time personal computers were still scarce, and seeing one outside a fortune 100 company was akin to a unicorn sighting, so the subtlety between tool and solution was lost on this 21-year-old brain. Shortly after the cause of the burn-in was understood I had the opportunity to have lunch with one of these researchers and discuss his use of VisiCalc. That single IBM PC in his lab existed to turn experimental raw data that it directly collected into comprehensible observations. He had a terminal in his office for email, and document creation, so this system existed entirely to run two programs. One that collected raw data from his lab equipment, and VisiCalc, which translated that data into understandable information which helped him make sense of our world.

Recently Solarflare began adding Analytics packages to their open network packet capture platform as they ready it for general availability later this month. The first of these packages is Trading Analytics, and it wasn’t until I recently saw an actual demo of this application that the real value of these packages kicked in. The demo clarified for me this value much like the VisiCalc ah-ha moment mentioned above, but perhaps we need some additional context.

Imagine a video surveillance system at a large national airport. Someone in security can easily track a single passenger from the curb to the gate, without ever losing sight of them, now extend that to a computer network where the network packets are the people. The key value proposition of a distributed open network packet capture platform is the capability to gather copies of what happened at various points in your network while preserving the exact moment in time that that data was at that location. Now imagine that data being how a company trades electronically on several different stock exchanges. The capability to visualize the exact moment when an exchange notified you that IBM stock was trading at $150/share, then it updated you every step of the way showing you the precise instant that each of the various bits in your trading infrastructure kicked in. The value of being able to see the entire transaction with nanosecond resolution can then be monetized, especially when you can see this performance over hundreds, thousands or even millions of trades. Given the proper visualization and reports, this data can then empower you to revisit the slow stages in your trading platform to further wring out any latency that might result in lost market opportunities.

So what is analytics? Well, first it’s the programming that is responsible for decoding the market data. That means translating all the binary data contained in a captured network packet into humanly meaningful data. While this may sound trivial, trust me it isn’t, there are dozens of different network protocol formats and hundreds of exchanges worldwide that leverage these formats so it can quickly become a can of worms. Solarflare’s package translates over two dozen different network protocols, for example, NYSE ARCA, CME FIX and FAST into a single humanly readable framework. Next, it understands how to read and use the highly accurate time stamps attached to each network packet representing when those packets arrived at a given collection point. Then it correlates all the network sequence ID numbers and internal message ID numbers along with those time stamps so it can align a trade across your entire environment, and show you where it was at each precise moment in time. Finally, and this is the most important part, it can report and display the results in many different ways so that it easily fits into a methodology familiar to its consumer.

So what differentiates a tool from a solution? Some will argue that VisiCalc and Solarflare’s Trading Analytics are nothing more than a sophisticated set of fancy hammers and chisels, but they are much more. A solution adds significant, and measurable, value to the process by removing the mundane and menial tasks from that process thereby allowing us to focus on the real tasks that require our intellect.

{ Add a Comment }

Podcast on iTunes, Google Play & Stitcher

logopodcastAfter several weeks we now have a critical mass of episodes of The Technology Evangelist Podcast to justify posting it on iTunes, Google Play, and Stitcher. So now you can subscribe from your favorite service or player and stay current. Recently we completed episodes on high performance packet capture, electronic trading, NVMe, and Hadoop.

We’re now scheduling and recording episodes on FPGAs, digital currencies (bitcoin), Intel Skylake, TensorFlow, Containers, the race to zero, and much, much more. If you have ideas for topics or speakers, we always welcome suggestions, so please email The Technology Evangelist or comment to this blog post.

{ 2 Comments }

Will AI Restore HFT Profitability?

225x225bbSeveral years ago Michael Lewis wrote “Flash Boys: A Wall Street Revolt” and created a populist backlash that rocked the world of High-Frequency Trading (HFT). Lewis had publicly pulled back the curtain and shared his perspective on how our financial markets worked. Right or wrong, he sensationalized an industry that most people didn’t understand, and yet one in which nearly all of us have invested our life savings.

Several months after Flash Boys Peter Kovac, a Trader with EWT, authored “Flash Boys: Not So Fast” where he refuted a number of the sensationalistic claims in Lewis’s book. While I’ve not yet read Kovacs book, it’s behind “Dark Pools: The Rise of Machine Traders and the Rigging of the U.S. Stock Market” in my reading list, from the reviews it appears he did an excellent job of covering the other side of this argument.

All that aside, earlier today I came across an article by Joe Parsons in “The Trade” titled “HFT: Not so flashy anymore.” Parsons makes several very salient points:

  • Making money in HFT today is no longer as easy as it once was. As a result, big players are rapidly gobbling up little ones. For example in April Virtu picked up KCG and Quantlab recently acquired Teza Technologies.
  • Speed is no longer the primary driving differentiator between HFT shops. As someone in the speed business, I can tell you that today one microsecond for a network 1/2 round trip in trading is table stakes. Some shops are deploying leading edge technologies that can reduce this to under 200 nanoseconds. Speed as the only value proposition an HFT offers hasn’t been the case for a rather long time. This is one of the key assertions that was made five years ago in the “Dark Pools” book mentioned above.
  • MiFID II is having a greater impact on HFT then most people are aware. It brings with it new, and more stringent regulations governing dark pools and block trades.
  • Perhaps the most important point though that Parsons makes is that the shift today is away from arbitrage as a primary revenue source and towards analytics and artificial intelligence (AI). He down plays this point by offering it up as a possible future.

AI is the future of HFT as shops look for new and clever ways to push intelligence as close as possible to the exchanges. Companies like Xilinx, the leader in FPGAs, Nvidia with their P40 GPU, and Google, yes Google, with their Tensorflow Processing Unit (TPU) will define the hardware roadmap moving forward. How this will all unfold is still to be determined, but it will be one wild ride.

{ Add a Comment }

Container Networking is Hard

ContainerNetworkingisHardLast night I had the honor of speaking at Solarflare’s first “Contain NY” event. We were also very fortune to have two guest speakers, Shanna Chan from Red Hat, and Shaun Empie from NGINX. Shanna presented OpenShift then provided a live demonstration where she updated some code, rebuilt the application, constructed the container, then deployed the container into QA. Shaun followed that up by reviewing NGINX Plus with flawless application delivery and live action monitoring. He then rolled out and scaled up a website with shocking ease. I’m glad I went first as both would have been tough acts to follow.

While Shanna and Shaun both made container deployment appear easy, both of these deployments were focused on speed to deploy, not maximizing the performance of what was being deployed. As one dives into the details of how to extract the most from the resources we’re provided we quickly learn that container networking is hard, and performance networking from within containers is even an order of magnitude more challenging. Tim Hockin, a Google Engineering Manager, was quoted in the eBook “Networking, Security & Storage” by THENEWSTACK that “Every network in the world is a special snowflake; they’re all different, and there’s no way that we can build that into our system.”

Last night when I asked those assembled why container networking was hard no one offered what I thought was the obvious answer, that we expect to do everything we do on bare metal from within a container, and we expect that the container can be fully orchestrated. While that might not sound like “a big ask”, when you look at what is done to achieve performance networking today within the host, this actually is. Perhaps I should back up, when I say performance networking within a host I mean kernel bypass networking.

For kernel bypass to work it typically “binds” the server NIC’s silicon resources pretty tightly to one or more applications running in user space. This tight “binding” is accomplished using one of the at least several common methods: Solarflare’s Onload, Intel’s DPDK, or Mellanox’s RoCE. Each approach has its own pluses and minuses, but that’s not the point of this blog entry. When using any of the above it is this binding that establishes the fast path from the NIC to host memory. The objectives of these approaches, and this “binding” though runs counter to what one needs when doing orchestration, and that is a level of abstraction between the physical NIC hardware and the application/container. This level of abstraction can then be rewired so containers can easily be spun up, torn down, or migrated between hosts.

It is this abstraction layer where we all get knotted up. Do we use an underlay network leveraging MACVLANs or IPVLANs or an overlay network using VXLAN or NVGRE? Can we leverage a Container Network Interface (CNI) to do the trick? This is the part about container networking that is still maturing. While MACVLANs provide the closest link to the hardware and afford the best performance, they’re a layer-2 interface, and running unchecked in large scale deployments they could lead to a MAC explosion resulting in trouble with your switches. My understanding is that with this layer of connectivity there is no real entry point to abstract MACVLANs into say a CNI so one could use Kubernetes to orchestrate their deployment. Conversely, IPVLANs are a layer-3 interface and have already been abstracted to a CNI for Kubernetes orchestration. The real question is what is the performance penalty one can observe and measure between using a MACVLAN connected container and an IPvLAN connected one? All work to be done. Stay tuned…

{ Add a Comment }

Technology Evangelist Podcast

Well after a few weeks of planning and preparation the Technology Evangelist Podcast is finally available. This podcast will focus on bringing the engineers, marketing, and sales folks on the cutting edge of technology to the mic to explain it.

Our first episode features Ron Miller, the CTO of Cloudwick, talking about “Hadoop and Securing Hadoop Clusters.” Ron is an expert in Cyber Security having founded Mirage Networks in 2003. We’re honored to have Ron share with us some background on Hadoop and how one might secure Hadoop clusters.

In our second episode, Mark Zeller joined us to talk about non-Volatile Memory Express (NVMe), and how it will replace spinning disks over the years to come. We touch on the benefits of this technology, talk about erasure coding, and review where the technology is headed. This episode has been recorded and is pending final approval.

Yesterday on Saturday, June 10th, Bob Van Valzah had some time to stop by and discuss Electronic Trading. This episode covers such topics as what is trading, the race to zero, dark pools, and the book Flash Boys.  This episode has also been recorded and is pending final approval.

{ Add a Comment }

Four Container Networking Benefits

ContainerContainer networking is walking in the footsteps taken by virtualization over a decade ago. Still, networking is a non-trivial task as there are both underlay and overlay networks one needs to consider. Underlay Networks like a bridge, MACVLAN and IPVLAN are designed to map physical ports on the server to containers with as little overhead as possible. Conversely, there are also Overlay networks that require packet level encapsulation using technologies like VXLAN and NVGRE to accomplish the same goals.  Anytime network packets have to flow through hypervisors or layers of virtualization performance will suffer. Towards that end, Solarflare is now providing the following four benefits for those leveraging containers.

  1. NGINX Plus running in a container can now utilize ScaleOut Onload. In doing so NGINX Plus will achieve 40% improvement in performance over using standard host networking. With the introduction of Universal Kernel Bypass (UKB) Solarflare is now including for FREE both DPDK and ScaleOut Onload for all their base 8000 series adapters. This means that people wanting to improve application performance should seriously consider testing ScaleOut Onload.
  2. For those looking to leverage orchestration platforms like Kubernetes, Solarflare has provided the kernel organization with an Advanced Receive Flow Steering driver. This new driver improves performance in all the above-mentioned underlay networking configurations by ensuring that packets destined for containers are quickly and efficiently delivered to that container.
  3. At the end of July during the Black Hat Cyber Security conference, Solarflare will demonstrate a new security solution. This solution will secure all traffic to and from containers with enterprise unique IP addresses via hardware firewall in the NIC.
  4. Early this fall, as part of Solarflare’s Container Initiative they will be delivering an updated version of ScaleOut Onload that leverages MACVLANs and supports multiple network namespaces. This version should further improve both performance and security.

To learn more about all the above, and to also gain NGINX, Red Hat & Penguin Computing’s perspectives on containers please consider attending Contain NY next Tuesday on Wall St. You can click here to learn more.

{ Add a Comment }

5 Petabytes of Public Hadoop Data

HadoopSecurityBack in 2015, I sat through a DOE presentation during a government cyber security conference on SCADA (Supervisory Control and Data Acquisition) systems accessible from the web. SCADA is used to allow computers to manage public utilities, water, gas, petroleum refineries, nuclear power plants, etc… The speaker did a live demo using Shodan where he was able to demonstrate something like over 65K open SCADA networks reachable from the Internet. This article backs up the above-mentioned presentation, though the author points out that the maps only show German made SCADA systems. To be more precise the maps show Seimens SCADA controllers, which dominate the market. Most of these systems were for industrial control, and they should have been air-gapped, physically not connected, to ANY external network, let alone the Internet. Last night a friend suggested I read “Hadoop Servers Expose Over 5 Petabytes of Data” which shows that Hadoop clusters are no different.

Guess what? Shodan was leveraged again, but this time to find Internet accessible Hadoop clusters. In aggregate it found clusters containing upwards of 5 Petabytes, which for those without a computer science degree that’s 5 million Gigabytes. The article goes on to mention that over the past year nearly 500 Hadoop systems have been held for ransom. The article then goes on to point out where to go to secure a Hadoop system.  I bring all this up because very soon at Black Hat in July Solarflare will be demonstrating with Cloudwick how we can use the server NIC hardware to directly secure a Hadoop cluster. This can be done without changing a single line of code or altering the Hadoop configuration, stay tuned…

{ Add a Comment }

Near-Real-Time Analytics for HFT

RobotTradingArtificial Intelligence (AI) advances are finally progressing along a geometric curve thanks to cutting edge technologies like Google’s new TenserFlow Processing Unit (TPU), and NVidia’s latest Tesla V100 GPU platform.  Couple these with updated products like FPGAs from Xilinx such as their Kintex, refreshed Intel Purley 32 core CPUs, and advances in storage such as NVMe appliances from companies like X-IO, computing has never been so exciting! Near-real-time analytics for High-Frequency Trading (HFT) is now possible. This topic will be thoroughly discussed at the upcoming STAC Summit in NYC this coming Monday, June 5th. Please consider joining Hollis Beall Director of Performance Engineering at X-IO at the STAC Summit 1 PM panel discussion titled “The Brave New World of Big I/O!”  If you can’t make it or wish to get some background, there is Bill Miller’s Blog titled “Big Data Analytics: From Sometime Later to Real-Time” where he tips his hand at what where Hollis will be heading.

{ Add a Comment }

Four Failings of RoCE

RoCERecently someone suggested that I watch this rather informative video of how Microsoft Research had attempted to make RDMA over Converged Ethernet (RoCE) lossless. Unbelievably this video exposes and documents several serious flaws in the design of RoCE. Also, it appears they’ve replaced the word “Converged” with “Commodity,” to down message that RoCE doesn’t require anything special to run on regular old Ethernet. Here are the four points I got out of the video, please let me know your take:

  • RDMA Livelock: This is a simple problem of retransmitting. Since RDMA was architected for a lossless deterministic local bus architecture accommodations were never made for dropped packets as they just didn’t happen on a bus. Ethernet, on the other hand, was designed to expect a loss, remember vampire taps. Livelock occurs when a message composed of multiple packets experiences a dropped packet somewhere in the middle. At this point, RDMA has to start over from the first packet and retransmit the whole message. If this was a multiple megabyte frame of video, this retransmit approach will Lovelock a network. So what was Microsoft’s solution rewrite the RDMA stack retransmit logic to retransmit on drop detection (this is what TCP does), good luck, who’s got this action item?
  • Programmable Flow Control (PFC) Deadlock: This happens when switches encounter incomplete ARP packets. Microsoft’s solution is a call for more research, and to filter incomplete ARP packets. More to-do’s and this one is on all the switch vendors.
  • NIC PFC Storm: It seems that the firmware in some RoCE NICs has bugs that create Pause Frame storms. Beyond NIC vendors fixing those bugs, they also suggest that NIC and switch vendors include extra new software to detect oncoming storms and shut them down. Great idea, another to-do for the anonymous NIC and switch providers.
  • Slow Receiver NICs which generate excessive pause frames because of their crappy RDMA architecture which relies on a second level host based translation tables so they can fetch the destination memory address. Oh, my god, this is how you design an HPC NIC, seriously, how cheap can you be? Make the lookup tables bigger, seriously, Myricom addressed this problem back the 1990s. It appears on some RoCE NICs that it’s not that hard for the NIC to have so many receivers of kernel bypassed packets that they must go off NIC for the destination memory address lookups.

As the speaker closes out the discussion, he says, “This experiment shows that even with RDMA low latency and high throughput cannot be achieved at the same time as network congestion can cause queues to build up in the network.” Anyone who has done this for a while knows that low-latency and high bandwidth are mutually exclusive. That’s why High-Performance Computing (HPC) tests often start the tests with zero byte packets then scale up to demonstrate how latency increase proportionately to packet size.

All the above aside, this important question remains, why would anyone map a protocol like RDMA, which was designed for use on a lossless local bus, to a switched network and think that this would work? A local lossless bus is very deterministic, and it has requirements bound to its lossless nature and predictable performance. Conversely, Ethernet was designed from the beginning to expect, and accommodate loss, and performance has always been secondary to packet delivery. That’s why Ethernet performance is non-deterministic. The resilience of Ethernet, not performance, was the primary design criteria DARPA had mandated to ensure our military’s network would remain functional at all cost.

Soon Solarflare will begin shipping ScaleOut Onload free with all their 8000 series NICs, some of which sell for under $300 USD. With ScaleOut Onload TCP now has all the kernel bypass tricks RDMA offers, but with all the benefits and compatibility of sockets based TCP, no code changes. Furthermore, it delivers the performance of RDMA, but with much better reliability and availability than RoCE.

P.S. Mellanox just informed me that the NIC specific issues mentioned above were corrected some time ago in their ConnectX-4 series cards.

{ 3 Comments }