Rise of the NPU – Network Processing Unit

In the 1970’s Intel brought us the CPU. During the 1990s we saw the evolution of graphics processors with Nvidia popularizing the term GPU in 1999. Now we’re witnessing the dawning of the Network Processing Unit, NPU. Much like it’s multiple core graphics cousins, the NPU is a parallel processing architecture, but it has been tuned for manipulating network traffic.

This market is still rapidly evolving, and several approaches are progressing in parallel paths. On one end we have Tilera, founded in 2005 by engineers from MIT and Broadcom, their approach is a many-core one, where the cores are interconnected by an on chip network to each other & substantial system I/O. In the middle we have Myricom founded in 1994 from Caltech, their architecture is multi-core where two buses are used by an intimate collection of cores to share on chip memory, I/O & multiple network devices which are more tightly coupled to the wire. Finally, there is the FPGA approach advocated by Napatech, eNdace, and Solarflare, here the focus is on using the FPGAs to provide well-defined filtering and packet processing. It’s not exactly a processor, but an interesting transitional step. Each of these approaches has value, but over the coming decade the market will decide a winner, it always does. First, let’s take a deeper dive into the raw hardware.

Tilera attacks the NPU much more like a GPU than the other two strategies. They pack up to 72 cores onto a single chip. Then Tilera leverages a network mesh architecture to connect the cores to each other, and to multiple: DDR3 controllers, Ethernet SerDes (ports), PCIe busses, and a pair of MiCA encryption acceleration engines. On the high end these 64-bit cores are clocked at 1-1.2Ghz, and typically each has both an L1 (32KB) & L2 (256KB) cache along with three execution pipelines. The mesh fabric in aggregate has over 100Tbps of bandwidth and utilizes a non-blocking cut-through routing with 1 clock cycle per hop. They then utilize four independent 72-bit DDR3 controllers for a total addressable real memory capacity of 1TB, and an access speed of 1,866 MT/s. On the ethernet side of this high-end chip, Tilera has eight 10GbE XAUI interfaces and 32 SGMII ports for legacy 10/100/1000 Ethernet. On the PCI Express side, they utilize six integrated PCIe controllers each with four lanes, providing up to 96Gbps of throughput. Finally, there are two MiCA acceleration engines, these provide for encryption support for six popular protocols along with a public key accelerator supporting four additional protocols. This is an awesome amount of hardware packed into a single 45mm square chip. The one-page data sheet doesn’t cover the power it requires, but it’s likely in the 60-80W range. How might you ask could I come to this conclusion? Tilera produces a quad-port 10GbE PCIe card utilizing the 36 core version of this chip, and it consumes 50W of power, it requires a secondary ATX power connector along with an active fan/heat sink.

As mentioned we have Myricom in the middle. Since 1994 Myricom has been producing a family of single core network processors for their line of programmable network adapters. In late 2005 they introduced their first 10GbE processor. Since then they’ve turned the crank once in early 2008 to produce their current Lanai Z8E. As mentioned this is a single core RISC processor clocked at 333+ Mhz with 2MB of on chip SRAM, two 10GbE XAUI interfaces, and single 8-lane PCIe controller supporting 8x, 4x, 1x modes. This summer it is expected that Myricom will deliver their first multi-core architecture chip, but no details have yet to be officially released. We do know though that it will be more similar to Intel’s multicore architecture than Tilera’s many-core interconnected via a mesh network approach. We also know they’re going beyond XAUI to more tightly connect to the wire to further reduce network latency, as this has been a key focus of theirs the past few years.

Finally, on the hardware side, we have the FPGA crowd headed up by Napatech, eNdace & Solarflare. Solarflare is the newest entrant in this space, but they appear to have the clearest vision of where this technology can really go. While Napatech & eNdace, now an Emulex owned company, have been focused on packet capture & network analytics Solarflare is focused on the High-Frequency Trading (HFT) market. Solarflare has taken their low latency 10GbE ASIC and back ended it with a powerful FPGA. Their intent is to keep fundamental packet processing decisions on the network adapter. Concepts like intelligent filtering and payload normalization. The filtering is based on some pattern of bytes within not only the header but also potentially the payload. If a packet makes it past the filter code then they are working to normalize the payload contents so trade data coming from multiple sources will all be formatted identically when passed up to the user space application.

So what challenges lie ahead for those interested in squeezing the most out of these NPUs? Programmatic ones. Tilera’s many-core approach and the wealth of hardware devices on the die are compelling, but frankly programming it is going to be like herding cats. The programmer will have to attach a pair of cores to each 10GbE port, one to handle receive and the other transmit. If we take the HFT problem described above that Solarflare is working on and use this as an example then they should utilize three additional layers of cores behind this. The first layer to handle filtering, say four cores to spread the filtering workload out. Each core would then be inspecting key strings within payloads for applicable securities symbols. The second layer would be data normalization, here another four cores would line up directly with the first layer. These cores would then be altering packet payloads to conform to a predefined standard structure. The third layer would be a single core to collect all the packets from the second layer and steer them to the proper user space memory locations via a PCIe device. So in this simple feed handler example, we’ve used two cores for the 10GbE link, one to interface with the PCIe and eight to handle feed processing, for a total of 11. For a single 10GbE link that’s a lot of cats to herd. On the opposite side, we have the FPGA guys, and we’ve all known for years that programming FPGAs is non-trivial. It’s gotten better, but it’s still more magic than engineering. Myricom being in the middle is still the wild card, as they’ve not yet said publicly how open their processing architecture will be.

So the NPUs are here, how will you leverage them to yield a competitive advantage for your enterprise?

Leave a Reply