This article was originally published in April of 2012 on 10GbE.net.
At the time of this writing high performance dual port NICs have a street price between $550 & $750, this new hybrid NIC is rumored to cost ten times this. So why would someone even consider this approach? Simple to reduce load on the host CPU cores. The initial pitch is that this hybrid will take on the role of the feed handler. Typically a feed handler runs on several cores of a multi-core server today. It receives trading data from all the relevant exchanges, then filters off all the unwanted information, normalizes what remains and then passes this onto cores running the algorithmic trading code. By freeing up the feed handler cores, through the use of a hybrid NIC, this processing power can be allocated to run more advanced algorithmic codes.
On the surface, the pitch sounds like a great idea. Use a proven low-latency ASIC to pull packets off the wire, send the boring packets on to the OS and the interesting financial stuff to the FPGA. It’s when you get into the details that you realize it’s nothing more than a marketing scheme. When this product was designed I’m sure it sounded like a good idea, most 1U and 2U servers had eight cores and systems were getting CPU bound. As this NIC hits the market though Intel has once again turned the crank and vendors like IBM and HP are now delivering dual socket 16 core, 32 thread servers that will easily pickup the slack. A nicely configured HP DL360P with 16 cores, 32GB memory, etc… is available today for $10K, adding one of these hybrid NICs will nearly double the hardware price of your trading platform. Note, this before you even crack open the SDK and hire the small army of consultants you’ll need to program the FPGA.
Typically we’ve found that the normal packet flow from multiple exchanges into a trading server is roughly 200-300K packets per second, with very rare bursts up to 800K. So if one were to set aside four cores for feed handling, with an average feed load of 250Kpps, and assuming the feeds were evenly distributed each core would have 16 microseconds per packet. On these new 2.2Ghz Intel E5 systems this translates to roughly 8K instructions per packet to filter & normalize. This assumes two threads per core and an average of four clock ticks per instruction.
Like TCP Offload Engines (TOEs) these hybrid NICs sound great when they’re first proposed, but on in-depth analysis and particularly after Moore’s law kicks in, they soon become solutions looking for a problem, a novelty. With Intel’s new E5s why would anyone seriously invest their time hardware & consulting budgets on an outdated approach?
