25/50/100GbE Facts

Several years ago the mega data center and Web 2.0 companies started looking for an alternative to the approved 40GbE standard which is the link speed offered between 10GbE, and 100GbE. They viewed the IEEE approved implementations of both 40G and 100G, which were simply multiple 10G lanes, as very cumbersome. These mega data centers seek to leverage High-Performance Computing (HPC) concepts and desire that their exascale networks utilize fabrics that scale in even multiples. Installing 25/50GbE at the servers, and (2x25G) 50GbE, or (4x25G) 100GbE for the switch to switch links is much more efficient. It turns out two groups formed in parallel: The 25 Gigabit Ethernet Consortium and the 2550100 Alliance (that’s 25, 50, 100 for those that didn’t see it) to develop & promote a single lane 25GbE specification as the next Ethernet. This approach would then be extended to two 25G lanes for 50G, and four 25G lanes for 100G. It should be noted that today the IEEE, the industry standards body, has not yet ratified a 25GbE standard (when it does it will be referred to as IEEE 802.3by). Once approved this standard will be used to create Ethernet Controller NIC silicon and compatible switch silicon. This work is underway, but won’t be completed until sometime in 2016.

The 25 Gigabit Ethernet Consortium was founded by Arista, Broadcom, Google, Mellanox & Microsoft. While the 2550100 Alliance is about fifty companies, and the ten most notable in this alliance are: Accton, Acer, Cavium, Finisar, Hitachi Data Systems, Huawei, Lenovo, NEC, Qlogic, and Xilinx. Interestingly absent from the above lists are key Ethernet product companies: Chelsio, Cisco, Emulex, HP, Intel, and Solarflare. The focus of this piece will be on the Ethernet NIC controller silicon, because if you can’t connect the server then the whole discussion is just switch to switch interconnects, which is a another class of problem for discussion in a different forum. Today there appears to be only two general purpose Ethernet controller NIC chips that support a version of 25/50/100GbE and they are Mellanox’s ConnectX-4 and QLogic’s cLOM8514. For NIC silicon to actually be useful though it must be delivered on a PCI Express adapter. At this point QLogic has only demonstrated their 25GbE publicly, but has not announced a date to ship a PCIe adapter. This means that Mellanox is the solitary vendor shipping a production 25/50/100GbE adapter today. QLogic has not formally stated when it will be producing adapters with the cLOM8514.

In the home Wifi networking market hardware vendors typically race ahead of IEEE standards, and produce products to secure “first mover advantage”, otherwise known to end users as the bleeding edge, but they can only do this because their products are for the most part stand-a-lone. Enterprise and data center markets are highly interconnected and shipping a product ahead of the approved IEEE specification is inviting an avalanche of support calls. Today there still remain significant open technical issues around 25/50/100GbE such as auto negotiation, link training, and forward error correction. The IEEE has yet to resolve these, but they are being discussed. At the end user level, interoperability is the key issue. If a company were to produce a stand-a-lone NIC product without an accompanying cable & switch ecosystem they would be flooded with support requests. The converse is also true, if a company were to build a switch around the Broadcom silicon without offering a bundled in server NIC it would quickly also become an interoperability situation. Those on the bleeding edge, would surely now understand the true meaning of the phrase.

So why haven’t the more traditional 10GbE NIC vendors jumped on the 25/50/100GbE band wagon? Simple, without an approved IEEE standard the likelihood of profiting from your investment in 25/50/100GbE is fairly low. Today, exclusive of R&D, producing an Ethernet controller NIC chip is a multi-million dollar exercise. So to justify spinning a 25/50/100GbE NIC chip in early 2015 for a “first mover advantage” one would require a plan that it would produce well into the tens of millions in revenue. Couple this with the interoperability support nightmare of getting one vendor’s NIC working with a second vendors cable, and third vendors switch, and any profit that might exist could quickly be consumed.

Enterprise customers want choice, which by definition implies multivendor interoperability based on mature standards. Once the IEEE 802.3by standard (25/50/100GbE) is ratified next year it is expected that all the NIC vendors will begin shipping 25/50/100GbE NIC products.

7/27 Update: Broadcom announced a 25/50GbE NIC controller today.

A Path from GbE to 10GbE

Recently folks have asked how they could squeeze more out of their Gigabit Ethernet (GbE) infrastructures while they work to secure funding for a 10GbE upgrade in the future. I’ve been selling 10GbE NICs for 10 years and blogging the past six. What I’ve learned as a former system architect, IT manager, server sales, and now network sales is that the least painful method for making the transition is to demonstrate the payback to your management in stages. First I’d upgrade my existing servers to 10GbE adapters, but running on my existing GbE network to demonstrate that I was pushing that infrastructure to its full potential. It’s very likely that your existing multi-core servers are sometimes more CPU bound than bandwidth. Also, it is possible you may have some extra top of rack switch ports you can leverage. There are several interesting tricks worth considering. The first is to move to a current 10GbE controller, one that supports GbE (1000Base-T is the formal name for the RJ-45 telephone style modular connector). If this still doesn’t give you the performance bang you’re seeking then you can consider testing an operating system bypass (OS Bypass) network driver.

Upgrading from the generic GbE port mounted on your server’s mother board to a PCI Express option card with dual 10G ports means you’re moving from GbE chip technology designed 15 years ago to very possibly a state of the art 10G Ethernet controller designed in the past year or two. As mentioned in other posts like “Why 10G” and “Four Reasons Why 10GbE NIC Design Matters” some of today’s 10GbE chips internally offer 1,000s of virtual NIC interfaces, highly intelligent network traffic steering to CPU cores, and a number of advanced stateless network packet processing offloads (meaning that more work is done on the NIC that would otherwise normally have to be done by your Intel server CPUs).  Much of this didn’t exist when your server’s GbE chip was initially designed back in 2000. So what is the best way to make the jump?

There are two methods to plug your existing RJ45 terminated networking cables into new 10GbE server class NICs. The first, and easiest, is to use a native dual port 10GBase-T card that supports GbE like Solarflare’s SFN5161T which runs roughly $420. The second approach, which provides a much better path to 10GbE, is to use a dual port SFP+ card like Solarflare’s SFN7002F with a pair of 1000Base-T modules. In this case, the adapter is $395, and each module is roughly $40 (be careful here because there are numerous Cisco products offered that are often just “compatible”). When you get around to migrating to 10GbE both approaches will require new switches and very likely new network wiring. The 10Gbase-T standard, which uses the familiar RJ45 networking connector, will require that you move to the more expensive Cat6 cabling, and often these switches are more expensive and draw more power. If you have to rewire with Cat6, then you should seriously consider using passive DirectAttach DA cables with bonded SFP+ connectors that start at $20-$25 for 0.5-2M long cables. By the time your network admin custom makes the Cat6 cables for your rack it’ll likely be a break even expense cost (especially when you have to spend time diagnosing bad/failing cables). DA cables should be considerably more trouble free over time, frankly, 10GBase-T really pushes the limits of both the Cat6 cables and RJ-45 connectors.

Another thing to consider is leveraging an OS Bypass layer like Solarflare’s OpenOnload (OOL) for network intense applications like Nginx, Memcached, and HAProxy. We saw that OOL delivered a 3X performance gain over running packets through the Linux OS, which was documented in this whitepaper. In the testing for this whitepaper, we found that for Nginx content served from memory would typically take six cores to respond to a full 10G link. Running OOL it only required two cores. Turning this around a bit, with OOL on a dual port 10G card you should only need roughly four cores to serve static in-memory content at wire-rate 10G to both ports. So suppose you have an eight core server today with a pair of GbE links, and during peak times it’s typically running near capacity. By upgrading to a Solarflare adapter with OOL, still just utilizing both 10G ports as GbE ports, you could easily be buying yourself back some significant Intel CPU cycles. The above requires a very serious your mileage may vary type statement, but if you’re interested in giving it a try in your lab Solarflare will work with you on the Proof of Concept (POC). It should be noted that adding OOL to a SFN7002F adapter will roughly double the price of the adapter, but compare that additional few hundred dollars in a 10G software expense to the cost of replacing your server with a whole new one, installing all new software, perhaps additional new software licenses, configuration, testing, etc… Replacing the NIC, and adding an OS Bypass layer like OOL is actually pretty quick, easy & painless.

If you’re interested in kicking off a GbE to 10GbE POC please send us a brief email.

Severs Can Protect Themselves From a DDoS Attack

Solarflare is completing SolarSecure Server Defense, a Docker Container housing a start-of-the-art threat detection, and mitigation system. This system dynamically detects new threats and updates the filters applied to all network packets traversing the kernel network device driver in an effort to fend off future attacks in real time without direct human intervention. To do this Solarflare has employed four technologies: OpenOnload, SolarCapture Live, Bro Network Security Monitor, and SolarSecure Filter Engine.

OpenOnload provides an OSBypass means of shunting copies of all packets making it past the current filter set to SolarCapture. SolarCapture provides a Libpcap framework for packet capture which then hands these copied packets onto Bro for analysis. Bro then applies a series of scripts to each packet, and if a script detects a hit it raises an event. Each class of event then triggers a special SolarSecure Filter Engine script which then creates a new network packet filter. This filter is then loaded in real-time into the packet filter engine of the network adapter’s kernel device driver to be applied to all future network packets. Finally, Server Defense can alert your admins as new rules are created on each server across your infrastructure.

SolarSecure Server Defense inspects all inbound, outbound, container to container, and VM to VM packets on the same physical server, and filters are applied to every packet. This uniquely positions Solarflare Server Defense as the only containerized cyber defense solution designed to protect each individual server, VM or container, within an enterprise from a wide class of threats ranging from a simple SYN flood to a sophisticated DDoS attack. Even more compelling, it can actually defend from attacks originating from inside the same physical network, behind your existing perimeter defenses. It can actually defend one VM from an attack launched by another VM on the same physical server!

To learn more please contact Scott Schweitzer at Solarflare.

3X Better Performance with Nginx

Recently Solarflare concluded some testing with Nginx that measured the amount of traffic Nginx could respond to before it started dropping requests. We then scaled up the number of cores provided to Nginx to see how additional compute resources impacted the servicing of web page requests, and this is the resulting graph:

click for larger image

As you can see from the above graph most NIC implementations require about six cores to achieve 80% wire-rate. The major difference highlighted in this graph though is that with a Solarflare adapter, and their OpenOnload OS Bypass driver they can achieve 90% wire-rate performance utilizing ONLY two cores versus six. Note that this is with Intel’s most current 10G NIC the x710.

What’s interesting here though is that OpenOnload internally can bond together up to six 10G links, before a configuration file change is required to support more.  This could mean that a single 12 core server, running a single Nginx instance should be able to adequately service 90% wire-rate across all six 10G links, or theoretically 54Gbps of web page traffic. Now, of course, this is assuming everything is in memory, and the rest of the system is properly tuned. Viewed another way this is 4.5Gbps/core of web traffic serviced by Nginx running with OpenOnload on a Solarflare adapter compared to 1.4Gbps/core of web traffic with an Intel 10G NIC. This is a 3X gain in performance for Solarflare over Intel, how is the possible?

Simple, OpenOnload is a user space stack that communicates directly with the network adapter in the most efficient manner possible to service UDP & TCP requests. The latest version of OpenOnload has also been tuned to address the C10K problem. What’s important to note, is that by bypassing the Linux OS to service these communication requests Solarflare reduces the number of kernel context switches/core, memory copies, and can more effectively utilize the processor cache. All of this translates to more available cycles for Nginx on each and every core.

To further drive this point home we did an additional test just showing the performance gains OOL delivered to Nginx on 40GbE. Here you can see that the OS limits Nginx on a 10-core system to servicing about 15Gbps. With the addition of just OpenOnload to Nginx, that number jumps to 45Gbps. Again another 3X gain in performance.

If you have web servers today running Nginx, and you want to give them a gargantuan boost in performance please consider Solarflare and their OpenOnload technology. Imagine taking your existing web server today which has been running on a single Intel x520 dual port 10G card, replacing that with a Solarflare SFN7122F card, installing their OpenOnload drivers and seeing a 3X boost in performance. This is a fantastic way to breathe new life into existing installed web servers. Please consider contacting Solarflare today do a 10G OpenOnload proof of concept so you can see these performance gains for yourself first hand.

Beyond Gigabit Ethernet

Where wired connections can be made, they will always provide superior performance to that of wireless techniques. Since the commercialization of the telegraph over 175 years ago mankind has been looking for ever faster ways to encode & transfer information. The wired standard we’re all most familiar with today is Gigabit Ethernet (GbE). It runs throughout your office to your desktop, phone, printers, copiers, and wireless access points. It is the most pervasive method in the enterprise for reliably linking devices. So what’s next?

Two weeks ago if you’d have asked most technology professionals they would have answered 10 Gigabit Ethernet (10GbE). That was the commonly accepted plan. Then Cisco, Aquantia, Freescale & Xilinx announced an alliance to further develop & promote a proposed Next Generation (NBase-T) wired standard supporting 2.5GbE & 5GbE speeds over existing installed wires (Category 5a & 6) cables. We all know Cisco, and that’s enough to get pretty much everyone’s attention, but who are the other three? Aquantia is one of the leaders in producing the physical interface (PHY) chips that exist at both ends of the wire. Switch companies like Cisco use Aquantia, as do network interface card companies like Solarflare, Intel, and Chelsio. Aquantia has figured out how to take digital information and encode it into electrical signals designed to travel at very high speeds through very noisy wires. Then on the other end, their chips have the smarts to find the signal within the vast amounts of noise created by the wires themselves. Freescale & Xilinx are a bit further up the food chain, they make more programmable chips that can be positioned between Aquantia & Cisco’s switch logic, or the Intel processor in your computer.

So why did Cisco push to form the NBase-T Alliance, what do they gain from this investment? It turns out that improvements in Wireless networking are behind this, and Cisco has a large wireless business. In commercial environments, wireless access points now use a wider range of frequencies in parallel so they can service more of our wireless devices. These access points are pushing the limits on the back end with what GbE is capable of. Since most enterprises are already wired with Cat5a or Cat6 rewiring to support 10GbE would be very expensive. Hence the drive towards NBase-T.

The question though is how about performance desktop users? Folks doing video editing, simulation, or anything that is data intensive could easily push well beyond GbE. We’re now starting to see Apple & others ship 4K resolution desktop computers and displays. These devices can be huge data consumers. What’s the plan for supporting them beyond GbE? The answer still appears to be 10GbE, but time will tell.

Your Server as the Last Line of Cyber Defense

Here is an excerpt from an article I wrote for Cyber Defense Magazine that was published earlier today:

Since the days of medieval castle design, architects have cleverly engineered concentric defensive layers along with traps, to thwart attackers, and protect the strong hold. Today many people still believe that the moat was a water obstacle designed to protect the outer wall, when in fact it was often inside the outer wall and structured as a reservoir to flood any attempt at tunneling in. Much like these kingdoms of old, today companies are leveraging similar design strategies to protect themselves from Internet attackers.

The last line of defense is always the structure of the wall, and guards of the castle keep itself. Today the keep is your network server that provides customers with web content, partners with business data, and employee’s remote access. All traffic that enters your servers comes in through a network interface card (NIC). The NIC represents both the wall and the guards for the castle keep.  Your NIC should support a stateless packet filtering firewall application that is authorized to drop all unacceptable packets. By operating within both the NIC, and the kernel driver, this software application can drop packets from known Internet marauders, rate limit all inbound traffic, filter off SYN floods, and only pass traffic on acceptable ports. By applying all these techniques your server can be far more available for your customers, partners, and employees.

For the rest of the article, with several cool sections of code that explain how to protect your server, please visit Cyber Defense Magazine.

Building an Inexpensive Performance Packet Generator

You know that nice feeling you get when someone surprises you with a feature you weren’t expecting, but that totally change the way you use something. Like when your son told you about that free HBO app so you can now watch HBO on your iPad.” Well, recently Solarflare released an update to SolarCapture Pro (SCP V1.3) with just such a feature, replay.

On the surface replay sounds rather hum drum, you can replay libpcap files out to an Ethernet interface. So for the Uber nerds out there yes, you can plug this into Ostenato and make a poor man’s high-performance Ixia on the cheap under $2,250 (you’ll need an SFN7122F & the software SFS-SCP) plus the cost of your server.

If one considers that Solarflare provides the highest performance network adapters currently available for both 10GbE & 40GbE.  This replay feature could be extremely powerful. For example, someone could load a server up with memory, spin up one or more very large libpcap files then use the following command to blast them into their network at wire-rate.

solar_replay pps=1.5e6 prebuffer repeat=512 eth2=play1.pcap eth3=play1.pcap

In this example replay will sustain 1.5 million packets per second (Mpps), note this rate can be as high as 14.8Mpps if your pcap file is all small packets. Next before the replay actually starts play1.pcap will be “prebuffered” meaning that it will be loaded into memory before the replay begins so that disk performance won’t be a factor in the playback. Next, the replay will loop 512 times.  Finally, it will replay the same buffer out both 10GbE or 40GbE ports on the adapter, eth2 & eth3.

So what will this look like? Simple a storm of packets on two interfaces that are hopefully attached to different switches in your infrastructure. Note that the packet rate is actually limited by the size of the packets.

Additionally, you can pin the replay to specific cores, increase the number of buffers, adjust port & time ranges of what you want to replay from the pcap files, and throttle the rate to a multiple of the initial capture speed.

This is by far the most advanced replay capability available today on an ASIC based network adapter. SolarCapture is extremely powerful, and this sample just scratches the surface of what it is capable of.

If you’re interested in taking SolarCapture out for a test drive, or just want to learn more feel free to contact me, or reach out directly to Solarflare.

Towards a More Secure Network Interface (SNI)

Many of the objects in our lives are Internet connected. Everything from watches to home thermostats, refrigerators & even septic systems are now wired to the Internet. All of these devices have a certain expectation of trust when they connect to the Internet. Unfortunately there in lies the fundamental flaw. This “trust everything model” is inherent in nearly all network connected hardware individuals & corporations deploy, with the specific exception of course of security appliances.

Why do our networks work this way? Because it’s easier for hardware engineers to assume trust then require authentication. Take for example your car, it has hundreds of systems & sensors that are all interconnected. There is an assumed level of trust by every device that makes up your vehicle because the automaker believed they controlled everything. Now suppose you’re driving along at say 60MPH, and I was to reach in through your Onstar link & activate the ABS system on the right side of the vehicle. How’s that trust working for you now? Don’t laugh, I’m serious.  Automobile manufacturers are all facing this issue today thanks to several well-publicized hacks last summer.

Can you board a major airline in the US by simply walking into the airport, traversing the terminal, then boarding the plane? No. At a minimum, you have to go through a Transportation Safety Administration (TSA) checkpoint. Then a second, very simplistic, validation of your ticket at the gate. The TSA, in essence, is a packet filter, where you are the packet. They look at you, your ID, run you through a millimeter scanner & your stuff through an X-ray, and if all this passes muster you’re permitted to proceed.

Suppose there was a very bright tiny TSA agent that lived just inside your computer who supervised your connection to the Internet checking every bit of data coming into your computer. This tiny TSA agent seeing everything applies some basic sanity checks to your inbound data, let’s call this capability a Secure Network Interface (SNI). Here are some examples of the types of tests that this SNI might execute before allowing information to be handed off to your applications or operating system:

  • Is the data coming from somewhere or someone I trust?
  • Is it coming in specifically to the application I know & trust?
  • Is this a request that I find acceptable?
  • Is there anything in the request I might find objectionable?

Today corporate networks rely on firewalls, and other advanced filtering & security hardware to setup a demilitarized zone (DMZ) for all their Internet servers.  They then setup a second set of hardware firewalls with more restrictive rules to further protect internal systems & servers. Finally, we have the laptops, desktops & production servers, many of these also run software firewalls that do some basic network traffic filtering, think of them as each having that gate agent checking your data just before you need it.  This software firewall approach is flawed by design because the offending network traffic has already entered your system and has had access to your device drivers and low-level OS stack functions. Image if the TSA only existed at the gate to your plane.  Think of all the other doors & passages that would remain unprotected.

Imagine if every server had an SNI, actual hardware at the edge of your server or high-end workstation. Your network administrators could then explicitly & logically connect systems to each other & the appropriate users to one another through each of these SNI protected systems. The default would be that all outsiders would be ignored, if your network perimeter were then breached like Target was last fall, it wouldn’t make any difference. No logical connections would exist between say the unsecured HVAC system, yes the thieves broken in through the server that controlled the AC, and any of the corporate severs. This HVAC system would only be known to the VPN server, all other servers would shun it’s existence because the default action in their SNI would be deny, if you weren’t on the approved IP list to connect with a given server you’d be out of luck.

So does a Secure Network Interface (SNI) exist today? Yes,  Solarflare has a brand new software product called SolarSecure that installs a high-performance packet filter in the silicon of the server network adapter.  For now, you can click on this link to learn more.  In the near future, another Blog entry will explain the amazing capabilities of this exciting new technology.

Crash and Boom: Inside the 10GbE Adapter Market

It may be hard to believe, but we’re coming up on ten years with 10GbE as an adapter option for servers and workstations.
In  2003 the first 10GbE network adapters based on a new breed of chips hit the market—and by 2006, the list had eventually grown to include nearly twenty (AdvancedIO, Broadcom, Chelsio, Intel, Emulex, Endace, Mellanox, Myricom, Napatech, NetEffect, Neterion, NetXen, QLogic, ServerEngines, SMC, Solarflare, Teak Technologies, and Tehuti Networks).
Designing & building a 10GbE ASIC is not a cheap undertaking. Even on a shoestring budget, it could easily run $7-10M for that first working chip. Some of these companies never made it past that initial functional 10GbE controller chip. The above-combined efforts represent nearly one-quarter of a billion dollars to launch the 10GbE adapter market. To remain in this market long term… the full article is published over on HPCWire.

VMA – Voltaire Messaging Abandoned

This morning Mellanox announced that they are releasing the Voltaire Messaging Accelerator (VMA) as open source. Tom Thirer, the director of product management at Mellanox said: “By opening VMA source code we enable our customers with the freedom to implement the acceleration product and more easily tailor it to their specific application needs.”  He then followed this up with “We encourage our customers to use the free and open VMA source package and to contribute back to the community.” Now to be fair, I work for a company that has been selling 10GbE NICs, along with delivering & supporting a competing open source kernel bypass stack to the customer for over 5 years.

So what does moving VMA into OpenSource mean to Mellanox’s customers who run their business on systems that use VMA in production? Well, any problems or issues you now, or will ever have in the future with VMA, are now your problems and you get the privilege of fixing them.

OpenSource is a great method for rapidly advancing a broad appeal code base.  We all know and love Linux, the perceived shining star of the open source community, it runs on everything from a $60 Raspberry Pi to IBM’s System z mainframes. OpenSource works very well when there is significant interest, and demand for what the code offers. Mellanox’s VMA isn’t Linux, it’s a very specific network driver that runs on only one company’s network chip in a very niche set of markets. One of the main reasons Mellanox acquired Voltaire in 2011 for $208M was to gain control of VMA, it was one of the few unique features of Voltaire’s product line. Ever since then Mellanox been trying to stabilize the code base, reduce the jitter (unpredictable delays that can paralyze low latency systems), and exterminate some very pesky bugs. Those bugs and the support issues attached to them are the driving reason behind why Mellanox is now giving the source code away to the open source community.

Some might argue that they’re doing the financial services, HPC, and Web2.0 markets a huge favor by “donating” this code to the community. Mellanox is a business, they’ve spent many millions to acquire VMA in 2011, and likely much more over the past two years to further develop & maintain it. You don’t just jettison an expensive piece of code because you want to give your customers “the freedom to implement the acceleration product and more easily tailor it to their specific application needs.”

It’s been known in the industry for at least six weeks that Mellanox was going in this direction, in fact, the source code has actually been in Google Code since August 12, so whose contributed changes? Well, Mellanox has, over 30 times in fact, in order to get ready for this announcement.  This is big news, so how many people are following the code? Three, and two are the Mellanox employees who have submitted code fixes, all but one submitted by the same employee. How about the discussion list perhaps users are commenting there, nope it’s empty.

Finally, if Mellanox were serious about VMA moving forward there would be one or more courses on this product in the Mellanox Academy, today there are zero!  Check out the course catalog for yourself.  If the catalog isn’t enough to convince you that Mellanox’s focus is on Infiniband then let’s follow the numbers, and look at their most recent financials. Toward the end of their last quarterly SEC 10Q filing, you’ll see that Ethernet made up only 14% of their revenue. FDR, QDR & DDR Infiniband combined make up over 80% of their revenue. Mellanox is Infiniband, and more importantly, Infiniband is Mellanox.

Now Mellanox has said that they will still provide a binary version of VMA that they will support, but they’ve not publicly stated what that support contract will cost.