Data Types, Computation & Electronic Trading

January 31, 2021January 31, 2021 scottcschweitzer AI, FPGA, HFT, networking, Performance

I’ll return to the “Expanded Role of HPC Accelerators” in my next post, but before doing so, we need to take a step back and look at how data is stored to understand how best to operate on and accelerate that data. When you look under the hood of your car, you’ll find that there are at least six different types of mechanical fasteners holding things together.

Artificial Intelligence and Electronic Trading

Hoses often require a slotted screwdriver to remove, while others a Phillips’s head. Some panels require a star-like Torx wrench while others a hexagonal Allen, but the most common one you’ll find are hexagonal headed bolts and nuts in both English and Metric sizes. So why can’t the engineers who design these products select a single fastener style? Simple, each problem has unique characteristics, and these engineers are often choosing the most appropriate solution. The same is true for data types within a computer. Ask almost anyone, and they’ll say that data is stored in a computer in bytes, just like your engine has fasteners.

Computers process data in many end-user formats, from simple text and numbers to sounds, images, video, and much more. Ultimately, it all becomes bits organized, managed, and stored as bytes. We then wrap abstraction layers around these collections of bytes as we shape them into numbers that our computer can then process. We know that some numbers are “natural,” they’re also called “integers,” meaning they have no fractional component, for example, one, two, and three. Simultaneously, other “real” numbers contain a decimal point that can have anywhere from one to an infinite collection of digits to the decimal’s right. Computers process data using both of these numerical formats.

Textual data is relatively easy to grasp as it is often stored as an integer without a positive or negative sign. For example, the letter “A” is stored using a standard that has assigned it the value 65, which as a single byte is “01000001.” In Computer Science, how a number is represented can be just as important as the number itself’s value. One can store the number three as an integer, but when it is divided by the integer value two, some people would say the result is 1.5, but they could be wrong. Depending on the division operator used, some languages support several, and the data type assigned to the product there could be several different answers, all of which would be correct within their context. As you move from integer to real numbers or, more specifically, floating-point numbers, these numerical representations expand considerably based on the degree of precision required by the problem at hand. Today, some of the more common numerical data types are:

Integers, which often take a single byte of storage, and can be signed or unsigned. Integers can come in at least seven different distinct lengths from a nibble stored as four bits, a byte (eight bits), a half-word (sixteen bits), word (thirty-two bits), double word (sixty-four bits), octaword (one hundred and twenty-eight bits), and an n-bit value.
Half Precision floating-point, also known as FP16 is where the real number is expressed in 16 bits of storage. The numerical value is stored in ten bits; then there are five bits for the exponent and a single sign bit.
Single Precision, also known as floating-point 32 or FP32, for 32 bits. Here we have 23 bits for the fractional component, eight for the exponent, and one for the sign.
Double Precision, also known as floating-point 64 or FP64, for 64 bits. Finally, we have 52 bits for the fractional component, 11 for the exponent, and one for the sign.

There are other types, but these are the major ones. How computational units process data differs broadly based on the architecture of the processing unit. When we talk about processing units, we’re specifically talking about the Central Processing Unit (CPUs), Graphical Processing Unit (GPUs), Digital Signal Processor (DSPs), and Field Programmable Gate Array (FPGAs). CPUs are the most general. They can process all of the above data types and much more; they are the BMW X5 sport utility vehicle of computation. They’ll get you pretty much anywhere you want; they’ll do it in style and provide good performance. GPUs are that tricked out Jeep Wrangler Rubicon designed to crawl over pretty much anything you can throw at it while still carrying a family of four. On the other hand, DSPs are dirt bikes, they’ll get one person over rough terrain faster than anything available, but they’re not going to shine when they hit the asphalt. Finally, we have FPGAs, and they’re the Dodge Challenger SRT Demon in the pack; they never leave the asphalt; they just leave everything else on it behind. All that to say that GPUs and DSPs are designed to operate on floating-point data, while FPGAs do much better with integer data. So why does this matter?

Every problem is not suited to a slot headed fastener; sometimes you’ll need a Torx, while others a hexagonal headed bolt. The same is true in computer science. If your system’s primary application is weather forecasting, which is floating-point intense, you might want vast quantities of GPUs. Conversely, if you’re doing genetics sequencing, where data is entirely integer-based, you’ll find that FPGAs may outperform GPUs, delivering up to a 100X performance per watt advantage. Certain aspects of Artificial Intelligence (AI) have benefited more from using FP16 based calculations over using FP32 or FP64. In this case, DSPs may outshine GPUs in calculations per watt. As AI emerges in key computational markets moving forward, we’ll see more and more DSPs applications; one of these will be electronic trading.

Today the cutting edge of electronic trading platforms utilize FPGA boards that have many built-in ultra-high-performance networking ports. These boards, in some cases, have upwards of 16 external physical network connections. Trading data and orders into markets are sent via network messages, which are entirely character; hence integer, based. These FPGAs contain code blocks that rapidly process this integer data, but computations slow down considerably as some of these integers are converted to floating-point for computation. For example, some messages use a twelve-character format for the price where the first six digits are the whole number, and the second six digits represent the decimal number. So, a price of $12.34 would be represented as the character string “000012340000.” Other fields also use twelve-character values for a number, but the first ten digits are the whole number, and the last two the decimal value. In this case, 12,572.75 shares of a stock would be represented as “000001257275.” Now, of course, doing financial computations maintaining the price or quantity as characters is possible; it would be far more efficient if each were recast as single-precision (FP32) numbers. Then computation could be rapidly processed. Here’s where a new blend of FPGA processing, to rapidly move around character data, and DSP computing for handling financial calculations using single precision math will shine. Furthermore, DSP engines are an ideal platform for executing trained AI-based algorithms that will drive financial trading moving forward into the future.

Someday soon, we’ll see trading platforms that will execute entirely on a single high-performance chip. This chip will contain a blend of large blocks of adaptable FPGA logic; we’re talking millions of logic tables, along with thousands of DSP engines and dozens of high-performance network connections. This will enable intelligent trading decisions to be made and orders generated in tens of a billionth of a second!

SmartNICs and SmartSSDs, the Future of Smart Acceleration

September 25, 2020September 25, 2020 scottcschweitzer Accelerators, FPGA, networking

For the past three years, I’ve been writing about SmartNICs. One of my most popular blog posts is “What is a SmartNIC” from July 2017, which has been read over 6,000 times. This year, for the second time, I’ve presented at the Storage Developer Conference (SDC). The title for this blog post was also the title of my breakout session video, which ran for 50 minutes, and went live earlier this week. Here is the abstract for that session:

Since the advent of the Smart Phone over a decade ago, we’ve seen several new “Smart” technologies, but few have had a significant impact on the data center until now. SmartNICs and SmartSSDs will change the landscape of the data center, but what comes next? This talk will summarize the state of the SmartNIC market by classifying and discussing the technologies behind the leading products in the space. Then it will dive into the emerging technology of SmartSSDs and how they will change the face of storage and solutions. Finally, we’ll dive headfirst into the impact of PCIe 5 and Compute Express Link (CXL) on the future of Smart Acceleration on solution delivery.
—Scott Schweitzer, The Technology Evangelist, Xilinx, Sept 2020

In that talk, which has been seen by over 100 people in just the first 24-hours alone on YouTube (I’m told this doesn’t include conference attendees), I shared much of what I’ve learned over the past few months while producing the following new items on SmartNICs:

My fourth Electronic Design (ED) article on September 13th is titled “How PCIe 5 with CXL, CCIX, and SmartNICs Will Change Solution Acceleration.” This piece touches on some of the new high-level protocols which will make Smart Accelerators even more performant.
Hosting an IEEE Hot Interconnects Panel on August 19th titled: “SmartNICs vs. DPUs, Who Wins?” Over 225 people originally attended this talk, and the video has since been viewed another 300 times. We were fortunate enough to secure critical people from all the major SmartNIC companies, except Intel, for an extremely lively discussion. I also did a blog post on this last month.
The third article for ED on SmartNICs published on July 10th with the title “SmartNIC Architectures: A Shift to Accelerators and Why FPGAs are Poised to Dominate.” This article offers up a pretty comprehensive overview of the technologies in the market today from the leading companies.
The second article for ED on June 29th, “Why is a SmartNIC Better than a Regular NIC?“
My first article for ED on June 4th, “What Makes a SmartNIC Smart?“

And there’s more to come…

SmartNICs vs. DPUs, Who Wins?

August 25, 2020February 15, 2021 scottcschweitzer 25GbE, Accelerators, FPGA, Infiniband, networking, TCP Offload Engine Accelerators, Broadcom, DPU, Fungible, IPU, Networking, NVIDIA, Pensando, SmartNICs, Xilinx

Last week I hosted an IEEE Hot Interconnects Panel with the above title. We were lucky enough to secure some time from the following luminaries, and it made for an excellent panel:

Andy Gospodarek, Broadcom, Open Sourcerer
Pradeep Sindhu, Fungible, CEO
Michael Kagan, NVIDIA, CTO
Vipin Jain, Pensando, CTO
Gordon Brebner, Xilinx, CTO Staff & Fellow

Clicking on the image below should take you to the 90 minute Youtube video of our panel discussion. For those who are just interested in the highlights you can read below for some of the interesting facts pulled from our discussion.

**IEEE Hot Interconnects Panel: “SmartNICs vs. DPUs, Who Wins?”**

Here are some key points that contain significant value from the above panel discussion:

SmartNICs provide a second computing domain inside the server that could be used for security, orchestration, and control plane tasks. While some refer to this as an air-gapped domain it isn’t, but it is far more secure than running inside the same x86 system domain. This can be used to securely enable bare-metal as a service. — Michael Kagan
Several vendors are actively collaborating on a Portable NIC Architecture (PNA) designed to execute P4 code. When available, it would then be possible to deliver containers with P4 code that could run on any NIC that supported this PNA model. — Vipin Jain
The control plane needs to execute in the NIC for two reasons, first to offload the host CPU from what is quickly become 30% overhead for processing network traffic, and second to improve the determinism of the applications running on the server. –Vipin Jain
App stores are inevitable, when is the question. While some think it could be years, others believe it will happen within a year. Xilinx has partnered with a company that already has one for FPGA accelerators so the leap to SmartNICs shouldn’t be that challenging. –Gordon Brebner
The ISA is un-important, it’s the micro-architecture that matters. Fungible selected MIPS-64 because of it’s support for simultaneous multi-threaded execution with fine-grained context switching. — Pradeep Sindhu. While others feel that the eco-system of tools and the wide access to developers is most important and that is why they’ve selected ARM.
It should be noted that normally the ARM cores are NOT in the data plane.

The first 18 minutes are introductions and marketing messages. While these are educational, they are also somewhat canned marketing messages. The purpose of a panel discussion was to ask questions that the panel hadn’t seen in advance so we could draw out of them honest perspectives and feedback from their years of experience.

IMHO, here are some of the interesting comments, with who made them and where to find them:

18:50 Michael – The SmartNIC is a different computational domain, a computer in-front of a computer, and ideal for security. It can supervise or oversee all system I/O, key thing is that it is a real computer.

23:00 Gordon – Offloading the host CPU to the SmartNIC and enabling programmability of the device is critically important. We’ll also see functions and attributes of switches being merged into these SmartNICs.

24:50 Andy – Not only data plane offload, but control plane offload from the host is also critically important. Also hardware, in the form of on chip logic, should be applied to data plane offload whenever possible so that ARM cores are NOT being placed in the data plane.

26:00 Andy – Dropped the three letter string that most hardware providers cringe when we hear it, SDK. He stressed the importance of providing one. It should be noted that Broadcom at this point, as far as I know, appears to be the only SmartNIC OEM that provides a customer facing SmartNIC SDK.

26:50 Vipin – A cloud based device that is autonomous from the system and remotely manageable. Has it’s own brain, and that truly runs independently of the host CPU.

29:33 Pradeep – There is no golden rule, or rule of thumb like 1Gb/sec/core like what AMD has said. It’s important to determine what computations should be done in the DPU, multiplexing and stateful applications are ideal. General purpose CPUs are made for processing single threaded applications very fast, horrible at multiplexing.

33:37 Andy – 1Gb/core is really low, I’d not be comfortable with that. I would consider DPDK, or XDP and it would blow that metric away. People shouldn’t settle for this metric.

35:24 Michael – Network needs to take care of the network on it’s own, so zero core for an infinite number of Gigabits.

36:45 Gordon – The SmartNIC is a kinda filtering device, where sophisticated functions like IPS, can be offloaded into the NIC.

40:57 Andy – The Trueflow logic delivers a 4-5X improvement in packet processing. There are a very limited number of people really concerned with hitting line rate packet per second at these speeds. In the data center these PPS requirements are not realistic.

42:25 Michael – I support what Andy said, these packet rates are not realistic in the data center.

44:20 Pradeep – We’re having this discussion because general purpose CPUs can no longer keep up. This is not black and white, but a continuum, where does general processing end and a SmartNIC pick up. GRPC as an example needs to be offloaded. The correct interface is not TCP or RDMA, both are too low level. GRPC is a modern level for this communication interface. We need to have architectural innovation because scale out is here to stay!

46:00 Gordon – One thing about being FPGA based is that we can support tons of I/O. With FPGAs we don’t think in terms of cores, we look at I/O volumes, several years ago we first started looking at 100GbE then figured out how to do that and extended it to 400GbE. We can see the current way scaling well into the Terabit range. While we could likely provide Terabit range performance today it would be far to costly, it’s a price point issue, and nobody would buy it, the cost of doing things is also an issue.

48:35 Michael – CPUs don’t manage data efficiently. We have dedicated hardware engines and TCAM along with caches to service these engines, that’s the way it works.

49:45 Pradeep – The person asking the question perhaps meant control flow and not flow control, while they sound the same they mean different things. Control flow is what a CPU does, flow control is what networking does. A DPU or SmartNIC needs to do both well to be successful. It appears, and I could be wrong, that Pradeep is using pipeline to refer to consecutive stages of execution on a single macro resource like a DPU then chain as a collection of pipelines that provide a complete solution.

54:00 Vipin – Sticking with fixed function execution than line rate is possible. We need to move away from focusing on processing TCP packets, and shift focus to messages with a run-to-completion model. It is a general purpose program running in the data path.

57:20 Vipin – When it came to selecting our computational architecture it was all about ecosystem, and widely available resources and tooling. We [Pensando] went with ARM.

58:20 Pradeep – The ISA is an utter detail, it’s the macro-architecture that matters, not the micro instruction architecture. We chose MIPS because of the implementation which is a simultaneous multi-threaded implementation which is far and away a much better fine grained context switching. Much much better than anything else out there. There is also the economic price/performance to be considered.

1:00:12 Michael – I agree with Vipin it’s a matter of ecosystem, we need to provide a platform for people to develop. We’re not putting ARMs on the data path. So this performance consideration Pradeep has mentioned is not relevant. The key is providing an ecosystem that attracts as many developers as possible, and making their lives easier to produce great value on the device.

1:01:08 Andy – I agree 100%, that’s why we selected ARM, ecosystem drove our choice. With ARM their are enough Linux distributions, and you could be running containers on your NIC. The transition to ARM is trivial.

1:02:30 Gordon – Xilinx mixes ARM cores with programmable FPGA logic, and hard IP cores for things like encryption.

1:03:49 Pradeep – The real problem is the data path, but clearly ARM cores are not in the data path so they are doing control plane functions. Everyone says they are using ARM cores because of the rich ecosystem, but I’d argue that x86 has a richer ecosystem. If that’s the case then why NOT keep the control plane then in the hosts? So why does the control plane need to be imbedded inside the chip?

1:04:45 Vipin – Data path is NOT in ARM. We want it on a single die, we don’t want it hoping across many wires and killing performance. The kind of integration I can do by subsuming the ARM cores into my die is tremendous. That’s why it can not be on Intel. [Once you go off die performance suffers, so what I believe Vipin means is that he can configure on the die whatever collection of ARM cores, and hard logic he wants, and wire it together how best he sees fit to meet the needs of their customers. He can’t license x86 cores and integrate them on the same die as he can with ARM cores.] Plus if he did throw an x86 chip on the card it would blow his power budget [PCIe x16 lane cards are limited to 75W].

1:06:30 Michael – We don’t have as tight an integration with data-path and ARMs as Pensando. If you want to segregate computing domains between application tier and infrastructure tier you need another computer and putting an x86 on a NIC just isn’t practical.

1:07:10 Andy – The air-gap, bare-metal as a service, use case is a very popular one. Moving control plane functions off the x86 to the NIC, frees up x86 cores and enables a more deterministic environment for my applications.

1:08:50 Gordon – Having that programable logic alongside the ARM cores gives you both the control plane offload as well as dynamically being able to modify the data plane locally.

1:10:00 Michael – We are all for users programming the NIC we are providing an SDK, and working with third parties to host their applications and services on our NICs.

1:10:15 Andy – One of the best things we do it outreach, where we provide NICs to university developers, they disappear for a few months then return with completed applications or new use cases. Broadcom doesn’t want to tightly control how people use their devices, it isn’t open if it is limited by what’s available on the platform.

1:13:20 Vipin – Users should be allowed to own and define their own SDK to develop on the platform.

1:14:20 Pradeep – We provide programming stacks [libraries?] that are available to users through RestAPIs.

1:15:38 Gordon – We took an early lead in helping define the P4 language for programming network devices. Which became Barefoot Networks switch chips, but we’ve embraced it since very early on. We actually have a P4 to Verilog compiler so you can turn your P4 code into logic. The main SmartNIC functions inside Xilinx are written in P4. Then there are plug-ins where others can add their own P4 functions into the pipeline.

1:17:35 Michael – Yes, an app-store for our NIC, certainly. It’s a matter of how it is organized. For me it is somewhere users can go where they can safely download containerized applications or services which can then run on the SmartNIC.

1:18:20 Vipin – The App Store is a little ways out there, it is a good idea. We are working in the P4 community towards standards. He mentions PNA, the Portable NIC Architecture as an abstraction. [OMG, this is huge, and I wish I wasn’t juggling the balls trying to keep the panel moving as this would have been awesome to dig into. A PNA could then enable the capability to have containerized P4 applications that could potentially run across multiple vendors SmartNICs.] He also mentioned that you will need NIC based applications, and a fabric with infrastrucutre applications so that NICs on opposite sides of a fabric can be coordinated

1:21:30 Pradeep, An App Store at this point may be premature. In the long term something like an App Store will happen.

1:22:25 Michael, things are moving much faster these days, maybe just another year for SmartNICs and an App Store.

1:23:45 Gordon, we’ve been working with Pensando and others on the PNA concept with P4 for some time.

1:28:40 Vipin, ..more coming as I listen again on Wednesday.

For those curious the final vote was three for DPU and two for SmartNIC, but in the end the customer is the real winner.

Half a Billion Req/Sec!

June 8, 2020June 8, 2020 scottcschweitzer FPGA, HFT, networking #Algologic, #KVS, #Redis, #Xilinx

Can your In-Memory Key-Value Store handle a half-billion requests per second?

Many applications from biological to financial and Web2.0 utilize in-memory databases because of their cutting-edge performance, often delivering several orders of magnitude faster response time than traditional relational databases. When these in-memory databases are moved to their own machines in a multi-tier application environment, they often can serve 10 million requests per second, and that’s turning all the dials to 11 on a high-end dual-processor server. Much of this is due to how applications communicate with the kernel and the network.

Last year, my team used one of these databases, Redis, we then bypassed the kernel, connected up a 100Gbps network, and took that 10 million requests per second to almost 50 million. Earlier this year we began working with Algo-Logic, Dell, and CC Integration to blow way beyond that 50 million target. At RedisConf2020 in May, Algo-Logic announced a 1U Dell server they’ve customized that can service nearly a half-billion requests per second. To process these requests, we’re spreading the load across two AMD EPYC CPUs and three Xilinx FPGAs. All requests are serviced directly from local memory using an in-memory key-value store system. For those requests serviced by the FPGAs, their response time is measured in billionths of a second. Perhaps we should explain how Algo-Logic got here, and why this number is significant.

Some time ago, a new form of databases came back into everyday use, and they were classified as NoSQL, because they didn’t use Structured Query language, and were non-relational. These databases rely on clever algorithmic tricks to rapidly store and retrieve information in memory; this is very different from how traditional relational databases function. These NoSQL systems are sometimes referred to as key-value stores. With these systems, you pass in a key, and a value is returned. For example, pass in “12345,” and the value “2Z67890” might be returned. In this case, the key could be an order number, and the value returned is the tracking number or status, but the point is you made a simple request and got back a simple answer, perhaps in a few billionths of a second. What Algo-Logic has done is they wrote an application for the Xilinx Alveo U50 that turns the 40 Gbps Ethernet port on this card into four smoking fast key-value stores each with access to the cards 8GB of High Bandwidth Memory (HBM). Each Alveo U50 card with Algo-Logics KVS can service 150 million requests per second. Here is a high-level architectural diagram showing all the various components:

Architecture for 490M Requests/Second into a 1U Server

There are five production network ports on the back of the server, three 40Gbps and two 25Gbps. Each of the Xilinx Alveo U50 cards has a single 40Gbps port, and the dual 25Gbps ports are on an OCP-3 form factor card called the Xilinx XtremeScale X2562 which carries requests into the AMD EPYC CPU complex. Algo-Logic’s code running in each of the Xilinx Alveo cards breaks the 40Gbps channel into four 10Gbps channels and processes requests on each individually. This enables Algo-Logic to make the best use possible of the FPGA resources available to them.

Furthermore, to overcome network overhead issues, Algo-Logic has packed 44 get requests into a single 1408 byte packet. For those familiar with Redis, this is similar to an MGET, multiple get, request. Usually, a single 32-byte get request can easily fit into the smallest Ethernet payload, which is 40 bytes, but then Ethernet adds an additional 24-bytes of routing and a 12-byte frame gap — using a single request per-packet results in networking overhead consuming 58% of the available bandwidth. This is huge, and can clearly impact the total request per second rate. Packing 44 requests of 32 bytes each into a single packet means that the network overhead drops to 3% of the total bandwidth, which means significantly greater requests per second rates.

What Algo-Logic has done here is extraordinary. They’ve found a way to tightly link the 8GB of High Bandwidth Memory on the Xilinx Alveo U50 to four independent key-value store instances that can service requests in well under a micro-second. To learn more consider reaching out to John Hagerman at Algo-Logic Systems, Inc.

SmartNICs, the Next Wave in Server Acceleration

April 4, 2020April 4, 2020 scottcschweitzer 25GbE, Accelerators, FPGA, HFT, networking, Security, TCP Offload Engine

As system architects, we seriously contemplate and research the components to include in our next server deployment. First, we break the problem being solved into its essential parts; then, we size the components necessary to address each element. Is the problem compute, memory, or storage-intensive? How much of each element will be required to craft a solution today? How much of each will be needed in three years? As responsible architects, we have to design for the future, because what we purchase today, our team will still be responsible for three years from now. Accelerators complicate this issue because they can both dramatically breath new life into existing deployed systems, or significantly skew the balance when designing new solutions.

Today foundational accelerator technology comes in four flavors: Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Multi-Processor Systems on a Chip (MPSoCs) and most recently Smart Network Interface Cards (SmartNICs). In this market, GPUs are the 900-pound gorilla, but FPGAs have made serious market progress the past few years with significant deployments in Amazon Web Services (AWS) and Microsoft Azure. MPSoCs, and now SmartNICs, blend many different computational components into a single chip package, often utilizing a mix of ARM cores, GPU cores, Artificial Intelligence (AI) engines, FPGA logic, Digital Signal Processors (DSPs), as well as memory and network controllers. For now, we’re going to skip MPSoCs and focus on SmartNICs.

SmartNICs place acceleration technology at the edge of the server, as close as possible to the network. When computational processing of network intense workloads can be accomplished at the network edge, within a SmartNIC, it can often relieve the host CPU of many mundane networking tasks. Normal server processes require that the host CPU spend, on average, 30% of it’s time managing network traffic, this is jokingly referred to as the data center tax. Imagine how much more you could get out of a server if just that 30% were freed up, and what if more could be made available?

SmartNICs that leverage ARM cores and or FPGA logic cells exist today from a growing list of companies like Broadcom, Mellanox, Netronome, and Xilinx. SmartNICs can be designed to fit into a Software-Defined Networking (SDN) architecture. They can accelerate tasks like Network Function Virtualization (NVF), Open vSwitch (OvS), or overlay network tunneling protocols like Virtual eXtensible LAN (VXLAN) and Network Virtualization using Generic Routing Encapsulation (NVGRE). I know, networking alphabet soup, but the key here is that complex routing, and packet encapsulation tasks can be handed off from the host CPU to a SmartNIC. In virtualized environments, significant amounts of host CPU cycles can be consumed by these tasks. While they are not necessarily computationally intensive, they can be volumetrically intense. With datacenter networks moving to 25GbE and 50GbE, it’s not uncommon for host CPUs to process millions of packets per second. This processing is happening today in the kernel or hypervisor networking stack. With a SmartNIC packet routing and encapsulation can be handled at the edge, dramatically limiting the impact on the host CPU.

If all you were looking for from a SmartNICs is to offload the host CPU from having to do networking, thereby saving the datacenter networking tax of 30%, this might be enough to justify their expense. Most of the SmartNIC product offerings from the companies mentioned above run in the $2K to $4K price range. So suppose you’re considering a SmartNIC that costs $3K, with the proper software, and under load testing, you’ve found that it returns 30% of your host CPU cycles, what is the point at which the ROI makes sense? A simplistic approach would suggest that $3K divided by 30% yields a system cost of $10K. So if the cost of your servers is north of $10K, then adding a $3K SmartNIC is a wise decision, but wait, there’s more.

SmartNICs can also handle many complex tasks like key-value stores, encryption, and decryption (IPsec, MACsec, soon even SSL/TLS), next-generation firewalls, electronic trading, and much more. Frankly, the NIC industry is at an inflection point similar to when video cards evolved into GPUs to support the gaming and virtualization market. While Sony coined the term GPU with the introduction of the Playstation in 1994, it was Nvidia five years later in 1999 who popularized the GPU with the introduction of the GeForce 256. I doubt that in the mid-1990s, while Nvidia was designing the NV10 chip, the heart of the GeForce 256, that their engineers were also pondering how it might be used in high-performance computing (HPC) applications a decade later that had nothing to do with graphic rendering. Today we can look at all the ground covered by GPU and FPGA accelerators over the past two decades and quickly see a path forward for SmartNICs where they may even begin offloading the primary computational tasks of a server. It’s not inconceivable to envision a server with a half dozen SmartNICs all tasked with encoding video, or acting as key-value stores, web caches, or even trading stocks on various exchanges. I can see a day soon where the importance of SmartNIC selection will eclipse server CPU selection when designing a new solution from the ground up.

Mining, and the Importance of Knowing What, How, When, and Where?

December 3, 2019December 3, 2019 scottcschweitzer Digital Currency, FPGA

It doesn’t matter if you’re panning for gold, drilling for oil, or mining Bitcoin, your success is bounded by your best answers to what, how, when, and where? Often the “what” and “how” are tightly linked. If you own oil drilling equipment, you’re probably going to continue drilling for oil. If you buy an ASIC based Bitcoin mining rig, you can only mine Bitcoin. Traditionally “when” and “where” are the most fluid variables to address. A barrel of crude oil today is $57, but over the past year, it has fluctuated between $42 and $66. Similarly, Bitcoin, during the same year, has swung between $3,200 and $12,900, so answering the “when” can be very important. Fortunately, digital currencies can easily be mined and held, which allows us to artificially shift the “when” until the offer price of the commodity achieves the necessary profitability. In digital currency mining, the term is sometimes written HODL, originally a typo, but it has since morphed into “Hold On for Dear Life” until the currency is worth more than it cost you. Finally, we have the “where”, and I’m sure some are wondering why “where” matters in digital currency mining.

Moving backward through the above questions and drilling down specifically into digital currency mining as the application. “Where” is the easiest one, you want to install your mining equipment wherever you can get the cheapest power, manage the excess heat, and tolerate the noise. Recently two of the most extensive mining facilities, both around 300MW, have or are being stood up in former Aluminum plants. When making Aluminum, the single most costly component in the process is electricity, and it requires access to vast volumes of electricity. Often these facilities are located near hydro-electric plants where electricity is below $0.03 KW/h. Also, since every watt of power is converted into heat or sound, you need a method for cost-effectively dealing with these byproducts. One of the mining operations mentioned earlier is located in the far northern region of Russia, which makes cooling exceptionally easy. Also, with “where” you need a local government that is friendly to digital-currency mining. In the Russian example mentioned above, it took nearly two years to secure the proper legal support. Some countries like China, until recently, were not supportive of digital-currency mining. For enthusiasts like myself, we locate our mining gear in out of the way places like basements or closets, perhaps even insulating them for sound and channeling away the excess heat to somewhere useful.

Concerning “when,” that should be now. The general strategy executed by most of us currently mining is known as “mine and hold.” With the Bitcoin halving coming in May, the expectation is that Bitcoin will see a run-up to that point. In the prior two Bitcoin halvings, the price remained roughly the same before and after the event. The last halving was in July 2016, and since then, Bitcoin has gone from a niche commodity to a mainstream offering. In the previous week, Fidelity was awarded a trust license to operate its digital assets business; further proof Bitcoin has gone mainstream. As Bitcoin is the dominant digital currency, it is believed that as it rises, so shall many of the other currencies that use it as a benchmark. So, holding some of the other mainstream digital currencies like Ethereum should also see a significant benefit from a substantial increase in the value of Bitcoin.

Back to the “what” and “how.” With digital currency mining, you have two criteria to consider when answering the “how,” efficiency or flexibility. If you purchase a highly efficient solution, then it will be an ASIC based mining rig. You will then soon learn, if you haven’t already, that it has been designed to mine a single currency, and that’s ALL it can ever mine. Conversely, if you want flexibility, then an FPGA or GPU miner affords you various degrees of freedom, but again the choice between efficiency and flexibility comes into play. FPGA mining rigs are often 5X more efficient per watt than GPU based rigs, but the selection of FPGA bitstreams is finite, but growing monthly. Both FPGA and GPU rigs can easily switch from mining one coin to another with nominal effort; it’s the efficiency and what can be mined that separate the two.

Finally, I’ve neglected to address the most obvious questions “why?” This is both the root of our motivation to mine and the fabric of our most social network. “Our only hope, our only peace is to understand it, to understand the why. ‘Why’ is what separates us from them, you from me. Why’ is the only real social power, without it you are powerless. And this is how you come to me, without why,’ without power.” – Merovingian, “Matrix Reloaded” 2003

Size Matters, Especially in Computing

September 3, 2019September 4, 2019 scottcschweitzer FPGA

The only time someone says size doesn’t matter is when they have an abundance of what it is that’s being discussed. Back in the 1980s some of us took logic design and used discrete 7400 series chips to build out our projects. A 7400 has four two-input NAND gates, with four corresponding outputs, as well as power and ground pins. It is a simple 14 pin package about 3/4 of an inch long and maybe a quarter-inch wide that contains a grand total of sixteen transistors. Many of the basic gates we needed for our designs used that same exact package form factor which made for great fun. Thankfully we had young eyes back then because often times we’d be up till all hours of the night breadboarding our projects. We knew it was too late when someone would invariably slip up, insert a chip backward, and we’d all enjoy the faint whiff of burnt silicon.

Earlier this month Xilinx set a new world record by producing a field-programmable gate array (FPGA) chip which is a distant cousin of the 7400 called the Virtex Ultrascale+ VU19P. Instead of 16 transistors, it has 35 billion, with a “B”. Also, instead of four simple two-input, one output logic gates, it has nine million programmable system logic cells. A system logic cell is a “box” with six inputs and one output that is fully configurable and highly networked. Each individual little “box” is programmed by providing a logic table that maps all the possible six input combinations to the single output. So why does size matter?

Imagine you gave one child a quart-sized Ziplock bag of Legos and another several huge tackle boxes of pre-sorted bricks including Lego’s own robotics kit. Assuming both children have similar abilities and creativity which do you think will create the most compelling model? The first child’s solution wouldn’t be much larger than an apple and entirely static. While it could be revolutionary, it is limited to the constraints of the set of blocks provided. By contrast, the second child could produce a two-foot-tall robot that senses distance and moves freely about the room without bumping into walls. Which solution would you find compelling? In this case size matters in both the number and type of bricks available to the builder.

The system logic cells mentioned above are much like small Lego bricks in that they can easily replicate the capability of more complex bricks by combining several smaller ones. FPGAs are also like Legos in that you can quickly tear down a model and re-use the build blocks to assemble a new model. For the past 30 years, FPGAs have had limitations that have prevented them from going mainstream. First, it was their speed and size, then it was the complexity of programming them. FPGAs were hard to configure, but the companies behind this technology learned from the Graphical Processing Unit (GPU) market and realized they needed tools to make programming FPGAs easier. Today new tools exist to port C/C++ programs into FPGA bitstreams. Some might think that the decade of 2010 was the age of the GPU, while 2020 is shaping up to become the age of the FPGA.