Is Serverless Also Trafficless?

March 13, 2021March 13, 2021 scottcschweitzer Containers, Microsegmentation, Security

Recently I read the article “Why Now Is The Time To Go Serverless” by Romi Stein, the CEO of OpenLegacy, a composable platform company. While I agree with Romi on several points he made around the importance of APIs, micro-service architectures, and cloud computing. I agree that serverless doesn’t truly mean computing without a server, but rather computing on servers owned and provisioned by major cloud providers. My main point of contention is that large businesses executing mission-critical functions in public clouds may eventually come to regret this move to a “Serverless” architecture as it may also be “Trafficless.” Recently we’ve seen a rash of colossal security vulnerabilities from companies like Solarwinds and Microsoft (Outlook Server). Events like these should make us all pause and rethink how we handle security. Threat detection, and the resulting aftermath of a breach, especially in a composable enterprise highly dependent on a public cloud infrastructure, may be impossible because key data doesn’t exist or isn’t available.

In a traditional on-premises environment, it is generally understood that the volume of network traffic within the enterprise is often 10X that of the traffic entering and leaving the enterprise. One of the more essential strategies for detecting a potential breach within an enterprise is to examine; hopefully, in near-real-time, both the internal and external network flows looking for irregular traffic patterns. If you are notified of a breach, an analysis of these traffic patterns is often used to confirm a breach has occurred. To service both of these tasks copies are made of network traffic in flight, its called traffic capture. The data may then be reduced and eventually shipped off to Splunk, or run through a similar tool, hopefully locally. Honestly, I was never a big fan of shipping off copies of a company’s network traffic to a third party for analysis; many of a company’s trade secrets reside in these digital breadcrumbs.

Is a serverless environment also trafficless? Of course not, that’s ridiculous, but are private cloud providers willing to, or even capable of, sharing copies of all the network traffic your serverless architecture generates? If they were, what would you do with all that data? Wait, here’s another opportunity for the public cloud guys. They could sell everyone another service that captures and analyzes all your serverless network traffic to tell you when you’ve been breached! Seriously, this is something worthy of consideration.

Data Types, Computation & Electronic Trading

January 31, 2021January 31, 2021 scottcschweitzer AI, FPGA, HFT, networking, Performance

I’ll return to the “Expanded Role of HPC Accelerators” in my next post, but before doing so, we need to take a step back and look at how data is stored to understand how best to operate on and accelerate that data. When you look under the hood of your car, you’ll find that there are at least six different types of mechanical fasteners holding things together.

Artificial Intelligence and Electronic Trading

Hoses often require a slotted screwdriver to remove, while others a Phillips’s head. Some panels require a star-like Torx wrench while others a hexagonal Allen, but the most common one you’ll find are hexagonal headed bolts and nuts in both English and Metric sizes. So why can’t the engineers who design these products select a single fastener style? Simple, each problem has unique characteristics, and these engineers are often choosing the most appropriate solution. The same is true for data types within a computer. Ask almost anyone, and they’ll say that data is stored in a computer in bytes, just like your engine has fasteners.

Computers process data in many end-user formats, from simple text and numbers to sounds, images, video, and much more. Ultimately, it all becomes bits organized, managed, and stored as bytes. We then wrap abstraction layers around these collections of bytes as we shape them into numbers that our computer can then process. We know that some numbers are “natural,” they’re also called “integers,” meaning they have no fractional component, for example, one, two, and three. Simultaneously, other “real” numbers contain a decimal point that can have anywhere from one to an infinite collection of digits to the decimal’s right. Computers process data using both of these numerical formats.

Textual data is relatively easy to grasp as it is often stored as an integer without a positive or negative sign. For example, the letter “A” is stored using a standard that has assigned it the value 65, which as a single byte is “01000001.” In Computer Science, how a number is represented can be just as important as the number itself’s value. One can store the number three as an integer, but when it is divided by the integer value two, some people would say the result is 1.5, but they could be wrong. Depending on the division operator used, some languages support several, and the data type assigned to the product there could be several different answers, all of which would be correct within their context. As you move from integer to real numbers or, more specifically, floating-point numbers, these numerical representations expand considerably based on the degree of precision required by the problem at hand. Today, some of the more common numerical data types are:

Integers, which often take a single byte of storage, and can be signed or unsigned. Integers can come in at least seven different distinct lengths from a nibble stored as four bits, a byte (eight bits), a half-word (sixteen bits), word (thirty-two bits), double word (sixty-four bits), octaword (one hundred and twenty-eight bits), and an n-bit value.
Half Precision floating-point, also known as FP16 is where the real number is expressed in 16 bits of storage. The numerical value is stored in ten bits; then there are five bits for the exponent and a single sign bit.
Single Precision, also known as floating-point 32 or FP32, for 32 bits. Here we have 23 bits for the fractional component, eight for the exponent, and one for the sign.
Double Precision, also known as floating-point 64 or FP64, for 64 bits. Finally, we have 52 bits for the fractional component, 11 for the exponent, and one for the sign.

There are other types, but these are the major ones. How computational units process data differs broadly based on the architecture of the processing unit. When we talk about processing units, we’re specifically talking about the Central Processing Unit (CPUs), Graphical Processing Unit (GPUs), Digital Signal Processor (DSPs), and Field Programmable Gate Array (FPGAs). CPUs are the most general. They can process all of the above data types and much more; they are the BMW X5 sport utility vehicle of computation. They’ll get you pretty much anywhere you want; they’ll do it in style and provide good performance. GPUs are that tricked out Jeep Wrangler Rubicon designed to crawl over pretty much anything you can throw at it while still carrying a family of four. On the other hand, DSPs are dirt bikes, they’ll get one person over rough terrain faster than anything available, but they’re not going to shine when they hit the asphalt. Finally, we have FPGAs, and they’re the Dodge Challenger SRT Demon in the pack; they never leave the asphalt; they just leave everything else on it behind. All that to say that GPUs and DSPs are designed to operate on floating-point data, while FPGAs do much better with integer data. So why does this matter?

Every problem is not suited to a slot headed fastener; sometimes you’ll need a Torx, while others a hexagonal headed bolt. The same is true in computer science. If your system’s primary application is weather forecasting, which is floating-point intense, you might want vast quantities of GPUs. Conversely, if you’re doing genetics sequencing, where data is entirely integer-based, you’ll find that FPGAs may outperform GPUs, delivering up to a 100X performance per watt advantage. Certain aspects of Artificial Intelligence (AI) have benefited more from using FP16 based calculations over using FP32 or FP64. In this case, DSPs may outshine GPUs in calculations per watt. As AI emerges in key computational markets moving forward, we’ll see more and more DSPs applications; one of these will be electronic trading.

Today the cutting edge of electronic trading platforms utilize FPGA boards that have many built-in ultra-high-performance networking ports. These boards, in some cases, have upwards of 16 external physical network connections. Trading data and orders into markets are sent via network messages, which are entirely character; hence integer, based. These FPGAs contain code blocks that rapidly process this integer data, but computations slow down considerably as some of these integers are converted to floating-point for computation. For example, some messages use a twelve-character format for the price where the first six digits are the whole number, and the second six digits represent the decimal number. So, a price of $12.34 would be represented as the character string “000012340000.” Other fields also use twelve-character values for a number, but the first ten digits are the whole number, and the last two the decimal value. In this case, 12,572.75 shares of a stock would be represented as “000001257275.” Now, of course, doing financial computations maintaining the price or quantity as characters is possible; it would be far more efficient if each were recast as single-precision (FP32) numbers. Then computation could be rapidly processed. Here’s where a new blend of FPGA processing, to rapidly move around character data, and DSP computing for handling financial calculations using single precision math will shine. Furthermore, DSP engines are an ideal platform for executing trained AI-based algorithms that will drive financial trading moving forward into the future.

Someday soon, we’ll see trading platforms that will execute entirely on a single high-performance chip. This chip will contain a blend of large blocks of adaptable FPGA logic; we’re talking millions of logic tables, along with thousands of DSP engines and dozens of high-performance network connections. This will enable intelligent trading decisions to be made and orders generated in tens of a billionth of a second!

The Expanded Role of HPC Accelerators, Part 1

January 17, 2021January 17, 2021 scottcschweitzer Performance

When Lisa Su, CEO of AMD, presented a keynote talk at CES 2021 last week, she reminded me of a crucial aspect of High-Performance Computing (HPC) that often goes unnoticed. HPC is at the center of many computing innovations. Since SC04, I’ve attended the US SuperComputing conference in November pretty much religiously every other year or so. SC is where everyone in technology hauls out their best and brightest ideas and technologies, and I’ve been thrilled over the years to be a part of it with NEC, Myricom, and Solarflare. Some of the most intelligent people I’ve ever met or had the pleasure of working with, I first met at an SC event or dinner. SC though is continuously changing; just today, I posted a reference to the Cerebras CS-1, which uses a single chip that measures 8.5″ on each side to achieve SC performance that is 200X faster than #466 on the Top500.org list of supers. High-Performance Computing (HPC) is entering its fourth wave of innovation.

Cerebras CS-1 Single System SuperComputer Data Center in a Box

The first was defined by Seymour Cray in the early 1970s when he brought out the vector-based mainframe. The second wave was the clustering of Linux computers, which started to become a dominant force in HPC in the late 1990s. When this began, Intel systems were all single-core, with some supporting multiple CPU sockets. The “free” Linux operating system and low-cost Gigabit Ethernet (GbE) were the catalysts that enabled universities to quickly and easily cobble together significantly robust systems. Simultaneously, the open development of a Message Passing Interface (MPI) was completed that made it much easier to port existing first wave HPC applications over to clustered Linux systems without having to use TCP/IP. This second wave brought about advancements in HPC networking and storage that further defined it as a unique market. Today we’re at the tail end of the third wave of innovation driven by the Graphical Processing Unit (GPU). Some would say the dominant HPC brand today is NVIDIA because they’ve pushed GPUs’ envelope further and faster than anyone else, and they own Mellanox, the Infiniband networking guys. Today, our focus is the expanding role of accelerators, beyond GPUs, in HPC as they will define this new fourth wave of innovation.

Last week I thought this fourth wave would be defined by a whole new mode where all HPC computations are pushed off to special-purpose accelerators. These accelerators would then leverage the latest advances of the PCI express bus, new protocols for this bus, and the addition of Non-Volatile Memory express (NVMe) for storage. The fourth and soon fifth generation of the PCIe bus has provided dramatic speed improvements and support for two new protocols (CXL & CCIX) on this bus. Then along came the Cerebras CS-1 utilizing an 8.5″ square chip package that holds a single gargantuan chip with over a trillion transistors. While I think Cerebras may stand alone for some time using this single chip approach, it won’t be long before AMD considers the possibility of pouring hundreds of Zen3 chiplets into a package with an Infinity fabric that is MUCH larger than anything previously utilized. Imagine a single package that rivals Cerebras at 8.5″ square with hundreds of Zen3 chiplets (these are eight x86 cores sharing a common L3 cache), a large number of High Bandwidth Memory (HBM) chiplets, some FPGA chiplets contributed from Xilinx, along with Machine Learning (ML) chiplets from Xilinx latest Versal family, and chiplets for encryption, and 100GbE or faster networking. Talk about a system on a chip; this would be an HPC Super in a single package rivaling other many rack systems on the Top500.org.

More to come in Part II as we explain in more detail what I’d been thinking about regarding accelerators.

Performance vs. Perception

January 10, 2021January 10, 2021 scottcschweitzer Performance

Tesla Model 3 Performance Edition and Polaris Slingshot S

Our technology-focused world has jaded us, causing us to blur the line between our understanding of performance and perception. It is often easier than you might think to take factual data and conflate it with how something feels or our perception of what we experience. Performance is factual. A 2020 Tesla Model 3 Performance Edition accelerates from 0-60 MPH in 3.5 seconds. In comparison, a 2019 Polaris Slingshot Model S accelerates from 0-60 MPH in 5.5 seconds. On paper, the Slingshot accelerates 57% slower than the Tesla Model 3 Performance Edition; the data can be easily looked up and confirmed; these are empirical facts. Also, they’re somewhat easy to verify at a large parking lot or on a back road. The Slingshot accelerates over two seconds slower than the Tesla, both pictured to the right, but if you were to sit in the passenger’s seat of both, during the test, without a stopwatch, I’d bet serious money you’d say the Slingshot was faster.

Perception is a funny thing, all our senses are in play, and they help formulate our opinion of what we’ve experienced. The Tesla Model 3 Performance Edition is a superb vehicle, my cousin has one, and he can’t gush enough about how amazing it is. When you sit in the passenger’s seat, you experience comfortable black leather, plush cushioning, and 12-way adjustable heated seats. The cabin is climate-controlled, sound dampened, and trimmed in wood accents with plenty of glass and steel, making you feel safe, secure, and comfortable. Accelerating from 0-60 MPH is a simple task; the driver stomps on the accelerator, and the car does all the work, shifting as it needs to, with little engine noise. The 3.5 seconds fly by as the vehicle blows past 60 MPH like a rocket sled on rails. So how can a three-wheeled ride like the Slingshot ever compare?

If you’ve not experienced the Slingshot, it’s something entirely different as it engages all your senses, much like riding a motorcycle. There are only three wheels, two seats, no doors, and even the windshield and roof are optional. The standard passenger’s seat has one position, all the way back, and it isn’t heated. The seat is made from rigid foam, covered in all-weather vinyl, with luxury and comfort not being design considerations. Did I mention there are no doors, the cabin is open to the world, you see, hear and smell everything, there’s no wood trim, and the climate is the climate, no heat or A/C. With less than six inches of ground clearance, your bottom is 10” off the surface of the road, and an average height person can easily reach down and touch the road, although I wouldn’t recommend it while moving.

The Slingshot assaults each and every sense as it shapes your perception of accelerating from 0-60 MPH. The driver shifts from first through second and into third, slipping and chirping the back wheel with each of the three transitions from a standing start. The roar of the engine fills your ears; as you’re shoved back into the hard seat, you grab for the roll bar, then the knuckles of your right hand turn white, while you catch a whiff of the clutch and feel the air blow back your hair. It is an incredibly visceral experience, all while your smile grows to an ear to ear grin. Those 5.5 seconds could even be six, given the lost traction as you chirped the rear wheel, but it wouldn’t matter your passenger would swear on a bible; it was three seconds. How is this possible?

How could someone who’s been a passenger in both cars ever think the Slingshot was possibly faster? Simple perception. While Tesla engages your eyes and your ability to sense acceleration, that’s it. The Tesla was designed to shield you from the outside world while wrapping you in comfort, and they’ve done a fantastic job. Conversely, the Slingshot is all about exposing every nerve and sense in your body to the world around you. As we go around some turns, you might feel the slightest amount of drift as the lone rear tire gently swings around to the outside of the turn.

The above example goes to show that feelings can sometimes overcome facts. We live in a technological world where facts, like hard performance data, and our emotions, and perceptions, can easily be misconstrued. We all need to keep this in mind as we evaluate new solutions.

P.S. Yes, for those who’ve asked, the 2019 Slingshot S pictured above has been my project car for the past two months. The guy I bought it from in early November had purchased it new in May and had installed the roll bars and canvas top. Since then I’ve made a dozen other improvements from adding a full height windshield to a 500 watt Bluetooth amplifier with a pair of 6.5″ Kicker speakers (it didn’t come with a radio).

The Jetsons, Video & Acceleration

October 19, 2020October 20, 2020 scottcschweitzer Accelerators

Fifty-eight years ago, last month “The Jetsons” zipped into pop-culture in a flying car and introduced us to many new fictional technologies like huge flat screen TVs, tablet-based computing and video chat. Some of these technologies, had been popularized long before “The Jetson’s,” specifically video chat known previously as videotelephony was popularized as far back as the 1870s, but it took technology over a century to deliver the first commercially viable product. Today with the pandemic separating many of us from our loved ones and our workplaces, video chat has become an instrumental part of our lives. What most people don’t realize though is that video is extremely data intensive, especially at the viewing resolutions we’ve all become accustomed. Doing the computational work requires translating a high definition video (1080p) into a half dozen different resolutions, known as transcoding, to support most devices, including mobile, and to support various bandwidths. This is often done in real time and typically requires one or more CPU cores on a standard server. Now consider all the digital video you consume daily, both in video chats, and via streaming services, all this content needs to be transcoded for your consumption.

Transcoding video can be very CPU intensive, and the program typically used, FFmpeg, is very efficient at utilizing all the computational resources available. On Linux Ubuntu 16.04 using FFmpeg 2.8.17 my experience is that unconstrained FFmpeg will consume 92% of all the available compute power of the system. My test system is an AMD Ryzen 5 3300G clocked at 4.2Ghz with a Xilinx Alveo U30 Video Accelerator card. This is a hyperthreaded quad-core system. For the testing I produced two sample videos, one from Trevor Noah’s October 15^th, 2020 “Cow Hugging” episode and the other was John Oliver’s October 11^th, 2020 “Election 2020” episode. Using the code mentioned above here are the results in seconds of three successive runs using both files running through the AMD processor, and offloading transcoding into the Xilinx Alveo U30.

Raw data from testing. — Raw Data – Transcoding on AMD Ryzen 5 Versus Xilinx U30 Alveo Video Accelerator

From this, one can make several conclusions, but the one I see most fitting is that the Xilinx Alveo U30 can transcode content 8X faster than a single AMD Ryzen 5 core at 4.2Ghz. Now, this is still development level code; the general availability code has not shipped yet. It is also only utilizing one of the two encoding engines on the U30, so additional capacity is available for parallel encoding. As more is understood, this blog post will be updated to reflect new developments.

Updates:

10/20/20 – It has been suggested that I share the options used when calling ffmpeg for both the AMD CPU execution and the Xilinx Alveo U30. Here are the two sets of command line options used.

The script that calls the AMD processor to do the work used the following options with ffmpeg:

-f rawvideo -b:v 10M -c:v h264 -y /dev/null

The script that calls the Xilinx Alveo U30 used the following options:

-f rawvideo -b:v 10M -maxbitrate 10M -c:v mpsoc_vcu_h264 -y /dev/null

Dropping out the “-maxbitrate 10M” on one Alveo run later in the day yesterday didn’t seem to change much, but this will be further explored. Also it has been suggested that I look into the impact of using “-preset” which affects quality, and how that might perform differently on both platforms.

What is Confidential Computing

September 29, 2020September 29, 2020 scottcschweitzer Security

Data exists in three states, at rest, in-flight, and in-use. Over recent years the security industry has done an excellent job of providing solutions for securing data at rest, such as data stored on a hard drive and in-flight think web pages via HTTPS. Unfortunately, those looking to steal data are very familiar with these advances, so they probe the entire system searching for new vulnerabilities. They’ll look at code that hasn’t been touched in years, or even decades (Shellshock), and architectural elements of the system which were previously trusted like memory (Meltdown), cache, and CPU registers (Spectre). Confidential Computing address this third state, data in-use, by providing a hardware-based trusted execution environment (TEE). Last spring, the Linux Foundation realized that extensive reliance on public clouds demanded a more advanced holistic approach to security. Hence, they launched the Confidential Computing Consortium.

The key to Confidential Computing is building a TEE entirely in hardware. The three major CPU platforms all support some form of a TEE, and they are Intel’s Software Guard Extensions (SGX), AMD’s Secure Encrypted Virtualization (SEV), and ARM’s TrustZone. Developers can leverage these TEE platforms, but each is different, so code written for SGX will not work on an AMD processor. To defeat a TEE and access your most sensitive data, an attacker will need to profile the server hardware to determine which processor environment is in use. They will then need to find and deploy the appropriate vulnerability for that platform if one exists. They also need to ensure that their exploit has no digital or architectural fingerprints that would make attribution back to them a possibility when the exploit is eventually discovered.

Creating a trusted execution environment in hardware requires the host CPU to be intimately involved in the process. AMD, ARM, and Intel each provide their own hardware support for building a TEE, and each has its benefits. Two security researchers, one from Wayne State University and the other from the University of Houston, produced an excellent comparison of AMD and Intel’s platforms. For Intel, they stated:

“We conclude that Intel SGX is suited for highly security-sensitive but small workloads since it enforces the memory integrity protection and has a limited amount of secure resources.”

Concerning AMD

“AMD SME and SEV do not provide memory integrity protection. However, providing a greater amount of secure resources to applications, performing faster than Intel SGX (when an application requires a large amount of secure memory), and no code refactoring, make them more suitable for complex or legacy applications and services.”

Based on the work of these researches it would seem that AMD has a more comprehensive platform and that their solution is considerably more performant than Intel’s SGX.

So how does confidential computing establish a trusted execution environment? Today the Confidential Computing Consortium has three contributed projects, and each has its take on this as their objective:

Software Guard Extensions (SGX) from Intel
Open Enclave SDK
Enarx

For the past two decades, the US and the UK have used the construct of a Sensitive Compartmented Information Facility (SCIF but pronounced SKIF) to manage classified data. A SCIF is an enclave, a private space surrounded by public space, with a very well-defined set of procedures for securely using data within this private space and moving data into and out of the private space. Intel adopted some of these same concepts when they defined the Software Guard Extensions (SGX). SGX is a new set of processor instructions that first appeared in Skylake. When SGX instructions are used, it enables the processor to build a private enclave in memory where all the data and code in that memory region is encrypted. That region is further zoned off from all other processes, so they don’t have access to it, even those with a higher privilege. As the processor fetches instructions or data from that enclave, it then decrypts them in-flight, and if a result is to be stored back in the enclave, it will then be encrypted in-flight before it is stored.

When Intel rolled out SGX in 2015, it immediately became the safe that all safe crackers wanted to defeat. In computer science, safe crackers are security researchers, and in the five years since SGX was released, we’ve seen seven well-documented exploits. The two that exposed the most severe flaws in SGX were Prime+Probe and Foreshadow. Prime+Probe was able to grab the RSA keys that secured the encrypted code and data in the enclave. Within six months, a countermeasure was published to disable Prime+Probe. Foreshadow was a derivative of Spectre, and used flaws in speculative execution and buffer overflow to attack the secure enclave. SGX is a solid start with regard to building a trusted execution environment in hardware. WolfSSL also adopted SGX and tied it to its popular TLS/SSL stack to provide a secure connection into and out of an SGX enclave.

The Open Enclave SDK claims it is hardware agnostic, a software-only platform for creating a trusted execution environment. The Open Enclave SDK requires SGX with Flexible Launch Control (FLC) as a prerequisite for installation. It is an extension of SGX and only runs on Intel hardware. Recently, a technology preview was made available for the Open Portable Trusted Execution Environment OS available on ARM that leverages TrustZone. At this point, there appears to be no support for AMD’s platform.

Enarx is also a hardware-agnostic, but it is an application launcher designed to support Intel’s SGX and AMD’s Secure Encrypted Virtualization (SEV) platform. It does not require that applications be modified to use these trusted execution environments. When delivered, this would be a game-changer. “Enarx aims to make it simple to deploy workloads to a variety of different TEEs in the cloud, on your premises or elsewhere, and to allow you to have confidence that your application workload is as secure as possible.” At this point, Enarx hasn’t mentioned support for ARM’s TrustZone technology. There is tremendous promise in the work the Enarx team is doing, and they appear to be making some substantial progress.

The Confidential Computing Consortium is still less than a year old, and it has attracted all the major CPU and data center players as members. Their goal is an ambitious one, but with projects like Enarx well underway, it’s hopeful that securing data in-use will soon become commonplace throughout on-premises and cloud environments.

*Note this story was originally written for Linkedin on July 12, 2020.

SmartNICs and SmartSSDs, the Future of Smart Acceleration

September 25, 2020September 25, 2020 scottcschweitzer Accelerators, FPGA, networking

For the past three years, I’ve been writing about SmartNICs. One of my most popular blog posts is “What is a SmartNIC” from July 2017, which has been read over 6,000 times. This year, for the second time, I’ve presented at the Storage Developer Conference (SDC). The title for this blog post was also the title of my breakout session video, which ran for 50 minutes, and went live earlier this week. Here is the abstract for that session:

Since the advent of the Smart Phone over a decade ago, we’ve seen several new “Smart” technologies, but few have had a significant impact on the data center until now. SmartNICs and SmartSSDs will change the landscape of the data center, but what comes next? This talk will summarize the state of the SmartNIC market by classifying and discussing the technologies behind the leading products in the space. Then it will dive into the emerging technology of SmartSSDs and how they will change the face of storage and solutions. Finally, we’ll dive headfirst into the impact of PCIe 5 and Compute Express Link (CXL) on the future of Smart Acceleration on solution delivery.
—Scott Schweitzer, The Technology Evangelist, Xilinx, Sept 2020

In that talk, which has been seen by over 100 people in just the first 24-hours alone on YouTube (I’m told this doesn’t include conference attendees), I shared much of what I’ve learned over the past few months while producing the following new items on SmartNICs:

My fourth Electronic Design (ED) article on September 13th is titled “How PCIe 5 with CXL, CCIX, and SmartNICs Will Change Solution Acceleration.” This piece touches on some of the new high-level protocols which will make Smart Accelerators even more performant.
Hosting an IEEE Hot Interconnects Panel on August 19th titled: “SmartNICs vs. DPUs, Who Wins?” Over 225 people originally attended this talk, and the video has since been viewed another 300 times. We were fortunate enough to secure critical people from all the major SmartNIC companies, except Intel, for an extremely lively discussion. I also did a blog post on this last month.
The third article for ED on SmartNICs published on July 10th with the title “SmartNIC Architectures: A Shift to Accelerators and Why FPGAs are Poised to Dominate.” This article offers up a pretty comprehensive overview of the technologies in the market today from the leading companies.
The second article for ED on June 29th, “Why is a SmartNIC Better than a Regular NIC?“
My first article for ED on June 4th, “What Makes a SmartNIC Smart?“

And there’s more to come…

SmartNICs vs. DPUs, Who Wins?

August 25, 2020February 15, 2021 scottcschweitzer 25GbE, Accelerators, FPGA, Infiniband, networking, TCP Offload Engine Accelerators, Broadcom, DPU, Fungible, IPU, Networking, NVIDIA, Pensando, SmartNICs, Xilinx

Last week I hosted an IEEE Hot Interconnects Panel with the above title. We were lucky enough to secure some time from the following luminaries, and it made for an excellent panel:

Andy Gospodarek, Broadcom, Open Sourcerer
Pradeep Sindhu, Fungible, CEO
Michael Kagan, NVIDIA, CTO
Vipin Jain, Pensando, CTO
Gordon Brebner, Xilinx, CTO Staff & Fellow

Clicking on the image below should take you to the 90 minute Youtube video of our panel discussion. For those who are just interested in the highlights you can read below for some of the interesting facts pulled from our discussion.

**IEEE Hot Interconnects Panel: “SmartNICs vs. DPUs, Who Wins?”**

Here are some key points that contain significant value from the above panel discussion:

SmartNICs provide a second computing domain inside the server that could be used for security, orchestration, and control plane tasks. While some refer to this as an air-gapped domain it isn’t, but it is far more secure than running inside the same x86 system domain. This can be used to securely enable bare-metal as a service. — Michael Kagan
Several vendors are actively collaborating on a Portable NIC Architecture (PNA) designed to execute P4 code. When available, it would then be possible to deliver containers with P4 code that could run on any NIC that supported this PNA model. — Vipin Jain
The control plane needs to execute in the NIC for two reasons, first to offload the host CPU from what is quickly become 30% overhead for processing network traffic, and second to improve the determinism of the applications running on the server. –Vipin Jain
App stores are inevitable, when is the question. While some think it could be years, others believe it will happen within a year. Xilinx has partnered with a company that already has one for FPGA accelerators so the leap to SmartNICs shouldn’t be that challenging. –Gordon Brebner
The ISA is un-important, it’s the micro-architecture that matters. Fungible selected MIPS-64 because of it’s support for simultaneous multi-threaded execution with fine-grained context switching. — Pradeep Sindhu. While others feel that the eco-system of tools and the wide access to developers is most important and that is why they’ve selected ARM.
It should be noted that normally the ARM cores are NOT in the data plane.

The first 18 minutes are introductions and marketing messages. While these are educational, they are also somewhat canned marketing messages. The purpose of a panel discussion was to ask questions that the panel hadn’t seen in advance so we could draw out of them honest perspectives and feedback from their years of experience.

IMHO, here are some of the interesting comments, with who made them and where to find them:

18:50 Michael – The SmartNIC is a different computational domain, a computer in-front of a computer, and ideal for security. It can supervise or oversee all system I/O, key thing is that it is a real computer.

23:00 Gordon – Offloading the host CPU to the SmartNIC and enabling programmability of the device is critically important. We’ll also see functions and attributes of switches being merged into these SmartNICs.

24:50 Andy – Not only data plane offload, but control plane offload from the host is also critically important. Also hardware, in the form of on chip logic, should be applied to data plane offload whenever possible so that ARM cores are NOT being placed in the data plane.

26:00 Andy – Dropped the three letter string that most hardware providers cringe when we hear it, SDK. He stressed the importance of providing one. It should be noted that Broadcom at this point, as far as I know, appears to be the only SmartNIC OEM that provides a customer facing SmartNIC SDK.

26:50 Vipin – A cloud based device that is autonomous from the system and remotely manageable. Has it’s own brain, and that truly runs independently of the host CPU.

29:33 Pradeep – There is no golden rule, or rule of thumb like 1Gb/sec/core like what AMD has said. It’s important to determine what computations should be done in the DPU, multiplexing and stateful applications are ideal. General purpose CPUs are made for processing single threaded applications very fast, horrible at multiplexing.

33:37 Andy – 1Gb/core is really low, I’d not be comfortable with that. I would consider DPDK, or XDP and it would blow that metric away. People shouldn’t settle for this metric.

35:24 Michael – Network needs to take care of the network on it’s own, so zero core for an infinite number of Gigabits.

36:45 Gordon – The SmartNIC is a kinda filtering device, where sophisticated functions like IPS, can be offloaded into the NIC.

40:57 Andy – The Trueflow logic delivers a 4-5X improvement in packet processing. There are a very limited number of people really concerned with hitting line rate packet per second at these speeds. In the data center these PPS requirements are not realistic.

42:25 Michael – I support what Andy said, these packet rates are not realistic in the data center.

44:20 Pradeep – We’re having this discussion because general purpose CPUs can no longer keep up. This is not black and white, but a continuum, where does general processing end and a SmartNIC pick up. GRPC as an example needs to be offloaded. The correct interface is not TCP or RDMA, both are too low level. GRPC is a modern level for this communication interface. We need to have architectural innovation because scale out is here to stay!

46:00 Gordon – One thing about being FPGA based is that we can support tons of I/O. With FPGAs we don’t think in terms of cores, we look at I/O volumes, several years ago we first started looking at 100GbE then figured out how to do that and extended it to 400GbE. We can see the current way scaling well into the Terabit range. While we could likely provide Terabit range performance today it would be far to costly, it’s a price point issue, and nobody would buy it, the cost of doing things is also an issue.

48:35 Michael – CPUs don’t manage data efficiently. We have dedicated hardware engines and TCAM along with caches to service these engines, that’s the way it works.

49:45 Pradeep – The person asking the question perhaps meant control flow and not flow control, while they sound the same they mean different things. Control flow is what a CPU does, flow control is what networking does. A DPU or SmartNIC needs to do both well to be successful. It appears, and I could be wrong, that Pradeep is using pipeline to refer to consecutive stages of execution on a single macro resource like a DPU then chain as a collection of pipelines that provide a complete solution.

54:00 Vipin – Sticking with fixed function execution than line rate is possible. We need to move away from focusing on processing TCP packets, and shift focus to messages with a run-to-completion model. It is a general purpose program running in the data path.

57:20 Vipin – When it came to selecting our computational architecture it was all about ecosystem, and widely available resources and tooling. We [Pensando] went with ARM.

58:20 Pradeep – The ISA is an utter detail, it’s the macro-architecture that matters, not the micro instruction architecture. We chose MIPS because of the implementation which is a simultaneous multi-threaded implementation which is far and away a much better fine grained context switching. Much much better than anything else out there. There is also the economic price/performance to be considered.

1:00:12 Michael – I agree with Vipin it’s a matter of ecosystem, we need to provide a platform for people to develop. We’re not putting ARMs on the data path. So this performance consideration Pradeep has mentioned is not relevant. The key is providing an ecosystem that attracts as many developers as possible, and making their lives easier to produce great value on the device.

1:01:08 Andy – I agree 100%, that’s why we selected ARM, ecosystem drove our choice. With ARM their are enough Linux distributions, and you could be running containers on your NIC. The transition to ARM is trivial.

1:02:30 Gordon – Xilinx mixes ARM cores with programmable FPGA logic, and hard IP cores for things like encryption.

1:03:49 Pradeep – The real problem is the data path, but clearly ARM cores are not in the data path so they are doing control plane functions. Everyone says they are using ARM cores because of the rich ecosystem, but I’d argue that x86 has a richer ecosystem. If that’s the case then why NOT keep the control plane then in the hosts? So why does the control plane need to be imbedded inside the chip?

1:04:45 Vipin – Data path is NOT in ARM. We want it on a single die, we don’t want it hoping across many wires and killing performance. The kind of integration I can do by subsuming the ARM cores into my die is tremendous. That’s why it can not be on Intel. [Once you go off die performance suffers, so what I believe Vipin means is that he can configure on the die whatever collection of ARM cores, and hard logic he wants, and wire it together how best he sees fit to meet the needs of their customers. He can’t license x86 cores and integrate them on the same die as he can with ARM cores.] Plus if he did throw an x86 chip on the card it would blow his power budget [PCIe x16 lane cards are limited to 75W].

1:06:30 Michael – We don’t have as tight an integration with data-path and ARMs as Pensando. If you want to segregate computing domains between application tier and infrastructure tier you need another computer and putting an x86 on a NIC just isn’t practical.

1:07:10 Andy – The air-gap, bare-metal as a service, use case is a very popular one. Moving control plane functions off the x86 to the NIC, frees up x86 cores and enables a more deterministic environment for my applications.

1:08:50 Gordon – Having that programable logic alongside the ARM cores gives you both the control plane offload as well as dynamically being able to modify the data plane locally.

1:10:00 Michael – We are all for users programming the NIC we are providing an SDK, and working with third parties to host their applications and services on our NICs.

1:10:15 Andy – One of the best things we do it outreach, where we provide NICs to university developers, they disappear for a few months then return with completed applications or new use cases. Broadcom doesn’t want to tightly control how people use their devices, it isn’t open if it is limited by what’s available on the platform.

1:13:20 Vipin – Users should be allowed to own and define their own SDK to develop on the platform.

1:14:20 Pradeep – We provide programming stacks [libraries?] that are available to users through RestAPIs.

1:15:38 Gordon – We took an early lead in helping define the P4 language for programming network devices. Which became Barefoot Networks switch chips, but we’ve embraced it since very early on. We actually have a P4 to Verilog compiler so you can turn your P4 code into logic. The main SmartNIC functions inside Xilinx are written in P4. Then there are plug-ins where others can add their own P4 functions into the pipeline.

1:17:35 Michael – Yes, an app-store for our NIC, certainly. It’s a matter of how it is organized. For me it is somewhere users can go where they can safely download containerized applications or services which can then run on the SmartNIC.

1:18:20 Vipin – The App Store is a little ways out there, it is a good idea. We are working in the P4 community towards standards. He mentions PNA, the Portable NIC Architecture as an abstraction. [OMG, this is huge, and I wish I wasn’t juggling the balls trying to keep the panel moving as this would have been awesome to dig into. A PNA could then enable the capability to have containerized P4 applications that could potentially run across multiple vendors SmartNICs.] He also mentioned that you will need NIC based applications, and a fabric with infrastrucutre applications so that NICs on opposite sides of a fabric can be coordinated

1:21:30 Pradeep, An App Store at this point may be premature. In the long term something like an App Store will happen.

1:22:25 Michael, things are moving much faster these days, maybe just another year for SmartNICs and an App Store.

1:23:45 Gordon, we’ve been working with Pensando and others on the PNA concept with P4 for some time.

1:28:40 Vipin, ..more coming as I listen again on Wednesday.

For those curious the final vote was three for DPU and two for SmartNIC, but in the end the customer is the real winner.

Kobayashi Maru and Linkedin’s SSI

June 28, 2020July 6, 2020 scottcschweitzer Uncategorized

Fans of Star Trek immediately know the Kobayashi Maru as the no-win test given to all Starfleet officer candidates to see how they respond to a loss. After being one of Linkedin’s first million members, I recently found out that there is a score by which Linkedin determines how effectively you use their platform. This score is out of 100, and it is composed of four pillars, each with a value of 25 points. If you overachieve in any given pillar, you can’t earn more than 25 points; it’s a hard cap. Like the Kobayashi Maru, the only way to beat Linkedin’s Social Selling Index (SSI), is to learn as much as you can about the innards of how it works, then hack or more accurately “game the system.” Here is a link to your score. There are several articles out there that explain how the SSI is computed, some build on slides that Linkedin supplied at some point, but here are the basics that I’ve uncovered, and how you can “game the SSI.”

How Linkedin computes the SSI is extremely logical. Someone can effectively start with the platform and leverage it to become a successful sales professional in very little time. As mentioned earlier, the SSI is computed from four 25-point pillars which to some degree, build on each other, and they are:

Build your Brand
Grow your Network
Engage with your Network
Develop Relationships with your Network

The first pillar, “Building your Brand,” is almost entirely within your own control, and can be mastered with a free membership. There are four elements to building your brand, and these are: complete your profile, including video in your profile, write articles, and get endorsements. The first three require only elbow grease, basic video skills, and some creative writing. All of these elements are skills that most professionals should have some reasonable degree of competency with, and if not, can be quickly learned. Securing endorsements requires you to leverage your network’s closest elements to submit small fragments of text about your performance when you worked with them. If you want to be aggressive, you could write these for your former coworkers and offer them up to put in their voice and submit on your behalf. Scoring 25 in this area is within reach of most folks; I scored 24.61 when I learned about the SSI.

To pull off a 25 in the second pillar, “Growing your Network” requires a paid membership with Linkedin and for optimum success a “Sales Navigator” membership at $80/month. If you’re a free member and you buy up to Sales Navigator, some documentation implies that this will give you an immediate 10-point boost in this category. Once you have a Sales Navigator membership, it then requires that you use the tool, “Lead Builder,” and connect with recommendations. The “free” aspects of this pillar are doing people searches, viewing these profiles, especially 3rd-degree folks and people totally outside your network. While I had a paid membership, it was not a Sales Navigator membership when I discovered SSI, but when I bought up to Sales Navigator, my score in this pillar remained at 15.25. After going through the Sales Navigator training, my score did go up to 15.32, but clearly, I need to make effective use of Sales Navigator to pull my score up in this pillar. The expectation for those hitting 25 in this pillar is that you’ve used their tools to find leads and convert them into members of your network, and perhaps customers.

Engagement is the third pillar, and here Linkedin uses the following four metrics to determine your score. You need to share posts WITH pictures, give and get likes, repost content from others, comment and reshare on posts from others, join at least 50 groups, and finally send Inmails and get responses. Inmails only come with a paid membership, so again you can’t achieve 25 in this pillar without a paid membership. In this section, I started at 14.35. I never send Inmails, so that’s something that is going to change. Nor was I big on reposting content from others, or resharing posts by others. I do like posts from others and get likes from others, so perhaps that’s a good contributing factor. I was already a member of 52 groups, and from what I’ve read, adding more above 50 doesn’t contribute to increasing your score.

Finally, the last pillar is Relationships. This score is composed of the number of connections you have and the degree to which you interact with those connections. For a score of 25 in this group, it’s been said that you need at least 5,000 connections, this is not true. If you carefully curate who you invite, you can get close to 25 with under 2,000 quality connections. If you’re a VP or higher, you get additional bonus points, and connections in your network that are VP or higher earn you more points than entry-level connections. The SSI is all about the value of the network you’ve built and can sell to. If your network is made up of decision-makers versus contributors or influences, then it’s more effective and hence valuable. Here you get bonus points for connections with coworkers and for a high connection rate acceptance ratio. In other words, if you spam a bunch of people with connection requests that you have nothing in common with, then you’re wasting your time. These people will likely not accept your request, and if they do, Linkedin will know you were spamming and that those people who did accept were just being polite, but aren’t valuable network contacts. Here my score started at 22.8, and just over 24 hours, I was able to run it up to 24.05, a 1.25-point gain. Now It should be clear that I had 1,700 or so connections to start, so I skillfully ran it up to 1,815 connections knowing everything above, and it paid off. I went through my company and offered to connect with anyone that I shared at least five connections. Also, I ground through those in Linkedin who had jobs near me geographically that also shared five connections with me and invited those people. The combination of these two activities yielded just over two hundred open connection requests, and very nearly half accepted within 24-hours.

After 24 hours, some rapid course corrections, and a few hours working my network while on a car ride on a Saturday, I’ve brought my score up 1.35 points. Now that you know what I do about the SSI, I wish you all the best. Several people that have written articles about SSI are at or very close to 100. At 78, I’m still a rookie, but give me a few weeks.

SSI Score 79 – Sunday, June 28th, 2020

SSI Score 82 – Monday, June 29th, 2020 – Clearly what I learned above is working, five points in only a few days. Actually the score was 81.62, but Linkedin rounds.

SSI Score 82 – Tuesday, June 30th, 2020 – Actually 81.77, only a minor gain from yesterday, as I throttled back to see if there was “momentum.” Below is my current screenshot from today, here you can see that I’ve maxed out on “Build Relationships” at 25 and have nearly maxed “Establishing my Brand” at 24.78. Therefore my focus moving forward needs to be “Engage with Insight” and “Finding the Right People”. Engagement means utilizing all my Inmails with the intent of getting back a reply of some kind. To improve my Finding the Right People I need to leverage Sales Navigator to find leads to send Inmails to, perhaps two birds with one stone.

SSI Score 84 – Sunday, July 5th, 2020 – So the gain was five points in a week, but for the most part I took Thursday through Sunday off for the US holiday and had to move my mom out of the FL Keys (I live in Raleigh, so we had to fly down and back to Miami). Thankfully, there was clearly some momentum going into the weekend.

Banned from the Internet

June 10, 2020June 10, 2020 scottcschweitzer networking, Security, Website

After nearly seventeen years with IBM, in July of 2000, I left for a startup called Telleo founded by four IBM Researchers I knew and trusted. From 1983 through April 1994, I worked at IBM Research in NY and often dealt with colleagues at the Almaden Research Center in Silicon Valley. When they asked me to join, there was no interview; I already had impressed all four of them years earlier, this was in May of 2000. In March of 2001, the implosion of Telleo was evident. Although I’d not been laid off, I voluntarily quit just before Telleo stopped paying on their IBM lease, which I’d negotiated. The DotCom bubble burst in late 2000, so by early 2001, you were toast if you weren’t running on revenue. Now, if you didn’t live in Silicon Valley during 2001, imagine a large mining town where the mine had closed, this was close to what it was like, just on a much grander scale. Highway 101 had gone from packed during rush hour to what it typically looked like during the weekend. Venture Capitalists drew the purse strings closed, and if you weren’t running on revenue, you were out of business. Most dot-com startups bled red monthly and eventually expired.

Now imagine being an unemployed technology executive in the epicenter of the worst technology employment disaster in history, up until that point, with a wife who volunteered and two young kids. I was pretty motivated to find gainful employment. For the past few years, a friend of mine had run a small Internet Service Provider and had allowed me to host my Linux server there in return for some occasional consulting.

I’d set Nessus up on that server, along with several other tools, so it could be used to ethically hack client’s Internet servers, only by request, of course. One day when I was feeling particularly desperate, I wrote a small Perl script that sent a simple cover letter to jobs@X.com. Where “X” was a simple string starting with “aa” and eventually ending at “zzzzzzzz”. It would wait a few seconds between each email, and since these were to jobs@x.com I figured it was an appropriate email blast. Remember this was 2001, before SPAM was a widely used term. I thought “That’s what the “jobs” account is for anyway, right?” My email was very polite and requested a position and briefly highlighted my career.

Well, somewhere around 4,000 emails later, I got shut down, and my Internet domain, ScottSchweitzer.com was Black Holed. For those not familiar with the Internet version of this term, it essentially means no email from your domain even enters the Internet. If your ISP is a friend and he fixes it for you, he can run the risk of getting sucked in, and all the domains he hosts get sucked into the void as well. Death for an ISP. Fortunately, my friend that ran the ISP was a life-long IBMer, and he had networking connections at some of the highest levels in the Internet, so the ban stopped with my domain.

To clean this up required some emails and phone calls to fix the problem from the top down. It took two weeks and a fair amount of explaining to get my domain back online to the point where I could once again send out emails. Fortunately, I always have at least several active email accounts, and domains. Also, this work wasn’t in vain, as I’d received a few consulting gigs as a result of the email blast. So now you know someone who was banned from the Internet!

The Technology Evangelist

Since my first computer in 1983, a TRS80 Model III, I've been educating others on the promises of computing.

Author: scottcschweitzer