A Decade Old MacBook Pro

In 2012, I invested in my daughter, who was studying social work, by buying her a new MacBook Pro, the last one with a DVD drive. My road to Mac Fanboy has been one of the longest. My first professional job in 1983 was as a programmer hired to write educational software for the Apple II+. That lasted almost nine months till I left for college to finish my degree. It wasn’t until late in the 1980s, while working at IBM Research in New York doing final assembly and distribution on the as-yet-unannounced IBM RT/PC, that I encountered one of Steve Jobs’s next workstations in a lab. Like a Ferrari, only in black. It looked fast just sitting on the lab bench. I requested an account and spent a few evenings exploring the system. I wondered if the IBM RT/PC even stood a chance against it. At the time, I also had access to Silicon Graphics systems, and while the RT/PC had a much better processor architecture than the MIPS R3000 and far more advanced compilers, it still looked like a PC trying to be something more. The RT/PC eventually became the RS/6000 and later Power Systems. Until 1994, I had access to and regularly used many of these systems, and I taught Unix to hundreds of people at IBM Research over the years.

It was nearly ten years later, in 2004, before I purchased my first Apple product, a PowerBook G4, because it had the PowerPC G4 chip. It was the first, in my opinion, serious Unix laptop with a pleasant graphical software environment whose heritage went back to Next and Unix beyond that. I had Linux on various systems in the 1990s and tried Linux on laptops until then. Still, at that time, it always seemed like a science project, and I just needed something that would work and run versions of MS Office that were compatible, at the file format level, with its Windows counterparts. Since then, I’ve had all manner of MacBook and Mac mini. Occasionally, my current employer drags me back into the Windows world. Windows machines are acceptable at a hardware level, but the operating system WILL NEVER be as robust and secure as OSX or Linux.

Back to my daughter, two weeks ago, she mentioned that her laptop, the 2012 MacBook Pro, had crapped out, so I had her drop it off with me on Friday. Several years back, I replaced her 512GB spinning disk with a 1TB SSD, so I doubted that was the issue. It took forever for the system to boot, but once it was up, I was able to launch the Activity Monitor and a Console window, and over two hours, as a super user, I weeded out three pieces of crap-ware that were hogging all the CPU cycles. Once these applications were removed, it was like a new system. Sure, Apple suspended support for this platform precisely a year ago, after ten years, but with OSX Catalina, it is still running strong. This morning, I swapped out the two 4GB memory modules with two 8GB ones I had leftover from another project and replaced the battery with a new one for $35. To avoid further problems, I hid Safari, which was still acting quirky, and Chrome and installed Firefox, making it the default browser. She could now get another two, and if she’s careful, many more years from this system. Please don’t get on me about security updates. The only remaining moving part is the cooling fan. Besides a few minor dents in the aluminum case and the charger will only work in one of the two default orientations, the system performs like a champ. The display looks great, and the keyboard and trackpad are functioning excellently.

So why this post? We live in a consumable world. To remain relevant, those on the leading edge of technology replace our smartphones and laptops every two or three years. Our desktop and server computers and smart watches perhaps every five years. As an Apple shareholder, I expect, no require, that our customers remain on this upgrade treadmill. We also have to recognize, though, that there is another tier of customer, one who needs a product that works for ten or more years. My 82-year-old mother has a MacBook Air that she bought at my request to replace her failing three-year-old Windows machine about eight years ago. She’s worn the letters off all the vowel keys and several consonants, but she knows what’s where, and it still performs as well as the day she bought it. I dread the day, in the coming years, when I’ll have to pry her iPhone 8, the last one with a home button, that she got six years ago, from her hands. Her mind is no longer agile enough to handle the change of a home button-less iPhone 15.

I’m writing this on my 13” MacBook Air M1. I lament that the M1 chiplet architecture and tight integration of the MacBook Air line have eliminated the capability of upgrading the memory and storage beyond how the system initially shipped. Therefore, I bought up at the time of purchase to 16GB and 1TB. Eventually, next year, I’ll break down and move up to a 15” MacBook Air M3. Then, next Christmas, my daughter will get my M1 MacBook Air, but she doesn’t know it and isn’t expecting it so let’s keep this between us.

The Future 3D Printing & Plant-Based Resin

Raspberry Pi4, UPS and Temperature Sensors

A few months ago I switched over to Eco UV Resin as I wanted to be a good citizen of the planet, and thought that moving to a plant-based product that is water-soluble was the right thing to do. I’m on my third Kg container of this product, and I’m beginning to have my doubts, but perhaps I should back up.

In March of 2020, I purchased an Elegoo Mars 2 Pro 3D resin printer, and have consumed at least 15kg or more of standard resin in at least four colors, so I’m not your typical 3D resin noob. I also picked up an Elegoo Mercury Plus wash station at the same time, and typically wash 20-40 prints per gallon of 97% alcohol. To be clear I’m not printing small figures, but more often than not 3-5″ models of things like custom Raspberry Pi cases, fan mounting brackets, and prototypes of future products. So when using normal resin the alcohol in the washbasin will become discolored after a few washes, but I don’t switch colors much and the prints are always clean, smooth, and free of any residual resin after washing. In fact, often the rapid resin stays in suspension in the alcohol between print jobs, sometimes a week or more without settling. I’ve found that washing with less than 97% leads to stickier prints and the alcohol becomes contaminated faster due to the additional water. Some suggest curing suspended resin in the wash tank by leaving it out in the sun for a bit, then filtering it. Recently, I started using translucent clear plant-based resin, and the experience is considerably different from that of rapid resin.

When I finish printing with translucent clear Eco resin, I put the build plate with all my printed parts into a tank of soapy water in the wash station and run it for twenty minutes. The build plate is then removed, parts are separated from it then scrubbed to remove any trapped resin the wash station missed. Next, the supports are snapped off. One further scrub, some light sanding, and another rinse then on to curing. Seriously, I’m probably consuming a few gallons of water per print, and wash my hands several times thoroughly throughout the process. No, I don’t wear gloves, and this has never been a problem for me as I’m extremely careful, and what resin which has come in contact with my skin has never caused a problem. After I’m done, I then need to wash and scrub out the wash tank and build plate. This is far more labor-intensive than alcohol-soluble rapid-resin where I never clean the model outside of what the wash tank has done. It should be mentioned that I often print hollow models, and very carefully place the drain and vent holes so they are hidden, yet offer the best functionality. Trapped resin is rarely an issue. While Eco resin sounds environmentally friendly, when one takes the volume of water consumed into account it may not be as friendly as it appears. 

Today my $500 setup has a limited print volume of 5” x 3.2” x 6.3” and can produce a full-size single color print in under eight hours, often careful placement means two to three hours. Other products in this 3D printer line are bigger and capable of print volumes nearly triple mine, but they are still single color. Producing a final product that has multiple colors requires multiple prints for the various different colored parts, then you need to assemble them. Between colors, the tank needs to be emptied and cleaned. Designing something from scratch that screws or snaps together requires some serious modeling and slicing skills, along with the professional version of the slicer. A slicer is a program that takes your 3D model and turns it into a file your printer can understand. Even more important though, the slicer helps you layout your model on the build plate, detect and fix print problems with the model, make some adjustments, like creating weep holes and vents so that resin isn’t trapped inside hollow parts, and define supports connecting it to the build plate. How, and where to place supports for a 3D model with a resin printer requires different considerations than a traditional deposition printer. So where is 3D printing headed?

3D Resin printers often utilize a 4K monochrome computer screen that faces up from underneath the build tank. A UV light then shines through the screen and cures resin wherever there isn’t a negative print image. Initially, the slicer produces a raft that is a flat surface on the build plate with a lip around the edge that enables sliding a scraper underneath the raft when the printing is done to separate the print from the build plate. Supports then grow out of the raft and connect it with the model. Models are often suspended five to ten millimeters below the build plate with these supports enabling them to be easily separated from the raft after the build is completed. Once the printed model is rinsed and separated from the build plate then you need to remove all the supports, further clean the model, and often sand the surfaces where supports were attached to remove any pitting. Does this sound user-friendly? It isn’t, right now it’s something that only a devoted hobbyist or employee would leverage to achieve an objective.

Many of us can envision a day when we order a product on Amazon, and while it’s in the cart Amazon checks the supplies in our home “Amazon Replicator” to determine if the proper color resins and cleaning supplies are available to build the product. If so then the order is accepted, and sliced print files for my model replicator are automatically downloaded and queued into the printer. When the parts eventually emerge in the product hopper, along with one-page assembly instructions, the customer is then charged for the print and additional replacement supplies are shipped out when necessary. The customer then assembles their product and they have what they ordered in hours rather than days.

Today we pour 200 ml or so of a single color resin into the tank then print a single color until the job is complete. What if the printer sprayed an appropriately thick layer of a specifically colored resin onto the tank bottom, for the layers needing to be printed next in that color. Then between layers requiring a color change, the tank bottom could be cleaned, drained, dried, and the new color sprayed thick enough for the next layers requiring that color. When the print is completed the tank would be drained, cleaned, and dried for the next job. The build plate and the print would then be fully washed. At this point, simple robotics would be needed to separate the print from the supports coming up from the raft using one of several types of tools to clip, melt or vaporize support resin where it meets the model. At this stage in the process, the model is dropped into the curing bin for a few minutes then released into the output bin. Meanwhile, the raft and supports are recycled, perhaps back into supplies or for return to Amazon for reprocessing. With advances in modeling, slicing, and printing we may eventually reach this point for some simple products, but given my experience, this is still a number of years away. 

Regardless, it’s awesome to think that even today we can take an idea, create from it a 3D model, then slice and print it all in the same day! I really love technology.

7 Things I Learned From the IEEE Hot Interconnects Panel on SmartNICs

For the second year, I’ve had the pleasure of chairing the panel on SmartNICs. Chairing this panel is like being both an interviewer and a ringmaster. You have to ask thoughtful questions and respond to answers from six knowledgeable panelists while watching the Slack channel for real-time comments or questions that might further improve the event’s content. Recently I finished transcribing the talk and discovered the following seven gems sprinkled throughout the conversation.

7. Software today shapes the hardware of tomorrow. This is almost a direct quote from one of the speakers, but nearly half of the participants echoed it several times in different ways. One said that vertical integration means moving stuff done today in code for an Arm core into gates tomorrow.

6. DPUs are evolving into storage accelerators. Perhaps the biggest vendor mentioned adding compression, which means they are serious about soon positioning their DPU as a computational storage controller. 

5. Side-Channel Attacks (SCA) are a consideration. Only one vendor brought up this topic, but it was on the mind of several. Applying countermeasures in silicon to thwart side-channel attacks nearly doubles the number of gates for that specific cryptographic block. I understand that the countermeasures essentially consume the inverse power while also generating the inverse electromagnetic effects so that external measurements of the chip package during cryptographic functions yield a completely fuzzed result. 

4. Big Vendors are Cross-pollinating. We saw this last year with the announcement of the NVIDIA BlueField 2X, which includes a GPU on their future SmartNIC, but this appeared to be a bolt-on. NVIDIA’s roadmap didn’t integrate the GPU into the DPU until BlueField 4 some several years out. Now Xilinx, who will soon be part of AMD, is hinting at similar things. Intel, who acquired Altera several years ago, is also bolting Xeons onto their Infrastructure Processing Unit (IPU).  

3. Monterey will be the Data Center OS. VMWare wasn’t on the panel, but some panelists had a lot to say on the topic. One mentioned that the data center is the new computer. This same panelist strongly implied that the future of DPUs lies in the control plane plugging into Monterey. Playing nicely with Monterey will likely become a requirement if you want to sell into future data centers.  

2. The CPU on the DPU is fungible. The company seeking to acquire Arm mentioned they want to use a CPU micro-architecture in their DPU that they can tune. In other words, extending the Arm instruction set found in their DPU with additional instructions designed to process network packets. Now that could be interesting.

Finally, this is a nerdy plumbing type thing, but it will change the face of SmartNICs and bring enormous advantages to them is the move to Chiplets. Today ALL SmartNICs or DPUs rely on a single die in a single package, then one or perhaps two packages on a PCIe card. In the future, a single chip package will contain multiple dies, each with different functional components, and possibly fabricated at other process nodes, so…. 

1. The inflection point for chiplet adoption is integrated photonics. Chiplets becoming commonplace in DPU packages will become popular when there is a need to connect optics directly to the die in the package. This will enable very high-speed connections over extremely short distances.

9 Practical Resin Printing Suggestions

Just over six weeks and three liters of resin ago I received my Elegoo Mars 2 Pro Mono and the strongly suggested Elegoo Mercury Plus 2 in 1 washing and cleaning station. I ordered both these on Amazon for about $500 and they were extremely easy to set up and get working. Along with this order, I added a five-pack of Elegoo Release Film, Elegoo 3D Rapid Resin in clear red, a gallon of 99% Isopropyl Alcohol, 400 Grit sandpaper, and an AiBob Gun Cleaning Pad 16”x60” (this is a must-have). I’ve printed in both translucent red (2.5L) and flat black (0.5L). Also, I’ve been careful to hollow out models in the slicer, Chitubox, so that I’m using the minimum amount of resin necessary to print my models, and I’ve printed many with very little waste.

My resin printer setup, and yes a magnifying glass.

This printer is amazing, my prior experience was a few months, two years ago, with my son’s Creality Ender 3D which is a Fused Deposition Printer (FDP), your typical 3D printer. Eventually, we got the Creality producing usable results, but the difference between the Creality and Elegoo units is night and day. It would often take several tries to get the Creality to produce a workable print, and I’d installed the unit in a cabinet in my office so the temperature and airflow were strictly managed, and we’d modified the printer to reduce the noise, upgraded the print heads and improved the fans, but this post is about resin printing. My first resin print and nearly everyone since has come out as expected. So here are my nine suggestions for those interested in trying resin printing using the Elegoo Mars 2P. 

  1. Don’t Print Flat. Never print your model flat on the build plate. Because the printer exposes a print layer then rises a bit in the build tank then lowers again this creates shearing forces on the supports and a flat model could fail early. Also, you always have to pry your model off the build plate so having a raft and supports which may take damage on removal is always better than scratching up or breaking your model. I’ve found that rotating my model so it’s inclined 10 degrees from the build surface then elevating it 10mm off the build plate produces the best results. Chitubox will then create a raft to bond to the build plate, and it will raise the edges to make prying your model off easier. 
  2. Supports, you can never have too many. Be generous, add more, but make sure you’re bridging them from existing supports, or adding supports that you can then bridge from. You can always sand your model with 400 grit paper later to remove support marks. For finished surfaces sometimes you can avoid supporting these surfaces provided the surface facing the build plate is fully supported. 
  3. Models should drain down. Make certain you orient your model so that it drains down into the build tank. Also, be sure you hollow out your model and set your wall thickness to something like 3mm. This can save you considerable resin, and using translucent resins with somewhat hollow models can create some interesting effects when viewing the model.   
  1. Different color resins require different slicer settings. For example, black requires almost 30% more exposure time over the translucent resins. The Chitubox V1.8.1 slicer is very flexible, it makes it easy to make these adjustments. Here is a table that is invaluable when switching between resins.
  2. Never run out of resin during printing. I had this happen once, this morning, now I’m soaking the tank with alcohol and will try and get resin that’s bonded to the clear film on the bottom removed. Otherwise, I’ll need to replace the film.
  3. Have a ceiling fan on during printing and curing. You can use your printer in an office environment with normal indoor temperatures and if you work carefully, gloves and a mask can be avoided. This printer is very quiet, and prints much more quickly than traditional FDP. Resin printers use a single stepper motor that is installed in the base and it drives a single screw to raise and lower the build plate. The printer is fully enclosed, and the 2P has a fan and carbon filter so there is only a small amount of smell that leaves the unit. I’ve had the printer, not the wash station, running in the background while on Zoom calls, and nobody has ever said anything about hearing it.
  4. Lay down a felt rubberized gun mat ($10) on your work surface before installing the printer and cleaning station. It makes for an ideal work surface and wicks up the few droplets of alcohol that often fall everywhere. Before transferring a model from the printer to the cleaning tank I tilt the build plate a little to drain off the excess resin, and I carefully move the build plate from the printer to the cleaning station without dropping resin on the pad. I’ve found that two inches of spacing between the printer and the cleaning station are enough to make lifting the covers and reaching around back to turn things off easy, while also limiting the travel distance for models that may still drip.
  5. Removing the Cover. Lift the Mars 2P lid with your fingers wrapping under the cover edge as you lift it off the printer. There is a silicone gasket on the bottom of the cover and it will often rub the supports for the build tank which will result in it fall off. So if you carefully lift it and roll your fingers under the silicone gasket you can prevent this from happening. I’ve considered glueing the gasket in-place on the cover, but I think that’ll create other issues.
  6. Make sure you configure Chitubox with your specific printer model so it scales the build plate size properly and make the other default settings. Chitubox and the Elegoo will allow the raft to slightly fall outside the build area and still print, but be careful. Chitubox is a simple slicer, and I’ve used several in the past, but it is very capable and does a nice job.  

Well, that’s it, for now. I’ll edit this a bit more later today, but I hope you have an awesome time with your new printer. I’m sure there will be people out there that will insist I wear a ventilator mask and rubber gloves when printing, cleaning, etc… but I have the ceiling fan on high, and my office is extremely clean and clutter-free so that’s what works for me.

Is Serverless Also Trafficless?

Recently I read the article “Why Now Is The Time To Go Serverless” by Romi Stein, the CEO of OpenLegacy, a composable platform company. While I agree with Romi on several points he made around the importance of APIs, micro-service architectures, and cloud computing. I agree that serverless doesn’t truly mean computing without a server, but rather computing on servers owned and provisioned by major cloud providers. My main point of contention is that large businesses executing mission-critical functions in public clouds may eventually come to regret this move to a “Serverless” architecture as it may also be “Trafficless.” Recently we’ve seen a rash of colossal security vulnerabilities from companies like Solarwinds and Microsoft (Outlook Server). Events like these should make us all pause and rethink how we handle security. Threat detection, and the resulting aftermath of a breach, especially in a composable enterprise highly dependent on a public cloud infrastructure, may be impossible because key data doesn’t exist or isn’t available.

Getty Images

In a traditional on-premises environment, it is generally understood that the volume of network traffic within the enterprise is often 10X that of the traffic entering and leaving the enterprise. One of the more essential strategies for detecting a potential breach within an enterprise is to examine; hopefully, in near-real-time, both the internal and external network flows looking for irregular traffic patterns. If you are notified of a breach, an analysis of these traffic patterns is often used to confirm a breach has occurred. To service both of these tasks copies are made of network traffic in flight, its called traffic capture. The data may then be reduced and eventually shipped off to Splunk, or run through a similar tool, hopefully locally. Honestly, I was never a big fan of shipping off copies of a company’s network traffic to a third party for analysis; many of a company’s trade secrets reside in these digital breadcrumbs.

Is a serverless environment also trafficless? Of course not, that’s ridiculous, but are private cloud providers willing to, or even capable of, sharing copies of all the network traffic your serverless architecture generates? If they were, what would you do with all that data? Wait, here’s another opportunity for the public cloud guys. They could sell everyone another service that captures and analyzes all your serverless network traffic to tell you when you’ve been breached! Seriously, this is something worthy of consideration.

Data Types, Computation & Electronic Trading

I’ll return to the “Expanded Role of HPC Accelerators” in my next post, but before doing so, we need to take a step back and look at how data is stored to understand how best to operate on and accelerate that data. When you look under the hood of your car, you’ll find that there are at least six different types of mechanical fasteners holding things together.

Artificial Intelligence and Electronic Trading

Hoses often require a slotted screwdriver to remove, while others a Phillips’s head. Some panels require a star-like Torx wrench while others a hexagonal Allen, but the most common one you’ll find are hexagonal headed bolts and nuts in both English and Metric sizes. So why can’t the engineers who design these products select a single fastener style? Simple, each problem has unique characteristics, and these engineers are often choosing the most appropriate solution. The same is true for data types within a computer. Ask almost anyone, and they’ll say that data is stored in a computer in bytes, just like your engine has fasteners. 

Computers process data in many end-user formats, from simple text and numbers to sounds, images, video, and much more. Ultimately, it all becomes bits organized, managed, and stored as bytes. We then wrap abstraction layers around these collections of bytes as we shape them into numbers that our computer can then process. We know that some numbers are “natural,” they’re also called “integers,” meaning they have no fractional component, for example, one, two, and three. Simultaneously, other “real” numbers contain a decimal point that can have anywhere from one to an infinite collection of digits to the decimal’s right. Computers process data using both of these numerical formats.     

Textual data is relatively easy to grasp as it is often stored as an integer without a positive or negative sign. For example, the letter “A” is stored using a standard that has assigned it the value 65, which as a single byte is “01000001.” In Computer Science, how a number is represented can be just as important as the number itself’s value. One can store the number three as an integer, but when it is divided by the integer value two, some people would say the result is 1.5, but they could be wrong. Depending on the division operator used, some languages support several, and the data type assigned to the product there could be several different answers, all of which would be correct within their context. As you move from integer to real numbers or, more specifically, floating-point numbers, these numerical representations expand considerably based on the degree of precision required by the problem at hand. Today, some of the more common numerical data types are:

  • Integers, which often take a single byte of storage, and can be signed or unsigned. Integers can come in at least seven different distinct lengths from a nibble stored as four bits, a byte (eight bits), a half-word (sixteen bits), word (thirty-two bits), double word (sixty-four bits), octaword (one hundred and twenty-eight bits), and an n-bit value.
  • Half Precision floating-point, also known as FP16 is where the real number is expressed in 16 bits of storage. The numerical value is stored in ten bits; then there are five bits for the exponent and a single sign bit.
  • Single Precision, also known as floating-point 32 or FP32, for 32 bits. Here we have 23 bits for the fractional component, eight for the exponent, and one for the sign.
  • Double Precision, also known as floating-point 64 or FP64, for 64 bits. Finally, we have 52 bits for the fractional component, 11 for the exponent, and one for the sign.

There are other types, but these are the major ones. How computational units process data differs broadly based on the architecture of the processing unit. When we talk about processing units, we’re specifically talking about the Central Processing Unit (CPUs), Graphical Processing Unit (GPUs), Digital Signal Processor (DSPs), and Field Programmable Gate Array (FPGAs). CPUs are the most general. They can process all of the above data types and much more; they are the BMW X5 sport utility vehicle of computation. They’ll get you pretty much anywhere you want; they’ll do it in style and provide good performance. GPUs are that tricked out Jeep Wrangler Rubicon designed to crawl over pretty much anything you can throw at it while still carrying a family of four. On the other hand, DSPs are dirt bikes, they’ll get one person over rough terrain faster than anything available, but they’re not going to shine when they hit the asphalt. Finally, we have FPGAs, and they’re the Dodge Challenger SRT Demon in the pack; they never leave the asphalt; they just leave everything else on it behind. All that to say that GPUs and DSPs are designed to operate on floating-point data, while FPGAs do much better with integer data. So why does this matter?

Every problem is not suited to a slot headed fastener; sometimes you’ll need a Torx, while others a hexagonal headed bolt. The same is true in computer science. If your system’s primary application is weather forecasting, which is floating-point intense, you might want vast quantities of GPUs. Conversely, if you’re doing genetics sequencing, where data is entirely integer-based, you’ll find that FPGAs may outperform GPUs, delivering up to a 100X performance per watt advantage. Certain aspects of Artificial Intelligence (AI) have benefited more from using FP16 based calculations over using FP32 or FP64. In this case, DSPs may outshine GPUs in calculations per watt. As AI emerges in key computational markets moving forward, we’ll see more and more DSPs applications; one of these will be electronic trading.

Today the cutting edge of electronic trading platforms utilize FPGA boards that have many built-in ultra-high-performance networking ports. These boards, in some cases, have upwards of 16 external physical network connections. Trading data and orders into markets are sent via network messages, which are entirely character; hence integer, based. These FPGAs contain code blocks that rapidly process this integer data, but computations slow down considerably as some of these integers are converted to floating-point for computation. For example, some messages use a twelve-character format for the price where the first six digits are the whole number, and the second six digits represent the decimal number. So, a price of $12.34 would be represented as the character string “000012340000.” Other fields also use twelve-character values for a number, but the first ten digits are the whole number, and the last two the decimal value. In this case, 12,572.75 shares of a stock would be represented as “000001257275.” Now, of course, doing financial computations maintaining the price or quantity as characters is possible; it would be far more efficient if each were recast as single-precision (FP32) numbers. Then computation could be rapidly processed. Here’s where a new blend of FPGA processing, to rapidly move around character data, and DSP computing for handling financial calculations using single precision math will shine. Furthermore, DSP engines are an ideal platform for executing trained AI-based algorithms that will drive financial trading moving forward into the future.

Someday soon, we’ll see trading platforms that will execute entirely on a single high-performance chip. This chip will contain a blend of large blocks of adaptable FPGA logic; we’re talking millions of logic tables, along with thousands of DSP engines and dozens of high-performance network connections. This will enable intelligent trading decisions to be made and orders generated in tens of a billionth of a second!  

The Expanded Role of HPC Accelerators, Part 1

When Lisa Su, CEO of AMD, presented a keynote talk at CES 2021 last week, she reminded me of a crucial aspect of High-Performance Computing (HPC) that often goes unnoticed. HPC is at the center of many computing innovations. Since SC04, I’ve attended the US SuperComputing conference in November pretty much religiously every other year or so. SC is where everyone in technology hauls out their best and brightest ideas and technologies, and I’ve been thrilled over the years to be a part of it with NEC, Myricom, and Solarflare. Some of the most intelligent people I’ve ever met or had the pleasure of working with, I first met at an SC event or dinner. SC though is continuously changing; just today, I posted a reference to the Cerebras CS-1, which uses a single chip that measures 8.5″ on each side to achieve SC performance that is 200X faster than #466 on the Top500.org list of supers. High-Performance Computing (HPC) is entering its fourth wave of innovation.

Cerebras CS-1 Single System SuperComputer Data Center in a Box

The first was defined by Seymour Cray in the early 1970s when he brought out the vector-based mainframe. The second wave was the clustering of Linux computers, which started to become a dominant force in HPC in the late 1990s. When this began, Intel systems were all single-core, with some supporting multiple CPU sockets. The “free” Linux operating system and low-cost Gigabit Ethernet (GbE) were the catalysts that enabled universities to quickly and easily cobble together significantly robust systems. Simultaneously, the open development of a Message Passing Interface (MPI) was completed that made it much easier to port existing first wave HPC applications over to clustered Linux systems without having to use TCP/IP. This second wave brought about advancements in HPC networking and storage that further defined it as a unique market. Today we’re at the tail end of the third wave of innovation driven by the Graphical Processing Unit (GPU). Some would say the dominant HPC brand today is NVIDIA because they’ve pushed GPUs’ envelope further and faster than anyone else, and they own Mellanox, the Infiniband networking guys. Today, our focus is the expanding role of accelerators, beyond GPUs, in HPC as they will define this new fourth wave of innovation.

Last week I thought this fourth wave would be defined by a whole new mode where all HPC computations are pushed off to special-purpose accelerators. These accelerators would then leverage the latest advances of the PCI express bus, new protocols for this bus, and the addition of Non-Volatile Memory express (NVMe) for storage. The fourth and soon fifth generation of the PCIe bus has provided dramatic speed improvements and support for two new protocols (CXL & CCIX) on this bus. Then along came the Cerebras CS-1 utilizing an 8.5″ square chip package that holds a single gargantuan chip with over a trillion transistors. While I think Cerebras may stand alone for some time using this single chip approach, it won’t be long before AMD considers the possibility of pouring hundreds of Zen3 chiplets into a package with an Infinity fabric that is MUCH larger than anything previously utilized. Imagine a single package that rivals Cerebras at 8.5″ square with hundreds of Zen3 chiplets (these are eight x86 cores sharing a common L3 cache), a large number of High Bandwidth Memory (HBM) chiplets, some FPGA chiplets contributed from Xilinx, along with Machine Learning (ML) chiplets from Xilinx latest Versal family, and chiplets for encryption, and 100GbE or faster networking. Talk about a system on a chip; this would be an HPC Super in a single package rivaling other many rack systems on the Top500.org.

More to come in Part II as we explain in more detail what I’d been thinking about regarding accelerators.

Performance vs. Perception

Tesla Model 3 Performance Edition and Polaris Slingshot S

Our technology-focused world has jaded us, causing us to blur the line between our understanding of performance and perception. It is often easier than you might think to take factual data and conflate it with how something feels or our perception of what we experience. Performance is factual. A 2020 Tesla Model 3 Performance Edition accelerates from 0-60 MPH in 3.5 seconds. In comparison, a 2019 Polaris Slingshot Model S accelerates from 0-60 MPH in 5.5 seconds. On paper, the Slingshot accelerates 57% slower than the Tesla Model 3 Performance Edition; the data can be easily looked up and confirmed; these are empirical facts. Also, they’re somewhat easy to verify at a large parking lot or on a back road. The Slingshot accelerates over two seconds slower than the Tesla, both pictured to the right, but if you were to sit in the passenger’s seat of both, during the test, without a stopwatch, I’d bet serious money you’d say the Slingshot was faster.

Perception is a funny thing, all our senses are in play, and they help formulate our opinion of what we’ve experienced. The Tesla Model 3 Performance Edition is a superb vehicle, my cousin has one, and he can’t gush enough about how amazing it is. When you sit in the passenger’s seat, you experience comfortable black leather, plush cushioning, and 12-way adjustable heated seats. The cabin is climate-controlled, sound dampened, and trimmed in wood accents with plenty of glass and steel, making you feel safe, secure, and comfortable. Accelerating from 0-60 MPH is a simple task; the driver stomps on the accelerator, and the car does all the work, shifting as it needs to, with little engine noise. The 3.5 seconds fly by as the vehicle blows past 60 MPH like a rocket sled on rails. So how can a three-wheeled ride like the Slingshot ever compare?

If you’ve not experienced the Slingshot, it’s something entirely different as it engages all your senses, much like riding a motorcycle. There are only three wheels, two seats, no doors, and even the windshield and roof are optional. The standard passenger’s seat has one position, all the way back, and it isn’t heated. The seat is made from rigid foam, covered in all-weather vinyl, with luxury and comfort not being design considerations. Did I mention there are no doors, the cabin is open to the world, you see, hear and smell everything, there’s no wood trim, and the climate is the climate, no heat or A/C. With less than six inches of ground clearance, your bottom is 10” off the surface of the road, and an average height person can easily reach down and touch the road, although I wouldn’t recommend it while moving.

The Slingshot assaults each and every sense as it shapes your perception of accelerating from 0-60 MPH. The driver shifts from first through second and into third, slipping and chirping the back wheel with each of the three transitions from a standing start. The roar of the engine fills your ears; as you’re shoved back into the hard seat, you grab for the roll bar, then the knuckles of your right hand turn white, while you catch a whiff of the clutch and feel the air blow back your hair. It is an incredibly visceral experience, all while your smile grows to an ear to ear grin. Those 5.5 seconds could even be six, given the lost traction as you chirped the rear wheel, but it wouldn’t matter your passenger would swear on a bible; it was three seconds. How is this possible?

How could someone who’s been a passenger in both cars ever think the Slingshot was possibly faster? Simple perception. While Tesla engages your eyes and your ability to sense acceleration, that’s it. The Tesla was designed to shield you from the outside world while wrapping you in comfort, and they’ve done a fantastic job. Conversely, the Slingshot is all about exposing every nerve and sense in your body to the world around you. As we go around some turns, you might feel the slightest amount of drift as the lone rear tire gently swings around to the outside of the turn. 

The above example goes to show that feelings can sometimes overcome facts. We live in a technological world where facts, like hard performance data, and our emotions, and perceptions, can easily be misconstrued. We all need to keep this in mind as we evaluate new solutions.

P.S. Yes, for those who’ve asked, the 2019 Slingshot S pictured above has been my project car for the past two months. The guy I bought it from in early November had purchased it new in May and had installed the roll bars and canvas top. Since then I’ve made a dozen other improvements from adding a full height windshield to a 500 watt Bluetooth amplifier with a pair of 6.5″ Kicker speakers (it didn’t come with a radio).

The Jetsons, Video & Acceleration

The Jetsons – Hanna Barbera(c) 1962

Fifty-eight years ago, last month “The Jetsons” zipped into pop-culture in a flying car and introduced us to many new fictional technologies like huge flat screen TVs, tablet-based computing and video chat. Some of these technologies, had been popularized long before “The Jetson’s,” specifically video chat known previously as videotelephony was popularized as far back as the 1870s, but it took technology over a century to deliver the first commercially viable product. Today with the pandemic separating many of us from our loved ones and our workplaces, video chat has become an instrumental part of our lives. What most people don’t realize though is that video is extremely data intensive, especially at the viewing resolutions we’ve all become accustomed. Doing the computational work requires translating a high definition video (1080p) into a half dozen different resolutions, known as transcoding, to support most devices, including mobile, and to support various bandwidths. This is often done in real time and typically requires one or more CPU cores on a standard server. Now consider all the digital video you consume daily, both in video chats, and via streaming services, all this content needs to be transcoded for your consumption.

Transcoding video can be very CPU intensive, and the program typically used, FFmpeg, is very efficient at utilizing all the computational resources available. On Linux Ubuntu 16.04 using FFmpeg 2.8.17 my experience is that unconstrained FFmpeg will consume 92% of all the available compute power of the system. My test system is an AMD Ryzen 5 3300G clocked at 4.2Ghz with a Xilinx Alveo U30 Video Accelerator card. This is a hyperthreaded quad-core system. For the testing I produced two sample videos, one from Trevor Noah’s October 15th, 2020 “Cow Hugging” episode and the other was John Oliver’s October 11th, 2020 “Election 2020” episode. Using the code mentioned above here are the results in seconds of three successive runs using both files running through the AMD processor, and offloading transcoding into the Xilinx Alveo U30.

Raw data from testing.
Raw Data – Transcoding on AMD Ryzen 5 Versus Xilinx U30 Alveo Video Accelerator

From this, one can make several conclusions, but the one I see most fitting is that the Xilinx Alveo U30 can transcode content 8X faster than a single AMD Ryzen 5 core at 4.2Ghz. Now, this is still development level code; the general availability code has not shipped yet. It is also only utilizing one of the two encoding engines on the U30, so additional capacity is available for parallel encoding. As more is understood, this blog post will be updated to reflect new developments.


10/20/20 – It has been suggested that I share the options used when calling ffmpeg for both the AMD CPU execution and the Xilinx Alveo U30. Here are the two sets of command line options used.

The script that calls the AMD processor to do the work used the following options with ffmpeg:

-f rawvideo -b:v 10M -c:v h264 -y /dev/null

The script that calls the Xilinx Alveo U30 used the following options:

-f rawvideo -b:v 10M -maxbitrate 10M -c:v mpsoc_vcu_h264 -y /dev/null

Dropping out the “-maxbitrate 10M” on one Alveo run later in the day yesterday didn’t seem to change much, but this will be further explored. Also it has been suggested that I look into the impact of using “-preset” which affects quality, and how that might perform differently on both platforms.

What is Confidential Computing

Confidential Computing Consortium

Data exists in three states, at rest, in-flight, and in-use. Over recent years the security industry has done an excellent job of providing solutions for securing data at rest, such as data stored on a hard drive and in-flight think web pages via HTTPS. Unfortunately, those looking to steal data are very familiar with these advances, so they probe the entire system searching for new vulnerabilities. They’ll look at code that hasn’t been touched in years, or even decades (Shellshock), and architectural elements of the system which were previously trusted like memory (Meltdown), cache, and CPU registers (Spectre). Confidential Computing address this third state, data in-use, by providing a hardware-based trusted execution environment (TEE). Last spring, the Linux Foundation realized that extensive reliance on public clouds demanded a more advanced holistic approach to security. Hence, they launched the Confidential Computing Consortium.  

The key to Confidential Computing is building a TEE entirely in hardware. The three major CPU platforms all support some form of a TEE, and they are Intel’s Software Guard Extensions (SGX), AMD’s Secure Encrypted Virtualization (SEV), and ARM’s TrustZone. Developers can leverage these TEE platforms, but each is different, so code written for SGX will not work on an AMD processor. To defeat a TEE and access your most sensitive data, an attacker will need to profile the server hardware to determine which processor environment is in use. They will then need to find and deploy the appropriate vulnerability for that platform if one exists. They also need to ensure that their exploit has no digital or architectural fingerprints that would make attribution back to them a possibility when the exploit is eventually discovered. 

Creating a trusted execution environment in hardware requires the host CPU to be intimately involved in the process. AMD, ARM, and Intel each provide their own hardware support for building a TEE, and each has its benefits. Two security researchers, one from Wayne State University and the other from the University of Houston, produced an excellent comparison of AMD and Intel’s platforms. For Intel, they stated:

“We conclude that Intel SGX is suited for highly security-sensitive but small workloads since it enforces the memory integrity protection and has a limited amount of secure resources.”

Concerning AMD

“AMD SME and SEV do not provide memory integrity protection. However, providing a greater amount of secure resources to applications, performing faster than Intel SGX (when an application requires a large amount of secure memory), and no code refactoring, make them more suitable for complex or legacy applications and services.” 

Based on the work of these researches it would seem that AMD has a more comprehensive platform and that their solution is considerably more performant than Intel’s SGX.

So how does confidential computing establish a trusted execution environment? Today the Confidential Computing Consortium has three contributed projects, and each has its take on this as their objective:

For the past two decades, the US and the UK have used the construct of a Sensitive Compartmented Information Facility (SCIF but pronounced SKIF) to manage classified data. A SCIF is an enclave, a private space surrounded by public space, with a very well-defined set of procedures for securely using data within this private space and moving data into and out of the private space. Intel adopted some of these same concepts when they defined the Software Guard Extensions (SGX). SGX is a new set of processor instructions that first appeared in Skylake. When SGX instructions are used, it enables the processor to build a private enclave in memory where all the data and code in that memory region is encrypted. That region is further zoned off from all other processes, so they don’t have access to it, even those with a higher privilege. As the processor fetches instructions or data from that enclave, it then decrypts them in-flight, and if a result is to be stored back in the enclave, it will then be encrypted in-flight before it is stored. 

When Intel rolled out SGX in 2015, it immediately became the safe that all safe crackers wanted to defeat. In computer science, safe crackers are security researchers, and in the five years since SGX was released, we’ve seen seven well-documented exploits. The two that exposed the most severe flaws in SGX were Prime+Probe and Foreshadow. Prime+Probe was able to grab the RSA keys that secured the encrypted code and data in the enclave. Within six months, a countermeasure was published to disable Prime+Probe. Foreshadow was a derivative of Spectre, and used flaws in speculative execution and buffer overflow to attack the secure enclave. SGX is a solid start with regard to building a trusted execution environment in hardware. WolfSSL also adopted SGX and tied it to its popular TLS/SSL stack to provide a secure connection into and out of an SGX enclave. 

The Open Enclave SDK claims it is hardware agnostic, a software-only platform for creating a trusted execution environment. The Open Enclave SDK requires SGX with Flexible Launch Control (FLC) as a prerequisite for installation. It is an extension of SGX and only runs on Intel hardware. Recently, a technology preview was made available for the Open Portable Trusted Execution Environment OS available on ARM that leverages TrustZone. At this point, there appears to be no support for AMD’s platform. 

Enarx is also a hardware-agnostic, but it is an application launcher designed to support Intel’s SGX and AMD’s Secure Encrypted Virtualization (SEV) platform. It does not require that applications be modified to use these trusted execution environments. When delivered, this would be a game-changer. “Enarx aims to make it simple to deploy workloads to a variety of different TEEs in the cloud, on your premises or elsewhere, and to allow you to have confidence that your application workload is as secure as possible.” At this point, Enarx hasn’t mentioned support for ARM’s TrustZone technology. There is tremendous promise in the work the Enarx team is doing, and they appear to be making some substantial progress.  

The Confidential Computing Consortium is still less than a year old, and it has attracted all the major CPU and data center players as members. Their goal is an ambitious one, but with projects like Enarx well underway, it’s hopeful that securing data in-use will soon become commonplace throughout on-premises and cloud environments. 

*Note this story was originally written for Linkedin on July 12, 2020.