When I get together with friends, once in a while, questions about AI come up, and invariably, they steer the discussion towards a reference to Skynet. For those not plugged into the zeitgeist, Skynet is the AI in “The Terminator” that is out to exterminate humanity. Now, as a chess player, I’ll also acknowledge that while the possibility exists, the likelihood of humanity going down that path is extremely low. My latest concern is one over the coming battle over energy.
NVIDIA held its annual GTC (GPU Technology Conference) in San Jose earlier this week. Jensen Huang, their CEO, unveiled their next-generation DGX [48:24], an AI supercomputer system in a single rack. For those not in technology, think of a cabinet-sized box six feet tall, two feet wide, and three feet deep that consumes an amazing 120 KW every hour of power while performing 1.4 ExaFLOPs. For contrast, the DGX consumes energy at a rate equal to 100 average American homes (a home consumes 10,500 KW / year). It does math at a rate equal to all eight billion people on the planet doing one calculation on a calculator per second for four years, non-stop. That’s what this machine can complete in one second.
In November, there were precisely two publicly announced systems in the world, Frontier and Aurora, both US Department of Energy SuperComputer Clusters, capable of achieving an ExaFLOP. One is in Oak Ridge, TN, and the other is at the Argonne National Lab, and they each consume 200X the power of the DGX above. It should also be noted that these are massive systems, often in the 200-rack range, but the move to GPUs has improved this, as Frontier has only 74 racks while Aurora has 166 racks. The main point Jensen was trying to make is that a single DGX is similar to the computational power of these data center clusters.
Those close to this technology would argue that NVIDIA is gaming its ExaFLOPs number because their calculations differ from those computed by Frontier and Aurora to make the Top500 list. Frontier and Aurora report their numbers while running the Linpack benchmark calculation using double-precision 64-bit floating point numbers. They cannot employ mathematical tricks that shorten number formats, reduce results, or optimize matrix multiplication using innovative new algorithms. On the other hand, Jensen is a magician performing unconstrained; he tosses out his ExaFLOPs number using an FP4 data type. This is the absolute smallest number format defined today; it’s 1/16th the size of the numbers used for Linpack, and trust me, size matters in more ways than one. Furthermore, Jensen’s ExaFLOPs metric benefits from using many of the latest tricks, including ways to shrink the number size and reduce the number of terms you need to operate on with calculations, some done in the networking cards.
Let’s get back to power. The US energy grid is under intense pressure from the rapid growth in demand resulting from the widespread adoption of Electric Vehicles (EVs), including electric commercial trucks and soon tractor trailers. Tesla is rolling out a new charging station every day. Shell and others are also looking to jump into this market, and all this power for EVs must come from somewhere. Thankfully, while EV electric demand is growing, homeowners are increasingly installing solar panels on their roofs to offset their use, particularly if they have an EV or two in the garage. The US federal government recently adopted new rules to move all future vehicle production by 2032 to EVs and Hybrids. While the growth in solar may offset the drain placed on the grid by EVs, NVIDIA’s new DGX changes everything.
Current data centers were designed around racks consuming 10-20 KW each, some permit up to 70 KW per cabinet with a 300 KW commitment, but this is pretty new. All of this includes the matching required cooling, which is just as vital but often overlooked. NVIDIA’s new DGX consumes 6-12X more power than what is currently deployed, and the first part of the problem lies here. As they begin shipping the DGX, we will see data center demand for power explode, unlike before. People have realized that the benefits of an AI grow as the models for these AIs, on which they are trained, grow geometrically. These larger models need even larger DGX class systems to be implemented. Unless something changes soon, at some point in the future, we may be competing with AI systems for electrical power.