Of course the company that figured out chiplets says the answer is more chiplets
Within the next 10 years, the world’s most powerful supercomputers won’t just simulate nuclear reactions, they may well run on them. That is, if we don’t take drastic steps to improve the efficiency of our compute architectures, AMD CEO Lisa Su said during her keynote at the International Solid-State Circuits Conference this week.
The root of the problem is that while companies like AMD and Intel have managed to roughly double the performance of their CPUs and GPUs every 2.4 years, and companies like HPE, Atos, and Lenovo have achieved similar gains roughly every 1.2 years at the system level, Su says power efficiency is lagging behind.
Citing the performance and efficiency figures gleaned from the top supercomputers, AMD says gigaflops per watt is doubling roughly every 2.2 years, about half the pace the systems are growing.
Assuming this trend continues unchanged, AMD estimates that we’ll achieve a zettaflop-class supercomputer in about 10-years give or take. For reference, the US powered on its first exascale supercomputer, Oak Ridge National Laboratory’s Frontier system, last year. A supercomputer capable of a zettaflop of FP64 performance would be 1,000x more powerful.
To AMD’s credit, its estimate for when we’ll cross the zettaflop barrier is at least a little more conservative than Intel’s rather hyperbolic claims that it’d cross that threshold by 2027. What’s more, the AMD CEO says such a machine won’t exactly be practical unless compute architectures get drastically more efficient and soon.
If things continue on their current trajectory, AMD estimates that a zettaflop-class supercomputer would need somewhere in the neighborhood of 500 megawatts of power. “That’s probably too much,” Su admits. “That’s on the scale of what a nuclear power plant would be.”
“This flattening of efficiency becomes the largest challenge that we have to solve, both from a technology standpoint as well as from a sustainability standpoint,” she said. “Our challenge is to figure out how over the next decade we think about compute efficiency as the number one priority.”
Correcting course
Part of the problem facing chipmakers is the means they’ve traditionally relied on to achieve generational efficiency gains are becoming less effective.
Echoing Nvidia’s leather jacket aficionado and CEO Jensen Huang, Su admits Moore’s Law is slowing down. “It’s getting much, much harder to get density performance as well as efficiency” out of smaller process tech.
“As we get into the advanced nodes, we still see improvements, but those improvements are at a much slower pace,” she added, referencing efforts to shrink process tech much beyond 5nm or even 3nm.
But while improvements in process tech are slowing down, Su argues there are still opportunities to be had, and, perhaps unsurprisingly, most of them center around AMD’s chiplet-centric worldview. “The package is the new motherboard,” she said.
Over the past few years, several chipmakers have embraced this philosophy. In addition to AMD, which arguably popularized the approach with its Epyc datacenter chips and later brought the tech to its Instinct GPUs, chipmakers — including Intel, Apple, and Amazon — are now employing multi-die architectures to combat bottlenecks and accelerate workloads.
Chiplets, argues the AMD boss, will allow chipmakers to address three of the low hanging fruits when it comes to compute efficiency: compute energy, communications energy, and memory energy.
Modular chiplet or tile architectures have numerous advantages. For instance, they can allow chipmakers to use optimal process tech for each component. AMD uses some of TSMC’s densest process tech for its CPUs and GPU dies, but often employs larger nodes for things like I/O and analog signaling which don’t scale as efficiently.
Chiplets also help reduce the amount of power required for communications between the components since the compute, memory, and I/O can be packaged in closer proximity. And when stacked vertically, as AMD has done with SRAM on its X-series Epycs and Intel is doing with HBM on its Ponte Vecchio GPUs, the gains are even greater, the chipmakers claim.
AMD expects advanced 3D packaging techniques will yield 50x more efficient communications compared to conventional off-package memory and I/O.
This is no doubt why AMD, Intel, and Nvidia have started integrating CPUs, GPUs, and AI accelerators into their next-gen silicon. For example, AMD’s upcoming MI300 will integrate its Zen 4 CPU cores with its CDNA3 GPUs and a boatload of HBM memory. Intel’s Falcon shores platform will follow a similar trajectory. Meanwhile, Nvidia’s Grace Hopper superchips, while not integrated to the same degree, still co-package an Arm CPU with 512GB of LPDDR5 with a Hopper GPU die and 80GB of HBM.
AMD isn’t stopping at CPUs, GPUs, or memory either. The company has thrown its support behind the Universal Chiplet Interconnect Express (UCIe) consortium, which is trying to establish standards for chiplet-to-chiplet communication, so a chiplet from one vendor can be packaged alongside one from another.
AMD is also is actively working to integrate IP from its Xilinx and Pensando acquisitions into new products. During Su’s keynote, she highlighted the potential for co-packaged optical networking, stacked DRAM, and even in-memory compute as potential opportunities to further improve power efficiency.
Is it time to give AI a crack at HPC?
But while there’s opportunity to improve the architecture, Su also suggests that it may be time to reevaluate the way we go about conducting HPC workloads, which have traditionally relied on high-precision computational simulation using massive datasets.
Instead, the AMD CEO makes the case that it may be time to make heavier use of AI and machine learning in HPC. And she’s not alone in thinking this. Nvidia and Intel have both been pushing the advantages of lower precision compute, particularly for machine learning where trading a few decimal places of accuracy can mean the difference between days and hours for training.
Nvidia has arguably been the most egregious, claiming systems capable of multiple “AI exaflops.” What they conveniently leave out, or bury in the fine print, is the fact they’re talking about FP16, FP8, or Int8 performance, not the FP64 calculations typically used in most HPC workloads.
“Just taking a look at the relative performance over the last 10 years, as much as we’ve improved in traditional metrics around SpecInt Rate or flops, the AI flops have improved much faster,” the AMD chief said. “They’ve improved much faster because we’ve had all these mixed precision capabilities.”
One of the first applications of AI/ML for HPC could be for what Su refers to as AI surrogate physics models. The general principle is that practitioners employ traditional HPC in a much more targeted way and use machine learning to help narrow the field and reduce the computational power required overall.
Several DoE labs are already exploring the use of AI/ML to improve everything from climate models and drug discovery to simulated nuclear weapons testing and maintenance.
“It’s early. There is a lot of work to be done on the algorithms here, and there’s a lot of work to be done in how to partition the problems,” Su said.
Source: The Register