The history of the data processing industry is one of constant progress. Processors get faster, storage becomes cheaper, and memory gets closer. We see the consequences of this progress through every aspect of society, and it also reaches the top, where national governments continue to invest in larger and better supercomputers. Some technological necessity and some technological race, the supercomputer exascale era is about to begin, as orders for the first exaFLOP stand are now expiring. It's only fitting when this morning the US Department of Energy announces the contract for its fastest supercomputer yet, the Frontier system, which will be built by Cray and AMD.
Frontier is scheduled for delivery in 2021
The new supercomputer is being built as part of the US DOE's CORAL-2 supercomputer program, with Frontier scheduled to replace the Oak Ridge National Laboratory's current Summit supercomputer. Summit is the current reigning champion of the supercomputer world, with 200 petaFLOPS of performance, and according to the US DOE and Oak Ridge, improving the performance of the new computer is significant. Frontier should be able to deliver over 7x the performance of the summit, and is expected to be the world's fastest supercomputer when enabled.
Frontier, like Summit (and Titan before it), is an open science system, meaning it's available to academic researchers to run simulations and experiments. Therefore, the laboratory expects the supercomputer to be used for a wide range of projects across many disciplines, including not only traditional modeling and simulation tasks, but also multiple data-driven techniques for artificial intelligence and data analysis. In fact, the latter is a slightly new basis for the laboratory and the system's possible users; Just as we have seen in corporate space in recent years, neural network AI is becoming an increasingly popular technique for solving problems and extracting data from large datasets, and now researchers are looking at how to refine those techniques from current generation systems and use them on exascale level projects.
|US Department of Energy Supercomputers|
|CPU Architecture|| AMD EPYC
|Intel Scalable Xeon||IBM POWER9|
|GPU Architecture||Radeon Instinct||Intel Xe||NVIDIA Volta|
|Performance (RPEAK)||1.5 EFLOPS||1 EFLOPS||200 PFLOPS|
|Power Consumption||~ 30MW||] N / A||13MW|
|Nodes||100 Cabinets||N / A||3,400|
|Laboratory  Oak Ridge||Argonne||Oak Ridge|
|Supplier  Cray||Not l||IBM|
Frontier: Powered by Cray & AMD
Officially, the main contractor for Frontier will be Cray. But looking at the specifications, you can be excused for believing that it was AMD. Cray for its part cooperates with the chipmaker for the system, and as a result, AMD is delivering most of the core machine to the new supercomputer. Designed as a next-generation CPU + accelerator system, with a mix of CPUs and GPUs that make the big computing work, AMD will deliver both CPUs and GPUs for Frontier. And as the main processor manufacturer, AMD will also take on much responsibility for developing the software stack as well, with Cray's company developing an improved version of the ROCm environment to optimize the performance of the massive cluster of CPUs and GPUs.
On the CPU side of things, AMD will deliver a custom next-generation EPYC processor. AMD has confirmed that it will use a future generation of its Zen CPU cores, and given the timing of the project, we almost certainly see a Zen 3 or Zen 4 design here. Just how customized the AMD's CPU is to be seen, but their announcement has revealed that Frontier's CPUs will contain new instructions for optimizing AI and supercomputing workloads.
At the same time on the GPU side of things, AMD and Cray keep their cards a little closer. Instead of mentioning some architecture or architectural generation, AMD says only that the GPUs are "based on the Radeon Instinct family" and "not yet announced." AMD's current public road map goes out to "Next Gen" by 2020, and with GPU development cycles averaging over 2 years, this may be the architecture we see. But with the special needs of a supercomputer, AMD can have something a little more tailored.
What the company affirms for now is that they do not hold back on features. The HPC-focused GPU is designed with Frontier in mind and will include mixed precision computer support. Feeding the beast will be HBM memory, and AMD will print a version of Infinity Fabric to connect CPUs and GPUs.
In fact, while AMD has kept the details on the technology light, it seems that this version of IF will be the most advanced version yet. AMD is particularly aware that it is an "amazing" cohesive fabric, calling it the first fully optimized CPU + GPU design for supercomputing. AMD's GPUs and CPUs will be arranged in a 4-to-1 ratio, with 4 GPUs for each EPYC processor. It is worth noting that AMD's slide shows a network with each GPU connected to the CPU and two other GPUs, but I do not read too much into this yet, since AMD has not provided any other details about the IF setup.  With AMD going up to the blade level, all of these nodes connect to Cray's job. For Frontier, the supercomputer launches its new Slingshot pairing, an equally ambitious pairing that will support adaptive routing, overload management, and service features. Slingshot is capable of 200GB / sec per port, with individual blades containing a port for each GPU in the blade, allowing other nodes to read and write data directly to a GPU's memory. As a result, Frontier will have a significant amount of bandwidth, which is anything but necessary to allow the system to scale to the exaFLOP levels.
In general, Frontier will be organized into over 100 Cray Sashta cabinets. And while Cray has not announced a particular power consumption figure for Frontier, with each cabinet rated at 300 kW, this would set the entire system over 30 MW. Like putting things in context, this is over twice as much power as the 13MW Summit. So while Frontier is a significantly faster system than the supercomputer replaces it, Cray, AMD and US DOE, all the clamps in Moore's law feel slow, as power gains are more difficult to achieve. All told, in a passing comment in the press releases, it seems that Oak Ridge will install a total of 40MW capacity to Frontier, which is a significant amount of power to say the least.
Along with promoting the United States' own supercomputing management goal, ensuring the Frontier contract also represents great gains for Cray and AMD. Cray is now involved in both 2021 exascale systems, reinforcing his own place in the supercomputing world. Meanwhile, for AMD, which uses this current generation from the outside, they look in, they have now secured a huge and prestigious win for both their CPU and GPU divisions.
In fact, it is interesting to note that the two 2021 exascale systems are ordered, both come from full-service processor vendors that deliver both CPUs and GPUs. Current generation systems like Summit use mixed vendors – e.g. IBM + NVIDIA – so the move to integrated vendors is a big shift for these CPU + accelerator systems. It is clear that using a single vendor for all processors, which benefits both AMD and Intel, is technological and purchasing advantages. Although it is worth noting that the CORAL-2 program requires DOE to purchase systems based on two different architectures, so if the future is integrated systems, AMD and Intel are the logical choices.
In any case, with the contract placed For Frontier, the job is only half finished. AMD and Cray must continue to develop hardware and software for the system, not to mention locking down the specific specifications of the finished supercomputer. So expect to keep hearing news about Frontier trickle over the next couple of years, which led to the installation in 2021.