- Aug 17, 2014
Building a supercomputer is always challenging, but creating the industry’s first exascale-class system is an encounter with something wholly unexpected and requires a lot of work with hardware and software. Unfortunately, this might be happening with Oak Ridge National Laboratory’s Frontier supercomputer, which can barely last a day without numerous hardware failures.
ORNL’s Frontier is the industry’s first system designed to deliver up to 1.685 FP64 ExaFLOPS peak performance using AMD’s 64-core EPYC Trento processors, Instinct MI250X compute GPUs, and HPE’s Slingshot interconnections at 21 MW of power. HPE built the system and used the Cray EX (opens in new tab) architecture designed for scale-out applications, primarily for ultra-fast supercomputers.
While on paper, the Frontier supercomputer looks exceptionally good, and hardware parts of the machine system have been delivered, it seems like problems with hardware keep chasing the machine from coming online and being available to researchers requiring performance of around 1 FP64 ExaFLOPS.
“We are working through issues in hardware and making sure that we understand (what they are),” said Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), in an interview with InsideHPC. “You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.”
Rumors about potential hardware failures of Frontier have been floating around for quite a while now. Some said that the system experienced problems with the Slingshot interconnect, according to another InsideHPC (opens in new tab) story. In addition, others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year. Remember that the X version, with a higher number of stream processors and high clocks, is only available to select customers.