Shock Report: ORNL Frontier Has Teething Problems

Building the fastest computer in the world is a heroic enterprise.  By definition, it hasn’t been done before, and if you are doing it right, you are pushing the limits of both engineering and physics. There is a good chance that things will go wrong.

As I put it thirty years ago, Bob’s First Law of Parallel Computing is,

“If you have enough processors, some of them aren’t working.”

Bob’s First Law Of Parallel Computing (circa 1989)

So, I was not surprised Anton Shilov’s headline, “World’s Fastest Supercomputer Can’t Run a Day Without Failure”. [2]  He’s referring to ORNL’s Frontier, which has claimed 1 exaflop performance

Shockingly enough, with some 60 million some parts, in tens of thousands of units, and hundreds of kilometers of cables, there have been glitches.  Rumors suggest that the mean time to failure is hours.

Been there.  Done that.  [1]

As Bob’s Law implies, building large, complex systems soon enough becomes all about reliability.  No one can be surprised that Frontier is going to have to work hard to achieve its theoretical potential for substantial periods of time.

So, let’s be patient.  This is perfectly normal.

As always, it’s not how well the dog dances, it’s that he dances at all.

A couple of other thoughts.

First of all, this is a great example why we need to actually build experimental systems, not just talk about them.  There is a huge difference between the Powerpoint description of the system (AKA, The Hollywood Version) and what you can actually get working.  And the only way to find out how to do Exabyte scale computation is to actually try several times to do it.

Second, I can tell you one group who is secretly relieved by all this attention to hardware failures:  their software folks!  

I’ve been in their shoes (more than once), and I’m pretty sure that the software isn’t ready yet.  Software is always late, always buggy, and never good enough. 

But as long as the hardware is still shaking and rattling and scaring the cat, no one can tell if the software is cruising like the Powerpoint says it will.

These hardware woes gives the software groups a few precious months of extra time before the spotlight inevitably turns, and we’ll be hearing about the bugs and shortfalls in Frontier’s software. 

(And I can hear it now, “Why isn’t your code running faster?  We just got the hardware working yesterday, you’ve had 24 hours to set the world record. What’s wrong with you people?” )


  1. Perry A. Emrath, Mark S. Anderson, Richard R. Barton, and Robert E. McGrath, The Xylem Operating System, in ICPP 1. 1991. p. 67-70. http://dblp.uni-trier.de/db/conf/icpp/icpp1991-1.html#EmrathABM91
  2. Anton Shilov, World’s Fastest Supercomputer Can’t Run a Day Without Failure, in Tom’s Hardwar, October 10, 2022. https://www.tomshardware.com/news/worlds-fastest-supercomputer-cant-run-a-day-without-failure

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.