How To Kill A Supercomputer

Or any computer, actually.

Having worked for many years in and with supercomputing centers, I have to say that there is a certain romantic appeal of working with these fragile, difficult giants. By definition, a supercomputer is running at the edge of what is possible, and it is actually at the edges of many technical frontiers all at the same time. These days, there are so many components operating in parallel that it simply isn’t possible for humans to apprehend what is going on. It’s a miracle if things actually work, in my opinion.

Al Geist published a nice piece in IEEE Computer about “How To Kill A Supercomputer”, explaining some of the unexpected vulnerabilities of massive computing systems.

Building really fast, really large computers means building a lot of really tiny components, at every scale. Transistors, wires, chips, boards, racks, power supplies. Everything is out at the edge of what can be built.

As Geist discusses, electronics are vulnerable to cosmic rays, which slam into components and wires, causing voltage spikes which change the information recorded. In a system with a bazillion little pieces, this causes a zillion little errors, randomly throughout. As he says, “The surface area of all the silicon in a supercomputer functions somewhat like a large cosmic-ray detector.

Expected time to failure: minutes.

This problem has actually be known for a long time because some of the premier customers for high performance computing, such as Los Alamos National Lab, are located at high altitudes, where they experience higher levels of cosmic radiation compared to sea level. Not a few systems worked OK in test labs, but developed serious problems when delivered to the high desert.

Geist points out that it’s not just component failures and transient errors that cause problems. Supercomputers are connected by high speed networks, which are also vulnerable. The loss of a key router can render the whole system inoperable, even if everything else survives. Similarly, the components themselves are complex networks of components, with on-chip bus, memory, and caches.

Cosmic rays are hardly the only challenge, Geist cites a case where some components were assembled with solder that contained radioactive lead. All solder has trace amounts of radioactivity, but this batch emitted enough particles to corrupt the local cache memory. Sigh.

Besides these exotic failure modes, high performance computing also faces severe engineering barriers, High performance chips are hot, as hot, I am told, as the core of a nuclear reactor. Dissipating that heat isn’t easy, and any cooling problem requires swift shutdown.

These systems suck electricity like mad. I was told that the day the University of Illinois turned on its Blue Waters system, the power bill for the campus would go up by 25%.  Whether that particular factoid is precisely accurate or not, it is absolutely true that high performance computers have no choice but to become much more efficient. No one can afford the juice.

‘Blue Waters’ panorama. National Center for Supercomputing Applications University of Illinois, Urbana Champaign

Amazing beasts.

By the way, how in the world do you program them to do anything useful? They’ve got 10,000 and more processors, absurd amounts of storage, a byzantine maze of interconnects, caches, and IO. What were the results? Were they right? If there is an error, what caused it, and how can you fix it?

No puny human can understand what is going on.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s