Posted by in General

Will the mountain come to the Mohammad or Mohammad go to the mountain?

When we consider whether clouds can provide a suitable platform for high performance computing (HPC) we always talk about how cloud computing needs to evolve to suit the needs of HPC – in other words will the mountain come to Mohammad. But there are signs that there may also be movement in the other direction – transforming HPC so that it may work better in the cloud paradigm. Mohammad may have to go.

Discussions around this issue typically focus on performance: how the existing cloud hardware and software has to change. But those are not the only issues. I recently listened to a talk given by a colleague from the Joint Lab for Petascale Computing, Franck Cappello, who considered an often overlooked aspect of HPC – fault management. As it turns out, the way fault tolerance for HPC applications is handled is dramatically different from other applications and can have enormous influence over both its performance and the cost.

HPC applications are typically single program multiple data (SPMD) — tightly-coupled codes executing in lockstep and running on thousands or hundreds of thousands of processors. The assumption is that if just one node in the whole computation fails, the whole computation has to be repeated. To make such failures less catastrophic – potentially throwing out many weeks’ worth of computation — we use global restart based on checkpointing – application state is periodically saved and when the failure occurs the application is restarted from the last checkpoint data. How often do we checkpoint? The answer to this depends most on a quality called mean time between failures (MTBF) – if your checkpointing interval is greater than MTBF you’d have to be lucky for your computation to make much progress.  As the architectures evolved to support computations running on increasingly more nodes the probability of failure of at least one of those nodes during the computation started increasing, thus pushing MTBF down. To compensate, MTBF became an increasingly important factor in the design of both HPC hardware and the software that executes on it.

Before we go on let’s pause and reflect when have we last even heard of an MTBF of a cloud? Or MTBF of a virtual cluster deployed on that cloud for that matter? Likely never, because so far these systems tend to support applications that are more loosely coupled where the failure of one component does not affect all.

But here is the issue: global restart is expensive. You spend a lot of time saving state and occasionally you also have to read it and redo part of your calculation.  This affects both the overall time of your computation (when your code finishes in practice) and the cost of that computation. In fact, Franck and his colleagues estimate that global restart can range from 20% of the total HPC computation cost to as much as 50% in extreme cases  – and will of course go up as the MTBF goes down. In other words, if MTBF of a virtual cluster is low — as it is likely to be — HPC on a cloud will not only drag down the execution time but also be prohibitively expensive due to more frequent need for restarts. These factors combined could easily keep HPC out of clouds no matter how good their benchmark results are.

But do we really need global restart if only one component fails? Franck and his colleagues investigated this question and found that in most cases we do not. They are now working on leveraging this finding: formulating protocols that log less data and restart fewer nodes thus significantly reducing the cost of providing fault tolerance for SPMD style applications. The MTBF of clouds, while still an important factor may not be a deal-breaker after all.

It seems that the pay-per-use model of cloud computing sent us all on a global efficiency drive. Before it emerged, optimizing qualities such as fault-tolerance and the resulting power usage and cost was largely a global concern driven by the resource owner. The individual users had little incentive to optimize the cost of their specific run. For this reason, progress happened largely on a global scope, e.g., by driving architecture evolution. Pay-per-use changes this point of view: it now becomes important to individual users to ensure that their run costs as little as possible. It is therefore likely that the next wave of progress will arise out of optimizing individual runs.

It will be fascinating to watch as Mohammad and the mountain maneuver around each other during the next few years ;-) .

You can find more information about this and related issues on the Joint Lab publications page.