Cloud Highlights from CCGrid 2011

Posted by Paul Marshall | Posted in General | 06-13-2011 | One response

A few weeks ago (May 23rd-26th) I traveled to sunny Newport Beach, California to present a paper, Improving Utilization of Infrastructure Clouds, at CCGrid 2011. Our paper addresses one of the main challenges faced by infrastructure cloud providers: ensuring that resources are utilized efficiently while still providing resources on-demand. To solve this catch-22, we deployed backfill VMs on idle VMM nodes. For evaluation, we deployed Condor in the backfill VMs and demonstrated an increase to 100% utilization of the infrastructure resources. All of the details are in the paper, so I won’t elaborate here. You can also try backfill for yourself with Nimbus 2.7.

While at the conference I also attended a number of excellent presentations. In the interest of brevity, I’ll highlight only a few of them here. The keynote on the second day, Maximizing Profit and Pricing in Cloud Environments, given by Albert Zomaya of the University of Sydney, discussed the challenges of pay-as-you-go cloud computing and the often conflicting objectives of cloud service providers (maximize profit) and cloud users (minimize expenses). Dr. Zomaya proposed numerous models and algorithms for profit-driven scheduling of resources. He also proposed an application profiling technique that detects application patterns (e.g. IO intensive, memory intensive, or CPU intensive) and then applies a prediction model to achieve optimal VM placement in a cloud infrastructure.

Another interesting discussion at CCGrid was a panel on autonomic cloud computing. The panel emphasized the difficulties of managing large-scale cloud infrastructures. Much of the panel focused on the need for tools and techniques to manage these resources in and automatic and efficient manner where resources dynamically scale (up or down) to match demand. The Chief of Research from Rackspace, Adrian Otto, highlighted the efforts at Rackspace to address this challenge. They analyzed various characteristics (e.g. identifying distribution models to describe bursts in website activity) to predict cloud resource needs.

There were many other excellent presentations, however, I’ll leave it to you to read the papers from the conference (the full program and paper listing is available here). Next year CCGrid is heading to Ottawa, Canada, perhaps I’ll see you there!

Sunset dinner cruise in Newport Bay

Sunset dinner cruise in Newport Bay

Comparisons: Not so Odious as Once Thought

Posted by Kate Keahey | Posted in General | 02-03-2011 | 4 responses

I often get asked if there is any published work evaluating performance and cost of scientific applications on IaaS clouds and comparing them to using clusters — and I always say LOTS! …and then can’t remember more than a few off the top of my head ;-) . So I recently put together a list — included below — of various evaluation and comparison efforts I’ve been able to find. They  look all sorts of aspects of performance — from low-level benchmarks to applications of various types, from reliability to cost. They all tend to focus on somewhat different aspects of the issue and collectively paint a picture blessings and challenges of cloud computing for science.

My personal favorite is “Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Amazon Web Services Cloud” — on top of the list since, having just come out at CloudCom 2010 last December it is the most recent. The authors evaluate the AWS IaaS offering based on the NERSC benchmarks framework — a comprehensive set of benchmarks capturing the typical workload in a scientific datacenter. They report not only the performance characteristics of scientific applications on virtual clusters created in the cloud but also note the mean time between failures (MTBF) of a virtual cluster deployed on cloud resources — the consequences of which I (coincidentally) blogged about around the time this paper was presented.

And finally, I have a favor to ask — if you know of papers evaluating various aspects of scientific applications on IaaS clouds or have favorites in the filed, or opinions on what you would like to see evaluated — please tell us about it. I will do a post of lessons learned.

Evaluation of IaaS clouds for scientific applications:

Mohammad and the Mountain

Posted by Kate Keahey | Posted in General | 12-07-2010 | 2 responses

Will the mountain come to the Mohammad or Mohammad go to the mountain?

When we consider whether clouds can provide a suitable platform for high performance computing (HPC) we always talk about how cloud computing needs to evolve to suit the needs of HPC – in other words will the mountain come to Mohammad. But there are signs that there may also be movement in the other direction – transforming HPC so that it may work better in the cloud paradigm. Mohammad may have to go.

Discussions around this issue typically focus on performance: how the existing cloud hardware and software has to change. But those are not the only issues. I recently listened to a talk given by a colleague from the Joint Lab for Petascale Computing, Franck Cappello, who considered an often overlooked aspect of HPC – fault management. As it turns out, the way fault tolerance for HPC applications is handled is dramatically different from other applications and can have enormous influence over both its performance and the cost.

HPC applications are typically single program multiple data (SPMD) — tightly-coupled codes executing in lockstep and running on thousands or hundreds of thousands of processors. The assumption is that if just one node in the whole computation fails, the whole computation has to be repeated. To make such failures less catastrophic – potentially throwing out many weeks’ worth of computation — we use global restart based on checkpointing – application state is periodically saved and when the failure occurs the application is restarted from the last checkpoint data. How often do we checkpoint? The answer to this depends most on a quality called mean time between failures (MTBF) – if your checkpointing interval is greater than MTBF you’d have to be lucky for your computation to make much progress.  As the architectures evolved to support computations running on increasingly more nodes the probability of failure of at least one of those nodes during the computation started increasing, thus pushing MTBF down. To compensate, MTBF became an increasingly important factor in the design of both HPC hardware and the software that executes on it.

Before we go on let’s pause and reflect when have we last even heard of an MTBF of a cloud? Or MTBF of a virtual cluster deployed on that cloud for that matter? Likely never, because so far these systems tend to support applications that are more loosely coupled where the failure of one component does not affect all.

But here is the issue: global restart is expensive. You spend a lot of time saving state and occasionally you also have to read it and redo part of your calculation.  This affects both the overall time of your computation (when your code finishes in practice) and the cost of that computation. In fact, Franck and his colleagues estimate that global restart can range from 20% of the total HPC computation cost to as much as 50% in extreme cases  – and will of course go up as the MTBF goes down. In other words, if MTBF of a virtual cluster is low — as it is likely to be — HPC on a cloud will not only drag down the execution time but also be prohibitively expensive due to more frequent need for restarts. These factors combined could easily keep HPC out of clouds no matter how good their benchmark results are.

But do we really need global restart if only one component fails? Franck and his colleagues investigated this question and found that in most cases we do not. They are now working on leveraging this finding: formulating protocols that log less data and restart fewer nodes thus significantly reducing the cost of providing fault tolerance for SPMD style applications. The MTBF of clouds, while still an important factor may not be a deal-breaker after all.

It seems that the pay-per-use model of cloud computing sent us all on a global efficiency drive. Before it emerged, optimizing qualities such as fault-tolerance and the resulting power usage and cost was largely a global concern driven by the resource owner. The individual users had little incentive to optimize the cost of their specific run. For this reason, progress happened largely on a global scope, e.g., by driving architecture evolution. Pay-per-use changes this point of view: it now becomes important to individual users to ensure that their run costs as little as possible. It is therefore likely that the next wave of progress will arise out of optimizing individual runs.

It will be fascinating to watch as Mohammad and the mountain maneuver around each other during the next few years ;-) .

You can find more information about this and related issues on the Joint Lab publications page.

Cloud Highlights from Supercomputing 2010

Posted by Paul Marshall | Posted in General | 11-25-2010 | One response

Happy Thanksgiving!

Last week (Nov 13-19) was the annual Supercomputing (SC) conference. This year it was held in New Orleans, Louisiana. Cloud computing was featured by vendors and speakers throughout the conference. There were far too many cool products, talks, and papers to mention in a single post, however, a few of the highlights that we are thankful we caught in person include:

  • Two representatives from Platform Computing presented a large-scale cloud deployment being tested at the CERN laboratory in “Building the World’s Largest HPC Cloud.” CERN is testing Platform ISF to run scientific jobs in a virtualized environment. Results included reports of launching several thousand VMs and a comparison of image distribution techniques.
  • In “Virtualization for HPC”, members of the academic (Ohio State University, ORNL) and industrial (VMware, Univa UD, Deopli) communities shared their vision of a future for virtualization technologies in HPC. Topics discussed included pro-active fault tolerance using migration, virtualized access to high-performance interconnects, and new hypervisors technologies designed for exascale computing.
  • In “Low Latency, High Throughput, RDMA and the Cloud in Between” representatives from Mellanox, Dell, and AMD discussed the advantages of cloud computing and highlighted the importance of reducing latency and increasing throughput for scientific communities. RDMA over Converged Ethernet (RoCE) was emphasized as a specific effort toward reducing latency in virtualized environments.
  • The work in “Elastic Cloud Caches for Accelerating Service-Oriented Computations” demonstrated a dynamic and fast memory-based cache using IaaS resources, specifically for a geoinformatics cyberinfrastructure. The system responds to changes in demand by dynamically adding or removing IaaS nodes from the cache.

In addition to some great cloud computing talks and sessions, cloud resources were also involved in a handful of demos and tutorials. In particular, Purdue demoed Springboard, a “hub” to work with NSF’s TeraGrid infrastructure. The hub provides a central point for researchers to collaborate and removes the need for researchers to rely strictly on the command line when interacting with the TeraGrid’s resources. Springboard also interfaces with the TeraGrid’s first cloud resource, Wispy, at Purdue. The National Center for Atmospheric Research (NCAR) and the University of Colorado at Boulder used 150 Amazon EC2 instances for the Linux Cluster Construction tutorial. The virtual machines were launched on-demand the morning of the tutorial. They provided participants with a realistic software environment for configuring and deploying a Linux cluster using a variety of open source tools such as OpenMPI, Torque, and Ganglia.

With cloud computing becoming ever more popular at SC it would be cool to see an HPC challenge category for cloud computing, perhaps running on Amazon’s cluster compute resource that just last week was officially included in the Top500 at 231.

Sky Computing

Posted by Kate Keahey | Posted in General | 11-10-2010 | No responses

I’ve been wanting to say a few words about sky computing for a while and eventually iSGTW forestalled me with a very nice article on the topic. It describes a cool work by Nimbus committer Pierre Riteau who created a virtual cluster of over a thousand cores over resources leased from six Nimbus clouds: three provided by Grid’5000 and three by FutureGrid.

“Sky computing” was a name we coined back in 2008 to describe the idea of operating in a multi-cloud environment. It addresses the issues of provider interoperability and comparison between providers (cloud markets) as well as end-user concerns – the abstractions and tools required to provide an integrated and secure environment over resources provisioned in multiple potentially distributed clouds which the original paper focused on.

Pierre’s work pushes the limits of this concept in terms of scale: distributing more images to more distributed clouds for longer sustained leases obtained faster. He started out with a Grid5000 Large Scale Deployment Challenge award earlier this year for managing deployment of hundreds, and by mid-summer was investigating the properties of distributed clusters of thousands of cores. I can’t wait to see what happens next ;-) .

More information about Pierre’s sky computing ventures can be found in his TeraGrid 2010 poster and the recent ERCIM article.

Grids versus Clouds

Posted by Kate Keahey | Posted in General | 12-19-2009 | One response

The issue of how exactly cloud computing differs from grid computing was responsible for much controversy in the last year. Here are my two cents on how Infrastructure-as-a-Service (IaaS) cloud computing and grid computing are different (also discussed in the Sky Computing paper)

At some level, both cloud computing and grid computing represent the idea of using remote resources. However, grid computing is built on the assumption that control over the manner in which resources are used stays with the site, reflecting local software and policy choices. These choices are not always useful to remote users who might need a different operating system, or login access instead of a batch scheduler interface to a site. Reconciling those choices between multiple groups of users proved to be complex, time-consuming, and expensive. Looking back, leaving complete control over the resources with the site was a pragmatic choice that enabled very fast adoption of a radically transformative technology. On the other hand, once the technology became successful, this factor made it difficult for it to scale to many user groups with different (and sometimes conflicting) requirements of what the resource should provide.

IaaS cloud computing represents a fundamental change of assumption: when a remote user “leases” a resource, the control of that resource is turned over to that user. This change in assumption was enabled by the availability of a free and efficient virtualization technology: the Xen hypervisor. Before virtualization, turning over the control to the user was fraught with danger: the user could easily subvert a site. But virtualization provides a way of isolating the leased resource from a site in a secure way that mitigates this danger. Virtual machines can be deployed very fast (on the order of miliseconds) – when in addition to that the overhead and the price associated with a reliable virtualization technology went down, it suddenly became viable and cost-effective to use them in order to lease resources to remote users.

The ability to turn the control of remote resource over to the user makes it possible to develop tools, such as Nimbus, and provide services such as Amazon’s EC2 or Science Clouds that allow users to carve out their own custom “site” out of remote resources. At the same time, this change of assumption challenges the established notions of what it means to be a site as we continue to struggle with the new meaning and implications of domain names, site licenses, and established security practices.

Cloud Computing and Bioinformatics: Notes from a Workshop

Posted by Kate Keahey | Posted in General | 12-01-2009 | No responses

I recently attended an immensely interesting workshop on using cloud computing for systems biology computations. The workshop was co-held with SC09. The agenda and the presentations are available online from the workshop pages and are well worth a look. Here are some impressions from the workshop.

The workshop began with a discussion of current challenges in biosciences. One of the most compelling is personal medicine which helps physicians tailor treatments to individual patients based on feedback obtained on genetic and molecular level. For example, knowledge of genetic variations can now help physicians better assess treatment risks, manage dosing of drugs, better detect diseases in early stages and optimize treatments such as e.g., breast cancer therapy. In his introductory talk, Eugene Kolker said that today there were already hundreds of patients treated based on information obtained from their genetic signatures as part of experimental programs. He also emphasized that the main obstacle to progress in this area is not obtaining the data but the response time and ability to store, process, and analyze it to obtain the right information. And this brings us to cloud computing, in this workshop the “prime suspect” to process, analyze and store on demand.

Simon Twigger from the Medical College of Wisconsin made a very compelling case for why bioscientists need cloud computing. His based his case on the pipette analogy – a common tool in molecular biology and medicine typically equipped with a disposable tip. The analogy was particularly apt as probably 90% of the audience was using the pipette on a daily basis. Simon proposed the following: “Imagine that you are running your lab with only one pipette tip to share.” [Huge laughter from the audience.] He then went on to explain how this assumption would change the work pattern in his laboratory. First, everybody would have to wait in line to use the pipette tip. Because of this waiting, they would do a lot less work. They would also do only small scale things because the imaginary pipette tip is small (moving large quantities of liquid would take weeks!). They would do fewer things because washing the pipette between uses is a pain. And finally they would not try to do something risky, because what it the pipette tip, e.g., becomes clogged? Having only one 16-node cluster for the lab, Simon explained, was exactly like having only one pipette tip – it was a bottleneck for the work in the lab. You queue your program and can’t make progress till the results become available. Because of that you do less work. Since the cluster is small, you also try only small scale things – as well as fewer types of things because different types of things may require configuration changes. And the risky stuff you don’t do at all.

The panel in the afternoon presented some option for cloud computing for science. Kathy Yelick from LBNL and our own Pete Beckman described the recently funded DOE Magellan project – a research project looking at how to build clouds for science. Afterwards, Owen White from the University of Maryland started a discussion on what makes cloud computing compelling to science. In addition to issues brought up earlier by Simon, the ease-of-use plays a very significant role. Owen described how his group was trying to use the TeraGrid and found it too complex to use both procedurally and technically – they were not able to overcome the entry barrier despite the significant resource incentive. The ease-of-use question has many aspects. Pete summed it up by saying that half the users tell him that they want to develop their own VM images and half that they don’t. A rough show of hands showed that in that particular audience everybody thought that developing their own image was much simpler than adapting their application to an environment provided by somebody else (because this is effectively the alternative). This does raise an issue however: for some people the need to develop their own image may be too high a barrier.

As if to address this issue the panel was followed by a presentation from Sam Angiuoli from the University of Maryland. Sam described an appliance for automated analysis of sequence data developed for the bio community. It seems that a model is emerging where some users take the initiative to develop appliances on as a service to their community. This is similar to e.g., the high-energy physics CERNVM project that provides images supporting all four LHC experiments.

The workshop was wrapped up by a talk from Deepak Singh from Amazon Web Services who described AWS capabilities but also the different ways in which various projects use them. It’s fun to see new potential for science emerge!

Welcome to scienceclouds.org

Posted by Kate Keahey | Posted in General | 11-16-2009 | No responses

Today we are moving Science Clouds to its own web pages. In addition to enabling quite a few exploratory projects, the Science Clouds to date served as a bit of a “cloud clinic” where various folks interested in using cloud computing for a scientific project would contact us and get advice and help on how to get started using Infrastructure-as-a-Service (IaaS) clouds.

Over time, these efforts resulted in shared images, papers, cloud evaluation projects, and other endeavors of which the following had the most impact:

The STAR “last minute” experiment: one of our most fascinating experiences last year has been helping the scientists of the STAR nuclear physics experiment meet a conference deadline by conducting a significant last-minute run on Amazon’s EC2. The run took place over 300+ virtual nodes, deployed as virtual clusters with the help of the Nimbus Context Broker, and ran over roughly 10 days consuming more than 36,000 hours of compute time. More details and perspectives are available in articles in ISGTW, HPCwire, and Newsweek.

The ALICE project with CernVM: the challenge was to elastically extend the globally distributed set of resources available to the ALICE HEP experiment — one of the four Large Hydron Collider experiments at CERN — in such a way that users don’t even notice whether their code runs in cloud provided resources or not. This was achieved by sensing resource demand and dynamically deploying virtual machine images developed by the CernVM project. More details available in HPCwire.

Sky Computing: can you create an environment deployed over a federation of cloud resources that is configured and protected in the same way as a local cluster? We defined and evaluated it in a paper that appeared in the September/October 2009 IEEE Internet Computing issue.

We appreciate all efforts and ideas that we have seen in the last year. With this new site we’d like to broaden the potential for discussions and share not just resources and images, but also papers, thoughts, and ideas on all sorts of topics related to how cloud computing can help science. If you have ideas, experiences or thoughts, let us know.