Building a Great Testbed for Cloud Computing Research

Posted July 29, 2016 by futureplatforms in General, Uncategorized

We have been silent for a while – not for the lack of something to say but lack of time to say it in ;-). But today is a very special day: yesterday was the first anniversary of the day Chameleon, a cloud computing experimental instrument project that Nimbus team is proud to lead, went public. Considering how busy we are, breaking the silence is a bit of a treat but after all that’s what anniversaries are for!

One lesson we all learned in over a decade of working on cloud computing research – or any type of systems research in fact – it is that without a reliable testbed such research is hard with gusts to impossible. Experimenting with new solutions in operating systems, virtualization, power management all require significant levels of control where the user has access to a broad range of system level tasks: everything from customizing the system kernel, accessing IPMI or even reconfiguring BIOS. They also frequently require that experiments be run in a controlled environment, without being impacted by other users. This was always the crux of the matter: such resources are hard to obtain – and even harder to obtain at scale required by research on Big Data and Big Compute.

No longer. Thanks to NSF’s vision and support for experimental Computer Science we now have the large-scale, deeply reconfigurable testbed that cloud computing research requires. The bulk of Chameleon hardware so far consists of over 550 nodes and 5 PBs of storage distributed over University of Chicago and TACC with 100 Gbps network between them. The nodes are distributed over 12 homogenous racks, each consisting of 42 compute nodes (Intel Haswell processors) and 4 storage nodes with 16 2TB disks each – that’s with very high bandwidth I/O for your Big Data experiments. The remaining active storage is configured as object store for experimental data and image server. On this relatively homogenous framework were grafted heterogeneous elements: one of the racks has Infiniband network in addition to Ethernet, two nodes have higher memory, SSD, and HDD elements to facilitate experiments with storage hierarchies, we have two K80 GPU nodes, and two M40 GPU nodes. To this we are planning to add NVRAM, FPGAs, as well as a new cluster of ARMs and Atoms.

The really exciting bit is in the configuration though. We built this testbed for our colleagues and ourselves – anybody who needs “as if it were in my lab” level of access to a system – requirements from detailed interviews with almost 20 Computer Science research teams went into developing a vision. As a result of their insights, our Resource Discovery portal has painstakingly detailed hardware configuration information, ranging from cache levels of nodes to serial numbers of individual components, that is automatically discovered and updated as hardware and firmware on the testbed changes. Moreover, every time anything on the testbed changes, we release a new testbed version (over 30 versions over the last year!) so that you can tell at a glance if the testbed today is the same as the testbed yesterday. You can ask for resources interactively (on demand) or place advance reservations – which you may have to do if you have an eye on hundreds of nodes for your Big Compute experiment! You can reconfigure nodes on a “bare metal” level – we provide a range of appliances to make it easier but you can also roll your own – and from there you can add configuration, reboot into a custom kernel, or reconfigure to a completely new system as needed. Then you can snapshot, i.e., save your appliance so that you can move it to a different site, revisit later, or point to the exact version of your environment in your paper.

And there is one more interesting thing: we built the cloud research testbed on top of a cloud, pulling ourselves up by our bootstraps as it were. While many pioneering research infrastructures, such as GENI and Grid’5000, gave us ideas, we took a gamble that today this type of infrastructure can be built using off-the-shelf software components – and it paid off with dividends! Most of Chameleon is built using OpenStack: we are using Ironic for bare metal reconfiguration, Nova, Glance, Swift, and all the other familiar OpenStack components. To be sure, we had to extend it to fit our requirements – add advance reservations and snapshotting for example – but we are very happy with the payoff. Using OpenStack allows us to leverage its ecosystem of tools and new features as they come out, make contributions that then serve a community broader than our users, and make our model very accessible to others. OpenStack is not all of course: our resource description and versioning system was borrowed from Grid’5000, we added and populated the Appliance Catalog as well as other features making things easier for the users.

It has been a busy, adventurous, and very exciting year: we pioneered a new way of building experimental systems and of this we are very proud. But not as proud as of the fact that in this one year of operation Chameleon served as an experimental instrument for 190+ exciting and innovative research projects and was used for 800+ users working on everything from developing exascale operating systems, through security, to education projects. And not as proud as of the fact that five fantastic institutions, each contributing complementary expertise, could come together working as one team to build a new experimental instrument for other cloud computing researchers.

Happy Birthday Chameleon!

P.S. If you are working on cloud computing research and need a testbed, check us out at www.chameleoncloud.org — we support Computer Science research nationwide and international collaborations as well. We will be happy to support your research!

Featured

Recent Posts

Recent Comments

Archives

Categories

Meta