The history of High Performance Computing (HPC) can be told as a continuous search for the optimal use of the underlying hardware platform. As new architectures are delivered with impressive peak performance specifications, a parallel effort is made by HPC community to exploit these new machines. This has been the case from the development of message passing frameworks in the early 80’s to today’s challenges of multi-core processors.
An important piece in this quest is the Distributed Resource Management System (DRMS). A DRMS can be roughly defined as a software component that provides an uniform and single view of a set of (possibly heterogeneous) computational resources. A DRMS can be found in any distributed system from tightly coupled MPPs to Grid or even Peer2Peer environments. The goal of the DRMS in any of these platforms is to find an optimal assignment (in terms of a given target like utilization or wall-time) between a computational workload and computational resources.
Virtualization has opened up avenues for resource management techniques e.g. server consolidation and isolation, custom execution environment provisioning… Although, probably the most exciting feature is its ability to dynamically shape a hardware infrastructure. The same set of hardware blades can be shaped to be a web server and a database server, or a set of cluster worker nodes for scientific computing, or a set of workstations for a virtual classroom…
In this way, virtualization has completely changed the habitat of DRMS species. In fact, their goal can be reformulated the other way around: find the optimal computational resources to execute a given computational workload. Traditionally a DRMS assigns pending jobs to local resources, and lately with the advent of Grid Computing also to remote resources in other administration domains.
There are two alternatives today to extend the limits of a computing cluster:
- Moving the jobs to the resources. Grid computing enables the interoperation of different DRMS, so a meta-scheduler can transfer the jobs to another cluster. GridWay on top of Globus provides a solution to balance workload across several clusters. This federation of administration domains may even follow a typical customer-provider relationship in a utility fashion.
- Moving the resources to the jobs. The cluster can be provided with additional worker nodes to execute the extra workload. A very interesting example of this approach, that combines Grid computing and virtualization (Workspace Service and Amazon EC2), has been made in the context of the STAR project. Other interesting reference that implements this idea is the Hedeby project.
The generalization of this second approach and its application to any kind of services requires of a new component class for resource management. This new component can be referred as Infrastructure Management System (IMS); there is no clear consensus on its name, virtual environment manager, virtualization manager or VM manager are sometimes used. An IMS dynamically shapes a physical infrastructure by deploying virtual machines to adapt its configuration to the services it supports and their current load. Given the characteristics of VMs (flexibility, security, portability…), the IMS will be the key resource management component for the next generation resource provisioning paradigm, i.e. cloud computing.
Some examples include commercial products like IBM Virtualization Manager, Platform Orchestrator or VMware VirtualCenter, and open source initiatives like the OpenNEbula Virtual Infrastructure Engine. OpenNEbula allows a physical cluster to dynamically execute multiple virtual clusters, so providing on-demand resource provisioning because the number of working nodes can grow according to user demands so that there are always computing slots available.So the question is: are you going to move the jobs or the nodes?.