Autoscaling TORQUE Appliance | Cloud computing for science - Just another CELS WordPress Sites site

TORQUE is a resource manager based on the Portable Batch System (PBS). It manages the scheduling and execution of jobs in a cluster. We provide an appliance containing Torque 3.0.6 and Phorque, a tool that monitors a Torque cluster, executes a policy to determine how many instances to launch or terminate, and then provisions instances on infrastructure clouds via Phantom, our auto-scaling cloud platform.

The TORQUE appliance is available on the Hotel and Sierra Nimbus clouds on FutureGrid with the image name torque.gz.

To run the TORQUE appliance with Phantom, you need to create a launch configuration and a domain for the TORQUE server. Phorque will be started automatically and will take care of creating a launch configuration and a domain for TORQUE worker nodes.

First, create a launch configuration for your TORQUE server node. We will call it torque-server. Select the public image torque.gz and use the following user data (which is a Chef JSON configuration file):

{
  "phorque": {
    "price_per_hour": 5,
    "clouds": [
      {
        "name": "hotel",
        "cloud_uri": "svc.uc.futuregrid.org",
        "cloud_port": 8444,
        "autoscale_uri": "phantom.nimbusproject.org",
        "autoscale_port": "8445",
        "image_id": "torque.gz",
        "price": 0,
        "access_id": "$PHANTOM_ACCESS_ID",
        "secret_key": "$PHANTOM_SECRET_ID",
        "launch_config_name": "hotellc",
        "autoscale_group_name": "hotelasg",
        "cloud_type": "nimbus",
        "availability_zone": "us-east-1",
        "instance_type": "m1.small",
        "instance_cores": 1,
        "max_instances": 5,
        "charge_time_secs": 3600
      }
    ]
  },
  "run_list": [
    "recipe[torque::server]",
    "recipe[phorque]"
  ]
}

Note: You must replace $PHANTOM_ACCESS_ID and $PHANTOM_SECRET_ID by your Phantom Autoscale credentials (see how to retrieve them).

If you are using Sierra, you should also replace:

"name": "hotel"
"cloud_uri": "svc.uc.futuregrid.org"
"launch_config_name": "hotellc"
"autoscale_group_name": "hotelasg"

by:

"name": "sierra"
"cloud_uri": "s83r.idp.sdsc.futuregrid.org"
"launch_config_name": "sierralc"
"autoscale_group_name": "sierraasg"

You must also make sure that the hotellc/sierralc launch configurations and the hotelasg/sierraasg domains DO NOT exist.

For more information on the configuration parameters, see the Phorque documentation.

Enable your selected site and save the launch configuration.

Switch to the domains tab and create a domain called torque-server using the launch configuration we just created and a number of VMs of 1. Start the domain and wait until the VM is running. Once it is running, click on it to reveal the details and take note of its hostname.

SSH to this machine as root, and switch to the torque user:

su - torque

You can now submit jobs to TORQUE, which is done like this:

echo "cat /etc/group" | qsub

Phorque monitors the job queue (visible with qstat). When jobs are waiting execution, Phorque will use Phantom to instantiate new virtual machines. They are configured to be able to SSH to your server node with an SSH key generated during the initial deployment. When a job is finished, the result will be sent to your server node in /home/torque:

STDIN.e0
STDIN.o0

STDIN.e0 contains standard error while STDIN.o0 contains standard output (where 0 is the job ID assigned by TORQUE).

Step 1:

Step 2:

Step 3: