Warning! The cluster has been shutdown. It is no more. It is an Ex-Cluster!

This page is being kept for informative purposes only. The cluster itself has been shutdown now.

Math Department Computing Cluster

The Math Departement Computing operates a general-purpose computing cluster which is managed using Torque/Maui job scheduling software. The cluster is available to all Math Faculty and Graduate students.

It is currently comprised of 16 dedicated compute nodes made up of quad-core Intel Xeon X5355 2.66 GHz CPU's running in 64-bit mode. Each node contains 2 CPU's providing a total of 24 CPU's in the cluster. Most nodes have 16 Gig RAM, though two nodes have 32 Gig RAM available. Nodes are connected with a high-bandwidth, low-latency interconnection based on Infiniband, as well as 10/100 ethernet.

The cluster runs Linux based opperating system and supports a variety of software including MATLAB, Mathematica, Macaulay2, and an MPI implementation (MVAPICH2). Compiled C and Fortran programs can also be run in either 64 or 32 bit modes.

Contents

Quick Reference GuideCluster ReferenceUnix/Linux help

Introduction

A cluster is a group of computers that are networked together and are managed by software so that they can be treated as one large machine. The cluster is managed by a program called the scheduler, which determines how best to use the resources (CPU, memory, disk space) provided by the cluster.

When you want to run something on the cluster, you need to let the scheduler what you want to run and what you need for it to run. This is called submitting a job. Your job may not run immediately. If the resources it needs are not currently available, the scheduler will keep your job in a queue until the resources become available.

To get access to the cluster, you will need to login to the frontend server of the cluster. Use your favorite ssh program to access cluster.math.vt.edu with your Math PID and password. You will be placed in your cluster home directory, which is separate from your Math home directory. A file share for your cluster home directory is available, please see the file share help page for information on using this.

In the cluster, your Math home directory is available only on the frontend. It is located as /math/PID where PID is your Math PID. If you want to make files available to the cluster you will need to copy them to your cluster home directory, /home/PID.

Once your job is running, another place it can create or copy files is to scratch space, /scratch. Scratch space is on faster drives and should be used for temporary files or frequently accessed data files your programs are working on. Anything you want to keep should be copied to your home directory before your job is finished.

Getting Started

You will need to do a few things before you are able to use the cluster. You will need access to your Math Home directory, you will need to create an ssh key, and you will need to enable password-less logins to the cluster nodes.

Creating an ssh key

On cluster.math you will want to run the following command:

ssh-keygen

Just hit the return key for the passphrase, do not set a passphrase. When finished, you need to do the following:

cd $HOME/.ssh
cat id_rsa.pub > authorized_keys

This will allow you to use you ssh key in place of your password.

ssh keys for all the cluster servers are automatically copied to your cluster home when you first login to the cluster. If this file (.ssh/known_hosts) should become lost or corrupted, you can download a new copy here.

Matlab Parallel Configuration

To run Matlab jobs using the Distributed Computing Environment, you need to setup the Paralled Configuration.A sample is provided here as well as located in /opt/matlab/Math_Cluster.settings on the frontend machine.

To use this configuration:

Job Submission Files

A job submission file, refered to as a PBS file, is a simple text file that does two things. It tells the cluster what resources you will need and it tells the cluster what to run. A sample file that runs a job on a single CPU looks like this:

#PBS -N MYJOB
#PBS -S /bin/sh
#PBS -M PID@math.vt.edu
#PBS -m ea

cd $HOME
matlab -nodisplay -nojvm -r MATLAB_MFILE >& OUTPUT_FILE
PID is your Math PID.

You will get an email message when your job starts to be executed and when it finishes. Any output from the job will be in your HOME directory.

A more complicated job that would run on 2 nodes 3 programs in parallel would look something like this:

#PBS -N MYJOB
#PBS -S /bin/sh
#PBS -l nodes=2:ppn=3
#PBS -M PID@math.vt.edu
#PBS -m ea

cd $HOME
myMPIprogram >& OUTPUT_FILE
nodes should be set to how many different machines (max of 3 currently) you want your job to run on.
ppn is how many processors (cores) you want to use on each requested machine. Each machine has up to 8 available.

The Math cluster supports some specialty options such as high memory servers. Please see Cluster Reference for more information on making use of these.

Submitting Jobs

Jobs can be submitted to the cluster only from the frontend (cluster.math.vt.edu).

You need these things:

  1. A job submission file
  2. Program and/or data files copied to your Math home

Assuming that your PBS file and all necessary program files are located properly under your home directory, you would login to cluster.math.vt.edu and then run something like

cat job.pbs | qsub

Job Status and Control

Run the following command on cluster.math.vt.edu

qstat

MPI

MPI is available on the Math cluster using a package called MVAPICH2. This package supports communications over InfiniBand connections. The infiniband interfaces should be faster then ethernet, but if you are having problems with MPI, it would be advisable to try out the ethernet interfaces.

In general you will use a mpd running on each node. A helper script, mpisetup.sh, has been created to setup and run an mpd on each node your job is assigned to by the scheduler. Each mpd will use an infiniband interface for it's communications by default. To use a different interface, specify eth (ethernet) or ib1 (alternative infiniband) as an arugment to mpisetup.sh.

#PBS -l nodes=3:ppn=2



mpisetup.sh mpdtrace -l

mpirun -np 6 ./a.out

mpdallexit



This should run using 2 processor cores on 3 nodes for a total of 6 processor cores.

mpdtrace is an optional step that may help in debugging problems.

Don't forget to have a .mpd.conf file in your home directory with a line setting the MPD_SECRETWORD variable to some value. You will know you have to do this if you get an error from mpdboot_frontend like this:

mpdboot_frontend (handle_mpd_output 406): from mpd on 10.10.0.7, invalid port info:
no_port

To do this easily, try the following:
echo MPD_SECRETWORD=xxxxxxxxxxx > ~/.mpd.conf
chmod 600 ~/.mpd.conf

where xxxxxxxxxxx is some secret password that you want to use. Be careful if using characters like quotes, *, or other special characters in your password.

Please see the MVAPICH Users Guide for more information on using MPI.