Using Jazz  |  Getting Started  |  About the LCRC  |  Presentations  |  Status  |  FAQ  |  Search   |   Main Page  

Running Jobs on the Jazz Cluster

Access to Jazz compute nodes is provided via a scheduler and an associated batch queuing application (currently PBS is used for both). To run an application, the user submits a 'job' to the scheduler. The scheduler will then determine, based on LCRC policies, available and requested resources and other jobs waiting in the queue, when the job should run. At that time, the scheduler hands the job to the queuing application which initiates the job.

Please see the LCRC FAQ section on PBS for additional information.

When a user's job starts, it is given a list of compute nodes. Jazz compute nodes are named j[1-350] - i.e. j1, j2, ... j350. The automatically generated node list is stored in the $PBS_NODEFILE file on the master node. If the job is an interactive job, the user will be given a shell on the master node itself. The assigned nodes are only accessible by the user who submitted the job.

The general concept is to request some number of nodes and any required resources on those nodes. PBS will give you that many nodes immediately, if possible, or if there are not enough nodes available, the job will be scheduled for later. Use qsub to submit jobs. Use 'qstat -a' to see the current jobs and track your job. Use 'qstat -f' jobid' to see what is happening with your job, to see which nodes are assigned, etc. Examples for each of these commands, and many others, are shown within this document, and are also available in the PBS command summary document.

If you want to work interactively with some nodes, that is by far the easiest, and you don't even need a pbs job script. For example, to get 10 nodes:

qsub -I -l nodes=10

Eventually, you will be put on the master node (usually the highest). At that point, you can figure out what nodes you have been given by looking at the PBS_NODEFILE file:

cat $PBS_NODEFILE

You will be given the default time, which is 15 minutes. At the end of that amount of walltime, your job will be terminated. If you exit from the node you were placed on, your job will terminated. You can request additional time with the walltime resource:

qsub -I -l nodes=10,walltime=0:60:00

If you want to submit a job and have it run everything for you, you need to use a job script. In the script, you specify all the resources, etc. with #PBS directives. You would make a file with the script in it, and then use 'qsub <filename>' to submit it. You can override any #PBS directives on the command line, for example: 'qsub -l nodes=10'. Output from your job will be stored in two files (unless you request just one, see below) called <jobname>.{o,e}<jobid>.

In the example script below, you will see the following #PBS directives:

-l nodes=50 (I'm requesting 50 nodes)
-l walltime=0:40:00 (for a maximum of 40 minutes)
-j oe (to join my stdout and stderr into one file)
-m abe (send me email when the job starts and stops and if it aborts)
-N mpd2-test1 (name the job mpd2-test1, restrictions: 15 chars max, no whitespace, 1st char must be alpha)


Here's the script:
#!/bin/tcsh
#PBS -l nodes=50
#PBS -l walltime=0:40:00
#PBS -j oe
#PBS -m abe
#PBS -N mpd2-test1

# set $NN to have the current number of nodes 
setenv NN `wc -l $PBS_NODEFILE | awk '{print $1}'`

echo '========================================'

# print out the list of nodes
echo 'NODES: '
cat $PBS_NODEFILE

echo '========================================'

# run an mpirun job
mpirun -np $NN -machinefile $PBS_NODEFILE /home/smc/hello



Help Security/Privacy Notice Disclaimer