Running Jobs

LCRC clusters utilize Torque PBS manager and Maui PBS scheduler to allow users to submit jobs via PBS commands. The Portable Batch System (PBS) is a richly featured workload management system providing job scheduling and job management interface on computing resources, including Linux clusters. With PBS, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them in as efficient a manner as it can.

The Basics

Scheduler

The scheduler is a policy engine which allows sites control over when, where, and how resources such as processors, memory, and disk are allocated to jobs. In addition to this control, it also provides mechanisms which help to intelligently optimize the use of these resources, monitor system performance, help diagnose problems, and generally manage the system.

LCRC is using the Maui scheduler. Maui is developed by Cluster Resources and you can refer to the Maui documentation on their site for more information.

Resource Manager

Resources managers provide the low-level functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone can not control jobs.

LCRC is using the resource manger Torque, which is a derivative of PBS, and may be considered synonymous in this context. Torque is developed by Cluster Resources and you can refer to the Torque Documentation on their site for more information.

Job Flow

The life cycle of a job can be divided into four stages: creation, submission, execution, and finalization.

Creation

Typically, a submit script is written to hold all of the parameters of a job. These parameters could include how long a job should run (walltime), what resources are necessary to run, and what to execute.

  • You may only submit jobs from the cluster login nodes.
  • The PBS command file does not need to be executable.
  • In the case of parallel jobs, the PBS command file is staged to, and executed on, the first allocated compute node only. Use MPI or SSH to run programs on multiple nodes.
  • The command script is executed from your home directory in all cases. You can change the directory from the script so that log files are output in your project directory. You can also refer to the submission directory by using the $PBS_O_WORKDIR environment variable.
  • Please do NOT run large, long, multi-threaded, parallel, or CPU-intensive jobs on a front-end login host. All users share the front-end hosts, and running anything but the smallest test job will negatively impact everyone’s ability to use the clusters.
Submission

A job is submitted with the qsub command. Once submitted, the policies set by the administration and technical staff of the site dictate the priority of the job and therefore, when it will start executing. A limit of 100 job submissions per user is currently enforced.

Execution

Jobs often spend most of their lifecycle executing. While a job is running, its status can be queried with the showq or qstat command. A limit of 32 jobs per user running at once is currently enforced.

Finalization

When a job completes, any leftover user processes are killed, and by default, the stdout and stderr files are copied to the directory where the job was submitted.

Priorities

Prioritization is the process of determining which of many options best fulfills overall goals. With the jobs prioritized, the scheduler can roughly fulfill site objectives by starting the jobs in priority order.

What changes a job’s priority value?

Assigned project priority (positive/negative for expired allocations)
Queue time (positive)
Node count (positive)
Fairshare (positive/negative)

Fairshare

Fairshare is a mechanism which allows historical resource utilization information to be incorporated into job feasibility and priority decisions.

Backfill

Backfill is a scheduling optimization which allows a scheduler to make better use of available resources by running small jobs out of order.

Available Queues

We have several public queues to choose from on each cluster. This does not list information on the private condo nodes. Below you can find some details on the types of nodes in each public queue.

Blues Cluster Queues Nodes Cores Memory Processor Co-processors Local Scratch Disk
shared 4 16 64 GB Sandy Bridge Xeon E5-2670 2.6GHz 15 GB
batch 306 16 64 GB Sandy Bridge Xeon E5-2670 2.6GHz 15 GB
haswell 40 32 128 GB Haswell Xeon E5-2698v3 2.3GHz 15 GB
biggpu 6 16 768 GB Sandy Bridge Xeon E5-2670 2.6GHz  2x NVIDIA Tesla K40m GPU 1 TB

Submitting Jobs

Batch

Job submission is accomplished using the qsub command, which takes a number of command line arguments and integrates them into the specified PBS command file. The PBS command file may be specified as a filename on the qsub command line or may be entered via STDIN. PBS batch scripts can be created using your favorite editor.

As an example, let’s create a batch script called hello.pbs that expects an executable called hello.x as shown below

#!/bin/sh

#PBS -N hello
#PBS -l nodes=1:ppn=16
#PBS -l walltime=0:00:15
#PBS -j oe

cd $PBS_O_WORKDIR
mpiexec ./hello.x

As you can see in the script, the first portion sets up variables needed to submit the job and have it distributed to machines assigned to you by the scheduler. You can check out the man page for qsub to get a list of variables to pass to qsub. The following portion of the script sets up the working directory and what program will be executing. In addition, you can check out our job scheduling policy listed here to get a full understanding of what are acceptable limits for submitting jobs to the queue.

Interactive

Interactive jobs can run on compute nodes. You can start interactive jobs either with specific time constraints (walltime=hh:mm:ss) or with the default time constraints of the queue to which you submit your job. PBS assigns to all jobs, even interactive jobs, the maximum wall time of their queue.

If you request an interactive job without a wall time option, PBS assigns to your job the default wall time limit for the queue to which you submit. If this is shorter than the time you actually need, your job will terminate before completion. If, on the other hand, this time is longer than what you actually need, you are effectively withholding computing resources from other users. For this reason, it is best to always pass a reasonable wall time value to PBS for interactive jobs.

Once your interactive job starts, you may use that connection as an interactive shell and invoke whatever other programs or other commands you wish. To submit an interactive job with one minute of wall time, use the -I option to qsub:

$ qsub -I -l walltime=00:01:00
waiting for job 100.blues.lcrc.anl.gov to start
job 100.blues.lcrc.anl.gov ready

If you need to use a remote X11 display from within your job , add the -v DISPLAY option to qsub as well:

$ qsub -I -l walltime=00:01:00 -v DISPLAY
waiting for job 101.blues.lcrc.anl.gov to start
job 101.blues.lcrc.anl.gov ready

To quit your interactive job, run:

$ logout

Deleting Jobs

You can delete jobs using the Torque/PBS command qdel (note: you can only delete jobs you have submitted.)

$ qdel <jobid>

If for whatever reason you are unable to delete your job, contact us and we can delete it for you.

Special Queues (Blues)

LCRC offers several queues that differ from the normal compute queues. Outlined below are some of the queues and some caveats when working with them.

Running on the Blues ‘haswell’ Queue

In order to optimally use these nodes you will need to use the latest version of hydra when using mvapich2 available under the key +hydra-3.2. Place this key above the mvapich2 key in order to use the latest hydra job launcher.

You will also need to add an additional flag to the mpiexec command to have MPI ranks bind to the right processers like below

mpiexec -n 32 --bind-to=core

If using miprun use:

mpirun -n 32 --bind-to core

After those changes you are now ready to compute on the haswell queue. To submit to this queue you can simply add #PBS -q haswell to your batch script or submit your job with the -q haswell flag.

Be aware that like our other compute nodes you will be charged for every processor on a node that you consume regardless of if you use all the resources on that node.

Note that the options to bind processes to cores may vary between MPI packages. Please consult the documentation for the MPI package you use or send an email to support@lcrc.anl.gov if you require assistance.

Running on the Blues ‘biggpu’ Queue

The addition of big memory GPU nodes to Blues facilitates computations that require special capabilities. You can now take advantage of the power of graphical processors to further increase the amount of work you can accomplish on Blues with codes developed for GPUs as a computing resource.

In addition to the standard 16 Sandy Bridge cores, each node has 2 NVIDIA Tesla K40m GPUs and each GPU has 12GB of on-card memory and 2,880 CUDA cores to give an estimated double precision floating point performance of 1.66 Teraflops per GPU.

In addition to having GPUs, these Blues nodes also have 768GB of memory to help solve problems that are more memory bound. The large memory enables data analyses, visualizations and pre- and post-processing that require large, shared memory. Lastly, this queue offers 1 Terabyte of local scratch disk.

There are a total of 6 of these big memory/GPU nodes.

There is a separate queue for the big memory/GPU nodes named biggpu. To submit to this queue you can simply add #PBS -q biggpu to your batch script or submit your job with the -q biggpu flag.

Because of the limited number of these nodes, there is a different charge rate to use them. For each node you use you will be charged at two times the rate, or 32 core hours. These are dedicated nodes so jobs will not be shared on them. For this reason it is important that you use codes that will benefit greatly from either the added power of the GPUs or the increased memory off the nodes.

If you wish to use CUDA to use the GPUs, make sure to add the following softenv key in order to have the CUDA toolkit at your disposal: +cudatoolkit

Everything else about these nodes are the same on any other Blues node including having access to GPFS filesystems.

If you have any further questions about these nodes or using them please let us know by sending an email to support@lcrc.anl.gov.

Checking Queues and Jobs

Job Status

You can check the status of all jobs using the showq command:

$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
536764             xxxxxxxx    Running     8     1:13:40  Mon Apr  4 03:40:16
536751             xxxxxxxx    Running     8     1:14:31  Mon Apr  4 03:41:07
537124             xxxxxxxx    Running    64     1:23:13  Mon Apr  4 09:19:49
536961             xxxxxxxx    Running    32     2:06:52  Mon Apr  4 06:34:28
536788             xxxxxxxx    Running     8     2:21:25  Mon Apr  4 04:48:01

...

  108 Active Jobs    2506 of 2928 Processors Active (85.59%)
                      315 of  320 Nodes Active      (98.44%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
536070             xxxxxxxx       Idle    96 12:12:00:00  Fri Apr  1 00:29:28
535002             xxxxxxxx       Idle     8 12:12:00:00  Mon Mar 28 18:28:24
535016             xxxxxxxx       Idle     8 12:12:00:00  Mon Mar 28 20:47:55
535042             xxxxxxxx       Idle     8 12:12:00:00  Mon Mar 28 22:24:35
535309             xxxxxxxx       Idle     8 12:12:00:00  Tue Mar 29 13:27:52

...

30 Idle Jobs

You can also check jobs in the queues for a specific user:

$ showq -u <userid>
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
536310                xxxxxxxx    Running    32     5:25:16  Sat Apr  2 15:55:22
536318                xxxxxxxx    Running    32     6:39:51  Sat Apr  2 17:09:57
536259                xxxxxxxx    Running    48     7:31:18  Sat Apr  2 18:01:24
536266                xxxxxxxx    Running    48    16:38:17  Sun Apr  3 03:08:23
536280                xxxxxxxx    Running    48    20:37:05  Sun Apr  3 07:07:11
536138                xxxxxxxx    Running    48  1:01:32:59  Fri Apr  1 10:03:05
537134                xxxxxxxx    Running    48  1:22:50:42  Mon Apr  4 09:20:48
537173                xxxxxxxx    Running    48  1:23:23:49  Mon Apr  4 09:53:55
537182                xxxxxxxx    Running    48  1:23:37:49  Mon Apr  4 10:07:55
536150                xxxxxxxx    Running    48  2:06:08:09  Sat Apr  2 14:38:15
537187                xxxxxxxx    Running    72  3:05:56:20  Mon Apr  4 10:26:26

   11 Active Jobs    2530 of 2928 Processors Active (86.41%)
                      318 of  320 Nodes Active      (99.38%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
537189                xxxxxxxx       Idle    72  3:06:00:00  Mon Apr  4 10:29:32

1 Idle Job

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

Total Jobs: 12   Active Jobs: 11   Idle Jobs: 1   Blocked Jobs: 0

You can use the showbf command to gauge how large of a backfill is available.

Checking for Free Nodes

You can check how many nodes are free by running:

$ nodes -c

total free nodes in the compute queue:  0

By default this will show you the count for the batch queue. You can use the following to specify a specific queue:

$ nodes -q haswell

Determining when the next job will run

You can view the start time of the highest priority job using the Maui showstart command. The showstart command assumes any job you pass to it would be the next to run, even if it isn’t. This command is only accurate for the highest priority job in queue.

You can view the highest priority job in queue by running the Maui showq command. In the following example, job id 410553 is the highest priority job in queue:

$ showq -i

          JobName    Priority  XFactor  Q   User  Group Procs  WCLimit  Class  SystemQueueTime

           410553*       7380      1.3 pr     xxxxx earlygrp     48 16:16:00:00     batch  Wed Jun 23 10:03:56

           411250*       2762      1.6 pr    xxxxxx earlygrp     56  3:00:00:00     batch  Sat Jun 26 15:01:48

           411251*       2759      1.6 pr    xxxxxx earlygrp     56  3:00:00:00     batch  Sat Jun 26 15:04:57

Using showstart, we can see this job will start in 12:10 hours, on Saturday July 10th 19:00. Please note that show start assumes the job ID is the next to run. If you run showstart on a job that isn’t next to run the data will be inaccurate.

$ showstart 410553

job 410553 requires 48 procs for 16:16:00:00

Earliest start in      12:10:54:49 on Sat Jul 10 19:00:00

Earliest completion in 29:02:54:49 on Tue Jul 27 11:00:00

Best Partition: DEFAULT

Why isn’t my job running yet?

LCRC uses a batch system for users to submit and track jobs. The demand on compute hours available on LCRC systems greatly exceeds supply. As with many shared systems, complexities arise when attempting to utilize the available computing resources in a fair and effective manner. In a batch system, a scheduler is assigned the job of determining when and how jobs are run so as to maximize the output of the cluster and also allow the usage of the compute resources fairly among the users. Intelligent scheduling decisions significantly improve the effectiveness of the cluster resulting in more jobs being run with a shorter job turnaround time.

The scheduling decisions are based on a number of factors, such as available cores and the duration of availability, the total resources requested by various jobs (core-hours), the priority of the project (for instance strategic LDRD project-based jobs have higher priority), the number of jobs submitted by various users among other factors. Users should remember that jobs are NOT scheduled on first-in-first-out basis. For instance, a job requesting 2 nodes for an hour could run earlier than a job that requested 4 nodes for 96 hours, even though the second job was submitted earlier. This could happen if the scheduler determines that resources are available to run the smaller job earlier.

The load on the system varies over time. Users can use the showq command to check the system load, however this can be deceiving. The number of processors also includes the private condo nodes we host which not everyone can run on. At the bottom of the Active Jobs, the command shows a listing of core-hour usage as shown below:

102 Active Jobs 5624 of 9668 Processors Active (58.17%)
313 of 483 Nodes Active (64.80%)

To get a better idea of the number of free nodes, you should use the nodes command as outlined here.

Jobs could also be “stalled” by upcoming maintenance periods. LCRC has the second Monday of every month set aside for system maintenance (software upgrades, hardware replacements, etc.). If the total wall-time requested by the job is longer than the time period between job submission and the maintenance date, the job will start only after the maintenance period ends. For instance, if the monthly maintenance day is set for Feb 8th and user A submits a job requesting a wall-time of 168 hours (7 days) on Feb 2nd, the job will not start till the maintenance period is over. In the meantime, other jobs which require lesser wall time (say 2 days) might start even if they were submitted after user A and the scheduler determines that they can complete before the start of maintenance. LCRC staff sends out emails to users announcing the upcoming maintenance date(s). Users should pay attention to the maintenance schedule to ensure that there is no inordinate delay in the start of their jobs.

At times, some users tend to submit tens or hundreds of jobs which are listed when users use the ‘showq’ or ‘qstat’ commands. Users should not be alarmed by the large number of jobs. As mentioned earlier, job completion is not based on the first-in-first-out basis. Furthermore, while the scheduler lists these jobs, only the first 30 jobs submitted by a user are actually in the queue. Additionally, the fair-share policy also automatically lowers the priority of successive jobs by user A in a phased manner, in order to allow other users who are requesting lesser resources run jobs.

Occasionally, jobs are on ‘HOLD’ if the requested resources can never be met. For instance, submitting a job with ‘ppn=32’ on Blues (which has ppn=16) will not run. Users are asked to ensure that their batch scripts are set up correctly for the resource they are utilizing.

Maintenance Day

As mentioned above, LCRC has the second Monday of every month set aside for system maintenance (software upgrades, hardware replacements, etc.). During this time, the cluster will be unavailable for all users and won’t be accessible. We generally have the cluster available again by the end of the same day, however, depending on the scope of work to be done this could be longer.

We always send out a reminder email (usually a week before) of the start date and then another email immediately after the work is complete and the clusters are available again. These emails go to the LCRC Users email list. All LCRC users are subscribed to this list by default.

Job Commands Quick Reference Guide

Command Description
qsub <script_name> Submit a Job
qdel <job_id> Delete a Job
nodes -c
nodes -q <queue_name>
nodes -c -q <queue_name>
Get a count of available nodes in the default batch (compute) queue
List the hostnames of available nodes in a specific public queue
Get a count of available nodes in a specific public queue
showq
showq -u <username>
showbf
Show queued jobs via the scheduler
Show queued jobs from a specific user
Show backfill window
qstat
qstat -u <username>
qstat <job_id>
qstat -f <job_id>
View queues and jobs via the resource manager
View jobs from a specific user
View a specific job
View detailed information about a specific job
qhold <job_id>
qrls <job_id>
Put a hold on a job
Release a hold on a job
checkjob <job_id> Provide a detailed status report for a specified job via the scheduler
showstart Show estimates of when job can/will start