LCRC clusters utilize Torque PBS manager and Maui PBS scheduler to allow users to submit jobs via PBS commands. The Portable Batch System (PBS) is a richly featured workload management system providing job scheduling and job management interface on computing resources, including Linux clusters. With PBS, a user requests resources and submits a job to a queue. The system will then take jobs from queues, allocate the necessary nodes, and execute them in as efficient a manner as it can.
The scheduler is a policy engine which allows sites control over when, where, and how resources such as processors, memory, and disk are allocated to jobs. In addition to this control, it also provides mechanisms which help to intelligently optimize the use of these resources, monitor system performance, help diagnose problems, and generally manage the system.
LCRC is using the Maui scheduler. Maui is developed by Cluster Resources and you can refer to the Maui documentation on their site for more information.
Resources managers provide the low-level functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone can not control jobs.
LCRC is using the resource manger Torque, which is a derivative of PBS, and may be considered synonymous in this context. Torque is developed by Cluster Resources and you can refer to the Torque Documentation on their site for more information.
The life cycle of a job can be divided into four stages: creation, submission, execution, and finalization.
Typically, a submit script is written to hold all of the parameters of a job. These parameters could include how long a job should run (walltime), what resources are necessary to run, and what to execute.
- You may only submit jobs from the cluster login nodes.
- The PBS command file does not need to be executable.
- In the case of parallel jobs, the PBS command file is staged to, and executed on, the first allocated compute node only. Use MPI or SSH to run programs on multiple nodes.
- The command script is executed from your home directory in all cases. You can change the directory from the script so that log files are output in your project directory. You can also refer to the submission directory by using the
- Please do NOT run large, long, multi-threaded, parallel, or CPU-intensive jobs on a front-end login host. All users share the front-end hosts, and running anything but the smallest test job will negatively impact everyone’s ability to use the clusters.
A job is submitted with the qsub command. Once submitted, the policies set by the administration and technical staff of the site dictate the priority of the job and therefore, when it will start executing. A limit of 100 job submissions per user is currently enforced.
Jobs often spend most of their lifecycle executing. While a job is running, its status can be queried with the showq or qstat command. A limit of 32 jobs per user running at once is currently enforced.
When a job completes, any leftover user processes are killed, and by default, the stdout and stderr files are copied to the directory where the job was submitted.
Prioritization is the process of determining which of many options best fulfills overall goals. With the jobs prioritized, the scheduler can roughly fulfill site objectives by starting the jobs in priority order.
What changes a job’s priority value?
Assigned project priority (positive/negative for expired allocations)
Queue time (positive)
Node count (positive)
Fairshare is a mechanism which allows historical resource utilization information to be incorporated into job feasibility and priority decisions.
Backfill is a scheduling optimization which allows a scheduler to make better use of available resources by running small jobs out of order.
We have several public queues to choose from on each cluster. This does not list information on the private condo nodes. Below you can find some details on the types of nodes in each public queue.
|Blues Cluster Queues||Nodes||Cores||Memory||Processor||Co-processors||Local Scratch Disk|
|shared||4||16||64 GB||Sandy Bridge Xeon E5-2670 2.6GHz||–||15 GB|
|batch||306||16||64 GB||Sandy Bridge Xeon E5-2670 2.6GHz||–||15 GB|
|haswell||40||32||128 GB||Haswell Xeon E5-2698v3 2.3GHz||–||15 GB|
|biggpu||6||16||768 GB||Sandy Bridge Xeon E5-2670 2.6GHz||2x NVIDIA Tesla K40m GPU||1 TB|
Job submission is accomplished using the qsub command, which takes a number of command line arguments and integrates them into the specified PBS command file. The PBS command file may be specified as a filename on the qsub command line or may be entered via STDIN. PBS batch scripts can be created using your favorite editor.
As an example, let’s create a batch script called hello.pbs that expects an executable called hello.x as shown below
#!/bin/sh #PBS -N hello #PBS -l nodes=1:ppn=16 #PBS -l walltime=0:00:15 #PBS -j oe cd $PBS_O_WORKDIR mpiexec ./hello.x
As you can see in the script, the first portion sets up variables needed to submit the job and have it distributed to machines assigned to you by the scheduler. You can check out the man page for qsub to get a list of variables to pass to qsub. The following portion of the script sets up the working directory and what program will be executing. In addition, you can check out our job scheduling policy listed here to get a full understanding of what are acceptable limits for submitting jobs to the queue.
Interactive jobs can run on compute nodes. You can start interactive jobs either with specific time constraints (
walltime=hh:mm:ss) or with the default time constraints of the queue to which you submit your job. PBS assigns to all jobs, even interactive jobs, the maximum wall time of their queue.
If you request an interactive job without a wall time option, PBS assigns to your job the default wall time limit for the queue to which you submit. If this is shorter than the time you actually need, your job will terminate before completion. If, on the other hand, this time is longer than what you actually need, you are effectively withholding computing resources from other users. For this reason, it is best to always pass a reasonable wall time value to PBS for interactive jobs.
Once your interactive job starts, you may use that connection as an interactive shell and invoke whatever other programs or other commands you wish. To submit an interactive job with one minute of wall time, use the
-I option to qsub:
$ qsub -I -l walltime=00:01:00 waiting for job 100.blues.lcrc.anl.gov to start job 100.blues.lcrc.anl.gov ready
If you need to use a remote X11 display from within your job , add the
-v DISPLAY option to qsub as well:
$ qsub -I -l walltime=00:01:00 -v DISPLAY waiting for job 101.blues.lcrc.anl.gov to start job 101.blues.lcrc.anl.gov ready
To quit your interactive job, run:
You can delete jobs using the Torque/PBS command qdel (note: you can only delete jobs you have submitted.)
$ qdel <jobid>
If for whatever reason you are unable to delete your job, contact us and we can delete it for you.
Special Queues (Blues)
LCRC offers several queues that differ from the normal compute queues. Outlined below are some of the queues and some caveats when working with them.
Running on the Blues ‘haswell’ Queue
In order to optimally use these nodes you will need to use the latest version of hydra when using mvapich2 available under the key
+hydra-3.2. Place this key above the mvapich2 key in order to use the latest hydra job launcher.
You will also need to add an additional flag to the mpiexec command to have MPI ranks bind to the right processers like below
mpiexec -n 32 --bind-to=core
If using miprun use:
mpirun -n 32 --bind-to core
After those changes you are now ready to compute on the
haswell queue. To submit to this queue you can simply add
#PBS -q haswell to your batch script or submit your job with the
-q haswell flag.
Be aware that like our other compute nodes you will be charged for every processor on a node that you consume regardless of if you use all the resources on that node.
Note that the options to bind processes to cores may vary between MPI packages. Please consult the documentation for the MPI package you use or send an email to firstname.lastname@example.org if you require assistance.
Running on the Blues ‘biggpu’ Queue
The addition of big memory GPU nodes to Blues facilitates computations that require special capabilities. You can now take advantage of the power of graphical processors to further increase the amount of work you can accomplish on Blues with codes developed for GPUs as a computing resource.
In addition to the standard 16 Sandy Bridge cores, each node has 2 NVIDIA Tesla K40m GPUs and each GPU has 12GB of on-card memory and 2,880 CUDA cores to give an estimated double precision floating point performance of 1.66 Teraflops per GPU.
In addition to having GPUs, these Blues nodes also have 768GB of memory to help solve problems that are more memory bound. The large memory enables data analyses, visualizations and pre- and post-processing that require large, shared memory. Lastly, this queue offers 1 Terabyte of local scratch disk.
There are a total of 6 of these big memory/GPU nodes.
There is a separate queue for the big memory/GPU nodes named
biggpu. To submit to this queue you can simply add
#PBS -q biggpu to your batch script or submit your job with the
-q biggpu flag.
Because of the limited number of these nodes, there is a different charge rate to use them. For each node you use you will be charged at two times the rate, or 32 core hours. These are dedicated nodes so jobs will not be shared on them. For this reason it is important that you use codes that will benefit greatly from either the added power of the GPUs or the increased memory off the nodes.
If you wish to use CUDA to use the GPUs, make sure to add the following softenv key in order to have the CUDA toolkit at your disposal:
Everything else about these nodes are the same on any other Blues node including having access to GPFS filesystems.
If you have any further questions about these nodes or using them please let us know by sending an email to email@example.com.
Checking Queues and Jobs
You can check the status of all jobs using the showq command:
$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 536764 xxxxxxxx Running 8 1:13:40 Mon Apr 4 03:40:16 536751 xxxxxxxx Running 8 1:14:31 Mon Apr 4 03:41:07 537124 xxxxxxxx Running 64 1:23:13 Mon Apr 4 09:19:49 536961 xxxxxxxx Running 32 2:06:52 Mon Apr 4 06:34:28 536788 xxxxxxxx Running 8 2:21:25 Mon Apr 4 04:48:01 ... 108 Active Jobs 2506 of 2928 Processors Active (85.59%) 315 of 320 Nodes Active (98.44%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 536070 xxxxxxxx Idle 96 12:12:00:00 Fri Apr 1 00:29:28 535002 xxxxxxxx Idle 8 12:12:00:00 Mon Mar 28 18:28:24 535016 xxxxxxxx Idle 8 12:12:00:00 Mon Mar 28 20:47:55 535042 xxxxxxxx Idle 8 12:12:00:00 Mon Mar 28 22:24:35 535309 xxxxxxxx Idle 8 12:12:00:00 Tue Mar 29 13:27:52 ... 30 Idle Jobs
You can also check jobs in the queues for a specific user:
$ showq -u <userid> ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 536310 xxxxxxxx Running 32 5:25:16 Sat Apr 2 15:55:22 536318 xxxxxxxx Running 32 6:39:51 Sat Apr 2 17:09:57 536259 xxxxxxxx Running 48 7:31:18 Sat Apr 2 18:01:24 536266 xxxxxxxx Running 48 16:38:17 Sun Apr 3 03:08:23 536280 xxxxxxxx Running 48 20:37:05 Sun Apr 3 07:07:11 536138 xxxxxxxx Running 48 1:01:32:59 Fri Apr 1 10:03:05 537134 xxxxxxxx Running 48 1:22:50:42 Mon Apr 4 09:20:48 537173 xxxxxxxx Running 48 1:23:23:49 Mon Apr 4 09:53:55 537182 xxxxxxxx Running 48 1:23:37:49 Mon Apr 4 10:07:55 536150 xxxxxxxx Running 48 2:06:08:09 Sat Apr 2 14:38:15 537187 xxxxxxxx Running 72 3:05:56:20 Mon Apr 4 10:26:26 11 Active Jobs 2530 of 2928 Processors Active (86.41%) 318 of 320 Nodes Active (99.38%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 537189 xxxxxxxx Idle 72 3:06:00:00 Mon Apr 4 10:29:32 1 Idle Job BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 12 Active Jobs: 11 Idle Jobs: 1 Blocked Jobs: 0
You can use the showbf command to gauge how large of a backfill is available.
Checking for Free Nodes
You can check how many nodes are free by running:
$ nodes -c total free nodes in the compute queue: 0
By default this will show you the count for the batch queue. You can use the following to specify a specific queue:
$ nodes -q haswell
Determining when the next job will run
You can view the start time of the highest priority job using the Maui showstart command. The showstart command assumes any job you pass to it would be the next to run, even if it isn’t. This command is only accurate for the highest priority job in queue.
You can view the highest priority job in queue by running the Maui showq command. In the following example, job id 410553 is the highest priority job in queue:
$ showq -i JobName Priority XFactor Q User Group Procs WCLimit Class SystemQueueTime 410553* 7380 1.3 pr xxxxx earlygrp 48 16:16:00:00 batch Wed Jun 23 10:03:56 411250* 2762 1.6 pr xxxxxx earlygrp 56 3:00:00:00 batch Sat Jun 26 15:01:48 411251* 2759 1.6 pr xxxxxx earlygrp 56 3:00:00:00 batch Sat Jun 26 15:04:57
Using showstart, we can see this job will start in 12:10 hours, on Saturday July 10th 19:00. Please note that show start assumes the job ID is the next to run. If you run showstart on a job that isn’t next to run the data will be inaccurate.
$ showstart 410553 job 410553 requires 48 procs for 16:16:00:00 Earliest start in 12:10:54:49 on Sat Jul 10 19:00:00 Earliest completion in 29:02:54:49 on Tue Jul 27 11:00:00 Best Partition: DEFAULT
Why isn’t my job running yet?
LCRC uses a batch system for users to submit and track jobs. The demand on compute hours available on LCRC systems greatly exceeds supply. As with many shared systems, complexities arise when attempting to utilize the available computing resources in a fair and effective manner. In a batch system, a scheduler is assigned the job of determining when and how jobs are run so as to maximize the output of the cluster and also allow the usage of the compute resources fairly among the users. Intelligent scheduling decisions significantly improve the effectiveness of the cluster resulting in more jobs being run with a shorter job turnaround time.
The scheduling decisions are based on a number of factors, such as available cores and the duration of availability, the total resources requested by various jobs (core-hours), the priority of the project (for instance strategic LDRD project-based jobs have higher priority), the number of jobs submitted by various users among other factors. Users should remember that jobs are NOT scheduled on first-in-first-out basis. For instance, a job requesting 2 nodes for an hour could run earlier than a job that requested 4 nodes for 96 hours, even though the second job was submitted earlier. This could happen if the scheduler determines that resources are available to run the smaller job earlier.
The load on the system varies over time. Users can use the
showq command to check the system load, however this can be deceiving. The number of processors also includes the private condo nodes we host which not everyone can run on. At the bottom of the
Active Jobs, the command shows a listing of core-hour usage as shown below:
102 Active Jobs 5624 of 9668 Processors Active (58.17%)
313 of 483 Nodes Active (64.80%)
To get a better idea of the number of free nodes, you should use the
nodes command as outlined here.
Jobs could also be “stalled” by upcoming maintenance periods. LCRC has the second Monday of every month set aside for system maintenance (software upgrades, hardware replacements, etc.). If the total wall-time requested by the job is longer than the time period between job submission and the maintenance date, the job will start only after the maintenance period ends. For instance, if the monthly maintenance day is set for Feb 8th and user A submits a job requesting a wall-time of 168 hours (7 days) on Feb 2nd, the job will not start till the maintenance period is over. In the meantime, other jobs which require lesser wall time (say 2 days) might start even if they were submitted after user A and the scheduler determines that they can complete before the start of maintenance. LCRC staff sends out emails to users announcing the upcoming maintenance date(s). Users should pay attention to the maintenance schedule to ensure that there is no inordinate delay in the start of their jobs.
At times, some users tend to submit tens or hundreds of jobs which are listed when users use the ‘showq’ or ‘qstat’ commands. Users should not be alarmed by the large number of jobs. As mentioned earlier, job completion is not based on the first-in-first-out basis. Furthermore, while the scheduler lists these jobs, only the first 30 jobs submitted by a user are actually in the queue. Additionally, the fair-share policy also automatically lowers the priority of successive jobs by user A in a phased manner, in order to allow other users who are requesting lesser resources run jobs.
Occasionally, jobs are on ‘HOLD’ if the requested resources can never be met. For instance, submitting a job with ‘ppn=32’ on Blues (which has ppn=16) will not run. Users are asked to ensure that their batch scripts are set up correctly for the resource they are utilizing.
As mentioned above, LCRC has the second Monday of every month set aside for system maintenance (software upgrades, hardware replacements, etc.). During this time, the cluster will be unavailable for all users and won’t be accessible. We generally have the cluster available again by the end of the same day, however, depending on the scope of work to be done this could be longer.
We always send out a reminder email (usually a week before) of the start date and then another email immediately after the work is complete and the clusters are available again. These emails go to the LCRC Users email list. All LCRC users are subscribed to this list by default.
Job Commands Quick Reference Guide
||Submit a Job|
||Delete a Job|
||Get a count of available nodes in the default batch (compute) queue
List the hostnames of available nodes in a specific public queue
Get a count of available nodes in a specific public queue
||Show queued jobs via the scheduler
Show queued jobs from a specific user
Show backfill window
||View queues and jobs via the resource manager
View jobs from a specific user
View a specific job
View detailed information about a specific job
||Put a hold on a job
Release a hold on a job
||Provide a detailed status report for a specified job via the scheduler|
||Show estimates of when job can/will start|