Running Jobs on Bebop

Bebop utilizes the Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM) for job management. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

The Basics

Bebop uses Slurm for the scheduling and resource management of its infrastructure. The simplest way to become familiar with Slurm and its basic commands is to follow their Quick Start User Guide. In the rest of this page, we’ll cover specific examples and commands. If at any time something becomes unclear, please do contact LCRC support.

Available Queues

Bebop has several partitions (queues) defined. Use the -p option with srun or sbatch to select a partition. The default partition is bdwall.

Bebop currently enforces the following limits on publicly available queues:

  • 32 Running Jobs per user.
  • 100 Queued Jobs per user.
  • 7 Days (168 Hours) Maximum Walltime.
  • 1 Hour Default Walltime if not specified.

We have several public queues to choose from. This does not list information on the private condo nodes. Below you can find some details on the types of nodes in each public queue:


Bebop Partition Name Description Number of Nodes CPU Type Cores Per Node Memory Per Node
bdwall All Broadwell Nodes. 664 E5-2695v4 36 128GB DDR4
bdw Broadwell with 15GB /scratch disk. 600 E5-2695v4 36 128GB DDR4
bdwd Broadwell with 4TB /scratch disk. 64 E5-2695v4 36 128GB DDR4
bdws Broadwell Shared Nodes (Oversubscription/Non-Exclusive). 8 E5-2695v4 36 128GB DDR4
knlall All KNL Nodes. 352 Phi 7230 64 96GB DDR4/16GB MCDRAM
knl KNL with 15GB /scratch disk. 288 Phi 7230 64 96GB DDR4/16GB MCDRAM
knld KNL with 4TB /scratch disk. 64 Phi 7230 64 96GB DDR4/16GB MCDRAM

Submitting Jobs

Batch Jobs

To submit a batch job in SLURM one can use the sbatch command, which takes a number of different options. The most common and simple one would be a name of the submit script:

$ sbatch <script_name>

Each sbatch script can contain various options, but one of the most important ones you should set is the account name to charge your time to.

-A, --account=<account> Charge resources used by this job to specified account. The account is an arbitrary string.

 

Note:
In SLURM an account is a what we normally call in LCRC a project. If the account option is not specified, the time will be charged to your default account. Every Argonne employee has a default account which should be your new Bebop startup-<username> account that has an initial 20K core hours. Non-Argonne employees need to be a member of an existing, active project. If your time must be charged to a project other than the default account, the account option must be specified as a part of sbatch script. You can also change your default project by following the commands here.

The sbatch script format is very similar to the one that was used in Torque/Maui on Blues.

Here are several examples that covers the most common usage.


Example Submit Script (Simple)
#!/bin/bash
#SBATCH -J myjobname
#SBATCH -A myaccount
#SBATCH -p bdwall
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -t 00:15:00

/bin/hostname

This will run the hostname command on one node and on one core.


Example Submit Script (MPI)
#!/bin/bash
#SBATCH -J mympijobname
#SBATCH -A myaccount
#SBATCH -p bdwall
#SBATCH -N 2
#SBATCH --ntasks-per-node=36
#SBATCH -o myjob.%j.%N.out
#SBATCH -e myjob.%j.%N.error
#SBATCH --mail-user=abc@lcrc.anl.gov
#SBATCH -t 00:15:00

export I_MPI_FABRICS=shm:tmi
srun /bin/hostname

This script will start a job with name mympijobname in partition bdwall using project name myaccount.

The output file will be myjob.%j.%N.out and error file myjob.%j.%N.error (by default SLURM writes stdout and stderror to the same file and the default filename is “slurm-%j.out” where %j is substituted with the job number).

--ntasks-per-nodes=36 means that this job will use 36 cores on each node.

-t is walltime of 15 minutes.

--mail-user option will email user about job start and finish.

This will run the hostname command using the default Intel MPI across 2 nodes using 36 processes per node.

I_MPI_FABRICS=shm:tmi – use shared memory for communication within a single host, and the tmi (Omni-Path optimized) tag matching interface for host to host communication.


Example KNL Submit Script (MPI)
Attention KNL users:
Please note below that while you are able to change the KNL modes, this might incur time to reconfigure and reboot the nodes to make the changes.

Depending on the amount of KNL nodes and the changes to be made, this could take a decent amount of time. Because this is using the resources, this time will also be charged against your core hour usage including the time it takes to complete the job.

#!/bin/bash
#SBATCH -J mympijobname
#SBATCH -A myaccount
#SBATCH -p knlall
#SBATCH -C knl,quad,cache
#SBATCH -N 2
#SBATCH --ntasks-per-node=64
#SBATCH -t 00:15:00

export I_MPI_FABRICS=shm:tmi
srun /bin/hostname

This will run across two KNL nodes, using the Quadrant mode, and Cache MCDRAM mode.
The default setting is quad,cache.

A table of available settings, along with more detailed information about Slurm’s KNL support is available here.


Interactive Jobs

There are a couple of ways to run an interactive job on Bebop.

First, you can just get a session on a node by running:

$ srun --pty -p <partition> -t <walltime> /bin/bash

This will drop you onto one node. Once you exit the node, the allocation will be relinquished.


If you want more flexibility, you can instead have the system first allocate resources for the job:
The salloc command is used

$ salloc -N 2 -p knl -t 00:30:00

This job will allocate 2 nodes from knl partition for 30 minutes. You should get the job number from the output.

You can get a list of your allocated nodes and many other slurm settings set by the salloc command by doing:

$ export | grep SLURM

After the resources were allocated and the session was granted use srun command to run your job:

$ srun -n 8 ./myprog

This will start 8 threads on the allocated nodes. If you try and use more resources than you allocated (say 3 nodes worth of resources while you only asked for 2), this will create a separate reservation and the other will continue to run and use hours as well.

When you allocate resources via salloc, you can also now freely SSH to the nodes in your allocation as well if you prefer to run jobs from the nodes themselves..


Deleting job

To delete a job use scancel that take a job id as an option.

$ scancel 76543

For more information please refer to the man pages of the mentioned above commands.

This rosetta stone of workload managers from SchedMD may be helpful if you’re familiar with another workload manager:
https://slurm.schedmd.com/rosetta.pdf

Checking Queues and Jobs

squeue – is used to view job and job step information.

Common options:

-a,--all Display information about all jobs in all partitions.
-u <user_list>,--user=<user_list> Request jobs or job steps from a comma separated list of users. The list can consist of user names or user id numbers.
-i <seconds>, --iterate=<seconds> Repeatedly gather and report the requested information at the interval specified (in seconds).
-l,--long Report more of the available information for the selected jobs or job steps.

For an extensive list of formatting options please consult squeue man page.


scontrol – can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.

Common examples:

scontrol show node node-name Shows detailed information about the nodes.
scontrol show partition partition-name Shows detailed information about a specific partition.
scontrol show job job-id Shows detailed information about a specific job or all jobs if no job id is given.
scontrol update job job-id Change attributes of submitted job.

For an extensive list of formatting options please consult scontrol man page.


sinfo – view information about jobs, nodes and partitions located in the Slurm scheduling queue

Common options:

-a, --all Display information about all partitions.
-t, --states <states> Display nodes in a specific state. Example: idle
-i <seconds>, --iterate=<seconds> Print the state on a periodic basis. Sleep for the indicated number of seconds between reports.
-l, --long Print more detailed information.
-n <nodes>, --nodes=<nodes> Print information only about the specified node(s). Multiple nodes may be comma separated or expressed using a node range expression. For example “bdw-[0001-0007]”
-o <output_format>, --format=<output_format> Specify the information to be displayed using an sinfo format string.

For an extensive list of formatting options please consult sinfo man page.


sacct – command displays accounting data for all jobs and job steps and can be used to display the information about the complete jobs.

Common options:

-S, --starttime Select jobs in any state after the specified time.
-E end_time, --endtime=end_time Select jobs in any state before the specified time.

Valid time formats are:

HH:MM[:SS] [AM|PM]
MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
MM/DD[/YY]-HH:MM[:SS]
YYYY-MM-DD[THH:MM[:SS]]

Example:

# sacct -S2014-07-03-11:40 -E2014-07-03-12:00 -X -ojobid,start,end,state
                  JobID                 Start                  End        State
              --------- --------------------- -------------------- ------------
              2         2014-07-03T11:33:16   2014-07-03T11:59:01   COMPLETED
              3         2014-07-03T11:35:21   Unknown               RUNNING
              4         2014-07-03T11:35:21   2014-07-03T11:45:21   COMPLETED
              5         2014-07-03T11:41:01   Unknown               RUNNING

For an extensive list of formatting options please consult sacct man page.


sprio – view the factors that comprise a job’s scheduling priority.

sprio is used to view the components of a job’s scheduling priority when the multi-factor priority plugin is installed. sprio is a read-only utility that extracts information from the multi-factor priority plugin. By default, sprio returns information for all pending jobs. Options exist to display specific jobs by job ID and user name.

For an extensive list of formatting options please consult sprio man page.


Allocations/Time Management

Bebop and Blues currently use two separate allocation/time banking softwares. Your time and balances on one cluster will not be the same on the other. If you need to check your current account’s (project’s) balance(s), change your default account, etc., please see our documentation here.

Eventually, we will migrate to one banking software and at the same time we will continue to make useful checks available to the LCRC scripts for these actions.

Software Stack (Lmod)

Starting with Bebop, we have switched to Lmod for environment variable management. Blues is still using SoftEnv, but SoftEnv is no longer actively developed, and most other sites are using Environment Modules or Lmod instead. Lmod has several advantages over SoftEnv. While very similar to Environment Modules, Lmod also has different capabilities. For example, it prevents you from loading multiple versions of the same package at the same time. It also prevents you from having multiple compilers and MPI libraries loaded at the same time. For more information on how to use Lmod, click here.

By default your Lmod environment will load Intel Compilers, Intel MPI and Intel MKL.

Compute Node Scratch Space

Bebop currently writes all temporary files on the compute nodes to a 1GB partition at /tmp. We are not currently setting an alternate path by default. If you need more space to write output to you can temporary store files at /scratch. This directory is a 15GB partition on the diskless nodes and 4TB on the diskfull nodes (nodes ending in ‘d’). Please note that all data will be deleted from this directory once your job completes. You can also change your environment TMPDIR variable in your job script to the following:

export TMPDIR=/scratch

Why isn’t my job running yet?

Here are a few of the most common reasons your job may not be running.
First, check to the see reason code by querying your job number in Slurm:

$ squeue -j <job id>

Then, you can determine why the job has not started by deciphering this sample reason list:

Reason Code Description
AssocGrpBillingMinutes The job doesn’t have enough time in the banking account to begin.
BadConstraints The job’s constraints can not be satisfied.
BeginTime The job’s earliest start time has not yet been reached.
Cleaning The job is being requeued and still cleaning up from its previous execution.
Dependency This job is waiting for a dependent job to complete.
JobHeldAdmin The job is held by a system administrator.
JobHeldUser The job is held by the user.
NodeDown A node required by the job is down.
PartitionNodeLimit The number of nodes required by this job is outside of it’s partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The job’s time limit exceeds it’s partition’s current time limit.
Priority One or more higher priority jobs exist for this partition or advanced reservation.
QOSMaxJobsPerUserLimit The job’s QOS has reached its maximum job count for the user at one time.
ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job’s “reason” field as “UnavailableNodes”. Such nodes will typically require the intervention of a system administrator to make available.
Reservation The job is waiting its advanced reservation to become available.
Resources The job is waiting for resources to become available.
TimeLimit The job exhausted its time limit.

While this is not every reason code, these are the most common on Bebop. You can view the full list of Slurm reason codes here.

Command Line Quick Reference Guide

Command Description
sbatch <script_name> Submit a job.
scancel <job_id> Delete a job.
squeue
squeue -u <username>
Show queued jobs via the scheduler.
Show queued jobs from a specific user.
scontrol show job <job_id> Provide a detailed status report for a specified job via the scheduler.
sinfo -t idle Get a list of all free/idle nodes.
lcrc-sbank -q balance <proj_name>
lcrc-sbank -q balance
lcrc-sbank -q default
lcrc-sbank -s default <proj_name>
lcrc-sbank -q trans <proj_name>
Query a specific project balance.
Query all of your project balances.
Query your default project.
Change your default project.
Query all transactions on a project.
lcrc-quota Query your global filesystem disk usage.

 

Notes

Attention MVAPICH2 users:
Bebop by default using Intel, but if you switch to using MVAPICH2, please note that it was built with the slurm option, which means that mpiexec and mpirun are not available.
srun should be used as a process manager.