Running Jobs on Bebop

Bebop utilizes the Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM) for job management. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

The Basics

In the next few weeks, we’ll be documenting here in detail the basic information regarding the scheduling process for Slurm.

Available Queues

Bebop has several partitions defined, a partition is similar to a queue. Use the -p option with srun or sbatch to select a partition. The default partition is bdwall.

Bebop currently enforces a limit of 32 running jobs and 100 queued jobs per user.

Bebop Partition Name Description Number of Nodes CPU Type Cores Per Node Memory Per Node
debug For administrative use only. 1024 Mixed Mixed 128GB
knlall All KNL Nodes. 352 Phi 7230 64 128GB
knld KNL with 4TB /scratch disk. 64 Phi 7230 64 128GB
knl KNL with 15GB /scratch disk. 288 Phi 7230 64 128GB
bdwall All Broadwell Nodes. 672 E5-2695v4 36 128GB
bdwd Broadwell with 4TB /scratch disk. 64 E5-2695v4 36 128GB
bdw Broadwell with 15GB /scratch disk. 608 E5-2695v4 36 128GB

Submitting Jobs

Batch Jobs

To submit a batch job in SLURM one can use the sbatch command, which takes a number of different options. The most common and simple one would be a name of the submit script:

$ sbatch <script_name>

Each sbatch script can contain various options, but one of the most important ones you should set is the account name to charge your time to.

-A, --account=<account> Charge resources used by this job to specified account. The account is an arbitrary string.


In SLURM an account is a what we normally call in LCRC a project. If the account option is not specified, the time will be charged to your default account. Every Argonne employee has a default account which should be your new Bebop startup-<username> account that has an initial 20K core hours. Non-Argonne employees need to be a member of an existing, active project. If your time must be charged to a project other than the default account, the account option must be specified as a part of sbatch script. You can also change your default project by following the commands here.

The sbatch script format is very similar to the one that was used in Torque/Maui on Blues.

Here are several examples that covers the most common usage.

Example Submit Script (Simple)
#SBATCH --job-name=myjob
#SBATCH -p bdwall
#SBATCH -A myaccount
#SBATCH --ntasks-per-node=1
#SBATCH --time=15:00

This will run the hostname command on one node.

Example Submit Script (MPI)
#SBATCH --job-name=mympijob
#SBATCH -p bdwall
#SBATCH -A myaccount
#SBATCH -o myjob.%j.%N.out
#SBATCH -e myjob.%j.%N.error
#SBATCH --ntasks-per-node=36
#SBATCH -t 15:00
export I_MPI_FABRICS=shm:tmi
mpirun /bin/hostname

This script will start a job with name mympijob in partition bdwall using project name myaccount.

The output file will be myjob.%j.%N.out and error file myjob.%j.%N.error (by default SLURM writes stdout and stderror to the same file and the default filename is “slurm-%j.out” where %j is substituted with the job number).

--ntasks-per-nodes=36 means that this job will use 36 cores on each node.

-t is walltime.

--mail-user option will email user about job start and finish.

This will run the hostname command using the default Intel MPI across 2 nodes using 36 processes per node.
I_MPI_FABRICS=shm:tmi – use shared memory for communication within a single host, and the tmi (Omni-Path optimized) tag matching interface for host to host communication.

Example KNL Submit Script (MPI)
Attention KNL users:
Please note below that while you are able to change the KNL modes, this might incur time to reconfigure and reboot the nodes to make the changes.

Depending on the amount of KNL nodes and the changes to be made, this could take a decent amount of time. Because this is using the resources, this time will also be charged against your core hour usage including the time it takes to complete the job.

#SBATCH --job-name=mympijob
#SBATCH -p knlall
#SBATCH -A myaccount
#SBATCH -C knl,quad,cache
#SBATCH --ntasks-per-node=64
#SBATCH --time=15:00
export I_MPI_FABRICS=shm:tmi
mpirun /bin/hostname

This will run across two KNL nodes, using the Quadrant mode, and Cache MCDRAM mode.
The default setting is quad,cache.

A table of available settings, along with more detailed information about Slurm’s KNL support is available here.

Attention MVAPICH2 users:
On Bebop MVAPICH2 was build with slurm option, it means that mpiexec and mpirun are not available.
srun should be used as a process manager.

For example:

#SBATCH -J MyJobName
#SBATCH -p bdw
#SBATCH -A myaccount
#SBATCH -o myjob.%j.%N.out
#SBATCH -e myjob.%j.%N.error
#SBATCH --nodes=16
#SBATCH --tasks-per-node=36
#SBATCH -t 12:00:00

srun ./myprog

Interactive Jobs

There are a couple of ways to run an interactive job on Bebop.

First, you can just get a session on a node by running:

$ srun --pty -p <partition> -t <walltime> /bin/bash

If you want more flexibility, you can instead have the system first allocate resources for the job:
The salloc command is used

$ salloc -N 2 -p knl -t 00:30:00

This job will allocate 2 nodes from knl partition for 30 minutes. You should get the job number from the output.

You can get a list of your allocated nodes and many other slurm settings set by the salloc command by doing:

$ export | grep SLURM

After the resources were allocated and the session was granted use srun command to run your job:

$ srun -n 8 ./myprog

This will start 8 threads on the allocated nodes.

Deleting job

To delete a job use scancel that take a job id as an option.

$ scancel 76543

For more information please refer to the man pages of the mentioned above commands.

This rosetta stone of workload managers from SchedMD may be helpful if you’re familiar with another workload manager:

Checking Queues and Jobs

squeue – is used to view job and job step information.

Common options:

-a,--all Display information about all jobs in all partitions.
-u <user_list>,--user=<user_list> Request jobs or job steps from a comma separated list of users. The list can consist of user names or user id numbers.
-i <seconds>, --iterate=<seconds> Repeatedly gather and report the requested information at the interval specified (in seconds).
-l,--long Report more of the available information for the selected jobs or job steps.

For an extensive list of formatting options please consult squeue man page.

scontrol – can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.

Common examples:

scontrol show node node-name Shows detailed information about the nodes.
scontrol show partition partition-name Shows detailed information about a specific partition.
scontrol show job job-id Shows detailed information about a specific job or all jobs if no job id is given.
scontrol update job job-id Change attributes of submitted job.

For an extensive list of formatting options please consult scontrol man page.

sinfo – view information about jobs, nodes and partitions located in the Slurm scheduling queue

Common options:

-a, --all Display information about all partitions.
-t, --states <states> Display nodes in a specific state. Example: idle
-i <seconds>, --iterate=<seconds> Print the state on a periodic basis. Sleep for the indicated number of seconds between reports.
-l, --long Print more detailed information.
-n <nodes>, --nodes=<nodes> Print information only about the specified node(s). Multiple nodes may be comma separated or expressed using a node range expression. For example “bdw-[0001-0007]”
-o <output_format>, --format=<output_format> Specify the information to be displayed using an sinfo format string.

For an extensive list of formatting options please consult sinfo man page.

sacct – command displays accounting data for all jobs and job steps and can be used to display the information about the complete jobs.

Common options:

-S, --starttime Select jobs in any state after the specified time.
-E end_time, --endtime=end_time Select jobs in any state before the specified time.

Valid time formats are:

MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]


# sacct -S2014-07-03-11:40 -E2014-07-03-12:00 -X -ojobid,start,end,state
                  JobID                 Start                  End        State
              --------- --------------------- -------------------- ------------
              2         2014-07-03T11:33:16   2014-07-03T11:59:01   COMPLETED
              3         2014-07-03T11:35:21   Unknown               RUNNING
              4         2014-07-03T11:35:21   2014-07-03T11:45:21   COMPLETED
              5         2014-07-03T11:41:01   Unknown               RUNNING

For an extensive list of formatting options please consult sacct man page.

Allocations/Time Management

Bebop and Blues currently use two separate allocation/time banking softwares. Your time and balances on one cluster will not be the same on the other. If you need to check your current account’s (project’s) balance(s), change your default account, etc., please see our documentation here.

Eventually, we will migrate to one banking software and at the same time we will continue to make useful checks available to the LCRC scripts for these actions.

Software Stack (Lmod)

Starting with Bebop, we have switched to Lmod for environment variable management. Blues is still using SoftEnv, but SoftEnv is no longer actively developed, and most other sites are using Environment Modules or Lmod instead. Lmod has several advantages over SoftEnv. For example, it prevents you from loading multiple versions of the same package at the same time. It also prevents you from having multiple compilers and MPI libraries loaded at the same time. For more information on how to use Lmod, click here.

By default your Lmod environment will load Intel Compilers, Intel MPI and Intel MKL.

We’ll be updating this documentation to be more user friendly over the next few months.

Compute Node Scratch Space

Bebop currently writes all temporary files on the compute nodes to a 1GB partition at /tmp. We are not currently setting an alternate path by default. If you need more space to write output to you can temporary store files at /scratch. This directory is a 15GB partition on the diskless nodes and 4TB on the diskfull nodes (nodes ending in ‘d’). Please note that all data will be deleted from this directory once your job completes. You can also change your environment TMPDIR variable in your job script to the following:

export TMPDIR=/scratch

Why isn’t my job running yet?

We will be detailing this in depth in the next few weeks.

Command Line Quick Reference Guide

Command Description
sbatch <script_name> Submit a job.
scancel <job_id> Delete a job.
squeue -u <username>
Show queued jobs via the scheduler.
Show queued jobs from a specific user.
scontrol show job <job_id> Provide a detailed status report for a specified job via the scheduler.
sinfo -t idle Get a list of all free/idle nodes.
lcrc-sbank -q balance <proj_name>
lcrc-sbank -q balance
lcrc-sbank -q default
lcrc-sbank -s default <proj_name>
lcrc-sbank -q trans <proj_name>
Query a specific project balance.
Query all of your project balances.
Query your default project.
Change your default project.
Query all transactions on a project.
lcrc-quota Query your global filesystem disk usage.