Running Jobs on Blues

Blues was upgraded to CentOS 7, Slurm and Lmod on April 1, 2019 among other changes. This environment is now very similar to Bebop.

Details of notable changes are now documented here. If you have any questions that aren’t noted below, please contact [email protected]

Blues and Bebop utilize the Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM) for job management. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

The simplest way to become familiar with Slurm and its basic commands is to follow their Quick Start User Guide. In the rest of this page, we’ll cover specific examples and commands. If at any time something becomes unclear, please do contact LCRC support.

Logging Into Blues

Please be sure to following our Getting Started documentation in order to make sure you’ve completed the necessary steps so that you can login to the LCRC Blues cluster. Once you’ve done this, you can SSH to Blues by running the following:

ssh <your_lcrc_username>@blues.lcrc.anl.gov

The LCRC login nodes should not be used to run jobs on. Doing so may impact other users and require these login nodes to be rebooted.

If you need to add a new SSH key as you may not have logged in for awhile, please read through our documentation here.

As before, Blues and Bebop both share the same global GPFS filesystem. All of your home and project directories noted in our storage documentation will be available between clusters.

Projects Used for Job Submission

LCRC resources require a valid project with an allocation to submit jobs. Projects are what keeps track of your quarterly allocations. Please see the following page for more information about Projects in LCRC.

To see how much time will be deducted from your project when running jobs on Blues, please see the following on Core Hour Usage.

When logging into Blues for the first time, you’ll need to change your default project (as a reference, what LCRC calls projects are referred to as accounts in Slurm).

Blues and Bebop currently use two separate allocation/time databases. Your time and balances on one cluster will not be the same on the other. If you need to check your current account’s (project’s) balance(s), change your default account, etc., please see our documentation below or reference the information here: Project Allocation Queries and Management.

All Blues users will have a default project set to external upon first login which has no time allocated and thus you will not be able to submit jobs.

Blues no longer accepts negative allocations, so in order to run a job, your project will need time. Project time will be added as always, either at the beginning of a new fiscal quarter or by request by following the instructions here: Requesting Additional Project Time.

Blues users of the partitions/queues sball, haswell, shared, ivy and biggpu will need to use a project that has a valid allocation.

All Blues condo node users (that is all partitions/queues that are NOT publicly available) need to use the project/account name condo. This project will allow you to submit jobs to your condo nodes free of charge. This project will not work on the shared, publicly available partitions and MUST be used to submit jobs to the condo nodes.You can get a list of all partition names on Blues that you have access to by running sinfo -o %P. Any partition that is not sball, haswell, shared, ivy or biggpu is considered a condo partition.

Setting a Default Project on Blues

You can set your default project on Blues with the following command:

lcrc-sbank -s default <project_name>

You can also specify the project name on Blues in your job submission if you’d like to use something different other than your default. With SBATCH, this can be done with:

#SBATCH -A <project_name>

Query your Default Project on Blues

Once you set your default project on Blues, you can make sure this is set correctly with this command:

lcrc-sbank -q default

Query Project Balances on Blues

You can query your project balances on Blues to see how much time you have available and how much you have used.

Query all of your project balances on Blues:

lcrc-sbank -q balance

Query a specific project balance on Blues:

lcrc-sbank -q balance <project_name>

Query a Project Transaction History on Blues

If you’d like to see the transaction history for a project on Blues, you can run the below.

lcrc-sbank -q trans <project_name>

lcrc-sbank Help Menu

If you need to query the lcrc-sbank help menu at any time, simply run the below.

lcrc-sbank -h

Software Environment Using Lmod

Blues is using Lmod (Lua Environment Modules) for environment variable management. SoftEnv has been deprecated in LCRC as most other sites are using Environment Modules or Lmod instead. Lmod has several advantages over SoftEnv. For example, it prevents you from loading multiple versions of the same package at the same time. It also prevents you from having multiple compilers and MPI libraries loaded at the same time. See the Lmod User Guide for information on how to use Lmod. If you are used to using SoftEnv and want to know the equivalent commands for Lmod, here is a handy cheat sheet.

By default your Lmod environment will load Intel Compilers, Intel MPI and Intel MKL.

Using Slurm to Submit Jobs

Blues is using Slurm for the job resource manager and scheduler for the cluster.

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Your best source of finding information on using Slurm will come from their quickstart guide here or by using the man pages.

Below we will outline some general information on the Blues Slurm partitions and supply some basic submission information to get you started using the new tools.

Partitions Limits

Blues currently enforces the following limits on publicly available partitions:

  • 32 Running Jobs per user.
  • 100 Queued Jobs per user.
  • 7 Days (168 Hours) Maximum Walltime.
  • 1 Hour Default Walltime if not specified.
  • sball (Sandy Bridge Compute Nodes) is the default partition.

Available Partitions

Blues has several publicly available partitions (partitions are what were previously called queues before switching to Slurm) defined. Use the -p option with srun or sbatch to select a partition. The default partition is sball. Blues condo nodes partitions are not listed below. You can get a list of all partition names on Blues that you have access to by running sinfo -o %P. Any partition that is not sball, haswell, shared, ivy or biggpu is considered a condo partition.


Blues Partition Name Description Number of Nodes CPU Type Co-Processors Cores Per Node Memory Per Node Local Scratch Disk
sball Sandy Bridge Nodes 300 Intel Xeon E5-2670 2.6GHz 16 64 GB 15 GB
shared Sandy Bridge Shared Nodes (Oversubscription / Non-Exclusive) 4 Intel Xeon E5-2670 2.6GHz 16 64 GB 15 GB
haswell Haswell Nodes 60 Intel Xeon E5-2698v3 2.3GHz 32 128 GB 15 GB
ivy Ivy Bridge Nodes 1 Intel Xeon E5-2670v2 2.5GHz 20 64 GB 15 GB
biggpu Sandy Bridge Nodes 6 Intel Xeon E5-2670 2.6GHz 2x NVIDIA Tesla K40m GPU 16 768 GB 1 TB

Job Submission Commands

The 3 most common tools you will use to submit jobs are sbatch, srun and salloc.

You can reference the table below for a simple, quick cheat sheet on a few examples about jobs in Slurm:

Slurm Command Description
sbatch <job_script> Submit <job_script> to the Scheduler
srun <options> Run Parallel Jobs
salloc <options> Request an Interactive Job
squeue View Job Information
scancel <job_id> Delete a Job
Example Sbatch Job Submission (Simple)

Here you’ll find a couple of very simple submission scripts to get you started that you can use with sbatch to submit your job. For this example, the script can be named myjob.sh:

#!/bin/bash

#SBATCH --job-name=<my_job_name>
#SBATCH --account=<my_lcrc_project_name>
#SBATCH --partition=sball
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --output=<my_job_name>.out
#SBATCH --error=<my_job_name>.error
#SBATCH --mail-user=<your email address> # Optional if you require email
#SBATCH --mail-type=ALL                  # Optional if you require email
#SBATCH --time=01:00:00

# Run My Program
srun /bin/hostname
Example Sbatch Job Submission (MPI)
#!/bin/bash

#SBATCH --job-name=<my_job_name>
#SBATCH --account=<my_lcrc_project_name>
#SBATCH --partition=sball
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --output=<my_job_name>.out
#SBATCH --error=<my_job_name>.error
#SBATCH --mail-user=<your email address> # Optional if you require email
#SBATCH --mail-type=ALL                  # Optional if you require email
#SBATCH --time=01:00:00

# Setup My Environment
module load intel-parallel-studio/cluster.2018.4-ztml34f

# Run My Program
srun -n 32 ./helloworld

You can then submit this job from a Blues login nodes using:

sbatch myjob.sh

Please refer to the sbatch webpage for a list of full options including environment variables.

Example Interactive Job Submission

There are a couple of ways to run an interactive job on Blues.

First, you can just get a session on a node by using the srun command in the following way:

srun --pty -p <partition> -t <walltime> /bin/bash

This will drop you onto one node. Once you exit the node, the allocation will be relinquished.


If you want more flexibility, you can instead have the system first allocate resources for the job using the
the salloc command:

salloc -N 2 -p sball -t 00:30:00

This job will allocate 2 nodes from sball partition for 30 minutes. You should get the job number from the output. This command will not log you into any of your allocated nodes by default.

You can get a list of your allocated nodes and many other slurm settings set by the salloc command by doing:

printenv | grep SLURM

After the resources were allocated and the session was granted use srun command to run your job:

srun -n 8 ./myprog

This will start 8 threads on the allocated nodes. If you try and use more resources than you allocated (say 3 nodes worth of resources while you only asked for 2), this will create a separate reservation and the other will continue to run and use hours as well.

When you allocate resources via salloc, you can also now freely SSH to the nodes in your allocation as well if you prefer to run jobs from the nodes themselves.

Checking Queues and Jobs

To view job and job step information use squeue.

Here’s a quick example of what the output may look like:

squeue
             JOBID PARTITION       NAME    USER ST       TIME  NODES NODELIST(REASON)
               999     sball  test-joba   user2  R    2:40:31      2 b[110-111]
               998     sball  test-job2   user1  R      45:20      1 b101
               997   haswell  test-job1   user1  R       3:04      1 b100

Here are also some common options for squeue:

-a Display information about all jobs in all partitions. This is the default when running squeue with no options.
-u <user_list> Request jobs or job steps from a comma separated list of users. The list can consist of user names or user id numbers.
-j <job_id_list> Requests a comma separated list of job IDs to display. Defaults to all jobs.
-l Report more of the available information for the selected jobs or job steps.

Deleting a Job

To delete a job use scancel. This command will take the job id as its argument. Your job id will be given to when you submit the job. You can also retrieve this from the squeue command detailed above.

scancel <job_id>

Other Useful Slurm Commands

scontrol – can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.

Common examples:

scontrol show node node-name Shows detailed information about the nodes.
scontrol show partition partition-name Shows detailed information about a specific partition.
scontrol show job job-id Shows detailed information about a specific job or all jobs if no job id is given.
scontrol update job job-id Change attributes of submitted job.

For an extensive list of formatting options please consult scontrol man page.


sinfo – view information about jobs, nodes and partitions located in the Slurm scheduling queue

Common options:

-a, --all Display information about all partitions.
-t, --states <states> Display nodes in a specific state. Example: idle
-i <seconds>, --iterate=<seconds> Print the state on a periodic basis. Sleep for the indicated number of seconds between reports.
-l, --long Print more detailed information.
-n <nodes>, --nodes=<nodes> Print information only about the specified node(s). Multiple nodes may be comma separated or expressed using a node range expression. For example “bdw-[0001-0007]”
-o <output_format>, --format=<output_format> Specify the information to be displayed using an sinfo format string.

For an extensive list of formatting options please consult sinfo man page.


sacct – command displays accounting data for all jobs and job steps and can be used to display the information about the complete jobs.

Common options:

-S, --starttime Select jobs in any state after the specified time.
-E end_time, --endtime=end_time Select jobs in any state before the specified time.

Valid time formats are:

HH:MM[:SS] [AM|PM]
MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
MM/DD[/YY]-HH:MM[:SS]
YYYY-MM-DD[THH:MM[:SS]]

Example:

# sacct -S2014-07-03-11:40 -E2014-07-03-12:00 -X -ojobid,start,end,state
                  JobID                 Start                  End        State
              --------- --------------------- -------------------- ------------
              2         2014-07-03T11:33:16   2014-07-03T11:59:01   COMPLETED
              3         2014-07-03T11:35:21   Unknown               RUNNING
              4         2014-07-03T11:35:21   2014-07-03T11:45:21   COMPLETED
              5         2014-07-03T11:41:01   Unknown               RUNNING

For an extensive list of formatting options please consult sacct man page.


sprio – view the factors that comprise a job’s scheduling priority.

sprio is used to view the components of a job’s scheduling priority when the multi-factor priority plugin is installed. sprio is a read-only utility that extracts information from the multi-factor priority plugin. By default, sprio returns information for all pending jobs. Options exist to display specific jobs by job ID and user name.

For an extensive list of formatting options please consult sprio man page.

Core Hour Usage

As mentioned, submitting jobs to Blues requires time allocated to a Project (or what Slurm calls an Account). Our documentation has an extensive write up on this on the following page: Projects in LCRC

Whenever a computing job runs on any computing node, the time the job uses will be counted and recorded as computing used by the associated project. A job must have a project in order to run on the computing nodes and will be assigned to your default project if none has been specified in your job script. ALL jobs submitted via sbatch, srun or salloc will deduct computing core hours from your project.

On Blues, the nodes charge as follows for each job:

Sandy Bridge
# of Nodes * 16 (# Cores Per Node) * Time Used
Haswell
# of Nodes * 32 (# Cores Per Node) * Time Used
Ivy Bridge
# of Nodes * 20 (# Cores Per Node) * Time Used
GPU Nodes
# of Nodes * 16 (# Cores Per Node) * Time Used * 2

Projects will be charged for the entire node when a job is run even if you don’t utilize all of the cores or don’t actually run a job when a node allocated to you. Anytime a node is allocated, the resource is unavailable for anyone else to use, thus the reason for charging the full amount of a node.

The GPU nodes charge a factor of 2 due to the limited amount of nodes, plus there is a greater number of resources available on these nodes.

As a reminder, any non-public condo queues that you belong to DO NOT charge time to run on these nodes.

Compute Node Scratch Space

Blues currently writes all temporary files on the compute nodes to a 15 GB tmfs at /scratch. You can also write here to temporarily store your run files. Please note that all data will be deleted from this directory once your job completes. You can also change your environments TMPDIR variable in your job script if you want to set an alternate path.

Using the GPU Nodes

If you are looking to run jobs on a GPU, Blues offer 6 nodes in the biggpu partition. The number of cores and processor type are the same as the normal Sandy Bridge nodes.

The GPU nodes offer a few differences:

  • There is 1 TB of scratch disk available
  • There is 768 GB of RAM
  • There are 2x NVIDIA Tesla K40m GPUs

To schedule these nodes in Slurm, you can add the following to your submission script:

#SBATCH --partition=biggpu
#SBATCH --gres=gpu:2

This will schedule your job on the biggpu partition allocating 2x GPU’s per node. You can reduce this to 1 if you wish.

Why Isn’t My Job Running Yet?

If today is NOT LCRC Maintenance Day and you find that your job is in the pending (PD) state after running squeue, Slurm will provide a reason for this shown in the squeue command. Here are a few of the most common reasons your job may not be running.
First, check to the see reason code by querying your job number in Slurm:

squeue -j <job_id>

Then, you can determine why the job has not started by deciphering this sample reason list:

Reason Code Description
AccountNotAllowed The job isn’t using an account that is allowed on the partition. Condo node users must use the condo account on condo partitions. Publicly available partitions will not accept the condo account or the default account we set for users which is external.
AssocGrpBillingMinutes The job doesn’t have enough time in the banking account to begin.
BeginTime The job’s earliest start time has not yet been reached.
Cleaning The job is being requeued and still cleaning up from its previous execution.
Dependency This job is waiting for a dependent job to complete.
JobHeldAdmin The job is held by a system administrator.
JobHeldUser The job is held by the user.
NodeDown A node required by the job is down.
PartitionNodeLimit The number of nodes required by this job is outside of it’s partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The job’s time limit exceeds it’s partition’s current time limit.
Priority One or more higher priority jobs exist for this partition or advanced reservation.
QOSMaxJobsPerUserLimit The job’s QOS has reached its maximum job count for the user at one time.
ReqNodeNotAvail During LCRC Maintenance Day, you may see this reason, otherwise, some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job’s “reason” field as “UnavailableNodes”. Such nodes will typically require the intervention of a system administrator to make available.
Reservation The job is waiting its advanced reservation to become available.
Resources The job is waiting for resources to become available.
TimeLimit The job exhausted its time limit.

While this is not every reason code, these are the most common on Blues. You can view the full list of Slurm reason codes here.

Assuming your job is in the Priority/Resources state, you can use the sprio command to get a closer idea on when your job may start based on the priorities of other pending jobs. The priority is the sum of age, fairshare, jobsize and QOS (quality of service).

sprio is used to view the components of a job’s scheduling priority when the multi-factor priority plugin is installed. sprio is a read-only utility that extracts information from the multi-factor priority plugin. By default, sprio returns information for all pending jobs. Options exist to display specific jobs by job ID and user name.

For an extensive list of formatting options please consult sprio man page.

Command Line Quick Reference Guide

Command Description
sbatch <script_name> Submit a job.
scancel <job_id> Delete a job.
squeue
squeue -u <username>
Show queued jobs via the scheduler.
Show queued jobs from a specific user.
scontrol show job <job_id> Provide a detailed status report for a specified job via the scheduler.
sinfo -t idle Get a list of all free/idle nodes.
lcrc-sbank -q balance <project_name>
lcrc-sbank -q balance
lcrc-sbank -q default
lcrc-sbank -s default <project_name>
lcrc-sbank -q trans <project_name>
Query a specific project balance.
Query all of your project balances.
Query your default project.
Change your default project.
Query all transactions on a project.
lcrc-quota Query your global filesystem disk usage.

Troubleshooting Notes

SSH Host Key Changes

Since upgrading Blues to CentOS 7 on April 1, 2019, you may encounter the below error message or something similar when trying to login to Blues due to the fact that we changed the SSH Host Keys. Please see the below to fix this issue if necessary.

 

### NOTE: SSH Host Keys ###

 

Please note that these are the same nodes and same hostnames that Blues was using before. We did, however, rebuild them entirely with a new image and the SSH Host Key has changed! Because of this, you may see a message similar to the below when trying to login:


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:XXXXXXXXXXXXX
Please contact your system administrator.
Add correct host key in /home/<your_username>/.ssh/known_hosts to get rid of this message.
Offending key in /home/<your_username>/.ssh/known_hosts:<some_line_number>
RSA host key for blues.lcrc.anl.gov has changed and you have requested strict checking.
Host key verification failed.

Simply removing all of the old Blues host keys from all lines in /home/<your_username>/.ssh/known_hosts (replacing the path, your username and the hostname as necessary) should solve the issue and you should be able to login. You may also be able to remove them by running:

ssh-keygen -f "/home/<your_username>/.ssh/known_hosts" -R blues.lcrc.anl.gov

again, replacing the path, your username and the hostname as necessary.

If you continue to have trouble, please contact [email protected].

Login Node Name Changes

When you login to blues.lcrc.anl.gov after April 1, 2019, you will be dropped onto 1 of 4 Blues login nodes:

  • blueslogin1.lcrc.anl.gov
  • blueslogin2.lcrc.anl.gov
  • blueslogin3.lcrc.anl.gov
  • blueslogin4.lcrc.anl.gov

You will notice the login node names have changed. We have CNAME records implemented in order to not disturb your current workflow. They map as follows:

  • blogin1.lcrc.anl.gov > blueslogin1.lcrc.anl.gov
  • blogin2.lcrc.anl.gov > blueslogin2.lcrc.anl.gov
  • blogin3.lcrc.anl.gov > blueslogin3.lcrc.anl.gov
  • blogin4.lcrc.anl.gov > blueslogin4.lcrc.anl.gov

As always, the login nodes should not be used to run jobs on. Doing so may impact other users and require these login nodes to be rebooted.

Contact Information

Please contact [email protected] with any questions you may have regarding the upgrade.