Swing

Quick Facts

  • 6x public compute nodes
  • 2x public login nodes
  • 8x NVIDIA A100 GPUS per node
  • 1-2TB DDR4 and 320-640GB GPU memory per node
  • 128 cpu cores per compute node
  • Infiniband HDR Interconnect

Available Partitions

Swing has two partitions, the default being named gpu. By default, you will be allocated 1/8th of the node resources per GPU.

Nodes allow for multiple jobs from multiple users up until the resources are fully consumed (8 jobs with 1 GPU each per node, 1 job with 8 GPU per node, and everything in between).

You MUST request at least 1 GPU to run a job otherwise you will see the following error:

srun: error: Please request at least 1 GPU in the partition 'gpu'
srun: error: e.g '#SBATCH --gres=gpu:1')
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification
Partition Name Number
of
Nodes
GPUs
Per
Node
GPU
Memory
Per
Node
CPUs
Per
Node
DDR4
Memory
Per
Node
Local
Scratch
Disk
Operating
System
gpu 5 8x NVIDIA A100 40GB 320GB 2x AMD EPYC 7742 64-Core Processor (128 Total Cores) 1TB 14TB Ubuntu 20.04.2 LTS
gpu-large 1 8x NVIDIA A100 80GB 640GB 2x AMD EPYC 7742 64-Core Processor (128 Total Cores) 2TB 28TB Ubuntu 20.04.2 LTS

File Storage

On Swing, users that want to take advantage of local scratch space will have the option of using a small scratch space on the node’s memory (located at /scratch, 20GB tmpfs). Otherwise, users have access to the same GPFS filesystems as on our other resources including home, project and group space.

Please see our detailed description of the file storage used in LCRC here.

Architecture

Swing runs with 2x AMD EPYC 7742 64-Core Processor and 8x NVIDIA A100 GPUS per node.

Swing is also using an Infiniband HDR interconnect for its network. This fact comes into play when considering MPI programs that would use Infiniband library as a means for communication.

Running Jobs on Swing

For detailed information on how to run jobs on Swing, you can follow our documentation by clicking here: Running Jobs on Swing.

Swing utilizes the Slurm Workload Manager for job management. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

- - -

Allocations Note

Swing, unlike other LCRC clusters, charges allocation time based on GPU Hours instead of Core Hours. Please factor this in when applying for time on Swing.