Storage

Overview

On all of our clusters, users have access to a global home, project, and group space (if they belong to a group that has purchased additional storage). This means that files created from compute nodes on one cluster can be used for calculations on any of our other clusters. All storage systems use GPFS as the file system and certain filesystems are backed up nightly.

Filesystem Location Soft Limit Hard Limit
Home /home/<username> 100 GB 1 TB
Project /lcrc/project/<project name> 1+ TB 2+ TB
Group /lcrc/group/<group name> no quotas no quotas

We also offer the ability for groups to purchase their own storage resources to be hosted with us. In doing so you get access to the storage space across all of our clusters (unless otherwise specified), and we take care of supporting the system, replacing parts, and when possible tuning the storage resources to fit your data model.

If you are interested in learning more about purchasing additional storage resources please contact us at support@lcrc.anl.gov

You also have access to scratch space on the compute nodes while you have a job running on the node.

Filesystem Quotas

In order to prevent individual users from hogging all of the available storage space, a quota is enforced on home and project directories. Home directories have a quota of 100 GB, while project directories default to 1TB with larger quotas available upon request and approval.

These quotas are technically soft limits. If a running job outputs more data than you expected, you can continue writing to your home or project directories up to the hard limit. This prevents jobs in an infinite loop from crashing the filesystem. However, once you are over your soft limit, you only have a 2 week grace period to go below your quota again.

Once you are over your quota and your grace period has expired, you can no longer write files to your home directory, including the cache file used by Lmod for example. This means that your software environment could become corrupted, preventing you from finding executables you’ve previously used in the past. If you are unexpectedly seeing error messages like “command not found” or you see cryptic error messages upon login, check to make sure you aren’t over your quota with the following command:

$ /soft/lcrc/bin/lcrc-quota

If you believe your project requires additional storage space greater than 1TB, a project PI can put in a request for us to expand your quota. To do this, login to https://accounts.lcrc.anl.gov and under your ‘Owned’ projects on the left sidebar, click on the project you wish to request an increase. Put in the desired number of TBs as an integer under the ‘Storage (TB)’ box followed by a description of how you calculated your needs under ‘Storage Justification’. From here, save the page changes and this will send LCRC Staff a request. A decision will be made, and you will be notified. At this time, we are only accepting requests for project quota increases. If you run out of room in your home directory, you’ll need to either delete some of the data or move it to your project or group directories.

Home, Project, and Group Disk

Your home, project, and group directories are located on separate GPFS filesystems that are shared by all nodes on the cluster. These filesystems are located on a raid array and are served by multiple file servers. This provides both a performance increase and protection against the filesystems being inaccessible. If one server goes down, the other servers can continue to serve the filesystems.

Pros

  • Global namespace
  • Multi-TB filesystem
  • Large file support (> 2GB)
  • Backed up
  • Raid protection
  • Stable hardware
  • Native InfiniBand support

Cons

  • Moderate performance

Local and Global Scratch Disk

Local

If you need a place to put temporary files that don’t need to be accessed by other nodes, we recommend that you put them into the local scratch disk on the nodes during job runs. All jobs create a job specific directory with local storage which can be referenced from your job submission script using the variable $TMPDIR. The normal publicly available nodes offer 15 GB of temporary scratch space while the ‘biggpu’ queue offers 1 TB. Diskfull Bebop nodes also house a 4TB disk on each node. Note that these spaces aren’t backed up and the space will be cleared out on job end!

Pros

  • Fast access times
  • Large file support (> 2GB)

Cons

  • Unique to each node; not shared between nodes
  • GB filesystem
  • Not backed up
  • Cleared out at the end of your job
  • No raid protection

Global

LCRC also has a global scratch space located at /lcrc/globalscratch. This space is a GPFS filesystem that is several TBs in size (the size may change over time). This space is shared to all LCRC nodes, unlike the local scratch space. Because this is a GPFS filesystem, IO will not be as fast.

NOTE: This global scratch space may be cleaned up during our maintenance days or at other intervals and all files deleted in this space with or without notice. This space is not intended for long term storage and is not backed up in any way. Files deleted either accidentally or on purpose are permanently deleted.

Pros

  • Shared between nodes
  • Very large capacity

Cons

  • Slower access time
  • Not backed up
  • Cleared out with or without notice
  • No raid protection

Backups and Archives

As previously mentioned, all storage systems use GPFS as their filesystem. Currently we are backing up the home filesystem. We are also backing up the project and group filesystems at best effort with our current solution. Files in these directories may not be available under certain circumstances unfortunately. Due to the volumes of data, the backups will finish at varying intervals. We expect home to fully backup AT LEAST once a week. The project and group filesystems may take much longer. Some files may not be available to restore during these times depending on the time of the last backup and availability. If you have any files that are considered to be critical, we recommend in general to have another copy outside of LCRC just in case.

Home backups are written to both a disk and tape enclosure in LCRC while the rest only go to tape. Due to this, non-home restores may take longer to complete. Our backup policy is that we maintain ONLY the current version of each file for 90 days. If you delete a file, we will maintain the most recent copy of it for 90 days. After 90 days, this file will be completely removed from LCRC systems. Please note that if you make changes to a file or the file becomes corrupt in some way, this will be overwritten and the previous version will no longer exist. This also means that if you delete a file and within 90 days recreate it with the original name and in the original path, the next backup that occurs will overwrite the copy stored in our backup system making the old version also obsolete.

If you need to restore a lost file, please contact support@lcrc.anl.gov and we will make a best effort to restore this for you. Until a new backup is completed on each individual filesystem or fileset, a backup may not be available for us restore.

We currently do not offer the ability for users to archive their own files on demand for long term recovery, but we hope to re-implement this in the near future.