CADESUser DocumentationSHPC Condo User GuideRunning-JobsExecute a PBS JobExecute a Slurm Job


NOTICE: Nodes belonging to the birthright, chem, cnms, ccsd, bsd, mstd, ntrc, nucphys, virtues, theory, qmc, and ccsi groups have now moved to the Slurm scheduler and can be accessed from or-slurm-login.ornl.gov. See the Running Jobs documentation for more details.


Job Submission with Slurm

This page can be used as an introductory guide to job submission or a Slurm command reference. Each section can be read sequentially if you are new to Slurm, or referenced individually if you are already familiar with job scheduling.

Accessing Slurm

Please see the Running Jobs page for information on how to access Slurm from the or-condo and mod-condo.


Introduction

Slurm is a job scheduler and resource management program for computer clusters similar to both Moab and Torque. If you are already familiar with the Moab msub or Torque qsub command, adapting to Slurm is very straight forward. A command reference for Slurm and PBS/Torque is available here to assist with translating existing job scripts.

Job Submission

Job submission in Slurm is split between two commands: srun, and sbatch. sbatch is used to submit job scripts to the scheduler, while srun is used to run programs directly from the command line as jobs. Both commands share the same commandline arguments.

The rest of this page contains examples explaining how to run various types of jobs using srun and sbatch. Reference sections for commonly used environment variables, job parameters, and filename patterns are available at the bottom of the page.

Table of Contents

PBS Job Script Translation


Slurm provides its own qsub command, which attempts to seamlessly convert PBS job submission scripts to SBATCH scripts. This is the fastest way to test your existing job scripts against the Slurm scheduler with minimal changes. There are a few differences in how the Slurm scheduler and Moab scheduler are configured, however, which require slight modification to exiting PBS scripts:

  • -A birthright-burst-A birthright
    • The account used to submit the job may or may not need to be upated. Valid Slurm account names can be found using this command:
      sacctmgr show assoc where user=<uid> format=account.
  • -q gpu-q burst
    • The queue that the job is submitted to may need to be updated. Valid queue names can be found with the sinfo command.
  • -l walltime=0:0:10:0-l walltime=10:00
    • A walltime request needs to be included in the job script.
  • -l mem=<number>[unit]
    • If you are not already specifying memory per node in your script, you must add a memory request.
  • -l nodes=<number>:ppn=<number>
    • Node and Processor Per Node specifications need to be included in your script.

In addition to these four changes, any references to environment variables provided by Moab/Torque must be updated to use the Slurm equivalent environment variables. A list of common variables is available in the Environment Variables section below. For the full variable listing, view the official documentation here.

Common Translation Errors

1. PBS_O_WORKDIR must become SLURM_SUBMIT_DIR

If you have been using PBS_O_WORKDIR in your batch script to set your file path, Slurm will not recognize it and you job will complete without output or an error because it will not be able to find your executable or input files. User the Slurm equivalent to PBS_O_WORKDIR, SLURM_SUBMIT_DIR instead.

2. Use mpirun instead of srun for MPI codes

Currently, CADES OpenMPI modules and programming environments have not been compiled with the Slurm pmix headers required to use the srun command. To run MPI codes, use mpirun instead of srun. Slurm will still schedule your batch script. See the OpenMPI Slurm FAQ here for more information.

📝 Note: If you are submitting your job to CADES Condo resources, view the page on Slurm Resource Queues for more information on accounts and resource queues.

Srun


Getting Started

The example below shows how to run a program directly from the commandline as a job using srun. This method of job submission is useful for quick one-off jobs like the one shown, but generally should not be used for complex jobs with a lot of resource requirements.

HelloWorld.sh

#!/bin/bash

echo "Hello World!"

To submit the example HelloWorld.sh script as a job, run the following srun command:

srun -A <account_name> -p <partition_name> -N 1 -n 1 -c 1 --mem=8G -t 10:00 test.sh

Each argument used in this example is detailed below:

Arguments

 -A : Account to run the job under
 -p : Partition to run the job in
 -N : Nodes requested for the job
 -n : Tasks the job will run per node
 -c : CPU cores that each task requires
 --mem : Required memory per node
 -t : The requested walltime for the job

The last argument in the command lists the path and name of the program to run, which in this case is the example script located in the current directory.

If you are unsure of what account to specify for the -A argument, use the following command to list all of the accounts you are a member of:

sacctmgr show assoc where user=<uid> format=account

To find valid values for the -p argument, use the command sinfo to list all of the available partions in the cluster.

Interactive Jobs

Interactive jobs allow direct access to node hardware to run jobs. This is typically used when testing or profiling a job, or when running a program that requires user input.

srun can be used to start an interactive job with one additional argument: --pty. An example command is shown below:

srun -A <account_name> -p <partition_name> -N 1 -n 1 -c 1 --mem=8G -t 1:00:00 --pty /bin/bash

In this example the --pty argument was added, and the last argument listing what program to run was changed from the test script to the bash shell.

New Arguments

 --pty : Run the program listed as the last argument in pseudo-terminal mode

Sbatch


Non-interactive Jobs

The sbatch command is capable of parsing properly formatted script files and running them as jobs. Job scripts are comprised of three main components: the interpreter declaration, the sbatch arguments, and the job commands.

HelloWorld.sbatch

#!/bin/bash
# Interpreter declaration

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./test-output.txt
#SBATCH -e ./test-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
# sbatch arguments

./HelloWorld.sh
# Job commands -- this is the same HelloWorld.sh script used in the first example

This example is similar to the srun command line example above, but uses several more arguments to specify additional resources and contraints:

New Arguments

 -J          : Job name
 --mem       : Required memory per node. When this is set to zero, all available memory is requested
 -o          : File to redirect standard output to
 -e          : File to redirect standard error to
 --mail-type : Events to send email notifications for
 --mail-user : Email address to send job notifications

Submit the job script

Another advantage of the job script over the command line srun example at the top, is that it is much easier to submit.

To submit your job script to the compute nodes issue:

sbatch HelloWorld.sbatch

Example Output

Hello World!

Parallel Program Execution

The previous example demonstrated how to run a job on one core and one node. This example extends the previous one to run a program across multiple cores and multiple nodes. A few of the environment variables Slurm provides are also demonstrated.

HelloWorld.sh

#!/bin/bash

echo "Hello World! Node:${SLURMD_NODENAME} Core:${SLURM_PROCID}"

Two Slurm-controlled environment variables have been added to the HelloWorld.sh echo command. These variables will display the name of the node each job task is running on as well as the task number.

Multithread-HelloWorld.sbatch

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J multithread-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%j-multithread-output.txt
#SBATCH -e ./%j-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

srun ./HelloWorld.sh

There are five major differences between this job script and the one in the previous example:

  • -N 2
  • -n 4
  • --ntasks-per-node=2
  • -o ./%j-multithread-output.txt
  • srun ./HelloWorld.sh

The -N and -n parameters were changed to request two nodes and four tasks per node respectively. Because the cores per task option (-c) is still set to one, this script is requesting four processors on two nodes.

The --ntasks-per-node option ensures that two of the four requested tasks run on each node. Without this option, the scheduler may schedule three task on one node and one on another because the nodes have more cores than requested by the job script.

The %j added to the output and error file names is a special string called a "filename pattern" that is recognized by Slurm. When the job is submitted, the %j will be replaced with the job ID number.

The addition of the srun command before the script file tells Slurm to automatically run the script once for each requested job task. Using srun to schedule the program will also cause the scheduler to increment Slurm-controlled environment variables before running each task.

New Arguments

 --ntasks-per-node : Number of tasks to run per node

Environment Variables

 SLURMD_NODENAME : The name of the node that the current job task is running on
 SLURM_PROCID    : The ID number of the current job task running on the node

Filename Patterns

 %j : Job ID

Submit the job script

To submit your job script to the compute nodes issue:

sbatch Multithread-HelloWorld.sbatch

Example Output

Hello World! Node:or-slurm-c01 Core:2
Hello World! Node:or-slurm-c00 Core:0
Hello World! Node:or-slurm-c01 Core:3
Hello World! Node:or-slurm-c00 Core:1

Job Arrays

Job arrays provide a simple way of running multiple instances of a job with different data sets. This example modifies the previous example to run two instances of the same job. Additional job array specific environment variables are also demonstrated.

HelloWorld.sh

#!/bin/bash

echo "Hello World! Node:${SLURMD_NODENAME} Core:${SLURM_PROCID} Array:${SLURM_ARRAY_JOB_ID} Task:${SLURM_ARRAY_TASK_ID}"

Two additional environment variables were added to the end of the echo command; the first prints the primary job array ID, and the second prints the secondary ID for each task in the array.

Array-HelloWorld.sbatch

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J array-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%A-%a-output.txt
#SBATCH -e ./%A-%a-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
#SBATCH -a 0-1%2
#SBATCH --exclusive

srun ./HelloWorld.sh

There are three major differences between this job and the previous one, the addition of the -a parameter, the --exclusive parameter, and the %A and %a filename patterns. -a specifies how many jobs to run in the array, and can optionally include a limit on the number of jobs to run at once. This example specifies two job in the array with the 0-1 range parameter, indicating that one job should be run with ID 0, and another with ID 1. The %2 at the end specifies that up to two job in the array should run at once.

The --exclusive option tells the scheduler to schedule each job in the array exclusively on its own nodes. Without specifying this option, Slurm will pack job in an array onto as few nodes as possible.

The %A filename pattern is replaced by the primary job ID when the job is submitted, and the %a pattern is replaced by the secondary job array task ID. This will cause each job in the array to create its own output and error files.

New Arguments

 -a          : Create a job array with the specified range of job IDs
 --exclusive : Run each job in a job array exclusively on their own nodes

New Environment Variables

 SLURM_ARRAY_JOB_ID  : The primary ID of the job array
 SLURM_ARRAY_TASK_ID : The secondary ID of the running task in the job array

New Filename Patterns

 %A : Job array primary job ID
 %a : Job array task ID

Submit the job script

To submit your job script to the compute nodes issue:

sbatch Array-HelloWorld.sbatch

Example Output

Hello World! Node:or-slurm-c00 Core:0 Array:86 Task:0
Hello World! Node:or-slurm-c00 Core:1 Array:86 Task:0
Hello World! Node:or-slurm-c01 Core:2 Array:86 Task:0
Hello World! Node:or-slurm-c01 Core:3 Array:86 Task:0
Hello World! Node:or-slurm-c02 Core:1 Array:86 Task:1
Hello World! Node:or-slurm-c02 Core:0 Array:86 Task:1
Hello World! Node:or-slurm-c03 Core:2 Array:86 Task:1
Hello World! Node:or-slurm-c03 Core:3 Array:86 Task:1

MPI

This example demonstrates how to run MPI programs with the Slurm scheduler. The module commands in mpi-ring.sbatch may be unnecessary depending on how MPI is installed on the system you are running on. If you are running on the CADES Condos, the job script below can be copied verbatum.

mpi-ring.c

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
  int world_rank, world_size, token;
  MPI_Init(NULL, NULL);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  if (world_rank != 0) {
    MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1);
  } else {
    token = -1;
  }
  MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD);
  if (world_rank == 0) {
    MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1);
  }
  MPI_Finalize();
}

The mpi program used in this example was copied from Wes Kendall's MPI Tutorial website. A detailed explanation of the example program is available here:
"MPI Send and Recieve": https://mpitutorial.com/tutorials/mpi-send-and-receive/

mpi-ring.sbatch

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J mpi-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -o ./%j-mpi-output.txt
#SBATCH -e ./%j-mpi-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

module purge
module load PE-gnu
mpirun --mca btl tcp,self ./mpi_ring

The only new additions to the sbatch script in this example are the job commands at the bottom. module commands are used to load MPI, then the mpi_ring program is run over infiniband. The Slurm scheduler automatically passes node name and process count parameters to the mpirun command so that we do not have to specify them manually.

If the version of MPI you are using has been compiled with the Slurm pmix headers using the --with-pmi option, it is possible to use the srun command to run MPI programs instead of mpirun.
See the OpenMPI Slurm FAQ here for more information.

The --mca btl tcp,self is specific to the Slurm testbed nodes.

Submit the job script

To submit your job script to the compute nodes issue:

sbatch mpi-ring.sbatch

Example Output

Process 1 received token -1 from process 0
Process 2 received token -1 from process 1
Process 3 received token -1 from process 2
Process 0 received token -1 from process 3

GPUs

This example shows how to specify GPUs in an sbatch script.

HelloWorldGPU.sh

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --gres=gpu:2
#SBATCH -J gpu-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -o ./%j-gpu-output.txt
#SBATCH -e ./%j-gpu-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

nvidia-smi

The --gres (Generic RESource) option above is used to specify the number of GPU resources your job needs. You must specify at least --gres=gpu:1 to gain access to a GPU. The first specifier indicates the type of resource that is needed (gpu) and the second indicates the number. A third optional specifier can be added to request a particular model of gpu. To specify K80 GPUs, the above option would change to --gres=gpu:k80:2.

New Arguments

 --gres : Reserve nodes with the specified generic resources

Submit the job script

To submit your job script to the compute nodes issue:

sbatch HelloWorldGPU.sh

Example Output

Mon Jul 29 13:32:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                    0 |
| N/A   35C    P0    61W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P0    73W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:83:00.0 Off |                    0 |
| N/A   37C    P0    68W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   30C    P0    81W / 149W |      0MiB / 11441MiB |     63%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Containers

Singularity containers are the simplest tool for running a containerized workflow in an HPC environment. This example demonstrates how to create a Singularity container from a Docker Hub container and run it using the Slurm scheduler.

As a first step, check that Singularity is installed by running singularity --help. If the singularity command is not found, you will first need to load a Singularity software module, or install Singularity yourself. If Singularity is present, start by creating a new container with the following command:

singularity build cuda-test.sif docker://nvidia/cuda:10.1-base-ubuntu16.04

This should create a file called cuda-test.sif in the current directory. Now that the test container has been created, we can run it using a Slurm job script. Two versions of the same job script are shown below, one for running on the CADES Condos, and one for running on the DGX systems.

CADESCondosingularity.sbatch

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J singularity-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH --gres=gpu:k80:2

srun singularity exec --nv ./cuda-test.sif nvidia-smi

Submit the job script

To submit your job script to the compute nodes issue:

sbatch CADESCondosingularity.sbatch

DGXsingularity.sbatch

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J singularity-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH --gres=gpu:2

srun singularity exec ./cuda-test.sif nvidia-smi

Submit the job script

To submit your job script to the compute nodes issue:

sbatch DGXsingularity.sbatch

These two job scripts have different singularity exec command arguments. This is due to the difference in the Singularity versions installed between the DGX systems and the CADES Condos.

Example Output

Fri Jul 26 16:36:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   27C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  Off  | 00000000:36:00.0 Off |                    0 |
| N/A   26C    P0    49W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Monitoring the Queue

Below is a table of common Slurm commands and their Moab/Torque equivalents. Example runs for each command are also shown.

Moab Slurm Usage
qsub sbatch Job script submission. Examples shown above.
qsub, pbsdsh srun Interactive job submission and running parallel processes. Examples shown above.
qstat, showq squeue View the state of all jobs in the cluster.
checknode, showbf sinfo View information on queues and nodes in the cluster.
checkjob, mschedctl scontrol View detailed information on cluster jobs, accounts, queues, etc.
canceljob scancel Cancel a running or pending job.

Slurm provides multiple tools to view queue, system, and job status. Below are the most common and useful of these tools.

squeue

To see all jobs currently in the queue:

$ squeue

To see the full output of all your queued jobs:

$ squeue -l -u <uid>

Example:

$ squeue -l -u <uid>
Wed Aug 21 16:44:37 2019
             JOBID PARTITION     NAME      USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
                11     batch mpi_hello    <uid> COMPLETI       0:01      1:00      2 or-slurm-c[00-01]

sinfo

To see how many nodes are avilable in each queue:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
testing*     up    4:00:00      6   idle or-slurm-c[00-03],or-condo-g[05-06]

scontrol

To check your job's status:

$ scontrol show job <job_id>

scancel

To cancel your job:

$ scancel <job_id>

Appendix

Below are listed references for all of the Slurm environment variables, job script parameters, and file name patterns used in the examples given above. For full reference pages, see the official documentation linked in each sub-section.

Environment Variables

A list of frequently used environment variables that Slurm provides are detailed below. To view a complete list of all Slurm variables, check the official documentation here.

  • SLURM_JOB_NAME
    • Slurm job name
  • SLURM_JOB_ID
    • Slurm job ID
  • SLURMD_NODENAME
    • Hostname of the node the current task is running on
  • SLURM_TASKS_PER_NODE
    • Number of tasks allocated to each node for the job
  • SLURM_PROCID
    • The task number for the current task starting from zero
  • SLURM_JOB_NUM_NODES
    • The number of nodes allocated for the current job
  • SLURM_CPUS_ON_NODE
    • The number of CPU cores on the current node
  • SLURM_ARRAY_JOB_ID
    • The primary ID for the current job array
  • SLURM_ARRAY_TASK_ID
    • The secondary ID number for the current task in the job array
  • SLURM_ARRAY_TASK_COUNT
    • The total number of tasks in the current job array

Job Parameters

A list of frequently used sbatch and srun arguments are detailed below. To view a complete list of all arguments, check the official documentation here.

  • -A, --account=<account>
    • Account to run the job under
  • -p, --partition=<partition>
    • Partition to run the job in
  • -N, --nodes=<min[-max]>
    • Nodes requested for the job
  • -n, --ntasks=<tasks>
    • Tasks the job will run per node
  • -c, --cpus-per-task=<cpus>
    • CPU cores that each task requires
  • -t, --time=<time>
    • The requested walltime for the job
  • -J, --job-name=<name>
    • Job name
  • --mem=<size[units]>
    • Required memory per node. When this is set to zero, all available memory is requested
  • --mem-per-cpu=<size[units]>
    • Required memory per CPU
  • --mem-per-gpu=<size[units]>
    • Required memory per GPU
  • -o, --output=<file>
    • File to redirect standard output to
  • -e, --error=<file>
    • File to redirect standard error to
  • --mail-type=<type>
    • Events to send email notifications for
  • --mail-user=<user>
    • Email address to send job notifications
  • --ntasks-per-node=<tasks>
    • Number of tasks to run per node
  • -a, --array=<start-end[:step][%max]>
    • Create a job array with each job in the array receiving a task ID in the specified range
  • --exclusive
    • Run each job in a job array exclusively on their own nodes
  • --gres=<type>:[<model>]<number>
    • Reserve nodes with the specified generic resources

Filename Patterns

A list of frequently used filename pattern strings are detailed below. To view a complete list of all available patterns, check the official documentation here.

  • %j
    • Job ID
  • %x
    • Job name
  • %A
    • Job array primary job ID
  • %a
    • Job array task ID

results matching ""

    No results matching ""