Job management with SLURM

You should not run your compute code directly on the terminal you find when you log in. The login server ruche01 is not suited for computations.

In order to submit a job on the cluster, you need to describe the resources (cores, memory, time) you need to the task manager Slurm. The task manager will launch the job on a remote compute node as soon as the resources you need will be available. The job will be executed in a virtual resource chunk called a CGROUP. See section on CGROUPS below for more information.

There are two ways to run a compute code on ruche :

  • using a interactive Slurm job : this will open a terminal on a compute node where you can execute your code. This method is well-suited for light tests and environment configuration (especially for GPU accelerated codes). See the section Interactive jobs.
  • using a Slurm script : this will submit your script to the scheduler, which will run it when the resources are available. This method is well-suited for "production" runs.

Slurm is configured with a "fairshare" policy among the users, which means that the more resources you have asked for in the past days and the lower your priority will be for your jobs if the task manager has several jobs to handle at the same time.

Slurm script

Most of the time, you will run your code through a Slurm script. This script has the following functions :

  • specify the resources you need for your code : partition, walltime, number of nodes, memory (mem), number of tasks (ntasks), local SSD disk space (tmp), etc.
  • specify other parameters for your job (project which your job belongs to, output files, mail information on your job status, job name, etc.)
  • if you use GPUs, the number of gpus requested (--gres=gpu:)
  • setup the batch environment (load modules, set environment variables)
  • run the code

The batch environment is set by loading the proper modules (see section Module command) and setting the proper bash variables (PATH, OMP_NUM_THREAD, etc.). We recommend to unload all your modules with module purge beforehand and load the exact same modules as in your tests or/and your code compilation.

Running the code will depend on your executable. Parallel codes may have to use srun or having specific environment variables set.

SLURM partitions

  • By defaut partition is set to cpu_short.
  • You can change this setting by choosing a partition following the needed ressources in the array of queue names .

Slurm directives

You describe the resources you need in the submission script, using sbatch instructions (script lines beginning with #SBATCH). These options can be used directly with the sbatch command, or listed in a script. Using a script is the best solution if you want to submit the job several times, or several similar jobs.

How to describe your requested ressources with SBATCH

nodes

Number of nodes :

#SBATCH --nodes=<nnodes>

ntasks

Number of tasks (MPI processes) :

#SBATCH --ntasks=<ntasks>

ntasks-per-node

Number of tasks (MPI processes) per node:

#SBATCH --ntasks-per-node=<ntpn>

cpu-per-task

Number of threads per process (Ex: OpenMP threads per MPI process):

#SBATCH --cpus-per-task=<ntpt>

gres=gpu

Number of gpus :

#SBATCH --gres=gpu:<ngpus>

Note : In the job, the selected gpus will have IDs in a cgroup context. See section on CGROUPS below for more information.

exclusive

Allocated nodes are reserved exclusively in order to avoid sharing nodes with other running jobs. Do not use this directive unless the support team tells you to do so for a specific case.

#SBATCH --exclusive

mem

Memory per node :

#SBATCH --mem=<size[units]>
  • Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]
  • To know : default memory on Ruche is 4 GB per core (--mem-per-cpu=4G). You do not need to specify this directive if the default value is well-suited for your job.
  • Check your actual needs in terms of memory with seff (see below) and adapt your memory request accordingly for your next jobs. Reserving too much memory, in comparison with your actual needs, makes the job stay in queue longer and prevents other users from using the resource.

time

Specify the walltime for your job. if your job is still running after the walltime duration, your job will be killed :

#SBATCH --time=<hh:mm:ss> 

tmp

Use this directive only if you need a local SSD disk for IOs. Most of the time, performing IOs on the workdir is relevant and this directive is not needed.

To use a local SSD disk on a mem or gpu node:

1/ Specify a mem or a gpu partition in your script

2/ Reserve the amount of SSD disk space thanks to the Slurm directive tmp:

#SBATCH --tmp=<size[units]>
  • By default, tmp will be 100M.
  • Since there is no SSD disk on cpu nodes, Slurm will produce the following error if the tmp directive is used with a cpu partition: sbatch: error: Temporary disk specification can not be satisfied

3/ Use $TMPDIR to refer to the directory Slurm will create for your job on the SSD (this variable is set to scratch/login-jobid where login is your login and jobid the jobid of your job). Note that Slurm automatically deletes this directory at the end of the job, therefore you must transfer the outputs your code wrote in this directory back to your submission directory ($SLURM_SUBMIT_DIR) in your workdir at the end of the script. * Do not use $TMPDIR in a script with a cpu partition. For these partitions, $TMPDIR is set to /tmp which is too small for IOs.

partition

Specify the Slurm partition your job will be assigned :

#SBATCH --partition=<PartitionName>

With PartitionName in partition names list

SBATCH additional directives

job-name

Define the job's name :

#SBATCH --job-name=jobName

output

Define the standard output (stdout) for your job :

#SBATCH --output=outputJob.txt

error

Define the error output (stderr) for your job :

#SBATCH --error=errorJob.txt

By default both standard output and standard error are directed to the same file.

mail-user

Set an email address :

#SBATCH --mail-user=firstname.lastname@mywebserver.com 

mail-type

To be notify by mail when a step has been reached :

#SBATCH --mail-type=ALL

Arguments for -mail-type option are :

  • BEGIN : send an email when the job starts
  • END : send an email when the job stops
  • FAIL : send an email if the job fails
  • ALL : equivalent to BEGIN, END, FAIL.

export

Export user environment variables

  • By default all user environment variables will be loaded (--export=ALL).
  • To avoid dependencies and inconsistencies between submission environment and batch execution environment, disabling this functionality is highly recommended. In order to not export environment variables present at job submission time to the job's environment:
#SBATCH --export=NONE
  • To select explicitly exported variables from the caller's environment to the job environment:
#SBATCH --export=VAR1,VAR2

propagate

  • By default all ressources limits (obtained by ulimit command like stack, open files, nb processes, ...) are propagated (--propagate=ALL).
  • To avoid the propagation of interactive limits and erase batch ressources limits, it is encouraged to disable the fonctionnality:
#SBATCH --propagate=NONE

account

  • By default the compute time consumption is charged to your default project account
  • To indicate another project account, you can specify it with --account
  • To see the association between a job and the project, you can use squeue, scontrol or sacct commands.
#SBATCH --account=<MY_PROJECT>

Submit and monitor jobs

submit job

You need to submit your script job0 with :

$ sbatch job0
Submitted batch job 29509

which responds with the jobid attributed to the job. For example here, jobid is 29509. The jobid is a unique identifier that is used by many Slurm commands.

monitor job

The squeue command shows the list of jobs :

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             29509 cpu_short     job0 username  R       0:02      1 node001

cancel job

The scancel command cancels jobs.

To cancel job job0 with jobid 29509 (obtained through squeue), you would use :

$ scancel 29509

interactive jobs

  • Example 1: access one node in interactive for an hour
$ srun --nodes=1 --time=00:30:00 -p cpu_short --pty /bin/bash
[user@node001 ~]$ hostname
node001
  • Example 2: access on a node with a GPU for 30 minutes
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:1 --pty /bin/bash
  • Use --x11 option if you need X forwarding.

job arrays

Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value.

# Submit a job array with index values between 0 and 31
[user@ruche01 ~]$ sbatch --array=0-31 job

# Submit a job array with index values of 1, 3, 5 and 7
[user@ruche01 ~]$ sbatch --array=1,3,5,7 job

# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
[user@ruche01 ~]$ sbatch --array=1-7:2 job

The subjobs should not depend on each other. SLURM can start these jobs in every order, at the same time or not.

chain jobs

If you want to submit a job which must be executed after another job, you can use the chain function in slurm.

[username@ruche01 ~]$ sbatch slurm_script1.sh
Submitted batch job 74698
[username@ruche01 ~]$ squeue 
JOBID PARTITION     NAME     USER      ST    TIME    NODES  NODELIST(REASON)
74698  *******      *******  username  PD    0:00    *      *******
[username@ruche01 ~]$ sbatch --dependency=afterok:74698 slurm_script2.sh
Submitted batch job 74699
[username@ruche01 ~]$ sbatch ---dependency=afterok:74698:74699 slurm_script3.sh
Submitted batch job 74700

Note that if one of the jobs in the sequence fails, the following jobs remain by default pending with the reason “DependencyNeverSatisfied” but can never be executed. You must then delete them using the scancel command. If you want these jobs to be automatically canceled on failure, you must specify the –kill-on-invalid-dep = yes option when submitting them.

Here are the common chaining rules :

  • after: = job can start once job has started execution
  • afterany: = job can start once job has terminated
  • afterok: = job can start once job has terminated successfully
  • afternotok: = job can start once job has terminated upon failure
  • singleton = job can start once any previous job with identical name and user has terminated

Accounting

  • Use the command sacct to get info on your jobs. More detail on this page
[user@ruche01 ~]$ sacct -e
Account             AdminComment        AllocCPUS           AllocGRES          
AllocNodes          AllocTRES           AssocID             AveCPU             
AveCPUFreq          AveDiskRead         AveDiskWrite        AvePages           
AveRSS              AveVMSize           BlockID             Cluster            
Comment             ConsumedEnergy      ConsumedEnergyRaw   CPUTime            
CPUTimeRAW          DerivedExitCode     Elapsed             ElapsedRaw         
Eligible            End                 ExitCode            GID                
Group               JobID               JobIDRaw            JobName     
...

[user@ruche01 ~]$ sacct -j 1240028 --format=jobid,jobname,maxrss,elapsed,exitcode --unit=G
       JobID    JobName     MaxRSS    Elapsed ExitCode 
------------ ---------- ---------- ---------- -------- 
1240028        hydrossd              00:01:28      0:0 
1240028.bat+      batch      4.48G   00:01:28      0:0 
1240028.ext+     extern      0.00G   00:01:28      0:0 

The MaxRSS field (ram used by the job) is very useful that got cancelled due to out of memory problems.

  • Use the command seff to get info on a finished job.
[user@ruche01 ~]$ seff 1240028
Job ID: 1240028
Cluster: ruche
User/Group: user/lab
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:01:27
CPU Efficiency: 98.86% of 00:01:28 core-walltime
Job Wall-clock time: 00:01:28
Memory Utilized: 4.48 GB
Memory Efficiency: 89.62% of 5.00 GB

Note : on ruche, the accounting information is restricted to your jobs only.

CGROUPS

Your slurm job will be executed in a virtual resource chunk called a CGROUP, formed with the allocated amount of RAM, cores and GPUS. In some cases, you will be allowed to see only the selected resources.

Example :

[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:2 --export=none --pty /bin/bash
[user@ruche-gpu03 user]$ nvidia-smi # will only see 2 selected gpus

Important note if you are selecting GPUs : The gpu IDs displayed with nvidia-smi start at 0 in the cgroup. The actual gpu IDs are stored in the environment variable SLURM_JOB_GPUS or SLURM_STEP_GPUS. If you specify manually the gpu IDs to your framework using the wrong context, you can end using the same GPU as another job. Be careful !

  • CUDA_VISIBLE_DEVICES and GPU_DEVICE_ORDINAL values contains the IDs in the cgroup context
  • SLURM_JOB_GPUS or SLURM_STEP_GPUS value contains the IDs in the global context

Example :

[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:2 --export=none --pty /bin/bash
[user@ruche-gpu03 user]$ echo $CUDA_VISIBLE_DEVICES # cgroup context
0,1
[user@ruche-gpu03 user]$ echo $GPU_DEVICE_ORDINAL # cgroup context
0,1
[user@ruche-gpu03 user]$ echo $SLURM_STEP_GPUS # global context
1,2