Job management with SLURM
Warning
You should not run your compute code directly on the terminal you find when you log in. The login servers ruche01 and ruche02 are not suited for computations.
In order to submit a job on the cluster, you need to describe the resources (cores, memory, time) you need to the task manager Slurm. The task manager will launch the job on a remote compute node as soon as the resources you need will be available. The job will be executed in a virtual resource chunk called a CGROUP. See section on CGROUPS below for more information.
There are two ways to run a compute code on ruche :
- using a interactive Slurm job : this will open a terminal on a compute node where you can execute your code. This method is well-suited for light tests and environment configuration (especially for GPU accelerated codes). See the section Interactive jobs.
- using a Slurm script : this will submit your script to the scheduler, which will run it when the resources are available. This method is well-suited for "production" runs.
Slurm is configured with a "fairshare" policy among the users, which means that the more resources you have asked for in the past days and the lower your priority will be for your jobs if the task manager has several jobs to handle at the same time.
Slurm script
Most of the time, you will run your code through a Slurm script. This script has the following functions :
- specify the resources you need for your code : partition, walltime, number of nodes, memory (mem), number of tasks (ntasks), local SSD disk space (tmp), etc.
- specify other parameters for your job (project which your job belongs to, output files, mail information on your job status, job name, etc.)
- if you use GPUs, the number of gpus requested (--gres=gpu:)
- setup the batch environment (load modules, set environment variables)
- run the code
The batch environment is set by loading the proper modules (see section Module command) and setting the proper bash variables (PATH, OMP_NUM_THREAD, etc.).
We recommend to unload all your modules with module purge
beforehand and load the exact same modules as in your tests or/and your code compilation.
Running the code will depend on your executable. Parallel codes may have to use srun
or having specific environment variables set.
SLURM partitions
Note
Defaut partition is set to cpu_short
. A job submitted without any partition specified will run on the default partition.
You can change this setting by choosing a partition following the needed ressources in the array of queue names.
Slurm directives
You describe the resources you need in the submission script, using sbatch instructions (script lines beginning with #SBATCH). These options can be used directly with the sbatch command, or listed in a script. Using a script is the best solution if you want to submit the job several times, or several similar jobs.
How to describe your requested ressources with SBATCH
nodes
Number of nodes :
#SBATCH --nodes=<nnodes>
ntasks
Number of tasks (MPI processes) :
#SBATCH --ntasks=<ntasks>
ntasks-per-node
Number of tasks (MPI processes) per node:
#SBATCH --ntasks-per-node=<ntpn>
cpu-per-task
Number of threads per process (Ex: OpenMP threads per MPI process):
#SBATCH --cpus-per-task=<ntpt>
gres=gpu
Number of gpus :
#SBATCH --gres=gpu:<ngpus>
Note
In the job, the selected gpus will have IDs in a cgroup context. See section on CGROUPS below for more information.
exclusive
Allocated nodes are reserved exclusively in order to avoid sharing nodes with other running jobs. Do not use this directive unless the support team tells you to do so for a specific case.
#SBATCH --exclusive
mem
Memory per node :
#SBATCH --mem=<size[units]>
- Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]
- To know : default memory on Ruche is 4 GB per core (
--mem-per-cpu=4G
). You do not need to specify this directive if the default value is well-suited for your job. - Check your actual needs in terms of memory with
seff
(see below) and adapt your memory request accordingly for your next jobs. Reserving too much memory, in comparison with your actual needs, makes the job stay in queue longer and prevents other users from using the resource.
time
Specify the walltime for your job. if your job is still running after the walltime duration, your job will be killed :
#SBATCH --time=<hh:mm:ss>
tmp
Use this directive only if you need a local SSD disk for IOs. Most of the time, performing IOs on the workdir is relevant and this directive is not needed.
To use a local SSD disk on a mem
or gpu
node:
1/ Specify a mem
or a gpu
partition in your script
2/ Reserve the amount of SSD disk space thanks to the Slurm directive tmp
:
#SBATCH --tmp=<size[units]>
- By default,
tmp
will be1G
. - Since there is no SSD disk on
cpu
nodes, Slurm will produce the following error if thetmp
directive is used with acpu
partition:sbatch: error: Temporary disk specification can not be satisfied
3/ Use $TMPDIR
to refer to the directory Slurm will create for your job on the SSD (this variable is set to scratch/login-jobid
where login
is your login and jobid
the jobid of your job). Note that Slurm automatically deletes this directory at the end of the job, therefore you must transfer the outputs your code wrote in this directory back to your submission directory ($SLURM_SUBMIT_DIR
) in your workdir at the end of the script.
* Do not use $TMPDIR
in a script with a cpu
partition. For these partitions, $TMPDIR
is set to /tmp
which is too small for IOs.
partition
Specify the Slurm partition your job will be assigned :
#SBATCH --partition=<PartitionName>
With PartitionName
in partition names list
SBATCH additional directives
job-name
Define the job's name :
#SBATCH --job-name=jobName
output
Define the standard output (stdout) for your job :
#SBATCH --output=outputJob.txt
error
Define the error output (stderr) for your job :
#SBATCH --error=errorJob.txt
By default both standard output and standard error are directed to the same file.
mail-type
To be notify by mail when a step has been reached :
#SBATCH --mail-type=ALL
Arguments for -mail-type option are :
- BEGIN : send an email when the job starts
- END : send an email when the job stops
- FAIL : send an email if the job fails
- ALL : equivalent to BEGIN, END, FAIL.
export
Export user environment variables
- By default all user environment variables will be loaded (--export=ALL).
- To avoid dependencies and inconsistencies between submission environment and batch execution environment, disabling this functionality is highly recommended. In order to not export environment variables present at job submission time to the job's environment:
#SBATCH --export=NONE
- To select explicitly exported variables from the caller's environment to the job environment:
#SBATCH --export=VAR1,VAR2
propagate
- By default all ressources limits (obtained by ulimit command like stack, open files, nb processes, ...) are propagated (--propagate=ALL).
- To avoid the propagation of interactive limits and erase batch ressources limits, it is encouraged to disable the fonctionnality:
#SBATCH --propagate=NONE
account
- By default the compute time consumption is charged to your default project account
- To indicate another project account, you can specify it with
--account
- To see the association between a job and the project, you can use
squeue
,scontrol
orsacct
commands.
#SBATCH --account=<MY_PROJECT>
Submit and monitor jobs
submit job
You need to submit your script job0 with :
$ sbatch job0
Submitted batch job 29509
which responds with the jobid attributed to the job. For example here, jobid is 29509. The jobid is a unique identifier that is used by many Slurm commands.
monitor job
The squeue
command shows the list of jobs :
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
29509 cpu_short job0 username R 0:02 1 node001
cancel job
The scancel
command cancels jobs.
To cancel job job0 with jobid 29509 (obtained through squeue), you would use :
$ scancel 29509
interactive jobs
- Example 1: access one node in interactive for an hour
$ srun --nodes=1 --time=00:30:00 -p cpu_short --pty /bin/bash
[user@node001 ~]$ hostname
node001
- Example 2: access on a node with a GPU for 30 minutes
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:1 --pty /bin/bash
- Use
--x11
option if you need X forwarding.
job arrays
Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value.
# Submit a job array with index values between 0 and 31
[user@ruche01 ~]$ sbatch --array=0-31 job
# Submit a job array with index values of 1, 3, 5 and 7
[user@ruche01 ~]$ sbatch --array=1,3,5,7 job
# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
[user@ruche01 ~]$ sbatch --array=1-7:2 job
The subjobs should not depend on each other. SLURM can start these jobs in every order, at the same time or not.
chain jobs
If you want to submit a job which must be executed after another job, you can use the chain function in slurm.
[username@ruche01 ~]$ sbatch slurm_script1.sh
Submitted batch job 74698
[username@ruche01 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
74698 ******* ******* username PD 0:00 * *******
[username@ruche01 ~]$ sbatch --dependency=afterok:74698 slurm_script2.sh
Submitted batch job 74699
[username@ruche01 ~]$ sbatch ---dependency=afterok:74698:74699 slurm_script3.sh
Submitted batch job 74700
Note that if one of the jobs in the sequence fails, the following jobs remain by default pending with the reason “DependencyNeverSatisfied” but can never be executed. You must then delete them using the scancel command. If you want these jobs to be automatically canceled on failure, you must specify the –kill-on-invalid-dep = yes option when submitting them.
Here are the common chaining rules :
- after:
= job can start once job has started execution - afterany:
= job can start once job has terminated - afterok:
= job can start once job has terminated successfully - afternotok:
= job can start once job has terminated upon failure - singleton = job can start once any previous job with identical name and user has terminated
Accounting
- Use the command
sacct
to get info on your jobs. More detail on this page
[user@ruche01 ~]$ sacct -e
Account AdminComment AllocCPUS AllocGRES
AllocNodes AllocTRES AssocID AveCPU
AveCPUFreq AveDiskRead AveDiskWrite AvePages
AveRSS AveVMSize BlockID Cluster
Comment ConsumedEnergy ConsumedEnergyRaw CPUTime
CPUTimeRAW DerivedExitCode Elapsed ElapsedRaw
Eligible End ExitCode GID
Group JobID JobIDRaw JobName
...
[user@ruche01 ~]$ sacct -j 1240028 --format=jobid,jobname,maxrss,elapsed,exitcode --unit=G
JobID JobName MaxRSS Elapsed ExitCode
------------ ---------- ---------- ---------- --------
1240028 hydrossd 00:01:28 0:0
1240028.bat+ batch 4.48G 00:01:28 0:0
1240028.ext+ extern 0.00G 00:01:28 0:0
The MaxRSS field (ram used by the job) is very useful that got cancelled due to out of memory problems.
- Use the command
seff
to get info on a finished job.
[user@ruche01 ~]$ seff 1240028
Job ID: 1240028
Cluster: ruche
User/Group: user/lab
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:01:27
CPU Efficiency: 98.86% of 00:01:28 core-walltime
Job Wall-clock time: 00:01:28
Memory Utilized: 4.48 GB
Memory Efficiency: 89.62% of 5.00 GB
Note
On ruche, the accounting information is restricted to your jobs only.
CGROUPS
Your slurm job will be executed in a virtual resource chunk called a CGROUP, formed with the allocated amount of RAM, cores and GPUS. In some cases, you will be allowed to see only the selected resources.
Example :
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:2 --export=none --pty /bin/bash
[user@ruche-gpu03 user]$ nvidia-smi # will only see 2 selected gpus
Important note if you are selecting GPUs : The gpu IDs displayed with nvidia-smi start at 0 in the cgroup. The actual gpu IDs are stored in the environment variable SLURM_JOB_GPUS
or SLURM_STEP_GPUS
. If you specify manually the gpu IDs to your framework using the wrong context, you can end using the same GPU as another job. Be careful !
CUDA_VISIBLE_DEVICES
andGPU_DEVICE_ORDINAL
values contains the IDs in the cgroup contextSLURM_JOB_GPUS
orSLURM_STEP_GPUS
value contains the IDs in the global context
Example :
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:2 --export=none --pty /bin/bash
[user@ruche-gpu03 user]$ echo $CUDA_VISIBLE_DEVICES # cgroup context
0,1
[user@ruche-gpu03 user]$ echo $GPU_DEVICE_ORDINAL # cgroup context
0,1
[user@ruche-gpu03 user]$ echo $SLURM_STEP_GPUS # global context
1,2