Job management with SLURM
You should not run your compute code directly on the terminal you find when you log in. The login server ruche01 is not suited for computations.
In order to submit a job on the cluster, you need to describe the resources (cores, memory, time) you need to the task manager Slurm. The task manager will launch the job on a remote compute node as soon as the resources you need will be available. The job will be executed in a virtual resource chunk called a CGROUP. See section on CGROUPS below for more information.
There are two ways to run a compute code on ruche :
- using a interactive Slurm job : this will open a terminal on a compute node where you can execute your code. This method is well-suited for light tests and environment configuration (especially for GPU accelerated codes). See the section Interactive jobs.
- using a Slurm script : this will submit your script to the scheduler, which will run it when the resources are available. This method is well-suited for "production" runs.
Slurm is configured with a "fairshare" policy among the users, which means that the more resources you have asked for in the past days and the lower your priority will be for your jobs if the task manager has several jobs to handle at the same time.
Most of the time, you will run your code through a Slurm script. This script has the following functions :
- specify the resources you need for your code : partition, walltime, number of nodes, memory (mem), number of tasks (ntasks), local SSD disk space (tmp), etc.
- specify other parameters for your job (project which your job belongs to, output files, mail information on your job status, job name, etc.)
- if you use GPUs, the number of gpus requested (--gres=gpu:)
- setup the batch environment (load modules, set environment variables)
- run the code
The batch environment is set by loading the proper modules (see section Module command) and setting the proper bash variables (PATH, OMP_NUM_THREAD, etc.).
We recommend to unload all your modules with
module purge beforehand and load the exact same modules as in your tests or/and your code compilation.
Running the code will depend on your executable. Parallel codes may have to use
srun or having specific environment variables set.
- By defaut
partitionis set to
- You can change this setting by choosing a partition following the needed ressources in the array of queue names .
You describe the resources you need in the submission script, using sbatch instructions (script lines beginning with #SBATCH). These options can be used directly with the sbatch command, or listed in a script. Using a script is the best solution if you want to submit the job several times, or several similar jobs.
How to describe your requested ressources with SBATCH
Number of nodes :
Number of tasks (MPI processes) :
Number of tasks (MPI processes) per node:
Number of threads per process (Ex: OpenMP threads per MPI process):
Number of gpus :
Note : In the job, the selected gpus will have IDs in a cgroup context. See section on CGROUPS below for more information.
Allocated nodes are reserved exclusively in order to avoid sharing nodes with other running jobs. Do not use this directive unless the support team tells you to do so for a specific case.
Memory per node :
- Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]
- To know : default memory on Ruche is 4 GB per core (
--mem-per-cpu=4G). You do not need to specify this directive if the default value is well-suited for your job.
- Check your actual needs in terms of memory with
seff(see below) and adapt your memory request accordingly for your next jobs. Reserving too much memory, in comparison with your actual needs, makes the job stay in queue longer and prevents other users from using the resource.
Specify the walltime for your job. if your job is still running after the walltime duration, your job will be killed :
Use this directive only if you need a local SSD disk for IOs. Most of the time, performing IOs on the workdir is relevant and this directive is not needed.
To use a local SSD disk on a
1/ Specify a
mem or a
gpu partition in your script
2/ Reserve the amount of SSD disk space thanks to the Slurm directive
- By default,
- Since there is no SSD disk on
cpunodes, Slurm will produce the following error if the
tmpdirective is used with a
sbatch: error: Temporary disk specification can not be satisfied
$TMPDIR to refer to the directory Slurm will create for your job on the SSD (this variable is set to
login is your login and
jobid the jobid of your job). Note that Slurm automatically deletes this directory at the end of the job, therefore you must transfer the outputs your code wrote in this directory back to your submission directory (
$SLURM_SUBMIT_DIR) in your workdir at the end of the script.
* Do not use
$TMPDIR in a script with a
cpu partition. For these partitions,
$TMPDIR is set to
/tmp which is too small for IOs.
Specify the Slurm partition your job will be assigned :
PartitionName in partition names list
SBATCH additional directives
Define the job's name :
Define the standard output (stdout) for your job :
Define the error output (stderr) for your job :
By default both standard output and standard error are directed to the same file.
Set an email address :
To be notify by mail when a step has been reached :
Arguments for -mail-type option are :
- BEGIN : send an email when the job starts
- END : send an email when the job stops
- FAIL : send an email if the job fails
- ALL : equivalent to BEGIN, END, FAIL.
Export user environment variables
- By default all user environment variables will be loaded (--export=ALL).
- To avoid dependencies and inconsistencies between submission environment and batch execution environment, disabling this functionality is highly recommended. In order to not export environment variables present at job submission time to the job's environment:
- To select explicitly exported variables from the caller's environment to the job environment:
- By default all ressources limits (obtained by ulimit command like stack, open files, nb processes, ...) are propagated (--propagate=ALL).
- To avoid the propagation of interactive limits and erase batch ressources limits, it is encouraged to disable the fonctionnality:
- By default the compute time consumption is charged to your default project account
- To indicate another project account, you can specify it with
- To see the association between a job and the project, you can use
Submit and monitor jobs
You need to submit your script job0 with :
$ sbatch job0 Submitted batch job 29509
which responds with the jobid attributed to the job. For example here, jobid is 29509. The jobid is a unique identifier that is used by many Slurm commands.
squeue command shows the list of jobs :
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 29509 cpu_short job0 username R 0:02 1 node001
scancel command cancels jobs.
To cancel job job0 with jobid 29509 (obtained through squeue), you would use :
$ scancel 29509
- Example 1: access one node in interactive for an hour
$ srun --nodes=1 --time=00:30:00 -p cpu_short --pty /bin/bash [user@node001 ~]$ hostname node001
- Example 2: access on a node with a GPU for 30 minutes
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:1 --pty /bin/bash
--x11option if you need X forwarding.
Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value.
# Submit a job array with index values between 0 and 31 [user@ruche01 ~]$ sbatch --array=0-31 job # Submit a job array with index values of 1, 3, 5 and 7 [user@ruche01 ~]$ sbatch --array=1,3,5,7 job # Submit a job array with index values between 1 and 7 # with a step size of 2 (i.e. 1, 3, 5 and 7) [user@ruche01 ~]$ sbatch --array=1-7:2 job
The subjobs should not depend on each other. SLURM can start these jobs in every order, at the same time or not.
If you want to submit a job which must be executed after another job, you can use the chain function in slurm.
[username@ruche01 ~]$ sbatch slurm_script1.sh Submitted batch job 74698 [username@ruche01 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 74698 ******* ******* username PD 0:00 * ******* [username@ruche01 ~]$ sbatch --dependency=afterok:74698 slurm_script2.sh Submitted batch job 74699 [username@ruche01 ~]$ sbatch ---dependency=afterok:74698:74699 slurm_script3.sh Submitted batch job 74700
Note that if one of the jobs in the sequence fails, the following jobs remain by default pending with the reason “DependencyNeverSatisfied” but can never be executed. You must then delete them using the scancel command. If you want these jobs to be automatically canceled on failure, you must specify the –kill-on-invalid-dep = yes option when submitting them.
Here are the common chaining rules :
= job can start once job has started execution
= job can start once job has terminated
= job can start once job has terminated successfully
= job can start once job has terminated upon failure
- singleton = job can start once any previous job with identical name and user has terminated
- Use the command
sacctto get info on your jobs. More detail on this page
[user@ruche01 ~]$ sacct -e Account AdminComment AllocCPUS AllocGRES AllocNodes AllocTRES AssocID AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize BlockID Cluster Comment ConsumedEnergy ConsumedEnergyRaw CPUTime CPUTimeRAW DerivedExitCode Elapsed ElapsedRaw Eligible End ExitCode GID Group JobID JobIDRaw JobName ... [user@ruche01 ~]$ sacct -j 1240028 --format=jobid,jobname,maxrss,elapsed,exitcode --unit=G JobID JobName MaxRSS Elapsed ExitCode ------------ ---------- ---------- ---------- -------- 1240028 hydrossd 00:01:28 0:0 1240028.bat+ batch 4.48G 00:01:28 0:0 1240028.ext+ extern 0.00G 00:01:28 0:0
The MaxRSS field (ram used by the job) is very useful that got cancelled due to out of memory problems.
- Use the command
seffto get info on a finished job.
[user@ruche01 ~]$ seff 1240028 Job ID: 1240028 Cluster: ruche User/Group: user/lab State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:01:27 CPU Efficiency: 98.86% of 00:01:28 core-walltime Job Wall-clock time: 00:01:28 Memory Utilized: 4.48 GB Memory Efficiency: 89.62% of 5.00 GB
Note : on ruche, the accounting information is restricted to your jobs only.
Your slurm job will be executed in a virtual resource chunk called a CGROUP, formed with the allocated amount of RAM, cores and GPUS. In some cases, you will be allowed to see only the selected resources.
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:2 --export=none --pty /bin/bash [user@ruche-gpu03 user]$ nvidia-smi # will only see 2 selected gpus
Important note if you are selecting GPUs : The gpu IDs displayed with nvidia-smi start at 0 in the cgroup. The actual gpu IDs are stored in the environment variable
SLURM_STEP_GPUS. If you specify manually the gpu IDs to your framework using the wrong context, you can end using the same GPU as another job. Be careful !
GPU_DEVICE_ORDINALvalues contains the IDs in the cgroup context
SLURM_STEP_GPUSvalue contains the IDs in the global context
[user@ruche01 ~]$ srun --nodes=1 --time=00:30:00 -p gpu --gres=gpu:2 --export=none --pty /bin/bash [user@ruche-gpu03 user]$ echo $CUDA_VISIBLE_DEVICES # cgroup context 0,1 [user@ruche-gpu03 user]$ echo $GPU_DEVICE_ORDINAL # cgroup context 0,1 [user@ruche-gpu03 user]$ echo $SLURM_STEP_GPUS # global context 1,2