Slurm: Command-line Interface

Reference

HPC

Slurm

Published

October 31, 2014

Modified

January 16, 2025

Table 1: List of commands for users

Command	Description
`sinfo`	Information on cluster partitions and nodes
`squeue`	Overview of jobs and their states
`scontrol`	View configuration, states, (un-)suspending jobs
`srun`	Run executable as job (blocks until the job is scheduled)
`salloc`	Submit an interactive job. (blocks until prompt appears)
`sbatch`	Submit a job script for batch scheduling
`scancel`	Cancels a running or pending job

Partitions

sinfo lists partition…

Default partition — * asterisk as suffix to the name

# partition state summary
sinfo -s

# comprehensive list idle nodes
sinfo -Nel -t idle

CPUs & Memory

Table 2: List of columns related to CPUs and Memory

Column	Description
`CPUS`	Count of CPUs (logic processors)
`S:C:T`	Count of `S`ockets, `C`ores, `T`hreads
`CPUS(A/I/O/T)`	CPU states …capital letter are abbreviations: `A`vailable, `I`dle, `O`ther and `T`otal
`MEMORY`	Maximum allocatable RAM

>>> sinfo -o "%9P %6g %4c %10z %8m %5D %20C"
PARTITION GROUPS CPUS S:C:T      MEMORY   NODES CPUS(A/I/O/T)       
debug     all    128+ 2:32+:2    257500+  10    0/1664/384/2048     
main*     all    96+  2:24+:2    191388+  440   23056/33840/6144/630
high_mem  all    256  8:16:2     1031342  46    2296/4616/4864/11776
gpu       all    96   2:24:2     515451   50    1202/430/3168/4800  
long      all    96+  2:24+:2    191388+  342   19072/28576/6048/536

Time Limits

Table 3: Run-time columns, format in “days-hours:minutes:seconds”

Column	Description
`DEFAULTTIME`	Default runtime if non is specified by option
`TIMELIMIT`	Maximum run-time for a job (`infinite` if a partition support this)

>>> sinfo -o "%9P  %6g %11L %10l %5D %20C" 
PARTITION  GROUPS DEFAULTTIME TIMELIMIT  NODES CPUS(A/I/O/T)       
debug      all    5:00        30:00      10    0/1664/384/2048     
main*      all    2:00:00     8:00:00    440   23058/33838/6144/630
high_mem   all    1:00:00     7-00:00:00 46    2296/4616/4864/11776
gpu        all    2:00:00     7-00:00:00 50    1202/430/3168/4800  
long       all    2:00:00     7-00:00:00 342   19074/28574/6048/536

Selection

Table 4: salloc, srun, and sbatch option to select a partition

Option	Description
`-p`, `--partition`	Request a specific partition for the resource allocation.

Table 5: List of environment variables to select a partition

Variable	Description
`SLURM_PARTITION`	Interpreted by the `srun` command
`SALLOC_PARTITION`	Interpreted by the `salloc` command
`SBATCH_PARTITION`	Interpreted by the `sbatch` command

For example request resource from a debug partition:

sbatch --partition=debug ...

Jobs

Job details of all jobs from a user

for i in $(squeue -u $USER -o '%i' -h) ; do scontrol show job $i ; done

History of jobs for a particular user

List jobs after start time -S MM/DD[/YY]
List all users -a, or a particular user -u vpenso

» sacct -nX -o end,state,exitcode […] | uniq -f2 -c
[…]
      5 2015-09-14T20:08:48  COMPLETED      0:0 
      6 2015-09-14T20:08:48     FAILED      1:0 
     13 2015-09-14T22:35:01  CANCELLED      0:0 
      2 2015-09-15T09:50:35     FAILED      1:0 
     51 2015-09-15T10:22:51  COMPLETED      0:0 
      5 2015-09-15T12:32:10    TIMEOUT      1:0 
      1 2015-09-15T12:32:12  CANCELLED      0:0 
      5 2015-09-15T12:56:40    TIMEOUT      1:0 
      1 2015-09-15T13:01:01  CANCELLED      0:0 
      5 2015-09-15T18:38:10    TIMEOUT      1:0

Run-Time

Run-time of currently executed jobs, and their limits

squeue -t r -o '%11M %11l %9P %8u %6g %10T' -S '-M' | uniq -f 1 -c

Estimated start time of jobs waiting in queue

squeue -t pd,s -o '%20S %.8u %4P %7a %.2t %R' -S 'S' | uniq -c

Read the man-page for a list for Job Reason Codes

man -P 'less -p "^JOB REASON CODES"' squeue

Failing

List failed jobs for users and/or accounts:

Option	Description
`-a`, `--allusers`	All user for the system
`-A`, `--accounts $LIST`	List of Slurm accounts
`-u`, `--user $NAME`	A specific Linux user name

start_time=$(date --date="3 days ago" +"%Y-%m-%d")
# i.e. for all users, and all accounts
sacct --format jobid,user,state,start,end,elapsed,exitcode,nodelist \
      --starttime $start_time \
      --state failed \
      --allusers

Limit the output to a specific JOB_ID:

sacct --format jobid,account,user,start,elapsed,exitcode,nodelist --jobs $JOB_ID

Investigate a specific job using its JOB_ID in the log-files on the resource manager. Make sure to use zgrep to read log-files already compressed by log rotation.

zgrep $JOB_ID /var/log/slurmctld*

Exit Code

Non-zero exit code assumed to be a job failure

Exit code¹ …preserved as job meta-data:

…value in the range of 0 to 255
Derived of…
- sbatch — …batch script exit code
- salloc — …exit call terminating session
- srun — …return value of executed command

Table 6: List of Exit codes

Exit Code	Description
0	success (≠0 failure)
1	general failure
2	incorrect shell building
3-124	error in job (check software exit codes)
125	out of memory
126	command can not execute
127	command not found
128	invalid argument
129-192	terminated by host signals

Host Signal

When a host signal was responsible for the job termination…

…signal number will be displayed after the exit code
…for example 0:53 (separation by colon) <exit_code>:<signal>

SACCT_FORMAT="jobid,user,state,exitcode,nodelist"
sacct -j $job_id[,$job_id,…]

Derived Exit Code

Derived exit code — highest exit code returned from all job steps

sjobexitmod — view and modify the derived exit code and comment string
…allows users annotate a job exit after completion …describe that failed

# list exit codes for a job
sjobexitmod -l $job_id

# modify after completion
sjobexitmod $job_id -e $exit_code -r "$comment"

Priority

Jobs priority is an integer…

…ranges between 0 and 4294967295
…larger numbers = higher position in queue

# list job in priority order …highest priority at the bottom
sprio -l -S 'Y'

# put job on top of queue (aka set highest possible priority)
scontrol top $job_id

# set a specific priority (in relation to other users)
scontrol update job=$job_id priority=$priority

Operators and administrators can launch jobs with top priority:

srun --priority top #…

Parallel Jobs

Slurm supports the Process Management Interface (PMI), specifically PMIx ². PMI provides a common abstraction to HPC process managers, to decouple process management from the underlying process manager. The process manager has following functions:

Handle start/stop of processes
Aggregation of I/O channels std{in|out|err}
Environment and signal propagation
Central coordination point of parallel processes

PMI is used by most MPI libraries to interact with any compliant system e.g. Slurm to fulfill following roles:

Requests the PM to start processes on the nodes of a parallel machine
Propagate startup data with PMI out-of-band communication
Processes use out-of-band communication to setup MPI communication

Launch Modes

Slurm supports multiple modes to launch MPI process:

Slurm launches tasks, and PMI initializes communication (default)
Slurm allocates resources, mpirun launches tasks (using Slurm)
Slurm allocates resources, mpirun launches tasks with a mechanism outside the control of Slurm (no CPU task binding, nor task accounting)

List the supported MPI launch modes with:

srun --mpi=list

The launch mode can be selected using an environment variable or by command-line option for the srun and sbatch commands:

# set the launch mode with an env. variable
SLURM_MPI_TYPE=$mode
# set launch mode with an option
{srun|sbatch} --mpi=$mode ...

Show the default mode for launching MPI applications by printing the Slurm system configuration with scontrol:

» scontrol show config | grep MpiDefault
MpiDefault              = pmix_v2

Example Program

Following C code exemplifies a basic “Hello World” MPI program:

// mpi-hello.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include "mpi.h"

int main(int argc,char ** argv )
{
    int rank;
    int size;
    char hostname[1024];
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    pid_t pid = getpid();
    gethostname(hostname, 1024);
    printf( "Hello world %s.%d [%d/%d]\n",hostname , pid, rank, size);
    MPI_Finalize();
    return 0;
}

Compile the program using mpicc:

mpicc $LUSTRE_HOME/src/mpi-hello.c -o $LUSTRE_HOME/bin/mpi-hello

Execute the program with mpiexec specifying the number of parallel processes using the option -n <numproc>:

# run the program with four parallel process
mpiexec -n 4 $LUSTRE_HOME/bin/mpi-hello

Once you have verified that the program works as expected, you can continue to launch it on the resources of the compute cluster.

Suspend

Suspend all running jobs of a user (option -t R)

» squeue -ho %A -t R -u $user | paste -sd' '
509854 509855 509856 509853
» scontrol suspend $(squeue -ho %A -t R -u $user | paste -sd ' ')

Resume all suspended jobs of a user (option -t S):

scontrol resume $(squeue -ho %A -t S -u $user | paste -sd ' ')

Other sub-commands of scontrol

Command	Description
`hold`	Prevent a pending job from beginning started
`release`	Release a previously held job to begin execution
`uhold`	Hold a job so that the job owner may release it

Recurring Jobs

scrontab schedules recurring jobs on the cluster. It provides a cluster based equivalent to crontab (short for “cron table”), a system that specifies scheduled tasks to be run by the cron daemon³ on Unix-like systems. scrontab is used to configure Slurm to execute commands at specified intervals, allowing users to automate repetitive tasks.

All users can have their own scrontab file, allowing for personalized job scheduling without interfering with other users. Users can define jobs directly in the scrontab file, specifying the command to run, the schedule, and any Slurm options (like resource requests).

Format

The scrontab configuration format works similar to the traditional cron format, allowing users to specify when and how often jobs should be executed. The configuration can have several crontab entries (jobs).

# create a simple example for scrontab
>>> cat > sleep.scrontab <<EOF
#SCRON --time=00:02:00
#SCRON --job-name=sleep-scrontab
#SCRON --chdir=/lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
*/10 * * * * date && sleep 30
EOF

# install a new scrontab from a file
>>> scrontab sleep.scrontab

# check the queue
>>> squeue --me -O Jobid,EligibleTime,Name,State
JOBID               ELIGIBLE_TIME       NAME                STATE               
14938318            2024-10-31T10:20:00 sleep-scrontab      PENDING

Time Fields

The first five fields specify the schedule for the job, and they represent from left to right:

Field	Description
Minute (0-59)	The minute of the hour when the job should be scheduled
Hour (0-23)	The hour of the day when the job should be scheduled
Day of the Month (1-31)	The specific day of the month when the job should run
Month (1-12)	The month when the job should run
Day of the Week (0-7)	The day of the week when the job should run (0 and 7 both represent Sunday).

Special characters are sued to define more complex schedules:

Character	Description
Asterisk (`*`)	Represents “every” unit of time. For example, an asterisk in the minute field means the job will run every minute.
Comma (`,`)	Used to specify multiple values. For example, `1,15` in the minute field means the job will run at the 1st and 15th minute of the hour.
Dash (`-`)	Specifies a range of values. For example, `1-5` in the day of the week field means the job will run from Monday to Friday.
Slash (`/`)	Specifies increments. For example, `*/5` in the minute field means the job will run every 5 minutes.

Some users may find it convenient to us a web-site based crontab generator⁴ to prepare a custom configuration.

Shortcuts

Shortcuts to specify some common time intervals

Shortcut	Description
`@annually`	Job will become eligible at 00:00 Jan 01 each year
`@monthly`	Job will become eligible at 00:00 on the first day of each month
`@weekly`	Job will become eligible at 00:00 Sunday of each week
`@daily`	Job will become eligible at 00:00 each day
`@hourly`	Job will become eligible at the first minute of each hour.

Meta-Commands

Lines starting with #SCRON allow users to set Slurm options for the single following crontab entry. This means each crontab entry needs its own list of #SCRON meta-commands, for example:

#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
@daily path/to/sleep.sh > sleep-$(date +%Y%m%dT%H%M).log

Options include most of those available to the sbatch command (make sure to read the manual pages for more details). In order to write output of a recurring job into a single file use following option:

Options	Description
`--open-mode`	Appends output to an existing log-file (instead of overwrite)

#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
0 8 * * * path/to/sleep.sh

Usage

Users can configure their scrontab in multiple ways:

# modify the configuration with your preferred text-edotr
1EDITOR=vim scrontab -e
# read the configuration from a file
2scrontab path/to/file

# print the configuration
3scrontab -l

# clear the configuration
4scrontab -r

1: Modify the configuration with an text-editor using option -e.
2: Apply a configuration by passing a file as argument.
3: Option -l print the configuration to the terminal
4: Option -r removes the entire configuration (jobs continue to run, but won’t longer recur).

Jobs have the same Job ID for every run (until the next time the configuration is modified).

# list jobs with 
1squeue --me -O Jobid,EligibleTime,Name,State

# list all recurring jobs in the past
2sacct --duplicates --jobs $job_id

# skip next run
3scontrol requeue $job_id

# disable a cron job
4scancel --cron $job_id

1: List when cronjobs will be eligible for next execution. Note that jobs are not guaranteed to execute at the preferred time.
2: List all recurring executions of the cronjob from the accounting.
3: Skip next execution of a cronjob with scontrol and reschedule the job to the upcoming available time.
4: Request to cancel a job submitted by crontab with scancel. The job in the crontab will be preceded by the comment #DISABLED

Reservations

Slurm has the ability to reserve resources⁵ for jobs being executed by select users and/or accounts. A resource reservation identifies the resources in that reservation and a time period during which the reservation is available. The resources which can be reserved include cores, nodes, licenses and/or burst buffers. A reservation that contains nodes or cores is associated with one partition, and can’t span resources over multiple partitions. The only exception from this is when the reservation is created with explicitly requested nodes.

Reservations can be created, updated, and removed with the scontrol command

# Display an overview list for reservations
sinfo -T

# List all reservations with detailed specification
scontrol show reservation

ReservationName= — identifier used to allocate resources from the reservation
Users=, Accounts= — users/accounts with access to a reservation

Usage

salloc, srun and sbatch …reference the reservation

# request a specific reservation for allocation
sbatch --reservation=$name ...

-r, --reservation — job allocates resources from the specified reservation
-p, --partition
- …if a resource reservation provides nodes from multiple partitions
- …it is required to use the partition option in addition!

Alternatively use following input environment variables:

Environment Variable	Description
`SLURM_RESERVATION`	…reservation with `srun`
`SALLOC_RESERVATION`	…reservation with `salloc`
`SBATCH_RESERVATION`	…reservation with `sbatch`

Duration & Flags

Following is a subset of specifications (refer to the corresponding in the scontrol manual page):

Option	Description
`starttime`	`YYYY-MM-DD[THH:MM]`, or `now[+time]` where time is count with a time unit (minutes, hours, days, or weeks)
`endtime`	`YYYY-MM-DD[THH:MM]` (alternatively use `duration`)
`duration`	`[[days-]hours:]minutes` or `UNLIMITED`/`infinite`
`flags=<list>`	`maint` identify system maintenance for the accounting `ignore_jobs` running during reserved time `daily` or `weekly` reoccurring reservation

Reserve an entire cluster at a particular time for a system down time:

scontrol create reservation starttime=$starttime \
   duration=120 user=root flags=maint,ignore_jobs nodes=ALL

Reserve a specific node to investigate a problem:

scontrol create reservation starttime=now \
    user=root duration=infinite flags=maint nodes=$node

Remove a reservation from the system:

# remove a reservation from the system
scontrol delete reservation=$name

Resources

By default, reservations must not overlap. They must either include different nodes or operate at different times. If specific nodes are not specified when a reservation is created, Slurm will automatically select nodes to avoid overlap and ensure that the selected nodes are available when the reservation begins. … Note a reservation having a maint or overlap flag will not have resources removed from it by a subsequent reservation also having a maint or overlap flag, so nesting of reservations only works to a depth of two.

Option	Description
`nodecnt=<num>`	Number of nodes…. `nodecnt=1k` (multiplies 1024)
`nodes=`	Nodeset to use or `nodes=all` reserver all nodes in the cluster.
`feature=`	Only nodes with a specific feature

# specific set of nodes
scontrol ... nodes='node[0700-0720],node[1000-1002]' ...

# all nodes in a partition
scontrol ... partitionname=long nodes=all

Users & Accounts

Reservations can not only be created for the use of specific accounts and users, but specific accounts and/or users can be prevented from using them. If both Users and Accounts are specified, a job must match both in order to use the reservation:

You can add or remove individual accounts/users from an existing reservation by using the update command and adding a ‘+’ or ‘-’ sign before the ‘=’ sign. If accounts are denied access to a reservation (account name preceded by a ‘-’), then all other accounts are implicitly allowed to use the reservation and it is not possible to also explicitly specify allowed accounts.

# add an account to an existing reservation
scontorl update reservation=$name account+=$account

Examples:

accounts=— configure accounts with access…
- accounts=alice,bob — comma separated list allowed groups
- accounts-=bob — allow all accounts except list accounts
users= — configure users with access…
- users=jane,joe — comma separated list of allowed users
- users-=ted — all users except listed
- users=-troth — deny access for listed users

Nodes

Get an overview of the resource:

sinfo -lNe
- …one line per node & partition
- …list resource (CPU, RAM) & node features
sinfo -rd
- …list all unresponsive nodes
- …reason for node state: down, drained, or failing

Format the output to be piped into nodeset:

# node-list of drained nodes
sinfo -h -N -o '%n' -t drain,draining,drained | nodeset -f

# node-list of unresponsive nodes...
sinfo -h -N -o '%n' -t down,no_respond,power_down,unk,unknown | nodeset -f

Table 7: List of Node States

State	Description
`IDLE`	…not allocated
`ALLOCATED`	…by one or more jobs
`ALLOCATED+`	…some jobs in process of completing
`COMPLETING`	…all jobs completing
`INVAL`	…node did not register to controller
`FUTURE`	…node not available yet
`MAINT`	…node in maintenance
`DRAINING`	…node will become unavailable by admin request
`DRAINED`	…node unavailable by admin request
`DOWN`	…node unavailable for use
`FAIL`	…node expected to fail …unavailable by admin request
`FAILING`	…jobs expected to fail soon

Drain & Resume

Remove a node (temporarily) from operation…

# graceful drain nodes for maintenance
scontrol update state=drain nodename="$nodeset" reason="$reason"

# move a node back into operational state
scontrol update state=resume nodename="$nodeset"

state=drain
- …state draining …no new jobs …running jobs continue
- …state drained …node empty …returned to state idle manually
state=down
- …state down …abort all running jobs (immediately)
- …will interrupt service to the user (jobs may be requeued)
state=resume …state idle …accept new jobs

Reboot

Reboot nodes using the resource manager scontrol reboot sub-command:

# reboot nodes as soon as it is idle (explicitly drain the nodes beforehand)
scontrol reboot ...            # Defaults to ALL!!! reboots all nodes in the cluster
scontrol reboot $(hostname)... # reboot localhost
scontrol reboot "$nodeset" ... # reboot a nodeset

# drain & reboot the nodes
scontrol reboot ASAP "$nodeset"

# cancle pending reboots with
scontrol cancel_reboot "$nodeset"

# node clears its state and resturns to service after reboot
scontrol reboot "$nodeset" nextstate=RESUME ...

Nodes with pending reboot…

>>> scontrol show node $node
#...
  State=MIXED+DRAIN+REBOOT_REQUESTED #...
#...
  Reason=Reboot ASAP [root@2023-10-18T09:50:18]

Nodes during reboot…

>>> scontrol show node $node
#...
  State=DOWN+DRAIN+REBOOT_ISSUED #...
#...
  Reason=Reboot ASAP : reboot issued [root@2023-10-20T07:05:07]

Footnotes

Job Exit Codes, Slurm Documentation
https://slurm.schedmd.com/job_exit_code.html ↩︎
Process Management Interface - Exascale, PMIx Community
https://pmix.org/↩︎
cron, Wikipedia
https://en.wikipedia.org/wiki/Cron ↩︎
Crontab Generator
https://crontab-generator.org ↩︎
Advanced Resource Reservation Guide, SchedMD
https://slurm.schedmd.com/reservations.html ↩︎

--- title: 'Slurm: Command-line Interface' categories: - Reference - HPC - Slurm date: 2014/10/31 date-modified: 2025/01/16 toc-expand: 3 tbl-colwidths: [25,75] --- Command | Description -----------|------------------- `sinfo` | Information on cluster partitions and nodes `squeue` | Overview of jobs and their states `scontrol` | View configuration, states, (un-)suspending jobs `srun` | Run executable as job (blocks until the job is scheduled) `salloc` | Submit an interactive job. (blocks until prompt appears) `sbatch` | Submit a job script for batch scheduling `scancel` | Cancels a running or pending job : List of commands for users {#tbl-user-commands} # Partitions `sinfo` lists partition… - **Default partition** — `*` asterisk as suffix to the name ```bash # partition state summary sinfo -s # comprehensive list idle nodes sinfo -Nel -t idle ``` ## CPUs & Memory Column | Description -----------------|----------------- `CPUS` | Count of CPUs (logic processors) `S:C:T` | Count of `S`ockets, `C`ores, `T`hreads `CPUS(A/I/O/T)` | CPU states …capital letter are abbreviations: `A`vailable, `I`dle, `O`ther and `T`otal `MEMORY` | Maximum allocatable RAM : List of columns related to CPUs and Memory {#tbl-parrition-cpumem} ```bash >>> sinfo -o "%9P %6g %4c %10z %8m %5D %20C" PARTITION GROUPS CPUS S:C:T MEMORY NODES CPUS(A/I/O/T) debug all 128+ 2:32+:2 257500+ 10 0/1664/384/2048 main* all 96+ 2:24+:2 191388+ 440 23056/33840/6144/630 high_mem all 256 8:16:2 1031342 46 2296/4616/4864/11776 gpu all 96 2:24:2 515451 50 1202/430/3168/4800 long all 96+ 2:24+:2 191388+ 342 19072/28576/6048/536 ``` ## Time Limits Column | Description --------------|----------------- `DEFAULTTIME` | Default runtime if non is specified by option `TIMELIMIT` | Maximum run-time for a job (`infinite` if a partition support this) : Run-time columns, format in "days-hours:minutes:seconds" {#tbl-partition-times} ```bash >>> sinfo -o "%9P %6g %11L %10l %5D %20C" PARTITION GROUPS DEFAULTTIME TIMELIMIT NODES CPUS(A/I/O/T) debug all 5:00 30:00 10 0/1664/384/2048 main* all 2:00:00 8:00:00 440 23058/33838/6144/630 high_mem all 1:00:00 7-00:00:00 46 2296/4616/4864/11776 gpu all 2:00:00 7-00:00:00 50 1202/430/3168/4800 long all 2:00:00 7-00:00:00 342 19074/28574/6048/536 ``` ## Selection Option | Description --------------------|------------------- `-p`, `--partition` | Request a specific partition for the resource allocation. : `salloc`, `srun`, and `sbatch` option to select a partition {#tbl-partition-option} Variable | Description -------------------|----------------------------- `SLURM_PARTITION` | Interpreted by the `srun` command `SALLOC_PARTITION` | Interpreted by the `salloc` command `SBATCH_PARTITION` | Interpreted by the `sbatch` command : List of environment variables to select a partition {#tbl-partition-variables} For example request resource from a debug partition: ```sh sbatch --partition=debug ... ``` # Jobs **Job details** of all jobs from a user ```bash for i in $(squeue -u $USER -o '%i' -h) ; do scontrol show job $i ; done ``` **History of jobs** for a particular user - List jobs after start time `-S MM/DD[/YY]` - List all users `-a`, or a particular user `-u vpenso` ```bash » sacct -nX -o end,state,exitcode […] | uniq -f2 -c […] 5 2015-09-14T20:08:48 COMPLETED 0:0 6 2015-09-14T20:08:48 FAILED 1:0 13 2015-09-14T22:35:01 CANCELLED 0:0 2 2015-09-15T09:50:35 FAILED 1:0 51 2015-09-15T10:22:51 COMPLETED 0:0 5 2015-09-15T12:32:10 TIMEOUT 1:0 1 2015-09-15T12:32:12 CANCELLED 0:0 5 2015-09-15T12:56:40 TIMEOUT 1:0 1 2015-09-15T13:01:01 CANCELLED 0:0 5 2015-09-15T18:38:10 TIMEOUT 1:0 ``` ## Run-Time Run-time of currently executed jobs, and their limits ```bash squeue -t r -o '%11M %11l %9P %8u %6g %10T' -S '-M' | uniq -f 1 -c ``` **Estimated start time** of jobs waiting in queue ```bash squeue -t pd,s -o '%20S %.8u %4P %7a %.2t %R' -S 'S' | uniq -c ``` Read the man-page for a list for _Job Reason Codes_ ```bash man -P 'less -p "^JOB REASON CODES"' squeue ``` ## Failing List failed jobs for users and/or accounts: Option | Description --------------------------|----------------------- `-a`, `--allusers` | All user for the system `-A`, `--accounts $LIST` | List of Slurm accounts `-u`, `--user $NAME` | A specific Linux user name ```bash start_time=$(date --date="3 days ago" +"%Y-%m-%d") # i.e. for all users, and all accounts sacct --format jobid,user,state,start,end,elapsed,exitcode,nodelist \ --starttime $start_time \ --state failed \ --allusers ``` Limit the output to a **specific `JOB_ID`**: ```bash sacct --format jobid,account,user,start,elapsed,exitcode,nodelist --jobs $JOB_ID ``` Investigate a specific job using its `JOB_ID` in the log-files on the resource manager. Make sure to use `zgrep` to read log-files already compressed by log rotation. ```bash zgrep $JOB_ID /var/log/slurmctld* ``` ## Exit Code **Non-zero exit code assumed to be a job failure** Exit code[^YZOMz] …preserved as job meta-data: * …value in the range of 0 to 255 * Derived of… - `sbatch` — …batch script exit code - `salloc` — …exit call terminating session - `srun` — …return value of executed command Exit Code | Description ----------|------------- 0 | success (≠0 failure) 1 | general failure 2 | incorrect shell building 3-124 | error in job (check software exit codes) 125 | out of memory 126 | command can not execute 127 | command not found 128 | invalid argument 129-192 | terminated by host signals : List of Exit codes {#tbl-exitcode} ### Host Signal When a **host signal** was responsible for the job termination… - …signal number will be displayed after the exit code - …for example `0:53` (separation by colon) `<exit_code>:<signal>` [^YZOMz]: Job Exit Codes, Slurm Documentation <https://slurm.schedmd.com/job_exit_code.html> ```bash SACCT_FORMAT="jobid,user,state,exitcode,nodelist" sacct -j $job_id[,$job_id,…] ``` ### Derived Exit Code **Derived exit code** — highest exit code returned from all job steps - `sjobexitmod` — view and modify the derived exit code and comment string - …allows users annotate a job exit after completion …describe that failed ```bash # list exit codes for a job sjobexitmod -l $job_id # modify after completion sjobexitmod $job_id -e $exit_code -r "$comment" ``` ## Priority Jobs priority is an integer… - …ranges between `0` and `4294967295` - …larger numbers = higher position in queue ```bash # list job in priority order …highest priority at the bottom sprio -l -S 'Y' # put job on top of queue (aka set highest possible priority) scontrol top $job_id # set a specific priority (in relation to other users) scontrol update job=$job_id priority=$priority ``` Operators and administrators can launch jobs with top priority: ```bash srun --priority top #… ``` ## Parallel Jobs Slurm supports the Process Management Interface (PMI), specifically PMIx [^pmix]. PMI provides a common **abstraction to HPC process managers**, to decouple process management from the underlying process manager. The process manager has following functions: [^pmix]: Process Management Interface - Exascale, PMIx Community <https://pmix.org/> * Handle start/stop of processes * Aggregation of I/O channels `std{in|out|err}` * Environment and signal propagation * Central **coordination point of parallel processes** PMI is used by most MPI libraries to interact with any compliant system e.g. Slurm to fulfill following roles: * Requests the PM to start processes on the nodes of a parallel machine * Propagate startup data with PMI out-of-band communication * Processes use out-of-band communication to setup MPI communication ### Launch Modes Slurm supports multiple modes to launch MPI process: * **Slurm launches tasks, and PMI initializes communication** (default) * Slurm allocates resources, `mpirun` launches tasks (using Slurm) * Slurm allocates resources, `mpirun` launches tasks with a mechanism outside the control of Slurm (no CPU task binding, nor task accounting) List the supported MPI launch modes with: ```sh srun --mpi=list ``` The launch mode can be selected using an environment variable or by command-line option for the `srun` and `sbatch` commands: ```sh # set the launch mode with an env. variable SLURM_MPI_TYPE=$mode # set launch mode with an option {srun|sbatch} --mpi=$mode ... ``` Show the default mode for launching MPI applications by printing the Slurm system configuration with `scontrol`: ```sh » scontrol show config | grep MpiDefault MpiDefault = pmix_v2 ``` ### Example Program Following C code exemplifies a basic "Hello World" MPI program: ```c // mpi-hello.c #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include "mpi.h" int main(int argc,char ** argv ) { int rank; int size; char hostname[1024]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); pid_t pid = getpid(); gethostname(hostname, 1024); printf( "Hello world %s.%d [%d/%d]\n",hostname , pid, rank, size); MPI_Finalize(); return 0; } ``` Compile the program using `mpicc`: ```sh mpicc $LUSTRE_HOME/src/mpi-hello.c -o $LUSTRE_HOME/bin/mpi-hello ``` Execute the program with `mpiexec` specifying the number of parallel processes using the option `-n <numproc>`: ```sh # run the program with four parallel process mpiexec -n 4 $LUSTRE_HOME/bin/mpi-hello ``` Once you have verified that the program works as expected, you can continue to launch it on the resources of the compute cluster. ## Suspend **Suspend** all running jobs of a user (option `-t R`) ```bash » squeue -ho %A -t R -u $user | paste -sd' ' 509854 509855 509856 509853 » scontrol suspend $(squeue -ho %A -t R -u $user | paste -sd ' ') ``` **Resume** all suspended jobs of a user (option `-t S`): ```bash scontrol resume $(squeue -ho %A -t S -u $user | paste -sd ' ') ``` Other sub-commands of `scontrol` Command | Description ----------|---------------------- `hold` | Prevent a pending job from beginning started `release` | Release a previously held job to begin execution `uhold` | Hold a job so that the job owner may release it ## Recurring Jobs `scrontab` schedules recurring jobs on the cluster. It provides a cluster based equivalent to `crontab` (short for "cron table"), a system that specifies scheduled tasks to be run by the cron daemon[^vlMHW] on Unix-like systems. `scrontab` is used to configure Slurm to **execute commands at specified intervals, allowing users to automate repetitive tasks**. [^vlMHW]: `cron`, Wikipedia <https://en.wikipedia.org/wiki/Cron> All users can have their own `scrontab` file, allowing for personalized job scheduling without interfering with other users. Users can define jobs directly in the `scrontab` file, specifying the command to run, the schedule, and any Slurm options (like resource requests). ### Format The `scrontab` configuration format works similar to the traditional cron format, allowing users to specify when and how often jobs should be executed. The configuration can have several crontab entries (jobs). ```bash # create a simple example for scrontab >>> cat > sleep.scrontab <<EOF #SCRON --time=00:02:00 #SCRON --job-name=sleep-scrontab #SCRON --chdir=/lustre/hpc/vpenso #SCRON --output=sleep-scrontab-%j.log #SCRON --open-mode=append */10 * * * * date && sleep 30 EOF # install a new scrontab from a file >>> scrontab sleep.scrontab # check the queue >>> squeue --me -O Jobid,EligibleTime,Name,State JOBID ELIGIBLE_TIME NAME STATE 14938318 2024-10-31T10:20:00 sleep-scrontab PENDING ``` ### Time Fields The first five fields specify the schedule for the job, and they represent from left to right: Field | Description ------------|---------------------------------- Minute (0-59) | The minute of the hour when the job should be scheduled Hour (0-23) | The hour of the day when the job should be scheduled Day of the Month (1-31) | The specific day of the month when the job should run Month (1-12) | The month when the job should run Day of the Week (0-7) | The day of the week when the job should run (0 and 7 both represent Sunday). Special characters are sued to define more complex schedules: Character | Description ------------|---------------------------------- Asterisk (`*`) | Represents "every" unit of time. For example, an asterisk in the minute field means the job will run every minute. Comma (`,`) | Used to specify multiple values. For example, `1,15` in the minute field means the job will run at the 1st and 15th minute of the hour. Dash (`-`) | Specifies a range of values. For example, `1-5` in the day of the week field means the job will run from Monday to Friday. Slash (`/`) | Specifies increments. For example, `*/5` in the minute field means the job will run every 5 minutes. Some users may find it convenient to us a web-site based `crontab` generator[^THlUD] to prepare a custom configuration. [^THlUD]: Crontab Generator <https://crontab-generator.org> ### Shortcuts Shortcuts to specify some common time intervals Shortcut | Description ------------|---------------------------------- `@annually` | Job will become eligible at 00:00 Jan 01 each year `@monthly` | Job will become eligible at 00:00 on the first day of each month `@weekly` | Job will become eligible at 00:00 Sunday of each week `@daily` | Job will become eligible at 00:00 each day `@hourly` | Job will become eligible at the first minute of each hour. ### Meta-Commands Lines starting with `#SCRON` allow users to set Slurm options for the single following `crontab` entry. This means each `crontab` entry needs its own list of `#SCRON` meta-commands, for example: ```bash #SCRON --job-name=sleep-scrontab #SCRON --chdir /lustre/hpc/vpenso @daily path/to/sleep.sh > sleep-$(date +%Y%m%dT%H%M).log ``` Options include most of those available to the `sbatch` command (make sure to read the manual pages for more details). In order to write output of a recurring job into a single file use following option: Options | Description ------------|---------------------------------- `--open-mode` | Appends output to an existing log-file (instead of overwrite) ```bash #SCRON --job-name=sleep-scrontab #SCRON --chdir /lustre/hpc/vpenso #SCRON --output=sleep-scrontab-%j.log #SCRON --open-mode=append 0 8 * * * path/to/sleep.sh ``` ### Usage Users can configure their `scrontab` in multiple ways: ```bash # modify the configuration with your preferred text-edotr EDITOR=vim scrontab -e #<1> # read the configuration from a file scrontab path/to/file #<2> # print the configuration scrontab -l #<3> # clear the configuration scrontab -r #<4> ``` 1. Modify the configuration with an text-editor using option `-e`. 2. Apply a configuration by passing a file as argument. 3. Option `-l` print the configuration to the terminal 4. Option `-r` removes the entire configuration (jobs continue to run, but won't longer recur). **Jobs have the same Job ID for every run** (until the next time the configuration is modified). ```bash # list jobs with squeue --me -O Jobid,EligibleTime,Name,State #<1> # list all recurring jobs in the past sacct --duplicates --jobs $job_id #<2> # skip next run scontrol requeue $job_id #<3> # disable a cron job scancel --cron $job_id #<4> ``` 1. List when cronjobs will be eligible for next execution. Note that jobs are not guaranteed to execute at the preferred time. 2. List all recurring executions of the cronjob from the accounting. 3. Skip next execution of a cronjob with `scontrol` and reschedule the job to the upcoming available time. 4. Request to cancel a job submitted by crontab with `scancel`. The job in the crontab will be preceded by the comment `#DISABLED` # Reservations > Slurm has the ability to reserve resources[^slresg] for jobs being executed > by select users and/or accounts. A resource reservation identifies the > resources in that reservation and a time period during which the reservation > is available. The resources which can be reserved include **cores, nodes, > licenses and/or burst buffers**. A reservation that contains nodes or cores is > associated with one partition, and can't span resources over multiple > partitions. The only exception from this is when the reservation is created > with explicitly requested nodes. [^slresg]: Advanced Resource Reservation Guide, SchedMD <https://slurm.schedmd.com/reservations.html> **Reservations can be created, updated, and removed with the `scontrol` command** ```bash # Display an overview list for reservations sinfo -T # List all reservations with detailed specification scontrol show reservation ``` * `ReservationName=` — identifier used to allocate resources from the reservation * `Users=`, `Accounts=` — users/accounts with access to a reservation ## Usage `salloc`, `srun` and `sbatch` …reference the reservation ```bash # request a specific reservation for allocation sbatch --reservation=$name ... ``` - `-r`, `--reservation` — job allocates resources from the specified reservation - `-p`, `--partition` - …if a resource reservation provides nodes from multiple partitions - …it is required to use the partition option in addition! Alternatively use following input environment variables: Environment Variable | Description ---------------------|------------------- `SLURM_RESERVATION` | …reservation with `srun` `SALLOC_RESERVATION` | …reservation with `salloc` `SBATCH_RESERVATION` | …reservation with `sbatch` ## Duration & Flags Following is a subset of specifications (refer to the corresponding in the `scontrol` manual page): Option | Description ----------------|------------------------------ `starttime` | `YYYY-MM-DD[THH:MM]`, or `now[+time]` where time is count</br> with a time unit (minutes, hours, days, or weeks) `endtime` | `YYYY-MM-DD[THH:MM]` (alternatively use `duration`) `duration` | `[[days-]hours:]minutes` or `UNLIMITED`/`infinite` `flags=<list>` | `maint` identify system maintenance for the accounting<br/> `ignore_jobs` running during reserved time<br/> `daily` or `weekly` reoccurring reservation Reserve an entire cluster at a particular time for a system down time: ```bash scontrol create reservation starttime=$starttime \ duration=120 user=root flags=maint,ignore_jobs nodes=ALL ``` Reserve a specific node to investigate a problem: ```bash scontrol create reservation starttime=now \ user=root duration=infinite flags=maint nodes=$node ``` Remove a reservation from the system: ```bash # remove a reservation from the system scontrol delete reservation=$name ``` ## Resources > By default, **reservations must not overlap**. They must either include > different nodes or operate at different times. If specific nodes are not > specified when a reservation is created, Slurm will automatically select > nodes to avoid overlap and ensure that the selected nodes are available when > the reservation begins. ... Note a reservation having a `maint` or `overlap` > flag will not have resources removed from it by a subsequent reservation also > having a `maint` or `overlap` flag, so nesting of reservations only works to > a depth of two. Option | Description -|------------------------------ `nodecnt=<num>` | Number of nodes.... `nodecnt=1k` (multiplies 1024) `nodes=` | Nodeset to use or `nodes=all` reserver all nodes in the cluster. `feature=` | Only nodes with a specific feature ```bash # specific set of nodes scontrol ... nodes='node[0700-0720],node[1000-1002]' ... # all nodes in a partition scontrol ... partitionname=long nodes=all ``` ## Users & Accounts Reservations can not only be created for the use of specific accounts and users, but specific accounts and/or users can be prevented from using them. If both Users and Accounts are specified, a **job must match both in order to use the reservation**: > You can add or remove individual accounts/users from an existing reservation > by using the update command and adding a '+' or '-' sign before the '=' sign. > If accounts are denied access to a reservation (account name preceded by a > '-'), then all other accounts are implicitly allowed to use the reservation > and it is not possible to also explicitly specify allowed accounts. ```bash # add an account to an existing reservation scontorl update reservation=$name account+=$account ``` Examples: * `accounts=`— configure accounts with access… - `accounts=alice,bob` — comma separated list allowed groups - `accounts-=bob` — allow all accounts except list accounts * `users=` — configure users with access… - `users=jane,joe` — comma separated list of allowed users - `users-=ted` — all users except listed - `users=-troth` — deny access for listed users # Nodes Get an overview of the resource: * **`sinfo -lNe`** - …one line per node & partition - …list resource (CPU, RAM) & node features * **`sinfo -rd`** - …list all unresponsive nodes - …reason for node state: `down`, `drained`, or `failing` Format the output to be piped into `nodeset`: ```bash # node-list of drained nodes sinfo -h -N -o '%n' -t drain,draining,drained | nodeset -f # node-list of unresponsive nodes... sinfo -h -N -o '%n' -t down,no_respond,power_down,unk,unknown | nodeset -f ``` State | Description -------------|------------ `IDLE` | …not allocated `ALLOCATED` | …by one or more jobs `ALLOCATED+` | …some jobs in process of completing `COMPLETING` | …all jobs completing `INVAL` | …node did not register to controller `FUTURE` | …node not available yet `MAINT` | …node in maintenance `DRAINING` | …node will become unavailable by admin request `DRAINED` | …node unavailable by admin request `DOWN` | …node unavailable for use `FAIL` | …node expected to fail …unavailable by admin request `FAILING` | …jobs expected to fail soon : List of Node States {#tbl-nodestates} ## Drain & Resume Remove a node (temporarily) from operation… ```bash # graceful drain nodes for maintenance scontrol update state=drain nodename="$nodeset" reason="$reason" # move a node back into operational state scontrol update state=resume nodename="$nodeset" ``` * **`state=drain`** - …state `draining` …no new jobs …running jobs continue - …state `drained` …node empty …returned to state `idle` manually * **`state=down`** - …state `down` …abort all running jobs (immediately) - …will interrupt service to the user (jobs may be requeued) * **`state=resume`** …state `idle` …accept new jobs ## Reboot Reboot nodes using the resource manager `scontrol reboot` sub-command: ```bash # reboot nodes as soon as it is idle (explicitly drain the nodes beforehand) scontrol reboot ... # Defaults to ALL!!! reboots all nodes in the cluster scontrol reboot $(hostname)... # reboot localhost scontrol reboot "$nodeset" ... # reboot a nodeset # drain & reboot the nodes scontrol reboot ASAP "$nodeset" # cancle pending reboots with scontrol cancel_reboot "$nodeset" # node clears its state and resturns to service after reboot scontrol reboot "$nodeset" nextstate=RESUME ... ``` Nodes with pending reboot... ```bash >>> scontrol show node $node #... State=MIXED+DRAIN+REBOOT_REQUESTED #... #... Reason=Reboot ASAP [root@2023-10-18T09:50:18] ``` Nodes during reboot... ```bash >>> scontrol show node $node #... State=DOWN+DRAIN+REBOOT_ISSUED #... #... Reason=Reboot ASAP : reboot issued [root@2023-10-20T07:05:07] ```