Slurm: Command-line Interface

Reference
HPC
Slurm
Published

October 31, 2014

Modified

January 16, 2025

Table 1: List of commands for users
Command Description
sinfo Information on cluster partitions and nodes
squeue Overview of jobs and their states
scontrol View configuration, states, (un-)suspending jobs
srun Run executable as job (blocks until the job is scheduled)
salloc Submit an interactive job. (blocks until prompt appears)
sbatch Submit a job script for batch scheduling
scancel Cancels a running or pending job

Partitions

sinfo lists partition…

  • Default partition* asterisk as suffix to the name
# partition state summary
sinfo -s

# comprehensive list idle nodes
sinfo -Nel -t idle

CPUs & Memory

Table 2: List of columns related to CPUs and Memory
Column Description
CPUS Count of CPUs (logic processors)
S:C:T Count of Sockets, Cores, Threads
CPUS(A/I/O/T) CPU states …capital letter are abbreviations: Available, Idle, Other and Total
MEMORY Maximum allocatable RAM
>>> sinfo -o "%9P %6g %4c %10z %8m %5D %20C"
PARTITION GROUPS CPUS S:C:T      MEMORY   NODES CPUS(A/I/O/T)       
debug     all    128+ 2:32+:2    257500+  10    0/1664/384/2048     
main*     all    96+  2:24+:2    191388+  440   23056/33840/6144/630
high_mem  all    256  8:16:2     1031342  46    2296/4616/4864/11776
gpu       all    96   2:24:2     515451   50    1202/430/3168/4800  
long      all    96+  2:24+:2    191388+  342   19072/28576/6048/536

Time Limits

Table 3: Run-time columns, format in “days-hours:minutes:seconds”
Column Description
DEFAULTTIME Default runtime if non is specified by option
TIMELIMIT Maximum run-time for a job (infinite if a partition support this)
>>> sinfo -o "%9P  %6g %11L %10l %5D %20C" 
PARTITION  GROUPS DEFAULTTIME TIMELIMIT  NODES CPUS(A/I/O/T)       
debug      all    5:00        30:00      10    0/1664/384/2048     
main*      all    2:00:00     8:00:00    440   23058/33838/6144/630
high_mem   all    1:00:00     7-00:00:00 46    2296/4616/4864/11776
gpu        all    2:00:00     7-00:00:00 50    1202/430/3168/4800  
long       all    2:00:00     7-00:00:00 342   19074/28574/6048/536

Selection

Table 4: salloc, srun, and sbatch option to select a partition
Option Description
-p, --partition Request a specific partition for the resource allocation.
Table 5: List of environment variables to select a partition
Variable Description
SLURM_PARTITION Interpreted by the srun command
SALLOC_PARTITION Interpreted by the salloc command
SBATCH_PARTITION Interpreted by the sbatch command

For example request resource from a debug partition:

sbatch --partition=debug ...

Jobs

Job details of all jobs from a user

for i in $(squeue -u $USER -o '%i' -h) ; do scontrol show job $i ; done

History of jobs for a particular user

  • List jobs after start time -S MM/DD[/YY]
  • List all users -a, or a particular user -u vpenso
» sacct -nX -o end,state,exitcode [] | uniq -f2 -c
[…]
      5 2015-09-14T20:08:48  COMPLETED      0:0 
      6 2015-09-14T20:08:48     FAILED      1:0 
     13 2015-09-14T22:35:01  CANCELLED      0:0 
      2 2015-09-15T09:50:35     FAILED      1:0 
     51 2015-09-15T10:22:51  COMPLETED      0:0 
      5 2015-09-15T12:32:10    TIMEOUT      1:0 
      1 2015-09-15T12:32:12  CANCELLED      0:0 
      5 2015-09-15T12:56:40    TIMEOUT      1:0 
      1 2015-09-15T13:01:01  CANCELLED      0:0 
      5 2015-09-15T18:38:10    TIMEOUT      1:0 

Run-Time

Run-time of currently executed jobs, and their limits

squeue -t r -o '%11M %11l %9P %8u %6g %10T' -S '-M' | uniq -f 1 -c

Estimated start time of jobs waiting in queue

squeue -t pd,s -o '%20S %.8u %4P %7a %.2t %R' -S 'S' | uniq -c

Read the man-page for a list for Job Reason Codes

man -P 'less -p "^JOB REASON CODES"' squeue

Failing

List failed jobs for users and/or accounts:

Option Description
-a, --allusers All user for the system
-A, --accounts $LIST List of Slurm accounts
-u, --user $NAME A specific Linux user name
start_time=$(date --date="3 days ago" +"%Y-%m-%d")
# i.e. for all users, and all accounts
sacct --format jobid,user,state,start,end,elapsed,exitcode,nodelist \
      --starttime $start_time \
      --state failed \
      --allusers

Limit the output to a specific JOB_ID:

sacct --format jobid,account,user,start,elapsed,exitcode,nodelist --jobs $JOB_ID

Investigate a specific job using its JOB_ID in the log-files on the resource manager. Make sure to use zgrep to read log-files already compressed by log rotation.

zgrep $JOB_ID /var/log/slurmctld*

Exit Code

Non-zero exit code assumed to be a job failure

Exit code1 …preserved as job meta-data:

  • …value in the range of 0 to 255
  • Derived of…
    • sbatch — …batch script exit code
    • salloc — …exit call terminating session
    • srun — …return value of executed command
Table 6: List of Exit codes
Exit Code Description
0 success (≠0 failure)
1 general failure
2 incorrect shell building
3-124 error in job (check software exit codes)
125 out of memory
126 command can not execute
127 command not found
128 invalid argument
129-192 terminated by host signals

Host Signal

When a host signal was responsible for the job termination…

  • …signal number will be displayed after the exit code
  • …for example 0:53 (separation by colon) <exit_code>:<signal>
SACCT_FORMAT="jobid,user,state,exitcode,nodelist"
sacct -j $job_id[,$job_id,…]

Derived Exit Code

Derived exit code — highest exit code returned from all job steps

  • sjobexitmod — view and modify the derived exit code and comment string
  • …allows users annotate a job exit after completion …describe that failed
# list exit codes for a job
sjobexitmod -l $job_id

# modify after completion
sjobexitmod $job_id -e $exit_code -r "$comment"

Priority

Jobs priority is an integer…

  • …ranges between 0 and 4294967295
  • …larger numbers = higher position in queue
# list job in priority order …highest priority at the bottom
sprio -l -S 'Y'

# put job on top of queue (aka set highest possible priority)
scontrol top $job_id

# set a specific priority (in relation to other users)
scontrol update job=$job_id priority=$priority

Operators and administrators can launch jobs with top priority:

srun --priority top #…

Suspend

Suspend all running jobs of a user (option -t R)

» squeue -ho %A -t R -u $user | paste -sd' '
509854 509855 509856 509853
» scontrol suspend $(squeue -ho %A -t R -u $user | paste -sd ' ')

Resume all suspended jobs of a user (option -t S):

scontrol resume $(squeue -ho %A -t S -u $user | paste -sd ' ')

Other sub-commands of scontrol

Command Description
hold Prevent a pending job from beginning started
release Release a previously held job to begin execution
uhold Hold a job so that the job owner may release it

Recurring Jobs

scrontab schedules recurring jobs on the cluster. It provides a cluster based equivalent to crontab (short for “cron table”), a system that specifies scheduled tasks to be run by the cron daemon2 on Unix-like systems. scrontab is used to configure Slurm to execute commands at specified intervals, allowing users to automate repetitive tasks.

All users can have their own scrontab file, allowing for personalized job scheduling without interfering with other users. Users can define jobs directly in the scrontab file, specifying the command to run, the schedule, and any Slurm options (like resource requests).

Format

The scrontab configuration format works similar to the traditional cron format, allowing users to specify when and how often jobs should be executed. The configuration can have several crontab entries (jobs).

# create a simple example for scrontab
>>> cat > sleep.scrontab <<EOF
#SCRON --time=00:02:00
#SCRON --job-name=sleep-scrontab
#SCRON --chdir=/lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
*/10 * * * * date && sleep 30
EOF

# install a new scrontab from a file
>>> scrontab sleep.scrontab

# check the queue
>>> squeue --me -O Jobid,EligibleTime,Name,State
JOBID               ELIGIBLE_TIME       NAME                STATE               
14938318            2024-10-31T10:20:00 sleep-scrontab      PENDING  

Time Fields

The first five fields specify the schedule for the job, and they represent from left to right:

Field Description
Minute (0-59) The minute of the hour when the job should be scheduled
Hour (0-23) The hour of the day when the job should be scheduled
Day of the Month (1-31) The specific day of the month when the job should run
Month (1-12) The month when the job should run
Day of the Week (0-7) The day of the week when the job should run (0 and 7 both represent Sunday).

Special characters are sued to define more complex schedules:

Character Description
Asterisk (*) Represents “every” unit of time. For example, an asterisk in the minute field means the job will run every minute.
Comma (,) Used to specify multiple values. For example, 1,15 in the minute field means the job will run at the 1st and 15th minute of the hour.
Dash (-) Specifies a range of values. For example, 1-5 in the day of the week field means the job will run from Monday to Friday.
Slash (/) Specifies increments. For example, */5 in the minute field means the job will run every 5 minutes.

Some users may find it convenient to us a web-site based crontab generator3 to prepare a custom configuration.

Shortcuts

Shortcuts to specify some common time intervals

Shortcut Description
@annually Job will become eligible at 00:00 Jan 01 each year
@monthly Job will become eligible at 00:00 on the first day of each month
@weekly Job will become eligible at 00:00 Sunday of each week
@daily Job will become eligible at 00:00 each day
@hourly Job will become eligible at the first minute of each hour.

Meta-Commands

Lines starting with #SCRON allow users to set Slurm options for the single following crontab entry. This means each crontab entry needs its own list of #SCRON meta-commands, for example:

#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
@daily path/to/sleep.sh > sleep-$(date +%Y%m%dT%H%M).log

Options include most of those available to the sbatch command (make sure to read the manual pages for more details). In order to write output of a recurring job into a single file use following option:

Options Description
--open-mode Appends output to an existing log-file (instead of overwrite)
#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
0 8 * * * path/to/sleep.sh

Usage

Users can configure their scrontab in multiple ways:

# modify the configuration with your preferred text-edotr
1EDITOR=vim scrontab -e
# read the configuration from a file
2scrontab path/to/file

# print the configuration
3scrontab -l

# clear the configuration
4scrontab -r
1
Modify the configuration with an text-editor using option -e.
2
Apply a configuration by passing a file as argument.
3
Option -l print the configuration to the terminal
4
Option -r removes the entire configuration (jobs continue to run, but won’t longer recur).

Jobs have the same Job ID for every run (until the next time the configuration is modified).

# list jobs with 
1squeue --me -O Jobid,EligibleTime,Name,State

# list all recurring jobs in the past
2sacct --duplicates --jobs $job_id

# skip next run
3scontrol requeue $job_id

# disable a cron job
4scancel --cron $job_id
1
List when cronjobs will be eligible for next execution. Note that jobs are not guaranteed to execute at the preferred time.
2
List all recurring executions of the cronjob from the accounting.
3
Skip next execution of a cronjob with scontrol and reschedule the job to the upcoming available time.
4
Request to cancel a job submitted by crontab with scancel. The job in the crontab will be preceded by the comment #DISABLED

Reservations

Slurm has the ability to reserve resources4 for jobs being executed by select users and/or accounts. A resource reservation identifies the resources in that reservation and a time period during which the reservation is available. The resources which can be reserved include cores, nodes, licenses and/or burst buffers. A reservation that contains nodes or cores is associated with one partition, and can’t span resources over multiple partitions. The only exception from this is when the reservation is created with explicitly requested nodes.

Reservations can be created, updated, and removed with the scontrol command

# Display an overview list for reservations
sinfo -T

# List all reservations with detailed specification
scontrol show reservation
  • ReservationName= — identifier used to allocate resources from the reservation
  • Users=, Accounts= — users/accounts with access to a reservation

Usage

salloc, srun and sbatch …reference the reservation

# request a specific reservation for allocation
sbatch --reservation=$name ...
  • -r, --reservation — job allocates resources from the specified reservation
  • -p, --partition
    • …if a resource reservation provides nodes from multiple partitions
    • …it is required to use the partition option in addition!

Alternatively use following input environment variables:

Environment Variable Description
SLURM_RESERVATION …reservation with srun
SALLOC_RESERVATION …reservation with salloc
SBATCH_RESERVATION …reservation with sbatch

Duration & Flags

Following is a subset of specifications (refer to the corresponding in the scontrol manual page):

Option Description
starttime YYYY-MM-DD[THH:MM], or now[+time] where time is count
with a time unit (minutes, hours, days, or weeks)
endtime YYYY-MM-DD[THH:MM] (alternatively use duration)
duration [[days-]hours:]minutes or UNLIMITED/infinite
flags=<list> maint identify system maintenance for the accounting
ignore_jobs running during reserved time
daily or weekly reoccurring reservation

Reserve an entire cluster at a particular time for a system down time:

scontrol create reservation starttime=$starttime \
   duration=120 user=root flags=maint,ignore_jobs nodes=ALL

Reserve a specific node to investigate a problem:

scontrol create reservation starttime=now \
    user=root duration=infinite flags=maint nodes=$node

Remove a reservation from the system:

# remove a reservation from the system
scontrol delete reservation=$name

Resources

By default, reservations must not overlap. They must either include different nodes or operate at different times. If specific nodes are not specified when a reservation is created, Slurm will automatically select nodes to avoid overlap and ensure that the selected nodes are available when the reservation begins. … Note a reservation having a maint or overlap flag will not have resources removed from it by a subsequent reservation also having a maint or overlap flag, so nesting of reservations only works to a depth of two.

Option Description
nodecnt=<num> Number of nodes…. nodecnt=1k (multiplies 1024)
nodes= Nodeset to use or nodes=all reserver all nodes in the cluster.
feature= Only nodes with a specific feature
# specific set of nodes
scontrol ... nodes='node[0700-0720],node[1000-1002]' ...

# all nodes in a partition
scontrol ... partitionname=long nodes=all

Users & Accounts

Reservations can not only be created for the use of specific accounts and users, but specific accounts and/or users can be prevented from using them. If both Users and Accounts are specified, a job must match both in order to use the reservation:

You can add or remove individual accounts/users from an existing reservation by using the update command and adding a ‘+’ or ‘-’ sign before the ‘=’ sign. If accounts are denied access to a reservation (account name preceded by a ‘-’), then all other accounts are implicitly allowed to use the reservation and it is not possible to also explicitly specify allowed accounts.

# add an account to an existing reservation
scontorl update reservation=$name account+=$account

Examples:

  • accounts=— configure accounts with access…
    • accounts=alice,bob — comma separated list allowed groups
    • accounts-=bob — allow all accounts except list accounts
  • users= — configure users with access…
    • users=jane,joe — comma separated list of allowed users
    • users-=ted — all users except listed
    • users=-troth — deny access for listed users

Nodes

Get an overview of the resource:

  • sinfo -lNe
    • …one line per node & partition
    • …list resource (CPU, RAM) & node features
  • sinfo -rd
    • …list all unresponsive nodes
    • …reason for node state: down, drained, or failing

Format the output to be piped into nodeset:

# node-list of drained nodes
sinfo -h -N -o '%n' -t drain,draining,drained | nodeset -f

# node-list of unresponsive nodes...
sinfo -h -N -o '%n' -t down,no_respond,power_down,unk,unknown | nodeset -f
Table 7: List of Node States
State Description
IDLE …not allocated
ALLOCATED …by one or more jobs
ALLOCATED+ …some jobs in process of completing
COMPLETING …all jobs completing
INVAL …node did not register to controller
FUTURE …node not available yet
MAINT …node in maintenance
DRAINING …node will become unavailable by admin request
DRAINED …node unavailable by admin request
DOWN …node unavailable for use
FAIL …node expected to fail …unavailable by admin request
FAILING …jobs expected to fail soon

Drain & Resume

Remove a node (temporarily) from operation…

# graceful drain nodes for maintenance
scontrol update state=drain nodename="$nodeset" reason="$reason"

# move a node back into operational state
scontrol update state=resume nodename="$nodeset"
  • state=drain
    • …state draining …no new jobs …running jobs continue
    • …state drained …node empty …returned to state idle manually
  • state=down
    • …state down …abort all running jobs (immediately)
    • …will interrupt service to the user (jobs may be requeued)
  • state=resume …state idle …accept new jobs

Reboot

Reboot nodes using the resource manager scontrol reboot sub-command:

# reboot nodes as soon as it is idle (explicitly drain the nodes beforehand)
scontrol reboot ...            # Defaults to ALL!!! reboots all nodes in the cluster
scontrol reboot $(hostname)... # reboot localhost
scontrol reboot "$nodeset" ... # reboot a nodeset

# drain & reboot the nodes
scontrol reboot ASAP "$nodeset"

# cancle pending reboots with
scontrol cancel_reboot "$nodeset"

# node clears its state and resturns to service after reboot
scontrol reboot "$nodeset" nextstate=RESUME ...

Nodes with pending reboot…

>>> scontrol show node $node
#...
  State=MIXED+DRAIN+REBOOT_REQUESTED #...
#...
  Reason=Reboot ASAP [root@2023-10-18T09:50:18]

Nodes during reboot…

>>> scontrol show node $node
#...
  State=DOWN+DRAIN+REBOOT_ISSUED #...
#...
  Reason=Reboot ASAP : reboot issued [root@2023-10-20T07:05:07]

Footnotes

  1. Job Exit Codes, Slurm Documentation
    https://slurm.schedmd.com/job_exit_code.html↩︎

  2. cron, Wikipedia
    https://en.wikipedia.org/wiki/Cron↩︎

  3. Crontab Generator
    https://crontab-generator.org↩︎

  4. Advanced Resource Reservation Guide, SchedMD
    https://slurm.schedmd.com/reservations.html↩︎