Slurm: Command-line Interface
Command | Description |
---|---|
sinfo |
Information on cluster partitions and nodes |
squeue |
Overview of jobs and their states |
scontrol |
View configuration, states, (un-)suspending jobs |
srun |
Run executable as job (blocks until the job is scheduled) |
salloc |
Submit an interactive job. (blocks until prompt appears) |
sbatch |
Submit a job script for batch scheduling |
scancel |
Cancels a running or pending job |
Partitions
sinfo
lists partition…
- Default partition —
*
asterisk as suffix to the name
# partition state summary
sinfo -s
# comprehensive list idle nodes
sinfo -Nel -t idle
CPUs & Memory
Column | Description |
---|---|
CPUS |
Count of CPUs (logic processors) |
S:C:T |
Count of S ockets, C ores, T hreads |
CPUS(A/I/O/T) |
CPU states …capital letter are abbreviations: A vailable, I dle, O ther and T otal |
MEMORY |
Maximum allocatable RAM |
>>> sinfo -o "%9P %6g %4c %10z %8m %5D %20C"
PARTITION GROUPS CPUS S:C:T MEMORY NODES CPUS(A/I/O/T)
debug all 128+ 2:32+:2 257500+ 10 0/1664/384/2048
main* all 96+ 2:24+:2 191388+ 440 23056/33840/6144/630
high_mem all 256 8:16:2 1031342 46 2296/4616/4864/11776
gpu all 96 2:24:2 515451 50 1202/430/3168/4800
long all 96+ 2:24+:2 191388+ 342 19072/28576/6048/536
Time Limits
Column | Description |
---|---|
DEFAULTTIME |
Default runtime if non is specified by option |
TIMELIMIT |
Maximum run-time for a job (infinite if a partition support this) |
>>> sinfo -o "%9P %6g %11L %10l %5D %20C"
PARTITION GROUPS DEFAULTTIME TIMELIMIT NODES CPUS(A/I/O/T)
debug all 5:00 30:00 10 0/1664/384/2048
main* all 2:00:00 8:00:00 440 23058/33838/6144/630
high_mem all 1:00:00 7-00:00:00 46 2296/4616/4864/11776
gpu all 2:00:00 7-00:00:00 50 1202/430/3168/4800
long all 2:00:00 7-00:00:00 342 19074/28574/6048/536
Selection
Option | Description |
---|---|
-p , --partition |
Request a specific partition for the resource allocation. |
Variable | Description |
---|---|
SLURM_PARTITION |
Interpreted by the srun command |
SALLOC_PARTITION |
Interpreted by the salloc command |
SBATCH_PARTITION |
Interpreted by the sbatch command |
For example request resource from a debug partition:
sbatch --partition=debug ...
Jobs
Job details of all jobs from a user
for i in $(squeue -u $USER -o '%i' -h) ; do scontrol show job $i ; done
History of jobs for a particular user
- List jobs after start time
-S MM/DD[/YY]
- List all users
-a
, or a particular user-u vpenso
» sacct -nX -o end,state,exitcode […] | uniq -f2 -c
[…]
5 2015-09-14T20:08:48 COMPLETED 0:0
6 2015-09-14T20:08:48 FAILED 1:0
13 2015-09-14T22:35:01 CANCELLED 0:0
2 2015-09-15T09:50:35 FAILED 1:0
51 2015-09-15T10:22:51 COMPLETED 0:0
5 2015-09-15T12:32:10 TIMEOUT 1:0
1 2015-09-15T12:32:12 CANCELLED 0:0
5 2015-09-15T12:56:40 TIMEOUT 1:0
1 2015-09-15T13:01:01 CANCELLED 0:0
5 2015-09-15T18:38:10 TIMEOUT 1:0
Run-Time
Run-time of currently executed jobs, and their limits
squeue -t r -o '%11M %11l %9P %8u %6g %10T' -S '-M' | uniq -f 1 -c
Estimated start time of jobs waiting in queue
squeue -t pd,s -o '%20S %.8u %4P %7a %.2t %R' -S 'S' | uniq -c
Read the man-page for a list for Job Reason Codes
man -P 'less -p "^JOB REASON CODES"' squeue
Failing
List failed jobs for users and/or accounts:
Option | Description |
---|---|
-a , --allusers |
All user for the system |
-A , --accounts $LIST |
List of Slurm accounts |
-u , --user $NAME |
A specific Linux user name |
start_time=$(date --date="3 days ago" +"%Y-%m-%d")
# i.e. for all users, and all accounts
sacct --format jobid,user,state,start,end,elapsed,exitcode,nodelist \
--starttime $start_time \
--state failed \
--allusers
Limit the output to a specific JOB_ID
:
sacct --format jobid,account,user,start,elapsed,exitcode,nodelist --jobs $JOB_ID
Investigate a specific job using its JOB_ID
in the log-files on the resource manager. Make sure to use zgrep
to read log-files already compressed by log rotation.
zgrep $JOB_ID /var/log/slurmctld*
Exit Code
Non-zero exit code assumed to be a job failure
Exit code1 …preserved as job meta-data:
- …value in the range of 0 to 255
- Derived of…
sbatch
— …batch script exit codesalloc
— …exit call terminating sessionsrun
— …return value of executed command
Exit Code | Description |
---|---|
0 | success (≠0 failure) |
1 | general failure |
2 | incorrect shell building |
3-124 | error in job (check software exit codes) |
125 | out of memory |
126 | command can not execute |
127 | command not found |
128 | invalid argument |
129-192 | terminated by host signals |
Host Signal
When a host signal was responsible for the job termination…
- …signal number will be displayed after the exit code
- …for example
0:53
(separation by colon)<exit_code>:<signal>
SACCT_FORMAT="jobid,user,state,exitcode,nodelist"
sacct -j $job_id[,$job_id,…]
Derived Exit Code
Derived exit code — highest exit code returned from all job steps
sjobexitmod
— view and modify the derived exit code and comment string- …allows users annotate a job exit after completion …describe that failed
# list exit codes for a job
sjobexitmod -l $job_id
# modify after completion
sjobexitmod $job_id -e $exit_code -r "$comment"
Priority
Jobs priority is an integer…
- …ranges between
0
and4294967295
- …larger numbers = higher position in queue
# list job in priority order …highest priority at the bottom
sprio -l -S 'Y'
# put job on top of queue (aka set highest possible priority)
scontrol top $job_id
# set a specific priority (in relation to other users)
scontrol update job=$job_id priority=$priority
Operators and administrators can launch jobs with top priority:
srun --priority top #…
Suspend
Suspend all running jobs of a user (option -t R
)
» squeue -ho %A -t R -u $user | paste -sd' '
509854 509855 509856 509853
» scontrol suspend $(squeue -ho %A -t R -u $user | paste -sd ' ')
Resume all suspended jobs of a user (option -t S
):
scontrol resume $(squeue -ho %A -t S -u $user | paste -sd ' ')
Other sub-commands of scontrol
Command | Description |
---|---|
hold |
Prevent a pending job from beginning started |
release |
Release a previously held job to begin execution |
uhold |
Hold a job so that the job owner may release it |
Recurring Jobs
scrontab
schedules recurring jobs on the cluster. It provides a cluster based equivalent to crontab
(short for “cron table”), a system that specifies scheduled tasks to be run by the cron daemon2 on Unix-like systems. scrontab
is used to configure Slurm to execute commands at specified intervals, allowing users to automate repetitive tasks.
All users can have their own scrontab
file, allowing for personalized job scheduling without interfering with other users. Users can define jobs directly in the scrontab
file, specifying the command to run, the schedule, and any Slurm options (like resource requests).
Format
The scrontab
configuration format works similar to the traditional cron format, allowing users to specify when and how often jobs should be executed. The configuration can have several crontab entries (jobs).
# create a simple example for scrontab
>>> cat > sleep.scrontab <<EOF
#SCRON --time=00:02:00
#SCRON --job-name=sleep-scrontab
#SCRON --chdir=/lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
*/10 * * * * date && sleep 30
EOF
# install a new scrontab from a file
>>> scrontab sleep.scrontab
# check the queue
>>> squeue --me -O Jobid,EligibleTime,Name,State
JOBID ELIGIBLE_TIME NAME STATE
14938318 2024-10-31T10:20:00 sleep-scrontab PENDING
Time Fields
The first five fields specify the schedule for the job, and they represent from left to right:
Field | Description |
---|---|
Minute (0-59) | The minute of the hour when the job should be scheduled |
Hour (0-23) | The hour of the day when the job should be scheduled |
Day of the Month (1-31) | The specific day of the month when the job should run |
Month (1-12) | The month when the job should run |
Day of the Week (0-7) | The day of the week when the job should run (0 and 7 both represent Sunday). |
Special characters are sued to define more complex schedules:
Character | Description |
---|---|
Asterisk (* ) |
Represents “every” unit of time. For example, an asterisk in the minute field means the job will run every minute. |
Comma (, ) |
Used to specify multiple values. For example, 1,15 in the minute field means the job will run at the 1st and 15th minute of the hour. |
Dash (- ) |
Specifies a range of values. For example, 1-5 in the day of the week field means the job will run from Monday to Friday. |
Slash (/ ) |
Specifies increments. For example, */5 in the minute field means the job will run every 5 minutes. |
Some users may find it convenient to us a web-site based crontab
generator3 to prepare a custom configuration.
Shortcuts
Shortcuts to specify some common time intervals
Shortcut | Description |
---|---|
@annually |
Job will become eligible at 00:00 Jan 01 each year |
@monthly |
Job will become eligible at 00:00 on the first day of each month |
@weekly |
Job will become eligible at 00:00 Sunday of each week |
@daily |
Job will become eligible at 00:00 each day |
@hourly |
Job will become eligible at the first minute of each hour. |
Meta-Commands
Lines starting with #SCRON
allow users to set Slurm options for the single following crontab
entry. This means each crontab
entry needs its own list of #SCRON
meta-commands, for example:
#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
@daily path/to/sleep.sh > sleep-$(date +%Y%m%dT%H%M).log
Options include most of those available to the sbatch
command (make sure to read the manual pages for more details). In order to write output of a recurring job into a single file use following option:
Options | Description |
---|---|
--open-mode |
Appends output to an existing log-file (instead of overwrite) |
#SCRON --job-name=sleep-scrontab
#SCRON --chdir /lustre/hpc/vpenso
#SCRON --output=sleep-scrontab-%j.log
#SCRON --open-mode=append
0 8 * * * path/to/sleep.sh
Usage
Users can configure their scrontab
in multiple ways:
# modify the configuration with your preferred text-edotr
1EDITOR=vim scrontab -e
# read the configuration from a file
2scrontab path/to/file
# print the configuration
3scrontab -l
# clear the configuration
4scrontab -r
- 1
-
Modify the configuration with an text-editor using option
-e
. - 2
- Apply a configuration by passing a file as argument.
- 3
-
Option
-l
print the configuration to the terminal - 4
-
Option
-r
removes the entire configuration (jobs continue to run, but won’t longer recur).
Jobs have the same Job ID for every run (until the next time the configuration is modified).
# list jobs with
1squeue --me -O Jobid,EligibleTime,Name,State
# list all recurring jobs in the past
2sacct --duplicates --jobs $job_id
# skip next run
3scontrol requeue $job_id
# disable a cron job
4scancel --cron $job_id
- 1
- List when cronjobs will be eligible for next execution. Note that jobs are not guaranteed to execute at the preferred time.
- 2
- List all recurring executions of the cronjob from the accounting.
- 3
-
Skip next execution of a cronjob with
scontrol
and reschedule the job to the upcoming available time. - 4
-
Request to cancel a job submitted by crontab with
scancel
. The job in the crontab will be preceded by the comment#DISABLED
Reservations
Slurm has the ability to reserve resources4 for jobs being executed by select users and/or accounts. A resource reservation identifies the resources in that reservation and a time period during which the reservation is available. The resources which can be reserved include cores, nodes, licenses and/or burst buffers. A reservation that contains nodes or cores is associated with one partition, and can’t span resources over multiple partitions. The only exception from this is when the reservation is created with explicitly requested nodes.
Reservations can be created, updated, and removed with the scontrol
command
# Display an overview list for reservations
sinfo -T
# List all reservations with detailed specification
scontrol show reservation
ReservationName=
— identifier used to allocate resources from the reservationUsers=
,Accounts=
— users/accounts with access to a reservation
Usage
salloc
, srun
and sbatch
…reference the reservation
# request a specific reservation for allocation
sbatch --reservation=$name ...
-r
,--reservation
— job allocates resources from the specified reservation-p
,--partition
- …if a resource reservation provides nodes from multiple partitions
- …it is required to use the partition option in addition!
Alternatively use following input environment variables:
Environment Variable | Description |
---|---|
SLURM_RESERVATION |
…reservation with srun |
SALLOC_RESERVATION |
…reservation with salloc |
SBATCH_RESERVATION |
…reservation with sbatch |
Duration & Flags
Following is a subset of specifications (refer to the corresponding in the scontrol
manual page):
Option | Description |
---|---|
starttime |
YYYY-MM-DD[THH:MM] , or now[+time] where time is countwith a time unit (minutes, hours, days, or weeks) |
endtime |
YYYY-MM-DD[THH:MM] (alternatively use duration ) |
duration |
[[days-]hours:]minutes or UNLIMITED /infinite |
flags=<list> |
maint identify system maintenance for the accountingignore_jobs running during reserved timedaily or weekly reoccurring reservation |
Reserve an entire cluster at a particular time for a system down time:
scontrol create reservation starttime=$starttime \
duration=120 user=root flags=maint,ignore_jobs nodes=ALL
Reserve a specific node to investigate a problem:
scontrol create reservation starttime=now \
$node user=root duration=infinite flags=maint nodes=
Remove a reservation from the system:
# remove a reservation from the system
scontrol delete reservation=$name
Resources
By default, reservations must not overlap. They must either include different nodes or operate at different times. If specific nodes are not specified when a reservation is created, Slurm will automatically select nodes to avoid overlap and ensure that the selected nodes are available when the reservation begins. … Note a reservation having a
maint
oroverlap
flag will not have resources removed from it by a subsequent reservation also having amaint
oroverlap
flag, so nesting of reservations only works to a depth of two.
Option | Description |
---|---|
nodecnt=<num> |
Number of nodes…. nodecnt=1k (multiplies 1024) |
nodes= |
Nodeset to use or nodes=all reserver all nodes in the cluster. |
feature= |
Only nodes with a specific feature |
# specific set of nodes
scontrol ... nodes='node[0700-0720],node[1000-1002]' ...
# all nodes in a partition
scontrol ... partitionname=long nodes=all
Users & Accounts
Reservations can not only be created for the use of specific accounts and users, but specific accounts and/or users can be prevented from using them. If both Users and Accounts are specified, a job must match both in order to use the reservation:
You can add or remove individual accounts/users from an existing reservation by using the update command and adding a ‘+’ or ‘-’ sign before the ‘=’ sign. If accounts are denied access to a reservation (account name preceded by a ‘-’), then all other accounts are implicitly allowed to use the reservation and it is not possible to also explicitly specify allowed accounts.
# add an account to an existing reservation
scontorl update reservation=$name account+=$account
Examples:
accounts=
— configure accounts with access…accounts=alice,bob
— comma separated list allowed groupsaccounts-=bob
— allow all accounts except list accounts
users=
— configure users with access…users=jane,joe
— comma separated list of allowed usersusers-=ted
— all users except listedusers=-troth
— deny access for listed users
Nodes
Get an overview of the resource:
sinfo -lNe
- …one line per node & partition
- …list resource (CPU, RAM) & node features
sinfo -rd
- …list all unresponsive nodes
- …reason for node state:
down
,drained
, orfailing
Format the output to be piped into nodeset
:
# node-list of drained nodes
sinfo -h -N -o '%n' -t drain,draining,drained | nodeset -f
# node-list of unresponsive nodes...
sinfo -h -N -o '%n' -t down,no_respond,power_down,unk,unknown | nodeset -f
State | Description |
---|---|
IDLE |
…not allocated |
ALLOCATED |
…by one or more jobs |
ALLOCATED+ |
…some jobs in process of completing |
COMPLETING |
…all jobs completing |
INVAL |
…node did not register to controller |
FUTURE |
…node not available yet |
MAINT |
…node in maintenance |
DRAINING |
…node will become unavailable by admin request |
DRAINED |
…node unavailable by admin request |
DOWN |
…node unavailable for use |
FAIL |
…node expected to fail …unavailable by admin request |
FAILING |
…jobs expected to fail soon |
Drain & Resume
Remove a node (temporarily) from operation…
# graceful drain nodes for maintenance
scontrol update state=drain nodename="$nodeset" reason="$reason"
# move a node back into operational state
scontrol update state=resume nodename="$nodeset"
state=drain
- …state
draining
…no new jobs …running jobs continue - …state
drained
…node empty …returned to stateidle
manually
- …state
state=down
- …state
down
…abort all running jobs (immediately) - …will interrupt service to the user (jobs may be requeued)
- …state
state=resume
…stateidle
…accept new jobs
Reboot
Reboot nodes using the resource manager scontrol reboot
sub-command:
# reboot nodes as soon as it is idle (explicitly drain the nodes beforehand)
scontrol reboot ... # Defaults to ALL!!! reboots all nodes in the cluster
scontrol reboot $(hostname)... # reboot localhost
scontrol reboot "$nodeset" ... # reboot a nodeset
# drain & reboot the nodes
scontrol reboot ASAP "$nodeset"
# cancle pending reboots with
scontrol cancel_reboot "$nodeset"
# node clears its state and resturns to service after reboot
scontrol reboot "$nodeset" nextstate=RESUME ...
Nodes with pending reboot…
>>> scontrol show node $node
#...
State=MIXED+DRAIN+REBOOT_REQUESTED #...
#...
Reason=Reboot ASAP [root@2023-10-18T09:50:18]
Nodes during reboot…
>>> scontrol show node $node
#...
State=DOWN+DRAIN+REBOOT_ISSUED #...
#...
Reason=Reboot ASAP : reboot issued [root@2023-10-20T07:05:07]
Footnotes
Job Exit Codes, Slurm Documentation
https://slurm.schedmd.com/job_exit_code.html↩︎cron
, Wikipedia
https://en.wikipedia.org/wiki/Cron↩︎Crontab Generator
https://crontab-generator.org↩︎Advanced Resource Reservation Guide, SchedMD
https://slurm.schedmd.com/reservations.html↩︎