Slurm - Multifactor Priorities
scontrol, sprioi command, Reservation & Fair-Share
Multifactor priority plugin PriorityType=priority/multifactor…
- …ordering the queue of jobs based on several factors
- …highest priority jobs no necessarily evaluated first
- Job evaluation in following order (top-down)…
- …preemption (preemptor higher priority then preemptee)
- …reservation (jobs with advanced reservation have higher priority)
- …partition priority
- …job priority
- …submit time
- …job ID
sprio -wlists configured weights for each factor- …weight can be assigned to each factor
- …enact a policy that blends a combination of factors
Reservation
Advanced Resource Reservation Guide, SchedMD:
- …reserve resources for jobs being executed by select users and/or accounts
- …identifies the resources in that reservation and a time period
- …resources reserved include cores, nodes, licenses and/or burst buffers
- …reservation contains nodes or cores associated with one partition
- …with the exception of a reservation created with explicitly requested nodes
List available reservations…
- …
ReservationName=…identifier used to allocate resources from the reservation - …
Users=andAccounts=…access privileges associated to a reservation
# ...list all reservation in the system
sinfo -T
scontrol show reservationsDuration & Flags
Following is a subset of specifications (refer to the corresponding in the scontrol manual page):
starttime=- …
YYYY-MM-DD[THH:MM] - …or
now[+time]where time is count with a time unit (minutes, hours, days, or weeks)
- …
endtime=…YYYY-MM-DD[THH:MM]..alternatively use durationduration…[[days-]hours:]minutesorUNLIMITED/infiniteflags=<list>maintidentify system maintenance for the accountingignore_jobsrunning during reserved timedailyorweeklyreoccurring reservation
Reserve an entire cluster at a particular time for a system down time:
scontrol create reservation starttime=$starttime \
duration=120 user=root flags=maint,ignore_jobs nodes=ALLReserve a specific node to investigate a problem:
scontrol create reservation starttime=now \
user=root duration=infinite flags=maint nodes=$nodeRemove a reservation from the system:
scontrol delete reservation=$nameResources
By default, reservations must not overlap. They must either include different nodes or operate at different times. If specific nodes are not specified when a reservation is created, Slurm will automatically select nodes to avoid overlap and ensure that the selected nodes are available when the reservation begins. … Note a reservation having a
maintoroverlapflag will not have resources removed from it by a subsequent reservation also having amaintoroverlapflag, so nesting of reservations only works to a depth of two.
Options…
nodecnt=<num>- …number of nodes to reserved (selected by the scheduler)
nodecnt=1ksuffix multiplies by 1024.
nodes=…nodeset to use …nodes=allreserve all nodes in the cluster.feature=…only nodes with a specific feature
# specific set of nodes
scontrol ... nodes='lxbk[0700-0720],lxbk[1000-1002]' ...
# all nodes in a partition
scontrol ... partitionname=long nodes=allAccounts & Users
Reservations can not only be created for the use of specific accounts/users…
- …if users and accounts are specified
- …job must match both in order to use the reservation
Options…
accounts=fire,icecomma separated list allowed groups …accounts-=water,earthallow all accounts except list accountsusers=alice,bobcomma separated list of allowed users …users=-zackdeny access for listed users
Add/remove individual accounts/users from an existing reservation
- …adding a ‘+’ or ‘-’ sign before the ‘=’ sign.
- …if accounts are denied access to a reservation
- ..account name preceded by a ‘-’
- …then all other accounts are implicitly allowed
- ..not possible to also explicitly specify allowed accounts.
# ...add an account to an existing reservation
scontorl update reservation=$name account+=fireUsage
Reference a resource reservation with salloc, srun, and sbatch…
- …option
--reservation=<name>…allocate resources in specified reservation - …if a resource reservation provides nodes from multiple partitions…
- …required to use the
--partition=option as well - …otherwise the schedule can not determine which resources to use
- …required to use the
# ...request a specific reservation for allocation
sbatch --reservation=$name ...Alternatively use following input environment variables:
| Environment Variable | Description |
|---|---|
SLURM_RESERVATION |
Use a reservation with srun. |
SALLOC_RESERVATION |
Use a reservation with salloc. |
SBATCH_RESERVATION |
Use a reservation with sbatch. |
Priority Factors
List of configurable priority factors…
| Factor | Description |
|---|---|
| age | time job is waiting in queue |
| association | …factor defined for a job |
| fair-share | …relation to resources consumed in the past |
| size | …size of the resources a job allocates |
| nice | …factor controlled by users |
| partition | …priority associated to a partition |
| qos | …quality of service associated |
| site | …factor dictated by admins |
| tres | …factor associated to the resources requested |
- …sum of all the factors that have been enabled
- …integer that ranges between 0 and 4294967295
- …the larger the number …the higher the job will be positioned in the queue
# ...list jobs in priority order with requested resources
squeue --priority --format="%.10A %.8Q %.3D %.3H %.3I %.3J %.10l %.10m %n" --sort=-p,i --state=PD
# ...modify the priority of a job
scontrol update job=$job_id priority=$priorityNice
Users can adjust the priority of their own jobs…
- …positive values negatively impact a job’s priority
- …negative values increase a job’s priority
- …ranges from +/-2147483645
- …backfill algorithm may run lower-priority job before a higher priority job still
# ...put specified job first in queue for user
scontrol top $job_list
# ...specify a low-priority job
sbatch --nice=10000 #...