Slurm - Multifactor Priorities

scontrol, sprioi command, Reservation & Fair-Share

HPC
Published

August 18, 2023

Modified

August 28, 2023

Multifactor priority plugin PriorityType=priority/multifactor

Reservation

Advanced Resource Reservation Guide, SchedMD:

  • …reserve resources for jobs being executed by select users and/or accounts
  • …identifies the resources in that reservation and a time period
  • …resources reserved include cores, nodes, licenses and/or burst buffers
  • …reservation contains nodes or cores associated with one partition
  • …with the exception of a reservation created with explicitly requested nodes

List available reservations…

  • ReservationName= …identifier used to allocate resources from the reservation
  • Users= and Accounts= …access privileges associated to a reservation
# ...list all reservation in the system
sinfo -T
scontrol show reservations

Duration & Flags

Following is a subset of specifications (refer to the corresponding in the scontrol manual page):

  • starttime=
    • YYYY-MM-DD[THH:MM]
    • …or now[+time] where time is count with a time unit (minutes, hours, days, or weeks)
  • endtime=YYYY-MM-DD[THH:MM] ..alternatively use duration
  • duration[[days-]hours:]minutes or UNLIMITED/infinite
  • flags=<list>
    • maint identify system maintenance for the accounting
    • ignore_jobs running during reserved time
    • daily or weekly reoccurring reservation

Reserve an entire cluster at a particular time for a system down time:

scontrol create reservation starttime=$starttime \
   duration=120 user=root flags=maint,ignore_jobs nodes=ALL

Reserve a specific node to investigate a problem:

scontrol create reservation starttime=now \
    user=root duration=infinite flags=maint nodes=$node

Remove a reservation from the system:

scontrol delete reservation=$name

Resources

By default, reservations must not overlap. They must either include different nodes or operate at different times. If specific nodes are not specified when a reservation is created, Slurm will automatically select nodes to avoid overlap and ensure that the selected nodes are available when the reservation begins. … Note a reservation having a maint or overlap flag will not have resources removed from it by a subsequent reservation also having a maint or overlap flag, so nesting of reservations only works to a depth of two.

Options…

  • nodecnt=<num>
    • …number of nodes to reserved (selected by the scheduler)
    • nodecnt=1k suffix multiplies by 1024.
  • nodes= …nodeset to use …nodes=all reserve all nodes in the cluster.
  • feature= …only nodes with a specific feature
# specific set of nodes
scontrol ... nodes='lxbk[0700-0720],lxbk[1000-1002]' ...

# all nodes in a partition
scontrol ... partitionname=long nodes=all

Accounts & Users

Reservations can not only be created for the use of specific accounts/users…

  • …if users and accounts are specified
  • …job must match both in order to use the reservation

Options…

  • accounts=fire,ice comma separated list allowed groups …accounts-=water,earth allow all accounts except list accounts
  • users=alice,bob comma separated list of allowed users …users=-zack deny access for listed users

Add/remove individual accounts/users from an existing reservation

  • …adding a ‘+’ or ‘-’ sign before the ‘=’ sign.
  • …if accounts are denied access to a reservation
    • ..account name preceded by a ‘-’
    • …then all other accounts are implicitly allowed
    • ..not possible to also explicitly specify allowed accounts.
# ...add an account to an existing reservation
scontorl update reservation=$name account+=fire

Usage

Reference a resource reservation with salloc, srun, and sbatch

  • option --reservation=<name> …allocate resources in specified reservation
  • …if a resource reservation provides nodes from multiple partitions…
    • …required to use the --partition= option as well
    • …otherwise the schedule can not determine which resources to use
# ...request a specific reservation for allocation
sbatch --reservation=$name ...

Alternatively use following input environment variables:

Environment Variable Description
SLURM_RESERVATION Use a reservation with srun.
SALLOC_RESERVATION Use a reservation with salloc.
SBATCH_RESERVATION Use a reservation with sbatch.

Priority Factors

List of configurable priority factors…

Factor Description
age time job is waiting in queue
association …factor defined for a job
fair-share …relation to resources consumed in the past
size …size of the resources a job allocates
nice …factor controlled by users
partition …priority associated to a partition
qos …quality of service associated
site …factor dictated by admins
tres …factor associated to the resources requested

Priority factor

  • …sum of all the factors that have been enabled
  • …integer that ranges between 0 and 4294967295
  • …the larger the number …the higher the job will be positioned in the queue
# ...list jobs in priority order with requested resources
squeue --priority --format="%.10A %.8Q %.3D %.3H %.3I %.3J %.10l %.10m %n" --sort=-p,i --state=PD

# ...modify the priority of a job
scontrol update job=$job_id priority=$priority

Nice

Users can adjust the priority of their own jobs…

  • …positive values negatively impact a job’s priority
  • …negative values increase a job’s priority
  • …ranges from +/-2147483645
  • …backfill algorithm may run lower-priority job before a higher priority job still
# ...put specified job first in queue for user
scontrol top $job_list

# ...specify a low-priority job
sbatch --nice=10000 #...

Fair Share

Configuration…

  • PriorityWeightFairShare …weight the fair-share factor
  • PriorityDecayHalfLife …relevance given to consumed resource …set to zero no decay will happen
  • PriorityUsageResetPeriod …rest the counters periodically
>>> scontrol show config | grep -e ^PriorityWeightFairShare -e ^PriorityDecayHalfLife -e ^PriorityUsageResetPeriod
PriorityDecayHalfLife   = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityWeightFairShare = 8000

Usage and Shares are the two components of the fair-share factor:

  • Shares are assigned to associations…
    • …representing its “part” of the system (similar to slices of a pie)
    • …normalized to 0.0…1.0
  • Usage …represents the accounts proportional usage of the system …value between 0.0 and 1.0 that
  • If Shares == Usage, you have hit your fair-share target.
sacctmgr list accounts withassoc format=account,user,share

Algorithm use to determine the fair share factor is highly configurable…

  • …default algorithm can be further replaced by setting options with PriorityFlags
  • FAIR_TREEE
  • DEPTH_OBLIVIOUS