Slurm - Configuration Files

HPC
Published

February 17, 2016

Modified

October 17, 2023

Nodes

Slurm node configuration

  • …recommended to use include nodes.conf …decouple the node configuration into a dedicated file
  • …require restart of slurmctld and all slurmd
  • Only the NodeName must be supplied in the configuration
    • …other node configuration information is optional
    • …resources checked at node registration time
    • …CPUs, RealMemory and TmpDisk …nodes DOWN if resource do not match

NodeName

Name that Slurm uses to refer to a node…

  • …string that hostname -s returns
  • …needs to be resolvable by DNS or /etc/hosts
  • …single node name can not appear more than once
  • …specification using nodesets for example lx[15,18,32-33]

Nodes requires specification of the hardware resources they provide…

  • Boards …number of Baseboards
  • SocketsPerBoard …physical processor sockets/chips on a baseboard
  • CoresPerSocket …cores in a single physical processor socket
  • ThreadsPerCore …logical threads in a single physical core
  • CPUs …logical processors
  • RealMemory …real memory on the node in megabytes …example /proc/meminfo says 195981480 kB / 1024 = 191388 MiB
  • Gres …comma-delimited list of generic resources
  • Features …comma-delimited list of characteristic associated with the node

slurmd -C print the actual hardware configuration on a given node

# examples
NodeName=lx[01-10] Feature=amd,epyc,7713 CPUs=256 ... ThreadsPerCore=2 RealMemory=515425
NodeName=lx[11,12] Feature=amd,epyc,7413,mi100 Gres=gpu:8 ...

SLURM checks if nodes provide specified resources…

  • …otherwise it emits an error: Setting node $node state...
  • …followed by a reason similar to…
# a GPU is missing
gres/gpu count reported lower than configured (0 < 1)
# RAM is missing
Low RealMemory (reported:257558 < 100.00% of configured:257649)
  • sinfo shows the node with state INVALID_REQ

NodeSet

…allows you to define a name for a specific set of nodes

  • …used to simplify the partition configuration section
  • Each NodeSet
    • Nodes= …defined by an explicit list of nodes
    • Feature= …filtering the nodes by a particular feature
    • …can be a union of two sub-sets
  • …not usable outside of the partition configuration
NodeSet=all Nodes=lxb[1130-1168]

DownNodes

Record state of nodes which are temporarily…

  • …in DOWN, DRAIN or FAILING state
  • …without altering permanent configuration under a NodeName= specification
  • State=FUTURE
    • …node is defined for future
    • …made available by changing the configuration and scontrol reconfigure

Partitions

Establish job limits and access controls for groups of nodes…

  • …nodes may be in more than one partition
  • …jobs are allocated resources within a single partition

PartitionName= …specified by users when submitting jobs

  • Nodes= …comma-separated list of nodes or nodesets
    • …nodes associated with a partition provide the available resources
    • ALL mapped to all nodes configured in the cluster
  • Default=YES …for jobs without partition specification
PartitionName=debug Nodes=all Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=lxb[001-100] State=UP DefaultTime=02:00:00 MaxTime=7-00:00:00 #... 

Apply changes with scontrol reconfigure as administrator

State

State of partitions

state= Queue New Jobs Jobs Continue Execution New Job Allocations
UP yes yes yes
DOWN yes yes no
DRAIN no yes yes
INACTIVE no ? no
# alter the partitions configuration without modifying a configuration file
scontrol update PartitionName=debug State=drain

Specifications

Access control…

  • AllowAccounts …accounts which may execute jobs (default ALL)
  • DenyAccounts …accounts which may not execute jobs
  • AllowGroups …group names which may execute jobs
    • …unset by default …all groups are allowed
    • root & SlurmUser always allowed
  • AllowQos …Qos which may execute jobs
  • DenyQos …Qos which may not execute jobs
  • DisableRootJobs=YESroot will be prevented from running any jobs

Run-Time …format [days-]hours:minutes:seconds

  • DefaultTime …run time limit if not specified
  • MaxTime …maximum run time limit for jobs
  • OverTimeLimit …number of minutes jobs can exceed run-time

Scheduler

The scheduler determines what job to execute next.

  • …considers…
    • pending jobs
    • allocatable resources
    • resource constrains
    • account limits (defined by administrators/coordinators)
  • …loops through the jobs in the scheduler queue…
    • …grants resource allocations…
    • …over a period of time following priorities

Performance

slurmctld wont respond to client requests during scheduling…

  • …huge number of jobs…scheduling mechanism run-time is exhaustive
  • …may be completely unresponsive to user commands like sinfo or squeue
  • Optimization…
    • …balance between responsiveness…
    • …efficient allocation of resource for maximum utilization
    • Rule of thumb…
    • …utilization improves…more pending jobs are included in priority calculation
    • …overall run-time of the priority calculation depends on the number of job

Designed to perform a quick-scheduling attempt at frequent intervals…

  • …when a job is submitted, completed or the job configuration changes
  • Slower and more comprehensive scheduling is performed less frequently
  • sdiag command shows information related to slurmctld scheduler performance

Quick Scheduling

Quick scheduling is designed to provide nearly instant response when possible. It considers recently added jobs and a limited number of jobs already prioritized and waiting in queue. Multiple configuration options are used to govern the depth of queued jobs to include during a scheduling cycle:

  • default_queue_depth defines how far down the job queue to test (default 100). Once any task for a job array is left pending, no other tasks in that job array are considered for scheduling. A user submitting hundreds of individual jobs at once may hamper the efficiency of quick scheduling.
  • partition_job_depth defines how many jobs are tested in any single partition (default 0, no limit). Once any job in a partition is left pending, no other jobs in that partition are considered for scheduling.

Two configurations are available to adjust timing of quick scheduling execution. Continuous execution of quick scheduling will lock slurmctld and make the system unresponsive. On systems where users submit a lot of individual jobs with a short run-time it should be therefore considered to delay quick scheduling:

  • batch_sched_delay sets the delay in seconds scheduling of jobs can be postponed. This can be useful in a high-throughput environments in which batch jobs are submitted at a very high rate (looping sbatch for example). For example, if many jobs are submitted each second, the overhead of trying to schedule each one will adversely impact the rate at which jobs can be submitted.
  • defer avoids attempting to schedule each job individually. Defer scheduling until a later time when scheduling multiple jobs simultaneously may be possible (disables quick scheduling). This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs.

Main Scheduler Loop

The main scheduling loop includes all pending jobs and calculates priorities comprehensively. This is the most expensive (in terms of run-time) operation for slurmctld. The execution frequency of the main scheduling loop is influenced by many configuration parameters, however boundaries are defined with:

  • The value of default_queue_depth is ignored. The main scheduling loop runs until reaching the configured max_sched_time time limit (default value is half of MessageTimeout).
  • sched_interval configures how frequently, in seconds, the main scheduling loop will execute and test all pending jobs. The default value is 60 seconds.
  • sched_min_interval sets the minimum time between the end of one scheduling cycle and the beginning of the next scheduling (default to 1000000ms, high throughput environments use values in the scale of 50000ms). Triggering does not cause the scheduling logic to be started immediately, but only within the configured of sched_interval.
  • sched_max_job_start defines the maximum number of jobs to started per scheduling cycle (defaults to zero, no limit)

The main scheduling mechanism has dedicated block in the output of the sdiag command:

>>> sdiag
...
Main schedule statistics (microseconds):
    Last cycle:   28198
    Max cycle:    1831377
    Total cycles: 1219
    Mean cycle:   50729
    Mean depth cycle:  25
    Cycles per minute: 2
    Last queue length: 18
...

Global Queue limits

  • MaxJobCount maximum number of jobs active in database
    • …prevents slurmctld from exhausting memory or other resources
    • …limit is reached…submit additional jobs fails The default value is 10000
    • …each task of a job array counts as one job
  • MaxSubmitJobs prevent a single user from filling the system
  • MaxArraySize maximum job array task index value
  • …value of MaxJobCount should be much larger than MaxArraySize

SchedulerType

…selects the scheduling mechanism…

  • sched/backfill
    • …lower priority jobs can start earlier
    • …fill idle slots provided they are finished before the next high priority jobs
    • …used on the majority of systems.
  • sched/builtin (default) jobs run in FIFO (first-in-first-out) mode.
  • sched/hold jobs are scheduled by administrators.
  • …additional configuration exists to connect external scheduling mechanism

SchedulerParamters

…configures the scheduling mechanism…key=value pairs in a comma separated list

# Example from the _High Throughput Computing Administration Guide_
#     https://slurm.schedmd.com/high_throughput.html
batch_sched_delay=20
bf_continue
bf_interval=300
bf_min_age_reserve=10800
bf_resolution=600
bf_yield_interval=1000000
partition_job_depth=500
sched_max_job_start=200
sched_min_interval=2000000
  • sched_min_interval faster scheduling at the cost of high CPU load
    • High throughput environments use values of 50000+
  • bf_yield_interval more responsiveness
    • …how often to relinquish operations in order to answer client requests
>>> scontrol show config | grep -i sched
...
SchedulerParameters     = bf_max_job_start=300,bf_max_job_test=400,default_queue_depth=200,max_rpc_cnt=100,defer
SchedulerType           = sched/backfill
...

Backfill Configuration

Research showed that this algorithm allows to increase density of supercomputer resource use by 20% and decreases average waiting time for setting jobs for execution.

  • Requires all jobs to be submitted with --time
    • …since many users take the defaults and only use fraction of that time
    • …encourage users to set time limits accurately (as small as possible)
  • Expected start time of pending jobs depends upon…
    • …expected completion time of running jobs
    • …reasonably accurate time limits are valuable
    • Otherwise backfill will not work efficiently
  • Partition configuration options DefaultTime and MaxTime
    • …define the boundaries for job run-times….
    • …used if the job owner does not specify limits when submitting the job
  • Global configuration option OverTimeLimit
    • …defines the amount by which a job can exceed its time limit before it is killed

The sdiag commands includes a dedicated block with backfilling specific information:

>>> sdiag
...
Backfilling stats
        Total backfilled jobs (since last slurm start): 405486
        Total backfilled jobs (since last stats cycle start): 8115
        Total backfilled heterogeneous job components: 0
        Total cycles: 989
        Last cycle when: Fri Nov 19 09:56:01 2021 (1637312161)
        Last cycle: 1976492                 # (ms) run-time of last cycle
        Max cycle:  3751612                 # longest run-time (since last reset)
        Mean cycle: 1962912                 # mean run-time (since last reset)
        Last depth cycle: 1660              # jobs processed during last run
        Last depth cycle (try sched): 1660
        Depth Mean: 1624
        Depth Mean (try depth): 1624
        Last queue length: 5327             # number of jobs pending
        Queue length mean: 6248             # mean count of jobs pending
...

Timing & Frequency

Backfill scheduling is a time consuming operation…

  • …locks are periodically released briefly
  • …so that other operations can be processed (e.g. submit new jobs)

Options related to timing/frequency of the backfill mechanism execution:

  • bf_continue
    • …continues backfill scheduling after releasing locks
    • …permits consideration of more jobs…
    • …may result in the delayed scheduling of newly submitted jobs
  • bf_interval
    • …interval between backfill scheduling attempts
    • …default value is 30 seconds
  • bf_yield_sleep
    • …time that backfill scheduler sleeps for when locks are released
    • …default value 500000 usec (0.5 sec)
  • bf_yield_interval
    • …time between backfill scheduler lock release
    • …tells the schedule how often to relinquish operations…
    • …to answer client requests (more responsiveness)

Depth

Options related to the depth (number of jobs) considered during backfill:

  • bf_window determines how long into the future to look. The default value is 1440 minutes (one day). Higher values result in more overhead, less responsiveness and higher memory consumption. Too small of a value will starve large jobs indefinitely. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution.
  • bf_resolution defines the time resolution of backfill scheduling. Should be increased proportionally when adjusting bf_window. For example: bf_window=11520,bf_resolution=600 (8 days, 10 minutes). A larger bf_resolution results in faster backfill scheduling due to reduced granularity in the time-slices considered during calculations. A resolution between 300-600 is the most common (default 60). Tiny jobs will not benefit as much from bf_resolution
  • bf_max_job_test maximum number of jobs consider for backfill scheduling (default 100).
  • bf_max_job_start maximum number of jobs backfill schedule (default value is 0, no limit).
  • bf_max_job_part maximum number of jobs per partition to consider for backfill scheduling (default value is 0 ,no limit).
  • bf_max_job_user maximum number of jobs per user to consider for backfill scheduling (default value is 0, no limit).

Cgroups

Slurm uses cgroups to constrain different resources to jobs…

  • …and to get accounting about these resources
  • Supports two cgroup modes…
    • …cgroups v1 …legacy mode (rewritten in 21.08)
    • …cgroups v2 …unified mode (added in 22.05)
    • …nodes have either v1 or v2 enabled
    • …hybrid nodes with both v1 and v2 not supported
  • SchedMD documentation…

Plugins

Enable Cgroup plugins in slurm.conf

# ...process tracking and management with Cgroups
ProctrackType=proctrack/cgroup
# ..constraining resources with Cgroups
TaskPlugin=task/cgroup,task/affinity
# ...gather job statistics with Cgroups
JobAcctGatherType=jobacct_gather/cgroup

proctrack/cgroup

Keeps track of all processes in a job…

  • stores the PIDs in a specific hierarchy in the cgroup tree
  • …signal PIDs when instructed (for example to send SIGTERM)
  • …no specific options for this plugin in cgroup.conf

task/cgroup

Constrains resources to a job/step/task…

  • ensure that boundaries of an allocation are not violated
  • Confines to the…
    • …allocated CPUs
    • …specific memory resources
    • …allocated GRES (including GPUs)
  • …uses the Cgroups cpuset, memory and device sub-systems
  • …multiple options in cgroup.conf apply to this plugin

Recommended to stack TaskPlugin=task/affinity,task/cgroup

  • …when configuring ConstrainCores=yes in cgroup.conf
  • …enables --cpu-bind and/or --mem-bind

jobacct_gather/cgroup

Collects CPU and memory statistics…

  • …uses the Cgroup cpuacct and memory sub-systems
  • …reads cgroup.stats (and similar) for the entire sub-tree of PIDs
  • …no specific options for this plugin in cgroup.conf

cgroup.conf

Configuration file for the cgroup support…

  • …located in the same directory as the slurm.conf
  • changes take effect upon restart of slurmd (unless otherwise noted)
CgroupPlugin=autodetect
ConstrainCores=yes        # constrain allowed cores to the subset of allocated resources
ConstrainDevices=yes      # constrain allowed devices based on GRES allocated resources 
ConstrainRAMSpace=yes     # set memory soft & hard limits

Select Plugin

Select plugin …responsible to select resources to be allocated to a job…

  • …aware of the systems topology …data structures established by the topology plugin
  • Multiple plugins…
    • select/linear …allocates whole nodes to jobs
    • select/cons_res …allocate individual sockets, cores, threads, or CPUs within a node
    • select/cons_tres …expands cons_res functionality to allocate generic resources (like GPUs)
    • …communicating with an external entity to perform these actions

cons_resconsumable resources and cons_tres (consumable trackable resource)

# ...excerpt from sample slurm.conf file
SelectType=select/cons_tres
  • …jobs can be co-scheduled on nodes when resources permit it …enabled/disabled cluster-wide
  • …plugin is enabled via SelectType and SelectTypeParameters in the slurm.conf
  • Related user options…
    • --exclusive …allows to request nodes in exclusive mode if required
    • --oversubscribe …incompatible to consumable resources …will not be honored

Trackable Resources

Trackable RESources (TRES) …monitor consumable resource

# ...default configuration
>>> scontrol show config \
      | grep -e ^AccountingStorageTRES -e ^PriorityWeightTRES -e ^PriorityFlags -e TRESBillingWeights
AccountingStorageTRES   = cpu,mem,energy,node
PriorityFlags           = 
PriorityWeightTRES      = (null)
  • AccountingStorageTRES
    • …defines which consumable resources are tracked
    • …by default CPU, Energy, Memory, and Node are tracked, whether specified or not
  • PriorityWeightTRES
    • …comma separated list of resource types and associated billing weights
    • …defines the degree how each resource contributes to the job’s priority.
>>> scontrol show config | grep ^PriorityWeightTRES
PriorityWeightTRES      = CPU=1000,Mem=1000

TRESBillingWeights …weights for each partition

  • …contributing to the calculation of the job resource consumption
  • …specified as a comma-separated list of type-weight pairs
  • …base unit can be adjusted with the suffix K,M,G,T or P.
  • …by default the sum of all tracked resources multiplied by their weight
>>> scontrol show partition | grep -e PartitionName=[main,long] -e TRESBillingWeights
PartitionName=main
   TRESBillingWeights=cpu=1.0,mem=.25G
PartitionName=long
   TRESBillingWeights=cpu=1.5,mem=.50G

Incorporates all resources into the aggregation of consumed resources for a particular job…

sum(<type>*<weight>,[])
  • …alternatively the MAX_TRES flag
  • …consider only tracked resource with the biggest contribution to the resource consumption
  • …basically most expensive resource determines the cost of a job regards to its fair share factor
max(<type>*<weight>,[])