Slurm: Configuration Files

HPC

Slurm

Published

February 17, 2016

Modified

October 17, 2023

Nodes

Slurm node configuration…

…recommended to use include nodes.conf …decouple the node configuration into a dedicated file
…require restart of slurmctld and all slurmd
Only the NodeName must be supplied in the configuration
- …other node configuration information is optional
- …resources checked at node registration time
- …CPUs, RealMemory and TmpDisk …nodes DOWN if resource do not match

`NodeName`

Name that Slurm uses to refer to a node…

…string that hostname -s returns
…needs to be resolvable by DNS or /etc/hosts
…single node name can not appear more than once
…specification using nodesets for example lx[15,18,32-33]

Nodes requires specification of the hardware resources they provide…

Boards …number of Baseboards
SocketsPerBoard …physical processor sockets/chips on a baseboard
CoresPerSocket …cores in a single physical processor socket
ThreadsPerCore …logical threads in a single physical core
CPUs …logical processors
RealMemory …real memory on the node in megabytes …example /proc/meminfo says 195981480 kB / 1024 = 191388 MiB
Gres …comma-delimited list of generic resources
Features …comma-delimited list of characteristic associated with the node

slurmd -C print the actual hardware configuration on a given node

# examples
NodeName=lx[01-10] Feature=amd,epyc,7713 CPUs=256 ... ThreadsPerCore=2 RealMemory=515425
NodeName=lx[11,12] Feature=amd,epyc,7413,mi100 Gres=gpu:8 ...

SLURM checks if nodes provide specified resources…

…otherwise it emits an error: Setting node $node state...
…followed by a reason similar to…

# a GPU is missing
gres/gpu count reported lower than configured (0 < 1)
# RAM is missing
Low RealMemory (reported:257558 < 100.00% of configured:257649)

sinfo shows the node with state INVALID_REQ

`NodeSet`

…allows you to define a name for a specific set of nodes

…used to simplify the partition configuration section
Each NodeSet…
- …Nodes= …defined by an explicit list of nodes
- …Feature= …filtering the nodes by a particular feature
- …can be a union of two sub-sets
…not usable outside of the partition configuration

NodeSet=all Nodes=lxb[1130-1168]

`DownNodes`

Record state of nodes which are temporarily…

…in DOWN, DRAIN or FAILING state
…without altering permanent configuration under a NodeName= specification
State=FUTURE…
- …node is defined for future
- …made available by changing the configuration and scontrol reconfigure

Partitions

Establish job limits and access controls for groups of nodes…

…nodes may be in more than one partition
…jobs are allocated resources within a single partition

PartitionName= …specified by users when submitting jobs

Nodes= …comma-separated list of nodes or nodesets
- …nodes associated with a partition provide the available resources
- ALL mapped to all nodes configured in the cluster
Default=YES …for jobs without partition specification

PartitionName=debug Nodes=all Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=lxb[001-100] State=UP DefaultTime=02:00:00 MaxTime=7-00:00:00 #...

Apply changes with scontrol reconfigure as administrator

State

State of partitions

`state=`	Queue New Jobs	Jobs Continue Execution	New Job Allocations
`UP`	yes	yes	yes
`DOWN`	yes	yes	no
`DRAIN`	no	yes	yes
`INACTIVE`	no	?	no

# alter the partitions configuration without modifying a configuration file
scontrol update PartitionName=debug State=drain

Specifications

Access control…

AllowAccounts …accounts which may execute jobs (default ALL)
DenyAccounts …accounts which may not execute jobs
AllowGroups …group names which may execute jobs
- …unset by default …all groups are allowed
- …root & SlurmUser always allowed
AllowQos …Qos which may execute jobs
DenyQos …Qos which may not execute jobs
DisableRootJobs=YES …root will be prevented from running any jobs

Run-Time …format [days-]hours:minutes:seconds

DefaultTime …run time limit if not specified
MaxTime …maximum run time limit for jobs
OverTimeLimit …number of minutes jobs can exceed run-time

Scheduler

The scheduler determines what job to execute next.

…considers…
- pending jobs
- allocatable resources
- resource constrains
- account limits (defined by administrators/coordinators)
…loops through the jobs in the scheduler queue…
- …grants resource allocations…
- …over a period of time following priorities

Performance

slurmctld wont respond to client requests during scheduling…

…huge number of jobs…scheduling mechanism run-time is exhaustive
…may be completely unresponsive to user commands like sinfo or squeue
Optimization…
- …balance between responsiveness…
- …efficient allocation of resource for maximum utilization
- Rule of thumb…
- …utilization improves…more pending jobs are included in priority calculation
- …overall run-time of the priority calculation depends on the number of job

Designed to perform a quick-scheduling attempt at frequent intervals…

…when a job is submitted, completed or the job configuration changes
Slower and more comprehensive scheduling is performed less frequently
sdiag command shows information related to slurmctld scheduler performance

Quick Scheduling

Quick scheduling is designed to provide nearly instant response when possible. It considers recently added jobs and a limited number of jobs already prioritized and waiting in queue. Multiple configuration options are used to govern the depth of queued jobs to include during a scheduling cycle:

default_queue_depth defines how far down the job queue to test (default 100). Once any task for a job array is left pending, no other tasks in that job array are considered for scheduling. A user submitting hundreds of individual jobs at once may hamper the efficiency of quick scheduling.
partition_job_depth defines how many jobs are tested in any single partition (default 0, no limit). Once any job in a partition is left pending, no other jobs in that partition are considered for scheduling.

Two configurations are available to adjust timing of quick scheduling execution. Continuous execution of quick scheduling will lock slurmctld and make the system unresponsive. On systems where users submit a lot of individual jobs with a short run-time it should be therefore considered to delay quick scheduling:

batch_sched_delay sets the delay in seconds scheduling of jobs can be postponed. This can be useful in a high-throughput environments in which batch jobs are submitted at a very high rate (looping sbatch for example). For example, if many jobs are submitted each second, the overhead of trying to schedule each one will adversely impact the rate at which jobs can be submitted.
defer avoids attempting to schedule each job individually. Defer scheduling until a later time when scheduling multiple jobs simultaneously may be possible (disables quick scheduling). This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs.

Main Scheduler Loop

The main scheduling loop includes all pending jobs and calculates priorities comprehensively. This is the most expensive (in terms of run-time) operation for slurmctld. The execution frequency of the main scheduling loop is influenced by many configuration parameters, however boundaries are defined with:

The value of default_queue_depth is ignored. The main scheduling loop runs until reaching the configured max_sched_time time limit (default value is half of MessageTimeout).
sched_interval configures how frequently, in seconds, the main scheduling loop will execute and test all pending jobs. The default value is 60 seconds.
sched_min_interval sets the minimum time between the end of one scheduling cycle and the beginning of the next scheduling (default to 1000000ms, high throughput environments use values in the scale of 50000ms). Triggering does not cause the scheduling logic to be started immediately, but only within the configured of sched_interval.
sched_max_job_start defines the maximum number of jobs to started per scheduling cycle (defaults to zero, no limit)

The main scheduling mechanism has dedicated block in the output of the sdiag command:

>>> sdiag
...
Main schedule statistics (microseconds):
    Last cycle:   28198
    Max cycle:    1831377
    Total cycles: 1219
    Mean cycle:   50729
    Mean depth cycle:  25
    Cycles per minute: 2
    Last queue length: 18
...

Global Queue limits

MaxJobCount maximum number of jobs active in database
- …prevents slurmctld from exhausting memory or other resources
- …limit is reached…submit additional jobs fails The default value is 10000
- …each task of a job array counts as one job
MaxSubmitJobs prevent a single user from filling the system
MaxArraySize maximum job array task index value
…value of MaxJobCount should be much larger than MaxArraySize

`SchedulerType`

…selects the scheduling mechanism…

sched/backfill
- …lower priority jobs can start earlier
- …fill idle slots provided they are finished before the next high priority jobs
- …used on the majority of systems.
sched/builtin (default) jobs run in FIFO (first-in-first-out) mode.
sched/hold jobs are scheduled by administrators.
…additional configuration exists to connect external scheduling mechanism

`SchedulerParamters`

…configures the scheduling mechanism…key=value pairs in a comma separated list

# Example from the _High Throughput Computing Administration Guide_
#     https://slurm.schedmd.com/high_throughput.html
batch_sched_delay=20
bf_continue
bf_interval=300
bf_min_age_reserve=10800
bf_resolution=600
bf_yield_interval=1000000
partition_job_depth=500
sched_max_job_start=200
sched_min_interval=2000000

sched_min_interval faster scheduling at the cost of high CPU load
- High throughput environments use values of 50000+
bf_yield_interval more responsiveness
- …how often to relinquish operations in order to answer client requests

>>> scontrol show config | grep -i sched
...
SchedulerParameters     = bf_max_job_start=300,bf_max_job_test=400,default_queue_depth=200,max_rpc_cnt=100,defer
SchedulerType           = sched/backfill
...

Backfill Configuration

Research showed that this algorithm allows to increase density of supercomputer resource use by 20% and decreases average waiting time for setting jobs for execution.

Requires all jobs to be submitted with --time
- …since many users take the defaults and only use fraction of that time
- …encourage users to set time limits accurately (as small as possible)
Expected start time of pending jobs depends upon…
- …expected completion time of running jobs
- …reasonably accurate time limits are valuable
- Otherwise backfill will not work efficiently
Partition configuration options DefaultTime and MaxTime
- …define the boundaries for job run-times….
- …used if the job owner does not specify limits when submitting the job
Global configuration option OverTimeLimit
- …defines the amount by which a job can exceed its time limit before it is killed

The sdiag commands includes a dedicated block with backfilling specific information:

>>> sdiag
...
Backfilling stats
        Total backfilled jobs (since last slurm start): 405486
        Total backfilled jobs (since last stats cycle start): 8115
        Total backfilled heterogeneous job components: 0
        Total cycles: 989
        Last cycle when: Fri Nov 19 09:56:01 2021 (1637312161)
        Last cycle: 1976492                 # (ms) run-time of last cycle
        Max cycle:  3751612                 # longest run-time (since last reset)
        Mean cycle: 1962912                 # mean run-time (since last reset)
        Last depth cycle: 1660              # jobs processed during last run
        Last depth cycle (try sched): 1660
        Depth Mean: 1624
        Depth Mean (try depth): 1624
        Last queue length: 5327             # number of jobs pending
        Queue length mean: 6248             # mean count of jobs pending
...

Timing & Frequency

Backfill scheduling is a time consuming operation…

…locks are periodically released briefly
…so that other operations can be processed (e.g. submit new jobs)

Options related to timing/frequency of the backfill mechanism execution:

bf_continue
- …continues backfill scheduling after releasing locks
- …permits consideration of more jobs…
- …may result in the delayed scheduling of newly submitted jobs
bf_interval
- …interval between backfill scheduling attempts
- …default value is 30 seconds
bf_yield_sleep
- …time that backfill scheduler sleeps for when locks are released
- …default value 500000 usec (0.5 sec)
bf_yield_interval
- …time between backfill scheduler lock release
- …tells the schedule how often to relinquish operations…
- …to answer client requests (more responsiveness)

Depth

Options related to the depth (number of jobs) considered during backfill:

bf_window determines how long into the future to look. The default value is 1440 minutes (one day). Higher values result in more overhead, less responsiveness and higher memory consumption. Too small of a value will starve large jobs indefinitely. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution.
bf_resolution defines the time resolution of backfill scheduling. Should be increased proportionally when adjusting bf_window. For example: bf_window=11520,bf_resolution=600 (8 days, 10 minutes). A larger bf_resolution results in faster backfill scheduling due to reduced granularity in the time-slices considered during calculations. A resolution between 300-600 is the most common (default 60). Tiny jobs will not benefit as much from bf_resolution
bf_max_job_test maximum number of jobs consider for backfill scheduling (default 100).
bf_max_job_start maximum number of jobs backfill schedule (default value is 0, no limit).
bf_max_job_part maximum number of jobs per partition to consider for backfill scheduling (default value is 0 ,no limit).
bf_max_job_user maximum number of jobs per user to consider for backfill scheduling (default value is 0, no limit).

Cgroups

Slurm uses cgroups to constrain different resources to jobs…

…and to get accounting about these resources
Supports two cgroup modes…
- …cgroups v1 …legacy mode (rewritten in 21.08)
- …cgroups v2 …unified mode (added in 22.05)
- …nodes have either v1 or v2 enabled
- …hybrid nodes with both v1 and v2 not supported
SchedMD documentation…

Plugins

Enable Cgroup plugins in slurm.conf…

# ...process tracking and management with Cgroups
ProctrackType=proctrack/cgroup
# ..constraining resources with Cgroups
TaskPlugin=task/cgroup,task/affinity
# ...gather job statistics with Cgroups
JobAcctGatherType=jobacct_gather/cgroup

`proctrack/cgroup`

Keeps track of all processes in a job…

…stores the PIDs in a specific hierarchy in the cgroup tree
…signal PIDs when instructed (for example to send SIGTERM)
…no specific options for this plugin in cgroup.conf

`task/cgroup`

Constrains resources to a job/step/task…

…ensure that boundaries of an allocation are not violated
Confines to the…
- …allocated CPUs
- …specific memory resources
- …allocated GRES (including GPUs)
…uses the Cgroups cpuset, memory and device sub-systems
…multiple options in cgroup.conf apply to this plugin

Recommended to stack TaskPlugin=task/affinity,task/cgroup…

…when configuring ConstrainCores=yes in cgroup.conf
…enables --cpu-bind and/or --mem-bind

`jobacct_gather/cgroup`

Collects CPU and memory statistics…

…uses the Cgroup cpuacct and memory sub-systems
…reads cgroup.stats (and similar) for the entire sub-tree of PIDs
…no specific options for this plugin in cgroup.conf

`cgroup.conf`

Configuration file for the cgroup support…

…located in the same directory as the slurm.conf
…changes take effect upon restart of slurmd (unless otherwise noted)

CgroupPlugin=autodetect
ConstrainCores=yes        # constrain allowed cores to the subset of allocated resources
ConstrainDevices=yes      # constrain allowed devices based on GRES allocated resources 
ConstrainRAMSpace=yes     # set memory soft & hard limits

Select Plugin

Select plugin …responsible to select resources to be allocated to a job…

…aware of the systems topology …data structures established by the topology plugin
Multiple plugins…
- select/linear …allocates whole nodes to jobs
- select/cons_res …allocate individual sockets, cores, threads, or CPUs within a node
- select/cons_tres …expands cons_res functionality to allocate generic resources (like GPUs)
- …communicating with an external entity to perform these actions

cons_res …consumable resources and cons_tres (consumable trackable resource)

# ...excerpt from sample slurm.conf file
SelectType=select/cons_tres

…jobs can be co-scheduled on nodes when resources permit it …enabled/disabled cluster-wide
…plugin is enabled via SelectType and SelectTypeParameters in the slurm.conf
Related user options…
- …--exclusive …allows to request nodes in exclusive mode if required
- …--oversubscribe …incompatible to consumable resources …will not be honored

Trackable Resources

Trackable RESources (TRES) …monitor consumable resource

# ...default configuration
>>> scontrol show config \
      | grep -e ^AccountingStorageTRES -e ^PriorityWeightTRES -e ^PriorityFlags -e TRESBillingWeights
AccountingStorageTRES   = cpu,mem,energy,node
PriorityFlags           = 
PriorityWeightTRES      = (null)

…AccountingStorageTRES…
- …defines which consumable resources are tracked
- …by default CPU, Energy, Memory, and Node are tracked, whether specified or not
PriorityWeightTRES…
- …comma separated list of resource types and associated billing weights
- …defines the degree how each resource contributes to the job’s priority.

>>> scontrol show config | grep ^PriorityWeightTRES
PriorityWeightTRES      = CPU=1000,Mem=1000

TRESBillingWeights …weights for each partition

…contributing to the calculation of the job resource consumption
…specified as a comma-separated list of type-weight pairs
…base unit can be adjusted with the suffix K,M,G,T or P.
…by default the sum of all tracked resources multiplied by their weight

>>> scontrol show partition | grep -e PartitionName=[main,long] -e TRESBillingWeights
PartitionName=main
   TRESBillingWeights=cpu=1.0,mem=.25G
PartitionName=long
   TRESBillingWeights=cpu=1.5,mem=.50G

Incorporates all resources into the aggregation of consumed resources for a particular job…

sum(<type>*<weight>,[…])

…alternatively the MAX_TRES flag
…consider only tracked resource with the biggest contribution to the resource consumption
…basically most expensive resource determines the cost of a job regards to its fair share factor

max(<type>*<weight>,[…])

--- title: 'Slurm: Configuration Files' categories: - HPC - Slurm date: 2016/02/17 date-modified: 2023/10/17 toc-expand: 3 --- # Nodes Slurm [node configuration][58EKu]... [58EKu]: https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION * ...recommended to use `include nodes.conf` ...decouple the node configuration into a dedicated file * ...require restart of `slurmctld` and all `slurmd` * Only the `NodeName` must be supplied in the configuration - ...other node configuration information is optional - ...resources checked at node registration time - ...CPUs, RealMemory and TmpDisk ...nodes `DOWN` if resource do not match ## `NodeName` Name that Slurm uses to refer to a node... * ...string that `hostname -s` returns * ...needs to be resolvable by DNS or `/etc/hosts` * ...single node name can not appear more than once * ...specification using nodesets for example `lx[15,18,32-33]` Nodes requires specification of the hardware resources they provide... - `Boards` ...number of Baseboards - `SocketsPerBoard` ...physical processor sockets/chips on a baseboard - `CoresPerSocket` ...cores in a single physical processor socket - `ThreadsPerCore` ...logical threads in a single physical core - `CPUs` ...logical processors - `RealMemory` ...real memory on the node in megabytes ...example `/proc/meminfo` says 195981480 kB / 1024 = 191388 MiB - `Gres` ...comma-delimited list of generic resources - `Features` ...comma-delimited list of characteristic associated with the node `slurmd -C` print the actual hardware configuration on a given node ```sh # examples NodeName=lx[01-10] Feature=amd,epyc,7713 CPUs=256 ... ThreadsPerCore=2 RealMemory=515425 NodeName=lx[11,12] Feature=amd,epyc,7413,mi100 Gres=gpu:8 ... ``` SLURM checks if nodes provide specified resources... * ...otherwise it emits an `error: Setting node $node state...` * ...followed by a reason similar to... ```sh # a GPU is missing gres/gpu count reported lower than configured (0 < 1) # RAM is missing Low RealMemory (reported:257558 < 100.00% of configured:257649) ``` * `sinfo` shows the node with state `INVALID_REQ` ## `NodeSet` ...allows you to define a name for a specific set of nodes - ...used to simplify the partition configuration section - Each `NodeSet`... - ...`Nodes=` ...defined by an explicit list of nodes - ...`Feature=` ...filtering the nodes by a particular feature - ...can be a union of two sub-sets - ...not usable outside of the partition configuration ```txt NodeSet=all Nodes=lxb[1130-1168] ``` ## `DownNodes` Record state of nodes which are temporarily... - ...in DOWN, DRAIN or FAILING state - ...without altering permanent configuration under a `NodeName=` specification - `State=FUTURE`... - ...node is defined for future - ...made available by changing the configuration and `scontrol reconfigure` # Partitions Establish job limits and access controls for groups of nodes... - ...nodes may be in more than one partition - ...jobs are allocated resources within a single partition `PartitionName=` ...specified by users when submitting jobs * `Nodes=` ...comma-separated list of nodes or nodesets * ...nodes associated with a partition provide the available resources * `ALL` mapped to all nodes configured in the cluster * `Default=YES` ...for jobs without partition specification ```sh PartitionName=debug Nodes=all Default=YES MaxTime=INFINITE State=UP PartitionName=gpu Nodes=lxb[001-100] State=UP DefaultTime=02:00:00 MaxTime=7-00:00:00 #... ``` **Apply changes with `scontrol reconfigure` as administrator** ## State State of partitions `state=` | Queue New Jobs | Jobs Continue Execution | New Job Allocations -------------|----------------|-------------------------|--------- `UP` | yes | yes | yes `DOWN` | yes | yes | no `DRAIN` | no | yes | yes `INACTIVE` | no | ? | no ```bash # alter the partitions configuration without modifying a configuration file scontrol update PartitionName=debug State=drain ``` ## Specifications Access control... * `AllowAccounts` ...accounts which may execute jobs (default `ALL`) * `DenyAccounts` ...accounts which may not execute jobs * `AllowGroups` ...group names which may execute jobs * ...unset by default ...all groups are allowed * ...`root` & `SlurmUser` always allowed * `AllowQos` ...Qos which may execute jobs * `DenyQos` ...Qos which may not execute jobs * `DisableRootJobs=YES` ...`root` will be prevented from running any jobs Run-Time ...format `[days-]hours:minutes:seconds` - `DefaultTime` ...run time limit if not specified - `MaxTime` ...maximum run time limit for jobs - `OverTimeLimit` ...number of minutes jobs can exceed run-time # Scheduler _The scheduler determines what job to execute next._ - ...considers... - pending jobs - allocatable resources - resource constrains - account limits (defined by administrators/coordinators) - ...loops through the jobs in the scheduler queue... - ...grants resource allocations... - ...over a period of time following priorities ## Performance `slurmctld` wont respond to client requests during scheduling... - ...huge number of jobs...scheduling mechanism run-time is exhaustive - ...may be completely unresponsive to user commands like `sinfo` or `squeue` - Optimization... - ...balance between responsiveness... - ...efficient allocation of resource for maximum utilization - Rule of thumb... - ...utilization improves...more pending jobs are included in priority calculation - ...overall run-time of the priority calculation depends on the number of job Designed to perform a quick-scheduling attempt at frequent intervals... - ...when a job is submitted, completed or the job configuration changes - Slower and more comprehensive scheduling is performed less frequently - `sdiag` command shows information related to `slurmctld` scheduler performance ### Quick Scheduling Quick scheduling is designed to provide nearly instant response when possible. It considers recently added jobs and a limited number of jobs already prioritized and waiting in queue. Multiple configuration options are used to govern the **depth of queued jobs to include** during a scheduling cycle: * `default_queue_depth` defines how far down the job queue to test (default 100). Once any task for a job array is left pending, no other tasks in that job array are considered for scheduling. A user submitting hundreds of individual jobs at once may hamper the efficiency of quick scheduling. * `partition_job_depth` defines how many jobs are tested in any single partition (default 0, no limit). Once any job in a partition is left pending, no other jobs in that partition are considered for scheduling. Two configurations are available to adjust timing of quick scheduling execution. Continuous execution of quick scheduling will lock `slurmctld` and make the system unresponsive. On systems where users submit a lot of individual jobs with a short run-time it should be therefore considered to **delay quick scheduling**: * `batch_sched_delay` sets the delay in seconds scheduling of jobs can be postponed. This can be useful in a high-throughput environments in which batch jobs are submitted at a very high rate (looping `sbatch` for example). For example, if many jobs are submitted each second, the overhead of trying to schedule each one will adversely impact the rate at which jobs can be submitted. * `defer` avoids attempting to schedule each job individually. Defer scheduling until a later time when scheduling multiple jobs simultaneously may be possible (disables quick scheduling). This option may improve system responsiveness when large numbers of jobs (many hundreds) are submitted at the same time, but it will delay the initiation time of individual jobs. ### Main Scheduler Loop The **main scheduling loop** includes all pending jobs and calculates priorities comprehensively. This is the most expensive (in terms of run-time) operation for `slurmctld`. The execution frequency of the main scheduling loop is influenced by many configuration parameters, however boundaries are defined with: * The value of `default_queue_depth` is ignored. The main scheduling loop runs until reaching the configured `max_sched_time` time limit (default value is half of `MessageTimeout`). * `sched_interval` configures how frequently, in seconds, the main scheduling loop will execute and test all pending jobs. The default value is 60 seconds. * `sched_min_interval` sets the minimum time between the end of one scheduling cycle and the beginning of the next scheduling (default to 1000000ms, high throughput environments use values in the scale of 50000ms). Triggering does not cause the scheduling logic to be started immediately, but only within the configured of `sched_interval`. * `sched_max_job_start` defines the maximum number of jobs to started per scheduling cycle (defaults to zero, no limit) The main scheduling mechanism has dedicated block in the output of the `sdiag` command: ```bash >>> sdiag ... Main schedule statistics (microseconds): Last cycle: 28198 Max cycle: 1831377 Total cycles: 1219 Mean cycle: 50729 Mean depth cycle: 25 Cycles per minute: 2 Last queue length: 18 ... ``` ### Global Queue limits - `MaxJobCount` maximum number of jobs active in database - ...prevents `slurmctld` from exhausting memory or other resources - ...limit is reached...submit additional jobs fails The default value is 10000 - ...each task of a job array counts as one job - `MaxSubmitJobs` prevent a single user from filling the system - `MaxArraySize` maximum job array task index value - ...value of `MaxJobCount` should be much larger than `MaxArraySize` ### `SchedulerType` ...selects the scheduling mechanism... * `sched/backfill` - ...lower priority jobs can start earlier - ...fill idle slots provided they are finished before the next high priority jobs - ...used on the majority of systems. * `sched/builtin` (default) jobs run in FIFO (first-in-first-out) mode. * `sched/hold` jobs are scheduled by administrators. * ...additional configuration exists to connect external scheduling mechanism ### `SchedulerParamters` ...configures the scheduling mechanism...`key=value` pairs in a comma separated list ```sh # Example from the _High Throughput Computing Administration Guide_ # https://slurm.schedmd.com/high_throughput.html batch_sched_delay=20 bf_continue bf_interval=300 bf_min_age_reserve=10800 bf_resolution=600 bf_yield_interval=1000000 partition_job_depth=500 sched_max_job_start=200 sched_min_interval=2000000 ``` - `sched_min_interval` faster scheduling at the cost of high CPU load - High throughput environments use values of 50000+ - `bf_yield_interval` more responsiveness - ...how often to relinquish operations in order to answer client requests ```sh >>> scontrol show config | grep -i sched ... SchedulerParameters = bf_max_job_start=300,bf_max_job_test=400,default_queue_depth=200,max_rpc_cnt=100,defer SchedulerType = sched/backfill ... ``` ## Backfill Configuration > Research showed that this algorithm allows to increase density of supercomputer resource use by 20% and decreases average waiting time for setting jobs for execution. - Requires all jobs to be submitted with `--time` - ...since many users take the defaults and only use fraction of that time - ...encourage users to set time limits accurately (as small as possible) - Expected start time of pending jobs depends upon... - ...expected completion time of running jobs - ...reasonably accurate time limits are valuable - Otherwise backfill will not work efficiently - Partition configuration options `DefaultTime` and `MaxTime` - ...define the boundaries for job run-times.... - ...used if the job owner does not specify limits when submitting the job * Global configuration option `OverTimeLimit` - ...defines the amount by which a job can exceed its time limit before it is killed The `sdiag` commands includes a dedicated block with backfilling specific information: ```bash >>> sdiag ... Backfilling stats Total backfilled jobs (since last slurm start): 405486 Total backfilled jobs (since last stats cycle start): 8115 Total backfilled heterogeneous job components: 0 Total cycles: 989 Last cycle when: Fri Nov 19 09:56:01 2021 (1637312161) Last cycle: 1976492 # (ms) run-time of last cycle Max cycle: 3751612 # longest run-time (since last reset) Mean cycle: 1962912 # mean run-time (since last reset) Last depth cycle: 1660 # jobs processed during last run Last depth cycle (try sched): 1660 Depth Mean: 1624 Depth Mean (try depth): 1624 Last queue length: 5327 # number of jobs pending Queue length mean: 6248 # mean count of jobs pending ... ``` ## Timing & Frequency Backfill scheduling is a time consuming operation... - ...locks are periodically released briefly - ...so that other operations can be processed (e.g. submit new jobs) Options related to **timing/frequency** of the backfill mechanism execution: - `bf_continue` - ...continues backfill scheduling after releasing locks - ...permits consideration of more jobs... - ...may result in the delayed scheduling of newly submitted jobs * `bf_interval` - ...interval between backfill scheduling attempts - ...default value is 30 seconds * `bf_yield_sleep` - ...time that backfill scheduler sleeps for when locks are released - ...default value 500000 usec (0.5 sec) * `bf_yield_interval` - ...time between backfill scheduler lock release - ...tells the schedule how often to relinquish operations... - ...to answer client requests (more responsiveness) ## Depth Options related to the **depth** (number of jobs) considered during backfill: * `bf_window` determines how long into the future to look. The default value is 1440 minutes (one day). Higher values result in more overhead, less responsiveness and higher memory consumption. Too small of a value will starve large jobs indefinitely. In order to limit the amount of data managed by the backfill scheduler, if the value of `bf_window` is increased, then it is generally advisable to also increase `bf_resolution`. * `bf_resolution` defines the time resolution of backfill scheduling. Should be increased proportionally when adjusting `bf_window`. For example: `bf_window=11520,bf_resolution=600` (8 days, 10 minutes). A larger `bf_resolution` results in faster backfill scheduling due to reduced granularity in the time-slices considered during calculations. A resolution between 300-600 is the most common (default 60). Tiny jobs will not benefit as much from `bf_resolution` * `bf_max_job_test` maximum number of jobs consider for backfill scheduling (default 100). * `bf_max_job_start` maximum number of jobs backfill schedule (default value is 0, no limit). * `bf_max_job_part` maximum number of jobs per partition to consider for backfill scheduling (default value is 0 ,no limit). * `bf_max_job_user` maximum number of jobs per user to consider for backfill scheduling (default value is 0, no limit). # Cgroups Slurm uses cgroups to constrain different resources to jobs... - ...and to get accounting about these resources - Supports two cgroup modes... - ...cgroups v1 ...**legacy mode** (rewritten in 21.08) - ...cgroups v2 ...**unified mode** (added in 22.05) - ...nodes have either v1 or v2 enabled - ...hybrid nodes with both v1 and v2 not supported - SchedMD documentation... - [Control Group in Slurm][TO3qD] - [Control Group v2 plugin][kcir2] - [`cgroup.conf`][lUnQd] [TO3qD]: https://slurm.schedmd.com/cgroups.html [kcir2]: https://slurm.schedmd.com/cgroup_v2.html [lUnQd]: https://slurm.schedmd.com/cgroup.conf.html ## Plugins Enable Cgroup plugins in `slurm.conf`... ```sh # ...process tracking and management with Cgroups ProctrackType=proctrack/cgroup # ..constraining resources with Cgroups TaskPlugin=task/cgroup,task/affinity # ...gather job statistics with Cgroups JobAcctGatherType=jobacct_gather/cgroup ``` ### `proctrack/cgroup` Keeps track of all processes in a job... - ...**stores the PIDs in a specific hierarchy in the cgroup tree** - ...signal PIDs when instructed (for example to send `SIGTERM`) - ...no specific options for this plugin in `cgroup.conf` ### `task/cgroup` Constrains resources to a job/step/task... - ...**ensure that boundaries of an allocation are not violated** - Confines to the... - ...allocated CPUs - ...specific memory resources - ...allocated GRES (including GPUs) - ...uses the Cgroups `cpuset`, `memory` and `device` sub-systems - ...multiple options in `cgroup.conf` apply to this plugin Recommended to stack `TaskPlugin=task/affinity,task/cgroup`... - ...when configuring `ConstrainCores=yes` in `cgroup.conf` - ...enables `--cpu-bind` and/or `--mem-bind` ### `jobacct_gather/cgroup` Collects CPU and memory statistics... - ...uses the Cgroup `cpuacct` and `memory` sub-systems - ...reads `cgroup.stats` (and similar) for the entire sub-tree of PIDs - ...no specific options for this plugin in `cgroup.conf` ## `cgroup.conf` Configuration file for the cgroup support... - ...located in the same directory as the `slurm.conf` - ...**changes take effect upon restart of `slurmd`** (unless otherwise noted) ```sh CgroupPlugin=autodetect ConstrainCores=yes # constrain allowed cores to the subset of allocated resources ConstrainDevices=yes # constrain allowed devices based on GRES allocated resources ConstrainRAMSpace=yes # set memory soft & hard limits ``` # Select Plugin [Select plugin][slkyd] ...responsible to select resources to be allocated to a job... [slkyd]: https://slurm.schedmd.com/select_design.html - ...aware of the systems topology ...data structures established by the topology plugin - Multiple plugins... - `select/linear` ...allocates whole nodes to jobs - `select/cons_res` ...allocate individual sockets, cores, threads, or CPUs within a node - `select/cons_tres` ...expands `cons_res` functionality to allocate generic resources (like GPUs) - ...communicating with an external entity to perform these actions `cons_res` ...[consumable resources][Km3Gd] and `cons_tres` (consumable trackable resource) [Km3Gd]: https://slurm.schedmd.com/cons_res.html ```sh # ...excerpt from sample slurm.conf file SelectType=select/cons_tres ``` - ...jobs can be co-scheduled on nodes when resources permit it ...enabled/disabled cluster-wide - ...plugin is enabled via `SelectType` and `SelectTypeParameters` in the `slurm.conf` - Related user options... - ...`--exclusive` ...allows to request nodes in exclusive mode if required - ...`--oversubscribe` ...incompatible to consumable resources ...will not be honored # Trackable Resources [Trackable RESources (TRES)][RvvMI] ...monitor consumable resource [RvvMI]: http://slurm.schedmd.com/tres.html ```sh # ...default configuration >>> scontrol show config \ | grep -e ^AccountingStorageTRES -e ^PriorityWeightTRES -e ^PriorityFlags -e TRESBillingWeights AccountingStorageTRES = cpu,mem,energy,node PriorityFlags = PriorityWeightTRES = (null) ``` - ...`AccountingStorageTRES`... - ...defines which consumable resources are tracked - ...by default CPU, Energy, Memory, and Node are tracked, whether specified or not - `PriorityWeightTRES`... - ...comma separated list of **resource types and associated billing weights** - ...defines the degree how each resource contributes to the job's priority. ```sh >>> scontrol show config | grep ^PriorityWeightTRES PriorityWeightTRES = CPU=1000,Mem=1000 ``` `TRESBillingWeights` ...weights for each partition - ...contributing to the **calculation of the job resource consumption** - ...specified as a comma-separated list of type-weight pairs - ...base unit can be adjusted with the suffix K,M,G,T or P. - ...by default the **sum of all tracked resources multiplied by their weight** ```sh >>> scontrol show partition | grep -e PartitionName=[main,long] -e TRESBillingWeights PartitionName=main TRESBillingWeights=cpu=1.0,mem=.25G PartitionName=long TRESBillingWeights=cpu=1.5,mem=.50G ``` Incorporates all resources into the aggregation of consumed resources for a particular job... ```sh sum(<type>*<weight>,[…]) ``` - ...alternatively the `MAX_TRES` flag - ...consider only tracked resource with the biggest contribution to the resource consumption - ...basically most expensive resource determines the cost of a job regards to its fair share factor ```sh max(<type>*<weight>,[…]) ```