Slurm: Cluster Control Plane

Configuration & Operation of Slurm Services

HPC

Slurm

Published

November 3, 2015

Modified

January 10, 2025

`sackd`

auth/slurm and cred/slurm plugins (from Slurm 23.11)

Slurm internal authentication and job credential plugins
- …alternative to MUNGE authentication service
- …separate from existing auth/jwt plugin
Requires shared /etc/slurm/slurm.key throughout the cluster
Clients use local socket …managed by slurm{ctld,d,dbd}

sackd on login nodes…

…provides authentication for client commands
…integrate into a “configless” environment
- …manages cache of configuration files
- …updates received automatically through scontrol reconfigure

`slurmctld`

With version 23.11 ¹ scontrol reconfigure has been changed…

…systemd service units use Type=notify …new option --systemd for slurm{ctld,d}
…catches configuration mistakes …continues execution (instead of failing)
…reconfigure allows for almost any (supported) configuration changes to take place
- …no explicit restart of the daemon required anymore
- …SIGHUB similar behaviour to restarting slurm{ctld,d} processes

Scalability

Maximum job throughput and overall slurmctld responsiveness (under heavy load) governed by latency reading/writing to the StateSaveLocation. In high-throughput (more then ~200.000 jobs/day) environments local storage performance for the controller needs to considered:

Fewer fast cores (high clock frequency) on the slurmctld host is preferred
Fast storage for the StateSaveLocation (preferably NVMe)
- IOPS to this location can sustain a major bottleneck to job throughput
- At least two directories and two files created per job
- Corresponding unlink() calls will add to the load
- Use of array jobs significantly improves performance…

Hardware, example minimum system requirements ~100.000 jobs/day with 500 nodes: 16GB RAM, dual core CPU, dedicated SSD/NVME (for state-save). The amount of RAM required increases with larger workloads and the number of compute nodes.

slurmdbd should be hosted on a dedicated node, preferably with a dedicated SSD/NVMe for the relational database (of a local MariaDB instance). The RAM requirements goes up in relation to the number of jobs which query the database. A minimum system requirement to support 500 nodes with ~100.000 jobs/day is 16-32 GB RAM just on the database host.

Static & Dynamic Nodes

Recommended process for adding a static node…

Stop slurmctld
Update the configuration for Nodes=
Restart slurmd daemons on all nodes
(Re-)start slurmctld

Dynamically add & delete nodes ² ³…

…without restarting slurmctld & slurmd
…controller uses NodeAddr/NodeHostname for dynamic slurmd registrations
…only supported with SelectType=select/cons_tres
…set MaxNodeCount= in the configuration
Limitations…
- …suboptimal internal representation of nodes
- …inaccurate infromation for topology plugins
- …requires scontrol reconfigure or service restart

Configuration

By default nodes aren’t added to any partition…

Nodes=All in the partition definition…
⇒ have all nodes in the partition, even new dynamic nodes

PartitionName=open Nodes=ALL MaxTime=INFINITE Default=Yes State=Up

Nodeset= …create nodesets, add the nodeset to a partition…
⇒ registering dynamic nodes with a feature to add it to the nodeset

Nodeset=ns1 Feature=f1
Nodeset=ns2 Feature=f2
PartitionName=all Nodes=ALL
PartitionName=p1 Nodes=ns1
PartitionName=p2 Nodes=ns2
PartitionName=p3 Nodes=ns1,ns2

Run scontrol reconfigure after modifications to nodesets and partitions!

Node registration for example with slurmd –Z –conf="Feature=f1"

Operation

Dynamic registration requires option slurmd -Z

Two ways to add a dynamic node…

…slurmd -Z --conf=
- …option --conf …defines additional parameters of a dynamic node
- NodeName= not allowed in --conf
- …hardware topology optional …overwrites slurmd -C if specified
scontrol create state=FUTURE nodename= [conf syntax]
- …allows to overwrite slurmd -C hardware topology
- …appending a features= for association with a defined nodeset=

Delete with scontrol delete nodename=…

…needs to be idle …cleared from any reservation
Stop slurmd after the delete command

`nss_slurm`

…optional Slurm NSS plugin …password and group resolution

…serviced through the local slurmstepd process
…removes load from these systems during launch of huge numbers of jobs/steps
…return only results for processes within a given job step
…not meant as replacement for network directory services like LDAP, SSSD, or NSLCD

LDAP-less Control Plane

slurmctld without LDAP (Slurm 23.11)…

…enabled through auth/slurm credential format extensibility
…username, UID, GID captured alongside the job submission
…auth/slurm permits the login node to securely provide these details
…set AuthInfo=use_client_ids in slurm{dbd}.conf

`slurmdbd`

SlurmDBD aka slurmdbd (slurm database daemon)…

…interface to the relational database storing accounting records
…configuration is available in slurmdbd.conf
- …should be protected from unauthorized access …contains a database password
- …file should be only on the computer where SlurmDBD executes
- …only be readable by the user which executes SlurmDBD (e.g. slurm)

Documentation …Slurm Database Daemon …slurmdbd.conf…

…host running the database (MySQL/MariaDB) is referred to as back-end
…nodes hosting the database daemon is called front-end

Run the daemon foreground and verbose mode to debug the configuration:

# run in foreground with debugging enabled
slurmctld -Dvvvvv

# follow the daemon logs
multitail /var/log/slurm{ctld,dbd}

Back-End

Provides the RDBMS back-end to store accounting database …interfaced by slurmdbd

…dedicated database …typically called slurm_acct_db
…grant the corresponding permissions database server

cat > /tmp/slurm_user.sql <<EOF
grant all on slurm_acct_db.* TO 'slurm'@'node' identified by '12345678' with grant option;
grant all on slurm_acct_db.* TO 'slurm'@'node.fqdn' identified by '12345678' with grant option;
EOF
sudo mysql < /tmp/slurm_user.sql

On start slurmdbd will first try to connect with the back-end database…

StorageHost database hostname
StorageUser database user name
StoragePass database user password
StorageType database type
StorageLoc database name on the database server (defaults slurm_acct_db)

# back-end database
#
StorageHost=lxrmdb04
#StoragePort=3306
StorageUser=slurm
StoragePass=12345678
StorageType=accounting_storage/mysql
#StorageLoc=slurm_acct_db

Launch the interactive mysql shell…

/* ...list databases */
show database like 'slurm%' ;

/* ...check users access to databases */
select user,host from mysql.user where user='slurm';

Connect from a remote node…

…requires the MYSQL client (dnf install -y mysql)
…use the password set with StoragePass in slurmdbd.conf

mysql_password=$(grep StoragePass /etc/slurm/slurmdbd.conf | cut -d= -f2)
database=$(grep StorageHost /etc/slurm/slurmdbd.conf | cut -d= -f2)

# connect to the database server
mysql --host $database --user slurm --password="$mysql_password" slurm_acct_db

Front-End

Configure the Slurm controller to write accounting records to a back-end SQL database using slurmdbd as interface:

AccountingStorageType The accounting storage mechanism type. Acceptable values at present include “accounting_storage/none” and “accounting_storage/slurmdbd”. The “accounting_storage/slurmdbd” value indicates that accounting records will be written to the Slurm DBD, which manages an underlying MySQL database. See “man slurmdbd” for more information. The default value is “accounting_storage/none” and indicates that account records are not maintained. Also see DefaultStorageType.

In order to enable all nodes to query the accounting database with make sure that the following configuration is correct:

AccountingStorageHost The name of the machine hosting the accounting storage database. Only used with systems using SlurmDBD, ignored otherwise.

Note the configuration above referese to the node hosting the Slurm database daemon not the back-end database. An error similar to the following text is emitted by sacct if the connection the can not be established:

sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Changes to this configuration require scontrol reconfigure to be propagated:

# check the configuration with...
>>> scontrol show config | grep AccountingStorage.*Host
AccountingStorageBackupHost = (null)
AccountingStorageHost   = lxbk0263
AccountingStorageExternalHost = (null)

Purge

The database can grow very large with time…

…depends on the job throughput
…truncating the tables helps performance
…typically no need to access very old job metadata

Remove old data from the accounting database

# data retention
#
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month

Sites requiring access to historic account data…

…separated from the archive options described in the next section
…may host a dedicated isolated instance of slurmdbd
…runs a copy or part of a copy of the production database
…provides quick access to query historical information

`slurmrestd`

Service slurmrestd that translate JSON/YAML over HTTP requests into Slurm RPC requests…

Allows to submit and manage jobs through REST calls (for example via curl)
Launch and manage batch jobs from a (web-)service

References…

Slurm REST API & JSON Web Tokens (JWT) Authentication
REST API talks at SLUG ’19 & ’20
- https://slurm.schedmd.com/SLUG19/REST_API.pdf
- https://slurm.schedmd.com/SLUG20/REST_API.pdf

`slurmd`

Each compute server (node) has a slurmd daemon…

…waits for work, executes that work…
…returns status, and waits for more work

Flags..

-…planned for backfill
*…not responding
$…maintenance
@…pending reboot
^…rebooting
!…pending power down
%…powering down
~…power off
#…power up & configuring

Configuration

# foreground debug mode...
slurmd -Dvvvvv

Configuration in slurm.conf…

SlurmdUser…defaults to root
SlurmdPort…defaults to 6818
SlurmdParameters…see man-page
SlurmdTimeout…time in seconds (defaults to 300)…
- …before slurmctld sets an unresponsive node to state DOWN
- …ping by Slurm internal communication mechanisms
SlurmdPidFile…defaults to /var/run/slurmd.pid
SlurmdSpoolDir…defaults to /var/spool/slurmd
- …daemon’s state information
- …batch job script information
SlurmdLogFile…defaults to syslog
SlurmdDebug & SlurmdSyslogDebug…
- …during ops. quite,fatal,error or info
- …debug…verbose…debug{2,3,4,5}

`RebootProgram`

The commands above will execute a RebootProgram

>>> scontrol show config | grep -i reboot                       
RebootProgram           = /etc/slurm/libexec/reboot

Example…

#!/bin/bash
IPMITOOL="$(which ipmitool)"
if [ $? -eq 0 ]; then
    # overcome hanging Lustre mounts...
    "$IPMITOOL" power reset
else
    /usr/bin/systemctl reboot --force
fi

Footnotes

Slurm Community BoF, SC23, November 2023
https://slurm.schedmd.com/SC23/Slurm-SC23-BOF.pdf ↩︎
Dynamic Nodes, Slurm Administrator Documentation
https://slurm.schedmd.com/dynamic_nodes.html ↩︎
Cloudy, With a Chance of Dynamic Nodes, SLUG’22
https://slurm.schedmd.com/SLUG22/Dynamic_Nodes.pdf ↩︎

--- title: 'Slurm: Cluster Control Plane' subtitle: Configuration & Operation of Slurm Services categories: - HPC - Slurm date: 2015/11/03 date-modified: 2025/01/10 toc-expand: 3 --- # `sackd` `auth/slurm` and `cred/slurm` plugins (from Slurm 23.11) - Slurm internal authentication and job credential plugins - ...alternative to MUNGE authentication service - ...separate from existing `auth/jwt` plugin - Requires shared `/etc/slurm/slurm.key` throughout the cluster - Clients use local socket ...managed by `slurm{ctld,d,dbd}` `sackd` on login nodes... - ...provides authentication for client commands - ...integrate into a "configless" environment - ...manages cache of configuration files - ...updates received automatically through `scontrol reconfigure` # `slurmctld` With version 23.11 [^idQwg] `scontrol reconfigure` has been changed... - ...systemd service units use `Type=notify` ...new option `--systemd` for `slurm{ctld,d}` - ...catches configuration mistakes ...continues execution (instead of failing) - ...reconfigure allows for almost any (supported) configuration changes to take place - ...no explicit restart of the daemon required anymore - ...SIGHUB similar behaviour to restarting `slurm{ctld,d}` processes [^idQwg]: Slurm Community BoF, SC23, November 2023 <https://slurm.schedmd.com/SC23/Slurm-SC23-BOF.pdf> ## Scalability Maximum job throughput and overall `slurmctld` responsiveness (under heavy load) governed by latency reading/writing to the `StateSaveLocation`. In high-throughput (more then ~200.000 jobs/day) environments local storage performance for the controller needs to considered: * Fewer fast cores (high clock frequency) on the `slurmctld` host is preferred * Fast storage for the `StateSaveLocation` (preferably NVMe) - IOPS to this location can sustain a major bottleneck to job throughput - At least two directories and two files created per job - Corresponding `unlink()` calls will add to the load - Use of array jobs significantly improves performance... Hardware, example minimum system requirements ~100.000 jobs/day with 500 nodes: 16GB RAM, dual core CPU, dedicated SSD/NVME (for state-save). The amount of RAM required increases with larger workloads and the number of compute nodes. `slurmdbd` should be hosted on a dedicated node, preferably with a dedicated SSD/NVMe for the relational database (of a local MariaDB instance). The RAM requirements goes up in relation to the number of jobs which query the database. A minimum system requirement to support 500 nodes with ~100.000 jobs/day is 16-32 GB RAM just on the database host. ## Static & Dynamic Nodes Recommended process for adding a static node… 1. Stop `slurmctld` 2. Update the configuration for `Nodes=` 3. Restart `slurmd` daemons on all nodes 4. (Re-)start `slurmctld` Dynamically add & delete nodes [^wODVW] [^fDGJi]… [^wODVW]: Dynamic Nodes, Slurm Administrator Documentation <https://slurm.schedmd.com/dynamic_nodes.html> [^fDGJi]: Cloudy, With a Chance of Dynamic Nodes, SLUG'22 <https://slurm.schedmd.com/SLUG22/Dynamic_Nodes.pdf> - …without restarting `slurmctld` & `slurmd` - …controller uses `NodeAddr`/`NodeHostname` for dynamic `slurmd` registrations - ...only supported with `SelectType=select/cons_tres` - ...set `MaxNodeCount=` in the configuration - Limitations... - ...suboptimal internal representation of nodes - ...inaccurate infromation for topology plugins - ...requires `scontrol reconfigure` or service restart ### Configuration By default nodes aren't added to any partition… - `Nodes=All` in the partition definition… - ⇒ have all nodes in the partition, even new dynamic nodes ```txt PartitionName=open Nodes=ALL MaxTime=INFINITE Default=Yes State=Up ``` - `Nodeset=` …create nodesets, add the nodeset to a partition… - ⇒ registering dynamic nodes with a feature to add it to the nodeset ```txt Nodeset=ns1 Feature=f1 Nodeset=ns2 Feature=f2 PartitionName=all Nodes=ALL PartitionName=p1 Nodes=ns1 PartitionName=p2 Nodes=ns2 PartitionName=p3 Nodes=ns1,ns2 ``` Run `scontrol reconfigure` after modifications to nodesets and partitions! Node registration for example with `slurmd –Z –conf="Feature=f1"` ### Operation Dynamic registration **requires option `slurmd -Z`** Two ways to add a dynamic node… - …`slurmd -Z --conf=` - …option `--conf` …defines additional parameters of a dynamic node - `NodeName=` not allowed in `--conf` - …hardware topology optional …overwrites `slurmd -C` if specified - `scontrol create state=FUTURE nodename= [conf syntax]` - …allows to overwrite `slurmd -C` hardware topology - …appending a `features=` for association with a defined `nodeset=` Delete with `scontrol delete nodename=`… - …needs to be idle …cleared from any reservation - Stop `slurmd` after the delete command ## `nss_slurm` ...optional [Slurm NSS plugin][4GMPL] ...password and group resolution - ...serviced through the local `slurmstepd` process - ...removes load from these systems during launch of huge numbers of jobs/steps - ...return only results for processes within a given job step - ...not meant as replacement for network directory services like LDAP, SSSD, or NSLCD [4GMPL]: https://slurm.schedmd.com/nss_slurm.html ## LDAP-less Control Plane `slurmctld` without LDAP (Slurm 23.11)... - ...enabled through `auth/slurm` credential format extensibility - ...username, UID, GID captured alongside the job submission - ...`auth/slurm` permits the login node to securely provide these details - ...set `AuthInfo=use_client_ids` in `slurm{dbd}.conf` # `slurmdbd` SlurmDBD aka `slurmdbd` (slurm database daemon)... - ...interface to the relational database storing accounting records - ...configuration is available in `slurmdbd.conf` - ...should be protected from unauthorized access ...contains a database password - ...file should be only on the computer where SlurmDBD executes - ...only be readable by the user which executes SlurmDBD (e.g. `slurm`) Documentation ...[Slurm Database Daemon][vJyfB] ...[`slurmdbd.conf`][0OHDB]... - ...host running the database (MySQL/MariaDB) is referred to as **back-end** - ...nodes hosting the database daemon is called **front-end** [vJyfB]: https://slurm.schedmd.com/slurmdbd.html [0OHDB]: https://slurm.schedmd.com/slurmdbd.conf.html Run the daemon foreground and verbose mode to debug the configuration: ```sh # run in foreground with debugging enabled slurmctld -Dvvvvv # follow the daemon logs multitail /var/log/slurm{ctld,dbd} ``` ## Back-End Provides the RDBMS back-end to store accounting database ...interfaced by `slurmdbd` - ...dedicated database ...typically called `slurm_acct_db` - ...grant the corresponding permissions database server ```sh cat > /tmp/slurm_user.sql <<EOF grant all on slurm_acct_db.* TO 'slurm'@'node' identified by '12345678' with grant option; grant all on slurm_acct_db.* TO 'slurm'@'node.fqdn' identified by '12345678' with grant option; EOF sudo mysql < /tmp/slurm_user.sql ``` On start `slurmdbd` will first try to connect with the back-end database... * `StorageHost` database hostname * `StorageUser` database user name * `StoragePass` database user password * `StorageType` database type * `StorageLoc` database name on the database server (defaults `slurm_acct_db`) ```sh # back-end database # StorageHost=lxrmdb04 #StoragePort=3306 StorageUser=slurm StoragePass=12345678 StorageType=accounting_storage/mysql #StorageLoc=slurm_acct_db ``` Launch the interactive `mysql` shell... ```sql /* ...list databases */ show database like 'slurm%' ; /* ...check users access to databases */ select user,host from mysql.user where user='slurm'; ``` Connect from a remote node... - ...requires the MYSQL client (`dnf install -y mysql`) - ...use the password set with `StoragePass` in `slurmdbd.conf` ```sh mysql_password=$(grep StoragePass /etc/slurm/slurmdbd.conf | cut -d= -f2) database=$(grep StorageHost /etc/slurm/slurmdbd.conf | cut -d= -f2) # connect to the database server mysql --host $database --user slurm --password="$mysql_password" slurm_acct_db ``` ## Front-End Configure the Slurm controller to write accounting records to a back-end SQL database using `slurmdbd` as interface: > `AccountingStorageType` The accounting storage mechanism type. Acceptable > values at present include "accounting_storage/none" and > "accounting_storage/slurmdbd". The "accounting_storage/slurmdbd" value > indicates that accounting records will be written to the Slurm DBD, which > manages an underlying MySQL database. See "man slurmdbd" for more > information. The default value is "accounting_storage/none" and indicates > that account records are not maintained. Also see DefaultStorageType. In order to enable all nodes to query the accounting database with make sure that the following configuration is correct: > `AccountingStorageHost` The name of the machine hosting the accounting > storage database. Only used with systems using SlurmDBD, ignored > otherwise. Note the configuration above referese to the node hosting the Slurm database daemon not the back-end database. An error similar to the following text is emitted by `sacct` if the connection the can not be established: ```bash sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused sacct: error: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused ``` Changes to this configuration require `scontrol reconfigure` to be propagated: ```bash # check the configuration with... >>> scontrol show config | grep AccountingStorage.*Host AccountingStorageBackupHost = (null) AccountingStorageHost = lxbk0263 AccountingStorageExternalHost = (null) ``` ## Purge The database can grow very large with time... - ...depends on the job throughput - ...**truncating the tables helps performance** - ...typically no need to access very old job metadata Remove old data from the accounting database ```sh # data retention # PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month PurgeTXNAfter=12month PurgeUsageAfter=24month ``` Sites requiring access to historic account data... - ...separated from the archive options described in the next section - ...may host a dedicated isolated instance of `slurmdbd` - ...runs a copy or part of a copy of the production database - ...provides quick access to query historical information ## Archive Archive accounting database: ```sh # data archive # ArchiveDir=/var/spool/slurm/archive ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=no ArchiveSuspend=no ArchiveTXN=no ArchiveUsage=no ``` # `slurmrestd` Service `slurmrestd` that translate JSON/YAML over HTTP requests into Slurm RPC requests... - Allows to submit and manage jobs through REST calls (for example via `curl`) - Launch and manage batch jobs from a (web-)service References... - Slurm REST API & JSON Web Tokens (JWT) Authentication - <https://slurm.schedmd.com/rest.html> - <https://slurm.schedmd.com/rest_api.html> - <https://slurm.schedmd.com/jwt.html> - REST API talks at SLUG '19 & '20 - <https://slurm.schedmd.com/SLUG19/REST_API.pdf> - <https://slurm.schedmd.com/SLUG20/REST_API.pdf> # `slurmd` Each compute server (node) has a `slurmd` daemon... - ...waits for work, executes that work... - ...returns status, and waits for more work Flags.. - `-`...planned for backfill - `*`...not responding - `$`...maintenance - `@`...pending reboot - `^`...rebooting - `!`...pending power down - `%`...powering down - `~`...power off - `#`...power up & configuring ## Configuration ```sh # foreground debug mode... slurmd -Dvvvvv ``` Configuration in `slurm.conf`... - `SlurmdUser`...defaults to root - `SlurmdPort`...defaults to 6818 - `SlurmdParameters`...see man-page - `SlurmdTimeout`...time in seconds (defaults to 300)... - ...before `slurmctld` sets an unresponsive node to state DOWN - ...ping by Slurm internal communication mechanisms - `SlurmdPidFile`...defaults to `/var/run/slurmd.pid` - `SlurmdSpoolDir`...defaults to `/var/spool/slurmd` - ...daemon's state information - ...batch job script information - `SlurmdLogFile`...defaults to syslog - `SlurmdDebug` & `SlurmdSyslogDebug`... - ...during ops. `quite`,`fatal`,`error` or `info` - ...debug...`verbose`...`debug{2,3,4,5}` ## `RebootProgram` The commands above will execute a `RebootProgram` ```sh >>> scontrol show config | grep -i reboot RebootProgram = /etc/slurm/libexec/reboot ``` Example... ```sh #!/bin/bash IPMITOOL="$(which ipmitool)" if [ $? -eq 0 ]; then # overcome hanging Lustre mounts... "$IPMITOOL" power reset else /usr/bin/systemctl reboot --force fi ```