Slurm - Cluster Control Plane
Configuration & Operation of Slurm Services
sackd
auth/slurm
and cred/slurm
plugins (from Slurm 23.11)
- Slurm internal authentication and job credential plugins
- …alternative to MUNGE authentication service
- …separate from existing
auth/jwt
plugin
- Requires shared
/etc/slurm/slurm.key
throughout the cluster - Clients use local socket …managed by
slurm{ctld,d,dbd}
sackd
on login nodes…
- …provides authentication for client commands
- …integrate into a “configless” environment
- …manages cache of configuration files
- …updates received automatically through
scontrol reconfigure
slurmctld
With version 23.11 1 scontrol reconfigure
has been changed…
- …systemd service units use
Type=notify
…new option--systemd
forslurm{ctld,d}
- …catches configuration mistakes …continues execution (instead of failing)
- …reconfigure allows for almost any (supported) configuration changes to take place
- …no explicit restart of the daemon required anymore
- …SIGHUB similar behaviour to restarting
slurm{ctld,d}
processes
Scalability
Maximum job throughput and overall slurmctld
responsiveness (under heavy load) governed by latency reading/writing to the StateSaveLocation
. In high-throughput (more then ~200.000 jobs/day) environments local storage performance for the controller needs to considered:
- Fewer fast cores (high clock frequency) on the
slurmctld
host is preferred - Fast storage for the
StateSaveLocation
(preferably NVMe)- IOPS to this location can sustain a major bottleneck to job throughput
- At least two directories and two files created per job
- Corresponding
unlink()
calls will add to the load - Use of array jobs significantly improves performance…
Hardware, example minimum system requirements ~100.000 jobs/day with 500 nodes: 16GB RAM, dual core CPU, dedicated SSD/NVME (for state-save). The amount of RAM required increases with larger workloads and the number of compute nodes.
slurmdbd
should be hosted on a dedicated node, preferably with a dedicated SSD/NVMe for the relational database (of a local MariaDB instance). The RAM requirements goes up in relation to the number of jobs which query the database. A minimum system requirement to support 500 nodes with ~100.000 jobs/day is 16-32 GB RAM just on the database host.
Static & Dynamic Nodes
Recommended process for adding a static node…
- Stop
slurmctld
- Update the configuration for
Nodes=
- Restart
slurmd
daemons on all nodes - (Re-)start
slurmctld
Dynamically add & delete nodes 2 3…
- …without restarting
slurmctld
&slurmd
- …controller uses
NodeAddr
/NodeHostname
for dynamicslurmd
registrations - …only supported with
SelectType=select/cons_tres
- …set
MaxNodeCount=
in the configuration - Limitations…
- …suboptimal internal representation of nodes
- …inaccurate infromation for topology plugins
- …requires
scontrol reconfigure
or service restart
Configuration
By default nodes aren’t added to any partition…
Nodes=All
in the partition definition…- ⇒ have all nodes in the partition, even new dynamic nodes
PartitionName=open Nodes=ALL MaxTime=INFINITE Default=Yes State=Up
Nodeset=
…create nodesets, add the nodeset to a partition…- ⇒ registering dynamic nodes with a feature to add it to the nodeset
Nodeset=ns1 Feature=f1
Nodeset=ns2 Feature=f2
PartitionName=all Nodes=ALL
PartitionName=p1 Nodes=ns1
PartitionName=p2 Nodes=ns2 PartitionName=p3 Nodes=ns1,ns2
Run scontrol reconfigure
after modifications to nodesets and partitions!
Node registration for example with slurmd –Z –conf="Feature=f1"
Operation
Dynamic registration requires option slurmd -Z
Two ways to add a dynamic node…
- …
slurmd -Z --conf=
- …option
--conf
…defines additional parameters of a dynamic node NodeName=
not allowed in--conf
- …hardware topology optional …overwrites
slurmd -C
if specified
- …option
scontrol create state=FUTURE nodename= [conf syntax]
- …allows to overwrite
slurmd -C
hardware topology - …appending a
features=
for association with a definednodeset=
- …allows to overwrite
Delete with scontrol delete nodename=
…
- …needs to be idle …cleared from any reservation
- Stop
slurmd
after the delete command
nss_slurm
…optional Slurm NSS plugin …password and group resolution
- …serviced through the local
slurmstepd
process - …removes load from these systems during launch of huge numbers of jobs/steps
- …return only results for processes within a given job step
- …not meant as replacement for network directory services like LDAP, SSSD, or NSLCD
LDAP-less Control Plane
slurmctld
without LDAP (Slurm 23.11)…
- …enabled through
auth/slurm
credential format extensibility - …username, UID, GID captured alongside the job submission
- …
auth/slurm
permits the login node to securely provide these details - …set
AuthInfo=use_client_ids
inslurm{dbd}.conf
slurmdbd
SlurmDBD aka slurmdbd
(slurm database daemon)…
- …interface to the relational database storing accounting records
- …configuration is available in
slurmdbd.conf
- …should be protected from unauthorized access …contains a database password
- …file should be only on the computer where SlurmDBD executes
- …only be readable by the user which executes SlurmDBD (e.g.
slurm
)
Documentation …Slurm Database Daemon …slurmdbd.conf
…
- …host running the database (MySQL/MariaDB) is referred to as back-end
- …nodes hosting the database daemon is called front-end
Run the daemon foreground and verbose mode to debug the configuration:
# run in foreground with debugging enabled
slurmctld -Dvvvvv
# follow the daemon logs
multitail /var/log/slurm{ctld,dbd}
Back-End
Provides the RDBMS back-end to store accounting database …interfaced by slurmdbd
- …dedicated database …typically called
slurm_acct_db
- …grant the corresponding permissions database server
cat > /tmp/slurm_user.sql <<EOF
grant all on slurm_acct_db.* TO 'slurm'@'node' identified by '12345678' with grant option;
grant all on slurm_acct_db.* TO 'slurm'@'node.fqdn' identified by '12345678' with grant option;
EOF
sudo mysql < /tmp/slurm_user.sql
On start slurmdbd
will first try to connect with the back-end database…
StorageHost
database hostnameStorageUser
database user nameStoragePass
database user passwordStorageType
database typeStorageLoc
database name on the database server (defaultsslurm_acct_db
)
# back-end database
#
StorageHost=lxrmdb04
#StoragePort=3306
StorageUser=slurm
StoragePass=12345678
StorageType=accounting_storage/mysql
#StorageLoc=slurm_acct_db
Launch the interactive mysql
shell…
/* ...list databases */
database like 'slurm%' ;
show
/* ...check users access to databases */
select user,host from mysql.user where user='slurm';
Connect from a remote node…
- …requires the MYSQL client (
dnf install -y mysql
) - …use the password set with
StoragePass
inslurmdbd.conf
mysql_password=$(grep StoragePass /etc/slurm/slurmdbd.conf | cut -d= -f2)
database=$(grep StorageHost /etc/slurm/slurmdbd.conf | cut -d= -f2)
# connect to the database server
mysql --host $database --user slurm --password="$mysql_password" slurm_acct_db
Front-End
Configure the Slurm controller to write accounting records to a back-end SQL database using slurmdbd
as interface:
AccountingStorageType
The accounting storage mechanism type. Acceptable values at present include “accounting_storage/none” and “accounting_storage/slurmdbd”. The “accounting_storage/slurmdbd” value indicates that accounting records will be written to the Slurm DBD, which manages an underlying MySQL database. See “man slurmdbd” for more information. The default value is “accounting_storage/none” and indicates that account records are not maintained. Also see DefaultStorageType.
In order to enable all nodes to query the accounting database with make sure that the following configuration is correct:
AccountingStorageHost
The name of the machine hosting the accounting storage database. Only used with systems using SlurmDBD, ignored otherwise.
Note the configuration above referese to the node hosting the Slurm database daemon not the back-end database. An error similar to the following text is emitted by sacct
if the connection the can not be established:
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Changes to this configuration require scontrol reconfigure
to be propagated:
# check the configuration with...
>>> scontrol show config | grep AccountingStorage.*Host
AccountingStorageBackupHost = (null)
AccountingStorageHost = lxbk0263
AccountingStorageExternalHost = (null)
Purge
The database can grow very large with time…
- …depends on the job throughput
- …truncating the tables helps performance
- …typically no need to access very old job metadata
Remove old data from the accounting database
# data retention
#
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
Sites requiring access to historic account data…
- …separated from the archive options described in the next section
- …may host a dedicated isolated instance of
slurmdbd
- …runs a copy or part of a copy of the production database
- …provides quick access to query historical information
Archive
Archive accounting database:
# data archive
#
ArchiveDir=/var/spool/slurm/archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
slurmrestd
Service slurmrestd
that translate JSON/YAML over HTTP requests into Slurm RPC requests…
- Allows to submit and manage jobs through REST calls (for example via
curl
) - Launch and manage batch jobs from a (web-)service
References…
- Slurm REST API & JSON Web Tokens (JWT) Authentication
- REST API talks at SLUG ’19 & ’20
slurmd
Each compute server (node) has a slurmd
daemon…
- …waits for work, executes that work…
- …returns status, and waits for more work
Flags..
-
…planned for backfill*
…not responding$
…maintenance@
…pending reboot^
…rebooting!
…pending power down%
…powering down~
…power off#
…power up & configuring
Node States
Node state codes…man sinfo
ALLOCATED
…by one or more jobsALLOCATED+
…some jobs in process of completingCOMPLETING
…all jobs completingIDLE
…not allocatedINVAL
…node did not register to controller- …allow invalid node resource with
SlurmdParameters=config_overrides
… - …only useful for testing purposes
- …allow invalid node resource with
FUTURE
…node not available yetMAINT
…node in maintenanceRESERVED
…advanced reservation- …some more…
Drain & Resume
Non-production node states…
DRAINING
…node will become unavailable by admin requestDRAINED
…node unavailable by admin requestDOWN
…node unavailable for useFAIL
…node expected to fail…unavailable by admin requestFAILING
…jobs expected to fail soon
Get an overview…
# ...non-responding (dead) nodes
sinfo -d
# drain nodes with reason
sinfo -o '%4D %10T %20E %N' -S -D -t drain,draining,drained
# node-list of drained nodes
sinfo -h -N -o '%n' -t drain,draining,drained | nodeset -f
# node-list of unresponsive nodes...
sinfo -h -N -o '%n' -t down,no_respond,power_down,unk,unknown | nodeset -f
Drain and resume nodes…
scontrol update state=drain nodename="$nodeset" reason="$reason"
scontrol update state=resume nodename="$nodeset"
Configuration
# foreground debug mode...
slurmd -Dvvvvv
Configuration in slurm.conf
…
SlurmdUser
…defaults to rootSlurmdPort
…defaults to 6818SlurmdParameters
…see man-pageSlurmdTimeout
…time in seconds (defaults to 300)…- …before
slurmctld
sets an unresponsive node to state DOWN - …ping by Slurm internal communication mechanisms
- …before
SlurmdPidFile
…defaults to/var/run/slurmd.pid
SlurmdSpoolDir
…defaults to/var/spool/slurmd
- …daemon’s state information
- …batch job script information
SlurmdLogFile
…defaults to syslogSlurmdDebug
&SlurmdSyslogDebug
…- …during ops.
quite
,fatal
,error
orinfo
- …debug…
verbose
…debug{2,3,4,5}
- …during ops.
scontrol reboot
Reboot nodes using the resource manager scontrol reboot
sub-command:
# reboot nodes as soon as it is idle (explicitly drain the nodes beforehand)
scontrol reboot ... # Defaults to ALL!!! reboots all nodes in the cluster
scontrol reboot $(hostname)... # reboot localhost
scontrol reboot "$nodeset" ... # reboot a nodeset
# drain & reboot the nodes
scontrol reboot ASAP "$nodeset"
# cancle pending reboots with
scontrol cancel_reboot "$nodeset"
# node clears its state and resturns to service after reboot
scontrol reboot "$nodeset" nextstate=RESUME ...
RebootProgram
The commands above will execute a RebootProgram
>>> scontrol show config | grep -i reboot
RebootProgram = /etc/slurm/libexec/reboot
Example…
#!/bin/bash
IPMITOOL="$(which ipmitool)"
if [ $? -eq 0 ]; then
# overcome hanging Lustre mounts...
"$IPMITOOL" power reset
else
/usr/bin/systemctl reboot --force
fi
States
Nodes with pending reboot…
>>> scontrol show node $node
#...
State=MIXED+DRAIN+REBOOT_REQUESTED #...
#...
Reason=Reboot ASAP [root@2023-10-18T09:50:18]
Nodes during reboot…
>>> scontrol show node $node
#...
State=DOWN+DRAIN+REBOOT_ISSUED #...
#...
Reason=Reboot ASAP : reboot issued [root@2023-10-20T07:05:07]
Footnotes
Slurm Community BoF, SC23, November 2023
https://slurm.schedmd.com/SC23/Slurm-SC23-BOF.pdf↩︎Dynamic Nodes, Slurm Administrator Documentation
https://slurm.schedmd.com/dynamic_nodes.html↩︎Cloudy, With a Chance of Dynamic Nodes, SLUG’22
https://slurm.schedmd.com/SLUG22/Dynamic_Nodes.pdf↩︎