Slurm - Cluster Control Plane

Configuration & Operation of Slurm Services

HPC
Published

November 3, 2015

Modified

January 25, 2024

sackd

auth/slurm and cred/slurm plugins (from Slurm 23.11)

  • Slurm internal authentication and job credential plugins
    • …alternative to MUNGE authentication service
    • …separate from existing auth/jwt plugin
  • Requires shared /etc/slurm/slurm.key throughout the cluster
  • Clients use local socket …managed by slurm{ctld,d,dbd}

sackd on login nodes…

  • …provides authentication for client commands
  • …integrate into a “configless” environment
    • …manages cache of configuration files
    • …updates received automatically through scontrol reconfigure

slurmctld

With version 23.11 1 scontrol reconfigure has been changed…

  • …systemd service units use Type=notify …new option --systemd for slurm{ctld,d}
  • …catches configuration mistakes …continues execution (instead of failing)
  • …reconfigure allows for almost any (supported) configuration changes to take place
    • …no explicit restart of the daemon required anymore
    • …SIGHUB similar behaviour to restarting slurm{ctld,d} processes

Scalability

Maximum job throughput and overall slurmctld responsiveness (under heavy load) governed by latency reading/writing to the StateSaveLocation. In high-throughput (more then ~200.000 jobs/day) environments local storage performance for the controller needs to considered:

  • Fewer fast cores (high clock frequency) on the slurmctld host is preferred
  • Fast storage for the StateSaveLocation (preferably NVMe)
    • IOPS to this location can sustain a major bottleneck to job throughput
    • At least two directories and two files created per job
    • Corresponding unlink() calls will add to the load
    • Use of array jobs significantly improves performance…

Hardware, example minimum system requirements ~100.000 jobs/day with 500 nodes: 16GB RAM, dual core CPU, dedicated SSD/NVME (for state-save). The amount of RAM required increases with larger workloads and the number of compute nodes.

slurmdbd should be hosted on a dedicated node, preferably with a dedicated SSD/NVMe for the relational database (of a local MariaDB instance). The RAM requirements goes up in relation to the number of jobs which query the database. A minimum system requirement to support 500 nodes with ~100.000 jobs/day is 16-32 GB RAM just on the database host.

Static & Dynamic Nodes

Recommended process for adding a static node…

  1. Stop slurmctld
  2. Update the configuration for Nodes=
  3. Restart slurmd daemons on all nodes
  4. (Re-)start slurmctld

Dynamically add & delete nodes 2 3

  • …without restarting slurmctld & slurmd
  • …controller uses NodeAddr/NodeHostname for dynamic slurmd registrations
  • …only supported with SelectType=select/cons_tres
  • …set MaxNodeCount= in the configuration
  • Limitations…
    • …suboptimal internal representation of nodes
    • …inaccurate infromation for topology plugins
    • …requires scontrol reconfigure or service restart

Configuration

By default nodes aren’t added to any partition…

  • Nodes=All in the partition definition…
  • ⇒ have all nodes in the partition, even new dynamic nodes
PartitionName=open Nodes=ALL MaxTime=INFINITE Default=Yes State=Up
  • Nodeset= …create nodesets, add the nodeset to a partition…
  • ⇒ registering dynamic nodes with a feature to add it to the nodeset
Nodeset=ns1 Feature=f1
Nodeset=ns2 Feature=f2
PartitionName=all Nodes=ALL
PartitionName=p1 Nodes=ns1
PartitionName=p2 Nodes=ns2
PartitionName=p3 Nodes=ns1,ns2

Run scontrol reconfigure after modifications to nodesets and partitions!

Node registration for example with slurmd –Z –conf="Feature=f1"

Operation

Dynamic registration requires option slurmd -Z

Two ways to add a dynamic node…

  • slurmd -Z --conf=
    • …option --conf …defines additional parameters of a dynamic node
    • NodeName= not allowed in --conf
    • …hardware topology optional …overwrites slurmd -C if specified
  • scontrol create state=FUTURE nodename= [conf syntax]
    • …allows to overwrite slurmd -C hardware topology
    • …appending a features= for association with a defined nodeset=

Delete with scontrol delete nodename=

  • …needs to be idle …cleared from any reservation
  • Stop slurmd after the delete command

nss_slurm

…optional Slurm NSS plugin …password and group resolution

  • …serviced through the local slurmstepd process
  • …removes load from these systems during launch of huge numbers of jobs/steps
  • …return only results for processes within a given job step
  • …not meant as replacement for network directory services like LDAP, SSSD, or NSLCD

LDAP-less Control Plane

slurmctld without LDAP (Slurm 23.11)…

  • …enabled through auth/slurm credential format extensibility
  • …username, UID, GID captured alongside the job submission
  • auth/slurm permits the login node to securely provide these details
  • …set AuthInfo=use_client_ids in slurm{dbd}.conf

slurmdbd

SlurmDBD aka slurmdbd (slurm database daemon)…

  • …interface to the relational database storing accounting records
  • …configuration is available in slurmdbd.conf
    • …should be protected from unauthorized access …contains a database password
    • …file should be only on the computer where SlurmDBD executes
    • …only be readable by the user which executes SlurmDBD (e.g. slurm)

Documentation …Slurm Database Daemonslurmdbd.conf

  • …host running the database (MySQL/MariaDB) is referred to as back-end
  • …nodes hosting the database daemon is called front-end

Run the daemon foreground and verbose mode to debug the configuration:

# run in foreground with debugging enabled
slurmctld -Dvvvvv

# follow the daemon logs
multitail /var/log/slurm{ctld,dbd}

Back-End

Provides the RDBMS back-end to store accounting database …interfaced by slurmdbd

  • …dedicated database …typically called slurm_acct_db
  • …grant the corresponding permissions database server
cat > /tmp/slurm_user.sql <<EOF
grant all on slurm_acct_db.* TO 'slurm'@'node' identified by '12345678' with grant option;
grant all on slurm_acct_db.* TO 'slurm'@'node.fqdn' identified by '12345678' with grant option;
EOF
sudo mysql < /tmp/slurm_user.sql

On start slurmdbd will first try to connect with the back-end database…

  • StorageHost database hostname
  • StorageUser database user name
  • StoragePass database user password
  • StorageType database type
  • StorageLoc database name on the database server (defaults slurm_acct_db)
# back-end database
#
StorageHost=lxrmdb04
#StoragePort=3306
StorageUser=slurm
StoragePass=12345678
StorageType=accounting_storage/mysql
#StorageLoc=slurm_acct_db

Launch the interactive mysql shell…

/* ...list databases */
show database like 'slurm%' ;

/* ...check users access to databases */
select user,host from mysql.user where user='slurm';

Connect from a remote node…

  • …requires the MYSQL client (dnf install -y mysql)
  • …use the password set with StoragePass in slurmdbd.conf
mysql_password=$(grep StoragePass /etc/slurm/slurmdbd.conf | cut -d= -f2)
database=$(grep StorageHost /etc/slurm/slurmdbd.conf | cut -d= -f2)

# connect to the database server
mysql --host $database --user slurm --password="$mysql_password" slurm_acct_db

Front-End

Configure the Slurm controller to write accounting records to a back-end SQL database using slurmdbd as interface:

AccountingStorageType The accounting storage mechanism type. Acceptable values at present include “accounting_storage/none” and “accounting_storage/slurmdbd”. The “accounting_storage/slurmdbd” value indicates that accounting records will be written to the Slurm DBD, which manages an underlying MySQL database. See “man slurmdbd” for more information. The default value is “accounting_storage/none” and indicates that account records are not maintained. Also see DefaultStorageType.

In order to enable all nodes to query the accounting database with make sure that the following configuration is correct:

AccountingStorageHost The name of the machine hosting the accounting storage database. Only used with systems using SlurmDBD, ignored otherwise.

Note the configuration above referese to the node hosting the Slurm database daemon not the back-end database. An error similar to the following text is emitted by sacct if the connection the can not be established:

sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Changes to this configuration require scontrol reconfigure to be propagated:

# check the configuration with...
>>> scontrol show config | grep AccountingStorage.*Host
AccountingStorageBackupHost = (null)
AccountingStorageHost   = lxbk0263
AccountingStorageExternalHost = (null)

Purge

The database can grow very large with time…

  • …depends on the job throughput
  • truncating the tables helps performance
  • …typically no need to access very old job metadata

Remove old data from the accounting database

# data retention
#
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month 

Sites requiring access to historic account data…

  • …separated from the archive options described in the next section
  • …may host a dedicated isolated instance of slurmdbd
  • …runs a copy or part of a copy of the production database
  • …provides quick access to query historical information

Archive

Archive accounting database:

# data archive
#
ArchiveDir=/var/spool/slurm/archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no

slurmrestd

Service slurmrestd that translate JSON/YAML over HTTP requests into Slurm RPC requests…

  • Allows to submit and manage jobs through REST calls (for example via curl)
  • Launch and manage batch jobs from a (web-)service

References…

slurmd

Each compute server (node) has a slurmd daemon…

  • …waits for work, executes that work…
  • …returns status, and waits for more work

Flags..

  • -…planned for backfill
  • *…not responding
  • $…maintenance
  • @…pending reboot
  • ^…rebooting
  • !…pending power down
  • %…powering down
  • ~…power off
  • #…power up & configuring

Node States

Node state codes…man sinfo

  • ALLOCATED…by one or more jobs
  • ALLOCATED+…some jobs in process of completing
  • COMPLETING…all jobs completing
  • IDLE…not allocated
  • INVAL…node did not register to controller
    • …allow invalid node resource with SlurmdParameters=config_overrides
    • …only useful for testing purposes
  • FUTURE…node not available yet
  • MAINT…node in maintenance
  • RESERVED…advanced reservation
  • …some more…

Drain & Resume

Non-production node states…

  • DRAINING…node will become unavailable by admin request
  • DRAINED…node unavailable by admin request
  • DOWN…node unavailable for use
  • FAIL…node expected to fail…unavailable by admin request
  • FAILING…jobs expected to fail soon

Get an overview…

# ...non-responding (dead) nodes
sinfo -d

# drain nodes with reason
sinfo -o '%4D %10T %20E %N' -S -D -t drain,draining,drained

# node-list of drained nodes
sinfo -h -N -o '%n' -t drain,draining,drained | nodeset -f

# node-list of unresponsive nodes...
sinfo -h -N -o '%n' -t down,no_respond,power_down,unk,unknown | nodeset -f

Drain and resume nodes…

scontrol update state=drain nodename="$nodeset" reason="$reason"

scontrol update state=resume nodename="$nodeset"

Configuration

# foreground debug mode...
slurmd -Dvvvvv

Configuration in slurm.conf

  • SlurmdUser…defaults to root
  • SlurmdPort…defaults to 6818
  • SlurmdParameters…see man-page
  • SlurmdTimeout…time in seconds (defaults to 300)…
    • …before slurmctld sets an unresponsive node to state DOWN
    • …ping by Slurm internal communication mechanisms
  • SlurmdPidFile…defaults to /var/run/slurmd.pid
  • SlurmdSpoolDir…defaults to /var/spool/slurmd
    • …daemon’s state information
    • …batch job script information
  • SlurmdLogFile…defaults to syslog
  • SlurmdDebug & SlurmdSyslogDebug
    • …during ops. quite,fatal,error or info
    • …debug…verbosedebug{2,3,4,5}

scontrol reboot

Reboot nodes using the resource manager scontrol reboot sub-command:

# reboot nodes as soon as it is idle (explicitly drain the nodes beforehand)
scontrol reboot ...            # Defaults to ALL!!! reboots all nodes in the cluster
scontrol reboot $(hostname)... # reboot localhost
scontrol reboot "$nodeset" ... # reboot a nodeset
# drain & reboot the nodes
scontrol reboot ASAP "$nodeset"
# cancle pending reboots with
scontrol cancel_reboot "$nodeset"
# node clears its state and resturns to service after reboot
scontrol reboot "$nodeset" nextstate=RESUME ...

RebootProgram

The commands above will execute a RebootProgram

>>> scontrol show config | grep -i reboot                       
RebootProgram           = /etc/slurm/libexec/reboot

Example…

#!/bin/bash
IPMITOOL="$(which ipmitool)"
if [ $? -eq 0 ]; then
    # overcome hanging Lustre mounts...
    "$IPMITOOL" power reset
else
    /usr/bin/systemctl reboot --force
fi

States

Nodes with pending reboot…

>>> scontrol show node $node
#...
  State=MIXED+DRAIN+REBOOT_REQUESTED #...
#...
  Reason=Reboot ASAP [root@2023-10-18T09:50:18]

Nodes during reboot…

>>> scontrol show node $node
#...
  State=DOWN+DRAIN+REBOOT_ISSUED #...
#...
  Reason=Reboot ASAP : reboot issued [root@2023-10-20T07:05:07]

Footnotes

  1. Slurm Community BoF, SC23, November 2023
    https://slurm.schedmd.com/SC23/Slurm-SC23-BOF.pdf↩︎

  2. Dynamic Nodes, Slurm Administrator Documentation
    https://slurm.schedmd.com/dynamic_nodes.html↩︎

  3. Cloudy, With a Chance of Dynamic Nodes, SLUG’22
    https://slurm.schedmd.com/SLUG22/Dynamic_Nodes.pdf↩︎