Slurm - Configless Compute & Login Nodes

HPC
Published

April 23, 2024

Modified

April 23, 2024

“Configless” Slurm 1… SLUG’20 2

Configuration

Configuration files are stored in a cache directory

  • …sub-directory /conf-cache/ in the SlurmdSpoolDir
  • …symlink is automatically created in /run/slurm/conf

Configuration…

  • slurm.conf
    • …requires SlurmctldParameters=enable_configless
    • scontrol reconfigure to apply configuration
  • slurmd
    • …make sure no configuration is present in /etc/slurm
    • …option --conf-server points to the slurmctld host

Limitations:

If any of the supported config files “Include” additional config files, the Included configs will ONLY be shipped if their “Include” filename reference has no path separators and the file is located adjacent to slurm.conf. Any additional config files will need to be shared a different way or added to the parent config.

Example

Controller wlm1 does not configure a node ex3:

>>> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3   unk* ex[0-2]

>>> grep NodeName /etc/slurm/slurm.conf 
NodeName=ex[0-2] CPUs=1 State=UNKNOWN

# Configuration supports configless mode…
>>> scontrol show config | grep SlurmctldParameters
SlurmctldParameters     = enable_configless

Any node which can authenticate is able to communicate to slurmctld

  • …requires configuration of munged or sackd
  • …require a corresponding configuration in /etc/slurm/slurm.conf

On a node unknown to the controller wlm1

>>> grep SlurmctldHost /etc/slurm/slurm.conf 
SlurmctldHost=wlm1(192.168.200.2)

>>> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3   unk* ex[0-2]

Switching to a configless setup…

  • …remove any configuration from /etc/slurm
  • …us a command-line options to reference the controller server
>>> rm -rf /etc/slurm
>>> slurmd -Dvvvv --conf-server wlm1
slurmd: fatal: Unable to determine this slurmd's NodeName

slurmd can only register if included in a NodeName configuration

echo 'SLURMD_OPTIONS=--conf-server wlm1' >> /etc/sysconfig/slurmd
systemctl start slurmd
ln -sf /var/spool/slurm/d/conf-cache/ /etc/slurm

Footnotes

  1. Configless Slurm, SchedMD Documentation
    https://slurm.schedmd.com/configless_slurm.html↩︎

  2. Field Notes 4: From The Frontlines of Slurm Support, SLUG’20
    https://www.youtube.com/watch?v=F8CZaqOQ4Sk↩︎