Howto: Slurm Configless Nodes
“Configless” Slurm 1… SLUG’20 2
- …nodes
slurmd
process pulls configuration fromslurmctld
- …instead of a shared configuration directory mounted via NFS to
/etc/slurm
- …run
slurmd
to manage configurations on login nodes - …without
slurmd
to cache configurations client commands cause RPC storm
Configuration
Configuration files are stored in a cache directory
- …sub-directory
/conf-cache/
in theSlurmdSpoolDir
- …symlink is automatically created in
/run/slurm/conf
Configuration…
slurm.conf
…- …requires
SlurmctldParameters=enable_configless
- …
scontrol reconfigure
to apply configuration
- …requires
slurmd
…- …make sure no configuration is present in
/etc/slurm
- …option
--conf-server
points to theslurmctld
host
- …make sure no configuration is present in
Limitations:
If any of the supported config files “Include” additional config files, the Included configs will ONLY be shipped if their “Include” filename reference has no path separators and the file is located adjacent to slurm.conf. Any additional config files will need to be shared a different way or added to the parent config.
Example
Controller wlm1
does not configure a node ex3
:
>>> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 3 unk* ex[0-2]
>>> grep NodeName /etc/slurm/slurm.conf
NodeName=ex[0-2] CPUs=1 State=UNKNOWN
# Configuration supports configless mode…
>>> scontrol show config | grep SlurmctldParameters
SlurmctldParameters = enable_configless
Any node which can authenticate is able to communicate to slurmctld
…
- …requires configuration of
munged
orsackd
- …require a corresponding configuration in
/etc/slurm/slurm.conf
On a node unknown to the controller wlm1
>>> grep SlurmctldHost /etc/slurm/slurm.conf
SlurmctldHost=wlm1(192.168.200.2)
>>> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 3 unk* ex[0-2]
Switching to a configless setup…
- …remove any configuration from
/etc/slurm
- …us a command-line options to reference the controller server
>>> rm -rf /etc/slurm
>>> slurmd -Dvvvv --conf-server wlm1
slurmd: fatal: Unable to determine this slurmd's NodeName
slurmd
can only register if included in a NodeName
configuration
echo 'SLURMD_OPTIONS=--conf-server wlm1' >> /etc/sysconfig/slurmd
systemctl start slurmd
ln -sf /var/spool/slurm/d/conf-cache/ /etc/slurm
Footnotes
Configless Slurm, SchedMD Documentation
https://slurm.schedmd.com/configless_slurm.html↩︎Field Notes 4: From The Frontlines of Slurm Support, SLUG’20
https://www.youtube.com/watch?v=F8CZaqOQ4Sk↩︎