Howto: Slurm Configless Nodes
“Configless” Slurm 1… SLUG’20 2
- …nodes
slurmdprocess pulls configuration fromslurmctld - …instead of a shared configuration directory mounted via NFS to
/etc/slurm - …run
slurmdto manage configurations on login nodes - …without
slurmdto cache configurations client commands cause RPC storm
Configuration
Configuration files are stored in a cache directory
- …sub-directory
/conf-cache/in theSlurmdSpoolDir - …symlink is automatically created in
/run/slurm/conf
Configuration…
slurm.conf…- …requires
SlurmctldParameters=enable_configless - …
scontrol reconfigureto apply configuration
- …requires
slurmd…- …make sure no configuration is present in
/etc/slurm - …option
--conf-serverpoints to theslurmctldhost
- …make sure no configuration is present in
Limitations:
If any of the supported config files “Include” additional config files, the Included configs will ONLY be shipped if their “Include” filename reference has no path separators and the file is located adjacent to slurm.conf. Any additional config files will need to be shared a different way or added to the parent config.
Example
Controller wlm1 does not configure a node ex3:
>>> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 3 unk* ex[0-2]
>>> grep NodeName /etc/slurm/slurm.conf
NodeName=ex[0-2] CPUs=1 State=UNKNOWN
# Configuration supports configless mode…
>>> scontrol show config | grep SlurmctldParameters
SlurmctldParameters = enable_configlessAny node which can authenticate is able to communicate to slurmctld…
- …requires configuration of
mungedorsackd - …require a corresponding configuration in
/etc/slurm/slurm.conf
On a node unknown to the controller wlm1
>>> grep SlurmctldHost /etc/slurm/slurm.conf
SlurmctldHost=wlm1(192.168.200.2)
>>> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 3 unk* ex[0-2]Switching to a configless setup…
- …remove any configuration from
/etc/slurm - …us a command-line options to reference the controller server
>>> rm -rf /etc/slurm
>>> slurmd -Dvvvv --conf-server wlm1
slurmd: fatal: Unable to determine this slurmd's NodeNameslurmd can only register if included in a NodeName configuration
echo 'SLURMD_OPTIONS=--conf-server wlm1' >> /etc/sysconfig/slurmd
systemctl start slurmd
ln -sf /var/spool/slurm/d/conf-cache/ /etc/slurmFootnotes
Configless Slurm, SchedMD Documentation
https://slurm.schedmd.com/configless_slurm.html↩︎Field Notes 4: From The Frontlines of Slurm Support, SLUG’20
https://www.youtube.com/watch?v=F8CZaqOQ4Sk↩︎