Slurm — Dynamic Nodes
Overview
Why Use the Slurm Dynamic Nodes Feature?
- Leverage Temporary Node Resources — Enables on-demand utilization of transient resources
- Shared Resource Pool — Facilitates multiple dynamic clusters sharing a common set of resources
- Faster Provisioning — Nodes can be registered in real time without disrupting running workloads
- Automation Friendly — Integrates cleanly with orchestration tools
| Aspect | Static Nodes | Dynamic Nodes |
|---|---|---|
| Multi cluster | No | Yes |
| Configuration | slurm.conf |
scontrol create & slurmd -Z |
| Conf. persistent | Yes | No |
| Operations | Predictable | More complex (non persistent & invalid node states) |
| Node state | Obvious DOWN/DRAIN |
External states (node transition between clusters) |
| Debugging | Easy | Harder |
| Model | Hardware-centric | Infrastucture-centric |
What Does Infrastructure-Centric Mean in the Context of Dynamic Nodes?
- The dynamic nodes feature is designed to work alongside infrastructure automation systems
- For example, slurm-operator1 enables dynamic Slurm cluster management on Kubernetes
- Major cloud providers (e.g., GCE, AWS) offer native integrations for dynamically scaling Slurm clusters
- Flurm2 (flexible Slurm) is a collection of scripts that leverages configless and dynamic nodes to fluidly reallocate resources across multiple clusters.
Static vs Dynamic
Static (Non-Dynamic) Nodes — Discovery via slurm.conf
- Predefined Node Configuration: Nodes must be explicitly listed in the
slurm.conffile - Synchronized Configuration: The
slurm.conffile must be consistent across all cluster nodes - Modifying Static Node Configuration requires restarting all services:
- Stop the
slurmctldservice - Modify the
Nodes=entry in theslurm.conf - Restart the
slurmddaemon on each node - Restart the
slurmctldservice
- Stop the
Dynamic Nodes3˒4 — Node registration via CLI or REST API
- Register Nodes on Demand: Use
scontrol createand/or the REST API - No modification of configuration files
- No Service Restarts: No restart of
slurmctldorslurmdrequired
Slurm periodically probes all nodes defined in its configuration to establish a connection with the corresponding slurmd instances. If nodes are migrated between multiple cluster instances, they must either be removed from the Slurm configuration or assigned unique DNS names to prevent collisions between slurmctld daemons.
Failure to do so can result in nodes being claimed by the wrong controller, leading to scheduling errors, failed job launches, or unpredictable cluster behavior. In environments where node reuse or migration is common, it is therefore recommended to enforce strict separation between cluster configurations, for example by using distinct DNS zones, node name prefixes, or automated cleanup of Slurm state before reassignment.
Registration
Prerequisites for dynamic node registration:5
- Node Registration Method:
slurmdregisters nodes usingNodeAddr/NodeHostname - Scheduler Configuration: Dynamic nodes are supported only when
SelectType=select/cons_tresis configured inslurm.conf - Maximum Node Limit: The
MaxNodeCount=parameter must be set in slurm.conf - Current Limitations:
- Suboptimal Node Representation: Internal data structures for dynamic nodes are less efficient
- Topology Plugin: Topology plugin configuration must be considered during node registration
Register a node via slurmd with option -Z
# Start slurmd with an option to dynamically register
slurmd -Z #…
# persistent configuration
echo 'SLURMD_OPTIONS=-Z' >> /etc/default/slurmd
systemctl restart slurmd.service- Option
--conf— Defines additional parameters of a dynamic node (NodeName=not allowed) - Hardware resources optional. If configured they overwrites
slurmd -C
Pre-register a node using the admin CLI
# add a node to the cluster
scontrol create state=FUTURE nodename=$nodeset #…
# Only dynamic nodes that have no running jobs and that are not
# part of a reservation can be deleted
scontrol delete nodename=$nodesetNode States
Check node state using scontrol show node:
>>> scontrol show node ccexe0100
NodeName=ccexe0100 Arch=x86_64 CoresPerSocket=64
#…
State=IDLE+DYNAMIC_NORM ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
#…Dynamic node states describe where a node is in its lifecycle when using the dynamic nodes feature:
| State | Description |
|---|---|
DYNAMIC_FUTURE |
Placeholder representing capacity that may appear in future |
DYNAMIC_NORM |
Dynamic node registered successfully via slurmd, treats it real, active node |
Resources & Features
Slurm checks resources6 during node registration — CPUs, RealMemory and TmpDisk
- Missing resources drain a node automatically with state
INVAL_REG - Resources & features can be configured with following methods:
- Static in
slurm.conf - Via the
scontroladmin CLI - As option to
slurmd --conf=(probably via config. management, or ad-hoc scripts) - Using the REST API for programmatic integration with infrastructure automation
- Static in
- Method 2, 3 & 4 are non persistent (dynamic) …lost in case the service nodes reboot
slurmd -Z --conf "RealMemory=714000 Feature=amd,9654"Partition Assignment
By default nodes are not added to any partition
# …unless using `Nodes=All` in the partition definition
PartitionName=open Nodes=ALL #…Static Configuration
ToDo
Dynamic Features
Use Nodeset= and registering dynamic nodes with a feature to add it to the nodeset
Nodeset=ns1 Feature=f1
Nodeset=ns2 Feature=f2
PartitionName=p1 Nodes=ns1
PartitionName=p2 Nodes=ns2
PartitionName=p3 Nodes=ns1,ns2Run scontrol reconfigure after modifications to nodesets and partitions!
Node features can be changed by scontrol update
AvailableFeatures=<features>— Feature(s) available on the specified node- Features being removed via
scontrolmust not be active - Previously defined available feature specification will be overwritten with the new value
- Features being removed via
ActiveFeatures=<features>— Feature(s) currently active on the specified node- Previously active feature specification will be overwritten with the new value
- ActiveFeatures may be configured as a subset of the AvailableFeatures
# remove active features by omitting them from the list
scontrol update nodename=$nodeset activefeatures="amd,epyc,9654"
# add a new available feature
scontrol update nodename=$nodeset availablefeatures="amd,epyc,9654,debug"Footnotes
Kubernetes Operator for Slurm Clusters, GitHub
https://github.com/SlinkyProject/slurm-operator↩︎A novel approach to dynamic computing using Slurm, PSI
https://github.com/paulscherrerinstitute/flurm↩︎Dynamic Nodes, Slurm Administrator Documentation
https://slurm.schedmd.com/dynamic_nodes.html↩︎Cloudy, With a Chance of Dynamic Nodes, SLUG’22
https://slurm.schedmd.com/SLUG22/Dynamic_Nodes.pdf↩︎Dynamic Nodes - Slurm Configuration, SchedMD Documentation
https://slurm.schedmd.com/dynamic_nodes.html#config↩︎Node Configuration,
slurm.confManual
https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION↩︎