Slurm — (Live-)Upgrades
Versions & Compatability
Version numbers
- …major release …year.month 20.02(2020 February) …every six month1
- …maintenance release …appends a suffix 20.02.02…every 4 to 6 weeks
- Official releases are available at:
Compatability between release…
- RPCs (remote procedure calls) & state files
- …only modified with major versions
- …may require to rebuild applications (using Slurm MPI libraries)
- …& locally developed Slurm plugins
 
- Slurm daemons will support…
- …RPCs and state files from the two previous major releases
- …upgrading at least once each year is recommended2
 
Why Upgrade…
Why upgrade Slurm?3
- Upgrade Slurm to take advantage of…
- security patches
- performance improvements
- new features (i.e. support for recent hardware)
 
- …developers provide bug fixes only for the recent release
- …support contracts requires staying on a current release
Related discussions on the Slurm mailing list…
Upgrade
Most sites do the upgrade only after draining the cluster4
- …SchedMD recommends to open a support request before live upgrades
- …note that this is only possible for paying customers
Recommendation
SchedMD discourages use of packages for deployment…
- …ships/supports the slurm.specfile …however not recommend to use RPMs
- …developer suggest to structure installs in version-specific directories
Usually a slide is includes with following proposed Slurm deployment structure5:
./configure --prefix=/apps/slurm/21.08.0/ --sysconfdir=/apps/slurm/etc/
ln -s /apps/slurm/21.08.0 /apps/slurm/dbd
ln -s /apps/slurm/21.08.0 /apps/slurm/ctld
ln -s /apps/slurm/21.08.0 /apps/slurm/d
ln -s /apps/slurm/21.08.0 /apps/slurm/current- Use the appropriate symlink in each service file, and add /apps/slurmcurrent symlink into$PATH(through/etc/profile.d/or a module file).
- This makes a rolling upgrade much simpler, just move the symlink when ready to move that component forward onto the newer release.
Performing upgrades…
- …install the new version of Slurm to a unique directory
- …use a symbolic link to point the directory to use
- …allows to install a new version before a maintenance period
- …easily switch between versions in case of roll back
- …avoids potential problems with library conflicts
(Live-)Upgrades
“Live-Upgrade”…
- …restart slurmdbd(including database migration) andslurmctldwithin a small enough time frame to not interrupt service on the cluster
- Tolerances for service interrupts are defined by
- SlurmctldTimeoutand- SlurmdTimeout
- …if the Slurm daemons are down for longer than the specified timeout during an upgrade, nodes may be marked DOWNand their jobs killed.
 
- To further clarify: If slurmddaemons are not able to contactslurmctldwithin the specified tolerance it is unavoidable that the payload of the entire cluster is killed.
Procedure
Database migration…
- …first time slurmdbdis started after an upgrade- …will take some time to update existing records in the database
- …if slurmdbdis started withsystemd, it may think thatslurmdbdis not responding and kill the process when it reaches its timeout value, which causes problems with the upgrade.
- …recommend starting slurmdbdby calling the command directly rather than usingsystemdwhen performing an upgrade6
 
- …run-time of the migration depends on the size of the accounting database
- …relevant to have a reasonable estimate about the run-time of the migration7
Upgrade for (mariadb,) slurmdbd, and slurmctld…
- Independent operations in sequential order
- …operators orchestrate the upgrade manually
- …stop slurmtld, stopslurmdbd, stopmariadb(if required)
- …upgrade Slurm (and MariaDB)
- …start mariadb, startslurmdbd, startslurmctld
 
- Once the control plane is upgradesare back…
- …slurmdservices on all compute nodes are upgraded incrementally
- …in groups rolling over all nodes (since slurmdrequires restart)
 
- …
Packages Repository
Dedicated Yum repositoires per Slurm version…
- …use of a slurm-release.rpmpackage to install repository configuration
- …configuration file located at /etc/yum.repos.d/slurm.repo
>>> grep base /etc/yum.repos.d/slurm.repo 
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm-debuginfo
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm-source
>>> dnf remove -y slurm-release
# install Yum configuration for a new Slurm version
>>> dnf install -y …/packages/slurm-24.11/el9/slurm-release.rpm
>>> grep base /etc/yum.repos.d/slurm.repo 
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm-debuginfo
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm-sourceAlternativly it is possible to use DNF versionlock to install a specific Slurm version.
Footnotes
- Slurm releases move to a six-month cycle, SchedMD 
 https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle↩︎
- Slurm - Quick Start Administrator Guide, SchedMD Documentation 
 https://slurm.schedmd.com/quickstart_admin.html#upgrade↩︎
- Field Notes From the Frontlines of Support, SUG 2021 
 https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf
 https://www.youtube.com/watch?v=-YAW-PBvLJc↩︎
- Field Notes From the Frontlines of Support, SUG 2020 
 https://slurm.schedmd.com/SLUG20/Field_Notes.pdf
 https://www.youtube.com/watch?v=F8CZaqOQ4Sk↩︎
- Field Notes From the Frontlines of Support, SUG 2021 
 https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf
 https://www.youtube.com/watch?v=-YAW-PBvLJc↩︎
- Slurm - Quick Start Administrator Guide, SchedMD Documentation 
 https://slurm.schedmd.com/quickstart_admin.html#upgrade↩︎
- Make a dry run database upgrade, Nilfheim Supercomputing Center, Denmark 
 https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#make-a-dry-run-database-upgrade↩︎