Slurm — (Live-)Upgrades

HPC
Slurm
Published

May 21, 2021

Modified

July 15, 2025

Versions & Compatability

Version numbers

Compatability between release…

  • RPCs (remote procedure calls) & state files
    • …only modified with major versions
    • …may require to rebuild applications (using Slurm MPI libraries)
    • …& locally developed Slurm plugins
  • Slurm daemons will support…
    • …RPCs and state files from the two previous major releases
    • …upgrading at least once each year is recommended2

Why Upgrade…

Why upgrade Slurm?3

  • Upgrade Slurm to take advantage of…
    • security patches
    • performance improvements
    • new features (i.e. support for recent hardware)
  • …developers provide bug fixes only for the recent release
  • …support contracts requires staying on a current release

Related discussions on the Slurm mailing list…

Upgrade

Most sites do the upgrade only after draining the cluster4

  • …SchedMD recommends to open a support request before live upgrades
  • …note that this is only possible for paying customers

Recommendation

SchedMD discourages use of packages for deployment…

  • …ships/supports the slurm.spec file …however not recommend to use RPMs
  • …developer suggest to structure installs in version-specific directories

Usually a slide is includes with following proposed Slurm deployment structure5:

./configure --prefix=/apps/slurm/21.08.0/ --sysconfdir=/apps/slurm/etc/
ln -s /apps/slurm/21.08.0 /apps/slurm/dbd
ln -s /apps/slurm/21.08.0 /apps/slurm/ctld
ln -s /apps/slurm/21.08.0 /apps/slurm/d
ln -s /apps/slurm/21.08.0 /apps/slurm/current
  • Use the appropriate symlink in each service file, and add /apps/slurm current symlink into $PATH (through /etc/profile.d/ or a module file).
  • This makes a rolling upgrade much simpler, just move the symlink when ready to move that component forward onto the newer release.

Performing upgrades…

  • …install the new version of Slurm to a unique directory
  • …use a symbolic link to point the directory to use
  • …allows to install a new version before a maintenance period
  • …easily switch between versions in case of roll back
  • …avoids potential problems with library conflicts

(Live-)Upgrades

“Live-Upgrade”…

  • …restart slurmdbd (including database migration) and slurmctld within a small enough time frame to not interrupt service on the cluster
  • Tolerances for service interrupts are defined by
    • SlurmctldTimeout and SlurmdTimeout
    • …if the Slurm daemons are down for longer than the specified timeout during an upgrade, nodes may be marked DOWN and their jobs killed.
  • To further clarify: If slurmd daemons are not able to contact slurmctld within the specified tolerance it is unavoidable that the payload of the entire cluster is killed.

Procedure

Database migration

  • …first time slurmdbd is started after an upgrade
    • …will take some time to update existing records in the database
    • …if slurmdbd is started with systemd, it may think that slurmdbd is not responding and kill the process when it reaches its timeout value, which causes problems with the upgrade.
    • recommend starting slurmdbd by calling the command directly rather than using systemd when performing an upgrade6
  • …run-time of the migration depends on the size of the accounting database
  • …relevant to have a reasonable estimate about the run-time of the migration7

Upgrade for (mariadb,) slurmdbd, and slurmctld

  • Independent operations in sequential order
    • operators orchestrate the upgrade manually
    • …stop slurmtld, stop slurmdbd, stop mariadb (if required)
    • …upgrade Slurm (and MariaDB)
    • …start mariadb, start slurmdbd, start slurmctld
  • Once the control plane is upgradesare back…
    • slurmd services on all compute nodes are upgraded incrementally
    • …in groups rolling over all nodes (since slurmd requires restart)

Packages Repository

Dedicated Yum repositoires per Slurm version…

  • …use of a slurm-release.rpm package to install repository configuration
  • …configuration file located at /etc/yum.repos.d/slurm.repo
>>> grep base /etc/yum.repos.d/slurm.repo 
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm-debuginfo
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm-source
>>> dnf remove -y slurm-release

# install Yum configuration for a new Slurm version
>>> dnf install -y …/packages/slurm-24.11/el9/slurm-release.rpm
>>> grep base /etc/yum.repos.d/slurm.repo 
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm-debuginfo
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm-source

Alternativly it is possible to use DNF versionlock to install a specific Slurm version.

Footnotes

  1. Slurm releases move to a six-month cycle, SchedMD
    https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle↩︎

  2. Slurm - Quick Start Administrator Guide, SchedMD Documentation
    https://slurm.schedmd.com/quickstart_admin.html#upgrade↩︎

  3. Field Notes From the Frontlines of Support, SUG 2021
    https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf
    https://www.youtube.com/watch?v=-YAW-PBvLJc↩︎

  4. Field Notes From the Frontlines of Support, SUG 2020
    https://slurm.schedmd.com/SLUG20/Field_Notes.pdf
    https://www.youtube.com/watch?v=F8CZaqOQ4Sk↩︎

  5. Field Notes From the Frontlines of Support, SUG 2021
    https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf
    https://www.youtube.com/watch?v=-YAW-PBvLJc↩︎

  6. Slurm - Quick Start Administrator Guide, SchedMD Documentation
    https://slurm.schedmd.com/quickstart_admin.html#upgrade↩︎

  7. Make a dry run database upgrade, Nilfheim Supercomputing Center, Denmark
    https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#make-a-dry-run-database-upgrade↩︎