Slurm — (Live-)Upgrades
Versions & Compatability
Version numbers
- …major release …year.month
20.02
(2020 February) …every six month1 - …maintenance release …appends a suffix
20.02.02
…every 4 to 6 weeks - Official releases are available at:
Compatability between release…
- RPCs (remote procedure calls) & state files
- …only modified with major versions
- …may require to rebuild applications (using Slurm MPI libraries)
- …& locally developed Slurm plugins
- Slurm daemons will support…
- …RPCs and state files from the two previous major releases
- …upgrading at least once each year is recommended2
Why Upgrade…
Why upgrade Slurm?3
- Upgrade Slurm to take advantage of…
- security patches
- performance improvements
- new features (i.e. support for recent hardware)
- …developers provide bug fixes only for the recent release
- …support contracts requires staying on a current release
Related discussions on the Slurm mailing list…
Upgrade
Most sites do the upgrade only after draining the cluster4
- …SchedMD recommends to open a support request before live upgrades
- …note that this is only possible for paying customers
Recommendation
SchedMD discourages use of packages for deployment…
- …ships/supports the
slurm.spec
file …however not recommend to use RPMs - …developer suggest to structure installs in version-specific directories
Usually a slide is includes with following proposed Slurm deployment structure5:
./configure --prefix=/apps/slurm/21.08.0/ --sysconfdir=/apps/slurm/etc/
ln -s /apps/slurm/21.08.0 /apps/slurm/dbd
ln -s /apps/slurm/21.08.0 /apps/slurm/ctld
ln -s /apps/slurm/21.08.0 /apps/slurm/d
ln -s /apps/slurm/21.08.0 /apps/slurm/current
- Use the appropriate symlink in each service file, and add
/apps/slurm
current symlink into$PATH
(through/etc/profile.d/
or a module file). - This makes a rolling upgrade much simpler, just move the symlink when ready to move that component forward onto the newer release.
Performing upgrades…
- …install the new version of Slurm to a unique directory
- …use a symbolic link to point the directory to use
- …allows to install a new version before a maintenance period
- …easily switch between versions in case of roll back
- …avoids potential problems with library conflicts
(Live-)Upgrades
“Live-Upgrade”…
- …restart
slurmdbd
(including database migration) andslurmctld
within a small enough time frame to not interrupt service on the cluster - Tolerances for service interrupts are defined by
SlurmctldTimeout
andSlurmdTimeout
- …if the Slurm daemons are down for longer than the specified timeout during an upgrade, nodes may be marked
DOWN
and their jobs killed.
- To further clarify: If
slurmd
daemons are not able to contactslurmctld
within the specified tolerance it is unavoidable that the payload of the entire cluster is killed.
Procedure
Database migration…
- …first time
slurmdbd
is started after an upgrade- …will take some time to update existing records in the database
- …if
slurmdbd
is started withsystemd
, it may think thatslurmdbd
is not responding and kill the process when it reaches its timeout value, which causes problems with the upgrade. - …recommend starting
slurmdbd
by calling the command directly rather than usingsystemd
when performing an upgrade6
- …run-time of the migration depends on the size of the accounting database
- …relevant to have a reasonable estimate about the run-time of the migration7
Upgrade for (mariadb
,) slurmdbd
, and slurmctld
…
- Independent operations in sequential order
- …operators orchestrate the upgrade manually
- …stop
slurmtld
, stopslurmdbd
, stopmariadb
(if required) - …upgrade Slurm (and MariaDB)
- …start
mariadb
, startslurmdbd
, startslurmctld
- Once the control plane is upgradesare back…
- …
slurmd
services on all compute nodes are upgraded incrementally - …in groups rolling over all nodes (since
slurmd
requires restart)
- …
Packages Repository
Dedicated Yum repositoires per Slurm version…
- …use of a
slurm-release.rpm
package to install repository configuration - …configuration file located at
/etc/yum.repos.d/slurm.repo
>>> grep base /etc/yum.repos.d/slurm.repo
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm-debuginfo
baseurl=http://…/packages/slurm-24.05/el$releasever/slurm-source
>>> dnf remove -y slurm-release
# install Yum configuration for a new Slurm version
>>> dnf install -y …/packages/slurm-24.11/el9/slurm-release.rpm
>>> grep base /etc/yum.repos.d/slurm.repo
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm-debuginfo
baseurl=http://…/packages/slurm-24.11/el$releasever/slurm-source
Alternativly it is possible to use DNF versionlock to install a specific Slurm version.
Footnotes
Slurm releases move to a six-month cycle, SchedMD
https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle↩︎Slurm - Quick Start Administrator Guide, SchedMD Documentation
https://slurm.schedmd.com/quickstart_admin.html#upgrade↩︎Field Notes From the Frontlines of Support, SUG 2021
https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf
https://www.youtube.com/watch?v=-YAW-PBvLJc↩︎Field Notes From the Frontlines of Support, SUG 2020
https://slurm.schedmd.com/SLUG20/Field_Notes.pdf
https://www.youtube.com/watch?v=F8CZaqOQ4Sk↩︎Field Notes From the Frontlines of Support, SUG 2021
https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf
https://www.youtube.com/watch?v=-YAW-PBvLJc↩︎Slurm - Quick Start Administrator Guide, SchedMD Documentation
https://slurm.schedmd.com/quickstart_admin.html#upgrade↩︎Make a dry run database upgrade, Nilfheim Supercomputing Center, Denmark
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#make-a-dry-run-database-upgrade↩︎