Slurm - Vagrant Cluster Test Environment
This article describes how to build a virtual machines environment for testing Slurm cluster installation and configuration. Nodes use following names for the Vagrant instances as well as hostnames:
Name | Description |
---|---|
wlm[1,2] |
(w)ork-(l)oad (m)anager …runs the slurm{ctld,dbd} service |
db[1,2] |
(d)ata(b)ase …hosts mariadb for setups with accounting database |
ex[0-3] |
(ex)ecution nodes …run slurmd instances |
Note that following example has been tested on Linux using Vagrant Libvirt 1 as provider. Some statements in the VagrantFile
are Libvirt specific, and need adjustment in case another provider is used. All configuration files described below are available on Github 2.
Network
The following will create a bridged private IP network 192.168.200.0/24
using a Libvirt Networking 3 configuration. This network is then utilized for the virtual machines using configuration statements for the Vagrant Libvirt plugin 4.
The configuration file below will add the network to the Libvirt dnsmasq
service:
- …
<host>
elements configure MAC- and IP-addresses for the virtual machine instance - …the
<dnsmasq:options>
element passes options to the underlying dnsmasq instance - …
dhcp-ignore=tag:!known
prevents respond to unknown MAC-addresses
network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
<name>custom</name>
<uuid>356e5daf-3b06-49d5-98fd-178e847cf559</uuid>
<forward mode='nat'/>
<bridge name='virbr9' stp='on' delay='0'/>
<mac address='52:54:00:aa:da:73'/>
<ip address='192.168.200.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.200.2' end='192.168.200.254'/>
<host mac='52:54:00:00:00:02' name='wlm1' ip='192.168.200.2' />
<host mac='52:54:00:00:00:12' name='wlm2' ip='192.168.200.3' />
<host mac='52:54:00:00:00:22' name='db1' ip='192.168.200.4' />
<host mac='52:54:00:00:00:32' name='db2' ip='192.168.200.5' />
<host mac='52:54:00:00:00:03' name='ex0' ip='192.168.200.20' />
<host mac='52:54:00:00:00:04' name='ex1' ip='192.168.200.21' />
<host mac='52:54:00:00:00:05' name='ex2' ip='192.168.200.22' />
<host mac='52:54:00:00:00:06' name='ex3' ip='192.168.200.22' />
<dhcp>
</ip>
</dnsmasq:options>
<dnsmasq:option value='dhcp-ignore=tag:!known'/>
<dnsmasq:options>
</network> </
Create a file custom.xml
with the content above in the working-directory:
# load the configuration file and enable the custom network
virsh net-define custom.xml && virsh net-start custom
# list all DHCP leases
virsh net-dhcp-leases custom
# modify the network configuration
virsh net-edit custom && virsh net-destroy custom && virsh net-start custom
Nodes
Following Vagrantfile
provides a skeleton configuration for a minimal cluster of a single service node, and a couple of resource nodes. The example needs to be adjusted to a specific use case by adding additional configurations to provision the individual nodes, and additional VM definition if required. Some common configuration for on all nodes are included already:
firewalld
is disabled …SELinux is set permissive- Fedora EPEL & PowerTools repositories are enabled
/etc/hosts
provides hostname resolution
# vi: set ft=ruby :
# /etc/hosts configuration on all nodes
= %q(
hosts 127.0.0.1 localhost localhost.localdomain
192.168.200.2 wlm1
192.168.200.3 wlm2
192.168.200.4 db1
192.168.200.5 db2
192.168.200.20 ex0
192.168.200.21 ex1
192.168.200.22 ex2
192.168.200.23 ex3
)
Vagrant.configure("2") do |config|
# Make sure to have at least 1GB of memory ...otherwise dnf could crash
.vm.provider :libvirt do |libvirt|
config.memory = 1024
libvirt.cpus = 1
libvirt.qemu_use_session = false
libvirtend
.vm.box = "almalinux/8"
config# sync the working-directory to the VM instances
.vm.synced_folder ".", "/vagrant", type: "rsync"
config
# Runs on all nodes...
.vm.provision "shell" do |s|
config.privileged = true,
s.inline = %Q(
s systemctl disable --now firewalld
setenforce Permissive
dnf install -y epel-release
dnf config-manager --set-enabled powertools
echo "#{hosts}" > /etc/hosts
)
end
# TODO: ...add configurations common to all nodes here
.vm.define "wlm1" do |node|
config.vm.hostname = "wlm1"
node.vm.network :private_network,
node:ip => "192.168.200.2",
:libvirt__network_name => "custom"
# TODO: ...add configuration specific to the resource manager
end
# TODO: ...add configurations for more service nodes
= %w(ex0 ex1 ex2 ex3)
nodes 0..(nodes.length - 1)).each do |num|
(= nodes[num]
name .vm.define "#{name}" do |node|
config.vm.hostname = name
node.vm.network :private_network,
node:ip => "192.168.200.#{20+num}",
:libvirt__network_name => "custom"
# TODO: ...add configuration for the exeution nodes here
end
end
end
Configure an additional Yum RPM package repository on all nodes. This is a typical requirement since most users build custom RPM packages including dependencies like PMIx or UCX themselves.
# Configuration for the Yum repository
= "
yum_repo [site-packages]
name = site-packages
baseurl = # TODO: add a URL to the repository
enabled = 1
gpgcheck = 0
"
# Add a Yum repository configuration file on all nodes
.vm.provision "shell", privileged: true,
configinline: %Q(echo "#{yum_repo}" > /etc/yum.repos.d/site-packages.repo)
Configuration
Depending on the test scenarios modify the corresponding section of the Vagrantfile
above to install and configure services…
Common
Configure the MUNGE 5 authentication service on all nodes:
# Install and configure MUNGE on all nodes
config.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %Q(
dnf install -y munge
echo 123456789123456781234567812345678 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 600 /etc/munge/munge.key
systemctl enable --now munge
)
end
Configure the slurm
user on all nodes. This depends on the Slurm RPM packages used , and applies to packages build from the RPM Spec file 6 provided by SchedMD. Packages from the Fedora project 7 or the OpenSUSE 8 project won’t need the configuration bellow:
# Create the `slurm` user and a list for required directories for the services
config.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %Q(
groupadd slurm --gid 900
useradd slurm --system --gid 900 --shell /bin/bash \
--no-create-home --home-dir /var/lib/slurm \
--comment "SLURM workload manager"
mkdir -p /etc/slurm /var/{lib,log,run}/slurm /var/spool/slurm/{d,ctld}
chown slurm --recursive /var/lib/slurm /var/spool/slurm
chgrp slurm --recursive /var/lib/slurm /var/spool/slurm
)
end
slurmcltd
Create a file slurm.conf
with following content in the working directory:
# vi: set ft=bash :
ClusterName=cluster
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
ReturnToService=1
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SlurmUser=root
SlurmctldHost=wlm1(192.168.200.2)
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmctldParameters=enable_configless
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldDebug=debug5
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdDebug=debug5
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
SchedulerType=sched/backfill
SelectType=select/linear
AccountingStorageType=accounting_storage/none
AccountingStoreFlags=job_comment
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
NodeName=ex[0-3] CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=ex[0-3] Default=YES MaxTime=INFINITE State=UP
slurm.conf
is synced to the VM instance /vagrant
directory after boot. Add following the wlm1
configuration section in the Vagrantfile
.vm.provision "shell" do |s|
node.privileged = true,
s.inline = %q(
s dnf install -y slurm-slurmctld
ln -sf /vagrant/slurm.conf /etc/slurm/slurm.conf
systemctl restart slurmctld
)
end
# Start the VM instance hosting slurmcltd
vagrant up wlm1
# ...check if the service is working
vagrant ssh wlm1 -c 'systemctl status slurmctld ; sinfo'
# After modifing slurm.conf run
vagrant rsync wlm1
vagrant ssh wlm1 -c 'sudo systemctl restart slurmctld'
slurmd
Run slurmd
in configless mode by adding following Provisioning configuration to the execution nodes section:
.vm.provision "shell" do |s|
node.privileged = true,
s.inline = %q(
s dnf install -y slurm-slurmd
echo 'SLURMD_OPTIONS=--conf-server wlm1' >> /etc/sysconfig/slurmd
systemctl restart slurmd
rm -rf /etc/slurm
ln -sf /var/spool/slurm/d/conf-cache/ /etc/slurm
)
end
Footnotes
Vagrant Libvirt Documentation
https://vagrant-libvirt.github.io/vagrant-libvirt↩︎Example Configuration, GitHub
https://github.com/vpenso/vagrant-playground/tree/master/slurm↩︎Virtual Networking, Libvirt Documentation
https://wiki.libvirt.org/VirtualNetworking.html↩︎Network Section, Vagrant Libvirt Documentation
https://vagrant-libvirt.github.io/vagrant-libvirt/configuration.html#networks↩︎MUNGE Uid ‘N’ Gid Emporium, GitHUB
https://github.com/dun/munge↩︎slurm.spec
for Slurm RPM packages, SchedMD, GitHub
https://github.com/SchedMD/slurm/blob/master/slurm.spec↩︎Slurm RPM Package, Fedora Project
https://src.fedoraproject.org/rpms/slurm↩︎Slurm Package, OpenSUSE Project
https://build.opensuse.org/package/show/network:cluster/slurm↩︎