Howto: Build a Slurm Test Cluster with Vagrant
This article describes how to build a virtual machines environment for testing Slurm cluster installation and configuration. Nodes use following names for the Vagrant instances as well as hostnames:
| Name | Description |
|---|---|
wlm[1,2] |
(w)ork-(l)oad (m)anager …runs the slurm{ctld,dbd} service |
db[1,2] |
(d)ata(b)ase …hosts mariadb for setups with accounting database |
ex[0-3] |
(ex)ecution nodes …run slurmd instances |
Note that following example has been tested on Linux using Vagrant Libvirt 1 as provider. Some statements in the VagrantFile are Libvirt specific, and need adjustment in case another provider is used. All configuration files described below are available on Github 2.
Network
The following will create a bridged private IP network 192.168.200.0/24 using a Libvirt Networking 3 configuration. This network is then utilized for the virtual machines using configuration statements for the Vagrant Libvirt plugin 4.
The configuration file below will add the network to the Libvirt dnsmasq service:
- …
<host>elements configure MAC- and IP-addresses for the virtual machine instance - …the
<dnsmasq:options>element passes options to the underlying dnsmasq instance - …
dhcp-ignore=tag:!knownprevents respond to unknown MAC-addresses
<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
<name>custom</name>
<uuid>356e5daf-3b06-49d5-98fd-178e847cf559</uuid>
<forward mode='nat'/>
<bridge name='virbr9' stp='on' delay='0'/>
<mac address='52:54:00:aa:da:73'/>
<ip address='192.168.200.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.200.2' end='192.168.200.254'/>
<host mac='52:54:00:00:00:02' name='wlm1' ip='192.168.200.2' />
<host mac='52:54:00:00:00:12' name='wlm2' ip='192.168.200.3' />
<host mac='52:54:00:00:00:22' name='db1' ip='192.168.200.4' />
<host mac='52:54:00:00:00:32' name='db2' ip='192.168.200.5' />
<host mac='52:54:00:00:00:03' name='ex0' ip='192.168.200.20' />
<host mac='52:54:00:00:00:04' name='ex1' ip='192.168.200.21' />
<host mac='52:54:00:00:00:05' name='ex2' ip='192.168.200.22' />
<host mac='52:54:00:00:00:06' name='ex3' ip='192.168.200.22' />
</dhcp>
</ip>
<dnsmasq:options>
<dnsmasq:option value='dhcp-ignore=tag:!known'/>
</dnsmasq:options>
</network>Create a file custom.xml with the content above in the working-directory:
# load the configuration file and enable the custom network
virsh net-define custom.xml && virsh net-start custom
# list all DHCP leases
virsh net-dhcp-leases custom
# modify the network configuration
virsh net-edit custom && virsh net-destroy custom && virsh net-start customNodes
Following Vagrantfile provides a skeleton configuration for a minimal cluster of a single service node, and a couple of resource nodes. The example needs to be adjusted to a specific use case by adding additional configurations to provision the individual nodes, and additional VM definition if required. Some common configuration for on all nodes are included already:
firewalldis disabled …SELinux is set permissive- Fedora EPEL & PowerTools repositories are enabled
/etc/hostsprovides hostname resolution
# vi: set ft=ruby :
# /etc/hosts configuration on all nodes
hosts = %q(
127.0.0.1 localhost localhost.localdomain
192.168.200.2 wlm1
192.168.200.3 wlm2
192.168.200.4 db1
192.168.200.5 db2
192.168.200.20 ex0
192.168.200.21 ex1
192.168.200.22 ex2
192.168.200.23 ex3
)
Vagrant.configure("2") do |config|
# Make sure to have at least 1GB of memory ...otherwise dnf could crash
config.vm.provider :libvirt do |libvirt|
libvirt.memory = 1024
libvirt.cpus = 1
libvirt.qemu_use_session = false
end
config.vm.box = "almalinux/8"
# sync the working-directory to the VM instances
config.vm.synced_folder ".", "/vagrant", type: "rsync"
# Runs on all nodes...
config.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %Q(
systemctl disable --now firewalld
setenforce Permissive
dnf install -y epel-release
dnf config-manager --set-enabled powertools
echo "#{hosts}" > /etc/hosts
)
end
# TODO: ...add configurations common to all nodes here
config.vm.define "wlm1" do |node|
node.vm.hostname = "wlm1"
node.vm.network :private_network,
:ip => "192.168.200.2",
:libvirt__network_name => "custom"
# TODO: ...add configuration specific to the resource manager
end
# TODO: ...add configurations for more service nodes
nodes = %w(ex0 ex1 ex2 ex3)
(0..(nodes.length - 1)).each do |num|
name = nodes[num]
config.vm.define "#{name}" do |node|
node.vm.hostname = name
node.vm.network :private_network,
:ip => "192.168.200.#{20+num}",
:libvirt__network_name => "custom"
# TODO: ...add configuration for the exeution nodes here
end
end
endConfigure an additional Yum RPM package repository on all nodes. This is a typical requirement since most users build custom RPM packages including dependencies like PMIx or UCX themselves.
# Configuration for the Yum repository
yum_repo = "
[site-packages]
name = site-packages
baseurl = # TODO: add a URL to the repository
enabled = 1
gpgcheck = 0
"
# Add a Yum repository configuration file on all nodes
config.vm.provision "shell", privileged: true,
inline: %Q(echo "#{yum_repo}" > /etc/yum.repos.d/site-packages.repo)Configuration
Depending on the test scenarios modify the corresponding section of the Vagrantfile above to install and configure services…
Common
Configure the MUNGE 5 authentication service on all nodes:
# Install and configure MUNGE on all nodes
config.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %Q(
dnf install -y munge
echo 123456789123456781234567812345678 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 600 /etc/munge/munge.key
systemctl enable --now munge
)
endConfigure the slurm user on all nodes. This depends on the Slurm RPM packages used , and applies to packages build from the RPM Spec file 6 provided by SchedMD. Packages from the Fedora project 7 or the OpenSUSE 8 project won’t need the configuration bellow:
# Create the `slurm` user and a list for required directories for the services
config.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %Q(
groupadd slurm --gid 900
useradd slurm --system --gid 900 --shell /bin/bash \
--no-create-home --home-dir /var/lib/slurm \
--comment "SLURM workload manager"
mkdir -p /etc/slurm /var/{lib,log,run}/slurm /var/spool/slurm/{d,ctld}
chown slurm --recursive /var/lib/slurm /var/spool/slurm
chgrp slurm --recursive /var/lib/slurm /var/spool/slurm
)
endslurmcltd
Create a file slurm.conf with following content in the working directory:
# vi: set ft=bash :
ClusterName=cluster
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
ReturnToService=1
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SlurmUser=root
SlurmctldHost=wlm1(192.168.200.2)
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmctldParameters=enable_configless
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldDebug=debug5
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdDebug=debug5
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
SchedulerType=sched/backfill
SelectType=select/linear
AccountingStorageType=accounting_storage/none
AccountingStoreFlags=job_comment
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
NodeName=ex[0-3] CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=ex[0-3] Default=YES MaxTime=INFINITE State=UPslurm.conf is synced to the VM instance /vagrant directory after boot. Add following the wlm1 configuration section in the Vagrantfile
node.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %q(
dnf install -y slurm-slurmctld
ln -sf /vagrant/slurm.conf /etc/slurm/slurm.conf
systemctl restart slurmctld
)
end# Start the VM instance hosting slurmcltd
vagrant up wlm1
# ...check if the service is working
vagrant ssh wlm1 -c 'systemctl status slurmctld ; sinfo'
# After modifing slurm.conf run
vagrant rsync wlm1
vagrant ssh wlm1 -c 'sudo systemctl restart slurmctld'slurmd
Run slurmd in configless mode by adding following Provisioning configuration to the execution nodes section:
node.vm.provision "shell" do |s|
s.privileged = true,
s.inline = %q(
dnf install -y slurm-slurmd
echo 'SLURMD_OPTIONS=--conf-server wlm1' >> /etc/sysconfig/slurmd
systemctl restart slurmd
rm -rf /etc/slurm
ln -sf /var/spool/slurm/d/conf-cache/ /etc/slurm
)
endFootnotes
Vagrant Libvirt Documentation
https://vagrant-libvirt.github.io/vagrant-libvirt↩︎Example Configuration, GitHub
https://github.com/vpenso/vagrant-playground/tree/master/slurm↩︎Virtual Networking, Libvirt Documentation
https://wiki.libvirt.org/VirtualNetworking.html↩︎Network Section, Vagrant Libvirt Documentation
https://vagrant-libvirt.github.io/vagrant-libvirt/configuration.html#networks↩︎MUNGE Uid ‘N’ Gid Emporium, GitHUB
https://github.com/dun/munge↩︎slurm.specfor Slurm RPM packages, SchedMD, GitHub
https://github.com/SchedMD/slurm/blob/master/slurm.spec↩︎Slurm RPM Package, Fedora Project
https://src.fedoraproject.org/rpms/slurm↩︎Slurm Package, OpenSUSE Project
https://build.opensuse.org/package/show/network:cluster/slurm↩︎