NHC - Cluster Node Health Check
HPC
Node Health Check (NHC)1
- …periodically determines node(s) health
- …drain automatically with
reason="NHC: …"
on failing checks - …resume automatically if all checks pass without failure
- …drain automatically with
Installation
Clone the source code repository from GitHub, and checkout a specific version…
version=1.4.3
# dependencies
sudo dnf install -y git automake make
# get the source code from GitHub
git clone https://github.com/mej/nhc.git ; cd nhc
git checkout tags/$version
# configure the source tree
./autogen.sh
./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec
# install locally
make test && sudo make install
Pre-built RPM packages2 are available on GitHub. Otherwise build custom RPM packages:
sudo dnf install -y @rpm-development-tools wget
# initialize the build environment
rpmdev-setuptree
cd ~/rpmbuild/SOURCES/
wget https://github.com/mej/nhc/releases/download/$version/lbnl-nhc-$version.tar.gz
cd -
rpmbuild -ba lbnl-nhc.spec
ls -r ~/rpmbuild/RPMS/*
Configuration
/etc/nhc/scripts/*.nhc # include files for checks
/etc/default/nhc # default confgiuration
/etc/nhc/nhc.conf # custom configuration file
/var/log/nhc.log # log file
egrep -v '(^#|^$)' /etc/nhc/nhc.conf # list all checks
Integration with Slurm Cluster Scheduler:
>>> scontrol show config | grep HealthCheck
HealthCheckInterval = 600 sec
HealthCheckNodeState = ANY
HealthCheckProgram = /etc/slurm/nhc/nhc.sh
…HealthCheckProgram
executes NHC with specific command options…
#!/usr/bin/env sh
/usr/sbin/nhc -c /etc/slurm/nhc/nhc.conf -l /var/log/nhc
Checks
- Logging….
- …with
-l -
option toSTDOUT
-v
option enables verbose…print each check on execution-d
option enables debugging check internals
- …with
- Testing…
- …
MARK_OFFLINE=0
disable mechanism to drain nodes - …use a copy of
nhc.conf
- …
# execute with verbose loggin to STDOUT
nhc -l - -v -c /etc/nhc/nhc.conf
# ...copy nhc.conf to /tmp for development
MARK_OFFLINE=0 nhc -l - -d -c /tmp/nhc.conf # debugging, disable drain...
# run a specific single check rather than reading checks from a config file
nhc -e 'check_cmd_status -r 0 touch /tmp/nhc.test'
Match String
Match strings …specifies the target for the check
- …against
/proc/sys/kernel/hostname
- Multiple forms…
- …glob expressions with wildcard
- …regular expression
- …node range expressions
# ...glob expression
* || valid_check1
!ln* || valid_check2
# ...regular expression
/n000[0-9]/ || valid_check3
!/\.(gpu|htc)/ || valid_check4
# ...node range expression
{n00[20-39]} || valid_check5
!{n03,n05,n0[7-9]} || valid_check6
{n00[10-21,23]} || this_target_is_invalid
Examples
…using the build-in check…
# check for a OS release version number
* || check_file_contents /etc/os-release '/^VERSION_ID.*8\.[8-9]/'
# check a specific kernel version
* || check_cmd_output -m '/^4\.18\.0\-477/' /usr/bin/uname -r
# check if the /tmp partition is effectively writable:
* || check_cmd_status -r 0 touch /tmp/nhc_tmp_writable
Footnotes
Node Health Check, GitHub
https://github.com/mej/nhc↩︎nhc
PRM package, GitHub
https://github.com/mej/nhc/releases↩︎