NHC - Cluster Node Health Check

HPC
Slurm
Published

November 3, 2015

Modified

August 19, 2024

Node Health Check (NHC)1

Installation

Clone the source code repository from GitHub, and checkout a specific version…

version=1.4.3
# dependencies
sudo dnf install -y git automake make
# get the source code from GitHub
git clone https://github.com/mej/nhc.git ; cd nhc
git checkout tags/$version

# configure the source tree
./autogen.sh
./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec

# install locally
make test && sudo make install

Pre-built RPM packages2 are available on GitHub. Otherwise build custom RPM packages:

sudo dnf install -y @rpm-development-tools wget
# initialize the build environment
rpmdev-setuptree

cd ~/rpmbuild/SOURCES/
wget https://github.com/mej/nhc/releases/download/$version/lbnl-nhc-$version.tar.gz
cd - 
rpmbuild -ba lbnl-nhc.spec
ls -r ~/rpmbuild/RPMS/*

Configuration

/etc/nhc/scripts/*.nhc                            # include files for checks
/etc/default/nhc                                  # default confgiuration
/etc/nhc/nhc.conf                                 # custom configuration file
/var/log/nhc.log                                  # log file
egrep -v '(^#|^$)' /etc/nhc/nhc.conf              # list all checks

Integration with Slurm Cluster Scheduler:

>>> scontrol show config | grep HealthCheck
HealthCheckInterval     = 600 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = /etc/slurm/nhc/nhc.sh

HealthCheckProgram executes NHC with specific command options…

#!/usr/bin/env sh
/usr/sbin/nhc -c /etc/slurm/nhc/nhc.conf -l /var/log/nhc

Checks

  • Logging….
    • …with -l - option to STDOUT
    • -v option enables verbose…print each check on execution
    • -d option enables debugging check internals
  • Testing…
    • MARK_OFFLINE=0 disable mechanism to drain nodes
    • …use a copy of nhc.conf
# execute with verbose loggin to STDOUT
nhc -l - -v -c /etc/nhc/nhc.conf

# ...copy nhc.conf to /tmp for development
MARK_OFFLINE=0 nhc -l - -d -c /tmp/nhc.conf      # debugging, disable drain...

# run a specific single check rather than reading checks from a config file
nhc -e 'check_cmd_status -r 0 touch /tmp/nhc.test'

Match String

Match stringsspecifies the target for the check

  • …against /proc/sys/kernel/hostname
  • Multiple forms…
    • …glob expressions with wildcard
    • …regular expression
    • …node range expressions
# ...glob expression
                *  || valid_check1
              !ln* || valid_check2
# ...regular expression
       /n000[0-9]/ || valid_check3
    !/\.(gpu|htc)/ || valid_check4
# ...node range expression
      {n00[20-39]} || valid_check5
!{n03,n05,n0[7-9]} || valid_check6
   {n00[10-21,23]} || this_target_is_invalid

Examples

…using the build-in check

# check for a OS release version number
* || check_file_contents /etc/os-release '/^VERSION_ID.*8\.[8-9]/'

# check a specific kernel version
* || check_cmd_output -m '/^4\.18\.0\-477/' /usr/bin/uname -r

# check if the /tmp partition is effectively writable:
* || check_cmd_status -r 0 touch /tmp/nhc_tmp_writable

Footnotes

  1. Node Health Check, GitHub
    https://github.com/mej/nhc↩︎

  2. nhc PRM package, GitHub
    https://github.com/mej/nhc/releases↩︎