InfiniBand: Linux Configuration

HPC
Network
InfiniBand
Published

August 19, 2015

Modified

January 2, 2025

Packages

Packages build from rdma-core-spec

Package Description
libverbs …library that allows userspace processes to use RDMA “verbs”
libibverbs-utils …libibverbs example programs such as ibv_devinfo
infiniband-diags IB diagnostic programs and scripts needed to diagnose an IB subnet

NVIDIA packages…

Modules

Mellanox HCAs require at least the mlx?_core and mlx?_ib kernel modules.

  • Hardware drivers…
    • mlx4_* modules are use by ConnectX adapters
    • mlx5_* modules are used by Connect-IB adapters
  • mlx_core…generic driver use by
    • mlx_ib for Infiniband
    • mlx_en for Ethernet
    • mlx_fc for Fiber-Channel
  • ib_* contains Infiniband specific functions…

Prior to rdma-core package (see above)…

## find all infiniband modules
>>> find /lib/modules/$(uname -r)/kernel/drivers/infiniband -type f -name \*.ko
## load requried modules
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do modprobe $mod ; done
## make sure modules get loaded on boot 
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do echo "$mod" >> /etc/modules-load.d/infiniband.conf ; done
## list loaded infiniband modules
>>> lsmod | egrep "^mlx|^ib|^rdma"
## check the version
>>> modinfo mlx4_core | grep -e ^filename -e ^version
## list module configuration parameters
>>> for i in /sys/module/mlx?_core/parameters/* ; do echo $i: $(cat $i); done
## module configuration
>>> cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4

IPoIB

InfiniBand does not use the internet protocol (IP) by default…

  • IP over InfiniBand (IPoIB) provides an IP network emulation layer…
  • …on top of InfiniBand remote direct memory access (RDMA) networks
  • ARP over a specific multicast group to convert IP to IB addresses
  • TCP/UDP over IPoIB (IPv4/6)
    • TCP uses reliable-connected mode, MTU up to 65kb
    • UDP uses unreliable-datagram mode, MTU limited to IB packages side 4kb
  • MTUs should be synchronized between all components

IPoIB devices have a 20 byte hardware address…

netstat -g                                # IP group membership
saquery -g | grep MGID | tr -s '..' | cut -d. -f2
                                          # list mulicast group GIDs
tail -n+1 /sys/class/net/ib*/mode         # connection mode
ibv_devinfo | grep _mtu                   # MTU of the hardware 
/sys/class/net/ib0/device/mlx4_port1_mtu
ip a | grep ib[0-9] | grep mtu | cut -d' ' -f2,4-5
                                          # MTU configuration for the interface

Network Boot

Boot over Infiniband (BoIB) …two boot modes:

  • …UEFI boot…
    • …modern and recommended way to network boot
    • …expansion ROM implements the UEFI APIs
    • …supports any network boot method available in the UEFI reference specification
      • …UEFI PXE/LAN boot
      • …UEFI HTTP boot
  • …legacy boot…
    • …boot device ROM for traditional BIOS implementations
    • …HCAs use FlexBoot (an iPXE variant) to
    • …enabled by an expansion ROM image .mrom

Dracut

Dracut …early boot environment…

rd.driver.post=mlx5_ib,ib_ipoib,ib_umad,rdma_ucm rd.neednet=1 rd.timeout=0 rd.retry=160 

List of parameters:

  • rd.driver.post load additional kernel modules
    • mlx4_ib support ConnectX-3/4
    • mlx5_ib for ConnectX-5 and newer
  • rd.neednet=1 forces start of network interfaces
  • rd.timeout=0 waits until a network interface is activated
  • rd.retry=160 time to wait for the network to initialize and become operational

RDMA Subsystem

RDMA subsystem relies on the kernel, udev and systemd to load modules…

rdma-core Package

  • Source code linux-rdma/rdma-core, GitHub
  • rdma-core package provides RDMA core user-space libraries and daemons…
  • udev loading the physical hardware driver
    • /usr/lib/udev/rules.d/*-rdma*.rules device manager rules
    • Once an RDMA device is created by the kernel…
    • …triggers module loading services
  • rdma-hw.target load a protocol module…
    • …pull in rdma management daemons dynamically
    • …wants rdma-load-modules@rdma.service before network.target
    • …loads all modules from /etc/rdma/modules/*.conf
# list kernel modules to be loaded
grep -v ^# /etc/rdma/modules/*.conf

rdma Commands

# ...view the state of all RDMA links
>>> rdma dev
0: mlx5_0: node_type ca fw 20.31.1014 node_guid 9803:9b03:0067:ab58 sys_image_guid 9803:9b03:0067:ab58

# ...display the RDMA link
>>> rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 817 sm_lid 762 lmc 0 state ACTIVE physical_state LINK_UP

Set up software RDMA on an existing interface…

modprobe $module
rdma link add $name type $type netdev $device

ibv_* Commands

RDMA devices available for use from the user space

ibv_devices list devices with GUID

>>> ibv_devices 
    device                 node GUID
    ------              ----------------
    mlx5_0              08c0eb0300f82cbc

ibv_devinfo -v show device capabilities accessible to user-space…

Drivers

Kernel

  • Inbox drivers
    • …upstream kernel support
    • …RHEL/SLES release documentation
  • Linux drivers part of MLNX_OFED
    • kmod* packages

iWARP

Implementation of iWARP (Internet Wide-area RDMA Protocol)…

  • …implements RDMA over IP networks …on top TCP/IP protocol
  • …works with all Ethernet network infrastructure
    • …offloads TCP/IP (from CPU) to RDMA-enabled NIC (RNIC)
    • …zero copy …direct data placement
      • …eliminates intermediate buffer copies
      • …reading and writing directly to application memory
    • …kernel bypass …remove need for context switches from kernel- to user-space Enable…
  • …block storage …iSER (iSCSI Extensions for RDMA)
  • …file storage (NFS over RDMA)
  • …NVMe over Fabrics

MLNX_OFED

# download the MLNX_OFED distirbution for NVIDIA
>>> tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
>>> ls MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/RPMS/*.rpm \
      | xargs -n 1 basename |sort
ar_mgr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
clusterkit-1.0.36-1.53100.x86_64.rpm
dapl-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-static-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-utils-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dpcp-1.1.2-1.53100.x86_64.rpm
dump_pr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
fabric-collector-1.1.0.MLNX20170103.89bb2aa-0.1.53100.x86_64.rpm
#...
  • Duplicate packages….
    • …in conflict with enterprise distribution are….
    • …prefixed with mlnx of include mlnx somewhere in the package name
  • Different installation profiles…
Package Name Profile
mlnx-ofed-all Installs all available packages in MLNX_OFED
mlnx-ofed-basic Installs basic packages required for running the cards
mlnx-ofed-guest Installs packages required by guest OS
mlnx-ofed-hpc Installs packages required for HPC
mlnx-ofed-hypervisor Installs packages required by hypervisor OS
mlnx-ofed-vma Installs packages required by VMA
mlnx-ofed-vma-eth Installs packages required by VMA to work over Ethernet
mlnx-ofed-vma-vpi Installs packages required by VMA to support VPI
bluefield Installs packages required for BlueField
dpdk Installs packages required for DPDK
dpdk-upstream-libs Installs packages required for DPDK using RDMA-Core
kernel-only Installs packages required for a non-default kernel

Build

Example from CentOS 7.9

# extract the MLNX OFED archive
cp /lustre/hpc/vpenso/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz .
tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/
# dependencies
yum install -y \
      automake \
      autoconf \
      createrepo \
      gcc-gfortran \
      libtool \
      libusbx \
      python-devel \
      redhat-rpm-config \
      rpm-build 

# remove all previosly installed artifacts...
./uninstall.sh

# run the generic installation
./mlnxofedinstall --skip-distro-check --add-kernel-support --kmp --force

# copy the new archive...
cp  /tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-3.10.0-1160.21.1.el7.x86_64/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-ext.tgz  ...

mlnxofedinstall will install the newly build RPM packages on the host.

>>> systemctl stop lustre.mount ; lustre_rmmod
# this will bring down the network interface, and disconnect your SSH session
>>> /etc/init.d/openibd
# new modules compatible to the kernel have been loaded
>>> modinfo mlx5_ib
filename:       /lib/modules/3.10.0-1160.21.1.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license:        Dual BSD/GPL
description:    Mellanox 5th generation network adapters (ConnectX series) IB driver
author:         Eli Cohen <eli@mellanox.com>
retpoline:      Y
rhelversion:    7.9
srcversion:     DF39E5800D8C1EEB9D2B51C
depends:        mlx5_core,ib_core,mlx_compat,ib_uverbs
vermagic:       3.10.0-1160.21.1.el7.x86_64 SMP mod_unload modversions 
parm:           dc_cnak_qp_depth:DC CNAK QP depth (uint)

The new kernel package have a time-stamp within the version to distinguish them from the original versions:

[root@lxbk0718 ~]# yum --showduplicates list kmod-mlnx-ofa_kernel
Installed Packages
kmod-mlnx-ofa_kernel.x86_64                  5.3-OFED.5.3.1.0.0.1.202104140852.rhel7u9                   installed   
Available Packages
kmod-mlnx-ofa_kernel.x86_64                  5.3-OFED.5.3.1.0.0.1.rhel7u9                                gsi-internal

Loading the Lustre module back into the kernel will fail…

[root@lxbk0718 ~]# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Invalid argument
[root@lxbk0718 ~]# dmesg -H | tail 
[  +0.000002] ko2iblnd: Unknown symbol ib_modify_qp (err -22)
[  +0.000025] ko2iblnd: Unknown symbol ib_destroy_fmr_pool (err 0)
[  +0.000007] ko2iblnd: disagrees about version of symbol rdma_destroy_id
[  +0.000001] ko2iblnd: Unknown symbol rdma_destroy_id (err -22)
[  +0.000004] ko2iblnd: disagrees about version of symbol __rdma_create_id
[  +0.000001] ko2iblnd: Unknown symbol __rdma_create_id (err -22)
[  +0.000042] ko2iblnd: Unknown symbol ib_dealloc_pd (err 0)
[  +0.000015] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err 0)
[  +0.000364] LNetError: 70810:0:(api-ni.c:2283:lnet_startup_lndnet()) Can't load LND o2ib, module ko2iblnd, rc=256
[  +0.002136] LustreError: 70810:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed

Rebuild of the Luster kernel modules compatible to MLNX OFED 5.3 is required

# get the source code
git clone git://git.whamcloud.com/fs/lustre-release.git
# checkout the version supporting the kernel
# cf. https://www.lustre.org/lustre-2-12-6-released/
git checkout v2_12_6
# prepare the build environment
sh ./autogen.sh
# configure to build only the Lustre client
./configure --disable-server --disable-tests
# builds with (once configuration works)
make && make rpms