InfiniBand: Linux Configuration
Packages
Packages build from rdma-core-spec
Package | Description |
---|---|
libverbs |
…library that allows userspace processes to use RDMA “verbs” |
libibverbs-utils |
…libibverbs example programs such as ibv_devinfo |
infiniband-diags |
IB diagnostic programs and scripts needed to diagnose an IB subnet |
NVIDIA packages…
- InfiniBand Management Tools
- InfiniBand diagnostic utilities (
ibdiagnet
,ibdiagpath
,smparquery
, etc)
- InfiniBand diagnostic utilities (
Modules
Mellanox HCAs require at least the mlx?_core
and mlx?_ib
kernel modules.
- Hardware drivers…
mlx4_*
modules are use by ConnectX adaptersmlx5_*
modules are used by Connect-IB adapters
mlx_core
…generic driver use bymlx_ib
for Infinibandmlx_en
for Ethernetmlx_fc
for Fiber-Channel
ib_*
contains Infiniband specific functions…
Prior to rdma-core
package (see above)…
## find all infiniband modules
>>> find /lib/modules/$(uname -r)/kernel/drivers/infiniband -type f -name \*.ko
## load requried modules
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do modprobe $mod ; done
## make sure modules get loaded on boot
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do echo "$mod" >> /etc/modules-load.d/infiniband.conf ; done
## list loaded infiniband modules
>>> lsmod | egrep "^mlx|^ib|^rdma"
## check the version
>>> modinfo mlx4_core | grep -e ^filename -e ^version
## list module configuration parameters
>>> for i in /sys/module/mlx?_core/parameters/* ; do echo $i: $(cat $i); done
## module configuration
>>> cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
IPoIB
InfiniBand does not use the internet protocol (IP) by default…
- IP over InfiniBand (IPoIB) provides an IP network emulation layer…
- …on top of InfiniBand remote direct memory access (RDMA) networks
- ARP over a specific multicast group to convert IP to IB addresses
- TCP/UDP over IPoIB (IPv4/6)
- TCP uses reliable-connected mode, MTU up to 65kb
- UDP uses unreliable-datagram mode, MTU limited to IB packages side 4kb
- MTUs should be synchronized between all components
IPoIB devices have a 20 byte hardware address…
netstat -g # IP group membership
saquery -g | grep MGID | tr -s '..' | cut -d. -f2
# list mulicast group GIDs
tail -n+1 /sys/class/net/ib*/mode # connection mode
ibv_devinfo | grep _mtu # MTU of the hardware
/sys/class/net/ib0/device/mlx4_port1_mtu
ip a | grep ib[0-9] | grep mtu | cut -d' ' -f2,4-5
# MTU configuration for the interface
Network Boot
Boot over Infiniband (BoIB) …two boot modes:
- …UEFI boot…
- …modern and recommended way to network boot
- …expansion ROM implements the UEFI APIs
- …supports any network boot method available in the UEFI reference specification
- …UEFI PXE/LAN boot
- …UEFI HTTP boot
- …legacy boot…
- …boot device ROM for traditional BIOS implementations
- …HCAs use FlexBoot (an iPXE variant) to
- …enabled by an expansion ROM image
.mrom
- …boot device ROM for traditional BIOS implementations
Dracut
Dracut …early boot environment…
- …requires to load additional kernel modules
- …kernel command-line parameters
rd.driver.post=mlx5_ib,ib_ipoib,ib_umad,rdma_ucm rd.neednet=1 rd.timeout=0 rd.retry=160
rd.driver.post
load additional kernel modulesmlx4_ib
support ConnectX-3/4mlx5_ib
for ConnectX-5 and newer
rd.neednet=1
forces start of network interfacesrd.timeout=0
waits until a network interface is activatedrd.retry=160
time to wait for the network to initialize and become operational
RDMA Subsystem
RDMA subsystem relies on the kernel, udev
and systemd
to load modules…
rdma-core
Package
- Source code linux-rdma/rdma-core, GitHub
rdma-core
package provides RDMA core user-space libraries and daemons…udev
loading the physical hardware driver/usr/lib/udev/rules.d/*-rdma*.rules
device manager rules- Once an RDMA device is created by the kernel…
- …triggers module loading services
- …
rdma-hw.target
load a protocol module…- …pull in rdma management daemons dynamically
- …wants
rdma-load-modules@rdma.service
beforenetwork.target
- …loads all modules from
/etc/rdma/modules/*.conf
# list kernel modules to be loaded
grep -v ^# /etc/rdma/modules/*.conf
rdma
Commands
# ...view the state of all RDMA links
>>> rdma dev
0: mlx5_0: node_type ca fw 20.31.1014 node_guid 9803:9b03:0067:ab58 sys_image_guid 9803:9b03:0067:ab58
# ...display the RDMA link
>>> rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 817 sm_lid 762 lmc 0 state ACTIVE physical_state LINK_UP
Set up software RDMA on an existing interface…
modprobe $module
rdma link add $name type $type netdev $device
ibv_*
Commands
RDMA devices available for use from the user space
ibv_devices
list devices with GUID
>>> ibv_devices
device node GUID
------ ----------------
mlx5_0 08c0eb0300f82cbc
ibv_devinfo -v
show device capabilities accessible to user-space…
Drivers
Kernel
- Inbox drivers…
- …upstream kernel support
- …RHEL/SLES release documentation
- Linux drivers part of MLNX_OFED
- …
kmod*
packages
- …
iWARP
Implementation of iWARP (Internet Wide-area RDMA Protocol)…
- …implements RDMA over IP networks …on top TCP/IP protocol
- …works with all Ethernet network infrastructure
- …offloads TCP/IP (from CPU) to RDMA-enabled NIC (RNIC)
- …zero copy …direct data placement
- …eliminates intermediate buffer copies
- …reading and writing directly to application memory
- …kernel bypass …remove need for context switches from kernel- to user-space Enable…
- …block storage …iSER (iSCSI Extensions for RDMA)
- …file storage (NFS over RDMA)
- …NVMe over Fabrics
MLNX_OFED
# download the MLNX_OFED distirbution for NVIDIA
>>> tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
>>> ls MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/RPMS/*.rpm \
| xargs -n 1 basename |sort
ar_mgr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
clusterkit-1.0.36-1.53100.x86_64.rpm
dapl-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-static-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-utils-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dpcp-1.1.2-1.53100.x86_64.rpm
dump_pr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
fabric-collector-1.1.0.MLNX20170103.89bb2aa-0.1.53100.x86_64.rpm
#...
- Duplicate packages….
- …in conflict with enterprise distribution are….
- …prefixed with
mlnx
of includemlnx
somewhere in the package name
- Different installation profiles…
Package Name | Profile |
---|---|
mlnx-ofed-all | Installs all available packages in MLNX_OFED |
mlnx-ofed-basic | Installs basic packages required for running the cards |
mlnx-ofed-guest | Installs packages required by guest OS |
mlnx-ofed-hpc | Installs packages required for HPC |
mlnx-ofed-hypervisor | Installs packages required by hypervisor OS |
mlnx-ofed-vma | Installs packages required by VMA |
mlnx-ofed-vma-eth | Installs packages required by VMA to work over Ethernet |
mlnx-ofed-vma-vpi | Installs packages required by VMA to support VPI |
bluefield | Installs packages required for BlueField |
dpdk | Installs packages required for DPDK |
dpdk-upstream-libs | Installs packages required for DPDK using RDMA-Core |
kernel-only | Installs packages required for a non-default kernel |
Build
Example from CentOS 7.9
# extract the MLNX OFED archive
cp /lustre/hpc/vpenso/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz .
tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/
# dependencies
yum install -y \
\
automake \
autoconf \
createrepo \
gcc-gfortran \
libtool \
libusbx \
python-devel \
redhat-rpm-config
rpm-build
# remove all previosly installed artifacts...
./uninstall.sh
# run the generic installation
./mlnxofedinstall --skip-distro-check --add-kernel-support --kmp --force
# copy the new archive...
cp /tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-3.10.0-1160.21.1.el7.x86_64/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-ext.tgz ...
mlnxofedinstall
will install the newly build RPM packages on the host.
>>> systemctl stop lustre.mount ; lustre_rmmod
# this will bring down the network interface, and disconnect your SSH session
>>> /etc/init.d/openibd
# new modules compatible to the kernel have been loaded
>>> modinfo mlx5_ib
filename: /lib/modules/3.10.0-1160.21.1.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license: Dual BSD/GPL
description: Mellanox 5th generation network adapters (ConnectX series) IB driver
author: Eli Cohen <eli@mellanox.com>
retpoline: Y
rhelversion: 7.9
srcversion: DF39E5800D8C1EEB9D2B51C
depends: mlx5_core,ib_core,mlx_compat,ib_uverbs
vermagic: 3.10.0-1160.21.1.el7.x86_64 SMP mod_unload modversions
parm: dc_cnak_qp_depth:DC CNAK QP depth (uint)
The new kernel package have a time-stamp within the version to distinguish them from the original versions:
[root@lxbk0718 ~]# yum --showduplicates list kmod-mlnx-ofa_kernel
Installed Packages
kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.202104140852.rhel7u9 installed
Available Packages
kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.rhel7u9 gsi-internal
Loading the Lustre module back into the kernel will fail…
[root@lxbk0718 ~]# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Invalid argument
[root@lxbk0718 ~]# dmesg -H | tail
[ +0.000002] ko2iblnd: Unknown symbol ib_modify_qp (err -22)
[ +0.000025] ko2iblnd: Unknown symbol ib_destroy_fmr_pool (err 0)
[ +0.000007] ko2iblnd: disagrees about version of symbol rdma_destroy_id
[ +0.000001] ko2iblnd: Unknown symbol rdma_destroy_id (err -22)
[ +0.000004] ko2iblnd: disagrees about version of symbol __rdma_create_id
[ +0.000001] ko2iblnd: Unknown symbol __rdma_create_id (err -22)
[ +0.000042] ko2iblnd: Unknown symbol ib_dealloc_pd (err 0)
[ +0.000015] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err 0)
[ +0.000364] LNetError: 70810:0:(api-ni.c:2283:lnet_startup_lndnet()) Can't load LND o2ib, module ko2iblnd, rc=256
[ +0.002136] LustreError: 70810:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
Rebuild of the Luster kernel modules compatible to MLNX OFED 5.3 is required
# get the source code
git clone git://git.whamcloud.com/fs/lustre-release.git
# checkout the version supporting the kernel
# cf. https://www.lustre.org/lustre-2-12-6-released/
git checkout v2_12_6
# prepare the build environment
sh ./autogen.sh
# configure to build only the Lustre client
./configure --disable-server --disable-tests
# builds with (once configuration works)
make && make rpms