InfiniBand: Linux Configuration
Packages
Packages build from rdma-core-spec
| Package | Description |
|---|---|
libverbs |
…library that allows userspace processes to use RDMA “verbs” |
libibverbs-utils |
…libibverbs example programs such as ibv_devinfo |
infiniband-diags |
IB diagnostic programs and scripts needed to diagnose an IB subnet |
NVIDIA packages…
- InfiniBand Management Tools
- InfiniBand diagnostic utilities (
ibdiagnet,ibdiagpath,smparquery, etc)
- InfiniBand diagnostic utilities (
Modules
Mellanox HCAs require at least the mlx?_core and mlx?_ib kernel modules.
- Hardware drivers…
mlx4_*modules are use by ConnectX adaptersmlx5_*modules are used by Connect-IB adapters
mlx_core…generic driver use bymlx_ibfor Infinibandmlx_enfor Ethernetmlx_fcfor Fiber-Channel
ib_*contains Infiniband specific functions…
Prior to rdma-core package (see above)…
## find all infiniband modules
>>> find /lib/modules/$(uname -r)/kernel/drivers/infiniband -type f -name \*.ko
## load requried modules
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do modprobe $mod ; done
## make sure modules get loaded on boot
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do echo "$mod" >> /etc/modules-load.d/infiniband.conf ; done
## list loaded infiniband modules
>>> lsmod | egrep "^mlx|^ib|^rdma"
## check the version
>>> modinfo mlx4_core | grep -e ^filename -e ^version
## list module configuration parameters
>>> for i in /sys/module/mlx?_core/parameters/* ; do echo $i: $(cat $i); done
## module configuration
>>> cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4IPoIB
InfiniBand does not use the internet protocol (IP) by default…
- IP over InfiniBand (IPoIB) provides an IP network emulation layer…
- …on top of InfiniBand remote direct memory access (RDMA) networks
- ARP over a specific multicast group to convert IP to IB addresses
- TCP/UDP over IPoIB (IPv4/6)
- TCP uses reliable-connected mode, MTU up to 65kb
- UDP uses unreliable-datagram mode, MTU limited to IB packages side 4kb
- MTUs should be synchronized between all components
IPoIB devices have a 20 byte hardware address…
netstat -g # IP group membership
saquery -g | grep MGID | tr -s '..' | cut -d. -f2
# list mulicast group GIDs
tail -n+1 /sys/class/net/ib*/mode # connection mode
ibv_devinfo | grep _mtu # MTU of the hardware
/sys/class/net/ib0/device/mlx4_port1_mtu
ip a | grep ib[0-9] | grep mtu | cut -d' ' -f2,4-5
# MTU configuration for the interfaceNetwork Boot
Boot over Infiniband (BoIB) …two boot modes:
- …UEFI boot…
- …modern and recommended way to network boot
- …expansion ROM implements the UEFI APIs
- …supports any network boot method available in the UEFI reference specification
- …UEFI PXE/LAN boot
- …UEFI HTTP boot
- …legacy boot…
- …boot device ROM for traditional BIOS implementations
- …HCAs use FlexBoot (an iPXE variant) to
- …enabled by an expansion ROM image
.mrom
- …boot device ROM for traditional BIOS implementations
Dracut
Dracut …early boot environment…
- …requires to load additional kernel modules
- …kernel command-line parameters
rd.driver.post=mlx5_ib,ib_ipoib,ib_umad,rdma_ucm rd.neednet=1 rd.timeout=0 rd.retry=160 rd.driver.postload additional kernel modulesmlx4_ibsupport ConnectX-3/4mlx5_ibfor ConnectX-5 and newer
rd.neednet=1forces start of network interfacesrd.timeout=0waits until a network interface is activatedrd.retry=160time to wait for the network to initialize and become operational
RDMA Subsystem
RDMA subsystem relies on the kernel, udev and systemd to load modules…
rdma-core Package
- Source code linux-rdma/rdma-core, GitHub
rdma-corepackage provides RDMA core user-space libraries and daemons…udevloading the physical hardware driver/usr/lib/udev/rules.d/*-rdma*.rulesdevice manager rules- Once an RDMA device is created by the kernel…
- …triggers module loading services
- …
rdma-hw.targetload a protocol module…- …pull in rdma management daemons dynamically
- …wants
rdma-load-modules@rdma.servicebeforenetwork.target - …loads all modules from
/etc/rdma/modules/*.conf
# list kernel modules to be loaded
grep -v ^# /etc/rdma/modules/*.confrdma Commands
# ...view the state of all RDMA links
>>> rdma dev
0: mlx5_0: node_type ca fw 20.31.1014 node_guid 9803:9b03:0067:ab58 sys_image_guid 9803:9b03:0067:ab58
# ...display the RDMA link
>>> rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 817 sm_lid 762 lmc 0 state ACTIVE physical_state LINK_UPSet up software RDMA on an existing interface…
modprobe $module
rdma link add $name type $type netdev $deviceibv_* Commands
RDMA devices available for use from the user space
ibv_devices list devices with GUID
>>> ibv_devices
device node GUID
------ ----------------
mlx5_0 08c0eb0300f82cbcibv_devinfo -v show device capabilities accessible to user-space…
Drivers
Kernel
- Inbox drivers…
- …upstream kernel support
- …RHEL/SLES release documentation
- Linux drivers part of MLNX_OFED
- …
kmod*packages
- …
iWARP
Implementation of iWARP (Internet Wide-area RDMA Protocol)…
- …implements RDMA over IP networks …on top TCP/IP protocol
- …works with all Ethernet network infrastructure
- …offloads TCP/IP (from CPU) to RDMA-enabled NIC (RNIC)
- …zero copy …direct data placement
- …eliminates intermediate buffer copies
- …reading and writing directly to application memory
- …kernel bypass …remove need for context switches from kernel- to user-space Enable…
- …block storage …iSER (iSCSI Extensions for RDMA)
- …file storage (NFS over RDMA)
- …NVMe over Fabrics
MLNX_OFED
# download the MLNX_OFED distirbution for NVIDIA
>>> tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
>>> ls MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/RPMS/*.rpm \
| xargs -n 1 basename |sort
ar_mgr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
clusterkit-1.0.36-1.53100.x86_64.rpm
dapl-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-static-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-utils-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dpcp-1.1.2-1.53100.x86_64.rpm
dump_pr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
fabric-collector-1.1.0.MLNX20170103.89bb2aa-0.1.53100.x86_64.rpm
#...- Duplicate packages….
- …in conflict with enterprise distribution are….
- …prefixed with
mlnxof includemlnxsomewhere in the package name
- Different installation profiles…
| Package Name | Profile |
|---|---|
| mlnx-ofed-all | Installs all available packages in MLNX_OFED |
| mlnx-ofed-basic | Installs basic packages required for running the cards |
| mlnx-ofed-guest | Installs packages required by guest OS |
| mlnx-ofed-hpc | Installs packages required for HPC |
| mlnx-ofed-hypervisor | Installs packages required by hypervisor OS |
| mlnx-ofed-vma | Installs packages required by VMA |
| mlnx-ofed-vma-eth | Installs packages required by VMA to work over Ethernet |
| mlnx-ofed-vma-vpi | Installs packages required by VMA to support VPI |
| bluefield | Installs packages required for BlueField |
| dpdk | Installs packages required for DPDK |
| dpdk-upstream-libs | Installs packages required for DPDK using RDMA-Core |
| kernel-only | Installs packages required for a non-default kernel |
Build
Example from CentOS 7.9
# extract the MLNX OFED archive
cp /lustre/hpc/vpenso/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz .
tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/
# dependencies
yum install -y \
automake \
autoconf \
createrepo \
gcc-gfortran \
libtool \
libusbx \
python-devel \
redhat-rpm-config \
rpm-build
# remove all previosly installed artifacts...
./uninstall.sh
# run the generic installation
./mlnxofedinstall --skip-distro-check --add-kernel-support --kmp --force
# copy the new archive...
cp /tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-3.10.0-1160.21.1.el7.x86_64/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-ext.tgz ...mlnxofedinstall will install the newly build RPM packages on the host.
>>> systemctl stop lustre.mount ; lustre_rmmod
# this will bring down the network interface, and disconnect your SSH session
>>> /etc/init.d/openibd
# new modules compatible to the kernel have been loaded
>>> modinfo mlx5_ib
filename: /lib/modules/3.10.0-1160.21.1.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license: Dual BSD/GPL
description: Mellanox 5th generation network adapters (ConnectX series) IB driver
author: Eli Cohen <eli@mellanox.com>
retpoline: Y
rhelversion: 7.9
srcversion: DF39E5800D8C1EEB9D2B51C
depends: mlx5_core,ib_core,mlx_compat,ib_uverbs
vermagic: 3.10.0-1160.21.1.el7.x86_64 SMP mod_unload modversions
parm: dc_cnak_qp_depth:DC CNAK QP depth (uint)The new kernel package have a time-stamp within the version to distinguish them from the original versions:
[root@lxbk0718 ~]# yum --showduplicates list kmod-mlnx-ofa_kernel
Installed Packages
kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.202104140852.rhel7u9 installed
Available Packages
kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.rhel7u9 gsi-internalLoading the Lustre module back into the kernel will fail…
[root@lxbk0718 ~]# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Invalid argument
[root@lxbk0718 ~]# dmesg -H | tail
[ +0.000002] ko2iblnd: Unknown symbol ib_modify_qp (err -22)
[ +0.000025] ko2iblnd: Unknown symbol ib_destroy_fmr_pool (err 0)
[ +0.000007] ko2iblnd: disagrees about version of symbol rdma_destroy_id
[ +0.000001] ko2iblnd: Unknown symbol rdma_destroy_id (err -22)
[ +0.000004] ko2iblnd: disagrees about version of symbol __rdma_create_id
[ +0.000001] ko2iblnd: Unknown symbol __rdma_create_id (err -22)
[ +0.000042] ko2iblnd: Unknown symbol ib_dealloc_pd (err 0)
[ +0.000015] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err 0)
[ +0.000364] LNetError: 70810:0:(api-ni.c:2283:lnet_startup_lndnet()) Can't load LND o2ib, module ko2iblnd, rc=256
[ +0.002136] LustreError: 70810:0:(events.c:625:ptlrpc_init_portals()) network initialisation failedRebuild of the Luster kernel modules compatible to MLNX OFED 5.3 is required
# get the source code
git clone git://git.whamcloud.com/fs/lustre-release.git
# checkout the version supporting the kernel
# cf. https://www.lustre.org/lustre-2-12-6-released/
git checkout v2_12_6
# prepare the build environment
sh ./autogen.sh
# configure to build only the Lustre client
./configure --disable-server --disable-tests
# builds with (once configuration works)
make && make rpms