InfiniBand - HPC Network Interconnect

HPC
Network
Published

August 19, 2015

Modified

July 22, 2024

Terminology

GUID Globally Unique Identifier

  • …64bit unique address assigned by vendor
  • …persistent through reboot
  • …3 types of GUIDs: Node, port(, and system image)

LID Local Identifier (48k unicast per subnet)

  • …16bit layer 2 address
  • …assigned by the SM when port becomes active
  • …each HCA port has a LID…
    • …all switch ports share the same LID
    • …director switches have one LID per ASIC

GID Global Identifier

  • …128bit address unique across multiple subnets
  • …based on the port GUID combined with 64bit subnet prefix
  • …Used in the Global Routing Header (GRH) (ignored by switches within a subnet)

PKEY Partition Identifier

  • …fabric segmentation of nodes into different partitions
  • …partitions unaware of each other
    • …limited 0 (can’t communicate between them selfs)
    • …full 1 membership
  • …ports may be member of multiple partitions
  • …assign by listing port GUIDs in partitons.conf

Hardware

Nvidia InfiniBand Networking Solutions

Switches

Switch Config. Ports Speed
SB7800 fixed 36 EDR
QM87xx fixed 40 HDR
QS8500 modular 800+ HDR
QM97xx fixed 64 NDR

Switches come in to configurations…

  • fixed …number of port
  • modular …gradually expandable port modules

Switches come in two flavors…

  • managed
    • …MLXN OS features unlocked
    • …access over SSH, SNMP, HTTPs
    • …enables monitoring and configuration
  • unmanaged
    • …in-band management is possible
    • …status via chassis LEDs

Get information from unmanaged switches with ibswinfo.sh

# requires MST service
>>> ./iwswinfo.sh -d lid-647
=================================================
Quantum Mellanox Technologies
=================================================
part number        | MQM8790-HS2F
serial number      | MT2202X19243
product name       | Jaguar Unmng IB 200
revision           | AK
ports              | 80
PSID               | MT_0000000063
GUID               | 0x1070fd030003af98
firmware version   | 27.2008.3328
-------------------------------------------------
uptime (d-h:m:s)   | 26d-20:16:01
-------------------------------------------------
PSU0 status        | OK
     P/N           | MTEF-PSF-AC-C
     S/N           | MT2202X18887
     DC power      | OK
     fan status    | OK
     power (W)     | 165
PSU1 status        | OK
     P/N           | MTEF-PSF-AC-C
     S/N           | MT2202X18881
     DC power      | OK
     fan status    | OK
     power (W)     | 148
-------------------------------------------------
temperature (C)    | 63
max temp (C)       | 63
-------------------------------------------------
fan status         | OK
fan#1 (rpm)        | 5959
fan#2 (rpm)        | 5251
fan#3 (rpm)        | 6013
fan#4 (rpm)        | 5251
fan#5 (rpm)        | 5906
fan#6 (rpm)        | 5293
fan#7 (rpm)        | 6125
fan#8 (rpm)        | 5293
fan#9 (rpm)        | 5959
-------------------------------------------------

Ethernet Gateway

Skyway InfiniBand to Ethernet gateway…

  • MLXN-GW (gateway operating system) appliance
  • 16x ports (8 Infiniband EDR/HDR x 8 Ethernet 100/200Gb/s)
  • Max. bandwidth 1.6Tb/s
  • High-availability & load-Balancing

…achieved by leveraging Ethernet LAG (Link Aggregation). LACP (Link Aggregation Control Protocol) is used to establish the LAG and to verify connectivity…

Cables

Cable part numbers…

Cable Speed Type Split Length
MC2207130 FDR DAC no .5, 1, 1.5, 2
MC220731V FDR AOC no 3, 5, 10, 15, 20, 25, 30, 40, 50, 75, 100
MCP1600-E EDR DAC no .5, 1, 1.5, 2, 2.5, 3, 4, 5
MFA1A00-E EDR AOC no 3, 5, 10, 15, 20, 30, 50, 100
MCP1650-H HDR DAC no .5, 1, 1.5, 2
MCP7H50-H HDR DAC yes 1, 1.5, 2
MCA1J00-H HDR ACC no 3, 4
MCA7J50-H HDR ACC yes 3, 4
MFS1S00-HxxxE HDR AOC no 3, 5, 10, 15, 20, 30, 50, 100, 130, 150
MFS1S50-HxxxE HDR AOC yes 3, 5, 10, 15, 20, 30

LinkX product family for Mellanox cables and transceivers

  • DAC, (passive) direct attach copper
    • low price
    • up to 2 meters (at HDR)
    • simple copper wires
    • no electronics
    • consume (almost) zero power
    • lowest latency
  • ACC, active copper cables (aka active DAC)
    • consumes 4 to 5 Watts
    • include signal-boosting integrated circuits (ICs)
    • extend the reach up to 4 meters (at 200G HDR)
  • AOC, active optical cables

DAC-in-a-Rack connect servers and storage to top-of-rack (TOR) switches

(passive/active) splitter cables

  • DAC/ACC
    • typically used to connect HDR100 HCAs to a HDR TOR switch
    • enabling a 40-port HDR switch to support 80-ports of 100G HDR100
    • 1:2 splitter breakout cable in DAC copper… (QSFP56 to 2xQSFP56)
  • AOC …1:2 splitter optical breakout cable… (QSFP56 to 2xQSFP56)

Firmware

MFT (Mellanox firmware tools)…

Installation …MLNX_OFED include the required packages…

dnf install -y mft kmod-kernel-mft-mlnx usbutils

…packages include an init-script…

systemctd start mst.service

Devices

…can be accessed by their PCI ID

# ...find PCI ID using lxpci
>>> lspci -d 15b3:
21:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

# ...query the firmware on a device using the PCI ID
>>> mstflint -d 21:00.0 query
Image type:            FS4
FW Version:            20.32.1010

…when the IB driver is loaded…access a device by device name..

# ...find the device name
>>> ibv_devinfo | grep hca_id
hca_id: mlx5_0

# ...query the firmware on a device using the device name
>>> mstflint -d mlx5_0 query
...

PSID (Parameter-Set IDentification) of the channel adapter…

>>> mlxfwmanager --query | grep PSID
  PSID:             SM_2121000001000
  • …PSID used to download the correct firmware for a device
  • …start with MT_. SM_, or AS_ indicate vendor re-labeled cards

mlxconfig

Reboot for configuration changes to take effect

Change device configurations without reburning the firmware…

# ...only a single device is present...
mlxconfig query | grep LINK
         PHY_COUNT_LINK_UP_DELAY             DELAY_NONE(0)   
         LINK_TYPE_P1                        IB(1)           
         KEEP_ETH_LINK_UP_P1                 True(1)         
         KEEP_IB_LINK_UP_P1                  False(0)        
         KEEP_LINK_UP_ON_BOOT_P1             True(1)         
         KEEP_LINK_UP_ON_STANDBY_P1          False(0)        
         AUTO_POWER_SAVE_LINK_DOWN_P1        False(0)        
         UNKNOWN_UPLINK_MAC_FLOOD_P1         False(0)
# ...set configuration
mlxconfig -d $device set KEEP_IB_LINK_UP_P1=0 KEEP_LINK_UP_ON_BOOT_P1=1

Reset the device configuration to default…

mlxconfig -d $device reset

mlxfwmanager

Updating Firmware After Installation

>>> mlxfwmanager --online -u
...
  Device Type:      ConnectX6 
  Part Number:      MCX653105A-ECA_Ax
  Description:      ConnectX-6 VPI adapter card; 100Gb/s (HDR100; EDR IB and 100GbE); single-port QSFP56; PCIe3.0 x16...
  PSID:             MT_0000000222
  PCI Device Name:  0000:21:00.0
  Base GUID:        08c0eb0300f0a5ec
  Versions:         Current        Available     
     FW             20.32.1010     20.35.1012    
     PXE            3.6.0502       3.6.0804      
     UEFI           14.25.0017     14.28.0015
...

mst

mst stops and starts the access driver for Linux

Example of updating the firmware on Super Micro boards:

>>> mst start && mst status -v
DEVICE_TYPE             MST                           PCI       RDMA    NET                 NUMA  
ConnectX2(rev:b0)       /dev/mst/mt26428_pciconf0     
ConnectX2(rev:b0)       /dev/mst/mt26428_pci_cr0      02:00.0   mlx4_0  net-ib0  

mlxcables

…work against the cables connected to the devices on the machine…

  • mst cable add…discover the cables that are connected to the local devices
  • mlxcables…access the cables…
    • …get cable IDs…
    • …upgrade firmware on the cables
>>> mlxcables -q
...
Cable name    : mt4123_pciconf0_cable_0
...
Identifier      : QSFP28 (11h)
Technology      : Copper cable unequalized (a0h)
Compliance      : 50GBASE-CR, ... HDR,EDR,FDR,QDR,DDR,SDR
...
Vendor          : Mellanox        
Serial number   : MT2214VS04725   
Part number     : MCP7H50-H01AR30 
...
Length [m]      : 1 m

Fabric

List of commands relevant to discover and debug the fabric…

Command Description
ibnetdiscover …scans fabric sub-network …generates topology information
iblinkinfo …list links in the farbic
ibnodes …list of nodes in the fabric
ibhosts …list channel adapters
ibportstate …state of a given port
ibqueryerrors …port error counters
ibroute …display forwarding table
ibdiagnet …complete fabrics scan …all device, port, link, counters, etc.

ibnetdiscover

Subnet discover …outputs a human readable topology file

List…

  • -l connected nodes
  • -H connected HCAs
  • -S connected switches
# switches...
>>> ibnetdiscover -S
Switch   : 0x7cfe90030097c8f0 ports 36 devid 0xc738 vendid 0x2c9 "SwitchX -  Mellanox Technologies"
#...

# host channel adapters
>>> ibnetdiscover -H
Ca       : 0x08c0eb0300af4fa2 ports 1 devid 0x101b vendid 0x2c9 "... mlx5_0"
Ca       : 0xe41d2d0300dff630 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
Ca       : 0xe41d2d0300e013d0 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
#...

Output by columns…

  • …GUID
  • …number of ports
  • devid device id …hexadecimal
  • vendid vendor ID …hexadecimal
  • "..." description

iblinkinfo

Reports link info for all links in the fabric…

# ...show switch with GUID
iblinkinfo -S 0x1070fd030003af98

# ...show only the next switch on the node up-link
iblinkinfo -n 1 --switches-only
  • …each switch with GUID is listed with…
    • …one port per line…
    • …left switch LID and port
    • …middle after == …connection width, speed and state
  • …right of ==> …down-link device…
    • …either a switch …or node HCA
    • …LID, port, node name and device type
# switch GUID ...name (if available)  ...type and model
Switch: 0x1070fd030003af98 Quantum Mellanox Technologies:
   647    1[  ] ==(                Down/ Polling)         ==>             [  ] "" ( )
   647    2[  ] ==( 2X        53.125 Gbps Active/  LinkUp)==>      23    1[  ] "localhost mlx5_0" ( )
#  LID    port     width ...speed ...physical state     down-link  LID   port   name ..device

List active ports on a specific switch switch…

>>> iblinkinfo -S 0x1070fd030003af98 -l | tr -s ' ' | cut -d'"' -f3- | grep -v -i down
 647 2[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a6115e 23 1[ ] "localhost mlx5_0" ( )
 647 21[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4b72 24 26[ ] "Quantum Mellanox" #...
 647 23[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4a92 16 14[ ] "Quantum Mellanox" #...
#...
 647 80[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a61cca 22 1[ ] "lxbk1149" ( )

ibdiagnet

ibdiagnet, reports trouble in a from like:

...
Link at the end of direct route "1,1,19,10,9,17"
     Errors:
           -error noInfo -command {smNodeInfoMad getByDr {1 1 19 10 9 17}}
Errors types explanation:
     "noInfo"  : the link was ACTIVE during discovery but, sending MADs across it
                   failed 4 consecutive times
...

ibdiagpath to print all GUIDs on the route

>>> ibdiagpath -d 1,1,19,10,9,17
...
-I- From: lid=0x0216 guid=0x7cfe90030097cef0 dev=51000 Port=17

…eventually use archived output of ibnetdiscover to identify the corresponding host.

Otherwise check the end of the cable connected to the switch port identified.

mlxconfig

mlxconfig – Changing Device Configuration Tool

Query switch using its LID…

  • query supported configurations after reboot
  • …option -e show default and current configurations
>>> mlxconfig -d lid-0x287 -e query
Device #1:
----------

Device type:    Quantum         
Name:           MQM8790-HS2X_Ax 
Description:    Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device:         lid-0x287       

Configurations:              Default              Current              Next Boot
*        SPLIT_MODE          NO_SPLIT_SUPPORT(0)  NO_SPLIT_SUPPORT(0)  SPLIT_2X(1)
         DISABLE_AUTO_SPLIT  ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0)
         SPLIT_PORT          Array[1..64]         Array[1..64]         Array[1..64]
         GB_VECTOR_LENGTH    0                    0                    0
         GB_UPDATE_MODE      ALL(0)               ALL(0)               ALL(0)
         GB_VECTOR           Array[0..7]          Array[0..7]          Array[0..7]

The '*' shows parameters with next value different from default/current value.

show_confs displays information about all configurations…

>>> mlxconfig -d lid-0x287 show_confs
# [...]
SWITCH CONF:
  DISABLE_AUTO_SPLIT=<DISABLE_AUTO_SPLIT|ENABLE_AUTO_SPLIT>Disable Auto-Split:
    0x0: ENABLE_AUTO_SPLIT - if NV is split OR if cable is split then port is split.
    0x1: DISABLE_AUTO_SPLIT - if NV is split then port is split # [...]
  SPLIT_MODE=<NO_SPLIT_SUPPORT|SPLIT_2X>  Split ports mode of operation configured # [...]
    0x0: NO_SPLIT_SUPPORT
    0x1: SPLIT_2X - device supports splitting ports to two 2X ports
# [...]

Split Cables

Changes require a switch reboot!

Split a Port in a remotely managed switches…

  • …only for Quantum based switch systems
  • …single physical quad-lane QSFP port is divided into 2 dual-lane ports
  • …all system ports may be split into 2-lane ports
  • …port changes the notation of that port
    • …from x/y to x/y/z
    • z indicating the number of the resulting sub-physical port (1,2)
  • …each sub-physical port is then handled as an individual port

Enable port splits…

# enable split mode support
mlxconfig -d <device> set SPLIT_MODE=1

# split ports....
mlxconfig -d <device> set SPLIT_PORT[<port_num>/<port_range>]=1
  • SPLIT_MODE = SPLIT_2X(1) enable splits…
    • …should be equivalent to split-ready configuration
    • …on managed switches …system profile ib split-ready
  • SPLIT_PORT[1..64]=1 …split for all ports…
    • …should be equivalent to changing the module type to a split mode…
    • …on manged switches …module-type qsfp-split-2

Query the configuration…

>>> mlxconfig -d lid-0x287 -e query SPLIT_PORT[1..64]

Device #1:
----------

Device type:    Quantum
Name:           MQM8790-HS2X_Ax
Description:    Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device:         lid-0x287

Configurations:           Default         Current         Next Boot
         SPLIT_PORT[1]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
         SPLIT_PORT[2]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
         SPLIT_PORT[3]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
         SPLIT_PORT[4]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
#[...] 

Adapters (HCAs)

ibstat

ibstat without arguments list all local adapters with state information

# list channel adapters (CAs)
>>> ibstat -l
mlx5_0

# GID...
>>> ibstat -p
0x08c0eb0300f82cbc

Operational State: Active & Physical state: LinkUp

>>> ibstat
CA 'mlx5_0'
# [...]
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 10
# [...]
  • Physical state …(of the cable)
    • Pooling …no connection …check cable (…and switch)
    • LinkUp …physical uplink connection (…does not mean it’s configured and ready to send data)
  • State (…of the HCA)
    • Down …no physical connection
    • Initializing …physical uplink connection …not discovered by the subnet manager
    • Active …port in a normal operational state
  • Rate
    • …speed at which the port is operating
    • …matches speed of the slowest device on the network path

ibstatus display similar information (however belongs to outdated tooling)

ipaddr

Display the lid (and range) as well as the GID address of a port

# local GID and LID
>>> ibaddr
GID fe80::e8eb:d303:a6:1856 LID start 0x15 end 0x15

# LID (in decimal) of the local adapter
>>> ibaddr -L
LID start 21 end 21

Used for address conversion between GIDs and LIDs

# GID of given LID
>>>ibaddr -g 0x22e
GID fe80::8c0:eb03:f8:2cbc 

# LID (range) for a GID
>>> ibaddr -G 0x1070fd030003af98 -L
LID start 647 end 647

iblinkinfo

Identify the switch a node is connected to …

# ..GUID
>>> iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2
0x1070fd030003af98

# ..LID
>>> ibaddr -G $(iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2) -L
LID start 647 end 647

ibdev2netdev

ibdev2netdev prints a list of local devices mapped to network interfaces…

>>> ibdev2netdev 
mlx5_0 port 1 ==> ib0 (Up)

# ...verbose
>>> ibdev2netdev -v
mlx5_0 (mt4123 - MCX653105A-ECAT) ConnectX-6 VPI adapter card, 100Gb/s #... 

Error Counters

List of InfiniBand error counters…

Counter Description
LinkDowned Node reboot, failed connection (port flapping)
Linkspeed If not at full speed check the adapter and cable
Linkwidth If not at full speed check the adapter and cable
PortRcvErrors Physical errors, local buffer overruns, malformed packets
PortRcvRemotePhysicalErrors See above… packet EBP (End Bad Packet) flag set…
PortRcvSwitchRelayErrors Packets could not be forwarded by the switch
Port[Rcv Xmit]ConstraintErrors
PortXmitWait Large numbers indicate congestion (high congestion results in XmitDiscards)
RcvRemotePhys(ical)Errors Package corruption occurred somewhere else in the fabric
SymbolErrors 99% of these errors are hardware related (small numbers can be ignored)
VL15Drop First packages dropped due to resource limits (not enough space in the buffers)
XmtDiscards Packet to be transmitted get dropped (high congestion in the fabric)

Cf. Overview of Error Counters, OpenFabric Alliance

Network Layers

Physical Layer

  • Link Speed x Link Width = Link Rate
  • Bit Error Rate (BER) 10^15
  • Virtual Lane (VL), multiple virtual links on single physical link
    • Mellanox 0-7 VLs each with dedicated buffers
    • Quality of Service, bandwidth management
  • Media for connecting two nodes
    • Passive Copper Cables FDR max. 3m, EDR max. 2m
    • Active Optical Cables (AOCs) FDR max. 300m, EDR max. 100m
    • Connector QSFP

Speeds

            Speed                       Width Rate     Latency   Encoding    Eff.Speed
---------------------------------------------------------------------------------------
1999   SDR  Single Data Rate     2.5Gbps   x4 10Gbps   5usec     NRZ 
2004   DDR  Double Data Rate     5Gbps     x4 20Gbps   2.5usec   NRZ 8/10    16Gbps
2008   QDR  Quadruple Data Rate  10Gbps    x4 40Gbps   1.3usec   NRZ 8/10    32Gbps
2011   FDR  Fourteen Data Rate   14Gbps    x4 56Gbps   0.7usec   NRZ 64/66   54.6Gbps
2014   EDR  Enhanced Data Rate   25Gbps    x4 100Gbps  0.5usec   NRZ 64/66   96.97Gbps 
2018   HDR  High Data Rate       50Gbps    x4 200Gbps <0.6usec   PAM-4 
2022   NDR  Next Data Rate       100Gbps   x4 400Gbps            PAM-4
?      XDR                       200Gbps   x4 800Gbps            PAM-4
?      GDR                                    1.6Tbps

Network Layer

  • Infiniband Routing
    • Fault isolation (e.g topology changes)
    • Increase security (limit attack scope within a network segment)
    • Inter-subnet package routing (connect multiple topologies)
  • Uses GIDs for each port included in the Global Routing Header (GRH)
  • Mellanox Infiniband Router SB7788 (up to 6 subnets)

Transport Layer

  • Message segmentation into multiple packages by the sender, reassembly on the receiver
    • Maximum Transfer Unit (MTU) default 4096 Byte openib.conf
  • End-to-End communication service for applications Virtual Channel
  • Queue Pairs (QPs), dedicated per connection
    • Send/receive queue structure to enable application to bypass kernel
    • Mode: connected vs. datagram; reliable vs. unreliable
    • Datagram mode uses one QP for multiple connections
    • Identified by 24bit Queue Pair Number (QPN)

Upper Layer

  • Protocols
    • Native Infiniband RDMA Protocols
    • MPI, RDMA Storage (iSER, SRP, NFS-RDMA), SDP (Socket Direct), RDS (Reliable Datagram)
    • Legacy TCP/IP, transported by IPoIB
  • Software transport Verbs
    • Client interface to the transport layer, HCA
    • Most common implementation is OFED
  • Subnet Manager Interface (SMI)
    • Subnet Manager Packages (SMP) (on QP0 VL15, no flow control)
    • LID routed or direct routed (before fabric initialisation using port numbers)
  • General Service Interface (GSI)
    • General Management Packages (GMP) (on QP1, subject to flow control)
    • LID routed

Topology

Roadmap of the network:

  • Critical aspect of any interconnection network
  • Defines how the channels and routers are connected
  • Sets performance bounds (network diameter, bisection bandwidth)
  • Determines the cost of the network
  • Keys to topology evaluation
    • Network throughput - for application traffic patterns
    • Network diameter - min/avg/max latency between hosts
    • Scalability - cost of adding new end-nodes
    • Cost per node - number of network routers/ports per end-node

Diameter defines the maximum distance between two nodes (hop count)

  • Lower network diameter
    • Better performance
    • Smaller cost (less cables & routers)
    • Less power consumption

Radix (or degree) of the router defines the number of ports per router

Nodal degree specifies how many links connect to each node

Demystifying DCN Topologies: Clos/Fat Trees
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part1
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part2

Clos Networks

Clos network is a multistage switching network

  • Enables connection of large number of nodes with small-size switches
    • 3 stages to switch from N inputs to N outputs
  • Exactly one connection between each spine and leaf switch

Fat-Trees (special case of folded Clos network)

  • Pros
    • simple routing
    • maximal network throughput
    • fault-tolerant (path diversity)
    • credit loop deadlock free routing
  • Cons
    • large diameter…
    • …more expensive
  • Alleviate the bandwidth bottleneck closer to the root with additional links
  • Multiple paths to the destination from the source towards the root
  • Consistent hop count, resulting in predictable latency.
  • does not scale linearly with cluster size (max. 7 layers/tiers)
  • Switches at the top of the pyramid shape are called Spines/Core
  • Switches at the bottom of the pyramid are called Leafs/Lines/Edges
  • External connections connect nodes to edge switches.
  • Internal connections connect core with edge switches.
  • Constant bi-sectional bandwidth (CBB)
    • Non blocking (1:1 ratio)
    • Equal number of external and internal connections (balanced)
  • Blocking (x:1), external connections is higher than internal connections, over subscription

Dragonfly

  • Pros
    • Reduce number of (long) global links…without reducing performance
    • …smaller network diameter
    • Reduced total cost of network (since cabling is reduced)
    • More scalable (compared to fat-tree)
  • Cons
    • Requires adaptive routing
    • …effectively balance load across global channels…
    • …adding selective virtual-channel discrimination…

Hierarchical topology dividing groups of routers…

  • …connected into sub-network of collectively acting router groups
    • …as one high-radix virtual router
    • …all minimal routes traverse at most one global channel…
    • …to realize a very low global diameter
  • Channels/links…
    • terminal connections to nodes/systems
    • local (intra-group) connections to other routers in the same group
    • global (long, inter-group) connections to routers in other groups
  • All-to-all connection between each router group
    • (Avoids the need for external top level switches)
    • Each group has at least on global link to each other router group

Flavors diverge on group sub-topology

  • …intra-group interconnection network (local channels)
  • 1D flattened butterfly, completely connected (default recommendation)
  • 2D flattened butterfly
  • Dragonfly+ (benefits of Dragonfly and Fat Tree)

Dragonfly+

Extends Dragonfly topology by using Clos-like group topology

  • High scalability then Dragonfly with lower cost than Fat Tree
  • Group (pod) topology typical 2-level fat tree
  • Pros… (compared to Dragonfly)
    • More scalable, allows larger number of nodes on the network
    • Similar or better bi-sectional bandwidth…
    • …smaller number of buffers to avoid credit loop deadlocks
    • At least 50% bi-sectional bandwidth for any router radix
    • Requires only two virtual lanes to prevent credit loop deadlock
  • Cons… (compared to Dragonfly)
    • Even more complex routing
    • Fully Progressive Adaptive Routing (FPAR)
    • Cabling complexity, intra-group routers connected as bipartite graph

Dragonfly+ is bipartite connected in the first intra-group level

  • Number of spine switches = number of leaf switches
  • Leaf router, first-layer
    • (terminal) connects to nodes
    • Intra-group (local) connection to spine routers
    • Only one uplink to each spine inside the group
  • Spine router, second-layer
    • intra-group (local) connection to leaf routers
    • inter-group (global) connections to spine routers of other groups
  • Support blocking factor in leaf switches and non-blocking on Spines

Locality, group size

  • With larger group size lager amount off traffic is internal (intra-group)
  • Intra-group traffic does not use inter-group global links…
  • …hence does not contribute to network throughput bottleneck

How to Configure DragonFly, Mellanox, 2020/03
https://community.mellanox.com/s/article/How-to-Configure-DragonFly

Exascale HPC Fabric Topology, Mellanox, 2019/03
http://www.hpcadvisorycouncil.com/events/2019/APAC-AI-HPC/uploads/2018/07/Exascale-HPC-Fabric-Topology.pdf

Subnet Manager

Software defined network (SDN)

  • …configures and maintains fabric operations
  • …central repository of all information
    • …configures switch forwarding tables

Only one master SM allowed per subnet

  • …can run on any server (or a managed switch on small fabrics)
  • …master-slave setup for high-availability

Install opensm packages …start the subnet manager…

dnf install -y opensm
systemctl enable --now opensm

Configuration

Configuration in /etc/rdma/opensm.conf

  • opensm daemon…
    • -c $path …create configuration file if missing
    • -p $prio …change priority …when stopped!
    • -R $engine …change routing algorithem
    • /var/log/opensm.log …for logging
  • sminfo …show master subnet manager LID, GUID, priority
  • saquery -s …show all subnet managers
ibdiagnet -r                        # check for routing issues
smpquery portinfo $lid $port        # query port information
smpquery nodeinfo $lid              # query node information
smpquery -D nodeinfo $lid           # ^ using direct route
ibroute $lid$                       # show switching table, LIDs in hex

Initialization

  • Subnet discovery (…after wakeup)
    • …travers the network beginning with close neighbors
    • …Subnet Manager Packages (SMP) to initiate “conversation”
  • Information gathering…
    • …find all links/switches/hosts on all connected ports to map topology
    • …Subnet Manager Query Message: direct routed information gathering for node/port information
    • …Subnet Manager Agent (SMA) required on each node
  • LIDs assignment
  • Paths establishment
    • …best path calculation to identify Shortest Path Table (Min-Hop)
    • …calculate Linear Forwarding Table (LFP)
  • Ports and switch configuration
  • Subnet activation

Topology Changes

SM monitors the fabric for a topology changes….

  • Light Sweep, every 10sec require node/port information
    • …port status changes
    • …search for other SMs, change priority
  • Heavy Sweep triggered by light sweep changes
    • …fabric discovery from scratch
    • …can be triggered by a IB TRAP from a status change on a switch
    • …ddge/host port state change impact is configurable -…SM failover & handover with SMInfo protocol
    • …election by priority (0-15) and lower GUID
    • …heartbeat for stand-by SM polling the master
    • …SMInfo attributes exchange information during discovery/polling to synchronize

Routing

Terms important to understand different algorithms…

  • tolerance …considered during path distance calculation
    • 0 …equal distance if the number of hops in the paths is the same
    • 1 …equal distance if the difference in hop count is less than or equal to one
  • contention …declared for every switch port on the path…
    • …that is already used for routing another LID…
    • …associated with the same host port

Algorithm…

  • …SPF, DOR, LASH….
  • Min-Hop minimal number of switch hops between nodes (cannot avoid credit loops)
  • ftree congestion-free symmetric fat-tree, shift communication pattern

Up-Down

  • …Min-Hop plus core/spine ranking
  • …for non pure fat-tree topologies
  • …down-up routes not allowed

Enable up-down routing engine:

>>> grep -e routing_engine -e root_guid_file /etc/opensm/opensm.conf    
#routing_engine (null)
routing_engine updn
#root_guid_file (null)
root_guid_file /etc/opensm/rootswitches.list
>>> head /etc/opensm/rootswitches.list
0xe41d2d0300e512c0
0xe41d2d0300e50bd0
0xe41d2d0300e51af0
0xe41d2d0300e52eb0
0xe41d2d0300e52e90

Adaptive

Avoid congestion with adaptive routing…

  • …supported on all types of topologies
  • …maximize network utilization
  • …spread traffic across all network links…
    • …determine optimal path for data packets
    • …allow packets to avoid congested areas
  • …redirect traffic to less occupied outgoing ports
  • …grading mechanism to select optimal ports considering
    • …egress port
    • …queue depth
    • …path priority (shorter paths have higher priority)

Requires ConnectX-5 or newer…

  • …packets can arrive out-of-order
  • …sender mark traffic for eligibility to network re-ordering
  • …inter-message ordering can be enforced when required

Linux Configuration

Packages

Packages build from rdma-core-spec

Package Description
libverbs …library that allows userspace processes to use RDMA “verbs”
libibverbs-utils …libibverbs example programs such as ibv_devinfo
infiniband-diags IB diagnostic programs and scripts needed to diagnose an IB subnet

NVIDIA packages…

Modules

Mellanox HCAs require at least the mlx?_core and mlx?_ib kernel modules.

  • Hardware drivers…
    • mlx4_* modules are use by ConnectX adapters
    • mlx5_* modules are used by Connect-IB adapters
  • mlx_core…generic driver use by
    • mlx_ib for Infiniband
    • mlx_en for Ethernet
    • mlx_fc for Fiber-Channel
  • ib_* contains Infiniband specific functions…

Prior to rdma-core package (see above)…

## find all infiniband modules
>>> find /lib/modules/$(uname -r)/kernel/drivers/infiniband -type f -name \*.ko
## load requried modules
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do modprobe $mod ; done
## make sure modules get loaded on boot 
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do echo "$mod" >> /etc/modules-load.d/infiniband.conf ; done
## list loaded infiniband modules
>>> lsmod | egrep "^mlx|^ib|^rdma"
## check the version
>>> modinfo mlx4_core | grep -e ^filename -e ^version
## list module configuration parameters
>>> for i in /sys/module/mlx?_core/parameters/* ; do echo $i: $(cat $i); done
## module configuration
>>> cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4

IPoIB

InfiniBand does not use the internet protocol (IP) by default…

  • IP over InfiniBand (IPoIB) provides an IP network emulation layer…
  • …on top of InfiniBand remote direct memory access (RDMA) networks
  • ARP over a specific multicast group to convert IP to IB addresses
  • TCP/UDP over IPoIB (IPv4/6)
    • TCP uses reliable-connected mode, MTU up to 65kb
    • UDP uses unreliable-datagram mode, MTU limited to IB packages side 4kb
  • MTUs should be synchronized between all components

IPoIB devices have a 20 byte hardware address…

netstat -g                                # IP group membership
saquery -g | grep MGID | tr -s '..' | cut -d. -f2
                                          # list mulicast group GIDs
tail -n+1 /sys/class/net/ib*/mode         # connection mode
ibv_devinfo | grep _mtu                   # MTU of the hardware 
/sys/class/net/ib0/device/mlx4_port1_mtu
ip a | grep ib[0-9] | grep mtu | cut -d' ' -f2,4-5
                                          # MTU configuration for the interface

Network Boot

Boot over Infiniband (BoIB) …two boot modes:

  • …UEFI boot…
    • …modern and recommended way to network boot
    • …expansion ROM implements the UEFI APIs
    • …supports any network boot method available in the UEFI reference specification
      • …UEFI PXE/LAN boot
      • …UEFI HTTP boot
  • …legacy boot…
    • …boot device ROM for traditional BIOS implementations
    • …HCAs use FlexBoot (an iPXE variant) to
    • …enabled by an expansion ROM image .mrom

Dracut

Dracut …early boot environment…

rd.driver.post=mlx5_ib,ib_ipoib,ib_umad,rdma_ucm rd.neednet=1 rd.timeout=0 rd.retry=160 

List of parameters:

  • rd.driver.post load additional kernel modules
    • mlx4_ib support ConnectX-3/4
    • mlx5_ib for ConnectX-5 and newer
  • rd.neednet=1 forces start of network interfaces
  • rd.timeout=0 waits until a network interface is activated
  • rd.retry=160 time to wait for the network to initialize and become operational

RDMA Subsystem

RDMA subsystem relies on the kernel, udev and systemd to load modules…

rdma-core Package

  • Source code linux-rdma/rdma-core, GitHub
  • rdma-core package provides RDMA core user-space libraries and daemons…
  • udev loading the physical hardware driver
    • /usr/lib/udev/rules.d/*-rdma*.rules device manager rules
    • Once an RDMA device is created by the kernel…
    • …triggers module loading services
  • rdma-hw.target load a protocol module…
    • …pull in rdma management daemons dynamically
    • …wants rdma-load-modules@rdma.service before network.target
    • …loads all modules from /etc/rdma/modules/*.conf
# list kernel modules to be loaded
grep -v ^# /etc/rdma/modules/*.conf

rdma Commands

# ...view the state of all RDMA links
>>> rdma dev
0: mlx5_0: node_type ca fw 20.31.1014 node_guid 9803:9b03:0067:ab58 sys_image_guid 9803:9b03:0067:ab58

# ...display the RDMA link
>>> rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 817 sm_lid 762 lmc 0 state ACTIVE physical_state LINK_UP

Set up software RDMA on an existing interface…

modprobe $module
rdma link add $name type $type netdev $device

ibv_* Commands

RDMA devices available for use from the user space

ibv_devices list devices with GUID

>>> ibv_devices 
    device                 node GUID
    ------              ----------------
    mlx5_0              08c0eb0300f82cbc

ibv_devinfo -v show device capabilities accessible to user-space…

Drivers

  • Inbox drivers
    • …upstream kernel support
    • …RHEL/SLES release documentation
  • Linux drivers part of MLNX_OFED
    • kmod* packages

iWARP

Implementation of iWARP (Internet Wide-area RDMA Protocol)…

  • …implements RDMA over IP networks …on top TCP/IP protocol
  • …works with all Ethernet network infrastructure
    • …offloads TCP/IP (from CPU) to RDMA-enabled NIC (RNIC)
    • …zero copy …direct data placement
      • …eliminates intermediate buffer copies
      • …reading and writing directly to application memory
    • …kernel bypass …remove need for context switches from kernel- to user-space Enable…
  • …block storage …iSER (iSCSI Extensions for RDMA)
  • …file storage (NFS over RDMA)
  • …NVMe over Fabrics

MLNX_OFED

# download the MLNX_OFED distirbution for NVIDIA
>>> tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
>>> ls MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/RPMS/*.rpm \
      | xargs -n 1 basename |sort
ar_mgr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
clusterkit-1.0.36-1.53100.x86_64.rpm
dapl-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-static-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-utils-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dpcp-1.1.2-1.53100.x86_64.rpm
dump_pr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
fabric-collector-1.1.0.MLNX20170103.89bb2aa-0.1.53100.x86_64.rpm
#...
  • Duplicate packages….
    • …in conflict with enterprise distribution are….
    • …prefixed with mlnx of include mlnx somewhere in the package name
  • Different installation profiles…
Package Name Profile
mlnx-ofed-all Installs all available packages in MLNX_OFED
mlnx-ofed-basic Installs basic packages required for running the cards
mlnx-ofed-guest Installs packages required by guest OS
mlnx-ofed-hpc Installs packages required for HPC
mlnx-ofed-hypervisor Installs packages required by hypervisor OS
mlnx-ofed-vma Installs packages required by VMA
mlnx-ofed-vma-eth Installs packages required by VMA to work over Ethernet
mlnx-ofed-vma-vpi Installs packages required by VMA to support VPI
bluefield Installs packages required for BlueField
dpdk Installs packages required for DPDK
dpdk-upstream-libs Installs packages required for DPDK using RDMA-Core
kernel-only Installs packages required for a non-default kernel

Build

Example from CentOS 7.9

# extract the MLNX OFED archive
cp /lustre/hpc/vpenso/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz .
tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/
# dependencies
yum install -y \
      automake \
      autoconf \
      createrepo \
      gcc-gfortran \
      libtool \
      libusbx \
      python-devel \
      redhat-rpm-config \
      rpm-build 

# remove all previosly installed artifacts...
./uninstall.sh

# run the generic installation
./mlnxofedinstall --skip-distro-check --add-kernel-support --kmp --force

# copy the new archive...
cp  /tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-3.10.0-1160.21.1.el7.x86_64/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-ext.tgz  ...

mlnxofedinstall will install the newly build RPM packages on the host.

>>> systemctl stop lustre.mount ; lustre_rmmod
# this will bring down the network interface, and disconnect your SSH session
>>> /etc/init.d/openibd
# new modules compatible to the kernel have been loaded
>>> modinfo mlx5_ib
filename:       /lib/modules/3.10.0-1160.21.1.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license:        Dual BSD/GPL
description:    Mellanox 5th generation network adapters (ConnectX series) IB driver
author:         Eli Cohen <eli@mellanox.com>
retpoline:      Y
rhelversion:    7.9
srcversion:     DF39E5800D8C1EEB9D2B51C
depends:        mlx5_core,ib_core,mlx_compat,ib_uverbs
vermagic:       3.10.0-1160.21.1.el7.x86_64 SMP mod_unload modversions 
parm:           dc_cnak_qp_depth:DC CNAK QP depth (uint)

The new kernel package have a time-stamp within the version to distinguish them from the original versions:

[root@lxbk0718 ~]# yum --showduplicates list kmod-mlnx-ofa_kernel
Installed Packages
kmod-mlnx-ofa_kernel.x86_64                  5.3-OFED.5.3.1.0.0.1.202104140852.rhel7u9                   installed   
Available Packages
kmod-mlnx-ofa_kernel.x86_64                  5.3-OFED.5.3.1.0.0.1.rhel7u9                                gsi-internal

Loading the Lustre module back into the kernel will fail…

[root@lxbk0718 ~]# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Invalid argument
[root@lxbk0718 ~]# dmesg -H | tail 
[  +0.000002] ko2iblnd: Unknown symbol ib_modify_qp (err -22)
[  +0.000025] ko2iblnd: Unknown symbol ib_destroy_fmr_pool (err 0)
[  +0.000007] ko2iblnd: disagrees about version of symbol rdma_destroy_id
[  +0.000001] ko2iblnd: Unknown symbol rdma_destroy_id (err -22)
[  +0.000004] ko2iblnd: disagrees about version of symbol __rdma_create_id
[  +0.000001] ko2iblnd: Unknown symbol __rdma_create_id (err -22)
[  +0.000042] ko2iblnd: Unknown symbol ib_dealloc_pd (err 0)
[  +0.000015] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err 0)
[  +0.000364] LNetError: 70810:0:(api-ni.c:2283:lnet_startup_lndnet()) Can't load LND o2ib, module ko2iblnd, rc=256
[  +0.002136] LustreError: 70810:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed

Rebuild of the Luster kernel modules compatible to MLNX OFED 5.3 is required

# get the source code
git clone git://git.whamcloud.com/fs/lustre-release.git
# checkout the version supporting the kernel
# cf. https://www.lustre.org/lustre-2-12-6-released/
git checkout v2_12_6
# prepare the build environment
sh ./autogen.sh
# configure to build only the Lustre client
./configure --disable-server --disable-tests
# builds with (once configuration works)
make && make rpms

Application Interface

  • OpenFabrics Alliance (OFA)
    • Builds open-source software: OFED (OpenFabrics Enterprise Distribution)
    • Kernel-level drivers, channel-oriented RDMA and send/receive operations
    • Kernel and user-level application programming interface (API)
    • Services for parallel message passing (MPI)
    • Includes Open Subnet Manager with diagnostic tools
    • IP over Infiniband (IPoIB), Infiniband Verbs/API

RDMA

  • Remote Direct Memory Access (RDMA)
  • Linux kernel network stack limitations
    • system call API package rates to slow for high speed network fabrics with latency in the nano-seconds
    • overhead copying data from user- to kernel-space
    • workarounds: Package aggregation, flow steering, pass NIC to user-space…
  • RDMA Subsystem: Bypass the kernel network stack to sustain full throughput
    • special Verbs library maps devices into user-space to allow direct data stream control
    • direct user-space to user-space memory data transfer (zero-copy)
    • offload of network functionality to the hardware device
    • messaging protocols implemented in RDMA
    • regular network tools may not work
    • bridging between common Ethernet networks and HPC network fabrics difficult
  • protocols implementing RDMA: Infiniband, Omnipath, Ethernet(RoCE)
  • future integration with the kernel network stack?
    • Integrate RDMA subsystem messaging with the kernel
    • Add Queue Pairs (QPs) concept to the kernel network stack to enable RDMA
    • Implement POSIX network semantics for Infiniband

RDMA over Ethernet

  • advances in Ethernet technology allows to build “lossless” Ethernet fabrics
    • PFC (Priority-based Flow Control) prevents package loss due to buffer overflow at switches
    • Enables FCoE (Fibre Channel over Ethernet), RoCE (RDMA over Converged Ethernet)
    • Ethernet NICs come with a variety of options for offloading
  • RoCE specification supported as annex to the IBTA
  • implements Infiniband Verbs over Ethernet (OFED >1.5.1)
    • use Infiniband transport & network layer, swaps link layer to use Ethernet frames
    • IPv4/6 addresses set over the regular Ethernet NIC
    • control path RDMA-CM API, data path Verbs API

OpenFabric

  • OpenFabrics Interfaces (OFI)
  • Developed by the OFI Working Group, a subgroup of OFA
    • Successor to IB Verbs, and RoCE specification
    • Optimizes software to hardware path by minimizing cache and memory footprint
    • Application-Centric and fabric implementation agnostic
  • libfabric core component of OFI
    • User-space API mapping applications to underlying fabric services
    • Hardware/protocol agnostic
  • Fabric hardware support implemented in OFI providers
    • Socket provider for development
    • Verbs provides allows to run over hardware supporting libibverbs (Infiniband)
    • useNIC (user-space NIC) providers supports Cisco Ethernet hardware
    • PSM (Performance Scale Messaging) provider for Intel Omni-Path and Cray Aries

References