InfiniBand - HPC Network Interconnect
- InfiniBand Architecture (IBA)
- Architecture for Interprocess Communication (IPC) networks
- Switch-based, point-to-point interconnection network
- low latency, high throughput, quality of service
- CPU offload, hardware based transport protocol, bypass of the kernel
- Mellanox Community
Terminology
GUID Globally Unique Identifier
- …64bit unique address assigned by vendor
- …persistent through reboot
- …3 types of GUIDs: Node, port(, and system image)
LID Local Identifier (48k unicast per subnet)
- …16bit layer 2 address
- …assigned by the SM when port becomes active
- …each HCA port has a LID…
- …all switch ports share the same LID
- …director switches have one LID per ASIC
GID Global Identifier
- …128bit address unique across multiple subnets
- …based on the port GUID combined with 64bit subnet prefix
- …Used in the Global Routing Header (GRH) (ignored by switches within a subnet)
PKEY Partition Identifier
- …fabric segmentation of nodes into different partitions
- …partitions unaware of each other
- …limited
0
(can’t communicate between them selfs) - …full
1
membership
- …limited
- …ports may be member of multiple partitions
- …assign by listing port GUIDs in
partitons.conf
Hardware
Nvidia InfiniBand Networking Solutions
Switches
Switch | Config. | Ports | Speed |
---|---|---|---|
SB7800 | fixed | 36 | EDR |
QM87xx | fixed | 40 | HDR |
QS8500 | modular | 800+ | HDR |
QM97xx | fixed | 64 | NDR |
Switches come in to configurations…
- …fixed …number of port
- …modular …gradually expandable port modules
Switches come in two flavors…
- …managed …
- …MLXN OS features unlocked
- …access over SSH, SNMP, HTTPs
- …enables monitoring and configuration
- …unmanaged …
- …in-band management is possible
- …status via chassis LEDs
Get information from unmanaged switches with ibswinfo.sh
# requires MST service
>>> ./iwswinfo.sh -d lid-647
=================================================
Quantum Mellanox Technologies
=================================================
part number | MQM8790-HS2F
serial number | MT2202X19243
product name | Jaguar Unmng IB 200
revision | AK
ports | 80
PSID | MT_0000000063
GUID | 0x1070fd030003af98
firmware version | 27.2008.3328
-------------------------------------------------
uptime (d-h:m:s) | 26d-20:16:01
-------------------------------------------------
PSU0 status | OK
P/N | MTEF-PSF-AC-C
S/N | MT2202X18887
DC power | OK
fan status | OK
power (W) | 165
PSU1 status | OK
P/N | MTEF-PSF-AC-C
S/N | MT2202X18881
DC power | OK
fan status | OK
power (W) | 148
-------------------------------------------------
temperature (C) | 63
max temp (C) | 63
-------------------------------------------------
fan status | OK
fan#1 (rpm) | 5959
fan#2 (rpm) | 5251
fan#3 (rpm) | 6013
fan#4 (rpm) | 5251
fan#5 (rpm) | 5906
fan#6 (rpm) | 5293
fan#7 (rpm) | 6125
fan#8 (rpm) | 5293
fan#9 (rpm) | 5959
-------------------------------------------------
Ethernet Gateway
Skyway InfiniBand to Ethernet gateway…
- MLXN-GW (gateway operating system) appliance
- 16x ports (8 Infiniband EDR/HDR x 8 Ethernet 100/200Gb/s)
- Max. bandwidth 1.6Tb/s
- High-availability & load-Balancing
…achieved by leveraging Ethernet LAG (Link Aggregation). LACP (Link Aggregation Control Protocol) is used to establish the LAG and to verify connectivity…
Cables
Cable part numbers…
Cable | Speed | Type | Split | Length |
---|---|---|---|---|
MC2207130 | FDR | DAC | no | .5, 1, 1.5, 2 |
MC220731V | FDR | AOC | no | 3, 5, 10, 15, 20, 25, 30, 40, 50, 75, 100 |
MCP1600-E | EDR | DAC | no | .5, 1, 1.5, 2, 2.5, 3, 4, 5 |
MFA1A00-E | EDR | AOC | no | 3, 5, 10, 15, 20, 30, 50, 100 |
MCP1650-H | HDR | DAC | no | .5, 1, 1.5, 2 |
MCP7H50-H | HDR | DAC | yes | 1, 1.5, 2 |
MCA1J00-H | HDR | ACC | no | 3, 4 |
MCA7J50-H | HDR | ACC | yes | 3, 4 |
MFS1S00-HxxxE | HDR | AOC | no | 3, 5, 10, 15, 20, 30, 50, 100, 130, 150 |
MFS1S50-HxxxE | HDR | AOC | yes | 3, 5, 10, 15, 20, 30 |
LinkX product family for Mellanox cables and transceivers
- DAC, (passive) direct attach copper
- low price
- up to 2 meters (at HDR)
- simple copper wires
- no electronics
- consume (almost) zero power
- lowest latency
- ACC, active copper cables (aka active DAC)
- consumes 4 to 5 Watts
- include signal-boosting integrated circuits (ICs)
- extend the reach up to 4 meters (at 200G HDR)
- AOC, active optical cables
DAC-in-a-Rack connect servers and storage to top-of-rack (TOR) switches
(passive/active) splitter cables…
- DAC/ACC
- typically used to connect HDR100 HCAs to a HDR TOR switch
- enabling a 40-port HDR switch to support 80-ports of 100G HDR100
- 1:2 splitter breakout cable in DAC copper… (QSFP56 to 2xQSFP56)
- AOC …1:2 splitter optical breakout cable… (QSFP56 to 2xQSFP56)
Firmware
MFT (Mellanox firmware tools)…
- Interface with the HCA firmware…
- …query firmware information
- …customize firmware images
- …burn firmware image to a device
- Configuration…
/etc/mft
- References
Installation …MLNX_OFED include the required packages…
dnf install -y mft kmod-kernel-mft-mlnx usbutils
…packages include an init-script…
systemctd start mst.service
Devices
…can be accessed by their PCI ID
# ...find PCI ID using lxpci
>>> lspci -d 15b3:
21:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
# ...query the firmware on a device using the PCI ID
>>> mstflint -d 21:00.0 query
Image type: FS4
FW Version: 20.32.1010
…when the IB driver is loaded…access a device by device name..
# ...find the device name
>>> ibv_devinfo | grep hca_id
hca_id: mlx5_0
# ...query the firmware on a device using the device name
>>> mstflint -d mlx5_0 query
...
PSID (Parameter-Set IDentification) of the channel adapter…
>>> mlxfwmanager --query | grep PSID
PSID: SM_2121000001000
- …PSID used to download the correct firmware for a device
- …start with
MT_
.SM_
, orAS_
indicate vendor re-labeled cards
mlxconfig
Reboot for configuration changes to take effect
Change device configurations without reburning the firmware…
# ...only a single device is present...
mlxconfig query | grep LINK
PHY_COUNT_LINK_UP_DELAY DELAY_NONE(0)
LINK_TYPE_P1 IB(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 False(0)
KEEP_LINK_UP_ON_BOOT_P1 True(1)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P1 False(0)
UNKNOWN_UPLINK_MAC_FLOOD_P1 False(0)
# ...set configuration
mlxconfig -d $device set KEEP_IB_LINK_UP_P1=0 KEEP_LINK_UP_ON_BOOT_P1=1
Reset the device configuration to default…
mlxconfig -d $device reset
mlxfwmanager
Updating Firmware After Installation…
>>> mlxfwmanager --online -u
...
Device Type: ConnectX6
Part Number: MCX653105A-ECA_Ax
Description: ConnectX-6 VPI adapter card; 100Gb/s (HDR100; EDR IB and 100GbE); single-port QSFP56; PCIe3.0 x16...
PSID: MT_0000000222
PCI Device Name: 0000:21:00.0
Base GUID: 08c0eb0300f0a5ec
Versions: Current Available
FW 20.32.1010 20.35.1012
PXE 3.6.0502 3.6.0804
UEFI 14.25.0017 14.28.0015
...
mst
mst
stops and starts the access driver for Linux
Example of updating the firmware on Super Micro boards:
>>> mst start && mst status -v
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX2(rev:b0) /dev/mst/mt26428_pciconf0
ConnectX2(rev:b0) /dev/mst/mt26428_pci_cr0 02:00.0 mlx4_0 net-ib0
mlxcables
…work against the cables connected to the devices on the machine…
mst cable add
…discover the cables that are connected to the local devicesmlxcables
…access the cables…- …get cable IDs…
- …upgrade firmware on the cables
>>> mlxcables -q
...
Cable name : mt4123_pciconf0_cable_0
...
Identifier : QSFP28 (11h)
Technology : Copper cable unequalized (a0h)
Compliance : 50GBASE-CR, ... HDR,EDR,FDR,QDR,DDR,SDR
...
Vendor : Mellanox
Serial number : MT2214VS04725
Part number : MCP7H50-H01AR30
...
Length [m] : 1 m
mlxlink
mlxlink
…check and debug link status
>>> mlxlink -d mlx5_0 --show_device
Operational Info
----------------
State : Active
Physical state : LinkUp
Speed : IB-SDR
Width : 2x
FEC : No FEC
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed : 0x00000001 (SDR)
Supported Cable Speed : 0x0000007f (HDR,EDR,FDR,FDR10,QDR,DDR,SDR)
...
Fabric
List of commands relevant to discover and debug the fabric…
Command | Description |
---|---|
ibnetdiscover |
…scans fabric sub-network …generates topology information |
iblinkinfo |
…list links in the farbic |
ibnodes |
…list of nodes in the fabric |
ibhosts |
…list channel adapters |
ibportstate |
…state of a given port |
ibqueryerrors |
…port error counters |
ibroute |
…display forwarding table |
ibdiagnet |
…complete fabrics scan …all device, port, link, counters, etc. |
ibnetdiscover
Subnet discover …outputs a human readable topology file
List…
-l
connected nodes-H
connected HCAs-S
connected switches
# switches...
>>> ibnetdiscover -S
Switch : 0x7cfe90030097c8f0 ports 36 devid 0xc738 vendid 0x2c9 "SwitchX - Mellanox Technologies"
#...
# host channel adapters
>>> ibnetdiscover -H
Ca : 0x08c0eb0300af4fa2 ports 1 devid 0x101b vendid 0x2c9 "... mlx5_0"
Ca : 0xe41d2d0300dff630 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
Ca : 0xe41d2d0300e013d0 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
#...
Output by columns…
- …GUID
- …number of
ports
- …
devid
device id …hexadecimal - …
vendid
vendor ID …hexadecimal - …
"..."
description
iblinkinfo
Reports link info for all links in the fabric…
# ...show switch with GUID
iblinkinfo -S 0x1070fd030003af98
# ...show only the next switch on the node up-link
iblinkinfo -n 1 --switches-only
- …each switch with GUID is listed with…
- …one port per line…
- …left switch LID and port
- …middle after
==
…connection width, speed and state
- …right of
==>
…down-link device…- …either a switch …or node HCA
- …LID, port, node name and device type
# switch GUID ...name (if available) ...type and model
Switch: 0x1070fd030003af98 Quantum Mellanox Technologies:
647 1[ ] ==( Down/ Polling) ==> [ ] "" ( )
647 2[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 23 1[ ] "localhost mlx5_0" ( )
# LID port width ...speed ...physical state down-link LID port name ..device
List active ports on a specific switch switch…
>>> iblinkinfo -S 0x1070fd030003af98 -l | tr -s ' ' | cut -d'"' -f3- | grep -v -i down
647 2[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a6115e 23 1[ ] "localhost mlx5_0" ( )
647 21[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4b72 24 26[ ] "Quantum Mellanox" #...
647 23[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4a92 16 14[ ] "Quantum Mellanox" #...
#...
647 80[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a61cca 22 1[ ] "lxbk1149" ( )
ibdiagnet
ibdiagnet
, reports trouble in a from like:
...
Link at the end of direct route "1,1,19,10,9,17"
Errors:
-error noInfo -command {smNodeInfoMad getByDr {1 1 19 10 9 17}}
Errors types explanation:
"noInfo" : the link was ACTIVE during discovery but, sending MADs across it
failed 4 consecutive times ...
ibdiagpath
to print all GUIDs on the route
>>> ibdiagpath -d 1,1,19,10,9,17
...
-I- From: lid=0x0216 guid=0x7cfe90030097cef0 dev=51000 Port=17
…eventually use archived output of ibnetdiscover
to identify the corresponding host.
Otherwise check the end of the cable connected to the switch port identified.
mlxconfig
mlxconfig
– Changing Device Configuration Tool
Query switch using its LID…
query
supported configurations after reboot- …option
-e
show default and current configurations
>>> mlxconfig -d lid-0x287 -e query
Device #1:
----------
Device type: Quantum
Name: MQM8790-HS2X_Ax
Description: Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device: lid-0x287
Configurations: Default Current Next Boot
* SPLIT_MODE NO_SPLIT_SUPPORT(0) NO_SPLIT_SUPPORT(0) SPLIT_2X(1)
DISABLE_AUTO_SPLIT ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0)
SPLIT_PORT Array[1..64] Array[1..64] Array[1..64]
GB_VECTOR_LENGTH 0 0 0
GB_UPDATE_MODE ALL(0) ALL(0) ALL(0)
GB_VECTOR Array[0..7] Array[0..7] Array[0..7]
The '*' shows parameters with next value different from default/current value.
show_confs
displays information about all configurations…
>>> mlxconfig -d lid-0x287 show_confs
# [...]
SWITCH CONF:
DISABLE_AUTO_SPLIT=<DISABLE_AUTO_SPLIT|ENABLE_AUTO_SPLIT>Disable Auto-Split:
0x0: ENABLE_AUTO_SPLIT - if NV is split OR if cable is split then port is split.
0x1: DISABLE_AUTO_SPLIT - if NV is split then port is split # [...]
SPLIT_MODE=<NO_SPLIT_SUPPORT|SPLIT_2X> Split ports mode of operation configured # [...]
0x0: NO_SPLIT_SUPPORT
0x1: SPLIT_2X - device supports splitting ports to two 2X ports
# [...]
Split Cables
Changes require a switch reboot!
Split a Port in a remotely managed switches…
- …only for Quantum based switch systems
- …single physical quad-lane QSFP port is divided into 2 dual-lane ports
- …all system ports may be split into 2-lane ports
- …port changes the notation of that port
- …from
x/y
tox/y/z
- …
z
indicating the number of the resulting sub-physical port (1,2)
- …from
- …each sub-physical port is then handled as an individual port
Enable port splits…
# enable split mode support
mlxconfig -d <device> set SPLIT_MODE=1
# split ports....
mlxconfig -d <device> set SPLIT_PORT[<port_num>/<port_range>]=1
SPLIT_MODE
=SPLIT_2X(1)
enable splits…- …should be equivalent to split-ready configuration
- …on managed switches …
system profile ib split-ready
…
SPLIT_PORT[1..64]=1
…split for all ports…- …should be equivalent to changing the module type to a split mode…
- …on manged switches …
module-type qsfp-split-2
Query the configuration…
>>> mlxconfig -d lid-0x287 -e query SPLIT_PORT[1..64]
Device #1:
----------
Device type: Quantum
Name: MQM8790-HS2X_Ax
Description: Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device: lid-0x287
Configurations: Default Current Next Boot
SPLIT_PORT[1] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
SPLIT_PORT[2] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
SPLIT_PORT[3] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
SPLIT_PORT[4] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
#[...]
Adapters (HCAs)
ibstat
ibstat
without arguments list all local adapters with state information
# list channel adapters (CAs)
>>> ibstat -l
mlx5_0
# GID...
>>> ibstat -p
0x08c0eb0300f82cbc
Operational State: Active
& Physical state: LinkUp
…
>>> ibstat
CA 'mlx5_0'
# [...]
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
# [...]
- Physical state …(of the cable)
- …
Pooling
…no connection …check cable (…and switch) - …
LinkUp
…physical uplink connection (…does not mean it’s configured and ready to send data)
- …
- State (…of the HCA)
- …
Down
…no physical connection - …
Initializing
…physical uplink connection …not discovered by the subnet manager - …
Active
…port in a normal operational state
- …
- Rate…
- …speed at which the port is operating
- …matches speed of the slowest device on the network path
ibstatus
display similar information (however belongs to outdated tooling)
ipaddr
Display the lid (and range) as well as the GID address of a port
# local GID and LID
>>> ibaddr
GID fe80::e8eb:d303:a6:1856 LID start 0x15 end 0x15
# LID (in decimal) of the local adapter
>>> ibaddr -L
LID start 21 end 21
Used for address conversion between GIDs and LIDs
# GID of given LID
>>>ibaddr -g 0x22e
GID fe80::8c0:eb03:f8:2cbc
# LID (range) for a GID
>>> ibaddr -G 0x1070fd030003af98 -L
LID start 647 end 647
iblinkinfo
Identify the switch a node is connected to …
# ..GUID
>>> iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2
0x1070fd030003af98
# ..LID
>>> ibaddr -G $(iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2) -L
LID start 647 end 647
ibdev2netdev
ibdev2netdev
prints a list of local devices mapped to network interfaces…
>>> ibdev2netdev
mlx5_0 port 1 ==> ib0 (Up)
# ...verbose
>>> ibdev2netdev -v
mlx5_0 (mt4123 - MCX653105A-ECAT) ConnectX-6 VPI adapter card, 100Gb/s #...
Error Counters
List of InfiniBand error counters…
Counter | Description |
---|---|
LinkDowned | Node reboot, failed connection (port flapping) |
Linkspeed | If not at full speed check the adapter and cable |
Linkwidth | If not at full speed check the adapter and cable |
PortRcvErrors | Physical errors, local buffer overruns, malformed packets |
PortRcvRemotePhysicalErrors | See above… packet EBP (End Bad Packet) flag set… |
PortRcvSwitchRelayErrors | Packets could not be forwarded by the switch |
Port[Rcv | Xmit]ConstraintErrors |
PortXmitWait | Large numbers indicate congestion (high congestion results in XmitDiscards) |
RcvRemotePhys(ical)Errors | Package corruption occurred somewhere else in the fabric |
SymbolErrors | 99% of these errors are hardware related (small numbers can be ignored) |
VL15Drop | First packages dropped due to resource limits (not enough space in the buffers) |
XmtDiscards | Packet to be transmitted get dropped (high congestion in the fabric) |
Network Layers
Physical Layer
- Link Speed x Link Width = Link Rate
- Bit Error Rate (BER) 10^15
- Virtual Lane (VL), multiple virtual links on single physical link
- Mellanox 0-7 VLs each with dedicated buffers
- Quality of Service, bandwidth management
- Media for connecting two nodes
- Passive Copper Cables FDR max. 3m, EDR max. 2m
- Active Optical Cables (AOCs) FDR max. 300m, EDR max. 100m
- Connector QSFP
Speeds
Speed Width Rate Latency Encoding Eff.Speed
---------------------------------------------------------------------------------------
1999 SDR Single Data Rate 2.5Gbps x4 10Gbps 5usec NRZ
2004 DDR Double Data Rate 5Gbps x4 20Gbps 2.5usec NRZ 8/10 16Gbps
2008 QDR Quadruple Data Rate 10Gbps x4 40Gbps 1.3usec NRZ 8/10 32Gbps
2011 FDR Fourteen Data Rate 14Gbps x4 56Gbps 0.7usec NRZ 64/66 54.6Gbps
2014 EDR Enhanced Data Rate 25Gbps x4 100Gbps 0.5usec NRZ 64/66 96.97Gbps
2018 HDR High Data Rate 50Gbps x4 200Gbps <0.6usec PAM-4
2022 NDR Next Data Rate 100Gbps x4 400Gbps PAM-4
? XDR 200Gbps x4 800Gbps PAM-4 ? GDR 1.6Tbps
Link Layer
- Subnet may contain: 48K unicast & 16k multicast addresses
- Local Routing Header (LRH) includes 16bit Destination LID (DLID) and port number
- LID Mask Controller (LMC), use multiple LIDs to load-balance traffic over multiple network paths
- Credit Based Flow Control between two nodes
- Independent for each virtual lane (to separate congestion/latency)
- Sender limited by credits granted by the receiver in 64byte units
- Service Level (SL) to Virtual Lane (VL) mapping defined in
opensm.conf
- Priority & weight value 0-255 indicate number 64byte units transported by a VL
- Guarantee performance to data flow to provide QoS
- Data Integrity
- 16bit Variant CRC (VCRC) link level integrity between two hops
- 32bit Invariant CRC (ICRC) end-to-end integrity
- Link Layer Retransmission (LLR)
- Mellanox SwitchX only, up to FDR, enabled by default
- Recovers problems on the physical layer
- Slight increase in latency
- Should remove all symbol errors
- Forward Error Correction (FEC)
- Mellanox Switch-IB only, EDR forward
- Based on 64/66bit encoding error correction
- No bandwidth loss
Network Layer
- Infiniband Routing
- Fault isolation (e.g topology changes)
- Increase security (limit attack scope within a network segment)
- Inter-subnet package routing (connect multiple topologies)
- Uses GIDs for each port included in the Global Routing Header (GRH)
- Mellanox Infiniband Router SB7788 (up to 6 subnets)
Transport Layer
- Message segmentation into multiple packages by the sender, reassembly on the receiver
- Maximum Transfer Unit (MTU) default 4096 Byte
openib.conf
- Maximum Transfer Unit (MTU) default 4096 Byte
- End-to-End communication service for applications Virtual Channel
- Queue Pairs (QPs), dedicated per connection
- Send/receive queue structure to enable application to bypass kernel
- Mode: connected vs. datagram; reliable vs. unreliable
- Datagram mode uses one QP for multiple connections
- Identified by 24bit Queue Pair Number (QPN)
Upper Layer
- Protocols
- Native Infiniband RDMA Protocols
- MPI, RDMA Storage (iSER, SRP, NFS-RDMA), SDP (Socket Direct), RDS (Reliable Datagram)
- Legacy TCP/IP, transported by IPoIB
- Software transport Verbs
- Client interface to the transport layer, HCA
- Most common implementation is OFED
- Subnet Manager Interface (SMI)
- Subnet Manager Packages (SMP) (on
QP0 VL15
, no flow control) - LID routed or direct routed (before fabric initialisation using port numbers)
- Subnet Manager Packages (SMP) (on
- General Service Interface (GSI)
- General Management Packages (GMP) (on
QP1
, subject to flow control) - LID routed
- General Management Packages (GMP) (on
Topology
Roadmap of the network:
- Critical aspect of any interconnection network
- Defines how the channels and routers are connected
- Sets performance bounds (network diameter, bisection bandwidth)
- Determines the cost of the network
- Keys to topology evaluation
- Network throughput - for application traffic patterns
- Network diameter - min/avg/max latency between hosts
- Scalability - cost of adding new end-nodes
- Cost per node - number of network routers/ports per end-node
Diameter defines the maximum distance between two nodes (hop count)
- Lower network diameter
- Better performance
- Smaller cost (less cables & routers)
- Less power consumption
Radix (or degree) of the router defines the number of ports per router
Nodal degree specifies how many links connect to each node
Demystifying DCN Topologies: Clos/Fat Trees
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part1
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part2
Clos Networks
Clos network is a multistage switching network
- Enables connection of large number of nodes with small-size switches
- 3 stages to switch from N inputs to N outputs
- Exactly one connection between each spine and leaf switch
Fat-Trees (special case of folded Clos network)
- Pros
- simple routing
- maximal network throughput
- fault-tolerant (path diversity)
- credit loop deadlock free routing
- Cons
- large diameter…
- …more expensive
- Alleviate the bandwidth bottleneck closer to the root with additional links
- Multiple paths to the destination from the source towards the root
- Consistent hop count, resulting in predictable latency.
- does not scale linearly with cluster size (max. 7 layers/tiers)
- Switches at the top of the pyramid shape are called Spines/Core
- Switches at the bottom of the pyramid are called Leafs/Lines/Edges
- External connections connect nodes to edge switches.
- Internal connections connect core with edge switches.
- Constant bi-sectional bandwidth (CBB)
- Non blocking (1:1 ratio)
- Equal number of external and internal connections (balanced)
- Blocking (x:1), external connections is higher than internal connections, over subscription
Dragonfly
- Pros
- Reduce number of (long) global links…without reducing performance
- …smaller network diameter
- Reduced total cost of network (since cabling is reduced)
- More scalable (compared to fat-tree)
- Cons
- Requires adaptive routing…
- …effectively balance load across global channels…
- …adding selective virtual-channel discrimination…
Hierarchical topology dividing groups of routers…
- …connected into sub-network of collectively acting router groups…
- …as one high-radix virtual router
- …all minimal routes traverse at most one global channel…
- …to realize a very low global diameter
- Channels/links…
- …terminal connections to nodes/systems
- …local (intra-group) connections to other routers in the same group
- …global (long, inter-group) connections to routers in other groups
- All-to-all connection between each router group
- (Avoids the need for external top level switches)
- Each group has at least on global link to each other router group
Flavors diverge on group sub-topology…
- …intra-group interconnection network (local channels)
- 1D flattened butterfly, completely connected (default recommendation)
- 2D flattened butterfly
- Dragonfly+ (benefits of Dragonfly and Fat Tree)
Dragonfly+
Extends Dragonfly topology by using Clos-like group topology
- High scalability then Dragonfly with lower cost than Fat Tree
- Group (pod) topology typical 2-level fat tree
- Pros… (compared to Dragonfly)
- More scalable, allows larger number of nodes on the network
- Similar or better bi-sectional bandwidth…
- …smaller number of buffers to avoid credit loop deadlocks
- At least 50% bi-sectional bandwidth for any router radix
- Requires only two virtual lanes to prevent credit loop deadlock
- Cons… (compared to Dragonfly)
- Even more complex routing…
- Fully Progressive Adaptive Routing (FPAR)
- Cabling complexity, intra-group routers connected as bipartite graph
Dragonfly+ is bipartite connected in the first intra-group level
- Number of spine switches = number of leaf switches
- Leaf router, first-layer
- (terminal) connects to nodes
- Intra-group (local) connection to spine routers
- Only one uplink to each spine inside the group
- Spine router, second-layer
- intra-group (local) connection to leaf routers
- inter-group (global) connections to spine routers of other groups
- Support blocking factor in leaf switches and non-blocking on Spines
Locality, group size
- With larger group size lager amount off traffic is internal (intra-group)
- Intra-group traffic does not use inter-group global links…
- …hence does not contribute to network throughput bottleneck
How to Configure DragonFly, Mellanox, 2020/03
https://community.mellanox.com/s/article/How-to-Configure-DragonFly
Exascale HPC Fabric Topology, Mellanox, 2019/03
http://www.hpcadvisorycouncil.com/events/2019/APAC-AI-HPC/uploads/2018/07/Exascale-HPC-Fabric-Topology.pdf
Subnet Manager
Software defined network (SDN)
- …configures and maintains fabric operations
- …central repository of all information
- …configures switch forwarding tables
Only one master SM allowed per subnet
- …can run on any server (or a managed switch on small fabrics)
- …master-slave setup for high-availability
Install opensm
packages …start the subnet manager…
dnf install -y opensm
systemctl enable --now opensm
Configuration
Configuration in /etc/rdma/opensm.conf
…
opensm
daemon…-c $path
…create configuration file if missing-p $prio
…change priority …when stopped!-R $engine
…change routing algorithem/var/log/opensm.log
…for logging
sminfo
…show master subnet manager LID, GUID, prioritysaquery -s
…show all subnet managers
ibdiagnet -r # check for routing issues
smpquery portinfo $lid $port # query port information
smpquery nodeinfo $lid # query node information
smpquery -D nodeinfo $lid # ^ using direct route
ibroute $lid$ # show switching table, LIDs in hex
Initialization
- Subnet discovery (…after wakeup)
- …travers the network beginning with close neighbors
- …Subnet Manager Packages (SMP) to initiate “conversation”
- Information gathering…
- …find all links/switches/hosts on all connected ports to map topology
- …Subnet Manager Query Message: direct routed information gathering for node/port information
- …Subnet Manager Agent (SMA) required on each node
- LIDs assignment
- Paths establishment
- …best path calculation to identify Shortest Path Table (Min-Hop)
- …calculate Linear Forwarding Table (LFP)
- Ports and switch configuration
- Subnet activation
Topology Changes
SM monitors the fabric for a topology changes….
- …Light Sweep, every 10sec require node/port information
- …port status changes
- …search for other SMs, change priority
- …Heavy Sweep triggered by light sweep changes
- …fabric discovery from scratch
- …can be triggered by a IB TRAP from a status change on a switch
- …ddge/host port state change impact is configurable -…SM failover & handover with SMInfo protocol
- …election by priority (0-15) and lower GUID
- …heartbeat for stand-by SM polling the master
- …SMInfo attributes exchange information during discovery/polling to synchronize
Routing
Terms important to understand different algorithms…
- …tolerance …considered during path distance calculation
- …
0
…equal distance if the number of hops in the paths is the same - …
1
…equal distance if the difference in hop count is less than or equal to one
- …
- …contention …declared for every switch port on the path…
- …that is already used for routing another LID…
- …associated with the same host port
Algorithm…
- …SPF, DOR, LASH….
- Min-Hop minimal number of switch hops between nodes (cannot avoid credit loops)
- ftree congestion-free symmetric fat-tree, shift communication pattern
Up-Down
- …Min-Hop plus core/spine ranking
- …for non pure fat-tree topologies
- …down-up routes not allowed
Enable up-down routing engine:
>>> grep -e routing_engine -e root_guid_file /etc/opensm/opensm.conf
#routing_engine (null)
routing_engine updn
#root_guid_file (null)
root_guid_file /etc/opensm/rootswitches.list
>>> head /etc/opensm/rootswitches.list
0xe41d2d0300e512c0
0xe41d2d0300e50bd0
0xe41d2d0300e51af0
0xe41d2d0300e52eb0
0xe41d2d0300e52e90
Adaptive
Avoid congestion with adaptive routing…
- …supported on all types of topologies
- …maximize network utilization
- …spread traffic across all network links…
- …determine optimal path for data packets
- …allow packets to avoid congested areas
- …redirect traffic to less occupied outgoing ports
- …grading mechanism to select optimal ports considering
- …egress port
- …queue depth
- …path priority (shorter paths have higher priority)
Requires ConnectX-5 or newer…
- …packets can arrive out-of-order
- …sender mark traffic for eligibility to network re-ordering
- …inter-message ordering can be enforced when required
Linux Configuration
Packages
Packages build from rdma-core-spec
Package | Description |
---|---|
libverbs |
…library that allows userspace processes to use RDMA “verbs” |
libibverbs-utils |
…libibverbs example programs such as ibv_devinfo |
infiniband-diags |
IB diagnostic programs and scripts needed to diagnose an IB subnet |
NVIDIA packages…
- InfiniBand Management Tools
- InfiniBand diagnostic utilities (
ibdiagnet
,ibdiagpath
,smparquery
, etc)
- InfiniBand diagnostic utilities (
Modules
Mellanox HCAs require at least the mlx?_core
and mlx?_ib
kernel modules.
- Hardware drivers…
mlx4_*
modules are use by ConnectX adaptersmlx5_*
modules are used by Connect-IB adapters
mlx_core
…generic driver use bymlx_ib
for Infinibandmlx_en
for Ethernetmlx_fc
for Fiber-Channel
ib_*
contains Infiniband specific functions…
Prior to rdma-core
package (see above)…
## find all infiniband modules
>>> find /lib/modules/$(uname -r)/kernel/drivers/infiniband -type f -name \*.ko
## load requried modules
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do modprobe $mod ; done
## make sure modules get loaded on boot
>>> for mod in mlx4_core mlx4_ib ib_umad ib_ipoib rdma_ucm ; do echo "$mod" >> /etc/modules-load.d/infiniband.conf ; done
## list loaded infiniband modules
>>> lsmod | egrep "^mlx|^ib|^rdma"
## check the version
>>> modinfo mlx4_core | grep -e ^filename -e ^version
## list module configuration parameters
>>> for i in /sys/module/mlx?_core/parameters/* ; do echo $i: $(cat $i); done
## module configuration
>>> cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
IPoIB
InfiniBand does not use the internet protocol (IP) by default…
- IP over InfiniBand (IPoIB) provides an IP network emulation layer…
- …on top of InfiniBand remote direct memory access (RDMA) networks
- ARP over a specific multicast group to convert IP to IB addresses
- TCP/UDP over IPoIB (IPv4/6)
- TCP uses reliable-connected mode, MTU up to 65kb
- UDP uses unreliable-datagram mode, MTU limited to IB packages side 4kb
- MTUs should be synchronized between all components
IPoIB devices have a 20 byte hardware address…
netstat -g # IP group membership
saquery -g | grep MGID | tr -s '..' | cut -d. -f2
# list mulicast group GIDs
tail -n+1 /sys/class/net/ib*/mode # connection mode
ibv_devinfo | grep _mtu # MTU of the hardware
/sys/class/net/ib0/device/mlx4_port1_mtu
ip a | grep ib[0-9] | grep mtu | cut -d' ' -f2,4-5
# MTU configuration for the interface
Network Boot
Boot over Infiniband (BoIB) …two boot modes:
- …UEFI boot…
- …modern and recommended way to network boot
- …expansion ROM implements the UEFI APIs
- …supports any network boot method available in the UEFI reference specification
- …UEFI PXE/LAN boot
- …UEFI HTTP boot
- …legacy boot…
- …boot device ROM for traditional BIOS implementations
- …HCAs use FlexBoot (an iPXE variant) to
- …enabled by an expansion ROM image
.mrom
- …boot device ROM for traditional BIOS implementations
Dracut
Dracut …early boot environment…
- …requires to load additional kernel modules
- …kernel command-line parameters
rd.driver.post=mlx5_ib,ib_ipoib,ib_umad,rdma_ucm rd.neednet=1 rd.timeout=0 rd.retry=160
rd.driver.post
load additional kernel modulesmlx4_ib
support ConnectX-3/4mlx5_ib
for ConnectX-5 and newer
rd.neednet=1
forces start of network interfacesrd.timeout=0
waits until a network interface is activatedrd.retry=160
time to wait for the network to initialize and become operational
RDMA Subsystem
RDMA subsystem relies on the kernel, udev
and systemd
to load modules…
rdma-core
Package
- Source code linux-rdma/rdma-core, GitHub
rdma-core
package provides RDMA core user-space libraries and daemons…udev
loading the physical hardware driver/usr/lib/udev/rules.d/*-rdma*.rules
device manager rules- Once an RDMA device is created by the kernel…
- …triggers module loading services
- …
rdma-hw.target
load a protocol module…- …pull in rdma management daemons dynamically
- …wants
rdma-load-modules@rdma.service
beforenetwork.target
- …loads all modules from
/etc/rdma/modules/*.conf
# list kernel modules to be loaded
grep -v ^# /etc/rdma/modules/*.conf
rdma
Commands
# ...view the state of all RDMA links
>>> rdma dev
0: mlx5_0: node_type ca fw 20.31.1014 node_guid 9803:9b03:0067:ab58 sys_image_guid 9803:9b03:0067:ab58
# ...display the RDMA link
>>> rdma link
link mlx5_0/1 subnet_prefix fe80:0000:0000:0000 lid 817 sm_lid 762 lmc 0 state ACTIVE physical_state LINK_UP
Set up software RDMA on an existing interface…
modprobe $module
rdma link add $name type $type netdev $device
ibv_*
Commands
RDMA devices available for use from the user space
ibv_devices
list devices with GUID
>>> ibv_devices
device node GUID
------ ----------------
mlx5_0 08c0eb0300f82cbc
ibv_devinfo -v
show device capabilities accessible to user-space…
Drivers
- Inbox drivers…
- …upstream kernel support
- …RHEL/SLES release documentation
- Linux drivers part of MLNX_OFED
- …
kmod*
packages
- …
iWARP
Implementation of iWARP (Internet Wide-area RDMA Protocol)…
- …implements RDMA over IP networks …on top TCP/IP protocol
- …works with all Ethernet network infrastructure
- …offloads TCP/IP (from CPU) to RDMA-enabled NIC (RNIC)
- …zero copy …direct data placement
- …eliminates intermediate buffer copies
- …reading and writing directly to application memory
- …kernel bypass …remove need for context switches from kernel- to user-space Enable…
- …block storage …iSER (iSCSI Extensions for RDMA)
- …file storage (NFS over RDMA)
- …NVMe over Fabrics
MLNX_OFED
# download the MLNX_OFED distirbution for NVIDIA
>>> tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
>>> ls MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/RPMS/*.rpm \
| xargs -n 1 basename |sort
ar_mgr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
clusterkit-1.0.36-1.53100.x86_64.rpm
dapl-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-devel-static-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dapl-utils-2.1.10.1.mlnx-OFED.4.9.0.1.4.53100.x86_64.rpm
dpcp-1.1.2-1.53100.x86_64.rpm
dump_pr-1.0-5.8.2.MLNX20210321.g58d33bf.53100.x86_64.rpm
fabric-collector-1.1.0.MLNX20170103.89bb2aa-0.1.53100.x86_64.rpm
#...
- Duplicate packages….
- …in conflict with enterprise distribution are….
- …prefixed with
mlnx
of includemlnx
somewhere in the package name
- Different installation profiles…
Package Name | Profile |
---|---|
mlnx-ofed-all | Installs all available packages in MLNX_OFED |
mlnx-ofed-basic | Installs basic packages required for running the cards |
mlnx-ofed-guest | Installs packages required by guest OS |
mlnx-ofed-hpc | Installs packages required for HPC |
mlnx-ofed-hypervisor | Installs packages required by hypervisor OS |
mlnx-ofed-vma | Installs packages required by VMA |
mlnx-ofed-vma-eth | Installs packages required by VMA to work over Ethernet |
mlnx-ofed-vma-vpi | Installs packages required by VMA to support VPI |
bluefield | Installs packages required for BlueField |
dpdk | Installs packages required for DPDK |
dpdk-upstream-libs | Installs packages required for DPDK using RDMA-Core |
kernel-only | Installs packages required for a non-default kernel |
Build
Example from CentOS 7.9
# extract the MLNX OFED archive
cp /lustre/hpc/vpenso/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz .
tar -xvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-x86_64/
# dependencies
yum install -y \
\
automake \
autoconf \
createrepo \
gcc-gfortran \
libtool \
libusbx \
python-devel \
redhat-rpm-config
rpm-build
# remove all previosly installed artifacts...
./uninstall.sh
# run the generic installation
./mlnxofedinstall --skip-distro-check --add-kernel-support --kmp --force
# copy the new archive...
cp /tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-3.10.0-1160.21.1.el7.x86_64/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.9-ext.tgz ...
mlnxofedinstall
will install the newly build RPM packages on the host.
>>> systemctl stop lustre.mount ; lustre_rmmod
# this will bring down the network interface, and disconnect your SSH session
>>> /etc/init.d/openibd
# new modules compatible to the kernel have been loaded
>>> modinfo mlx5_ib
filename: /lib/modules/3.10.0-1160.21.1.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license: Dual BSD/GPL
description: Mellanox 5th generation network adapters (ConnectX series) IB driver
author: Eli Cohen <eli@mellanox.com>
retpoline: Y
rhelversion: 7.9
srcversion: DF39E5800D8C1EEB9D2B51C
depends: mlx5_core,ib_core,mlx_compat,ib_uverbs
vermagic: 3.10.0-1160.21.1.el7.x86_64 SMP mod_unload modversions
parm: dc_cnak_qp_depth:DC CNAK QP depth (uint)
The new kernel package have a time-stamp within the version to distinguish them from the original versions:
[root@lxbk0718 ~]# yum --showduplicates list kmod-mlnx-ofa_kernel
Installed Packages
kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.202104140852.rhel7u9 installed
Available Packages
kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.rhel7u9 gsi-internal
Loading the Lustre module back into the kernel will fail…
[root@lxbk0718 ~]# modprobe lustre
modprobe: ERROR: could not insert 'lustre': Invalid argument
[root@lxbk0718 ~]# dmesg -H | tail
[ +0.000002] ko2iblnd: Unknown symbol ib_modify_qp (err -22)
[ +0.000025] ko2iblnd: Unknown symbol ib_destroy_fmr_pool (err 0)
[ +0.000007] ko2iblnd: disagrees about version of symbol rdma_destroy_id
[ +0.000001] ko2iblnd: Unknown symbol rdma_destroy_id (err -22)
[ +0.000004] ko2iblnd: disagrees about version of symbol __rdma_create_id
[ +0.000001] ko2iblnd: Unknown symbol __rdma_create_id (err -22)
[ +0.000042] ko2iblnd: Unknown symbol ib_dealloc_pd (err 0)
[ +0.000015] ko2iblnd: Unknown symbol ib_fmr_pool_map_phys (err 0)
[ +0.000364] LNetError: 70810:0:(api-ni.c:2283:lnet_startup_lndnet()) Can't load LND o2ib, module ko2iblnd, rc=256
[ +0.002136] LustreError: 70810:0:(events.c:625:ptlrpc_init_portals()) network initialisation failed
Rebuild of the Luster kernel modules compatible to MLNX OFED 5.3 is required
# get the source code
git clone git://git.whamcloud.com/fs/lustre-release.git
# checkout the version supporting the kernel
# cf. https://www.lustre.org/lustre-2-12-6-released/
git checkout v2_12_6
# prepare the build environment
sh ./autogen.sh
# configure to build only the Lustre client
./configure --disable-server --disable-tests
# builds with (once configuration works)
make && make rpms
Application Interface
- OpenFabrics Alliance (OFA)
- Builds open-source software: OFED (OpenFabrics Enterprise Distribution)
- Kernel-level drivers, channel-oriented RDMA and send/receive operations
- Kernel and user-level application programming interface (API)
- Services for parallel message passing (MPI)
- Includes Open Subnet Manager with diagnostic tools
- IP over Infiniband (IPoIB), Infiniband Verbs/API
RDMA
- Remote Direct Memory Access (RDMA)
- Linux kernel network stack limitations
- system call API package rates to slow for high speed network fabrics with latency in the nano-seconds
- overhead copying data from user- to kernel-space
- workarounds: Package aggregation, flow steering, pass NIC to user-space…
- RDMA Subsystem: Bypass the kernel network stack to sustain full throughput
- special Verbs library maps devices into user-space to allow direct data stream control
- direct user-space to user-space memory data transfer (zero-copy)
- offload of network functionality to the hardware device
- messaging protocols implemented in RDMA
- regular network tools may not work
- bridging between common Ethernet networks and HPC network fabrics difficult
- protocols implementing RDMA: Infiniband, Omnipath, Ethernet(RoCE)
- future integration with the kernel network stack?
- Integrate RDMA subsystem messaging with the kernel
- Add Queue Pairs (QPs) concept to the kernel network stack to enable RDMA
- Implement POSIX network semantics for Infiniband
RDMA over Ethernet
- advances in Ethernet technology allows to build “lossless” Ethernet fabrics
- PFC (Priority-based Flow Control) prevents package loss due to buffer overflow at switches
- Enables FCoE (Fibre Channel over Ethernet), RoCE (RDMA over Converged Ethernet)
- Ethernet NICs come with a variety of options for offloading
- RoCE specification supported as annex to the IBTA
- implements Infiniband Verbs over Ethernet (OFED >1.5.1)
- use Infiniband transport & network layer, swaps link layer to use Ethernet frames
- IPv4/6 addresses set over the regular Ethernet NIC
- control path RDMA-CM API, data path Verbs API
OpenFabric
- OpenFabrics Interfaces (OFI)
- Developed by the OFI Working Group, a subgroup of OFA
- Successor to IB Verbs, and RoCE specification
- Optimizes software to hardware path by minimizing cache and memory footprint
- Application-Centric and fabric implementation agnostic
- libfabric core component of OFI
- User-space API mapping applications to underlying fabric services
- Hardware/protocol agnostic
- Fabric hardware support implemented in OFI providers
- Socket provider for development
- Verbs provides allows to run over hardware supporting
libibverbs
(Infiniband) useNIC
(user-space NIC) providers supports Cisco Ethernet hardware- PSM (Performance Scale Messaging) provider for Intel Omni-Path and Cray Aries
References
- NVIDIA Infrastructure & Networking Knowledge Base