InfiniBand: Discover & Debug
Fabric
List of commands relevant to discover and debug the fabric…
Command | Description |
---|---|
ibnetdiscover |
…scans fabric sub-network …generates topology information |
iblinkinfo |
…list links in the farbic |
ibnodes |
…list of nodes in the fabric |
ibhosts |
…list channel adapters |
ibportstate |
…state of a given port |
ibqueryerrors |
…port error counters |
ibroute |
…display forwarding table |
ibdiagnet |
…complete fabrics scan …all device, port, link, counters, etc. |
ibnetdiscover
Subnet discover …outputs a human readable topology file
List…
-l
connected nodes-H
connected HCAs-S
connected switches
# switches...
>>> ibnetdiscover -S
Switch : 0x7cfe90030097c8f0 ports 36 devid 0xc738 vendid 0x2c9 "SwitchX - Mellanox Technologies"
#...
# host channel adapters
>>> ibnetdiscover -H
Ca : 0x08c0eb0300af4fa2 ports 1 devid 0x101b vendid 0x2c9 "... mlx5_0"
Ca : 0xe41d2d0300dff630 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
Ca : 0xe41d2d0300e013d0 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
#...
Output by columns…
- …GUID
- …number of
ports
- …
devid
device id …hexadecimal - …
vendid
vendor ID …hexadecimal - …
"..."
description
iblinkinfo
Reports link info for all links in the fabric…
# ...show switch with GUID
iblinkinfo -S 0x1070fd030003af98
# ...show only the next switch on the node up-link
iblinkinfo -n 1 --switches-only
- …each switch with GUID is listed with…
- …one port per line…
- …left switch LID and port
- …middle after
==
…connection width, speed and state
- …right of
==>
…down-link device…- …either a switch …or node HCA
- …LID, port, node name and device type
# switch GUID ...name (if available) ...type and model
Switch: 0x1070fd030003af98 Quantum Mellanox Technologies:
647 1[ ] ==( Down/ Polling) ==> [ ] "" ( )
647 2[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 23 1[ ] "localhost mlx5_0" ( )
# LID port width ...speed ...physical state down-link LID port name ..device
List active ports on a specific switch switch…
>>> iblinkinfo -S 0x1070fd030003af98 -l | tr -s ' ' | cut -d'"' -f3- | grep -v -i down
647 2[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a6115e 23 1[ ] "localhost mlx5_0" ( )
647 21[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4b72 24 26[ ] "Quantum Mellanox" #...
647 23[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4a92 16 14[ ] "Quantum Mellanox" #...
#...
647 80[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a61cca 22 1[ ] "lxbk1149" ( )
ibdiagnet
ibdiagnet
, reports trouble in a from like:
...
Link at the end of direct route "1,1,19,10,9,17"
Errors:
-error noInfo -command {smNodeInfoMad getByDr {1 1 19 10 9 17}}
Errors types explanation:
"noInfo" : the link was ACTIVE during discovery but, sending MADs across it
failed 4 consecutive times ...
ibdiagpath
to print all GUIDs on the route
>>> ibdiagpath -d 1,1,19,10,9,17
...
-I- From: lid=0x0216 guid=0x7cfe90030097cef0 dev=51000 Port=17
…eventually use archived output of ibnetdiscover
to identify the corresponding host.
Otherwise check the end of the cable connected to the switch port identified.
mlxconfig
mlxconfig
– Changing Device Configuration Tool
Query switch using its LID…
query
supported configurations after reboot- …option
-e
show default and current configurations
>>> mlxconfig -d lid-0x287 -e query
Device #1:
----------
Device type: Quantum
Name: MQM8790-HS2X_Ax
Description: Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device: lid-0x287
Configurations: Default Current Next Boot
* SPLIT_MODE NO_SPLIT_SUPPORT(0) NO_SPLIT_SUPPORT(0) SPLIT_2X(1)
DISABLE_AUTO_SPLIT ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0)
SPLIT_PORT Array[1..64] Array[1..64] Array[1..64]
GB_VECTOR_LENGTH 0 0 0
GB_UPDATE_MODE ALL(0) ALL(0) ALL(0)
GB_VECTOR Array[0..7] Array[0..7] Array[0..7]
The '*' shows parameters with next value different from default/current value.
show_confs
displays information about all configurations…
>>> mlxconfig -d lid-0x287 show_confs
# [...]
SWITCH CONF:
DISABLE_AUTO_SPLIT=<DISABLE_AUTO_SPLIT|ENABLE_AUTO_SPLIT>Disable Auto-Split:
0x0: ENABLE_AUTO_SPLIT - if NV is split OR if cable is split then port is split.
0x1: DISABLE_AUTO_SPLIT - if NV is split then port is split # [...]
SPLIT_MODE=<NO_SPLIT_SUPPORT|SPLIT_2X> Split ports mode of operation configured # [...]
0x0: NO_SPLIT_SUPPORT
0x1: SPLIT_2X - device supports splitting ports to two 2X ports
# [...]
Split Cables
Changes require a switch reboot!
Split a Port in a remotely managed switches…
- …only for Quantum based switch systems
- …single physical quad-lane QSFP port is divided into 2 dual-lane ports
- …all system ports may be split into 2-lane ports
- …port changes the notation of that port
- …from
x/y
tox/y/z
- …
z
indicating the number of the resulting sub-physical port (1,2)
- …from
- …each sub-physical port is then handled as an individual port
Enable port splits…
# enable split mode support
mlxconfig -d <device> set SPLIT_MODE=1
# split ports....
mlxconfig -d <device> set SPLIT_PORT[<port_num>/<port_range>]=1
SPLIT_MODE
=SPLIT_2X(1)
enable splits…- …should be equivalent to split-ready configuration
- …on managed switches …
system profile ib split-ready
…
SPLIT_PORT[1..64]=1
…split for all ports…- …should be equivalent to changing the module type to a split mode…
- …on manged switches …
module-type qsfp-split-2
Query the configuration…
>>> mlxconfig -d lid-0x287 -e query SPLIT_PORT[1..64]
Device #1:
----------
Device type: Quantum
Name: MQM8790-HS2X_Ax
Description: Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device: lid-0x287
Configurations: Default Current Next Boot
SPLIT_PORT[1] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
SPLIT_PORT[2] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
SPLIT_PORT[3] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
SPLIT_PORT[4] NO_SPLIT(0) NO_SPLIT(0) NO_SPLIT(0)
#[...]
Adapters (HCAs)
ibstat
ibstat
without arguments list all local adapters with state information
# list channel adapters (CAs)
>>> ibstat -l
mlx5_0
# GID...
>>> ibstat -p
0x08c0eb0300f82cbc
Operational State: Active
& Physical state: LinkUp
…
>>> ibstat
CA 'mlx5_0'
# [...]
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
# [...]
- Physical state …(of the cable)
- …
Pooling
…no connection …check cable (…and switch) - …
LinkUp
…physical uplink connection (…does not mean it’s configured and ready to send data)
- …
- State (…of the HCA)
- …
Down
…no physical connection - …
Initializing
…physical uplink connection …not discovered by the subnet manager - …
Active
…port in a normal operational state
- …
- Rate…
- …speed at which the port is operating
- …matches speed of the slowest device on the network path
ibstatus
display similar information (however belongs to outdated tooling)
ipaddr
Display the lid (and range) as well as the GID address of a port
# local GID and LID
>>> ibaddr
GID fe80::e8eb:d303:a6:1856 LID start 0x15 end 0x15
# LID (in decimal) of the local adapter
>>> ibaddr -L
LID start 21 end 21
Used for address conversion between GIDs and LIDs
# GID of given LID
>>>ibaddr -g 0x22e
GID fe80::8c0:eb03:f8:2cbc
# LID (range) for a GID
>>> ibaddr -G 0x1070fd030003af98 -L
LID start 647 end 647
iblinkinfo
Identify the switch a node is connected to …
# ..GUID
>>> iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2
0x1070fd030003af98
# ..LID
>>> ibaddr -G $(iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2) -L
LID start 647 end 647
ibdev2netdev
ibdev2netdev
prints a list of local devices mapped to network interfaces…
>>> ibdev2netdev
mlx5_0 port 1 ==> ib0 (Up)
# ...verbose
>>> ibdev2netdev -v
mlx5_0 (mt4123 - MCX653105A-ECAT) ConnectX-6 VPI adapter card, 100Gb/s #...
Port Counters
Determine bottlenecks …perfquery -x
- …option
-x
,--extended
show all port counters - Clear counters to remove historical data…
- …option
-R
to reset port counters …option-r
to reset after read - Two types of counters: traffic & error1
Layer 1 (Physical)
- Symbol Errors …physical problems
- 99% of these errors are hardware related (small numbers can be ignored)
- …for example broken cables …dirt in the connector …cable bending
- Link Recovers …port training state
- …port capabilities negotiated automatically …fro example link speed
- …“error on the line” …compatibility between switch & HCA (firmware compatibility)
- Related to Link Speed & Link Width …connection not at full speed
- …check the adapter, cable & split cable configuration
- Link Down …training did not work…
- …connection could not be established …failed connection (port flapping)
- Note: Depending on the HCA configuration host reboot implies link down
Layer 2 (Transport)
Xmt
& Rcv
traffic counters …service level specific
- Rcv Errors …CRC (check sum) errors
- …issues with data integrity
- …local buffer overruns, malformed packets (routing header)
- Xmit Wait …large numbers indicate congestion
- …high congestion results in Xmt Discards
- …packet to be transmitted get dropped (high congestion in the fabric)
- …check port buffer, ingress overflow
- …uses credit flow control to avoid loss of packages
- …related to ExeBufOverrunError …to many flow control updates
- PortRcvSwitchRelayErrors …packets could not be forwarded by the switch
- …or packages are waiting on the switch
- Causes …wrong destination lead …QoS level mapping broken
- VL15Dropped …subnet management data
- …(dedicated) virtual lane 15 …no flow control
- …drops not an issues …send repeated
Monitor Counters
Debug counters fabric wide with following procedure…
- Reset all counters
ibdiagnet -pc
- Wait 30~60 minutes
ibdiagnet -P all=1
…collect all counters with a threshold - Check errors
ibdiagnet -P all=1 --pm_pause_time 600
Above allows top observer production traffic in a specified time-frame
Bit Error Rate
Link Bit Error Rate (BER) very important!
Bit Error Rate (BER) with correction mechanisms:
…raw BER does not includes correction mechanisms:
- …FEC (Forward Error Correction)
- …PLR (Port Link Retransmission)
Goal: bit-errors as small as possible …\(10^{-15}\)
Congestion
Recommended to create a congestion map by monitoring all ports!
Related counters are XmitWait
, XmitPkts
& RcvPkts
XmitWait
…packages in queue …“ticks” to measure waiting- Note: Mixing generations (for example FDR & HDR) can lead to congestion
Pushback of flow control if congestion index is to high…
…congestion index by dividing the waiting with the transmitted packages
Footnotes
Overview of Error Counters, OpenFabric Alliance
https://www.openfabrics.org/mediawiki/index.php/Overview_of_Error_Counters#PortRcvSwitchRelayErrors↩︎