InfiniBand: Discover & Debug

HPC
Network
InfiniBand
Published

August 19, 2015

Modified

January 2, 2025

Fabric

List of commands relevant to discover and debug the fabric…

Command Description
ibnetdiscover …scans fabric sub-network …generates topology information
iblinkinfo …list links in the farbic
ibnodes …list of nodes in the fabric
ibhosts …list channel adapters
ibportstate …state of a given port
ibqueryerrors …port error counters
ibroute …display forwarding table
ibdiagnet …complete fabrics scan …all device, port, link, counters, etc.

ibnetdiscover

Subnet discover …outputs a human readable topology file

List…

  • -l connected nodes
  • -H connected HCAs
  • -S connected switches
# switches...
>>> ibnetdiscover -S
Switch   : 0x7cfe90030097c8f0 ports 36 devid 0xc738 vendid 0x2c9 "SwitchX -  Mellanox Technologies"
#...

# host channel adapters
>>> ibnetdiscover -H
Ca       : 0x08c0eb0300af4fa2 ports 1 devid 0x101b vendid 0x2c9 "... mlx5_0"
Ca       : 0xe41d2d0300dff630 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
Ca       : 0xe41d2d0300e013d0 ports 2 devid 0x1003 vendid 0x2c9 "... mlx4_0"
#...

Output by columns…

  • …GUID
  • …number of ports
  • devid device id …hexadecimal
  • vendid vendor ID …hexadecimal
  • "..." description

iblinkinfo

Reports link info for all links in the fabric…

# ...show switch with GUID
iblinkinfo -S 0x1070fd030003af98

# ...show only the next switch on the node up-link
iblinkinfo -n 1 --switches-only
  • …each switch with GUID is listed with…
    • …one port per line…
    • …left switch LID and port
    • …middle after == …connection width, speed and state
  • …right of ==> …down-link device…
    • …either a switch …or node HCA
    • …LID, port, node name and device type
# switch GUID ...name (if available)  ...type and model
Switch: 0x1070fd030003af98 Quantum Mellanox Technologies:
   647    1[  ] ==(                Down/ Polling)         ==>             [  ] "" ( )
   647    2[  ] ==( 2X        53.125 Gbps Active/  LinkUp)==>      23    1[  ] "localhost mlx5_0" ( )
#  LID    port     width ...speed ...physical state     down-link  LID   port   name ..device

List active ports on a specific switch switch…

>>> iblinkinfo -S 0x1070fd030003af98 -l | tr -s ' ' | cut -d'"' -f3- | grep -v -i down
 647 2[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a6115e 23 1[ ] "localhost mlx5_0" ( )
 647 21[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4b72 24 26[ ] "Quantum Mellanox" #...
 647 23[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> 0x1070fd03000f4a92 16 14[ ] "Quantum Mellanox" #...
#...
 647 80[ ] ==( 2X 53.125 Gbps Active/ LinkUp)==> 0xe8ebd30300a61cca 22 1[ ] "lxbk1149" ( )

ibdiagnet

ibdiagnet, reports trouble in a from like:

...
Link at the end of direct route "1,1,19,10,9,17"
     Errors:
           -error noInfo -command {smNodeInfoMad getByDr {1 1 19 10 9 17}}
Errors types explanation:
     "noInfo"  : the link was ACTIVE during discovery but, sending MADs across it
                   failed 4 consecutive times
...

ibdiagpath to print all GUIDs on the route

>>> ibdiagpath -d 1,1,19,10,9,17
...
-I- From: lid=0x0216 guid=0x7cfe90030097cef0 dev=51000 Port=17

…eventually use archived output of ibnetdiscover to identify the corresponding host.

Otherwise check the end of the cable connected to the switch port identified.

mlxconfig

mlxconfig – Changing Device Configuration Tool

Query switch using its LID…

  • query supported configurations after reboot
  • …option -e show default and current configurations
>>> mlxconfig -d lid-0x287 -e query
Device #1:
----------

Device type:    Quantum         
Name:           MQM8790-HS2X_Ax 
Description:    Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device:         lid-0x287       

Configurations:              Default              Current              Next Boot
*        SPLIT_MODE          NO_SPLIT_SUPPORT(0)  NO_SPLIT_SUPPORT(0)  SPLIT_2X(1)
         DISABLE_AUTO_SPLIT  ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0) ENABLE_AUTO_SPLIT(0)
         SPLIT_PORT          Array[1..64]         Array[1..64]         Array[1..64]
         GB_VECTOR_LENGTH    0                    0                    0
         GB_UPDATE_MODE      ALL(0)               ALL(0)               ALL(0)
         GB_VECTOR           Array[0..7]          Array[0..7]          Array[0..7]

The '*' shows parameters with next value different from default/current value.

show_confs displays information about all configurations…

>>> mlxconfig -d lid-0x287 show_confs
# [...]
SWITCH CONF:
  DISABLE_AUTO_SPLIT=<DISABLE_AUTO_SPLIT|ENABLE_AUTO_SPLIT>Disable Auto-Split:
    0x0: ENABLE_AUTO_SPLIT - if NV is split OR if cable is split then port is split.
    0x1: DISABLE_AUTO_SPLIT - if NV is split then port is split # [...]
  SPLIT_MODE=<NO_SPLIT_SUPPORT|SPLIT_2X>  Split ports mode of operation configured # [...]
    0x0: NO_SPLIT_SUPPORT
    0x1: SPLIT_2X - device supports splitting ports to two 2X ports
# [...]

Split Cables

Changes require a switch reboot!

Split a Port in a remotely managed switches…

  • …only for Quantum based switch systems
  • …single physical quad-lane QSFP port is divided into 2 dual-lane ports
  • …all system ports may be split into 2-lane ports
  • …port changes the notation of that port
    • …from x/y to x/y/z
    • z indicating the number of the resulting sub-physical port (1,2)
  • …each sub-physical port is then handled as an individual port

Enable port splits…

# enable split mode support
mlxconfig -d <device> set SPLIT_MODE=1

# split ports....
mlxconfig -d <device> set SPLIT_PORT[<port_num>/<port_range>]=1
  • SPLIT_MODE = SPLIT_2X(1) enable splits…
    • …should be equivalent to split-ready configuration
    • …on managed switches …system profile ib split-ready
  • SPLIT_PORT[1..64]=1 …split for all ports…
    • …should be equivalent to changing the module type to a split mode…
    • …on manged switches …module-type qsfp-split-2

Query the configuration…

>>> mlxconfig -d lid-0x287 -e query SPLIT_PORT[1..64]

Device #1:
----------

Device type:    Quantum
Name:           MQM8790-HS2X_Ax
Description:    Mellanox Quantum(TM) HDR InfiniBand Switch #[...]
Device:         lid-0x287

Configurations:           Default         Current         Next Boot
         SPLIT_PORT[1]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
         SPLIT_PORT[2]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
         SPLIT_PORT[3]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
         SPLIT_PORT[4]    NO_SPLIT(0)     NO_SPLIT(0)     NO_SPLIT(0)
#[...] 

Adapters (HCAs)

ibstat

ibstat without arguments list all local adapters with state information

# list channel adapters (CAs)
>>> ibstat -l
mlx5_0

# GID...
>>> ibstat -p
0x08c0eb0300f82cbc

Operational State: Active & Physical state: LinkUp

>>> ibstat
CA 'mlx5_0'
# [...]
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 10
# [...]
  • Physical state …(of the cable)
    • Pooling …no connection …check cable (…and switch)
    • LinkUp …physical uplink connection (…does not mean it’s configured and ready to send data)
  • State (…of the HCA)
    • Down …no physical connection
    • Initializing …physical uplink connection …not discovered by the subnet manager
    • Active …port in a normal operational state
  • Rate
    • …speed at which the port is operating
    • …matches speed of the slowest device on the network path

ibstatus display similar information (however belongs to outdated tooling)

ipaddr

Display the lid (and range) as well as the GID address of a port

# local GID and LID
>>> ibaddr
GID fe80::e8eb:d303:a6:1856 LID start 0x15 end 0x15

# LID (in decimal) of the local adapter
>>> ibaddr -L
LID start 21 end 21

Used for address conversion between GIDs and LIDs

# GID of given LID
>>>ibaddr -g 0x22e
GID fe80::8c0:eb03:f8:2cbc 

# LID (range) for a GID
>>> ibaddr -G 0x1070fd030003af98 -L
LID start 647 end 647

iblinkinfo

Identify the switch a node is connected to …

# ..GUID
>>> iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2
0x1070fd030003af98

# ..LID
>>> ibaddr -G $(iblinkinfo -n 1 | grep -i switch | cut -d' ' -f2) -L
LID start 647 end 647

ibdev2netdev

ibdev2netdev prints a list of local devices mapped to network interfaces…

>>> ibdev2netdev 
mlx5_0 port 1 ==> ib0 (Up)

# ...verbose
>>> ibdev2netdev -v
mlx5_0 (mt4123 - MCX653105A-ECAT) ConnectX-6 VPI adapter card, 100Gb/s #... 

Port Counters

Determine bottlenecks …perfquery -x

  • …option -x,--extended show all port counters
  • Clear counters to remove historical data…
  • …option -R to reset port counters …option -r to reset after read
  • Two types of counters: traffic & error1

Layer 1 (Physical)

  • Symbol Errors …physical problems
    • 99% of these errors are hardware related (small numbers can be ignored)
    • …for example broken cables …dirt in the connector …cable bending
  • Link Recovers …port training state
    • …port capabilities negotiated automatically …fro example link speed
    • …“error on the line” …compatibility between switch & HCA (firmware compatibility)
    • Related to Link Speed & Link Width …connection not at full speed
    • …check the adapter, cable & split cable configuration
  • Link Down …training did not work…
    • …connection could not be established …failed connection (port flapping)
    • Note: Depending on the HCA configuration host reboot implies link down

Layer 2 (Transport)

Xmt & Rcv traffic counters …service level specific

  • Rcv Errors …CRC (check sum) errors
    • …issues with data integrity
    • …local buffer overruns, malformed packets (routing header)
  • Xmit Wait …large numbers indicate congestion
    • …high congestion results in Xmt Discards
    • …packet to be transmitted get dropped (high congestion in the fabric)
    • …check port buffer, ingress overflow
    • …uses credit flow control to avoid loss of packages
    • …related to ExeBufOverrunError …to many flow control updates
  • PortRcvSwitchRelayErrors …packets could not be forwarded by the switch
    • …or packages are waiting on the switch
    • Causes …wrong destination lead …QoS level mapping broken
  • VL15Dropped …subnet management data
    • …(dedicated) virtual lane 15 …no flow control
    • …drops not an issues …send repeated

Monitor Counters

Debug counters fabric wide with following procedure…

  1. Reset all counters ibdiagnet -pc
  2. Wait 30~60 minutes ibdiagnet -P all=1 …collect all counters with a threshold
  3. Check errors ibdiagnet -P all=1 --pm_pause_time 600

Above allows top observer production traffic in a specified time-frame

Bit Error Rate

Link Bit Error Rate (BER) very important!

Bit Error Rate (BER) with correction mechanisms:

…raw BER does not includes correction mechanisms:

  • FEC (Forward Error Correction)
  • PLR (Port Link Retransmission)

Goal: bit-errors as small as possible\(10^{-15}\)

Congestion

Recommended to create a congestion map by monitoring all ports!

Related counters are XmitWait, XmitPkts & RcvPkts

  • XmitWait …packages in queue …“ticks” to measure waiting
  • Note: Mixing generations (for example FDR & HDR) can lead to congestion

Pushback of flow control if congestion index is to high…

congestion index by dividing the waiting with the transmitted packages