InfiniBand: Subnet-Manager (SM)

HPC
Network
InfiniBand
Published

August 19, 2015

Modified

January 3, 2025

Software defined network (SDN)

Only one master SM allowed per subnet

Install opensm packages …start the subnet manager…

dnf install -y opensm
systemctl enable --now opensm

Initialization

…include following steps:

  • Subnet discovery (…after wakeup)
    • …traverse the network beginning with close neighbors
    • …Subnet Manager Packages (SMP) to initiate “conversation”
  • Information gathering…
    • …find all links/switches/hosts on all connected ports to map topology
    • …Subnet Manager Query Message: direct routed information gathering for node/port information
    • …Subnet Manager Agent (SMA) required on each node
  • LIDs assignment
  • Paths establishment
    • …best path calculation to identify Shortest Path Table (Min-Hop)
    • …calculate Linear Forwarding Table (LFP)
  • Ports and switch configuration
  • Subnet activation

Topology Changes

SM monitors the fabric for a topology changes….

  • Light Sweep, every 10sec require node/port information
    • …port status changes
    • …search for other SMs, change priority
  • Heavy Sweep triggered by light sweep changes
    • …fabric discovery from scratch
    • …can be triggered by a IB TRAP from a status change on a switch
    • …ddge/host port state change impact is configurable -…SM failover & handover with SMInfo protocol
    • …election by priority (0-15) and lower GUID
    • …heartbeat for stand-by SM polling the master
    • …SMInfo attributes exchange information during discovery/polling to synchronize

Configuration

Configuration in /etc/rdma/opensm.conf

  • opensm daemon…
    • -c $path …create configuration file if missing
    • -p $prio …change priority …when stopped!
    • -R $engine …change routing algorithem
    • /var/log/opensm.log …for logging *- sminfo …show master subnet manager LID, GUID, priority
  • saquery -s …show all subnet managers
ibdiagnet -r                        # check for routing issues
smpquery portinfo $lid $port        # query port information
smpquery nodeinfo $lid              # query node information
smpquery -D nodeinfo $lid           # ^ using direct route
ibroute $lid$                       # show switching table, LIDs in hex

Partitions

Why use Partitions?

  • Different partitions for customers/applications
    • Priorities traffic of latency critical applications
    • Isolate traffic to a back-end storage system
  • Allows fabric partitioning for security & QoS
    • Secure the subnet-manager configuration…
    • …HCAs become partial members …can not configure the SM
  • Similar to VLAN technology in Ethernet networks

Each partition has an identifier named PKEY

  • PKEY enforcement done by link layer (layer 2) at the receiving side (HCA)
    • …separation of physical connections
    • …each package carries a PKEY …derived from the PKEY index
  • PKEYs are 16 bit integer configured in the SM port PKEY table…
    • 7FFF default partition …includes SM traffic (aka management packets)
    • …example PKEYs 0x0002, 0x0003, etc.
  • Partition membership (security mechanism) …full vs partial membership
    • Msb (most significant bit) defines nature of membership
    • 0x8002 full membership …0x0003 partial membership
    • …lsb (other 15 bits) corresponds to the PKEY for a partition

Configuration in partition.conf

  • …set a partition name to simplify logging!
  • …associate HCA GUIDs to a PKEY (15 bit) …set IPoIB flah
  • …set a default partition if a node is member of multiple partitions
  • Diagnostic tools:
    • smpquery PkeyTable on a switch to check ports
    • ibdiagnet.pkey files to list per node GUID

Multiple partitions require a IP sub-networks to use IPoIB…

  • …Linux network child interfaces ib0.xxxx
  • …add PKEY to /sys/class/net/ib0.xxxx/create_child
  • M_KEY authentication between SM and fabric…
    • …deployed by the SM to each node…
    • …avoid fabric discovery by hostile SM
  • SM_KEY authenticate SM to a master SM…
    • …configuration in opensm.conf
    • …hand-over control to another SM

Quality of Service

Why use QoS?

  • Support applications sensitive to latency, for example…
  • …configure different service levels for Lustre & MPI

QoS (Quality of Service) requires us of partitions…

  • …configure traffic priorities …control congestion
  • …only 2 levels of priority
  • Service Level (SL)
    • …field in LRH (local routing header) …packages operate 16 SLs
    • …nodes communication manager negotiates with the SM
  • Virtual Lane up to 7
    • …SL to VL mapping configured by the SM (various limits, sets priority levels)
    • …VL arbiter configures priority/weight …either high or low
    • …the VL arbitration table should only have one high priority lane

QoS enabled in opensm.conf …requires restart of opensm daemon

  • …do not re-configure in production!
  • …tuning in /etc/opensm/qos-policy.conf
  • qos_vlarb_low
    • …VL range 0-14 (practical 7) weight range 0-255
    • …example 0:64,1:128 …notation <VL>:<weight>, always provide a weight!
  • qos_high_limit
    • …ratio of high- over low-priority packages
    • 0 single packages …255 unbound (low prio. VLs may be starved)
    • …use default if possible

Verify with smpquery vlarb & smpquery sl2vlperfquery -X displays counters for service level data

ULP (Upper Layer Protocol) …for example IPoIB

  • …QoS policy to prioritize ULP …configured in qos-policy.conf
  • Examples:
    • …MPI could be ULP/application with service ID (or PKEY)
    • …Lustre could use a service ID …targeting port GUIDs
    • …giving priority to MDS over OSTs

Congestion Control

…solves the following two issues:

  1. Head of queue blocking
    • …use QoS for performance isolation of applications
    • …avoid performance degradation between multiple applications
  2. Parking lot effect
    • …link saturation over multiple hops
    • …use rate limiting & CNP (Congestion Control Packages)