InfiniBand: Subnet-Manager (SM)
HPC
Network
InfiniBand
Software defined network (SDN)
- …configures and maintains fabric operations
- …central repository of all information
- …configures switch forwarding tables
Only one master SM allowed per subnet
- …can run on any server (or a managed switch on small fabrics)
- …master-slave setup for high-availability
Install opensm
packages …start the subnet manager…
dnf install -y opensm
systemctl enable --now opensm
Initialization
…include following steps:
- Subnet discovery (…after wakeup)
- …traverse the network beginning with close neighbors
- …Subnet Manager Packages (SMP) to initiate “conversation”
- Information gathering…
- …find all links/switches/hosts on all connected ports to map topology
- …Subnet Manager Query Message: direct routed information gathering for node/port information
- …Subnet Manager Agent (SMA) required on each node
- LIDs assignment
- Paths establishment
- …best path calculation to identify Shortest Path Table (Min-Hop)
- …calculate Linear Forwarding Table (LFP)
- Ports and switch configuration
- Subnet activation
Topology Changes
SM monitors the fabric for a topology changes….
- …Light Sweep, every 10sec require node/port information
- …port status changes
- …search for other SMs, change priority
- …Heavy Sweep triggered by light sweep changes
- …fabric discovery from scratch
- …can be triggered by a IB TRAP from a status change on a switch
- …ddge/host port state change impact is configurable -…SM failover & handover with SMInfo protocol
- …election by priority (0-15) and lower GUID
- …heartbeat for stand-by SM polling the master
- …SMInfo attributes exchange information during discovery/polling to synchronize
Configuration
Configuration in /etc/rdma/opensm.conf
…
opensm
daemon…-c $path
…create configuration file if missing-p $prio
…change priority …when stopped!-R $engine
…change routing algorithem/var/log/opensm.log
…for logging *-sminfo
…show master subnet manager LID, GUID, priority
saquery -s
…show all subnet managers
ibdiagnet -r # check for routing issues
smpquery portinfo $lid $port # query port information
smpquery nodeinfo $lid # query node information
smpquery -D nodeinfo $lid # ^ using direct route
ibroute $lid$ # show switching table, LIDs in hex
Partitions
Why use Partitions?
- Different partitions for customers/applications
- Priorities traffic of latency critical applications
- Isolate traffic to a back-end storage system
- Allows fabric partitioning for security & QoS
- Secure the subnet-manager configuration…
- …HCAs become partial members …can not configure the SM
- Similar to VLAN technology in Ethernet networks
Each partition has an identifier named PKEY
- PKEY enforcement done by link layer (layer 2) at the receiving side (HCA)
- …separation of physical connections
- …each package carries a PKEY …derived from the PKEY index
- PKEYs are 16 bit integer configured in the SM port PKEY table…
- …
7FFF
default partition …includes SM traffic (aka management packets) - …example PKEYs
0x0002
,0x0003
, etc.
- …
- Partition membership (security mechanism) …full vs partial membership
- Msb (most significant bit) defines nature of membership
0x8002
full membership …0x0003
partial membership- …lsb (other 15 bits) corresponds to the PKEY for a partition
Configuration in partition.conf
- …set a partition name to simplify logging!
- …associate HCA GUIDs to a PKEY (15 bit) …set IPoIB flah
- …set a default partition if a node is member of multiple partitions
- Diagnostic tools:
smpquery PkeyTable
on a switch to check portsibdiagnet.pkey
files to list per node GUID
Multiple partitions require a IP sub-networks to use IPoIB…
- …Linux network child interfaces
ib0.xxxx
- …add PKEY to
/sys/class/net/ib0.xxxx/create_child
M_KEY
authentication between SM and fabric…- …deployed by the SM to each node…
- …avoid fabric discovery by hostile SM
SM_KEY
authenticate SM to a master SM…- …configuration in
opensm.conf
- …hand-over control to another SM
- …configuration in
Quality of Service
Why use QoS?
- Support applications sensitive to latency, for example…
- …configure different service levels for Lustre & MPI
QoS (Quality of Service) requires us of partitions…
- …configure traffic priorities …control congestion
- …only 2 levels of priority
- Service Level (SL)
- …field in LRH (local routing header) …packages operate 16 SLs
- …nodes communication manager negotiates with the SM
- Virtual Lane up to 7
- …SL to VL mapping configured by the SM (various limits, sets priority levels)
- …VL arbiter configures priority/weight …either high or low
- …the VL arbitration table should only have one
high
priority lane
QoS enabled in opensm.conf
…requires restart of opensm
daemon
- …do not re-configure in production!
- …tuning in
/etc/opensm/qos-policy.conf
qos_vlarb_low
- …VL range 0-14 (practical 7) weight range 0-255
- …example
0:64,1:128
…notation<VL>:<weight>
, always provide a weight!
qos_high_limit
- …ratio of high- over low-priority packages
- …
0
single packages …255
unbound (low prio. VLs may be starved) - …use default if possible
Verify with smpquery vlarb
& smpquery sl2vl
…perfquery -X
displays counters for service level data
ULP (Upper Layer Protocol) …for example IPoIB
- …QoS policy to prioritize ULP …configured in
qos-policy.conf
- Examples:
- …MPI could be ULP/application with service ID (or PKEY)
- …Lustre could use a service ID …targeting port GUIDs
- …giving priority to MDS over OSTs
Congestion Control
…solves the following two issues:
- Head of queue blocking
- …use QoS for performance isolation of applications
- …avoid performance degradation between multiple applications
- Parking lot effect
- …link saturation over multiple hops
- …use rate limiting & CNP (Congestion Control Packages)