InfiniBand: Subnet-Manager (SM)
HPC
Network
InfiniBand
Software defined network (SDN)
- …configures and maintains fabric operations
- …central repository of all information
- …configures switch forwarding tables
Only one master SM allowed per subnet
- …can run on any server (or a managed switch on small fabrics)
- …master-slave setup for high-availability
Install opensm packages …start the subnet manager…
dnf install -y opensm
systemctl enable --now opensmInitialization
…include following steps:
- Subnet discovery (…after wakeup)
- …traverse the network beginning with close neighbors
- …Subnet Manager Packages (SMP) to initiate “conversation”
- Information gathering…
- …find all links/switches/hosts on all connected ports to map topology
- …Subnet Manager Query Message: direct routed information gathering for node/port information
- …Subnet Manager Agent (SMA) required on each node
- LIDs assignment
- Paths establishment
- …best path calculation to identify Shortest Path Table (Min-Hop)
- …calculate Linear Forwarding Table (LFP)
- Ports and switch configuration
- Subnet activation
Topology Changes
SM monitors the fabric for a topology changes….
- …Light Sweep, every 10sec require node/port information
- …port status changes
- …search for other SMs, change priority
- …Heavy Sweep triggered by light sweep changes
- …fabric discovery from scratch
- …can be triggered by a IB TRAP from a status change on a switch
- …ddge/host port state change impact is configurable -…SM failover & handover with SMInfo protocol
- …election by priority (0-15) and lower GUID
- …heartbeat for stand-by SM polling the master
- …SMInfo attributes exchange information during discovery/polling to synchronize
Configuration
Configuration in /etc/rdma/opensm.conf…
opensmdaemon…-c $path…create configuration file if missing-p $prio…change priority …when stopped!-R $engine…change routing algorithem/var/log/opensm.log…for logging *-sminfo…show master subnet manager LID, GUID, priority
saquery -s…show all subnet managers
ibdiagnet -r # check for routing issues
smpquery portinfo $lid $port # query port information
smpquery nodeinfo $lid # query node information
smpquery -D nodeinfo $lid # ^ using direct route
ibroute $lid$ # show switching table, LIDs in hexPartitions
Why use Partitions?
- Different partitions for customers/applications
- Priorities traffic of latency critical applications
- Isolate traffic to a back-end storage system
- Allows fabric partitioning for security & QoS
- Secure the subnet-manager configuration…
- …HCAs become partial members …can not configure the SM
- Similar to VLAN technology in Ethernet networks
Each partition has an identifier named PKEY
- PKEY enforcement done by link layer (layer 2) at the receiving side (HCA)
- …separation of physical connections
- …each package carries a PKEY …derived from the PKEY index
- PKEYs are 16 bit integer configured in the SM port PKEY table…
- …
7FFFdefault partition …includes SM traffic (aka management packets) - …example PKEYs
0x0002,0x0003, etc.
- …
- Partition membership (security mechanism) …full vs partial membership
- Msb (most significant bit) defines nature of membership
0x8002full membership …0x0003partial membership- …lsb (other 15 bits) corresponds to the PKEY for a partition
Configuration in partition.conf
- …set a partition name to simplify logging!
- …associate HCA GUIDs to a PKEY (15 bit) …set IPoIB flah
- …set a default partition if a node is member of multiple partitions
- Diagnostic tools:
smpquery PkeyTableon a switch to check portsibdiagnet.pkeyfiles to list per node GUID
Multiple partitions require a IP sub-networks to use IPoIB…
- …Linux network child interfaces
ib0.xxxx - …add PKEY to
/sys/class/net/ib0.xxxx/create_child M_KEYauthentication between SM and fabric…- …deployed by the SM to each node…
- …avoid fabric discovery by hostile SM
SM_KEYauthenticate SM to a master SM…- …configuration in
opensm.conf - …hand-over control to another SM
- …configuration in
Quality of Service
Why use QoS?
- Support applications sensitive to latency, for example…
- …configure different service levels for Lustre & MPI
QoS (Quality of Service) requires us of partitions…
- …configure traffic priorities …control congestion
- …only 2 levels of priority
- Service Level (SL)
- …field in LRH (local routing header) …packages operate 16 SLs
- …nodes communication manager negotiates with the SM
- Virtual Lane up to 7
- …SL to VL mapping configured by the SM (various limits, sets priority levels)
- …VL arbiter configures priority/weight …either high or low
- …the VL arbitration table should only have one
highpriority lane
QoS enabled in opensm.conf …requires restart of opensm daemon
- …do not re-configure in production!
- …tuning in
/etc/opensm/qos-policy.conf qos_vlarb_low- …VL range 0-14 (practical 7) weight range 0-255
- …example
0:64,1:128…notation<VL>:<weight>, always provide a weight!
qos_high_limit- …ratio of high- over low-priority packages
- …
0single packages …255unbound (low prio. VLs may be starved) - …use default if possible
Verify with smpquery vlarb & smpquery sl2vl …perfquery -X displays counters for service level data
ULP (Upper Layer Protocol) …for example IPoIB
- …QoS policy to prioritize ULP …configured in
qos-policy.conf - Examples:
- …MPI could be ULP/application with service ID (or PKEY)
- …Lustre could use a service ID …targeting port GUIDs
- …giving priority to MDS over OSTs
Congestion Control
…solves the following two issues:
- Head of queue blocking
- …use QoS for performance isolation of applications
- …avoid performance degradation between multiple applications
- Parking lot effect
- …link saturation over multiple hops
- …use rate limiting & CNP (Congestion Control Packages)