InfiniBand: HPC Network Interconnect
- InfiniBand Architecture (IBA)
- Architecture for Interprocess Communication (IPC) networks
- Switch-based, point-to-point interconnection network
- low latency, high throughput, quality of service
- CPU offload, hardware based transport protocol, bypass of the kernel
- Mellanox Community
Terminology
GUID Globally Unique Identifier
- …64bit unique address assigned by vendor
- …persistent through reboot
- …3 types of GUIDs: Node, port(, and system image)
LID Local Identifier (48k unicast per subnet)
- …16bit layer 2 address
- …assigned by the SM when port becomes active
- …each HCA port has a LID…
- …all switch ports share the same LID
- …director switches have one LID per ASIC
GID Global Identifier
- …128bit address unique across multiple subnets
- …based on the port GUID combined with 64bit subnet prefix
- …Used in the Global Routing Header (GRH) (ignored by switches within a subnet)
PKEY Partition Identifier
- …fabric segmentation of nodes into different partitions
- …partitions unaware of each other
- …limited
0
(can’t communicate between them selfs) - …full
1
membership
- …limited
- …ports may be member of multiple partitions
- …assign by listing port GUIDs in
partitons.conf
Network Layers
Physical Layer
- Link Speed x Link Width = Link Rate
- Bit Error Rate (BER) 10^15
- Virtual Lane (VL), multiple virtual links on single physical link
- Mellanox 0-7 VLs each with dedicated buffers
- Quality of Service, bandwidth management
- Media for connecting two nodes
- Passive Copper Cables FDR max. 3m, EDR max. 2m
- Active Optical Cables (AOCs) FDR max. 300m, EDR max. 100m
- Connector QSFP
Speeds
Speed Width Rate Latency Encoding Eff.Speed
---------------------------------------------------------------------------------------
1999 SDR Single Data Rate 2.5Gbps x4 10Gbps 5usec NRZ
2004 DDR Double Data Rate 5Gbps x4 20Gbps 2.5usec NRZ 8/10 16Gbps
2008 QDR Quadruple Data Rate 10Gbps x4 40Gbps 1.3usec NRZ 8/10 32Gbps
2011 FDR Fourteen Data Rate 14Gbps x4 56Gbps 0.7usec NRZ 64/66 54.6Gbps
2014 EDR Enhanced Data Rate 25Gbps x4 100Gbps 0.5usec NRZ 64/66 96.97Gbps
2018 HDR High Data Rate 50Gbps x4 200Gbps <0.6usec PAM-4
2022 NDR Next Data Rate 100Gbps x4 400Gbps PAM-4
? XDR 200Gbps x4 800Gbps PAM-4 ? GDR 1.6Tbps
Link Layer
- Subnet may contain: 48K unicast & 16k multicast addresses
- Local Routing Header (LRH) includes 16bit Destination LID (DLID) and port number
- LID Mask Controller (LMC), use multiple LIDs to load-balance traffic over multiple network paths
- Credit Based Flow Control between two nodes
- Independent for each virtual lane (to separate congestion/latency)
- Sender limited by credits granted by the receiver in 64byte units
- Service Level (SL) to Virtual Lane (VL) mapping defined in
opensm.conf
- Priority & weight value 0-255 indicate number 64byte units transported by a VL
- Guarantee performance to data flow to provide QoS
- Data Integrity
- 16bit Variant CRC (VCRC) link level integrity between two hops
- 32bit Invariant CRC (ICRC) end-to-end integrity
- Link Layer Retransmission (LLR)
- Mellanox SwitchX only, up to FDR, enabled by default
- Recovers problems on the physical layer
- Slight increase in latency
- Should remove all symbol errors
- Forward Error Correction (FEC)
- Mellanox Switch-IB only, EDR forward
- Based on 64/66bit encoding error correction
- No bandwidth loss
Network Layer
- Infiniband Routing
- Fault isolation (e.g topology changes)
- Increase security (limit attack scope within a network segment)
- Inter-subnet package routing (connect multiple topologies)
- Uses GIDs for each port included in the Global Routing Header (GRH)
- Mellanox Infiniband Router SB7788 (up to 6 subnets)
Transport Layer
- Message segmentation into multiple packages by the sender, reassembly on the receiver
- Maximum Transfer Unit (MTU) default 4096 Byte
openib.conf
- Maximum Transfer Unit (MTU) default 4096 Byte
- End-to-End communication service for applications Virtual Channel
- Queue Pairs (QPs), dedicated per connection
- Send/receive queue structure to enable application to bypass kernel
- Mode: connected vs. datagram; reliable vs. unreliable
- Datagram mode uses one QP for multiple connections
- Identified by 24bit Queue Pair Number (QPN)
Upper Layer
- Protocols
- Native Infiniband RDMA Protocols
- MPI, RDMA Storage (iSER, SRP, NFS-RDMA), SDP (Socket Direct), RDS (Reliable Datagram)
- Legacy TCP/IP, transported by IPoIB
- Software transport Verbs
- Client interface to the transport layer, HCA
- Most common implementation is OFED
- Subnet Manager Interface (SMI)
- Subnet Manager Packages (SMP) (on
QP0 VL15
, no flow control) - LID routed or direct routed (before fabric initialisation using port numbers)
- Subnet Manager Packages (SMP) (on
- General Service Interface (GSI)
- General Management Packages (GMP) (on
QP1
, subject to flow control) - LID routed
- General Management Packages (GMP) (on
Topology
Roadmap of the network:
- Critical aspect of any interconnection network
- Defines how the channels and routers are connected
- Sets performance bounds (network diameter, bisection bandwidth)
- Determines the cost of the network
- Keys to topology evaluation
- Network throughput - for application traffic patterns
- Network diameter - min/avg/max latency between hosts
- Scalability - cost of adding new end-nodes
- Cost per node - number of network routers/ports per end-node
Diameter defines the maximum distance between two nodes (hop count)
- Lower network diameter
- Better performance
- Smaller cost (less cables & routers)
- Less power consumption
Radix (or degree) of the router defines the number of ports per router
Nodal degree specifies how many links connect to each node
Demystifying DCN Topologies: Clos/Fat Trees
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part1
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part2
Clos Networks
Clos network is a multistage switching network
- Enables connection of large number of nodes with small-size switches
- 3 stages to switch from N inputs to N outputs
- Exactly one connection between each spine and leaf switch
Fat-Trees (special case of folded Clos network)
- Pros
- simple routing
- maximal network throughput
- fault-tolerant (path diversity)
- credit loop deadlock free routing
- Cons
- large diameter…
- …more expensive
- Alleviate the bandwidth bottleneck closer to the root with additional links
- Multiple paths to the destination from the source towards the root
- Consistent hop count, resulting in predictable latency.
- does not scale linearly with cluster size (max. 7 layers/tiers)
- Switches at the top of the pyramid shape are called Spines/Core
- Switches at the bottom of the pyramid are called Leafs/Lines/Edges
- External connections connect nodes to edge switches.
- Internal connections connect core with edge switches.
- Constant bi-sectional bandwidth (CBB)
- Non blocking (1:1 ratio)
- Equal number of external and internal connections (balanced)
- Blocking (x:1), external connections is higher than internal connections, over subscription
Dragonfly
- Pros
- Reduce number of (long) global links…without reducing performance
- …smaller network diameter
- Reduced total cost of network (since cabling is reduced)
- More scalable (compared to fat-tree)
- Cons
- Requires adaptive routing…
- …effectively balance load across global channels…
- …adding selective virtual-channel discrimination…
Hierarchical topology dividing groups of routers…
- …connected into sub-network of collectively acting router groups…
- …as one high-radix virtual router
- …all minimal routes traverse at most one global channel…
- …to realize a very low global diameter
- Channels/links…
- …terminal connections to nodes/systems
- …local (intra-group) connections to other routers in the same group
- …global (long, inter-group) connections to routers in other groups
- All-to-all connection between each router group
- (Avoids the need for external top level switches)
- Each group has at least on global link to each other router group
Flavors diverge on group sub-topology…
- …intra-group interconnection network (local channels)
- 1D flattened butterfly, completely connected (default recommendation)
- 2D flattened butterfly
- Dragonfly+ (benefits of Dragonfly and Fat Tree)
Dragonfly+
Extends Dragonfly topology by using Clos-like group topology
- High scalability then Dragonfly with lower cost than Fat Tree
- Group (pod) topology typical 2-level fat tree
- Pros… (compared to Dragonfly)
- More scalable, allows larger number of nodes on the network
- Similar or better bi-sectional bandwidth…
- …smaller number of buffers to avoid credit loop deadlocks
- At least 50% bi-sectional bandwidth for any router radix
- Requires only two virtual lanes to prevent credit loop deadlock
- Cons… (compared to Dragonfly)
- Even more complex routing…
- Fully Progressive Adaptive Routing (FPAR)
- Cabling complexity, intra-group routers connected as bipartite graph
Dragonfly+ is bipartite connected in the first intra-group level
- Number of spine switches = number of leaf switches
- Leaf router, first-layer
- (terminal) connects to nodes
- Intra-group (local) connection to spine routers
- Only one uplink to each spine inside the group
- Spine router, second-layer
- intra-group (local) connection to leaf routers
- inter-group (global) connections to spine routers of other groups
- Support blocking factor in leaf switches and non-blocking on Spines
Locality, group size
- With larger group size lager amount off traffic is internal (intra-group)
- Intra-group traffic does not use inter-group global links…
- …hence does not contribute to network throughput bottleneck
How to Configure DragonFly, Mellanox, 2020/03
https://community.mellanox.com/s/article/How-to-Configure-DragonFly
Exascale HPC Fabric Topology, Mellanox, 2019/03
http://www.hpcadvisorycouncil.com/events/2019/APAC-AI-HPC/uploads/2018/07/Exascale-HPC-Fabric-Topology.pdf
Routing
Terms important to understand different algorithms…
- …tolerance …considered during path distance calculation
- …
0
…equal distance if the number of hops in the paths is the same - …
1
…equal distance if the difference in hop count is less than or equal to one
- …
- …contention …declared for every switch port on the path…
- …that is already used for routing another LID…
- …associated with the same host port
Algorithm…
- …SPF, DOR, LASH….
- Min-Hop minimal number of switch hops between nodes (cannot avoid credit loops)
- ftree congestion-free symmetric fat-tree, shift communication pattern
Up-Down
- …Min-Hop plus core/spine ranking
- …for non pure fat-tree topologies
- …down-up routes not allowed
Enable up-down routing engine:
>>> grep -e routing_engine -e root_guid_file /etc/opensm/opensm.conf
#routing_engine (null)
routing_engine updn
#root_guid_file (null)
root_guid_file /etc/opensm/rootswitches.list
>>> head /etc/opensm/rootswitches.list
0xe41d2d0300e512c0
0xe41d2d0300e50bd0
0xe41d2d0300e51af0
0xe41d2d0300e52eb0
0xe41d2d0300e52e90
Adaptive
Avoid congestion with adaptive routing…
- …supported on all types of topologies
- …maximize network utilization
- …spread traffic across all network links…
- …determine optimal path for data packets
- …allow packets to avoid congested areas
- …redirect traffic to less occupied outgoing ports
- …grading mechanism to select optimal ports considering
- …egress port
- …queue depth
- …path priority (shorter paths have higher priority)
Requires ConnectX-5 or newer…
- …packets can arrive out-of-order
- …sender mark traffic for eligibility to network re-ordering
- …inter-message ordering can be enforced when required
Application Interface
- OpenFabrics Alliance (OFA)
- Builds open-source software: OFED (OpenFabrics Enterprise Distribution)
- Kernel-level drivers, channel-oriented RDMA and send/receive operations
- Kernel and user-level application programming interface (API)
- Services for parallel message passing (MPI)
- Includes Open Subnet Manager with diagnostic tools
- IP over Infiniband (IPoIB), Infiniband Verbs/API
RDMA
- Remote Direct Memory Access (RDMA)
- Linux kernel network stack limitations
- system call API package rates to slow for high speed network fabrics with latency in the nano-seconds
- overhead copying data from user- to kernel-space
- workarounds: Package aggregation, flow steering, pass NIC to user-space…
- RDMA Subsystem: Bypass the kernel network stack to sustain full throughput
- special Verbs library maps devices into user-space to allow direct data stream control
- direct user-space to user-space memory data transfer (zero-copy)
- offload of network functionality to the hardware device
- messaging protocols implemented in RDMA
- regular network tools may not work
- bridging between common Ethernet networks and HPC network fabrics difficult
- protocols implementing RDMA: Infiniband, Omnipath, Ethernet(RoCE)
- future integration with the kernel network stack?
- Integrate RDMA subsystem messaging with the kernel
- Add Queue Pairs (QPs) concept to the kernel network stack to enable RDMA
- Implement POSIX network semantics for Infiniband
RDMA over Ethernet
- advances in Ethernet technology allows to build “lossless” Ethernet fabrics
- PFC (Priority-based Flow Control) prevents package loss due to buffer overflow at switches
- Enables FCoE (Fibre Channel over Ethernet), RoCE (RDMA over Converged Ethernet)
- Ethernet NICs come with a variety of options for offloading
- RoCE specification supported as annex to the IBTA
- implements Infiniband Verbs over Ethernet (OFED >1.5.1)
- use Infiniband transport & network layer, swaps link layer to use Ethernet frames
- IPv4/6 addresses set over the regular Ethernet NIC
- control path RDMA-CM API, data path Verbs API
OpenFabric
- OpenFabrics Interfaces (OFI)
- Developed by the OFI Working Group, a subgroup of OFA
- Successor to IB Verbs, and RoCE specification
- Optimizes software to hardware path by minimizing cache and memory footprint
- Application-Centric and fabric implementation agnostic
- libfabric core component of OFI
- User-space API mapping applications to underlying fabric services
- Hardware/protocol agnostic
- Fabric hardware support implemented in OFI providers
- Socket provider for development
- Verbs provides allows to run over hardware supporting
libibverbs
(Infiniband) useNIC
(user-space NIC) providers supports Cisco Ethernet hardware- PSM (Performance Scale Messaging) provider for Intel Omni-Path and Cray Aries
References
- NVIDIA Infrastructure & Networking Knowledge Base