InfiniBand: HPC Network Interconnect

HPC
Network
InfiniBand
Published

August 19, 2015

Modified

January 2, 2025

Terminology

GUID Globally Unique Identifier

  • …64bit unique address assigned by vendor
  • …persistent through reboot
  • …3 types of GUIDs: Node, port(, and system image)

LID Local Identifier (48k unicast per subnet)

  • …16bit layer 2 address
  • …assigned by the SM when port becomes active
  • …each HCA port has a LID…
    • …all switch ports share the same LID
    • …director switches have one LID per ASIC

GID Global Identifier

  • …128bit address unique across multiple subnets
  • …based on the port GUID combined with 64bit subnet prefix
  • …Used in the Global Routing Header (GRH) (ignored by switches within a subnet)

PKEY Partition Identifier

  • …fabric segmentation of nodes into different partitions
  • …partitions unaware of each other
    • …limited 0 (can’t communicate between them selfs)
    • …full 1 membership
  • …ports may be member of multiple partitions
  • …assign by listing port GUIDs in partitons.conf

Network Layers

Physical Layer

  • Link Speed x Link Width = Link Rate
  • Bit Error Rate (BER) 10^15
  • Virtual Lane (VL), multiple virtual links on single physical link
    • Mellanox 0-7 VLs each with dedicated buffers
    • Quality of Service, bandwidth management
  • Media for connecting two nodes
    • Passive Copper Cables FDR max. 3m, EDR max. 2m
    • Active Optical Cables (AOCs) FDR max. 300m, EDR max. 100m
    • Connector QSFP

Speeds

            Speed                       Width Rate     Latency   Encoding    Eff.Speed
---------------------------------------------------------------------------------------
1999   SDR  Single Data Rate     2.5Gbps   x4 10Gbps   5usec     NRZ 
2004   DDR  Double Data Rate     5Gbps     x4 20Gbps   2.5usec   NRZ 8/10    16Gbps
2008   QDR  Quadruple Data Rate  10Gbps    x4 40Gbps   1.3usec   NRZ 8/10    32Gbps
2011   FDR  Fourteen Data Rate   14Gbps    x4 56Gbps   0.7usec   NRZ 64/66   54.6Gbps
2014   EDR  Enhanced Data Rate   25Gbps    x4 100Gbps  0.5usec   NRZ 64/66   96.97Gbps 
2018   HDR  High Data Rate       50Gbps    x4 200Gbps <0.6usec   PAM-4 
2022   NDR  Next Data Rate       100Gbps   x4 400Gbps            PAM-4
?      XDR                       200Gbps   x4 800Gbps            PAM-4
?      GDR                                    1.6Tbps

Network Layer

  • Infiniband Routing
    • Fault isolation (e.g topology changes)
    • Increase security (limit attack scope within a network segment)
    • Inter-subnet package routing (connect multiple topologies)
  • Uses GIDs for each port included in the Global Routing Header (GRH)
  • Mellanox Infiniband Router SB7788 (up to 6 subnets)

Transport Layer

  • Message segmentation into multiple packages by the sender, reassembly on the receiver
    • Maximum Transfer Unit (MTU) default 4096 Byte openib.conf
  • End-to-End communication service for applications Virtual Channel
  • Queue Pairs (QPs), dedicated per connection
    • Send/receive queue structure to enable application to bypass kernel
    • Mode: connected vs. datagram; reliable vs. unreliable
    • Datagram mode uses one QP for multiple connections
    • Identified by 24bit Queue Pair Number (QPN)

Upper Layer

  • Protocols
    • Native Infiniband RDMA Protocols
    • MPI, RDMA Storage (iSER, SRP, NFS-RDMA), SDP (Socket Direct), RDS (Reliable Datagram)
    • Legacy TCP/IP, transported by IPoIB
  • Software transport Verbs
    • Client interface to the transport layer, HCA
    • Most common implementation is OFED
  • Subnet Manager Interface (SMI)
    • Subnet Manager Packages (SMP) (on QP0 VL15, no flow control)
    • LID routed or direct routed (before fabric initialisation using port numbers)
  • General Service Interface (GSI)
    • General Management Packages (GMP) (on QP1, subject to flow control)
    • LID routed

Topology

Roadmap of the network:

  • Critical aspect of any interconnection network
  • Defines how the channels and routers are connected
  • Sets performance bounds (network diameter, bisection bandwidth)
  • Determines the cost of the network
  • Keys to topology evaluation
    • Network throughput - for application traffic patterns
    • Network diameter - min/avg/max latency between hosts
    • Scalability - cost of adding new end-nodes
    • Cost per node - number of network routers/ports per end-node

Diameter defines the maximum distance between two nodes (hop count)

  • Lower network diameter
    • Better performance
    • Smaller cost (less cables & routers)
    • Less power consumption

Radix (or degree) of the router defines the number of ports per router

Nodal degree specifies how many links connect to each node

Demystifying DCN Topologies: Clos/Fat Trees
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part1
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part2

Clos Networks

Clos network is a multistage switching network

  • Enables connection of large number of nodes with small-size switches
    • 3 stages to switch from N inputs to N outputs
  • Exactly one connection between each spine and leaf switch

Fat-Trees (special case of folded Clos network)

  • Pros
    • simple routing
    • maximal network throughput
    • fault-tolerant (path diversity)
    • credit loop deadlock free routing
  • Cons
    • large diameter…
    • …more expensive
  • Alleviate the bandwidth bottleneck closer to the root with additional links
  • Multiple paths to the destination from the source towards the root
  • Consistent hop count, resulting in predictable latency.
  • does not scale linearly with cluster size (max. 7 layers/tiers)
  • Switches at the top of the pyramid shape are called Spines/Core
  • Switches at the bottom of the pyramid are called Leafs/Lines/Edges
  • External connections connect nodes to edge switches.
  • Internal connections connect core with edge switches.
  • Constant bi-sectional bandwidth (CBB)
    • Non blocking (1:1 ratio)
    • Equal number of external and internal connections (balanced)
  • Blocking (x:1), external connections is higher than internal connections, over subscription

Dragonfly

  • Pros
    • Reduce number of (long) global links…without reducing performance
    • …smaller network diameter
    • Reduced total cost of network (since cabling is reduced)
    • More scalable (compared to fat-tree)
  • Cons
    • Requires adaptive routing
    • …effectively balance load across global channels…
    • …adding selective virtual-channel discrimination…

Hierarchical topology dividing groups of routers…

  • …connected into sub-network of collectively acting router groups
    • …as one high-radix virtual router
    • …all minimal routes traverse at most one global channel…
    • …to realize a very low global diameter
  • Channels/links…
    • terminal connections to nodes/systems
    • local (intra-group) connections to other routers in the same group
    • global (long, inter-group) connections to routers in other groups
  • All-to-all connection between each router group
    • (Avoids the need for external top level switches)
    • Each group has at least on global link to each other router group

Flavors diverge on group sub-topology

  • …intra-group interconnection network (local channels)
  • 1D flattened butterfly, completely connected (default recommendation)
  • 2D flattened butterfly
  • Dragonfly+ (benefits of Dragonfly and Fat Tree)

Dragonfly+

Extends Dragonfly topology by using Clos-like group topology

  • High scalability then Dragonfly with lower cost than Fat Tree
  • Group (pod) topology typical 2-level fat tree
  • Pros… (compared to Dragonfly)
    • More scalable, allows larger number of nodes on the network
    • Similar or better bi-sectional bandwidth…
    • …smaller number of buffers to avoid credit loop deadlocks
    • At least 50% bi-sectional bandwidth for any router radix
    • Requires only two virtual lanes to prevent credit loop deadlock
  • Cons… (compared to Dragonfly)
    • Even more complex routing
    • Fully Progressive Adaptive Routing (FPAR)
    • Cabling complexity, intra-group routers connected as bipartite graph

Dragonfly+ is bipartite connected in the first intra-group level

  • Number of spine switches = number of leaf switches
  • Leaf router, first-layer
    • (terminal) connects to nodes
    • Intra-group (local) connection to spine routers
    • Only one uplink to each spine inside the group
  • Spine router, second-layer
    • intra-group (local) connection to leaf routers
    • inter-group (global) connections to spine routers of other groups
  • Support blocking factor in leaf switches and non-blocking on Spines

Locality, group size

  • With larger group size lager amount off traffic is internal (intra-group)
  • Intra-group traffic does not use inter-group global links…
  • …hence does not contribute to network throughput bottleneck

How to Configure DragonFly, Mellanox, 2020/03
https://community.mellanox.com/s/article/How-to-Configure-DragonFly

Exascale HPC Fabric Topology, Mellanox, 2019/03
http://www.hpcadvisorycouncil.com/events/2019/APAC-AI-HPC/uploads/2018/07/Exascale-HPC-Fabric-Topology.pdf

Routing

Terms important to understand different algorithms…

  • tolerance …considered during path distance calculation
    • 0 …equal distance if the number of hops in the paths is the same
    • 1 …equal distance if the difference in hop count is less than or equal to one
  • contention …declared for every switch port on the path…
    • …that is already used for routing another LID…
    • …associated with the same host port

Algorithm…

  • …SPF, DOR, LASH….
  • Min-Hop minimal number of switch hops between nodes (cannot avoid credit loops)
  • ftree congestion-free symmetric fat-tree, shift communication pattern

Up-Down

  • …Min-Hop plus core/spine ranking
  • …for non pure fat-tree topologies
  • …down-up routes not allowed

Enable up-down routing engine:

>>> grep -e routing_engine -e root_guid_file /etc/opensm/opensm.conf    
#routing_engine (null)
routing_engine updn
#root_guid_file (null)
root_guid_file /etc/opensm/rootswitches.list
>>> head /etc/opensm/rootswitches.list
0xe41d2d0300e512c0
0xe41d2d0300e50bd0
0xe41d2d0300e51af0
0xe41d2d0300e52eb0
0xe41d2d0300e52e90

Adaptive

Avoid congestion with adaptive routing…

  • …supported on all types of topologies
  • …maximize network utilization
  • …spread traffic across all network links…
    • …determine optimal path for data packets
    • …allow packets to avoid congested areas
  • …redirect traffic to less occupied outgoing ports
  • …grading mechanism to select optimal ports considering
    • …egress port
    • …queue depth
    • …path priority (shorter paths have higher priority)

Requires ConnectX-5 or newer…

  • …packets can arrive out-of-order
  • …sender mark traffic for eligibility to network re-ordering
  • …inter-message ordering can be enforced when required

Application Interface

  • OpenFabrics Alliance (OFA)
    • Builds open-source software: OFED (OpenFabrics Enterprise Distribution)
    • Kernel-level drivers, channel-oriented RDMA and send/receive operations
    • Kernel and user-level application programming interface (API)
    • Services for parallel message passing (MPI)
    • Includes Open Subnet Manager with diagnostic tools
    • IP over Infiniband (IPoIB), Infiniband Verbs/API

RDMA

  • Remote Direct Memory Access (RDMA)
  • Linux kernel network stack limitations
    • system call API package rates to slow for high speed network fabrics with latency in the nano-seconds
    • overhead copying data from user- to kernel-space
    • workarounds: Package aggregation, flow steering, pass NIC to user-space…
  • RDMA Subsystem: Bypass the kernel network stack to sustain full throughput
    • special Verbs library maps devices into user-space to allow direct data stream control
    • direct user-space to user-space memory data transfer (zero-copy)
    • offload of network functionality to the hardware device
    • messaging protocols implemented in RDMA
    • regular network tools may not work
    • bridging between common Ethernet networks and HPC network fabrics difficult
  • protocols implementing RDMA: Infiniband, Omnipath, Ethernet(RoCE)
  • future integration with the kernel network stack?
    • Integrate RDMA subsystem messaging with the kernel
    • Add Queue Pairs (QPs) concept to the kernel network stack to enable RDMA
    • Implement POSIX network semantics for Infiniband

RDMA over Ethernet

  • advances in Ethernet technology allows to build “lossless” Ethernet fabrics
    • PFC (Priority-based Flow Control) prevents package loss due to buffer overflow at switches
    • Enables FCoE (Fibre Channel over Ethernet), RoCE (RDMA over Converged Ethernet)
    • Ethernet NICs come with a variety of options for offloading
  • RoCE specification supported as annex to the IBTA
  • implements Infiniband Verbs over Ethernet (OFED >1.5.1)
    • use Infiniband transport & network layer, swaps link layer to use Ethernet frames
    • IPv4/6 addresses set over the regular Ethernet NIC
    • control path RDMA-CM API, data path Verbs API

OpenFabric

  • OpenFabrics Interfaces (OFI)
  • Developed by the OFI Working Group, a subgroup of OFA
    • Successor to IB Verbs, and RoCE specification
    • Optimizes software to hardware path by minimizing cache and memory footprint
    • Application-Centric and fabric implementation agnostic
  • libfabric core component of OFI
    • User-space API mapping applications to underlying fabric services
    • Hardware/protocol agnostic
  • Fabric hardware support implemented in OFI providers
    • Socket provider for development
    • Verbs provides allows to run over hardware supporting libibverbs (Infiniband)
    • useNIC (user-space NIC) providers supports Cisco Ethernet hardware
    • PSM (Performance Scale Messaging) provider for Intel Omni-Path and Cray Aries

References