InfiniBand: HPC Network Interconnect

HPC

Network

InfiniBand

Published

August 19, 2015

Modified

January 2, 2025

InfiniBand Architecture (IBA)
- Architecture for Interprocess Communication (IPC) networks
- Switch-based, point-to-point interconnection network
- low latency, high throughput, quality of service
- CPU offload, hardware based transport protocol, bypass of the kernel
Mellanox Community

Terminology

GUID Globally Unique Identifier

…64bit unique address assigned by vendor
…persistent through reboot
…3 types of GUIDs: Node, port(, and system image)

LID Local Identifier (48k unicast per subnet)

…16bit layer 2 address
…assigned by the SM when port becomes active
…each HCA port has a LID…
- …all switch ports share the same LID
- …director switches have one LID per ASIC

GID Global Identifier

…128bit address unique across multiple subnets
…based on the port GUID combined with 64bit subnet prefix
…Used in the Global Routing Header (GRH) (ignored by switches within a subnet)

PKEY Partition Identifier

…fabric segmentation of nodes into different partitions
…partitions unaware of each other
- …limited 0 (can’t communicate between them selfs)
- …full 1 membership
…ports may be member of multiple partitions
…assign by listing port GUIDs in partitons.conf

Network Layers

Physical Layer

Link Speed x Link Width = Link Rate
Bit Error Rate (BER) 10^15
Virtual Lane (VL), multiple virtual links on single physical link
- Mellanox 0-7 VLs each with dedicated buffers
- Quality of Service, bandwidth management
Media for connecting two nodes
- Passive Copper Cables FDR max. 3m, EDR max. 2m
- Active Optical Cables (AOCs) FDR max. 300m, EDR max. 100m
- Connector QSFP

Speeds

            Speed                       Width Rate     Latency   Encoding    Eff.Speed
---------------------------------------------------------------------------------------
1999   SDR  Single Data Rate     2.5Gbps   x4 10Gbps   5usec     NRZ 
2004   DDR  Double Data Rate     5Gbps     x4 20Gbps   2.5usec   NRZ 8/10    16Gbps
2008   QDR  Quadruple Data Rate  10Gbps    x4 40Gbps   1.3usec   NRZ 8/10    32Gbps
2011   FDR  Fourteen Data Rate   14Gbps    x4 56Gbps   0.7usec   NRZ 64/66   54.6Gbps
2014   EDR  Enhanced Data Rate   25Gbps    x4 100Gbps  0.5usec   NRZ 64/66   96.97Gbps 
2018   HDR  High Data Rate       50Gbps    x4 200Gbps <0.6usec   PAM-4 
2022   NDR  Next Data Rate       100Gbps   x4 400Gbps            PAM-4
?      XDR                       200Gbps   x4 800Gbps            PAM-4
?      GDR                                    1.6Tbps

Link Layer

Subnet may contain: 48K unicast & 16k multicast addresses
Local Routing Header (LRH) includes 16bit Destination LID (DLID) and port number
LID Mask Controller (LMC), use multiple LIDs to load-balance traffic over multiple network paths
Credit Based Flow Control between two nodes
- Independent for each virtual lane (to separate congestion/latency)
- Sender limited by credits granted by the receiver in 64byte units
Service Level (SL) to Virtual Lane (VL) mapping defined in opensm.conf
- Priority & weight value 0-255 indicate number 64byte units transported by a VL
- Guarantee performance to data flow to provide QoS
Data Integrity
- 16bit Variant CRC (VCRC) link level integrity between two hops
- 32bit Invariant CRC (ICRC) end-to-end integrity
Link Layer Retransmission (LLR)
- Mellanox SwitchX only, up to FDR, enabled by default
- Recovers problems on the physical layer
- Slight increase in latency
- Should remove all symbol errors
Forward Error Correction (FEC)
- Mellanox Switch-IB only, EDR forward
- Based on 64/66bit encoding error correction
- No bandwidth loss

Network Layer

Infiniband Routing
- Fault isolation (e.g topology changes)
- Increase security (limit attack scope within a network segment)
- Inter-subnet package routing (connect multiple topologies)
Uses GIDs for each port included in the Global Routing Header (GRH)
Mellanox Infiniband Router SB7788 (up to 6 subnets)

Transport Layer

Message segmentation into multiple packages by the sender, reassembly on the receiver
- Maximum Transfer Unit (MTU) default 4096 Byte openib.conf
End-to-End communication service for applications Virtual Channel
Queue Pairs (QPs), dedicated per connection
- Send/receive queue structure to enable application to bypass kernel
- Mode: connected vs. datagram; reliable vs. unreliable
- Datagram mode uses one QP for multiple connections
- Identified by 24bit Queue Pair Number (QPN)

Upper Layer

Protocols
- Native Infiniband RDMA Protocols
- MPI, RDMA Storage (iSER, SRP, NFS-RDMA), SDP (Socket Direct), RDS (Reliable Datagram)
- Legacy TCP/IP, transported by IPoIB
Software transport Verbs
- Client interface to the transport layer, HCA
- Most common implementation is OFED
Subnet Manager Interface (SMI)
- Subnet Manager Packages (SMP) (on QP0 VL15, no flow control)
- LID routed or direct routed (before fabric initialisation using port numbers)
General Service Interface (GSI)
- General Management Packages (GMP) (on QP1, subject to flow control)
- LID routed

Topology

Roadmap of the network:

Critical aspect of any interconnection network
Defines how the channels and routers are connected
Sets performance bounds (network diameter, bisection bandwidth)
Determines the cost of the network
Keys to topology evaluation
- Network throughput - for application traffic patterns
- Network diameter - min/avg/max latency between hosts
- Scalability - cost of adding new end-nodes
- Cost per node - number of network routers/ports per end-node

Diameter defines the maximum distance between two nodes (hop count)

Lower network diameter
- Better performance
- Smaller cost (less cables & routers)
- Less power consumption

Radix (or degree) of the router defines the number of ports per router

Nodal degree specifies how many links connect to each node

Demystifying DCN Topologies: Clos/Fat Trees
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part1
https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part2

Clos Networks

Clos network is a multistage switching network

Enables connection of large number of nodes with small-size switches
- 3 stages to switch from N inputs to N outputs
Exactly one connection between each spine and leaf switch

Fat-Trees (special case of folded Clos network)

Pros
- simple routing
- maximal network throughput
- fault-tolerant (path diversity)
- credit loop deadlock free routing
Cons
- large diameter…
- …more expensive
Alleviate the bandwidth bottleneck closer to the root with additional links
Multiple paths to the destination from the source towards the root
Consistent hop count, resulting in predictable latency.
does not scale linearly with cluster size (max. 7 layers/tiers)
Switches at the top of the pyramid shape are called Spines/Core
Switches at the bottom of the pyramid are called Leafs/Lines/Edges
External connections connect nodes to edge switches.
Internal connections connect core with edge switches.
Constant bi-sectional bandwidth (CBB)
- Non blocking (1:1 ratio)
- Equal number of external and internal connections (balanced)
Blocking (x:1), external connections is higher than internal connections, over subscription

Dragonfly

Pros
- Reduce number of (long) global links…without reducing performance
- …smaller network diameter
- Reduced total cost of network (since cabling is reduced)
- More scalable (compared to fat-tree)
Cons
- Requires adaptive routing…
- …effectively balance load across global channels…
- …adding selective virtual-channel discrimination…

Hierarchical topology dividing groups of routers…

…connected into sub-network of collectively acting router groups…
- …as one high-radix virtual router
- …all minimal routes traverse at most one global channel…
- …to realize a very low global diameter
Channels/links…
- …terminal connections to nodes/systems
- …local (intra-group) connections to other routers in the same group
- …global (long, inter-group) connections to routers in other groups
All-to-all connection between each router group
- (Avoids the need for external top level switches)
- Each group has at least on global link to each other router group

Flavors diverge on group sub-topology…

…intra-group interconnection network (local channels)
1D flattened butterfly, completely connected (default recommendation)
2D flattened butterfly
Dragonfly+ (benefits of Dragonfly and Fat Tree)

Dragonfly+

Extends Dragonfly topology by using Clos-like group topology

High scalability then Dragonfly with lower cost than Fat Tree
Group (pod) topology typical 2-level fat tree
Pros… (compared to Dragonfly)
- More scalable, allows larger number of nodes on the network
- Similar or better bi-sectional bandwidth…
- …smaller number of buffers to avoid credit loop deadlocks
- At least 50% bi-sectional bandwidth for any router radix
- Requires only two virtual lanes to prevent credit loop deadlock
Cons… (compared to Dragonfly)
- Even more complex routing…
- Fully Progressive Adaptive Routing (FPAR)
- Cabling complexity, intra-group routers connected as bipartite graph

Dragonfly+ is bipartite connected in the first intra-group level

Number of spine switches = number of leaf switches
Leaf router, first-layer
- (terminal) connects to nodes
- Intra-group (local) connection to spine routers
- Only one uplink to each spine inside the group
Spine router, second-layer
- intra-group (local) connection to leaf routers
- inter-group (global) connections to spine routers of other groups
Support blocking factor in leaf switches and non-blocking on Spines

Locality, group size

With larger group size lager amount off traffic is internal (intra-group)
Intra-group traffic does not use inter-group global links…
…hence does not contribute to network throughput bottleneck

How to Configure DragonFly, Mellanox, 2020/03
https://community.mellanox.com/s/article/How-to-Configure-DragonFly

Exascale HPC Fabric Topology, Mellanox, 2019/03
http://www.hpcadvisorycouncil.com/events/2019/APAC-AI-HPC/uploads/2018/07/Exascale-HPC-Fabric-Topology.pdf

Routing

Terms important to understand different algorithms…

…tolerance …considered during path distance calculation
- …0 …equal distance if the number of hops in the paths is the same
- …1 …equal distance if the difference in hop count is less than or equal to one
…contention …declared for every switch port on the path…
- …that is already used for routing another LID…
- …associated with the same host port

Algorithm…

…SPF, DOR, LASH….
Min-Hop minimal number of switch hops between nodes (cannot avoid credit loops)
ftree congestion-free symmetric fat-tree, shift communication pattern

Up-Down

…Min-Hop plus core/spine ranking
…for non pure fat-tree topologies
…down-up routes not allowed

Enable up-down routing engine:

>>> grep -e routing_engine -e root_guid_file /etc/opensm/opensm.conf    
#routing_engine (null)
routing_engine updn
#root_guid_file (null)
root_guid_file /etc/opensm/rootswitches.list
>>> head /etc/opensm/rootswitches.list
0xe41d2d0300e512c0
0xe41d2d0300e50bd0
0xe41d2d0300e51af0
0xe41d2d0300e52eb0
0xe41d2d0300e52e90

Adaptive

Avoid congestion with adaptive routing…

…supported on all types of topologies
…maximize network utilization
…spread traffic across all network links…
- …determine optimal path for data packets
- …allow packets to avoid congested areas
…redirect traffic to less occupied outgoing ports
…grading mechanism to select optimal ports considering
- …egress port
- …queue depth
- …path priority (shorter paths have higher priority)

Requires ConnectX-5 or newer…

…packets can arrive out-of-order
…sender mark traffic for eligibility to network re-ordering
…inter-message ordering can be enforced when required

Application Interface

OpenFabrics Alliance (OFA)
- Builds open-source software: OFED (OpenFabrics Enterprise Distribution)
- Kernel-level drivers, channel-oriented RDMA and send/receive operations
- Kernel and user-level application programming interface (API)
- Services for parallel message passing (MPI)
- Includes Open Subnet Manager with diagnostic tools
- IP over Infiniband (IPoIB), Infiniband Verbs/API

RDMA

Remote Direct Memory Access (RDMA)
Linux kernel network stack limitations
- system call API package rates to slow for high speed network fabrics with latency in the nano-seconds
- overhead copying data from user- to kernel-space
- workarounds: Package aggregation, flow steering, pass NIC to user-space…
RDMA Subsystem: Bypass the kernel network stack to sustain full throughput
- special Verbs library maps devices into user-space to allow direct data stream control
- direct user-space to user-space memory data transfer (zero-copy)
- offload of network functionality to the hardware device
- messaging protocols implemented in RDMA
- regular network tools may not work
- bridging between common Ethernet networks and HPC network fabrics difficult
protocols implementing RDMA: Infiniband, Omnipath, Ethernet(RoCE)
future integration with the kernel network stack?
- Integrate RDMA subsystem messaging with the kernel
- Add Queue Pairs (QPs) concept to the kernel network stack to enable RDMA
- Implement POSIX network semantics for Infiniband

RDMA over Ethernet

advances in Ethernet technology allows to build “lossless” Ethernet fabrics
- PFC (Priority-based Flow Control) prevents package loss due to buffer overflow at switches
- Enables FCoE (Fibre Channel over Ethernet), RoCE (RDMA over Converged Ethernet)
- Ethernet NICs come with a variety of options for offloading
RoCE specification supported as annex to the IBTA
implements Infiniband Verbs over Ethernet (OFED >1.5.1)
- use Infiniband transport & network layer, swaps link layer to use Ethernet frames
- IPv4/6 addresses set over the regular Ethernet NIC
- control path RDMA-CM API, data path Verbs API

OpenFabric

OpenFabrics Interfaces (OFI)
Developed by the OFI Working Group, a subgroup of OFA
- Successor to IB Verbs, and RoCE specification
- Optimizes software to hardware path by minimizing cache and memory footprint
- Application-Centric and fabric implementation agnostic
libfabric core component of OFI
- User-space API mapping applications to underlying fabric services
- Hardware/protocol agnostic
Fabric hardware support implemented in OFI providers
- Socket provider for development
- Verbs provides allows to run over hardware supporting libibverbs (Infiniband)
- useNIC (user-space NIC) providers supports Cisco Ethernet hardware
- PSM (Performance Scale Messaging) provider for Intel Omni-Path and Cray Aries

References

NVIDIA Infrastructure & Networking Knowledge Base
- https://forums.developer.nvidia.com/c/infrastructure/369

--- title: 'InfiniBand: HPC Network Interconnect' categories: - HPC - Network - InfiniBand date: 2015/08/19 date-modified: 2025/01/02 toc-expand: 2 --- * InfiniBand Architecture (IBA) - Architecture for Interprocess Communication (IPC) networks - Switch-based, point-to-point interconnection network - low latency, high throughput, quality of service - CPU offload, hardware based transport protocol, bypass of the kernel * [Mellanox Community](https://community.mellanox.com/) # Terminology **GUID** Globally Unique Identifier - ...64bit unique address assigned by vendor - ...persistent through reboot - ...3 types of GUIDs: Node, port(, and system image) **LID** Local Identifier (48k unicast per subnet) - ...16bit layer 2 address - ...assigned by the SM when port becomes active - ...each HCA port has a LID... - ...all switch ports share the same LID - ...director switches have one LID per ASIC **GID** Global Identifier - ...128bit address unique across multiple subnets - ...based on the port GUID combined with 64bit subnet prefix - ...Used in the Global Routing Header (GRH) (ignored by switches within a subnet) **PKEY** Partition Identifier - ...fabric segmentation of nodes into different partitions - ...partitions unaware of each other - ...limited `0` (can't communicate between them selfs) - ...full `1` membership - ...ports may be member of multiple partitions - ...assign by listing port GUIDs in `partitons.conf` # Network Layers ## Physical Layer * Link Speed x Link Width = Link Rate * Bit Error Rate (BER) 10^15 * **Virtual Lane** (VL), multiple virtual links on single physical link - Mellanox 0-7 VLs each with dedicated buffers - Quality of Service, bandwidth management * Media for connecting two nodes - Passive Copper Cables FDR max. 3m, EDR max. 2m - Active Optical Cables (AOCs) FDR max. 300m, EDR max. 100m - Connector QSFP ### Speeds ```txt Speed Width Rate Latency Encoding Eff.Speed --------------------------------------------------------------------------------------- 1999 SDR Single Data Rate 2.5Gbps x4 10Gbps 5usec NRZ 2004 DDR Double Data Rate 5Gbps x4 20Gbps 2.5usec NRZ 8/10 16Gbps 2008 QDR Quadruple Data Rate 10Gbps x4 40Gbps 1.3usec NRZ 8/10 32Gbps 2011 FDR Fourteen Data Rate 14Gbps x4 56Gbps 0.7usec NRZ 64/66 54.6Gbps 2014 EDR Enhanced Data Rate 25Gbps x4 100Gbps 0.5usec NRZ 64/66 96.97Gbps 2018 HDR High Data Rate 50Gbps x4 200Gbps <0.6usec PAM-4 2022 NDR Next Data Rate 100Gbps x4 400Gbps PAM-4 ? XDR 200Gbps x4 800Gbps PAM-4 ? GDR 1.6Tbps ``` ## Link Layer * Subnet may contain: 48K unicast & 16k multicast addresses * **Local Routing Header** (LRH) includes 16bit Destination LID (DLID) and port number * **LID Mask Controller** (LMC), use multiple LIDs to load-balance traffic over multiple network paths * **Credit Based Flow Control** between two nodes - Independent for each virtual lane (to separate congestion/latency) - Sender limited by credits granted by the receiver in 64byte units * **Service Level** (SL) to Virtual Lane (VL) mapping defined in `opensm.conf` - Priority & weight value 0-255 indicate number 64byte units transported by a VL - Guarantee performance to data flow to provide QoS * Data Integrity - 16bit Variant CRC (VCRC) link level integrity between two hops - 32bit Invariant CRC (ICRC) end-to-end integrity * Link Layer Retransmission (LLR) - Mellanox SwitchX only, up to FDR, enabled by default - Recovers problems on the physical layer - Slight increase in latency - Should remove all symbol errors * Forward Error Correction (FEC) - Mellanox Switch-IB only, EDR forward - Based on 64/66bit encoding error correction - No bandwidth loss ## Network Layer * Infiniband Routing - Fault isolation (e.g topology changes) - Increase security (limit attack scope within a network segment) - Inter-subnet package routing (connect multiple topologies) * Uses GIDs for each port included in the **Global Routing Header** (GRH) * Mellanox Infiniband Router SB7788 (up to 6 subnets) ## Transport Layer * Message segmentation into multiple packages by the sender, reassembly on the receiver - **Maximum Transfer Unit** (MTU) default 4096 Byte `openib.conf` * End-to-End communication service for applications **Virtual Channel** * **Queue Pairs** (QPs), dedicated per connection - Send/receive queue structure to enable application to bypass kernel - Mode: connected vs. datagram; reliable vs. unreliable - Datagram mode uses one QP for multiple connections - Identified by 24bit Queue Pair Number (QPN) ## Upper Layer * Protocols - Native Infiniband RDMA Protocols - MPI, RDMA Storage (iSER, SRP, NFS-RDMA), SDP (Socket Direct), RDS (Reliable Datagram) - Legacy TCP/IP, transported by IPoIB * Software transport **Verbs** - Client interface to the transport layer, HCA - Most common implementation is OFED * Subnet Manager Interface (SMI) - Subnet Manager Packages (SMP) (on `QP0 VL15`, no flow control) - LID routed or direct routed (before fabric initialisation using port numbers) * General Service Interface (GSI) - General Management Packages (GMP) (on `QP1`, subject to flow control) - LID routed # Topology Roadmap of the network: * Critical aspect of any interconnection network * Defines how the channels and routers are connected * Sets performance bounds (network diameter, bisection bandwidth) * Determines the cost of the network * Keys to topology evaluation - Network throughput - for application traffic patterns - Network diameter - min/avg/max latency between hosts - Scalability - cost of adding new end-nodes - Cost per node - number of network routers/ports per end-node **Diameter** defines the maximum distance between two nodes (hop count) * Lower network diameter - Better performance - Smaller cost (less cables & routers) - Less power consumption **Radix** (or degree) of the router defines the number of ports per router **Nodal degree** specifies how many links connect to each node Demystifying DCN Topologies: Clos/Fat Trees <https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part1> <https://packetpushers.net/demystifying-dcn-topologies-clos-fat-trees-part2> ## Clos Networks **Clos network** is a multistage switching network * Enables connection of large number of nodes with small-size switches - 3 stages to switch from N inputs to N outputs * Exactly one connection between each spine and leaf switch **Fat-Trees** (special case of folded Clos network) * Pros - simple routing - maximal network throughput - fault-tolerant (path diversity) - credit loop deadlock free routing * Cons - large diameter... - ...more expensive * Alleviate the bandwidth bottleneck closer to the root with additional links * Multiple paths to the destination from the source towards the root - Consistent hop count, resulting in predictable latency. - does not scale linearly with cluster size (max. 7 layers/tiers) - Switches at the top of the pyramid shape are called **Spines**/Core - Switches at the bottom of the pyramid are called **Leafs**/Lines/Edges - **External connections** connect nodes to edge switches. - **Internal connections** connect core with edge switches. - Constant bi-sectional bandwidth (CBB) - Non blocking (1:1 ratio) - Equal number of external and internal connections (balanced) - **Blocking** (x:1), external connections is higher than internal connections, over subscription ## Dragonfly * Pros - Reduce number of (long) global links...without reducing performance - ...smaller network diameter - **Reduced total cost** of network (since cabling is reduced) - **More scalable** (compared to fat-tree) * Cons - Requires **adaptive routing**... - ...effectively balance load across global channels... - ...adding selective **virtual-channel** discrimination... Hierarchical topology dividing groups of routers... * ...connected into sub-network of **collectively acting router groups**... - ...as one **high-radix virtual router** - ...all minimal routes traverse at most one global channel... - ...to realize a very low global diameter * Channels/links... - ...**terminal** connections to nodes/systems - ...**local** (intra-group) connections to other routers in the same group - ...**global** (long, inter-group) connections to routers in other groups * All-to-all connection between each router group - (Avoids the need for external top level switches) - Each group has **at least on global link to each other router group** Flavors diverge on **group sub-topology**... * ...intra-group interconnection network (local channels) * 1D flattened butterfly, completely connected (default recommendation) * 2D flattened butterfly * Dragonfly+ (benefits of Dragonfly and Fat Tree) ## Dragonfly+ Extends Dragonfly topology by using **Clos-like group topology** * _High scalability then Dragonfly with lower cost than Fat Tree_ * Group (pod) topology typical 2-level fat tree * Pros... (compared to Dragonfly) - **More scalable**, allows larger number of nodes on the network - Similar or better bi-sectional bandwidth... - ...smaller number of buffers to avoid credit loop deadlocks - At least 50% bi-sectional bandwidth for any router radix - Requires only two virtual lanes to prevent credit loop deadlock * Cons... (compared to Dragonfly) - Even **more complex routing**... - Fully Progressive Adaptive Routing (FPAR) - Cabling complexity, intra-group routers connected as **bipartite graph** Dragonfly+ is bipartite connected in the first intra-group level * Number of spine switches = number of leaf switches * **Leaf** router, first-layer - (terminal) connects to nodes - Intra-group (local) connection to spine routers - Only one uplink to each spine inside the group * **Spine** router, second-layer - intra-group (local) connection to leaf routers - inter-group (global) connections to spine routers of other groups * Support blocking factor in leaf switches and non-blocking on Spines Locality, **group size** * With larger group size lager amount off traffic is internal (intra-group) * Intra-group traffic does not use inter-group global links... * ...hence does not contribute to network throughput bottleneck How to Configure DragonFly, Mellanox, 2020/03 <https://community.mellanox.com/s/article/How-to-Configure-DragonFly> Exascale HPC Fabric Topology, Mellanox, 2019/03 <http://www.hpcadvisorycouncil.com/events/2019/APAC-AI-HPC/uploads/2018/07/Exascale-HPC-Fabric-Topology.pdf> # Routing Terms important to understand different algorithms... - ...**tolerance** ...considered during path distance calculation - ...`0` ...equal distance if the number of hops in the paths is the same - ...`1` ...equal distance if the difference in hop count is less than or equal to one - ...**contention** ...declared for every switch port on the path... - ...that is already used for routing another LID... - ...associated with the same host port Algorithm... - ...SPF, DOR, LASH.... - **Min-Hop** minimal number of switch hops between nodes (cannot avoid credit loops) - **ftree** congestion-free symmetric fat-tree, shift communication pattern ## Up-Down - ...Min-Hop plus core/spine ranking - ...for non pure fat-tree topologies - ...down-up routes not allowed Enable up-down routing engine: ```bash >>> grep -e routing_engine -e root_guid_file /etc/opensm/opensm.conf #routing_engine (null) routing_engine updn #root_guid_file (null) root_guid_file /etc/opensm/rootswitches.list >>> head /etc/opensm/rootswitches.list 0xe41d2d0300e512c0 0xe41d2d0300e50bd0 0xe41d2d0300e51af0 0xe41d2d0300e52eb0 0xe41d2d0300e52e90 ``` ## Adaptive Avoid congestion with adaptive routing... - ...supported on all types of topologies - ...maximize network utilization - ...spread traffic across all network links... - ...determine optimal path for data packets - ...allow packets to avoid congested areas - ...redirect traffic to less occupied outgoing ports - ...grading mechanism to select optimal ports considering - ...egress port - ...queue depth - ...path priority (shorter paths have higher priority) Requires ConnectX-5 or newer... - ...packets can arrive out-of-order - ...sender mark traffic for eligibility to network re-ordering - ...inter-message ordering can be enforced when required # Application Interface * [OpenFabrics](https://www.openfabrics.org/) Alliance (OFA) - Builds open-source software: **OFED** (OpenFabrics Enterprise Distribution) - Kernel-level drivers, channel-oriented RDMA and send/receive operations - Kernel and user-level application programming interface (API) - Services for parallel message passing (MPI) - Includes Open Subnet Manager with diagnostic tools - IP over Infiniband (IPoIB), Infiniband Verbs/API ## RDMA * Remote Direct Memory Access (RDMA) * Linux kernel network stack limitations - system call API package rates to slow for high speed network fabrics with latency in the nano-seconds - overhead copying data from user- to kernel-space - workarounds: Package aggregation, flow steering, pass NIC to user-space... * **RDMA Subsystem**: Bypass the kernel network stack to sustain full throughput - special **Verbs** library maps devices into user-space to allow direct data stream control - direct user-space to user-space memory data transfer (zero-copy) - offload of network functionality to the hardware device - messaging protocols implemented in RDMA - regular network tools may not work - bridging between common Ethernet networks and HPC network fabrics difficult * protocols implementing RDMA: Infiniband, Omnipath, Ethernet(RoCE) * future integration with the kernel network stack? - Integrate RDMA subsystem messaging with the kernel - Add Queue Pairs (QPs) concept to the kernel network stack to enable RDMA - Implement POSIX network semantics for Infiniband ## RDMA over Ethernet * advances in Ethernet technology allows to build "lossless" Ethernet fabrics - **PFC** (Priority-based Flow Control) prevents package loss due to buffer overflow at switches - Enables **FCoE** (Fibre Channel over Ethernet), **RoCE** (RDMA over Converged Ethernet) - Ethernet NICs come with a variety of options for offloading * RoCE specification supported as annex to the IBTA * implements Infiniband Verbs over Ethernet (OFED >1.5.1) - use Infiniband transport & network layer, swaps link layer to use Ethernet frames - IPv4/6 addresses set over the regular Ethernet NIC - control path RDMA-CM API, data path Verbs API ## OpenFabric * OpenFabrics Interfaces (OFI) * Developed by the OFI Working Group, a subgroup of OFA - Successor to IB Verbs, and RoCE specification - Optimizes software to hardware path by minimizing cache and memory footprint - Application-Centric and fabric implementation agnostic * [libfabric](https://ofiwg.github.io/libfabric/) core component of OFI - User-space API mapping applications to underlying fabric services - Hardware/protocol agnostic * Fabric hardware support implemented in **OFI providers** - Socket provider for development - Verbs provides allows to run over hardware supporting `libibverbs` (Infiniband) - `useNIC` (user-space NIC) providers supports Cisco Ethernet hardware - PSM (Performance Scale Messaging) provider for Intel Omni-Path and Cray Aries # References - NVIDIA Infrastructure & Networking Knowledge Base - <https://forums.developer.nvidia.com/c/infrastructure/369>