HPC — Network Interconnects
HPC
Network
- HPI (High Performance Interconnect) [^3]
- equipment designed for very high bandwidth and extreme low latency
- inter-node communication supporting large (node counts) clusters
- technologies in the HPI market:
- Ethernet, RoCE (RDMA over Converged Ethernet)
- InfiniBand
- Intel Omni-Path
- Cray Aries XC
- SGI NUMALink
- HPI evaluation criteria
- reliability of inter-node communication
- maximum requirements on link bandwidth
- sufficiently low latency
- load on node CPUs by the communication stack
- TcO of the equipment in relation to overall performance
Network vs Farbic
- network
- designed as universal interconnect
- vendor interoperability by design (for example Ethernet
- all-to-all communication for any application
- fabric
- designed as optimized interconnect
- single-vendor solution (Mellanox InfiniBand, Intel Omni-Path)
- single system build for a specific application
- spread network traffic across multiple physical links (multipath)
- scalable fat-tree and mesh topologies
- more sophisticated routing to allow redundancy and high-throughput
- non-blocking (over-subscription) interconnect
- low latency layer 2-type connectivity
Modern AI/HPC networks build around the GPU servers:
Offload vs. Onload
- network functions performed mostly in software “onload” (Ethernet, Omni-Path) [^2]
- requires CPU resources ⇒ decreases cycles available to hosted applications
- network functions performed by hardware “offload” (Infiniband, RoCE), aka Intelligent Interconnect
- Network hardware performs communication operations (including data aggregation)
- Increases resource availability of the CPU (improves overall efficiency)
- Particularly advantageous in scatter,gather type collective problems
- trade-off
- more capable network infrastructure (offload) vs. incrementally more CPUs on servers (onload)
- advantage of offloading increases with the size of the interconnected clusters (higher node count = more messaging)
- comparison of Infiniband & Omni-Path [^1]
- message rate test (excluding overheat of data polling) to understand impact of network protocol on CPU utilization
- result: InfiniBand CPU resource utilization <1%, Omni-Path >40%
Ethernet vs InfiniBand
- Ethernet …widely in production …broad ecosystem
- …many manufacturers at all layers …rapid innovation
- …easy to deploy …widespread expert knowledge
- …many tools for operation, management, tests
- InfiniBand
- …mostly used in HPC, cf. TOP500
- …de-facto monopoly by NVIDIA/Mellanox
- Omni-Path …Intel proprietary
RDMA Transport
(R)DMA — Remote Direct Memory Access
- …extends DMA (Direct Memory Access) over networks
- …bypasses the CPU & OS kernel/cache …zero-copy data
- …hardware loss detection, retransmission, loss recovery
- Used for…
- …memory-to-memory communication over networks
- …communication between accelerators (GPUs)
- Performance improvements over TCP
- 85% increase in throughput
- 85% reduced latency
- 65% reduced CPU usage
History timeline…
- 1993 RDMA concept (patent by HP)
- 1994 onward, research and development
- 1999 InfiniBand Trade Association (IBTA) formed
- 2000 InfiniBand Architecture Spec 1.0
- 2000s (early) InfiniBand gains use in HPC (pushed by Mellanox)
- 2000s (mid) Intel/Microsoft shift to PCIe
- 2010 RoCE introduced
- 2010s (mid) NVMe over Fabric
- 2010s (late) RoCE adoption in commercial data center (cloud/finance)
- 2014 RoCEv2 to improve performance
- 2019 NVIDIA acquires Mellanox
RoCE — RDMA over Converged Ethernet
- RoCEv1 …no IP header …not routable
- RoCEv2 — RDMA over UDP/IP
- …routable over layer 3
- …forces lossless networks …hard to tune congestion control
- …lack of multi-pathing …in-order packet delivery limited
- Uses ARP to get remote MAC address
Ultra Ethernet
- …run on IPv4/6 and Ethernet
- …multipath RMA
End-to-End NVMe
For networks between hosts and storage systems
NVMe — Protocol command set for block storage
- …replaces SCSI …uses PCIe transimssion channels
- …reduced latency & improved bandwidht (comapred to SCSI/SAS)
NVMe-oF — NVMe over Farbric
- Overcomes limites of NVMe PCIe
- …limted bus addresses
- …connection distance limits
- Extends NVMe to various storage networks
- …map NVMe commands and data to multiple fabric links
- …Fibre Channel, InfiniBand, RoCE v2, iWARP, and TCP
- …reduces overhead for processing storage network protocol stacks
NVMe over RoCE
- …combine NVMe with low latency and low CPU usage of RDMA
- …converges the LANs and SANs of data centers
nvmetcli
— Configure NVMe-over-Fabrics Target