HPC Network Interconnects

HPC

Network

Published

March 26, 2018

Modified

July 17, 2022

HPI (High Performance Interconnect) [^3]
- equipment designed for very high bandwidth and extreme low latency
- inter-node communication supporting large (node counts) clusters
technologies in the HPI market:
- Ethernet, RoCE (RDMA over Converged Ethernet)
- InfiniBand
- Intel Omni-Path
- Cray Aries XC
- SGI NUMALink
HPI evaluation criteria
- reliability of inter-node communication
- maximum requirements on link bandwidth
- sufficiently low latency
- load on node CPUs by the communication stack
- TcO of the equipment in relation to overall performance

Network vs Farbic

network
- designed as universal interconnect
- vendor interoperability by design (for example )
- all-to-all communication for any application
fabric
- designed as optimized interconnect
- single-vendor solution (Mellanox InfiniBand, Intel Omni-Path)
- single system build for a specific application
- spread network traffic across multiple physical links (multipath)
- scalable fat-tree and mesh topologies
- more sophisticated routing to allow redundancy and high-throughput
- non-blocking (over-subscription) interconnect
- low latency layer 2-type connectivity

Offload vs. Onload

network functions performed mostly in software “onload” (Ethernet, Omni-Path) [^2]
- requires CPU resources ⇒ decreases cycles available to hosted applications
network functions performed by hardware “offload” (Infiniband, RoCE), aka Intelligent Interconnect
- Network hardware performs communication operations (including data aggregation)
- Increases resource availability of the CPU (improves overall efficiency)
- Particularly advantageous in scatter,gather type collective problems
trade-off
- more capable network infrastructure (offload) vs. incrementally more CPUs on servers (onload)
- advantage of offloading increases with the size of the interconnected clusters (higher node count = more messaging)
comparison of Infiniband & Omni-Path [^1]
- message rate test (excluding overheat of data polling) to understand impact of network protocol on CPU utilization
- result: InfiniBand CPU resource utilization <1%, Omni-Path >40%

Comparison

Ethernet 10/25/40/50/100G (200G in 2018/19)
- Widely in production, supported by many manufacturers (Cisco, Brocade, Juniper, etc.)
- Easy to deploy, widespread expert knowledge
- “High” latency (ms rather than ns)
InfiniBand 40/56/100G (200G 2017)
- Widely used in HPC, cf. TOP500
- De-facto lead by Mellanox
Omni-Path 100G (future roadmap?)
- Intel proprietary
- Still in its infancy (very few production installations)
- Claims better bandwidth/latency/message rate then InfiniBand

--- title: HPC Network Interconnects categories: - HPC - Network date: 2018/03/26 date-modified: 2022/07/17 --- * **HPI** (High Performance Interconnect) [^3] - equipment designed for very high bandwidth and extreme low latency - inter-node communication supporting large (node counts) clusters * technologies in the HPI market: - Ethernet, RoCE (RDMA over Converged Ethernet) - InfiniBand - Intel Omni-Path - Cray Aries XC - SGI NUMALink * HPI evaluation criteria - reliability of inter-node communication - maximum requirements on link bandwidth - sufficiently low latency - load on node CPUs by the communication stack - TcO of the equipment in relation to overall performance ## Network vs Farbic * network - designed as **universal interconnect** - vendor interoperability by design (for example [](ethernet:)) - all-to-all communication for any application * fabric - designed as **optimized interconnect** - single-vendor solution (Mellanox InfiniBand, Intel Omni-Path) - single system build for a specific application - spread network traffic across multiple physical links (multipath) - scalable fat-tree and mesh topologies - more sophisticated routing to allow redundancy and high-throughput - non-blocking (over-subscription) interconnect - low latency layer 2-type connectivity ## Offload vs. Onload * network functions performed mostly in software "**onload**" (Ethernet, Omni-Path) [^2] - requires CPU resources ⇒ _decreases cycles available_ to hosted applications * network functions performed by hardware "**offload**" (Infiniband, RoCE), aka _Intelligent Interconnect_ - Network hardware performs communication operations (including data aggregation) - Increases resource availability of the CPU (improves overall efficiency) - Particularly advantageous in scatter,gather type collective problems * trade-off - more capable network infrastructure (offload) vs. incrementally more CPUs on servers (onload) - advantage of offloading increases with the size of the interconnected clusters (higher node count = more messaging) * comparison of Infiniband & Omni-Path [^1] - message rate test (excluding overheat of data polling) to understand impact of network protocol on CPU utilization - result: InfiniBand CPU resource utilization <1%, Omni-Path >40% ## Comparison * Ethernet 10/25/40/50/100G (200G in 2018/19) - Widely in production, supported by many manufacturers (Cisco, Brocade, Juniper, etc.) - Easy to deploy, widespread expert knowledge - "High" latency (ms rather than ns) * InfiniBand 40/56/100G (200G 2017) - Widely used in HPC, cf. TOP500 - De-facto lead by Mellanox * Omni-Path 100G (future roadmap?) - Intel proprietary - Still in its infancy (very few production installations) - Claims better bandwidth/latency/message rate then InfiniBand