HPC — Network Interconnects

HPC

Network

Published

March 26, 2018

Modified

October 10, 2025

HPI (High Performance Interconnect) [^3]
- equipment designed for very high bandwidth and extreme low latency
- inter-node communication supporting large (node counts) clusters
technologies in the HPI market:
- Ethernet, RoCE (RDMA over Converged Ethernet)
- InfiniBand
- Intel Omni-Path
- Cray Aries XC
- SGI NUMALink
HPI evaluation criteria
- reliability of inter-node communication
- maximum requirements on link bandwidth
- sufficiently low latency
- load on node CPUs by the communication stack
- TcO of the equipment in relation to overall performance

Bandwidth and latency requirements from applications

Network vs Farbic

network
- designed as universal interconnect
- vendor interoperability by design (for example Ethernet
- all-to-all communication for any application
fabric
- designed as optimized interconnect
- single-vendor solution (Mellanox InfiniBand, Intel Omni-Path)
- single system build for a specific application
- spread network traffic across multiple physical links (multipath)
- scalable fat-tree and mesh topologies
- more sophisticated routing to allow redundancy and high-throughput
- non-blocking (over-subscription) interconnect
- low latency layer 2-type connectivity

Modern AI/HPC networks build around the GPU servers:

Offload vs. Onload

network functions performed mostly in software “onload” (Ethernet, Omni-Path) [^2]
- requires CPU resources ⇒ decreases cycles available to hosted applications
network functions performed by hardware “offload” (Infiniband, RoCE), aka Intelligent Interconnect
- Network hardware performs communication operations (including data aggregation)
- Increases resource availability of the CPU (improves overall efficiency)
- Particularly advantageous in scatter,gather type collective problems
trade-off
- more capable network infrastructure (offload) vs. incrementally more CPUs on servers (onload)
- advantage of offloading increases with the size of the interconnected clusters (higher node count = more messaging)
comparison of Infiniband & Omni-Path [^1]
- message rate test (excluding overheat of data polling) to understand impact of network protocol on CPU utilization
- result: InfiniBand CPU resource utilization <1%, Omni-Path >40%

Ethernet vs InfiniBand

Ethernet …widely in production …broad ecosystem
- …many manufacturers at all layers …rapid innovation
- …easy to deploy …widespread expert knowledge
- …many tools for operation, management, tests
InfiniBand
- …mostly used in HPC, cf. TOP500
- …de-facto monopoly by NVIDIA/Mellanox
Omni-Path …Intel proprietary

Ultra Ethernet

…run on IPv4/6 and Ethernet
…multipath RMA

End-to-End NVMe

For networks between hosts and storage systems

NVMe — Protocol command set for block storage

…replaces SCSI …uses PCIe transimssion channels
…reduced latency & improved bandwidht (comapred to SCSI/SAS)

NVMe-oF — NVMe over Farbric

Overcomes limites of NVMe PCIe
- …limted bus addresses
- …connection distance limits
Extends NVMe to various storage networks
- …map NVMe commands and data to multiple fabric links
- …Fibre Channel, InfiniBand, RoCE v2, iWARP, and TCP
…reduces overhead for processing storage network protocol stacks

NVMe over RoCE

…combine NVMe with low latency and low CPU usage of RDMA
…converges the LANs and SANs of data centers
nvmetcli — Configure NVMe-over-Fabrics Target

--- title: 'HPC — Network Interconnects' categories: - HPC - Network date: 2018/03/26 date-modified: 2025/10/10 --- * **HPI** (High Performance Interconnect) [^3] - equipment designed for very high bandwidth and extreme low latency - inter-node communication supporting large (node counts) clusters * technologies in the HPI market: - Ethernet, RoCE (RDMA over Converged Ethernet) - InfiniBand - Intel Omni-Path - Cray Aries XC - SGI NUMALink * HPI evaluation criteria - reliability of inter-node communication - maximum requirements on link bandwidth - sufficiently low latency - load on node CPUs by the communication stack - TcO of the equipment in relation to overall performance ![Bandwidth and latency requirements from applications](app-network-req.png) ## Network vs Farbic * network - designed as **universal interconnect** - vendor interoperability by design (for example Ethernet - all-to-all communication for any application * fabric - designed as **optimized interconnect** - single-vendor solution (Mellanox InfiniBand, Intel Omni-Path) - single system build for a specific application - spread network traffic across multiple physical links (multipath) - scalable fat-tree and mesh topologies - more sophisticated routing to allow redundancy and high-throughput - non-blocking (over-subscription) interconnect - low latency layer 2-type connectivity Modern AI/HPC networks build around the GPU servers: ![](front-back-network.png){.lightbox width=80% fig-align="center"} ## Offload vs. Onload * network functions performed mostly in software "**onload**" (Ethernet, Omni-Path) [^2] - requires CPU resources ⇒ _decreases cycles available_ to hosted applications * network functions performed by hardware "**offload**" (Infiniband, RoCE), aka _Intelligent Interconnect_ - Network hardware performs communication operations (including data aggregation) - Increases resource availability of the CPU (improves overall efficiency) - Particularly advantageous in scatter,gather type collective problems * trade-off - more capable network infrastructure (offload) vs. incrementally more CPUs on servers (onload) - advantage of offloading increases with the size of the interconnected clusters (higher node count = more messaging) * comparison of Infiniband & Omni-Path [^1] - message rate test (excluding overheat of data polling) to understand impact of network protocol on CPU utilization - result: InfiniBand CPU resource utilization <1%, Omni-Path >40% ## Ethernet vs InfiniBand * Ethernet …widely in production …broad ecosystem - …many manufacturers at all layers …rapid innovation - …easy to deploy …widespread expert knowledge - …many tools for operation, management, tests * InfiniBand - …mostly used in HPC, cf. TOP500 - …de-facto monopoly by NVIDIA/Mellanox * Omni-Path …Intel proprietary **Ultra Ethernet** - …run on IPv4/6 and Ethernet - …multipath RMA # End-to-End NVMe For networks between hosts and storage systems **NVMe** — Protocol command set for block storage - …replaces SCSI …uses PCIe transimssion channels - …reduced latency & improved bandwidht (comapred to SCSI/SAS) **NVMe-oF** — NVMe over Farbric - Overcomes limites of NVMe PCIe - …limted bus addresses - …connection distance limits - Extends NVMe to various storage networks - …map NVMe commands and data to multiple fabric links - …Fibre Channel, InfiniBand, RoCE v2, iWARP, and TCP - …reduces overhead for processing storage network protocol stacks ![NVME over Farbic Host Driver](nvme-over-fabric.png){.lightbox width=60%} **NVMe over RoCE** - …combine NVMe with low latency and low CPU usage of RDMA - …converges the LANs and SANs of data centers - `nvmetcli` — Configure NVMe-over-Fabrics Target