HPC — Network Interconnects

HPC
Network
Published

March 26, 2018

Modified

July 23, 2025

Network vs Farbic

  • network
    • designed as universal interconnect
    • vendor interoperability by design (for example Ethernet
    • all-to-all communication for any application
  • fabric
    • designed as optimized interconnect
    • single-vendor solution (Mellanox InfiniBand, Intel Omni-Path)
    • single system build for a specific application
    • spread network traffic across multiple physical links (multipath)
    • scalable fat-tree and mesh topologies
    • more sophisticated routing to allow redundancy and high-throughput
    • non-blocking (over-subscription) interconnect
    • low latency layer 2-type connectivity

Modern AI/HPC networks build around the GPU servers:

Offload vs. Onload

  • network functions performed mostly in software “onload” (Ethernet, Omni-Path) [^2]
    • requires CPU resources ⇒ decreases cycles available to hosted applications
  • network functions performed by hardware “offload” (Infiniband, RoCE), aka Intelligent Interconnect
    • Network hardware performs communication operations (including data aggregation)
    • Increases resource availability of the CPU (improves overall efficiency)
    • Particularly advantageous in scatter,gather type collective problems
  • trade-off
    • more capable network infrastructure (offload) vs. incrementally more CPUs on servers (onload)
    • advantage of offloading increases with the size of the interconnected clusters (higher node count = more messaging)
  • comparison of Infiniband & Omni-Path [^1]
    • message rate test (excluding overheat of data polling) to understand impact of network protocol on CPU utilization
    • result: InfiniBand CPU resource utilization <1%, Omni-Path >40%

Ethernet vs InfiniBand

  • Ethernet …widely in production …broad ecosystem
    • …many manufacturers at all layers …rapid innovation
    • …easy to deploy …widespread expert knowledge
    • …many tools for operation, management, tests
  • InfiniBand
    • …mostly used in HPC, cf. TOP500
    • …de-facto monopoly by NVIDIA/Mellanox
  • Omni-Path …Intel proprietary

RDMA Transport

(R)DMA — Remote Direct Memory Access

  • …extends DMA (Direct Memory Access) over networks
  • …bypasses the CPU & OS kernel/cache …zero-copy data
  • …hardware loss detection, retransmission, loss recovery
  • Used for…
    • …memory-to-memory communication over networks
    • …communication between accelerators (GPUs)
  • Performance improvements over TCP
    • 85% increase in throughput
    • 85% reduced latency
    • 65% reduced CPU usage

History timeline…

  • 1993 RDMA concept (patent by HP)
  • 1994 onward, research and development
  • 1999 InfiniBand Trade Association (IBTA) formed
  • 2000 InfiniBand Architecture Spec 1.0
  • 2000s (early) InfiniBand gains use in HPC (pushed by Mellanox)
  • 2000s (mid) Intel/Microsoft shift to PCIe
  • 2010 RoCE introduced
  • 2010s (mid) NVMe over Fabric
  • 2010s (late) RoCE adoption in commercial data center (cloud/finance)
  • 2014 RoCEv2 to improve performance
  • 2019 NVIDIA acquires Mellanox

RoCE — RDMA over Converged Ethernet

  • RoCEv1 …no IP header …not routable
  • RoCEv2 — RDMA over UDP/IP
    • …routable over layer 3
    • …forces lossless networks …hard to tune congestion control
    • …lack of multi-pathing …in-order packet delivery limited
  • Uses ARP to get remote MAC address

Ultra Ethernet

  • …run on IPv4/6 and Ethernet
  • …multipath RMA

End-to-End NVMe

For networks between hosts and storage systems

NVMe — Protocol command set for block storage

  • …replaces SCSI …uses PCIe transimssion channels
  • …reduced latency & improved bandwidht (comapred to SCSI/SAS)

NVMe-oF — NVMe over Farbric

  • Overcomes limites of NVMe PCIe
    • …limted bus addresses
    • …connection distance limits
  • Extends NVMe to various storage networks
    • …map NVMe commands and data to multiple fabric links
    • …Fibre Channel, InfiniBand, RoCE v2, iWARP, and TCP
  • …reduces overhead for processing storage network protocol stacks

NVME over Farbic Host Driver

NVME over Farbic Host Driver

NVMe over RoCE

  • …combine NVMe with low latency and low CPU usage of RDMA
  • …converges the LANs and SANs of data centers
  • nvmetcli — Configure NVMe-over-Fabrics Target