RoCE — RDMA over Converged Ethernet

Network
HPC
Published

October 16, 2025

Modified

October 17, 2025

Overview

Version Description
RoCEv1 Deprecated (limited to OSI layer 2, therefore not routable)
RoCEv2 RDMA over UDP/IP (layer 3), therefore routable

RoCE (RDMA over Converged Ethernet):

  • Extends the benefits of RDMA to Ethernet networks
  • Enables a convergence of storage and regular data traffic on a single network plane

Give me the short version!

  • RoCE forces lossless networks …switches with DCB (Data Center Bridging)
  • PFC (Priority Flow Control) is used to create lossless priority queues
  • PFC can cause head-of-line (HoL) blocking and congestion spreading
  • DCQCN based congestion control is the safety net for PFC deadlocks
  • However it is hard to tune Ethernet congestion control mechanisms

Network Congestion

Ethernet operate on a best-effort layer-2 delivery (no native prioritization)…

  • Congestion - Network traffic overwhelms available capacity
  • Congestion indicators:
    • Increased latency & reduced bandwidth
    • High packet loss & jitter (variability in packet arrival times)

By default all traffic has an equal chance of being dropped in Ethernet

Lossless Ethernet

Lossless Ethernet is based on two pillars:

  1. Traffic management (QoS)
    • …prioritize or isolate traffic classes during congestion
    • How a single device handles traffic when it’s overwhelmed
  2. Congestion controlHow senders avoid overwhelming the network
    • Prevent congestion …dynamically adjusting sender rates based on network feedback
    • …cannot create bandwidth …ensure physical capacity > demand

Prerequisite

QoS12 configuration for Mellanox HCAs requires mlnx_qos MLNX Tools package.

# package is included in the NVIDIA DOCA distribution
wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.1.0/host/doca-host-3.1.0-091000_25.07_rhel94.x86_64.rpm
rpm2cpio doca-host-3.1.0-091000_25.07_rhel94.x86_64.rpm | cpio -idmv
# uplink port on the switch
>>> networkctl status eth4 | grep Connected
              Connected To: Leaf2 on port 100GE1/2/1 (Linkto_Client6)

# HCA need to be in Ethernet mode
>>> mlxconfig -d mlx5_0 query | grep LINK_TYPE
        LINK_TYPE_P1                                ETH(2) 

# print the current RoCE mode for a device port
>>> cma_roce_mode -d mlx5_0 -p 1
RoCE v2

# display the RDMA links
>>> rdma link show
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev eth4

Configuration

Traffic Management

QoS (Quality of Service) allows to select specific network traffic and prioritize it

  • Not a single solution – it’s a toolbox
  • Implemented using extensions and higher-layer protocols

Core QoS Frameworks for Ethernet

  • 802.1p, CoS (Class of Service)
    • PCP (Priority Code Point), VLAN header …pause traffic on any of eight priorities
    • Limited to a single broadcast domain (VLAN) …intra-VLAN prioritization
    • Priority is lost when traffic crosses routers
    • Untagged traffic: CoS 0…can not be paused
  • DSCP (Differentiated Services Code Point), Layer 3
    • DSCP value must be mapped explicitly in the configuration to a PFC priority
    • End-to-end across IP networks (routers must be configured)
    • DSCP-based PFC3 values in the Layer 3 IP header
    • Untagged traffic - DSCP 0 …best-effort

Start with DSCP for end-to-end control, use CoS for local VLAN segments

ToS  +----> DSCP +------> PFC +----> TC
                           +          +
                           |          |
                           v          v
Hardware                 Buffer     Queue

DSCP

Extended IP Precedence, ToS (Type of Service)4 (1981) evolved to DiffServe5 (Differentiated Services) field (1998)

  • DSCP reuses the 8-bit ToS fields, 6 bits for DSCP value (64 possible values), and 2 bits remain for ECN
  • Depending on the configuration format and tooling check the DSCP to ToS conversion table6

Force DSCP7 on an Mellanox device:

# use L3 PFC, default=pcp (L2 PFC)
mlnx_qos -i eth0 --trust dscp
# verify configuration
>>> mlnx_qos -i eth0
#...
Priority trust state: dscp
#...

Mapping

Remap DSCP value to specified PFC Priority (see below

# DSCP 26 (ToS 106) to priority 4
mlnx_qos -i eth0 --dscp2prio='set,26,4'

(Optional) Set default ToS for all RoCE traffic (non persistent)

cma_roce_tos -d mlx5_0 -t 106

# double check
cat /sys/kernel/config/rdma_cm/mlx5_0/ports/1/default_roce_tos 

PFC

PFC (Priority Flow Control)9Prevents package drops due to buffer overflow

  • RoCEv2 relies on PFC-enabled TC to create lossless Ethernet
  • Prevent traffic loss when congestion occurs on Layer 2 (PCP) or Layer 3 (DSCP) interfaces
  • Works in conjunction with QoS queues to enhance Ethernet pause frame function
  • Can pause traffic on a per-application basis by associating applications with a priority value

When network congestion occurs PFC (alone) may impose back-pressure on an upstream port

  • HoL (Head-of-Line) blocking:
    • Single congested queue (e.g. due to PFC pause frames) stalls all traffic on that priority
    • PFC pauses entire priority channel (not individual flows) …unrelated traffic waits
  • Mitigate cascade reaction (spread to remove switches) by congestion control

PFC behavior is unpredictable if VLAN-tagged packets are received on an interface with DSCP-based PFC enabled.

Traffic Classes

TC (Traffic Class, TClass) — Intermediate value between PFC Priority and Queue ID

  • TCs determine:
    • Which hardware queue the packet goes into
    • Whether PFC (Priority Flow Control) applies
    • Scheduling priority (e.g., strict priority vs. weighted)
  • Device driver maps TC to DSCP field in IP header
  • DSCP provides end-to-end marking, TCs provide hop-by-hop forwarding behavior
  • Without DSCP-to-TC mapping, RoCEv2 traffic won’t get the required lossless treatment.

Switches/NICs map DSCP values to TCs (Layer 2, 3-bit priority)

  • DSCP 40 (CS5/AF41) → TC 5 → PFC enabled → Lossless queue for RoCEv2
  • DSCP 0 (CS0) → TC 0 → No PFC → Best-effort queue

Priorities

Set the desired priority:

  • Each number corresponds to a priority, 0 through 7.
  • Setting the 5th number to 1 enables the 5th priority in the sequence, which is 4 (0,1,2,3,4,5,6,7).
# enable PFC priority 4
mlnx_qos -i eth4 --pfc 0,0,0,0,1,0,0,0

Display the current priorities

# verify configuration
>>> mlnx_qos -i eth4
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   0   1   0   0   0    
        buffer      0   0   0   0   1   0   0   0
#                                   ^-------- enabled
Headroom

Tune PFC Headroom Size10

  • The system uses the cable length and the maximum receive unit (MRU) to calculate the amount of buffer headroom reserved to support PFC.
  • The the shorter the cable length and lower the MRU, the less headroom buffer space is required for PFC.
# tune PFC Headroom Size
mlxlink -d mlx5_0 -m -c -e | grep 'Transfer Distance'
mlnx_qos -i eth4 --cable_len=5

# verify configuration
>>> mlnx_qos -i eth4
#...
Cable len: 5
#...

Congestion Control

DCQCN11 (Data Center Quantized Congestion Notification)

  • Manages congestion for PFC-enabled priority queues …coexists with QoS
    • Congestion control algorithm designed specifically for lossless Ethernet networks
    • DCQCN does not replace CoS/DSCP (It’s an add-on for lossless traffic)
    • Switches must support PFC and ECN on PFC queues

Per-flow congestion control protocol …before PFC triggers …enabled by two features: ECN & PFC

  • ECN (Explicit Congestion Notification)
    • Used between hosts separated by multiple network devices & different routed segments
    • ECN set one of two reserved ToS bits in the IP header to a value of 1
  • CNP (Congestion Notification Package) packets if…
    • …packet with an ECN mark (added by a switch) received
    • …out-of-order packet received (packet loss occurred)

Note that ConnectX-6 (and newer) supports RTTCC12 (Round‑Trip Time Congestion Control)

Verification

Tool Description
rdma_{server,client} Basic connectivity demo (simple client-server communication)
rping Simple RDMA ping-pong
ucmatose Like rping …allows to use a specific Type-of-Service (ToS)
ib_send_bw Bandwidth tests (traditional message passing pattern)
ib_send_lat Latency (measures round-trip time)
# install test tools
dnf install -y librdmacm-utils perftest

Connection

# Does RDMA work?
rdma_server              # node a
rdma_client -s $server   # node b

Performs RDMA transfers between two nodes:

# server side (persistent, multiple client can connect)
rping -PvVs

# client side …`-C` message count
rping -vVcC 10 -a $server_ip
# server
ucmatose -t $tos

# client
ucmatose -t $tos -s $server

Bandwidth

Point to Point Bandwidth Test

Example:

# server (reciever)
ib_send_bw -d mlx5_0

# client (sender)
ib_send_bw -d mlx5 --report_gbits $server_ip -F

Options:

  • --report_gbits
  • --run_infinitely (report every 5 seconds)

Footnotes

  1. Understanding QoS Configuration for RoCE, NVIDIA
    https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce↩︎

  2. Quality of Service (QoS), MLNX OFED Documentation, NVIDIA
    https://docs.nvidia.com/networking/display/mlnxenv543750lts/quality+of+service+(qos)↩︎

  3. Understanding PFC Using DSCP at Layer 3 for Untagged Traffic, HPC
    https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/cos/topics/concept/cos-lossless-l3-dscp-pfc-understanding.html↩︎

  4. IP Precedence and DSCP Values
    https://networklessons.com/quality-of-service/ip-precedence-dscp-values↩︎

  5. Definition of the Differentiated Services Field (DS Field), RFC2474
    https://www.ietf.org/rfc/rfc2474.txt↩︎

  6. DSCP to ToS conversion table
    https://bytesolutions.com/dscp-tos-cos-precedence-conversion-chart↩︎

  7. Lossless RoCE Configuration for Linux Drivers in DSCP-Based QoS Mode, NVIDIA
    https://enterprise-support.nvidia.com/s/article/lossless-roce-configuration-for-linux-drivers-in-dscp-based-qos-mode↩︎

  8. Assured Forwarding, RFC2597
    https://datatracker.ietf.org/doc/html/rfc2597↩︎

  9. Data Center Storage and Lossless Ethernet, HPC
    https://arubanetworking.hpe.com/techdocs/VSG/docs/040-dc-design/esp-dc-design-025-lossless-ethernet/#priority-flow-control↩︎

  10. Enable L3 PFC + DCQCN for RoCE on Mellanox ConnectX NICs, 2023/07/24
    https://blog.mylab.cc/2023/07/24/Enable-L3-PFC-DCQCN-for-RoCE-on-Mellanox-ConnectX-NICs↩︎

  11. Understanding RoCEv2 Congestion Management, NVIDIA
    https://enterprise-support.nvidia.com/s/article/understanding-rocev2-congestion-management↩︎

  12. Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control, NVIDIA
    https://developer.nvidia.com/blog/scaling-zero-touch-roce-technology-with-round-trip-time-congestion-control↩︎