RoCE — RDMA over Converged Ethernet
Overview
| Version | Description | 
|---|---|
| RoCEv1 | Deprecated (limited to OSI layer 2, therefore not routable) | 
| RoCEv2 | RDMA over UDP/IP (layer 3), therefore routable | 
RoCE (RDMA over Converged Ethernet):
- Extends the benefits of RDMA to Ethernet networks
- Enables a convergence of storage and regular data traffic on a single network plane
Give me the short version!
- RoCE forces lossless networks …switches with DCB (Data Center Bridging)
- PFC (Priority Flow Control) is used to create lossless priority queues
- PFC can cause head-of-line (HoL) blocking and congestion spreading
- DCQCN based congestion control is the safety net for PFC deadlocks
- However it is hard to tune Ethernet congestion control mechanisms
Network Congestion
Ethernet operate on a best-effort layer-2 delivery (no native prioritization)…
- Congestion - Network traffic overwhelms available capacity
- Congestion indicators:
- Increased latency & reduced bandwidth
- High packet loss & jitter (variability in packet arrival times)
 
By default all traffic has an equal chance of being dropped in Ethernet
Lossless Ethernet
Lossless Ethernet is based on two pillars:
- Traffic management (QoS)
- …prioritize or isolate traffic classes during congestion
- How a single device handles traffic when it’s overwhelmed
 
- Congestion control — How senders avoid overwhelming the network
- Prevent congestion …dynamically adjusting sender rates based on network feedback
- …cannot create bandwidth …ensure physical capacity > demand
 
Prerequisite
QoS1⸴2 configuration for Mellanox HCAs requires mlnx_qos MLNX Tools package.
# package is included in the NVIDIA DOCA distribution
wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.1.0/host/doca-host-3.1.0-091000_25.07_rhel94.x86_64.rpm
rpm2cpio doca-host-3.1.0-091000_25.07_rhel94.x86_64.rpm | cpio -idmv# uplink port on the switch
>>> networkctl status eth4 | grep Connected
              Connected To: Leaf2 on port 100GE1/2/1 (Linkto_Client6)
# HCA need to be in Ethernet mode
>>> mlxconfig -d mlx5_0 query | grep LINK_TYPE
        LINK_TYPE_P1                                ETH(2) 
# print the current RoCE mode for a device port
>>> cma_roce_mode -d mlx5_0 -p 1
RoCE v2
# display the RDMA links
>>> rdma link show
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev eth4Configuration
Traffic Management
QoS (Quality of Service) allows to select specific network traffic and prioritize it
- Not a single solution – it’s a toolbox
- Implemented using extensions and higher-layer protocols
Core QoS Frameworks for Ethernet
- 802.1p, CoS (Class of Service)
- PCP (Priority Code Point), VLAN header …pause traffic on any of eight priorities
- Limited to a single broadcast domain (VLAN) …intra-VLAN prioritization
- Priority is lost when traffic crosses routers
- Untagged traffic: CoS 0…can not be paused
 
- DSCP (Differentiated Services Code Point), Layer 3
- DSCP value must be mapped explicitly in the configuration to a PFC priority
- End-to-end across IP networks (routers must be configured)
- DSCP-based PFC3 values in the Layer 3 IP header
- Untagged traffic - DSCP 0 …best-effort
 
Start with DSCP for end-to-end control, use CoS for local VLAN segments
ToS  +----> DSCP +------> PFC +----> TC
                           +          +
                           |          |
                           v          v
Hardware                 Buffer     QueueDSCP
Extended IP Precedence, ToS (Type of Service)4 (1981) evolved to DiffServe5 (Differentiated Services) field (1998)
- DSCP reuses the 8-bit ToS fields, 6 bits for DSCP value (64 possible values), and 2 bits remain for ECN
- Depending on the configuration format and tooling check the DSCP to ToS conversion table6
Force DSCP7 on an Mellanox device:
# use L3 PFC, default=pcp (L2 PFC)
mlnx_qos -i eth0 --trust dscp# verify configuration
>>> mlnx_qos -i eth0
#...
Priority trust state: dscp
#...Header
| DSCP (dec) | DSCP | Class | Usage | 
|---|---|---|---|
| 0 | 0 | CS0 | Default traffic, no special QoS treatment | 
| 8 | 001000 | CS1 | Background traffic, less priority than best effort | 
| 16 | 010000 | AF11 | Low priority data, less sensitive to delay | 
| 24 | 011000 | AF21 | Medium priority data, moderate sensitivity to delay | 
| 32 | 100000 | AF31 | High priority data, sensitive to delay | 
| 40 | 101000 | AF41 | Critical data, very sensitive to delay | 
| 46 | 101110 | EF | Expedited Forwarding, real-time traffic like voice and video | 
| 48 | 110000 | NC | Reserved for network control traffic. | 
| 56 | 111000 | NC | Network control with the highest priority | 
CS (Class-Selector) code points are a subset of DSCP:
- Use for compatibility with IP precedence ToS
- Leftmost 3 bits define the class (CS0–CS7)
AF8 (Assured Forwarding) — Provides assured delivery within each class…
- …with different drop probabilities during congestion
- Class selector alone defines only priorities ≠ Class-based traffic handling with congestion-aware dropping
- Uses CS[1,2,3,4] …each class placed in a different queue
- Within each class three levels for drop probability
- Queue full …delete packages with “high drop”
 
Mapping
Remap DSCP value to specified PFC Priority (see below
# DSCP 26 (ToS 106) to priority 4
mlnx_qos -i eth0 --dscp2prio='set,26,4'(Optional) Set default ToS for all RoCE traffic (non persistent)
cma_roce_tos -d mlx5_0 -t 106
# double check
cat /sys/kernel/config/rdma_cm/mlx5_0/ports/1/default_roce_tos PFC
PFC (Priority Flow Control)9 — Prevents package drops due to buffer overflow
- RoCEv2 relies on PFC-enabled TC to create lossless Ethernet
- Prevent traffic loss when congestion occurs on Layer 2 (PCP) or Layer 3 (DSCP) interfaces
- Works in conjunction with QoS queues to enhance Ethernet pause frame function
- Can pause traffic on a per-application basis by associating applications with a priority value
When network congestion occurs PFC (alone) may impose back-pressure on an upstream port
- HoL (Head-of-Line) blocking:
- Single congested queue (e.g. due to PFC pause frames) stalls all traffic on that priority
- PFC pauses entire priority channel (not individual flows) …unrelated traffic waits
 
- Mitigate cascade reaction (spread to remove switches) by congestion control
PFC behavior is unpredictable if VLAN-tagged packets are received on an interface with DSCP-based PFC enabled.
Traffic Classes
TC (Traffic Class, TClass) — Intermediate value between PFC Priority and Queue ID
- TCs determine:
- Which hardware queue the packet goes into
- Whether PFC (Priority Flow Control) applies
- Scheduling priority (e.g., strict priority vs. weighted)
 
- Device driver maps TC to DSCP field in IP header
- DSCP provides end-to-end marking, TCs provide hop-by-hop forwarding behavior
- Without DSCP-to-TC mapping, RoCEv2 traffic won’t get the required lossless treatment.
Switches/NICs map DSCP values to TCs (Layer 2, 3-bit priority)
- DSCP 40 (CS5/AF41) → TC 5 → PFC enabled → Lossless queue for RoCEv2
- DSCP 0 (CS0) → TC 0 → No PFC → Best-effort queue
Priorities
Set the desired priority:
- Each number corresponds to a priority, 0 through 7.
- Setting the 5th number to 1 enables the 5th priority in the sequence, which is 4 (0,1,2,3,4,5,6,7).
# enable PFC priority 4
mlnx_qos -i eth4 --pfc 0,0,0,0,1,0,0,0Display the current priorities
# verify configuration
>>> mlnx_qos -i eth4
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   0   1   0   0   0    
        buffer      0   0   0   0   1   0   0   0
#                                   ^-------- enabledHeadroom
Tune PFC Headroom Size10
- The system uses the cable length and the maximum receive unit (MRU) to calculate the amount of buffer headroom reserved to support PFC.
- The the shorter the cable length and lower the MRU, the less headroom buffer space is required for PFC.
# tune PFC Headroom Size
mlxlink -d mlx5_0 -m -c -e | grep 'Transfer Distance'
mlnx_qos -i eth4 --cable_len=5
# verify configuration
>>> mlnx_qos -i eth4
#...
Cable len: 5
#...Congestion Control
DCQCN11 (Data Center Quantized Congestion Notification)
- Manages congestion for PFC-enabled priority queues …coexists with QoS
- Congestion control algorithm designed specifically for lossless Ethernet networks
- DCQCN does not replace CoS/DSCP (It’s an add-on for lossless traffic)
- Switches must support PFC and ECN on PFC queues
 
Per-flow congestion control protocol …before PFC triggers …enabled by two features: ECN & PFC
- ECN (Explicit Congestion Notification)
- Used between hosts separated by multiple network devices & different routed segments
- ECN set one of two reserved ToS bits in the IP header to a value of 1
 
- CNP (Congestion Notification Package) packets if…
- …packet with an ECN mark (added by a switch) received
- …out-of-order packet received (packet loss occurred)
 
Note that ConnectX-6 (and newer) supports RTTCC12 (Round‑Trip Time Congestion Control)
Verification
| Tool | Description | 
|---|---|
| rdma_{server,client} | Basic connectivity demo (simple client-server communication) | 
| rping | Simple RDMA ping-pong | 
| ucmatose | Like rping…allows to use a specific Type-of-Service (ToS) | 
| ib_send_bw | Bandwidth tests (traditional message passing pattern) | 
| ib_send_lat | Latency (measures round-trip time) | 
# install test tools
dnf install -y librdmacm-utils perftestConnection
# Does RDMA work?
rdma_server              # node a
rdma_client -s $server   # node bPerforms RDMA transfers between two nodes:
# server side (persistent, multiple client can connect)
rping -PvVs
# client side …`-C` message count
rping -vVcC 10 -a $server_ip# server
ucmatose -t $tos
# client
ucmatose -t $tos -s $serverBandwidth
Point to Point Bandwidth Test
Example:
# server (reciever)
ib_send_bw -d mlx5_0
# client (sender)
ib_send_bw -d mlx5 --report_gbits $server_ip -FOptions:
- --report_gbits
- --run_infinitely(report every 5 seconds)
Footnotes
- Understanding QoS Configuration for RoCE, NVIDIA 
 https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce↩︎
- Quality of Service (QoS), MLNX OFED Documentation, NVIDIA 
 https://docs.nvidia.com/networking/display/mlnxenv543750lts/quality+of+service+(qos)↩︎
- Understanding PFC Using DSCP at Layer 3 for Untagged Traffic, HPC 
 https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/cos/topics/concept/cos-lossless-l3-dscp-pfc-understanding.html↩︎
- IP Precedence and DSCP Values 
 https://networklessons.com/quality-of-service/ip-precedence-dscp-values↩︎
- Definition of the Differentiated Services Field (DS Field), RFC2474 
 https://www.ietf.org/rfc/rfc2474.txt↩︎
- DSCP to ToS conversion table 
 https://bytesolutions.com/dscp-tos-cos-precedence-conversion-chart↩︎
- Lossless RoCE Configuration for Linux Drivers in DSCP-Based QoS Mode, NVIDIA 
 https://enterprise-support.nvidia.com/s/article/lossless-roce-configuration-for-linux-drivers-in-dscp-based-qos-mode↩︎
- Assured Forwarding, RFC2597 
 https://datatracker.ietf.org/doc/html/rfc2597↩︎
- Data Center Storage and Lossless Ethernet, HPC 
 https://arubanetworking.hpe.com/techdocs/VSG/docs/040-dc-design/esp-dc-design-025-lossless-ethernet/#priority-flow-control↩︎
- Enable L3 PFC + DCQCN for RoCE on Mellanox ConnectX NICs, 2023/07/24 
 https://blog.mylab.cc/2023/07/24/Enable-L3-PFC-DCQCN-for-RoCE-on-Mellanox-ConnectX-NICs↩︎
- Understanding RoCEv2 Congestion Management, NVIDIA 
 https://enterprise-support.nvidia.com/s/article/understanding-rocev2-congestion-management↩︎
- Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control, NVIDIA 
 https://developer.nvidia.com/blog/scaling-zero-touch-roce-technology-with-round-trip-time-congestion-control↩︎