0% found this document useful (0 votes)
7 views14 pages

DCP Sigcomm25

The paper revisits RDMA reliability for lossy fabrics, proposing a new transport architecture called DCP that enhances reliability while being independent of Priority Flow Control (PFC). DCP-Switch and DCP-RNIC are designed to improve performance by implementing a lossless control plane and efficient retransmission mechanisms, achieving significant performance improvements over state-of-the-art solutions. Extensive experiments demonstrate that DCP effectively addresses challenges such as packet loss and latency, making it suitable for high-speed networking in modern datacenters.

Uploaded by

Funny POUM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

DCP Sigcomm25

The paper revisits RDMA reliability for lossy fabrics, proposing a new transport architecture called DCP that enhances reliability while being independent of Priority Flow Control (PFC). DCP-Switch and DCP-RNIC are designed to improve performance by implementing a lossless control plane and efficient retransmission mechanisms, achieving significant performance improvements over state-of-the-art solutions. Extensive experiments demonstrate that DCP effectively addresses challenges such as packet loss and latency, making it suitable for high-speed networking in modern datacenters.

Uploaded by

Funny POUM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Revisiting RDMA Reliability for Lossy Fabrics

Wenxue Li∗ Xiangzhou Liu Yunxuan Zhang


Hong Kong University of Science and Hong Kong University of Science and Hong Kong University of Science and
Technology, Huawei Technology Technology
Hong Kong, China Hong Kong, China Hong Kong, China

Zihao Wang Wei Gu Tao Qian


Hong Kong University of Science and Huawei Huawei
Technology, Hong Kong, China Nanjing, China Beijing, China

Gaoxiong Zeng Shoushou Ren Xinyang Huang


Huawei Huawei Hong Kong University of Science and
Shenzhen, China Beijing, China Technology, Hong Kong, China

Zhenghang Ren Bowen Liu Junxue Zhang


Hong Kong University of Science and Hong Kong University of Science and Hong Kong University of Science and
Technology, Hong Kong, China Technology, Hong Kong, China Technology, Hong Kong, China

Kai Chen Bingyang Liu


Hong Kong University of Science and Huawei
Technology, Hong Kong, China Shenzhen, China

Abstract CCS Concepts


Due to the high operational complexity and limited deployment • Networks → Transport protocols; Data center networks.
scale of lossless RDMA networks, the community has been explor-
ing efficient RDMA communication over lossy fabrics. State-of-the- Keywords
art (SOTA) lossy RDMA solutions implement a simplified selec- RDMA NICs, Reliability, Lossy Fabrics, Lossless Control Plane
tive repeat mechanism in RDMA NICs (RNICs) to enhance loss re-
ACM Reference Format:
covery efficiency. However, these solutions still face performance
Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao
challenges, such as unavoidable ECMP hash collisions and exces- Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren,
sive retransmission timeouts (RTOs). In this paper, we revisit RDMA Bowen Liu, Junxue Zhang, Kai Chen, and Bingyang Liu. 2025. Revisiting
reliability with the goals of being independent of PFC, compatible RDMA Reliability for Lossy Fabrics. In ACM SIGCOMM 2025 Conference
with packet-level load balancing, free from RTO, and friendly to (SIGCOMM ’25), September 8–11, 2025, Coimbra, Portugal. ACM, New York,
hardware offloading. To this end, we propose DCP, a transport ar- NY, USA, 14 pages. https://doi.org/10.1145/3718958.3750480
chitecture that co-designs both the switch and RNICs, fully meet-
ing the design goals. At its core, DCP-Switch introduces a simple 1 Introduction
yet effective lossless control plane, which is leveraged by DCP- Remote Direct Memory Access (RDMA) is a widely adopted high-
RNIC to enhance reliability support for high-speed lossy fabrics, speed networking technique in modern datacenters (DCs), driven
primarily including header-only-based retransmission and bitmap- by the demanding performance requirements of applications [16,
free packet tracking. We prototype DCP-Switch using P4 switch 22, 23, 41, 42, 53]. RDMA delivers high performance by offload-
and DCP-RNIC using FPGA. Extensive experiments demonstrate ing the entire network stack to RDMA NICs (RNICs). Traditional
that DCP achieves 1.6× and 2.1× performance improvements, com- RNICs (RNIC-GBN) adopt a Go-Back-N (GBN) retransmission mech-
pared to SOTA lossless and lossy RDMA solutions, respectively. anism to handle packet loss, which significantly degrades perfor-
mance in lossy Ethernet fabrics [42, 53, 56], prompting operators to

This work was done while Wenxue Li was an intern at Huawei.
rely on Priority Flow Control (PFC) [1] to ensure lossless transmis-
sion [16, 22, 23, 41, 56]. However, PFC is a coarse-grained mech-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed anism that introduces several issues (§2.1), such as head-of-line
for profit or commercial advantage and that copies bear this notice and the full cita- (HoL) blocking, congestion spreading, deadlock, and restrictions
tion on the first page. Copyrights for components of this work owned by others than on deployment distance [16, 23, 25, 29, 52].
the author(s) must be honored. Abstracting with credit is permitted. To copy other-
wise, or republish, to post on servers or to redistribute to lists, requires prior specific Recently, both industry and academia have been exploring ef-
permission and/or a fee. Request permissions from [email protected]. ficient RDMA communication over lossy fabrics (without PFC en-
SIGCOMM ’25, Coimbra, Portugal abled) to avoid the limitations and drawbacks of PFC [5, 11, 40, 42,
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-1524-2/25/09 49]. The primary goal of lossy RDMA transport is to maintain effi-
https://doi.org/10.1145/3718958.3750480 ciency in lossy fabrics while preserving the offloading capabilities
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

of RNICs. Many previous works have focused on software-based a weighted round-robin (WRR) scheduler to prioritize the control
transports, which do not benefit from RNIC offloading [18, 26, 47]. queue and ensure lossless delivery of HO packets1 .
SOTA lossy RDMA solutions (RNIC-SR) implement a simplified se- Leveraging the lossless CP transfer, DCP-RNIC employs a pre-
lective repeat (SR) mechanism in RNICs to enhance loss recovery cise and fast HO-based retransmission (§4.3). Specifically, upon re-
efficiency and avoid significant performance degradation due to ceiving an HO packet, the receiver swaps the source and destina-
packet loss [8, 9, 42, 53]. However, even with RNIC-SR, perfor- tion fields in the HO packet and forwards it to the sender. The
mance issues persist in lossy fabrics (§2.2). sender then retransmits the lost packet precisely, based on the PSN
First, RNIC-SR assumes flow-level single-path transmission, with carried by the HO packet ( R3 ). This mechanism is unaffected by
Equal-Cost Multi-Path (ECMP) as the default load balancing (LB) OOO packets caused by packet-level LB ( R2 ). DCP-RNIC incorpo-
scheme in the network. However, ECMP hashing collisions are rates microarchitecture innovations to minimize PCIe transactions,
inevitable and their impact on throughput degradation is increas- enhancing retransmission efficiency. Furthermore, since HO pack-
ingly significant in today’s datacenter workloads [38, 50]. Second, ets are stateless, the retransmission rate is inherently tied to their
certain lost packets (such as retransmitted and tail packets) cannot arrival rate. To address it, DCP-RNIC adds a retransmission queue
be recovered through fast retransmission in RNIC-SR; instead, they to enable a controllable retransmission rate.
rely on retransmission timeouts (RTOs). This leads to an increase DCP-RNIC supports order-tolerant packet reception at the receiver,
in RTO events, significantly prolonging tail latency. enabling compatibility with packet-level LB ( R2 ) (§4.4). This is
Packet-level LB techniques (such as packet spraying [21] and achieved through a thoughtfully designed RDMA header exten-
adaptive routing (AR) [3]) are promising alternatives to ECMP and sion that allows each packet to specify its corresponding memory
are increasingly gaining attention from various scenarios, such as address. Consequently, receiver-side RNIC can directly write any
large-scale LLM training [22]. However, RNIC-SR is natively in- packet, whether in-order or out-of-order, to its appropriate applica-
compatible with packet-level LB. The root cause of this incompati- tion memory location, eliminating the need for reordering buffers.
bility is that packet-level LB can cause out-of-order (OOO) packet Traditionally, RNICs maintain bitmaps to track which packets
arrivals even in the absence of packet loss, whereas RNIC-SR as- have been received or lost. This bitmap-based approach introduces
sumes that all OOO packets are due to packet loss. As a result, trade-offs between RNIC memory overhead and packet processing
combining packet-level LB with RNIC-SR leads to excessive spuri- efficiency, which limits connection scalability [39, 53] and packet
ous retransmissions, significantly degrading throughput. rate (packet-per-second) [30, 42], respectively. DCP-RNIC lever-
Based on these issues, we pose the question: Can we revisit RDMA ages the ”exactly once” feature of the lossless CP to eliminate the
reliability to meet the following requirements? need for packet-level bitmaps. Specifically, it employs a bitmap-
R1 Independence from PFC to avoid its drawbacks, extending free packet tracking scheme (§4.5), using packet counting to track
network scale and communication length. aggregated message-level information, significantly reducing both
R2 Compatibility with packet-level LB, eliminating hotspots and memory overhead and packet processing cycles ( R4 ). Furthermore,
increasing network throughput. DCP-RNIC incorporates a coarse-grained timeout mechanism as a
R3 Ability to quickly retransmit any lost packet without relying fallback to ensure reliability if the lossless control plane assump-
on RTO, reducing network latency. tion is violated (e.g., link/switch crashes).
R4 A hardware-oriented design, with the feasibility of low mem- We prototype DCP-Switch using P4 programmable switch [2]
ory and processing overhead. and DCP-RNIC using FPGA [4] (§5). The DCP-RNIC prototype con-
We answer this question optimistically with DCP, a transport sumes only 1.7% and 1.1% more computation and memory resources
architecture consisting of DCP-RNIC, which revisits the reliability compared to RNIC-GBN. We evaluate DCP through extensive testbed
support for high-speed lossy fabrics, and DCP-Switch, which in- experiments and simulations (§6). The testbed experiments show
troduces a simple yet effective innovation to enhance DCP-RNIC. that DCP maintains consistent throughput when combined with
DCP conceptually defines the data plane (DP) for payload trans- AR2 and achieves 1.6×∼72× higher loss recovery efficiency and
fer and the control plane (CP) for header transfer. While lossless 42% lower completion time for AI workloads, compared to Mel-
RDMA ensures that both DP and CP are lossless via PFC, DCP- lanox CX5 RNIC. Large-scale simulations demonstrate that DCP
Switch ensures a lossless CP while allowing the DP to operate in a achieves 2.1× and 1.6× improvements in realistic workloads, com-
lossy manner. The key design point of DCP is to leverage the loss- pared to IRN [42] and MP-RDMA [38], respectively. Moreover, DCP
less CP feature to enhance the RNICs’ reliability, enabling compat- achieves greater improvement in cross-DC scenarios, and its loss-
ibility with packet-level LB, precise retransmission, and minimal less CP remains robust under severe incast congestion.
memory and processing overhead.
DCP-Switch implements a packet trimming mechanism similar 2 Background and Motivation
to [18] ( R1 ). Specifically, when there is no congestion, the data
2.1 Lossless RDMA Network
packets are queued in the normal queue (a.k.a., data queue) and for-
warded to the receiver. When the data queue becomes congested, RDMA was originally designed for lossless InfiniBand networks.
the switch trims the payload from packets, modifies flags in the Thus, RNICs were initially equipped with a Go-Back-N retrans-
remaining header, and enqueues the header-only (HO) packet into mission mechanism to simplify their processing logic. RDMA over
another queue (a.k.a., control queue). To ensure lossless CP trans- 1
More precisely, ”HO packet loss is very rare,” as there is no mechanism that can
fer (§4.2) while avoiding starvation of DP delivery, the switch uses prevent losses under heavy load.
2
We implement an in-network adaptive routing mechanism in this work.
Revisiting RDMA Reliability for Lossy Fabrics SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal

ASIC Tomahawk 3 Tomahawk 5 Tofino 1 Tofino 2 Spectrum Spectrum-4


Capacity (ports × bandwidth per port) 32 × 400 Gbps 64 × 800 Gbps 32 × 100 Gbps 32 × 400 Gbps 32 × 100 Gbps 64 × 800 Gbps
Total buffer 64 MB 165 MB 20 MB 64 MB 16 MB 160 MB
Buffer per port per 100Gbps 0.5 MB 0.32 MB 0.62 MB 0.5 MB 0.5 MB 0.31 MB
Max. lossless length (1 queue) 4.1 km 2.62 km 5.08 km 4.1 km 4.1 km 2.56 km
Max. lossless length (8 queues) 512 m 327 m 634 m 512 m 512 m 320 m
Table 1: The maximum lossless communication distance with PFC enabled of various commodity switching ASICs.

Converged Ethernet version 2 (RoCEv2) encapsulates the Infini- fabrics (without PFC enabled) to bypass its drawbacks [5, 11, 12,
Band header within a UDP layer, enabling RDMA to operate over 40, 42, 49, 53]. On one hand, since traditional RNICs (i.e., RNIC-
Ethernet. Traditional RoCEv2 RNICs (i.e., RNIC-GBN) inherit the GBN) suffer significant throughput degradation upon packet loss
Go-Back-N mechanism and depend on PFC [1] to transform Eth- due to their GBN retransmission logic, a class of works aims to
ernet into a lossless fabric [42, 53, 56]. Despite its utility, PFC is develop efficient congestion control (CC) schemes that minimize
a coarse-grained mechanism that introduces several challenges in overall switch queueing to reduce packet loss [12, 34, 37]. How-
performance degradation, operational complexity, and scalability ever, they do not address the root issue, as packet loss remains in-
limitations, inhibiting its broader deployment. evitable in lossy fabrics. On the other hand, many works focus on
PFC operates as follows: when the ingress port/queue length software-based efficient loss recovery mechanisms, which unfor-
exceeds a specified threshold, a PAUSE signal is sent to all related tunately cannot leverage RNIC offloading capabilities [18, 26, 47].
upstream egress ports/queues, instructing them to halt data trans- Therefore, one of the primary goals of lossy RDMA transport
mission until a RESUME signal is received. As a result, PFC causes should be achieving efficient loss recovery while preserving the of-
issues such as HoL blocking, PFC storms, and deadlocks, signifi- floading capabilities of RNICs. SOTA lossy RDMA solutions, such
cantly degrading overall network performance [23, 25, 29, 56]. as IRN [42] and new-generation ConnectX RNICs [8, 9] implement
Moreover, because PFC spreads hop-by-hop, a single failure can a simplified selective repeat (SR) mechanism in RNICs (RNIC-SR)
have a cascading effect, potentially impacting the entire network. to improve loss recovery efficiency. However, even with RNIC-SR,
To mitigate this, operators employ various monitoring schemes to performance issues persist in lossy fabrics (IRN as an example4 ).
limit the scope of the failure. For example, a PFC watchdog detects Issue #1: Incompatibility with Packet-level LB. The loss re-
if a queue has remained in the paused state for an abnormally long covery mechanism of IRN operates as follows: Upon every out-
duration; if this occurs, the system disables PFC and drops all pack- of-order (OOO) packet arrival, the IRN receiver sends a selective
ets [16, 25]. However, this inevitably complicates network opera- acknowledgment (SACK), which carries both the cumulative ac-
tions and cannot fully avoid all accidents. knowledgment (ePSN) and the PSN of the OOO packets. When a
Additionally, PFC requires switches to reserve sufficient buffer SACK is received or when a timeout occurs, the IRN sender enters
space (i.e., headroom) for in-flight packets between hops, making loss recovery mode. It maintains a bitmap to track which packets
RoCEv2 suitable primarily for short-distance communication but have been cumulatively and selectively acknowledged. When in
impractical for long-distance scenarios (e.g., cross-datacenter) [16, the loss recovery mode, the sender begins retransmitting lost pack-
52]. Meanwhile, switch buffers are scarce resources, further exac- ets as indicated by the bitmap. A packet is considered lost only if
erbating this situation. We list the capacity and buffer information another packet with a higher PSN has been SACKed.
of several commodity datacenter switching chips in Table 1. We This mechanism functions correctly in single-path transmission,
calculate the maximum lossless communication distance (𝐿) that thus often used with ECMP in the network. However, ECMP col-
these switches can support with PFC enabled as follows: lisions are unavoidable, and their impact on throughput degrada-
buffer tion has become increasingly significant [38, 50]. Packet-level LB,
𝐿= (1) such as adaptive routing [3], is gaining increasing interest from
bandwidth × one-hop-delay × 2
the community because it offers substantial advantages, such as
where for example the one-hop-delay of a 1 km fiber is approxi- avoiding ECMP collisions, eliminating network hotspots, and dy-
mately 5𝜇 s3 . The results show that commodity switches face chal- namically adapting to path failures. These benefits are particularly
lenges in scaling the distance to tens of kilometers. Some works crucial in emerging scenarios such as LLM training and clouds [22,
propose adopting off-chip DRAM to store in-flight packets, suc- 27, 35, 49]. Although non-packet-level LBs, such as flowlet-based
cessfully enabling adaptation to 100 km distances [16]. However, LB [14, 32, 51] and congestion-aware path switching [31, 46, 55],
this approach inevitably complicates network operations and de- exist, they represent compromises that trade load balancing granu-
grades communication performance, as DRAM bandwidth is sig- larity for lower OOO degrees. Consequently, these approaches are
nificantly lower than on-chip switch SRAM. generally less efficient than packet-level LB schemes.
IRN suffers from spurious retransmissions when combined with
2.2 Existing Solutions over Lossy Fabrics packet-level LBs. The root cause is that packet-level LB inherently
Due to the inherent limitations of PFC, both industry and academia causes OOO packet arrivals even in the absence of actual packet
have been exploring efficient RDMA communication over lossy
4
We adopt IRN as a representative example of RNIC-SR solutions and primarily ana-
103 𝑚
3
5𝜇 s = 2×108 𝑚/𝑠
, where 2 × 108 m/s is the transmission speed of light in fiber. lyze IRN in this paper.
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

IRN-AR DCP
Ratio of retrans packets

1.0 1.0

Number of timeouts

Number of timeouts
20 IRN-ECMP 15
IRN DCP
15

CDF
10
0.5 0.5
Small (0~50KB) 10
Medium (50KB~2MB) 5
Large (>2MB) 5

0 0 0 0
7 7 7 7 7
0 10 2×10 0 0.2 0.4 0.6 0.8 1.0 0 10 2×10 3×10
Flow size (byte) Ratio of retransmission packets Flow size (byte) IRN-ECMP IRN-AR

(a) Retransmission ratio. (b) CDF of IRN’s retrans ratio. (a) Background flow. (b) Incast flow.
Figure 1: IRN generates significant spurious retransmis- Figure 2: IRN experiences excessive RTOs in both back-
sions, whereas DCP avoids them. ground and incast flows, whereas DCP avoids them.
loss. These OOO arrivals prompt the receiver to send SACKs, which bined with 128-to-1 incast traffic at an average load of 0.1. We eval-
in turn trigger the sender to incorrectly enter loss recovery mode uated IRN under both ECMP and adaptive routing (AR) and mea-
and unnecessarily retransmit a large number of packets. Thus, in- sured the number of timeouts. The results, illustrated in Fig. 2, indi-
tegrating them leads to excessive spurious retransmissions. cate that IRN experiences excessive timeouts in both background
We conducted an experiment using NS3 to validate this obser- and incast flows. Furthermore, IRN-AR encounters even more time-
vation. The topology is a two-layer CLOS network with 256 end- outs due to the spurious retransmission traffic it generates, which
hosts and 32 switches. We used a WebSearch [15] workload with increases network load and exacerbates congestion. In contrast, all
an average load of 0.3. IRN and DCP are deployed at the endhosts, flows in DCP experience no timeout.
while adaptive routing is applied in switches. During the experi-
ment, we monitored loss events and found no packet loss in either
setup. However, IRN generated a significant number of spurious re- 3 Goals
transmissions. Fig. 1a illustrates the ratio of retransmission packets Based on the above analysis, we aim to revisit RDMA reliability to
across various flow sizes. The results reveal that retransmissions meet the following requirements.
occur across all flow sizes, with a ratio up to 100%. We further cat- R1 Independence from PFC. The proposed solution must elim-
egorized the flow size into three classes: small, medium, and large. inate the dependence on PFC to completely avoid its associated
The cumulative distribution function (CDF) of their retransmission drawbacks. This requires the ability to efficiently handle packet
ratios is shown in Fig. 1b. The results indicate that approximately loss in Ethernet-based datacenters.
50%, 80%, and 90% of small, medium, and large flows, respectively, R2 Compatibility with packet-level LBs. RNIC-SR suffers from
experience spurious retransmissions. In contrast, all flows in DCP significant spurious retransmissions when combined with packet-
avoid incorrect retransmissions entirely. level LBs. The proposed solution must be inherently compatible
Issue #2: Excessive RTOs. Certain lost packets in IRN cannot be with packet-level LBs, which necessitates the accurate distinction
recovered through fast retransmission and instead rely on retrans- of real packet losses from OOO packet arrivals.
mission timeouts (RTOs). Specifically, the IRN sender requires a R3 Fast retransmission for lost packets. RNIC-SR often fails
SACK to trigger the loss recovery mode. Consequently, if the tail to recover certain packet losses via fast retransmission, leading to
packet of a flow is lost, no SACK is generated, preventing recov- excessive RTOs. The proposed solution should enable prompt re-
ery through fast retransmission and necessitating reliance on an transmission for any lost packets. This necessitates an explicit loss
RTO. Additionally, to avoid retransmission ambiguities, the sender notification scheme.
enters the loss recovery mode only once and remains in this state
R4 Hardware-oriented design. While software solutions can
until it exits. The IRN sender exits loss recovery mode only when it
bypass hardware resource limitations due to the greater resource
receives an ePSN greater than the largest PSN of outstanding pack-
availability in software, they cannot leverage RNIC offloading ca-
ets prior to entering the loss recovery mode. As a result, if the re-
pabilities and face inherent performance limitations. Therefore, the
transmitted packets are dropped again, they can only be recovered
proposed solution should adopt a hardware-oriented design, ensur-
through an RTO. The increase in RTO events could significantly
ing minimal memory and computational overhead simultaneously.
degrade the performance of IRN.
The loss of tail and retransmitted packets is common in modern DCP vs. closely related works. We propose DCP, which sat-
datacenters [24, 28, 36]. The reasons are two-fold. First, the average isfies all the above requirements. DCP revisits RDMA reliability,
flow size in modern datacenters becomes increasingly ”shorter” [23, which maintains high goodput under significant packet loss ( R1 ),
33, 48], resulting in a higher frequency of tail packets. Second, avoids spurious retransmissions when combined with packet-level
packet loss often exhibits locality, as the congestion tends to per- LBs ( R2 ), and ensures fast recovery for any lost packets ( R3 ). DCP
sist for a period. During this interval, all passing packets, including is hardware-friendly; we implemented a fully functional DCP pro-
rapidly retransmitted ones, are likely to be dropped. totype using FPGA and P4 switch, demonstrating its line-rate effi-
To validate this analysis, we conducted an experiment using ciency and low memory/computational overhead ( R4 ).
NS3 with a CLOS topology. The workload consists of a WebSearch Table 2 summarizes the differences between DCP and its closely
workload with an average load of 0.3 (as background traffic) com- related works, including RNIC-GBN and RNIC-SR. Among these,
MP-RDMA [38] redesigns RNICs to support packet-level multipath
Revisiting RDMA Reliability for Lossy Fabrics SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal

Requirements R1 R2 R3 R4 Data packet


Packet

Sender
RNIC-GBN [7] × × × 3 retransmis Packet
RNIC-SR [8, 9, 42, 53] 3 × × 3 sion Trimming
MPTCP [47] 3 3 × ×
HO
NDP [26] 3 3 3 × ACK

Receiver
CP [18] 3 3 3 × Packet WRR
MP-RDMA [38] × 3 × 3 reception & Schedular
DCP 3 3 3 3 tracking
Table 2: Comparison of DCP and closely related works.
Retransmitted data packet Switch
transmission. However, MP-RDMA still uses GBN as its loss re- Figure 3: DCP workflow.
covery scheme, which is inefficient in the presence of packet loss
memory ( 4 ), based on an extended RDMA header that supports
and therefore still requires PFC to create a lossless environment.
order-tolerant packet reception (§4.4). The receiver-side DCP-RNIC
MPTCP [47], NDP [26], and CP [18] are software solutions de-
maintains the metadata of data packets in a bitmap-free packet
signed primarily for TCP networks, not for RDMA networks.
tracking manner (§4.5) and sends acknowledgments (ACKs) to the
Note that DCP focuses primarily on reliability, while specific LB
sender based on this metadata, if necessary ( 5 ).
and congestion control (CC) designs are orthogonal to its design
goals. Regarding LB, DCP is compatible with any packet-level LB 4.2 Lossless Control Plane
schemes, as it supports order-tolerant packet reception and avoids
spurious retransmissions. For CC, although DCP currently inte- As shown in Fig. 4, two bits in the Type of Service (ToS) field of the
grates DCQCN [56], it is microarchitecturally compatible with any IP header are reserved as the DCP tag to distinguish four types of
CC scheme [12, 34, 37], as DCP’s retransmission and CC modules packets classified by DCP.
are designed to operate in a decoupled manner. • Non-DCP packets (DCP tag = 00): They are dropped by the
switch when the buffer exceeds the defined threshold.
4 Design • DCP data packets (DCP tag = 10): This category includes both
normal and retransmitted data packets. These packets are pro-
4.1 DCP Overview cessed by the Packet Trimming module when the switch buffer
DCP co-designs the switch and RNICs to fully meet the design re- exceeds the defined threshold.
quirements (§3). It consists of DCP-Switch and DCP-RNIC, with its • HO packets (DCP tag = 11): A DCP data packet is converted
workflow illustrated in Fig. 3. into an HO packet after its payload is trimmed by the Packet
Upon receiving a DCP data packet, the switch5 determines the Trimming module in the switch.
egress port based on load balancing requirements. It then checks • DCP ACK packets (DCP tag = 01): These packets contain ac-
whether the length of the egress data queue exceeds a specified knowledgments and are used to expire sender messages.
threshold. If so, the switch trims the payload from the packet, modi- Packet Trimming Module. Upon receiving a packet, the DCP-
fies specific flags in the remaining header, and enqueues this header- Switch first determines its egress port based on specific load-balancing
only (HO) packet into a separate egress queue, i.e., control queue. schemes (e.g., AR or ECMP). If the packet is an HO packet, it is en-
The switch uses a weighted round-robin (WRR) scheduler to pri- queued directly into the control queue. Otherwise, it is enqueued
oritize the control queue over the data queue, ensuring a lossless into the data queue when the data queue length is below a given
control plane (CP) (§4.2) while preventing starvation of the data threshold. When the data queue length exceeds the threshold, the
plane (DP) ( 1 ). Upon receiving the HO packet, the receiver swaps switch handles the packet based on its type: if it is a non-DCP or
its source and destination IP and Queue Pair Number (QPN) fields DCP ACK packet, the packet is dropped; if it is a DCP data packet,
and forwards it back to the sender ( 2 ). the switch trims the payload from the packet, modifies the DCP
Leveraging the lossless control plane, the sender employs an HO- tag in the remaining header to 11, and enqueues the remaining
based retransmission (§4.3) triggered by the HO packets ( 3 ). Since header into the control queue. As illustrated in Fig. 4(a), the remain-
the HO packet explicitly carries the PSN of the lost packet, the ing header in our design is 57 bytes6 .
sender precisely retransmits the lost packets indicated by it. We WRR Scheduling. The DCP-Switch employs WRR scheduling to
integrate RNIC micro-architecture innovations to further improve manage the data and control queues, ensuring a lossless control
the efficiency of the retransmission phase. Furthermore, based on queue. The weight of WRR (𝑤 ) depends on the switch radix (𝑁 )
the ACK packet, the sender implements a coarse-grained timeout and the ratio between the HO and data packet sizes (1 ∶ 𝑟 ). As-
mechanism as a fallback to ensure reliability if the lossless control suming the worst-case scenario where there is an 𝑁 − 1 to 1 incast
plane assumption is violated (e.g., link/switch crashes). burst in a switch and all data packets are trimmed, this will gen-
If the length of the egress data queue does not exceed the thresh- erate 𝐵 × 𝑁 𝑟−1 traffic to the control queue, where 𝐵 represents the
old, the data packet is enqueued in the egress data queue. Upon 𝑤
port bandwidth. The draining rate of the control queue is 𝐵 × 1+𝑤 .
reaching the receiver, the data packet—whether in-order or out-of-
To ensure a lossless control queue, the draining rate must be at
order—is directly written to the appropriate location in application
least equal to the input rate. Therefore, the weight 𝑤 can be set
5 6
In §4, ”switch” refers to DCP-Switch, and ”receiver” and ”sender” refer to the receiver- 57 bytes = 14 bytes MAC header + 20 bytes IP header + 8 bytes UDP header + 12
and sender-side DCP-RNICs, respectively. bytes BTH header + 3 bytes for the MSN field.
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

DCP tag (ToS) sRetryNo Only exists in two-sided (e.g.,


Send) operations
RetransQ SQ RQ CQ Application
Ethernet IP UDP BTH MSN SSN RETH Host Memory
QPs
HO packet (a) DCP data packet header Only exists in
one-sided (e.g., PCIe
DCP tag (ToS) eMSN (MSN) Write) operations

RNIC DMA Engine QP Schedular MTT


Ethernet IP UDP BTH AETH
Hardware
(b) DCP ACK packet header Rx path CC QPC
Retransmission
Figure 4: DCP extends the traditional RDMA header with
specific fields (indicated with red bold text). Tx path MAC
𝑁 −1
to 𝑟−𝑁 +1
(theoretical value), meaning that the ratio of scheduled
𝑁 −1 HO packets Retransmitted data packets
traffic volume between the control and data queues is 𝑟−𝑁 +1
∶ 1.
Note that this equation is valid only in scenarios where 𝑟 > Figure 5: Working steps of HO-based retransmission. The
(𝑁 − 1). In some cases where 𝑟 < (𝑁 − 1), we cannot theoretically gray modules are that DCP makes modifications.
guarantee lossless HO packet transmission by simply configuring
they are independent and stateless.
the weight. However, we evaluated in §6.3 that a small 𝑤 can effec-
tively handle extreme incast scales. Note that when an HO packet
• #1: Inefficient retransmission: Since HO packets are inde-
pendent, one way to apply the fetch-and-drop strategy during
is accidentally lost, DCP relies on a timeout for loss recovery in-
loss recovery phase would be as follows: for each HO, the RNIC
stead of HO-based retransmission (detailed in §4.5).
first fetches its corresponding WQE, processes the WQE, and
fetches the associated data. This approach however results in
4.3 Efficient HO-based Retransmission
significantly low throughput because each HO packet requires
We first describe the current packet-sending strategy of the trans- two PCIe transactions9 .
mit (Tx) path in DCP-RNIC. Then we outline the challenges of im- • #2: Incompatible with CC module: Since HO packets are
plementing an efficient retransmission mechanism based on HO stateless, the retransmission rate is tied to the receiving rate of
packets. Finally, we present our solution. the HO packets. Consequently, CC cannot regulate the retrans-
DCP-RNIC Tx Path. As shown in Fig. 5, during the Tx path of mission rate, which may worsen congestion in certain scenar-
DCP-RNIC, the Queue Pair (QP) Scheduler first determines which ios. For instance, if there exists severe congestion that causes
QP will be chosen to send Work Request Elements (WQEs)7 in substantial packet loss, the generated HO packets may trigger
the next round. Then, the DMA Engine fetches the data of the se- excessive retransmissions, further aggravating the congestion.
lected QP and encapsulates the data into multiple packets. To re- Solution: HO-based Retransmission. We present our HO-based
duce memory footprint, DCP-RNIC does not cache WQEs for active retransmission design, with Fig. 5 illustrating its working steps.
QPs that have unfinished WQEs in the Send Queue (SQ). Instead, As shown, the retransmission process is separated into the receive
it adopts a fetch-and-drop strategy for scheduling QPs and WQEs, (Rx) and Tx paths, adopting a batched-fetch strategy to reduce PCIe
which is commonly used in RNIC microarchitecture designs [53]. transactions during retransmissions and incorporating a retrans-
Specifically, the QP Scheduler first selects an active QP with an mission queue (RetransQ) in host memory for each QP to enable
available window (𝑎𝑤𝑖𝑛), which is determined by the CC module. the CC module to regulate the retransmission rate.
The DMA Engine then fetches up to 𝑛 WQEs from its SQ. The RNIC In the Rx path, upon receiving an HO packet, the RNIC extracts
processes the fetched WQEs and fetches up to min(round_quota, metadata from its header and packages it into a retransmission en-
awin) bytes of data from application memory. If there are unused try, which consists of (MSN, PSN). Then the DMA Engine writes
WQEs left in the RNIC after a scheduling round8 , these unused this retransmission entry into the corresponding QP’s RetransQ
WQEs are dropped rather than cached in the RNIC and will be located in the host memory. The RetransQ is allocated along with
fetched again the next time this QP is scheduled. Adjusting the val- the SQ, RQ, and CQ during QP creation. Once allocated, it is exclu-
ues of 𝑛 and round_quota affects the tradeoff between PCIe band- sively managed by the RNIC without involving any software ma-
width utilization and scheduling granularity. In our design, we set nipulation. Therefore, no additional CPU overhead is introduced.
𝑛 to 8 and round_quota to 16 KB (≈ the PCIe BDP), to balance per- In the Tx path, when a QP is scheduled, the RNIC first checks if
formance and scheduling granularity. its RetransQ is empty by examining the Queue Pair Context (QPC)
Challenges. Unlike transmitting normal data packets in the Tx status ( 1 ). If the RetransQ is not empty, the DMA Engine retrieves
path, HO-based retransmission faces challenges in efficiency and the 𝑎𝑤𝑖𝑛 value from the CC module ( 2 ) and fetches min(16, len,
compatibility. This is because HO packets have a unique feature: awin/MTU)10 retransmission entries from the RetransQ, where 𝑙𝑒𝑛
7
represents the length of the RetransQ, which is maintained in the
In RNICs, QP and WQE are descriptors for connections and messages. Each QP typi-
cally consists of a Send Queue (SQ), a Receive Queue (RQ), and a Completion Queue
9
(CQ) on both sides. Assuming the PCIe round-trip latency between the RNIC and host is 1 𝜇 s, the
8
This occurs when the total message size associated with the 𝑛 WQEs exceeds the throughput during the loss recovery period is 1KB/2𝜇 s = 4Gbps.
10
min(round_quota, awin) bytes. 16 × 1KB = 16KB, equals the previously configured round_quata.
Revisiting RDMA Reliability for Lossy Fabrics SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal

QPC. Simultaneously, the DMA Engine fetches up to 𝑛 WQEs from QPC head
the SQ ( 3 ). We configure 𝑛 to 8, consistent with the previous set-
ting. For each retransmission entry, the RNIC calculates its virtual (a) fixed BDP-sized
address based on the fetched WQE information (which contains
QPC head tail QPC
the starting address) and translates it into a physical address using head (point to eMSN)
the Memory Translation Table (MTT) ( 4 ). If the required WQE is
not present in the RNIC, it re-fetches the targeted WQE and re-
Chunk
places a randomly selected existing WQE in the RNIC. packet msg comp. CQE
Subsequently, the DMA Engine fetches the payload from appli-
Pool ... counter flag (mcf) flag (cf)
(multi-bits) (1 bit) (1 bit)
cation memory and encapsulates it into a packet ( 5 ). Once the
retransmitted packet is sent, the RNIC updates the length of the (b) Linked chunk (c) DCP’s bitmap-free way
RetransQ in the QPC and adjusts the 𝑎𝑤𝑖𝑛 in the CC module ( 6 ).
Figure 6: Three approaches of packet tracking.
After processing all fetched retransmission entries, the RNIC be-
gins transmitting normal data packets if there is an available send- shown in Fig. 4. The first is the Message Sequence Number (MSN),
ing quota. Note that throughout the entire process, the retransmis- which specifies the posting order of all requests in the SQ. Further-
sion and CC modules operate in a decoupled manner, making DCP more, DCP includes the sRetryNo in data packets and the eMSN in
microarchitecturally compatible with any CC scheme. ACK packets. The usage of these fields is detailed in §4.5.

4.5 Bitmap-free Packet Tracking


4.4 Order-tolerant Packet Reception
DCP-RNIC avoids the need for a reordering buffer via RDMA header
Both packet loss and load balancing can lead to out-of-order (OOO)
extension (§4.4). However, traditionally, the RNIC still needs to
packet arrivals. The standard RDMA header is designed for in-order
maintain bitmaps to track which packets have been received or
packet arrivals and is not well-suited for handling OOO packet re-
lost. The bitmap introduces undesired trade-offs between memory
ception. A feasible way to handle OOO packets with the standard
overhead and processing efficiency.
RDMA header is to allocate a large reorder buffer in the RNIC or
One approach for maintaining bitmaps is to pre-allocate fixed-
host, where OOO packets are temporarily stored. Once all pack- 𝐵𝐷𝑃
ets are received, the RNIC or host reorders them before pushing length, typically BDP-sized ( 𝑀𝑇 𝑈
bits), bitmaps for all active QPs [42,
them into application memory [30, 39]. However, this approach 53], as shown in Fig. 6(a). This method exhibits constant packet pro-
introduces significant memory and CPU overhead. cessing latency, as accessing any slot in the bitmap requires con-
To address this issue, we extend the standard RDMA header, al- stant steps: 1) calculating the address by adding the bitmap head
lowing the RNIC to write all packets, whether in-order or OOO, and PSN offset, and 2) accessing the address. Although promising,
directly to the correct locations in application memory. This elim- this approach leads to significant memory overhead. Table 3 illus-
inates the need for a reorder buffer. Here, we focus on the widely- trates the memory overhead of bitmaps in typical intra-DC sce-
used Send, Write, and Write-with-Immediate operations. narios (400Gbps bandwidth, 10𝜇 s RTT). As the number of QPs in-
creases, the footprint of BDP-sized bitmaps can easily exceed the
Write. In the standard specifications, the RDMA Extended Trans-
typical RNIC SRAM capacity (usually ∼2MB).
port Header (RETH), which contains the remote memory location,
Another common approach is to maintain a chunk pool [30,
is included only in the first packet of a Write message. As a re-
39], e.g., each chunk with 128 bits. Each QP is pre-allocated with
sult, during OOO arrivals, if the middle packets arrive first, they
only one chunk and is linked with additional chunks as needed
cannot determine the remote memory location. To address this, as
(Fig. 6(b)). This approach reduces bitmap memory overhead when
shown in Fig. 4(a), DCP includes the RETH header in all packets
the degree of OOO packets is low (Table 3). However, as the OOO
(including first, middle and last) of the Write message.
degree increases, not only does the memory overhead eventually
Send and Write-with-Immediate. Two-sided operations require
reach that of the BDP-sized approach, but this method also in-
each arriving packet to be matched with a corresponding Receive
troduces another issue: high access latency. Accessing bits in the
WQE at the responder. This matching is implicit for in-order packet
𝑛𝑡ℎ chunk requires 𝑂(𝑛) steps: determining whether the PSN is
arrivals but fails in the case of OOO arrivals. For example, if a
within the current chunk; if not, retrieving the next chunk’s ad-
packet of a latter message arrives before the preceding message
dress and proceeding to the next chunk. The access latency impacts
completes, it cannot find a matching Receive WQE. Since Receive
the packet rate (pps)11 . Fig. 7 illustrates the theoretical packet rate
WQEs must be consumed by Send and Write-with-Immediate re-
under various OOO degrees for a clock frequency of 300MHz. As
quests in the same order they are posted, we introduce a Send Se-
shown, the packet rate of the linked chunk degrades as the OOO
quence Number (SSN) for these operations, as shown in Fig. 4(a).
degree increases.
The SSN indicates the posting order and is included in all Send
packets and the last Write-with-Immediate packet. It is used to Opportunities. DCP’s lossless control plane and HO-based re-
identify the appropriate Receive WQE for processing. Without the transmission ensure that only truly lost packets are retransmitted.
SSN, the RNIC would need to buffer packets from out-of-order mes- This guarantees that for any given packet, exactly one copy arrives
sages, which incurs significant memory overhead. at the receiver. This ”exactly-once” property allows us to move
away from traditional packet-level tracking strategies. Instead, we
Other specific fields for DCP. Besides the above extensions,
DCP introduces additional fields to correlate metadata at RNICs, as 11
Bandwidth = 𝑝𝑝𝑠 × 𝑀𝑇 𝑈 . 50 Mpps amounts to 400 Gbps with a 1KB MTU.
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

Schemes BDP-sized Linked chunk DCP Schemes LUT Registers BRAM URAM
Per-QP, Intra-DC 320B 80B∼320B 32B RNIC-GBN 66k (5.4%) 102k (3.5%) 408 (20%) 38 (3.9%)
10k QPs, Intra-DC 3MB 0.76MB∼3MB 0.3MB DCP-RNIC 67k (5.5%) 103k (3.6%) 412 (20%) 37 (3.8%)
Table 3: Memory overhead for packet tracking. Table 4: Resource usage of our prototype. The number inside
the bracket is the ratio of total FPGA resources.
BDP-sized DCP Linked chunk
Theoretical packet

60
Otherwise, the timer continues. If the timer expires, the sender re-
rate (Mpps)

40 transmits all packets of the unaMSN𝑡ℎ message.


20 However, this approach disrupts the ”exactly once” delivery guar-
0 antee, which affects the correctness of the receiver’s packet count-
0 64 128 192 256 320 384 448 ing. To address this, the sender maintains a value called sRetryNo
Out-of-order (OOO) degree (initialized to 0) for each QP, which tracks the number of timeouts
Figure 7: Theoretical packet rate, i.e., packet per second (pps), for the unaMSN𝑡ℎ message. The data packet includes the sRetryNo
under various out-of-order degrees. value in its header. On the receiver side, the receiver maintains
can simply count the number of packet arrivals for each message a rRetryNo value for the eMSN𝑡ℎ message. When a packet of the
and compare this count to the message size to determine message eMSN𝑡ℎ message is received, the receiver first checks if the sRetryNo
completion. By adopting this approach, we reduce the memory re- field in its header matches rRetryNo. If they are equal, the eMSN𝑡ℎ
quirement at the receiver from 𝑛 bits to log2 (𝑛) bits where 𝑛 is the message’s packet counter is incremented by 1. If sRetryNo is greater
number of in-flight packets. Note that the sender-side bitmap is than rRetryNo, the receiver resets the packet counter to 0, updates
already eliminated in HO-based retransmission. the rRetryNo to this sRetryNo, and starts recounting for the newest
timeout round.
Solution: Bitmap-free Packet Tracking. The receiver-side DCP-
RNIC employs packet counting to track the aggregated message- Orthogonality. This bitmap-free design is orthogonal to the rest
level information, rather than tracking at the packet level. As shown of DCP-RNIC’s architecture. If we choose to maintain traditional
in Fig. 6(c), it maintains a multi-bit counter for each message. Specif- packet-level bitmaps at the receiver, the remaining components
ically, upon the arrival of a packet, the corresponding message’s of DCP-RNIC, such as HO-based retransmission, still function cor-
counter is incremented by 1. When the packet counter equals the rectly. We believe this bitmap-free approach is an innovative di-
message size, the receiver determines that the message is complete rection that contributes to rethinking next-generation high-speed
and sets the message completion flag (𝑚𝑐𝑓 ) to 1. If the message RNIC design. At the same time, we acknowledge that it faces tremen-
requires a CQ Element (CQE), the CQE flag (𝐶𝑓 ) is also set to 1. dous challenges in real-world deployment.
For each QP, the receiver maintains an expected message sequence
number (eMSN) state. In cases where messages are completed out
of order, the receiver waits until the eMSN𝑡ℎ message finishes, then
5 Implementation
updates the eMSN and generates one or multiple CQEs for the ap- DCP-Switch. We implement the lossless control plane, including
plication. This behavior aligns with the common assumption in packet trimming and WRR scheduling, using P4 switch [2]. When
upper-level application programming that messages are completed the egress queue length exceeds a given threshold, DCP-Switch
in order [38]. When the eMSN is updated, the receiver generates an generates a mirrored packet and sets its packet length to the length
ACK that carries the updated eMSN value (as shown in Fig. 4(b)). of the header during mirroring, while dropping the original packet;
The memory overhead of DCP-RNIC is determined by the out- this process utilizes the Mirror function. After the mirrored packet
standing message size in upper-layer applications. For example, AI is re-enqueued, we modify its DCP tag and the 𝑝𝑎𝑐𝑘𝑒𝑡 _𝑙𝑒𝑛 in the
applications typically use NCCL [44] as the communication back- IP header, then push it to the control queue of the egress port. We
end, where the outstanding messages per QP is 8, and the typi- also implement adaptive routing in the switch, where the ingress
cal message size is several MBs. Therefore, we allow each QP to pipeline monitors the egress queue length and, based on this infor-
track 8 messages and set the counter to 14 bits, resulting in a track- mation, selects the egress port with the lowest queue length. We
ing status of 2 bytes per message. Table 3 shows the associated enable WRR scheduling in the egress pipeline and assign it an ap-
memory overhead. Furthermore, since the processing latency for propriate weight according to §4.2. Discussions with multiple ven-
each packet in DCP is constant, merely increasing the correlated dors confirm that supporting the lossless control plane in switch
counter by 1, the packet rate remains constant (Fig. 7). ASICs is feasible, as it is simple, stateless, and does not interfere
Coarse-grained Timeout as a Fallback. The assumption of a with any existing switch functionalities.
lossless control plane may be violated due to link/switch failures DCP-RNIC. We build a fully functional prototype of DCP-RNIC
or accidental HO packet losses. In these cases, HO-based retrans- using an FPGA board [4] with PCIe Gen3 x16 and 100 Gbps Ether-
mission fails to recover the lost packets, so DCP-RNIC falls back net ports, running at a clock frequency of 300MHz. First, we imple-
to using a coarse-grained timeout mechanism to ensure reliability. ment an RNIC-GBN prototype as a baseline. We then implement
Specifically, the sender maintains the smallest unacknowledged DCP-RNIC by modifying specific modules of the RNIC-GBN base-
MSN (unaMSN) for each QP and keeps a timer associated with it. line. Table 4 illustrates the resource consumption for the RNIC-
If the received ACK’s eMSN is greater than the unaMSN, the timer GBN and DCP-RNIC prototypes. As shown, DCP-RNIC consumes
is reset and the unaMSN is updated to the received ACK’s eMSN. only 5.5%, 3.6%, 20%, and 3.8% of the FPGA board’s total LUT, reg-
isters, BRAM, and URAM resources, respectively. Moreover, DCP-
Revisiting RDMA Reliability for Lossy Fabrics SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal

100 1000

Goodput (Gbps)
Gbps for throughput

RNIC-GBN TCP DCP CX5

Goodput (Gbps)
and us for latency

DCP-RNIC 100
100
CX5
50 10
100Gb x 8 links DCP
Switch 1 ... Switch 2
50 1
0 0.1

0.

0.

0.

1%

2%

5%
0 1:1 1:4 1:10

01

1%

5%
%
Throughput Latency Loss rate Avg. bandwidth ratio between two paths

Figure 8: Basic validation of Figure 9: Testbed topology Figure 10: DCP’s superior Figure 11: DCP adapts to un-
DCP-RNIC prototype. used in this paper. loss recovery efficiency. equal paths via AR.
RNIC consumes only 1.7%, 0.4%, and 1.1%, more LUT, registers, and 80

Completion time (ms)

Completion time (ms)


DCP CX5 40 DCP CX5
BRAM, respectively, compared to RNIC-GBN. 60
We evaluate the basic performance of the DCP-RNIC prototype 40 20
by connecting two DCP-RNICs directly and using the perftest bench- 20
mark to measure throughput and latency. DCP-RNIC is compared
0 0
with RNIC-GBN and TCP. For the throughput test, we launch a 1 2 3 4 1 2 3 4
(a) AllReduce (b) AllToAll
long-running flow consisting of multiple 512 KB messages, and for
the latency test, we launch a 64 B message. The results in Fig. 8 Figure 12: DCP vs. CX5 under AI workloads.
demonstrate that DCP-RNIC successfully maintains hardware of- senders and two RNICs from 𝑆𝑤𝑖𝑡𝑐ℎ2 as receivers. The senders
floading capabilities, achieving throughput and latency compara- transmit two long-running cross-switch flows. We enable two cross-
ble to RNIC-GBN, both of which outperform TCP significantly. switch paths and modify their port capacities, setting their capac-
ity ratios to 1:1, 1:4, and 1:10. We implement adaptive routing (AR)
6 Evaluation on the switches, which forwards traffic according to the capacity
We evaluate DCP through extensive testbed experiments and sim- ratio of the links. We measure the average goodput of the two
ulations, which reveal the following key results: flows. Fig. 11 illustrates the results. As shown, DCP maintains sta-
• DCP achieves a 1.6×-72× improvement in loss recovery efficiency ble goodput under all capacity ratios, as it is natively compatible
and a 42% reduction in completion time for AI workloads, com- with packet-level LBs. In contrast, the goodput of CX5 significantly
pared to Mellanox CX5. Moreover, DCP maintains consistent decreases under non-equal port capacities.
throughput, whether with adaptive routing or over a 10 km AI workload. We implement an AllReduce and AllToAll bench-
communication distance (§6.1). mark using the verbs API [6] and OpenMPI [10]. The 16 RNICs
• DCP achieves 16% and 10% lower tail FCT under general work- in the testbed are arranged into four groups, each consisting of
loads, and 38% and 45% lower completion time under AI work- 4 RNICs. Each group executes an AllReduce/AllToAll operation,
loads, compared to MP-RDMA and IRN (i.e., SOTA lossless and with all groups starting execution simultaneously. DCP and CX5
lossy solutions), respectively. Additionally, DCP achieves even are integrated with AR and ECMP in the network, respectively. We
greater performance improvements in cross-datacenter (cross- measure the job completion time (JCT) for each group and present
DC) scenarios (§6.2). the results of 4 groups in Fig. 12. As shown, DCP reduces the JCT
• DCP+CC outperforms all other comparisons under high-load of AllReduce and AllToAll by up to 33% and 42%, respectively.
scenarios, and DCP’s lossless control plane remains robust un-
Long-haul communication. We replace one cross-switch link
der severe incast congestion (§6.3).
with a 10 km optical link (one-hop delay is 50 µs) and use 100G-LR
transceivers at the endpoints. We select one RNIC from 𝑆𝑤𝑖𝑡𝑐ℎ1 as
6.1 Testbed Benchmarks the sender and one RNIC from 𝑆𝑤𝑖𝑡𝑐ℎ2 as the receiver. The sender
Setup. Fig. 9 illustrates the testbed topology, consisting of two P4 transmits a long-running flow to the receiver. We measure the
programmable switches (𝑆𝑤𝑖𝑡𝑐ℎ1 and 𝑆𝑤𝑖𝑡𝑐ℎ2 ), each connected to throughput of this flow and observe that DCP can stably operate at
8 FPGAs. All links operate at 100 Gbps. The two switches are con- around 85 Gbps. This experiment serves as a first-step validation
nected via 8 parallel cross-switch links. We compare DCP-RNIC that DCP can adapt to long-haul communication scenarios.
with Mellanox CX5 RNIC (i.e., RNIC-GBN).
Loss recovery efficiency. We select two RNICs as sender and
receiver, respectively, where the sender transmits a long-running 6.2 Large-scale Simulations
flow to the receiver. We manipulate the P4 programmable switch Setup. We use NS3 for simulations. The topology consists of a two-
to drop packets with a given loss rate. Upon packet loss, the P4 layer CLOS network with 16 spine switches, 16 leaf switches, and
switch executes the packet trimming module for DCP traffic, while 256 servers (16 per rack). Each server has a single NIC connected
it simply drops packets for CX5 traffic. We vary the loss rate from to a single leaf switch. All links operate at 100 Gbps. In all experi-
0.01% to 5% and measure the goodput for both DCP and CX5 traffic. ments, except for cross-datacenter (DC) scenarios, the propagation
Fig. 10 shows the results. As demonstrated, DCP achieves a 1.6× delay of all links is set to 1 µs. In cross-DC experiments, the prop-
to 72× improvement in loss recovery efficiency under loss rates agation delay of links between servers and leaf switches is 1 µs,
ranging from 0.01% to 5%, compared to CX5. while the propagation delay between leaf and spine switches is set
Compatibility with AR. We select two RNICs from 𝑆𝑤𝑖𝑡𝑐ℎ1 as to 500 µs and 5 ms. The switch buffer size is 32 MB, and the entire
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

2.5 9
PFC (ECMP) 4 PFC (ECMP) PFC (ECMP)
PFC (ECMP)
FCT Slowdown

FCT Slowdown

FCT Slowdown

FCT Slowdown
IRN (AR) IRN (AR) 8 IRN (AR)
2.0 IRN (AR) 4 MP-RDMA MP-RDMA
3 MP-RDMA
MP-RDMA DCP (AR) 7 DCP (AR)
DCP (AR)
DCP (AR)
1.5 3 2 6

1 5
1.0
2
3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
10
19
24
30
39
49
61
73
11
205
614
105
1514
2016
3525
4917
8694
2902

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974
99

99

98

99
Flow size (KB) Flow size (KB) Flow size (KB) Flow size (KB)
5

5
(a) Load 0.3, P50. (b) Load 0.3, P95. (c) Load 0.5, P50. (d) Load 0.5, P95.
Figure 13: Comparision between DCP, PFC, IRN, and MP-RDMA under the WebSearch workload.
0.05
PFC DCP 1.0 1.0
0.15 IRN Ideal
MP-RDMA 0.04 PFC DCP
JCT (s)

JCT (s)
PFC IRN Ideal
CDF

CDF
0.10 0.5 IRN MP-RDMA 0.5 PFC
MP-RDMA 0.03 IRN
DCP MP-RDMA
0.05 DCP
0 0.02 0
5 10 15 0.002 0.005 0.01 5 10 15 0.01 0.02 0.03 0.04
Group index Time (s) Group index Time (s)

(a) AllReduce, JCT. (b) AllReduce, CDF of FCT. (c) AllToAll, JCT. (d) AllToAll, CDF of FCT.
Figure 14: Comparision between DCP, PFC, IRN, and MP-RDMA under the AllReduce and AllToAll workloads.

PFC 50 PFC PFC 1000 PFC MP-RDMA


FCT Slowdown

FCT Slowdown

FCT Slowdown

FCT Slowdown
IRN IRN 100 IRN IRN DCP
MP-RDMA 20 MP-RDMA MP-RDMA
10 100
DCP 10 DCP DCP
10
5 10

1 2 1
1
3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974
99

99

99

99
Flow size (KB) Flow size (KB) Flow size (KB) Flow size (KB)
5

5
(a) 500us delay, P50. (b) 500us delay, P95. (c) 5ms delay, P50. (d) 5ms delay, P95.
Figure 15: Comparision between DCP, PFC, IRN, and MP-RDMA under the cross-DC (100 and 1000 km) scenarios.
network is a single RDMA domain. missions when combined with AR. Additionally, we observe that
We compare DCP with PFC, IRN [42], and MP-RDMA [38]. MP- MP-RDMA fails to effectively control the out-of-order degree be-
RDMA includes its own CC component, i.e., an adaptive conges- low its expected threshold, which leads to its inferior performance.
tion window, while IRN only employs a BDP-based flow control. AI workload. We arrange 256 servers into 16 groups, with 16
By default, DCP, PFC, and IRN are combined with AR, ECMP, and servers per group. Each group executes an AllReduce or AllToAll
AR as the load balancing schemes in the network. We select IRN+AR operation, starting execution at the same time. The total traffic for
as the default configuration because we observe that IRN+AR out- one AllReduce/AllToAll operation is 300 MB. For AllReduce, the
performs IRN+ECMP in most of our experiments. We also compare total traffic is partitioned into 16 slices (i.e., flows), and then trans-
DCP, PFC, and IRN with CC integrated (we choose DCQCN [56] mitted following a RingAllReduce procedure. For AllToAll, the to-
as it is representative). The traffic loads used in the simulations tal traffic is partitioned into 16 slices, with one slice transmitted
include general workloads and AI workloads. to each group member. We measure the time of the last completed
General workload. We evaluate the WebSearch [15] workload, flow within each group as the Job Completion Time (JCT) for that
which consists of 60% of flows below 200 KB, 37% of flows between group, and we also measure the CDF of individual flows’ FCT.
200 KB and 10 MB, and 3% of flows exceeding 10 MB. Fig. 13 il- The results for AllReduce and AllToAll are shown in Fig. 14. As
lustrates the results under the WebSearch workload with average shown, DCP achieves average 38%, 44% and 61% lower JCT under
loads of 0.3 and 0.5. At each load, we measure the P50 and P95 (tail) the AllReduce workload (Fig. 14a), and average 5%, 45% and 46%
flow completion time (FCT). lower JCT under the AllToAll workload (Fig. 14c), compared to
As shown, fine-grained LB solutions, such as MP-RDMA and MP-RDMA, IRN and PFC, respectively. As shown in Fig. 14b and
AR, consistently outperform ECMP, as they can better balance traf- Fig. 14d, DCP achieves the best tail FCT for individual flows, which
fic. Among the fine-grained LB solutions, DCP achieves the best explains why DCP achieves the best JCT. This is because AI work-
performance, with an average of 5% and 16% lower tail FCT at loads are synchronized, meaning that if just one flow is delayed, it
0.3 load, and 10% and 12% lower tail FCT at 0.5 load, compared impacts the entire collective communication performance.
to IRN and MP-RDMA, respectively. The performance advantage Cross-DC scenarios. Fig. 15a and Fig. 15b illustrate the P50 and
of DCP over IRN is due to IRN’s susceptibility to spurious retrans- P95 (tail) FCT slowdown under the 100 km cross-DC scenario (i.e.,
Revisiting RDMA Reliability for Lossy Fabrics SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal

IRN IRN+CC 100

Goodput (Gbps)
FCT Slowdown

FCT Slowdown
3 MP-RDMA 3 MP-RDMA
DCP DCP+CC DCP
RACK-TLP
2 2 50
IRN
Timeout
1 1
0
0% 0.01% 0.1% 0.5% 1% 2% 5%
3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974
Loss rate
99

99
Flow size (KB) Flow size (KB)
5

5
Figure 17: Loss recovery efficiency of DCP, RACK-TLP, IRN,
(a) W/o CC, P50. (b) With CC, P50. and Timeout-based scheme under various loss rates.
1000 IRN DCP IRN+CC
50 cast induces severe congestion, leading to significant packet loss.
FCT Slowdown

FCT Slowdown

MP-RDMA MP-RDMA

100
DCP+CC
This triggers numerous HO packets arriving at the sender, which in
20
turn triggers a large number of retransmitted packets that further
10
10 exacerbate congestion and ultimately degrade overall performance.
5 In contrast, MP-RDMA includes a native adaptive congestion win-
dow, which reduces the traffic load when severe congestion occurs.
3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974

3
6
9
20
24
29
40
50
61
73
11
217
618
104
1521
1907
3491
5194
8609
2974
99

99
Flow size (KB) Flow size (KB)
5

5
Similarly, in IRN, each packet loss prompts only one retransmis-
(c) W/o CC, P99. (d) With CC, P99. sion from the sender. If the retransmitted packet is dropped again,
Figure 16: FCT slowdown of DCP, MP-RDMA and IRN under the IRN sender will not retransmit it but wait for a timeout. During
the incast workload w/ and w/o CC. this timeout period, no additional traffic is generated, which helps
reduce the average traffic load.
N=22; N=22; N=16; N=16;
Settings As noted, DCP focuses solely on the reliability aspect, leaving
128 to 1 255 to 1 128 to 1 255 to 1
rate control to the CC modules. Therefore, we integrate DCQCN
Loss rate w/o CC 0 0 0 0.16%
Loss rate w/ CC 0 0 0 0
into DCP and IRN and evaluate their performance. Fig. 16d shows
the P99 FCT slowdown after CC integration. As demonstrated, DCP
Table 5: Loss rate of HO packets under severe incast degree.
achieves the best P99 FCT after CC integration, with a reduction of
the propagation delay between leaf and spine switches is 500 us), about 31% and 29% compared to MP-RDMA and IRN, respectively.
while Fig. 15c and Fig. 15d show the evaluation results under the Robustness of the lossless control plane. DCP controls the
1000 km cross-DC scenario (i.e., 5 ms propagation delay). The work- maximum affordable incast scale by adjusting the scheduling weight
load consists of WebSearch traffic with an average load of 0.5. For 𝑁 −1
( 𝑟−𝑁 +1
∶ 1), as described in §4.2. 𝑁 −1 represents the maximum in-
PFC and MP-RDMA, we increase the switch buffer from 32 MB cast scale that can be handled at local switches. Ideally, 𝑁 should
to 600 MB and 6 GB for the 100 km and 1000 km distances, re- equal the switch radix, allowing DCP to handle any incast scale.
spectively, ensuring the buffer is larger than the PFC headroom. However, if the ratio 𝑟 is small, 𝑁 cannot be set to the switch radix.
In contrast, for IRN and DCP, the switch buffer remains at 32 MB. We evaluate extreme incast scales of 128-to-1 and 255-to-1 with 𝑁
As shown, DCP achieves approximately 89%, 81%, and 46% lower of 22 and 16 and measure the ratio of lost HO packets over all HO
tail FCT under the 100 km distance, and 84%, 95%, and 51% lower packets. The background traffic is WebSearch with an average load
tail FCT under the 100 km distance, compared to PFC, MP-RDMA, of 0.3. We evaluate DCP both with and without CC enabled. Table 5
and IRN, respectively. Compared to intra-DC scenarios (Fig. 13), shows the results. As shown, when CC is disabled, no HO packets
DCP achieves a larger improvement in cross-DC scenarios. This is are lost with 𝑁 = 22 at any incast scale, and only 0.16% of HO pack-
because servers generate more traffic in cross-DC scenarios due to ets are lost under the extreme 255-to-1 incast scale with 𝑁 = 16.
the larger BDP capacity. As a result, congestion is more severe in When CC is enabled, there are no HO packet losses in any case.
cross-DC scenarios than in intra-DC scenarios, making the perfor- This demonstrates that DCP’s lossless control plane maintains ro-
mance improvement of DCP more pronounced. bustness even under severe incast scales.
Comparison with Timeout and RACK-TLP. The NVIDIA Spec-
6.3 Deep Dive trum platform [13] supports adaptive routing (AR), where the Spec-
DCP+CC achieves the best performance under high loads. trum Switch dynamically changes packet paths, and the SuperNIC
We observe that in highly congested situations, such as extreme relies on timeouts for loss recovery to avoid spurious retransmis-
incast, DCP alone (i.e., without CC) struggles with tail FCT per- sions. However, Spectrum AR only supports RDMA Write opera-
formance. For example, we evaluate a workload comprising Web- tions, and using timeouts results in significant performance degra-
Search traffic at 50% average load and 128-to-1 incast traffic at 5% dation upon packet loss. Falcon [5] introduces RACK-TLP [19] for
average load. Fig. 16a and Fig. 16b illustrates the P50 FCT slow- TCP networks to address packet reordering and retransmission/tail
down of DCP, IRN, and MP-RDMA, with and without CC (specifi- loss issues. RACK-TLP maintains transmission timestamps for ev-
cally DCQCN) integration. As shown, DCP exhibits the lowest P50 ery data packet (including retransmissions). If a packet remains
FCT in both cases. However, Fig. 16c highlights the P99 FCT slow- unacknowledged for an estimated RTT after being sent, it is con-
down when DCP and IRN are evaluated without any CC integra- sidered lost. This mechanism tolerates a reordering window of one
tion, indicating that DCP alone exhibits the worst P99 FCT. RTT and helps avoid timeouts for retransmitted packet losses. How-
DCP’s poor tail FCT performance occurs because large-scale in- ever, it delays retransmissions by one RTT and incurs significant
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

memory overhead due to the need to maintain per-packet times- Onloading bitmaps to host memory? SRNIC [53] focuses on
tamps, making it impractical for hardware offloading. single path transmission, so all OOO packets (i.e., those that trig-
We conducted a simulation to evaluate the loss recovery effi- ger bitmap access) are assumed to be caused by packet loss. Con-
ciency of DCP, IRN, RACK-TLP, and timeout-based mechanisms. sequently, the access frequency to the bitmap is low, making it ac-
We measured the goodput of a long-running flow under ECMP ceptable for SRNIC to place the bitmap in host memory. In contrast,
with various packet loss rates artificially enforced at switches. As DCP employs packet-level LB, which can cause OOO packets even
shown in Fig. 17, DCP achieves up to 22%, 98%, and 99% higher in the absence of packet loss. As a result, the frequency of bitmap
goodput than RACK-TLP, IRN, and timeout, respectively. As packet access is much higher than what SRNIC assumes. Therefore, plac-
loss increases, the performance of the timeout-based scheme de- ing the bitmap in host memory is not feasible in our case.
grades sharply. IRN suffers due to increased timeouts caused by Back-to-sender for HO packets. Currently, header-only (HO)
the loss of retransmitted packets. RACK-TLP performs better than packets must be sent to the destination before being sent back
IRN by avoiding such timeouts, but this comes at the cost of high to the sender. The reason is as follows. The RDMA RC mode re-
memory overhead from maintaining timestamps. If timestamps are lies on QPs, which consist of a sender QP and a receiver QP. The
removed, RACK-TLP would likely perform worse than IRN due to packet’s header includes only the destination QP number (QPN)
delayed retransmissions. Note that our simulations do not emulate and does not carry the source QPN. Moreover, a packet is accepted
the end-host overhead of retransmitting packets, so all schemes by the RNIC only if its header contains the correct destination QPN.
appear more efficient than they would in testbed implementations. Therefore, when a switch generates an HO packet and attempts to
However, the relative performance comparison remains consistent. send it back to the sender directly, it must set the destination QPN
to that of the sender QP. However, the switch does not know the
7 Discussion sender’s QPN; only the receiver’s QP context contains this infor-
mation. As a result, the HO packet must first be forwarded to the
Differences with NDP and CP. While the DCP-Switch shares receiver to obtain the correct destination QPN before it can be sent
similarities with the switch-side logic of both NDP [26] and CP [18], back to the sender. In theory, if the switch maintained a mapping
the end-host design in DCP (i.e., DCP-RNIC) differs from both of table between sender and receiver QPNs, it could extract the data
them. Specifically, NDP primarily focuses on leveraging packet packet’s destination QPN, find the corresponding sender QPN, and
trimming to realize a receiver-driven CC mechanism. Its end-host modify the HO packet accordingly, allowing it to be returned di-
implementation is based on DPDK in software, while its Tofino and rectly to the sender. However, maintaining such a table would in-
FPGA implementations are limited to switch-side logic only. On troduce significant state overhead and computational complexity.
the other hand, CP places more emphasis on reliability for TCP. It
introduces PACK packets and relies on sender- and receiver-side
bitmaps to track losses and determine retransmissions. In contrast, 8 Related Work
DCP-RNIC is specifically tailored for RDMA networks. Our DCP- Load balancing. Various congestion-aware rerouting solutions [31,
RNIC design is grounded in an in-depth analysis of existing RNIC 46] and flowlet-based LBs [14, 32, 51] require a sufficiently large
architectures, with design decisions made incrementally based on packet interval within flows to trigger path changes. However, such
the needs of RDMA. The key difference in DCP-RNIC, such as HO- intervals are rare in RDMA traffic [38]. Moreover, finer-grained
based retransmission, an extended packet header, and bitmap-free load balancing typically yields better performance in terms of path
packet tracking, are not present in either NDP or CP. utilization and latency. DCP is designed to support per-packet load
Distinction of SR- and HO-based retransmission. HO-based balancing to fully exploit this potential. ConWeave [50] reorders
retransmission cannot be handled in the same way as SR-based packets in the network to deliver an in-order packet sequence to
retransmission. SR operates in a stateful manner: it uses a sender- RNICs, enabling packet-level load balancing as well. However, Con-
side bitmap to track lost packets, where gaps in the bitmap indicate Weave restricts a flow to at most two paths, limiting flexibility,
packet loss. Based on this information, SR can retrieve and retrans- and imposes non-trivial queuing overhead on switches. In con-
mit specific payloads at any later time. In contrast, DCP adopts trast, DCP’s packet trimming module is lightweight and stateless.
a stateless design without bitmaps, and loss events are indicated In summary, DCP is orthogonal to any specific LB approach; it fo-
through individual HO packets. As a result, DCP must maintain a cuses on reliability and is natively compatible with all LB schemes.
queue of these loss events to trigger further retransmissions. Since Multipath in lossless fabrics. To be compatible with Spectrum
the number of HO packets can be large, this queue must be imple- AR [13], the NVIDIA SuperNIC supports OOO reception for RDMA
mented in software rather than hardware. Write operations. It achieves this by converting all Write pack-
Congestion Control for DCP. Theoretically, DCP is compatible ets (e.g., Write First, Write Middle, Write Last) into Write-Only
with any reactive [12, 37, 41, 56] and proactive [17, 20, 43, 45] CC packets, each of which contains a destination memory address and
design. Currently, we adopt DCQCN at RNICs, which reduces the can thus be directly written to application memory. However, this
sending rate upon receiving CNPs. The received CNPs of a flow mechanism applies only to RDMA Write and does not generalize to
may result from multiple paths, but the sender does not currently other RDMA operations. In contrast, our proposed OOO-tolerant
distinguish between them and simply reacts in the standard DC- reception aims to provide a unified solution that supports both one-
QCN manner. The question of what CC strategy should be used sided and two-sided RDMA operations. Similarly, the recently pro-
when combined with in-network packet-level LBs is beyond the posed LEFT [30] also targets compatibility with packet-level load
scope of this paper. We plan to explore this in future work. balancing by enabling the correct delivery of OOO payloads to
Revisiting RDMA Reliability for Lossy Fabrics SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal

application memory. However, both NVIDIA Spectrum and LEFT [4] 2024. AMD Alveo™ U250 Data Center Accelerator Card. https://www.amd.co
work only in lossless fabrics, as they focus solely on data place- m/en/products/accelerators/alveo/u250/a-u250-a64g-pq-g.html.
[5] 2024. Google Falcon. https://cloud.google.com/blog/topics/systems/introduci
ment and rely solely on timeouts for loss recovery, which leads to ng-falcon-a-reliable-low-latency-hardware-transport.
significant performance degradation upon packet loss (§6.3). [6] 2024. Libibverbs. https://github.com/linux-rdma/rdma-core/blob/master/Doc
umentation/libibverbs.md.
Industrial lossy solutions. RACK-TLP [19] trades loss recov- [7] 2024. NVIDIA ConnectX-5. https://www.nvidia.com/en-sg/networking/ethern
ery efficiency for packet reordering tolerance. Specifically, for each et/connectx-5/.
ACK received, the sender calculates the latest RTT measurement [8] 2024. NVIDIA ConnectX-6 Dx. https://resources.nvidia.com/en-us-accelerated-
networking-resource-library/networking-overal-dp.
and checks whether there are any packets that are unacknowl- [9] 2024. NVIDIA ConnectX-7. https://resources.nvidia.com/en-us-accelerated-
edged for an estimated RTT (i.e., ”reordering window”). If this con- networking-resource-library/connectx-7-datasheet.
[10] 2024. OpenMPI. https://www.open-mpi.org/.
dition is met, RACK marks the packet as lost and retransmits it. [11] 2024. Ultra Ethernet Consortium. https://ultraethernet.org/wp-content/upload
This approach can avoid most spurious retransmissions, but the re- s/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf.
transmission is delayed by a full RTT, resulting in inefficient loss [12] 2024. Zero Touch RoCE. https://docs.nvidia.com/networking/display/winof2v
237/ethernet+network.
recovery. SRD [49] provides reliable datagram semantics but re- [13] 2025. NVIDIA Spectrum Platform. https://www.nvidia.com/en-us/networkin
quires applications to handle reordering themselves. Solar [40] im- g/spectrumx/.
plements an ordering-resilient network stack, using a one-packet- [14] Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan
Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus,
one-block approach, in FPGAs for specific storage applications. So- Rong Pan, Navindra Yadav, et al. 2014. CONGA: Distributed congestion-aware
lar maintains four paths in its control plane and relies on OOO load balancing for datacenters. In Proceedings of the 2014 ACM conference on
SIGCOMM. 503–514.
packet arrival to detect packet loss within each path. In contrast, [15] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye,
DCP achieves efficient loss recovery and reordering tolerance si- Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010.
multaneously, delivers reliable connection semantics without plac- Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference.
63–74.
ing any additional burden on the application, and natively adapts [16] Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir
to packet-level LBs. Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad
UEC [11] mentions the packet trimming technique as well but Cheema, et al. 2023. Empowering azure storage with {RDMA}. In 20th USENIX
Symposium on Networked Systems Design and Implementation (NSDI 23). 49–67.
does not discuss any RNIC solutions based on packet trimming. To [17] Qizhe Cai, Mina Tahmasbi Arashloo, and Rachit Agarwal. 2022. dcPIM: Near-
the best of our knowledge, DCP is the first work to propose a com- optimal proactive datacenter transport. In Proceedings of the ACM SIGCOMM
2022 Conference. 53–65.
prehensive RNIC design leveraging the packet trimming technique. [18] Peng Cheng, Fengyuan Ren, Ran Shu, and Chuang Lin. 2014. Catch the whole lot
Note that if UEC is eventually supported by switch vendors, DCP in an action: Rapid precise packet loss notification in data center. In 11th USENIX
can directly leverage the UEC-defined trimming functionality and Symposium on Networked Systems Design and Implementation (NSDI 14). 17–28.
[19] Yuchung Cheng, Neal Cardwell, Nandita Dukkipati, and Priyaranjan Jha. 2021.
eliminate the switch-side implementation overhead. The RACK-TLP Loss Detection Algorithm for TCP. RFC 8985. doi: 10.17487/RF
C8985
[20] Inho Cho, Keon Jang, and Dongsu Han. 2017. Credit-scheduled delay-bounded
9 Conclusion congestion control for datacenters. In Proceedings of the Conference of the ACM
Special Interest Group on Data Communication. 239–252.
We present DCP, a transport architecture that rethinks RDMA re- [21] Advait Dixit, Pawan Prakash, Y Charlie Hu, and Ramana Rao Kompella. 2013.
liability for lossy networks. By integrating a lightweight lossless On the impact of packet spraying in data center networks. In 2013 proceedings
control plane in switches with a hardware-efficient RNIC design, ieee infocom. IEEE, 2130–2138.
[22] Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes,
DCP eliminates dependence on PFC, supports packet-level load bal- Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi
ancing, and avoids RTOs. Our prototype and evaluation show that Yang, et al. 2024. Rdma over ethernet for distributed training at meta scale. In
Proceedings of the ACM SIGCOMM 2024 Conference. 57–70.
DCP significantly outperforms existing RDMA solutions, advanc- [23] Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen
ing the practicality of high-performance RDMA over lossy fabrics. Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, et al. 2021. When cloud storage
This work does not raise any ethical issues. meets {RDMA}. In 18th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 21). 519–533.
[24] Prateesh Goyal, Preey Shah, Naveen Kr Sharma, Mohammad Alizadeh, and
Thomas E Anderson. 2019. Backpressure flow control. In Proceedings of the 2019
Acknowledgments Workshop on Buffer Sizing. 1–3.
We thank the anonymous SIGCOMM reviewers and our shepherd, [25] Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye,
and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Pro-
Prof. Gianni Antichi, for their constructive suggestions. We also ceedings of the 2016 ACM SIGCOMM Conference. 202–215.
thank Zeke Wang and Xuzheng Chen for their valuable discus- [26] Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W
sions during this project. This work is supported in part by the Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-architecting datacenter
networks and stacks for low latency and high performance. In Proceedings of the
Hong Kong RGC TRS T41-603/20R, the GRF 16213621, the ITC AC- Conference of the ACM Special Interest Group on Data Communication. 29–42.
CESS, the TACC [54], and the NSFC 62402407. Bingyang Liu and [27] Jinbin Hu, Wenxue Li, Xiangzhou Liu, Junfeng Wang, Bowen Liu, Ping Yin,
Kai Chen are the corresponding authors. Jianxin Wang, Jiawei Huang, and Kai Chen. 2025. {FLB}: Fine-grained Load
Balancing for Lossless Datacenter Networks. In 2025 USENIX Annual Technical
Conference (USENIX ATC 25). 365–380.
[28] Shuihai Hu, Wei Bai, Gaoxiong Zeng, Zilong Wang, Baochen Qiao, Kai Chen,
References Kun Tan, and Yi Wang. 2020. Aeolus: A building block for proactive transport in
[1] 2020. 802.1Qbb – Priority-based Flow Control. https://1.ieee802.org/dcb/802- datacenters. In Proceedings of the Annual conference of the ACM Special Interest
1qbb/. Group on Data Communication on the applications, technologies, architectures, and
[2] 2023. EdgeCore AS9516. https://www.edge-core.com/_upload/images/2023- protocols for computer communication. 422–434.
061-DCS810_AS9516-32D_DS_R07_20230503.pdf. [29] Shuihai Hu, Yibo Zhu, Peng Cheng, Chuanxiong Guo, Kun Tan, Jitendra Pad-
[3] 2023. NVIDIA InfiniBand Adaptive Routing Technology - Accelerating HPC and hye, and Kai Chen. 2017. Tagger: Practical PFC deadlock prevention in data
AI Applications. https://resources.nvidia.com/en-us-cloud-native-supercompu center networks. In Proceedings of the 13th International Conference on emerging
ting-dpus-campaign/infiniband-white-paper-adaptive-routing. Networking EXperiments and Technologies. 451–463.
SIGCOMM ’25, September 8–11, 2025, Coimbra, Portugal Wenxue Li et al.

[30] Peihao Huang, Xin Zhang, Zhigang Chen, Can Liu, and Guo Chen. 2024. LEFT: Tao Huang. 2024. Bicc: Bilateral congestion control in cross-datacenter rdma
LightwEight and FasT packet Reordering for RDMA. In Proceedings of the 8th networks. In IEEE INFOCOM 2024-IEEE Conference on Computer Communications.
Asia-Pacific Workshop on Networking. 67–73. IEEE, 1381–1390.
[31] Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. 2014. [53] Zilong Wang, Layong Luo, Qingsong Ning, Chaoliang Zeng, Wenxue Li,
Flowbender: Flow-level adaptive routing for improved latency and throughput Xinchen Wan, Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, et al. 2023. {SRNIC}:
in datacenter networks. In Proceedings of the 10th ACM International on Confer- A Scalable Architecture for {RDMA}{NICs}. In 20th USENIX Symposium on Net-
ence on emerging Networking Experiments and Technologies. 149–160. worked Systems Design and Implementation (NSDI 23). 1–14.
[32] Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer [54] Kaiqiang Xu, Decang Sun, Hao Wang, Zhenghang Ren, Xinchen Wan, Xudong
Rexford. 2016. Hula: Scalable load balancing using programmable data planes. Liao, Zilong Wang, Junxue Zhang, and Kai Chen. 2025. Design and Operation of
In Proceedings of the Symposium on SDN Research. 1–12. Shared Machine Learning Clusters on Campus. In Proceedings of the 30th ACM In-
[33] Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, ternational Conference on Architectural Support for Programming Languages and
Youngjin Kwon, Simon Peter, and Emmett Witchel. 2021. Linefs: Efficient smart- Operating Systems, Volume 1. 295–310.
nic offload of a distributed file system with pipeline parallelism. In Proceedings [55] Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury. 2017.
of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 756–771. Resilient datacenter load balancing in the wild. In Proceedings of the Conference
[34] Yanfang Le, Rong Pan, Peter Newman, Jeremias Blendin, Abdul Kabbani, Vipin of the ACM Special Interest Group on Data Communication. 253–266.
Jain, Raghava Sivaramu, and Francis Matus. 2024. STrack: A Reliable Multipath [56] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn,
Transport for AI/ML Clusters. arXiv preprint arXiv:2407.15266 (2024). Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and
[35] Wenxue Li, Xiangzhou Liu, Yuxuan Li, Yilun Jin, Han Tian, Zhizhen Zhong, Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. ACM
Guyue Liu, Ying Zhang, and Kai Chen. 2024. Understanding communication SIGCOMM Computer Communication Review 45, 4 (2015), 523–536.
characteristics of distributed training. In Proceedings of the 8th Asia-Pacific Work-
shop on Networking. 1–8.
[36] Wenxue Li, Chaoliang Zeng, Jinbin Hu, and Kai Chen. 2023. Towards fine- A Additional Information
grained and practical flow control for datacenter networks. In 2023 IEEE 31st
International Conference on Network Protocols (ICNP). IEEE, 1–11.
Appendices are supporting material that has not been peer-reviewed.
[37] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang,
Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. 2019. HPCC: A.1 More Explanations for Results in §6
High precision congestion control. In Proceedings of the ACM special interest
group on data communication. 44–58. Performance fluctuation of IRN w/ AR. In general, the per-
[38] Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jian-
song Zhang, Enhong Chen, and Thomas Moscibroda. 2018. {Multi-Path} trans-
formance disadvantage of IRN w/ AR is less pronounced under
port for {RDMA} in datacenters. In 15th USENIX symposium on networked sys- low-load or non-congested scenarios, where the degree of packet
tems design and implementation (NSDI 18). 357–371. reordering is minimal. However, under high-load or congested con-
[39] Yuanwei Lu, Guo Chen, Zhenyuan Ruan, Wencong Xiao, Bojie Li, Jiansong
Zhang, Yongqiang Xiong, Peng Cheng, and Enhong Chen. 2017. Memory ef- ditions, IRN w/ AR suffers more significantly. This trend is consis-
ficient loss recovery for hardware-based transport in datacenter. In Proceedings tently observed across all our experiments. Specifically, in Fig. 13a
of the First Asia-Pacific Workshop on Networking. 22–28. and Fig. 13b, the workload load is 0.3, a relatively light load. As a
[40] Rui Miao, Lingjun Zhu, Shu Ma, Kun Qian, Shujun Zhuang, Bo Li, Shuguang
Cheng, Jiaqi Gao, Yan Zhuang, Pengcheng Zhang, et al. 2022. From luna to result, the P50 and P95 latencies of IRN w/ AR are comparable to
solar: the evolutions of the compute-to-storage networks in alibaba cloud. In DCP, showing no obvious disadvantage. In contrast, Fig. 13c and
Proceedings of the ACM SIGCOMM 2022 Conference. 753–766.
[41] Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Was-
Fig. 13d have a higher load of 0.5. The P50 latency typically reflects
sel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David performance under non-congested conditions, so IRN w/ AR and
Zats. 2015. TIMELY: RTT-based congestion control for the datacenter. ACM DCP still show similar P50 latency. By contrast, the P95 latency
SIGCOMM Computer Communication Review 45, 4 (2015), 537–550.
[42] Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krish- better captures performance under congestion, where IRN w/ AR
namurthy, Sylvia Ratnasamy, and Scott Shenker. 2018. Revisiting network sup- shows a noticeable performance gap compared to DCP.
port for RDMA. In Proceedings of the 2018 Conference of the ACM Special Interest For AI workloads, they inherently consist of coflows, and the re-
Group on Data Communication. 313–326.
[43] Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John Ousterhout. 2018. ported JCT reflects the tail latency among these coflows. Further-
Homa: A receiver-driven low-latency transport protocol using network priori- more, AI workloads are often highly synchronized, which leads
ties. In Proceedings of the 2018 Conference of the ACM Special Interest Group on
Data Communication. 221–235.
to bursts of traffic and transient high load. Under these conditions,
[44] NCCL. 2024. https://github.com/NVIDIA/nccl. the limitation of IRN w/ AR in handling reordering and congestion
[45] Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans are more pronounced, resulting in significantly worse JCT com-
Fugal. 2014. Fastpass: A centralized” zero-queue” datacenter network. In Pro-
ceedings of the 2014 ACM conference on SIGCOMM. 307–318. pared to DCP under the AI workloads (as shown in Fig. 14).
[46] Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, Gautam PFC’s oscillating behavior in cross-DC experiments. In cross-
Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, David Wetherall, and Abdul
Kabbani. 2022. PLB: congestion signals are simple and effective for network load
DC/long-distance scenarios (as shown in Fig. 15), all lossless schemes
balancing. In Proceedings of the ACM SIGCOMM 2022 Conference. 207–218. exhibit obvious performance variability across flows. For example,
[47] Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon both PFC and MP-RDMA show oscillating behavior in Fig. 15b and
Wischik, and Mark Handley. 2011. Improving datacenter performance and ro-
bustness with multipath TCP. ACM SIGCOMM Computer Communication Review Fig. 15d, although the effect is less pronounced for MP-RDMA. This
41, 4 (2011), 266–277. variability is due to that we increase the switch buffer sizes for loss-
[48] Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C Snoeren. less schemes from 32 MB to 600 MB and 6 GB for 100 km and 1000
2015. Inside the social network’s (datacenter) network. In Proceedings of the 2015
ACM Conference on Special Interest Group on Data Communication. 123–137. km distances, respectively. The large buffers can cause some large
[49] Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. 2020. A cloud- flows to perform exceptionally well when PAUSE is not triggered
optimized transport protocol for elastic and scalable hpc. IEEE micro 40, 6 (2020),
67–73.
while others perform poorly.
[50] Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, and Mun Choon
Chan. 2023. Network load balancing with in-network reordering support for
rdma. In Proceedings of the ACM SIGCOMM 2023 Conference. 816–831.
[51] Erico Vanini, Rong Pan, Mohammad Alizadeh, Parvin Taheri, and Tom Edsall.
2017. Let it flow: Resilient asymmetric load balancing with flowlet switching.
In 14th USENIX Symposium on Networked Systems Design and Implementation
(NSDI 17). 407–420.
[52] Zirui Wan, Jiao Zhang, Mingxuan Yu, Junwei Liu, Jun Yao, Xinghua Zhao, and

You might also like