0% found this document useful (0 votes)
39 views

02 AccelUPF Accelerating The 5G User Plane

Uploaded by

yaqi010822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

02 AccelUPF Accelerating The 5G User Plane

Uploaded by

yaqi010822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

AccelUPF: Accelerating the 5G user plane using

programmable hardware
Abhik Bose∗ , Shailendra Kirtikar∗ , Shivaji Chirumamilla∗ , Rinku Shah+ , Mythili
Vutukuru∗
Indian Institute of Technology Bombay∗ , Indraprastha Institute of Information Technology Delhi+
India
{abhik,shailendra,shivaji}@cse.iitb.ac.in,[email protected],[email protected]
ABSTRACT CCS CONCEPTS
The latest generation of 5G telecommunication networks • Networks → In-network processing; Programmable
are expected to provide high throughput and low latency networks; Network performance analysis; Mobile net-
while catering to diverse applications like mobile broadband, works.
dense IoT, and self-driving cars. A high performance User
Plane Function (UPF), the main element in the 5G user plane, KEYWORDS
is critical to achieving these performance goals. This paper 5G core, 5G user plane, programmable networks, in-network
presents AccelUPF, a 5G UPF that offloads functionality to computation
programmable dataplane hardware for performance accelera-
tion. While prior work has proposed accelerating the UPF by ACM Reference Format:
offloading its data forwarding functionality to programmable Abhik Bose∗ , Shailendra Kirtikar∗ , Shivaji Chirumamilla∗ , Rinku
hardware, the Packet Forwarding Control Protocol (PFCP) Shah+ , Mythili Vutukuru∗ . 2022. AccelUPF: Accelerating the 5G
messages from the control plane that configure the hardware user plane using programmable hardware. In The ACM SIGCOMM
data forwarding rules were still processed in software. We Symposium on SDN Research (SOSR) (SOSR ’22), October 19–20, 2022,
show that only offloading data forwarding and not PFCP Virtual Event, USA. ACM, New York, NY, USA, 15 pages. https:
//doi.org/10.1145/3563647.3563651
message processing leads to suboptimal performance in the
UPF for applications like IoT that have a much higher ra-
tio of PFCP messages to data traffic, due to a bottleneck 1 INTRODUCTION
at the software control plane that configures the hardware
packet forwarding rules. In contrast to prior work, AccelUPF The mobile packet core connects the wireless radio access
offloads both PFCP message processing as well as data for- network (with base stations and mobile users) to external
warding to programmable hardware. AccelUPF overcomes networks. The packet core consists of several control plane
several technical challenges pertaining to the processing of components that process signaling messages from mobile
the complex variable-sized PFCP messages within the mem- users (e.g., for authentication, setting up sessions to transfer
ory and compute constraints of programmable hardware data, handling mobility-related events) and the User Plane
platforms. Our evaluation of AccelUPF implemented over Function (UPF) on the data plane that forwards user traffic to
a Netronome programmable NIC and an Intel Tofino pro- and from external networks. The two planes communicate
grammable switch demonstrates performance gains over the using PFCP (Packet Forwarding Control Protocol) messages
state-of-the-art UPFs for real-world traffic scenarios. that are sent by the control plane to establish, modify, and
delete packet forwarding rules in the user plane, as shown
in Figure 1. The most recent fifth generation (5G) telecom-
Permission to make digital or hard copies of all or part of this work for
munication networks aim to support use cases with high
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
throughput (∼1 Gbps/user), very low processing latencies
this notice and the full citation on the first page. Copyrights for components (<1 ms), stringent quality of service (QoS), and diverse traf-
of this work owned by others than ACM must be honored. Abstracting with fic characteristics, e.g., enhanced mobile broadband, dense
credit is permitted. To copy otherwise, or republish, to post on servers or to deployments of IoT devices, self-driving cars, AR/VR, high-
redistribute to lists, requires prior specific permission and/or a fee. Request speed entertainment in a moving vehicle, and delay-sensitive
permissions from [email protected].
SOSR ’22, October 19–20, 2022, Virtual Event, USA
video applications [27, 42]. A high performance and low cost
© 2022 Association for Computing Machinery.
UPF is necessary for meeting these requirements.
ACM ISBN 978-1-4503-9892-3/22/10. . . $15.00 Most state-of-the-art UPFs today are built as multicore-
https://doi.org/10.1145/3563647.3563651 scalable software packet processing appliances running over

1
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

Therefore, for 5G use cases that generate significant amounts


of signaling messages and require frequent reconfigurations
of the user plane, neither the DPDK-based UPF nor today’s
offload-based UPF can deliver high performance, low power
consumption, and low cost, all together.
So, why not simply offload the processing of PFCP messages
to programmable hardware as well? PFCP message processing
in programmable hardware is challenging for several reasons,
as observed by prior work [20, 33]. PFCP message headers
are variable in size, with multiple levels of nesting and sev-
Figure 1: 5G Architecture.
eral optional fields. Parsing such complex headers within
commodity servers, and process traffic using a high per- programmable hardware is difficult because hardware is de-
formance packet I/O mechanism like the Data Plane De- signed to run at linerate and is therefore constrained with
velopment Kit (DPDK) [11]. However, given the stringent respect to the instruction set and memory resources avail-
performance requirements of 5G networks, and ever increas- able. Further, the match-action tables of the programmable
ing network speeds running into a few hundreds of Gbps, hardware that are used to store packet forwarding rules in
prior work has proposed accelerating the UPF by offloading today’s hardware accelerated UPFs are configurable only
data forwarding functionality to programmable hardware [5– from the software control plane via standard APIs, and can-
8, 18, 19, 22, 24, 31, 33], leveraging the availability of high- not be directly modified from the switch data plane. While
level programming languages like P4 [21] to program such some switch memory (in the form of stateful register arrays)
hardware. Prior work has quantified the performance, cost can be modified directly from within the data plane, such
and power savings of such offload [20]. Our evaluation com- memory can only be accessed via a restricted interface of
paring a production-grade DPDK-based software UPF with index-based access and is not as versatile as the match-action
a state-of-the-art programmable hardware accelerated UPF tables. Further, this limited switch memory may not accom-
(Table 1 in §5) also shows that hardware acceleration im- modate the packet forwarding rules of all users, and is also
proves performance per unit cost by ∼31%, and performance not persistent across switch failures. Therefore, offloading
per unit power consumed by ∼92%. the processing of PFCP messages to programmable hardware
It is important to note that the hardware-accelerated UPFs is non-trivial, and has not been attempted before in prior
in prior work offload only the user data forwarding to hard- work to the best of our knowledge.
ware. PFCP messages are still processed in software, and This paper proposes the design, implementation, and eval-
standard APIs exposed by hardware vendors are then used uation of AccelUPF, a programmable hardware accelerated
to configure packet forwarding rules in hardware. These 5G UPF that offloads most UPF functionality, including the
APIs have a limited capacity of installing packet forwarding processing of most types of PFCP messages and user data
rules, which in turn limits the PFCP message processing ca- forwarding, to programmable hardware. Our key insight is to
pacity of today’s offload-based UPFs that only offload data offload the processing of the more common and simpler pat-
forwarding. Our measurements (Table 1 in §5) show that terns of PFCP messages to the fastpath on hardware, while
such an offload-based UPF can comfortably forward traffic handling the more complex and infrequent PFCP messages in
at the 40Gbps linerate in our setup, but can process only a software. Our design incorporates several novel ideas to han-
few hundred PFCP messages/sec. dle common PFCP messages in the hardware fastpath. First,
But is the low rate of PFCP message processing in today’s the hardware PFCP parser in AccelUPF identifies the manda-
offload-based UPFs a bottleneck for real-life traffic? Some 5G tory and optional fields in the variable-sized, nested PFCP
use cases like IoT [34] or high mobility vehicular communi- headers, and chooses the parser states dynamically based
cation [32] are expected to frequently setup/reconfigure the on the fields present in the received header (§3.3). Second,
data plane via signaling messages, roughly in the order of AccelUPF stores packet forwarding rules in stateful register
tens of seconds. For a UPF handling a few hundred thousand arrays within the switch hardware. The register arrays allow
users [26, 39], this can translate to a few tens of thousands of only index-based access, so we use hash of the header fields
PFCP messages to be processed every second, which is too of received packets as an index to access the array, handling
high for today’s offload-based UPFs to handle. Note that in hash collisions and switch memory overflows on the slow-
our experiments, the DPDK-based software UPF was able to path in software. Further, because PFCP messages and data
process a few thousand PFCP messages/sec on a single core, traffic contain different header fields, we maintain packet
but we have already shown that software message process- forwarding rules across multiple register arrays indexed in
ing is inefficient with respect to cost and power consumption. different ways, in order to access them correctly across PFCP

2
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

and data traffic (§3.4). Third, we deploy a regular software


UPF in the slowpath for traffic that cannot be handled in the
fastpath within the programmable hardware, and we ensure
that the UPF state is shared correctly across the software and
hardware processing (§3.5). Finally, we use in-network repli-
cation of the register array state to protect against switch
failures (§3.6).
We implemented AccelUPF on two different P4 pro-
grammable data plane hardware: an Agilio CX Netronome
smart NIC [46] and an Intel Tofino programmable switch [25],
using an existing production-grade standards-compliant soft-
Figure 2: PDU session establishment callflow.
ware UPF [9] on the slowpath. Experiments with our Ac-
celUPF prototypes show that AccelUPF achieves signifi-
cantly higher PFCP message processing and data forwarding wireless radio access network (RAN)—which includes the
throughput, especially when normalized by cost or power User Equipment (UE) and the base station (gNB)—to other
consumed, as compared to both a pure software UPF as well external networks. The packet core executes several signal-
as a hardware accelerated UPF where only user data for- ing procedures on behalf of the UE. When a UE connects
warding has been offloaded. For use cases with significantly to a mobile network for the first time, it triggers an initial
higher fraction of PFCP messages like IoT, AccelUPF pro- registration procedure via the base station at the Access and
vides up to 56% higher throughput than the best performing Mobility Function (AMF), which communicates with other
UPFs in prior work. components in the control plane of the packet core to au-
Our work makes the following contributions. (i) We show thenticate and register the user. A registered UE that wishes
that prior work on programmable hardware accelerated 5G to send data through the packet core must set up one or
UPFs, where only data forwarding is offloaded to hardware, more PDU sessions, each with possibly different QoS require-
does not perform well for use cases which generate a high ments, using the PDU session establishment procedure. Such
rate of signaling messages, because of the limited capacity session-related procedures are coordinated by the Session
of the software control plane APIs that install packet for- Management Function (SMF) in the packet core. Once the
warding rules in the hardware. (ii) We design AccelUPF, a sessions are setup, the actual traffic in the user plane is for-
programmable hardware accelerated UPF that offloads not warded via the base station through one or more User Plane
just the data forwarding but also most PFCP message pro- Functions (UPFs) through the packet core. The user plane
cessing to programmable hardware, and experiments with traffic is encapsulated in GPRS Tunnelling Protocol (GTP)
our implementation show significant performance gains over headers inside the packet core, and this tunnelling helps man-
existing state-of-the-art UPFs for real-world traffic. (iii) Our age mobility of the UE in the network. A PDU session has
design illustrates how one can offload complex control plane two tunnels (identified by tunnel identifiers or TEID in the
message processing to programmable hardware and our tech- GTP header), one to carry uplink traffic from the base station
niques are broadly applicable to other applications as well. and one to carry downlink traffic from external networks to
(iv) By identifying the challenges in offloading PFCP process- the UE. On the uplink, the base station encapsulates the in-
ing to programmable hardware, our work can better inform coming UE IP traffic into GTP headers and the last UPF in the
future standardization efforts in 6G and beyond. packet core performs the decapsulation. For downlink traffic,
The rest of the paper is organized as follows. We begin the UPF performs the encapsulation and the base station
with the background required to understand our work (§2), does the decapsulation. During its life time, a UE will trigger
and then proceed to describe the design (§3), implementation many other signaling procedures in the packet core, e.g., AN
(§4), and evaluation (§5) of AccelUPF. We then present related (access network) release procedure to move to an idle state
work (§6) and conclusions (§7). after inactivity, service request procedure to reactivate itself
when it wishes to communicate again, handover procedure
to move to a different location, and so on.
2 BACKGROUND PFCP messages. The SMF and the UPF communicate us-
This section provides the relevant background on 5G network ing Packet Forwarding Control Protocol (PFCP) messages
architecture and programmable data plane hardware that is that are exchanged over a UDP connection between both
required to follow the rest of the paper. nodes [3]. A few important PFCP messages sent by the SMF
5G architecture and procedures. Figure 1 shows the 5G to the UPF include the PFCP session establishment request,
architecture [4]. The 5G mobile packet core connects the PFCP session modification request, and the PFCP session

3
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

deletion request, to establish, modify, and delete sessions PFCP messages in the software control plane, installs the
at the UPF respectively. After processing these messages, various rules in the hardware, and offloads only the GTP
the UPF sends back the corresponding response messages user plane traffic handling to the programmable hardware.
over PFCP as well, indicating the status (success or failure) PFCP message structure. A PFCP message has a highly
of the request. A single signaling procedure of the UE such complex structure. A PFCP message has several Information
as a PDU session establishment can trigger multiple PFCP Elements (IEs), which are used to create, modify, or delete the
request/response exchanges between the SMF and the UPF. packet forwarding rules at the UPF. For example, a PFCP ses-
For example, we show a simplified UE initial PDU session sion establishment message contains the following IEs [3]: a
establishment callflow in Figure 2. During this procedure, node identifier of the SMF which sent this message, a unique
the SMF first sends a PFCP session establishment to the session identifier (SEID) that identifies the session, followed
UPF to setup the uplink GTP tunnel, and later, after further by one or more IEs to create PDRs, FARs, BARs, QERs, and
communication with the base station, sends a PFCP session URRs. Now, while some IEs like the node ID, SEID, and the
modification message to setup the downlink GTP tunnel. Sev- IEs to create PDR and FAR are mandatory, some other IEs
eral other signaling procedures like the AN release, service are optional and need not always be specified. Furthermore,
request, and handover will also involve one or more PFCP a PFCP session establishment message can have a variable
request/response messages exchanged between the SMF and number of IEs to create PDRs, and each of these PDRs can
UPF, e.g., to mark a session as idle/active or to switch the cross reference the same or different FARs, BARs, and so
tunnel to another base station. The rate of PFCP messages on. Each of these IEs to create PDRs and other rules have a
received at a UPF will depend on the applications running nested structure with several smaller IEs contained within,
on the UEs being served by the UPF. Several new use cases which can further have mandatory and optional elements. To
of 5G like dense IoT or high speed mobility are expected complicate things further, the 3GPP standards allow IEs to
to generate high rate of PFCP messages. For example, a UE be present in any order inside a message. Therefore, parsing
running an IoT application will frequently establish sessions, and processing a PFCP message is a highly complicated op-
go idle and become active again, while transferring small eration that is hard to fully implement within the restricted
amounts of data in between, leading to a relatively higher processing available in hardware. This is the reason why no
proportion of PFCP messages in its generated traffic. prior work that uses programmable hardware to accelerate
UPF processing. The UPF primarily handles two types of in- UPF proposes processing PFCP messages in hardware.
coming traffic: PFCP messages that setup, modify and delete Programmable hardware. Before the introduction of pro-
various packet forwarding rules corresponding to UE data grammable data plane hardware, a high performance packet
sessions at the UPF, and user plane (GTP) traffic that is then processing network element like the UPF was either devel-
handled as per these established rules. There are several oped as a fixed function hardware appliance or as a software
types of rules at the UPF, as shown in Figure 2. Packet Detec- packet processing application running over commodity hard-
tion Rules (PDRs) help match the traffic of a session based on ware. While a hardware implementation provided higher and
packet header fields, e.g., source/destination IP address/port more deterministic performance, a software implementation
number, or GTP TEIDs. With each PDR, we have other associ- had the benefit of easy programmability to add new features.
ated rules that specify the action to be taken on the traffic that In contrast to fixed function hardware, programmable dat-
matches the PDR: Forward Action Rules (FARs) specify the aplane hardware can be easily programmed (and quickly
forwarding action to be applied on a packet (e.g., GTP TEIDs reprogrammed) to perform complex packet processing func-
to use for encapsulation and decapsulation), QoS Enforce- tions, via code written in a high-level language like P4 [21].
ment Rules (QERs) specify the QoS that must be enforced Therefore, programmable data planes provide the best of both
(e.g., maximum bit rate allowed for the session), Buffering worlds, with the performance of a hardware implementation
Action Rules (BARs) specify buffering requirements when and the flexibility of a software implementation. Packet pro-
the UE is idle, and Usage Reporting Rules (URRs) specify how cessing specifications written in a high-level language like P4
usage reporting should be performed for billing and charging. are compiled to a variety of targets, e.g., programmable hard-
These various PDRs and their associated FARs, QERs, BARs, ware ASICs [1, 2, 36, 37], NPUs [13, 45], and FPGAs [10, 17].
and URRs are established, modified, and deleted via PFCP Languages like P4 have several limitations put in place in
messages from the SMF to the UPF. Once these rules are in order to ensure linerate processing of the software speci-
place at the UPF, user plane GTP traffic is handled by finding fication. They have limited expressiveness in terms of the
a PDR that matches the received packet, and executing the supported instruction set and programming constructs. The
actions specified by the associated FARs, QERs, BARs, and packets cannot stall during the switch pipeline processing—
URRs. Prior work that uses programmable data plane hard- they have to be either forwarded or dropped. The amount of
ware to accelerate the UPF [5, 7, 8, 18, 22, 24, 33] proceses on-board memory on such hardware is limited in capacity

4
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

Hardware memory and compute limitations. The mem-


ory available to store application state in programmable
hardware is limited, and may not be enough to accommo-
date the state of all users being served by a given UPF. The
limited memory can also make some data traffic processing
(e.g., buffering of data packets for idle users or for sessions
that exceed their rate limit) hard to do within the hardware.
Finally, the failure of the switch can result in the loss of
application state stored in switch memory.

Figure 3: AccelUPF design. 3.2 Design overview


We now provide an overview of AccelUPF’s design (Figure 3),
(∼few tens of MBs) [46], and provides a restricted storage and the key ideas that help us address the challenges above.
model. Despite these limitations, researchers have observed Fastpath PFCP processing (§3.3). Given the complexity
substantial performance benefits by offloading applications of processing PFCP messages in hardware, AccelUPF splits
to programmable hardware, though the functionality that the PFCP processing into a fastpath in hardware and a slow-
is offloaded is often something that is simple enough to be path in software. We identify the most frequent and simple
executed within the limitations imposed by the hardware. patterns of PFCP messages that can be handled in hardware
on the fastpath, but even these simple messages have a large
3 DESIGN number of optional IEs and several levels of nesting. To parse
such messages correctly, AccelUPF identifies the finest gran-
The goal of AccelUPF is to accelerate the performance of the
ularity of IEs in the various PFCP headers, and chooses parser
5G UPF using programmable data plane hardware. However,
states dynamically based on the presence or absence of the
unlike prior work in this area, we aim to offload not just the
various optional IEs.
GTP user plane forwarding but also the PFCP processing.
Hardware data structures (§3.4). AccelUPF aims to avoid
We begin with describing the technical challenges that make
software control plane involvement in the fastpath of PFCP
this goal non-trivial to achieve.
processing, and uses in-switch stateful memory called reg-
ister arrays (that can be read and written from within the
3.1 Challenges data plane itself) to store packet forwarding rules present
Complexity of PFCP processing. PFCP messages have a in the PFCP messages. However, unlike match-action tables,
very complex structure, due to factors such as variable num- register array access is only index-based and not key-based.
ber of information elements (IEs), a large number of optional So, AccelUPF uses the hash of header fields in received pack-
IEs in each message, nested structure of IEs, and the flexibil- ets as an index to access packet forwarding rules within the
ity of ordering of the IEs within each message that is allowed register arrays. Computing this index is complicated by the
by the 3GPP standards specification. Given this complexity, fact that PFCP messages and GTP traffic have different sets
it is not easy to fully process all PFCP messages within the of fields in the packet headers. Therefore, the data structures
programmable hardware platforms available today. that store the packet forwarding rules in the register arrays
Updating switch state from the data plane. Most appli- are carefully designed so that the rules can be looked up via
cations that offload functionality to programmable hardware different indices for different types of traffic.
use the match-action tables available within the hardware Software fallback (§3.5). AccelUPF uses the software slow-
platforms to store application state. Incoming packets are path as a fallback for sessions that cannot be handled in
matched against these rules using the various fields in the hardware. We ensure that the UPF state is shared correctly
packet header as keys, and the action corresponding to the across the hardware and software components of AccelUPF,
matched rule is executed on the packet. While this key-based with clear state ownership to avoid race conditions.
matching is often implemented efficiently using fast spe- Replication of switch state (§3.6). Unlike state stored in
cialised hardware that can perform exact as well as ternary match-action tables that can be easily made fault tolerant us-
(wildcard) matches, today’s programmable data planes only ing replication at the software layer, application state stored
allow the match-action tables to be configured via from the in register arrays in AccelUPF can be lost due to switch
software control plane via standard APIs, which can become failures. To avoid losing forwarding rules state of users, Ac-
a performance bottleneck under high PFCP traffic. celUPF replicates the switch state created due to hardware

5
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

processing of PFCP messages across other switches in the


data plane using chain replication.

3.3 PFCP processing


AccelUPF handles the most common PFCP messages in the
hardware fastpath, and redirects the remaining PFCP mes-
sages (and their associated data traffic) to the slowpath in
software. PFCP messages that pertain to setting up and main-
taining the PFCP association between SMFs and UPFs (also
called PFCP node messages) are infrequent, happen outside
the critical path of UE applications, and do not impact user-
perceived performance in any way. Therefore, we handle all Figure 4: IEs implemented in the AccelUPF fastpath.
such messages only in the slowpath. Besides these messages, M, C and O represent Mandatory, Conditional and Op-
there are three main PFCP messages related to UE session tional IEs respectively.
management received at the UPF, which are the PFCP session
establishment, modification, and deletion messages. such messages correctly in hardware, we make the assump-
While a general PFCP session message can have a complex tion that the various IEs within a PFCP message appear in
structure, with a variable number of packet forwarding rules the exact order in which they are specified in the standards
per message, we argue that the message structure is simpler documents. Without this assumption, the complexity of pars-
in the common case of a single application or service at ing a PFCP message in hardware becomes intractable as the
a UE (e.g., mobile broadband or IoT) establishing a PDU various IEs cannot be represented as a directed acyclic graph
session to transfer data. To see why, consider the initial (DAG) anymore. Even if this assumption does not hold in
PDU establishment callflow shown in Figure 2. Here, the some existing 5G implementations, it is easy to enforce with
SMF first sends a PFCP session establishment request to the minimal changes to control plane components like SMF.
UPF, which creates one uplink PDR (containing the UE’s Second, PFCP messages contain a large number IEs (321 in
source IP address, GTP TEID, QoS flow identifier or QFI, and the latest specification version 17.5.0 [3]). Of these, some are
other packet header fields that identify a session’s uplink classified as mandatory, some optional, and some conditional
traffic) and its actions. Later, the SMF sends a PFCP session (i.e., mandatory or optional, depending on some condition
modification request, which associates with the same session being satisfied at the time of processing the message). Further,
a downlink PDR (containing the UE’s destination IP address many IEs are nested IEs, with each nested level having its
and other packet header fields to identify downlink traffic) own set of mandatory, optional, and conditional IEs. For
and its corresponding actions. Beyond these default rules, example, in a PFCP session establishment request handled
more PDRs for other applications can be added later to the by AccelUPF fastpath (Figure 4), the session identifier is a
same session via PFCP session modification messages, with mandatory IE, while the UE IP address is an optional IE. The
each PDR created by a separate PFCP message. Therefore, IE to create a QoS rule is a nested, conditional IE, which
PFCP messages in this common case do not necessarily need will be present for UEs specifying QoS rules, and absent
to contain more than one PDR in a single message. otherwise. To parse such messages correctly, we identify the
Given this observation, the hardware fastpath of AccelUPF smallest units (simple non-recursive IEs or even part of IEs)
handles PFCP session establishment messages that create that may be present or absent in a PFCP message. We classify
exactly one PDR, exactly one FAR, and at most one (op- conditional IEs as mandatory or optional suitably according
tional) QER. Other PFCP session establishment messages, to the PFCP message structure being handled in the fastpath.
e.g., those creating more than one PDRs, are handled in the Now, between every pair of mandatory IEs, we generate
slowpath. We place similar restrictions on the PFCP session multiple parser states and state transitions, guided by the
modification request also. Note that while we can extend our absence/presence of the various optional IEs. At each stage
design to handle messages that create two (or any such small, of the parsing, if the next IE is optional, the parser looks at
fixed number of) PDRs in the hardware fastpath, handling the first 16 bits of the next IE, and decides the next parser
an open-ended number of PDRs is infeasible in hardware. state based on which IE it finds next. For example, consider
Now, we note that designing a parser that can parse even a sequence of 4 IEs, say, A, B, C, and D, in a message. Out of
this restricted set of PFCP messages is non-trivial for several these IEs, suppose IEs A and D are mandatory, and B and C
reasons. First, the 3GPP specifications do not mandate a fixed are optional. To correctly parse these optional IEs, we create
order to the various IEs within a PFCP message. To parse the following parser states and state transitions: A → B →

6
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

C → D, A → C → D, A → D, A → B → D. The actual parser


state transitions will be guided by whether IEs B and C are
present in the received message or not. We create multiple
parser states in this manner for all optional IEs in the message.
Across the entire PFCP session establishment message shown
in Figure 4, this results in 28 parser states, 19 distinct paths
in the directed acyclic parse graph (DAG), and 42 transitions
between parse states. Figure 5: AccelUPF hardware data structure.
Given the challenges in parsing PFCP messages in hard-
ware identified above, we believe that future standardization data structure is called a session array, which is accessed
efforts must incorporate simplifications to the PFCP message using the hash of the session identifier as an index. When
format, e.g., enforcement of a fixed IE order, a limit on the a PFCP session message (establishment, modification, dele-
length and/or number of IEs supported in each message, in tion) is received, we compute the hash of the unique SEID,
order to enable faster processing on hardware accelerators. and use this as an index to locate the register array entry at
which this session’s information is stored. This register array
entry is then updated with information about the session as
3.4 Hardware data structures requested in the PFCP message. The second data structure is
After parsing PFCP messages, AccelUPF stores the result- the uplink match array, which stores information about the
ing packet forwarding rules inside stateful memory avail- PDR and the associated actions rules like FAR correspond-
able within most modern programmable hardware platforms ing to uplink traffic. The index into this array is the hash of
called register arrays. A register array is a storage abstraction the various packet header fields that can be obtained from
for an array of P4 registers, where the width of the register the GTP-encapsulated packets in uplink traffic (we use UE’s
and the length of the array are configurable. The register source IP, TEID and QoS flow identifier, but other choices are
array entries can be read or written during the various stages possible too). An analogous data structure is the downlink
of a packet processing pipeline directly from the switch data match array that stores packet rules for downlink traffic of a
plane itself. While one can lookup entries in a match action session, and is indexed using the hash of the packet header
table using a key (either exact or ternary), a register array fields present in downlink traffic (UE’s destination IP in our
entry can be accessed only using its index. case). Note that we will require different match arrays if we
A strawman approach to store packet forwarding rules wish to use a different set of packet header fields for index
in register arrays would be to compute a hash over some computation. The entry of a session in the session array does
packet header fields and use this hash as an index to access not store the PDRs directly within the entry itself; instead
the register array entry required to process this packet. But, it stores the indices of the uplink/downlink match arrays at
which packet header fields can we use? Every PFCP ses- which the uplink/downlink packet forwarding rules of a ses-
sion message has a unique session identifier (SEID), which sion are stored. This cross-linkage helps us avoid duplication
uniquely identifies a session. Therefore, one would think that of information across the various register arrays.
all session-related states, including the PDRs, FARs, and other When AccelUPF receives a PFCP message, it computes
rules associated with a session, can be stored and retrieved the hash of the SEID, locates the entry corresponding to
in a register array using the hash of a SEID as an index. How- this session, and uses the stored indices to access all the
ever, user plane traffic received from the UE (downlink IP packet forwarding rules (PDRs, FARs, etc.) of the session that
traffic that must be encapsulated in GTP headers, or uplink are stored in the uplink and downlink match arrays. These
GTP traffic that must be decapsulated) does not contain this various register array entries are then suitably updated to
session identifier in the GTP or IP layer headers. For exam- add/delete/modify rules based on the received PFCP request.
ple, given only the information within a GTP data packet A PFCP response indicating the status (success/failure) of
(GTP TEIDs, source/destination IP addresses etc.), how does the request is generated by modifying the contents of the
one identify the exact session that this packet belongs to, request packet itself. When AccelUPF receives a data packet,
without knowing the index in the array at which this ses- it identifies whether it is an uplink or downlink packet, and
sion is stored, and without having the ability to loop over all uses the hash of the suitable packet header fields to index
the entries in the array due to the constraints of hardware into the corresponding match array. It then verifies that the
linerate processing? packet matches the PDR, and executes the actions specified
To overcome this challenge, AccelUPF stores packet for- within the FAR and other rules in case of a match.
warding rules across multiple different register arrays, each AccelUPF currently supports only an exact matching of
accessed via a different index, as shown in Figure 5. The first packet header fields with the information in the PDR. If

7
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

one has to perform other kinds of complex matching, e.g., a session is initially created in hardware, but needs to fallback
PDR specifies a prefix and we must perform a longest prefix to the software slowpath midway due to reasons such as: (i)
match, such matching cannot be performed using register a session that was being handled by the hardware fastpath
arrays. AccelUPF handles sessions with such packet rules starts sending data at a rate beyond its configured maximum
in the software slowpath. In addition to forwarding actions, bit rate, and must fallback to the software for buffering, or
AccelUPF must also enforce other rules corresponding to (ii) we receive a PFCP session modification request for an
QoS enforcement, buffering data for idle users, and usage already established session that was being handled in the
reporting. Of such rules, our implementation currently sup- hardware, and this PFCP message has a complex structure,
ports the enforcement of a session-wide aggregate maximum e.g., a rule that requires longest prefix matching or one where
bit rate (AMBR) by computing per-flow rates using ingress multiple PDRs refer to the same FAR, and must therefore
timestamps and interpacket gaps. Support for more complex be processed in software. For such scenarios, all subsequent
policies is deferred to future work. If a session exceeds its PFCP message processing and GTP forwarding of the session
configured AMBR, its data packets have to be buffered, which must migrate from hardware to software.
is once again handled in the slowpath. This state migration is accomplished as follows. When
What if packets belonging to two different sessions hash to the hardware realizes that it can no longer process a certain
the same index within the match array? Our current imple- session in the fastpath, it marks all the session states in the
mentation stores two entries in a hash bucket using dual- register arrays as invalid and under migration. All subse-
width registers and other such mechanisms available via P4 quent PFCP messages and user plane packets are forwarded
extern units on most programmable switches. We can also to the software slowpath as a fallback. When a software
use techniques such as multiple hash functions to find alter- slowpath receives a PFCP session modification or deletion
nate indices [38]. However, we will eventually face a hash message, or a GTP user plane packet, but does not find corre-
collision, where two different sessions are contending for the sponding state in its data structures, it probes the hardware
same entry in the register array. Hash collisions are handled to find if this is a case of a session being migrated from hard-
in AccelUPF by handling all traffic of the colliding session in ware to software after initially being created in hardware. If
the software slowpath. it finds the corresponding state in the hardware register ar-
rays as marked for migration, it copies this state to software,
and deletes the corresponding invalid entry in the hardware
3.5 Software fallback register array data structures. All subsequent packets of this
Any PFCP message that cannot be handled within the hard- session will find the session and packet forwarding state in
ware fastpath in AccelUPF is redirected to a software UPF software and will be correctly handled in the slowpath. If the
running in host userspace, as shown in Figure 3. Examples software UPF does not find the state to process a PFCP/GTP
of such PFCP messages include: (i) node-related PFCP mes- packet either in its own software data structures or after
sages that are not on the critical path of user-perceived per- probing the hardware data structures, it drops the packet.
formance; (ii) PFCP messages that contain a large number We note that an alternate design is possible where we
of PDRs and associated action rules, which cannot be eas- handle complex PFCP messages in software and install the
ily parsed in hardware; (iii) sessions that require complex session rules directly to the hardware, allowing us to pro-
algorithms like longest prefix matching to match incoming cess future traffic of such sessions in the fastpath. How-
data traffic to packet forwarding rules; (iv) sessions which ever, this approach requires more frequent updates to the
hash to the same index in the register arrays. All such PFCP hardware rules from the software slowpath. Considering the
messages are redirected to the slowpath software UPF, which hardware-software communication bottleneck and limited
handles them normally, and creates suitable state in the form performance gains only for a small set of complex PFCP ses-
of packet forwarding rules in software. All subsequent PFCP sions, AccelUPF has not implemented this hybrid approach.
session modification/deletion messages or GTP user plane
packets of this session will also not find a matching rule
in hardware and will thus be forwarded to, and correctly 3.6 Fault tolerance of switch state
processed in, the software slowpath. AccelUPF stores packet forwarding rules in register arrays,
How are the session state and packet forwarding rules shared which are not persistent across switch failures. While a soft-
correctly across the hardware fastpath and software slowpath? ware UPF, or even a hardware accelerated UPF that only
Note that for most sessions, the state is created, modified, and offloads GTP user plane forwarding, can use software-based
deleted exclusively either within the hardware fastpath or in mechanisms to replicate and persist session state across
software. Therefore, the question of ownership of state is triv- switch/host failures, AccelUPF cannot rely on software repli-
ial to resolve in most cases. The only tricky scenario is when cation for hardware switch state. Therefore, AccelUPF relies

8
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

process packets in a run-to-completion manner. Large reg-


ister arrays are stored in memory that is shared across all
engines, and one engine accessing a register may cause an-
other engine to stall [47]. On the other hand, the Tofino
programmable switch uses a pipelined design where the
register array can be partitioned across multiple pipeline
stages, resulting in better performance when accessing regis-
ters from the packet processing pipeline [28]. AccelUPF uses
register arrays available at different part of the P4 pipeline
(ingress, egress), and across different pipelines, provided they
Figure 6: AccelUPF fastpath fault tolerance.
are independent from each other. AccelUPF distributes the
forwarding sessions among the different sets of the register
on techniques similar to those proposed in prior work for arrays by matching the most significant bits of the register
replication of switch state [23, 29, 30, 49]. To maintain packet array indices, which are generated by hashing the match
forwarding rules in register arrays in a fault-tolerant man- fields in the incoming PFCP and GTP packets.
ner, every update to the register arrays is replicated at K + Our slowpath implementation builds upon a production-
1 switches arranged in a linear fashion to achieve a K fault- grade standards-compliant software UPF [9]. We worked
tolerant system. The last switch in the chain replication acts with two different flavors of this UPF for our slowpath, one
as the primary UPF, and sends a response to the PFCP re- running on the kernel network stack, and another built on
quest. We use K = 1 in our current implementation. What top of the high-speed DPDK packet I/O framework. Both
if the primary switch fails before replication is completed? the kernel-based and DPDK-based UPFs came with a mul-
In such cases, we will not generate a response back to the ticore scalable design, with separate CPU cores processing
SMF, and the PFCP reliability feature [3] will ensure that the PFCP messages and GTP user plane traffic, though the GTP
SMF will detect the loss of the PFCP message via sequence forwarding throughput of the DPDK prototype was much
numbers embedded on PFCP messages, and will retry the higher. We made minor modifications to the software UPFs
PFCP request once again at a new switch. to work with our fastpath, e.g., to support migration of ses-
Note that our replication only increases the time required sion state from the fastpath in case of fallback to software.
for PFCP message processing, and does not impact the per- We also use the unmodified pure software DPDK-based UPF
formance of user traffic. This is because the GTP user plane to serve as a baseline in our evaluation.
traffic is only routed to the primary UPF from the RAN, and As another baseline, we modified the software UPFs to
is re-routed to the backup switches in the replication chain offload the GTP user plane forwarding functionality to a
only when the primary fails. Figure 6 shows the path taken programmable NIC/switch. In this GTP-offloaded UPF pro-
by PFCP messages and GTP packets in our fault-tolerant totype, the UPF processes PFCP messages normally, and in
AccelUPF design. addition to installing session state locally, also installs the
packet forwarding rules within the NIC/switch hardware
4 IMPLEMENTATION using the programmable data plane hardware APIs provided
This section describes the implementation of AccelUPF. The by the hardware platform vendors. We optimized this rule
fastpath of AccelUPF is implemented in P4, and compiled installation to work at the maximum rate supported by the
to run on two targets: an Agilio CX Netronome 2x40GbE hardware, by using multiple threads in the software control
smart NIC [46] and an Intel Tofino based Edgecore Wedge plane to install rules in parallel.
100BF-32X programmable switch [36]. For a smart NIC-based The software UPF also came with a load generator, a multi-
programmable data plane platform, the fastpath runs on the threaded DPDK application that emulates the functionality
NIC and the software slowpath runs in the host userspace. of the UEs, RAN, and the packet core control plane. The load
For a switch-based platform, the fastpath runs inside the generator generates PFCP messages and GTP user plane
switch packet processing pipeline, and the slowpath can traffic on behalf of emulated UEs, and provides knobs to vary
either run on the switch CPU or on a host connected to the the relative mix of PFCP messages and GTP traffic in the
switch. generated traffic.
We used two different hardware platforms to implement
our hardware fastpath because these platforms have different
internal implementations of abstractions like register arrays, 5 EVALUATION
on which our design depends heavily. The Netronome smart Setup. We use two different programmable hardware plat-
NIC platform uses multiple packet processing engines that forms to evaluate AccelUPF, shown in Figure 7. The setup

9
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

we scaled the cost of the server/switch/NIC according to


the number of cores/ports actually used. All the results are
shown in Table 1. We highlight some observations below.
PFCP throughput and latency. We first generate session
establishment and deletion requests from emulated UEs in
the load generator, and measure the maximum PFCP mes-
sage processing throughput and average message processing
latency or RTT for all UPFs, for a workload consisting only
of PFCP messages. We find that AccelUPF has much higher
Figure 7: Experimental setup. PFCP performance (4.3M PFCP messages/sec on Tofino) as
compared to the baseline UPFs, which could process only a
with the Netronome smart NIC contains three servers (AMD few thousand PFCP messages/sec.
Ryzen 9 5950X [email protected], 16 cores 32GB RAM), A few observations on the results shown in Table 1. (i) We
each connected to an Agilio CX 1/2x40GbE programmable note that the software UPF in our experiment was processing
smart NIC [46]. The first server hosts the load generator that PFCP messages on a single core. One could argue that the
generates PFCP and data traffic on behalf of emulated UEs, performance of the software UPF can scale to a higher PFCP
the second hosts the various UPFs we test, and the third processing capacity by adding more CPUs, but this scaling
hosts a sink that serves as the destination for the traffic gen- would increase the overall cost and power consumption of
erated from the load generator. The setup with the Tofino the system. For example, one would need over 500 cores on
programmable switch consists of two servers (Intel Xeon our server hosting the software UPF for it to match the per-
Gold 6234 [email protected] processors, 24 cores, 128GB RAM) formance of AccelUPF. The PFCP performance normalized
with 40Gbps NICs connected via an Intel Tofino Edgecore by cost or power consumed shows that AccelUPF is still more
Wedge 100BF-32X programmable switch [25]. In this setup, efficient than the software UPF, even if one were to scale the
the first server runs the load generator and the second server software UPF to a higher number of CPU cores. (ii) AccelUPF
runs the sink. The slowpath software UPF runs on an inter- also performs much better than the GTP-offloaded UPF, both
nal Intel Pentium CPU D1517 @ 1.60GHz CPU connected to in terms of throughput and latency, because processing PFCP
the data plane of the switch over the PCI bus. messages in hardware is much more efficient than process-
Parameters and metrics. Across all experiments, we vary ing them in software and installing the packet forwarding
the mix of PFCP messages to GTP data packets in the UPF rules in hardware. To confirm that the hardware rule installa-
traffic by varying the knobs provided in the load generator. tion is indeed the bottleneck in the GTP-offloaded UPFs, we
We measure the peak PFCP message processing throughput installed packet forwarding rules from a multi-threaded soft-
(messages/sec) and peak GTP data forwarding throughput ware controller program and measured the maximum rate
from the load generator. We also measure the average PFCP at which the programmable hardware platform can install
processing latency and GTP forwarding latency at saturation. packet forwarding rule. This rule installation capacity turned
All experiments were run for a duration of 300 seconds, and out to be the equivalent of 4448 PFCP msg/s for Netronome
we ensured that the load generator and the sink were not and 11406 PFCP msg/s for Tofino, which provides an op-
the performance bottleneck. timistic upper bound for the PFCP processing capacity on
UPF variants. We compare the performance of AccelUPF these platforms, even if one were to disregard all other pro-
(running on Netronome/Tofino) with the following baseline cessing overheads. (iii) AccelUPF performs much better on
UPFs described in §4: a pure software DPDK-based UPF, a Tofino than on Netronome, because of the higher overhead
GTP offloaded UPF (with data forwarding offloaded to the of register access in Netronome (see §4). Reading and writ-
Netronome/Tofino programmable hardware platforms). ing a single register takes around 150-590 clock cycles on
the Netronome platform [47], and processing a PFCP mes-
5.1 Microbenchmarks sage involves a few tens of register accesses. This explains
We first conduct several simple microbenchmarking experi- the higher latency (and lower throughput) of AccelUPF on
ments on all our UPF variants, and compare the PFCP mes- Netronome as compared to Tofino, which allows a much
sage processing throughput/latency and GTP data forward- faster access to registers via its pipelined design [28].
ing throughput/latency. We also normalize the PFCP and GTP forwarding throughput and latency. Next, we es-
GTP throughputs by the cost and power consumption of tablish sessions for 1K users ahead of time, and measure only
the corresponding commodity server or programmable hard- the maximum GTP data forwarding throughput and average
ware platforms [12, 14, 15], in order to measure performance forwarding latency or RTT for all UPFs. We use IMIX packet
per unit cost or power consumed. For the normalization, size [16] and only uplink traffic (results for other packet

10
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

PFCP GTP
UPF design
Tput (msg/s) msg/s/USD msg/s/Watt RTT (us) Tput (Mpps) Kpps/USD Kpps/Watt RTT (us)
SoftwareUPF 8309 85.51 949.60 40 11.93 17.53 194.77 85
GTPOffload Netronome 1953 6.39 78.12 1470 10.51 17.20 210.20 71
GTPOffload Tofino 499 1.91 31.23 447 11.94 22.91 373.12 49
AccelUPF Netronome 794849 2601.80 31793.96 114 4.83 7.91 96.60 115
AccelUPF Tofino 4389254 16841.26 274328.37 35 11.94 22.91 373.12 49

Table 1: PFCP processing and GTP forwarding performance of UPFs.

sizes and downlink forwarding were similar). We find that


all UPFs, with the exception of AccelUPF on Netronome, are
able to process packets at the linerate of 40Gbps. Further-
more, the hardware-accelerated UPFs have a much better
packet processing performance per unit cost or power con-
sumed, as compared to the pure software UPF. The only
outlier is the poor GTP forwarding throughput of AccelUPF
on Netronome, which is once again due to the higher over-
head of register access on the Netronome smart NIC platform.
While the GTP offloaded UPF on Netronome stores its packet Figure 8: Rate of hash collisions.
forwarding rules in match-action tables, AccelUPF uses in-
switch register arrays which are shared across all packet array. This translates to a maximum of 64K user sessions for
forwarding engines in the smart NIC, leading to frequent our Netronome platform, considering it has a single func-
stalls and poor packet processing performance. However, tional copy of each register array. Tofino has disjoint registers
AccelUPF on Tofino has no such issues, and performs on par for its ingress and egress pipelines, so we can support 128K
with the other state-of-the-art UPF variants. This experiment users in our current implementation. Note that for workloads
highlights the importance of choosing a good programmable generating frequent PFCP messages (say, every 10 seconds,
data plane hardware platform to deploy AccelUPF on. If the which is the common value of the inactivity timer after which
underlying hardware platform cannot support efficient reg- a session is marked as idle), the PFCP traffic generated by
ister access, AccelUPF may not be suitable for UPFs that see 128K users can only be comfortably and efficiently handled
a large share of GTP traffic and relatively small PFCP traffic. by AccelUPF, and by not any other UPF design we have eval-
Overhead of chain replication. The above microbench- uated. It is possible to increase capacity of AccelUPF further
marks of AccelUPF were obtained with the replication of by using multiple pipelines present in switches. The Tofino
switch state (along a chain of K+1 switches for a K-fault tol- switches support up to 4 pipelines (2 in our model), so it is
erance system) turned off. We measured performance with possible to increase the capacity of AccelUPF to 512K users,
replication turned on, and we found no noticeable degrada- which we plan to explore as part of future work.
tion in PFCP or GTP processing throughputs of AccelUPF. However, as the number of users approaches capacity,
However, the PFCP latency of AccelUPF when replicating hash collisions can become a problem. Figure 8 shows the
switch state at one other backup switch was 60% higher than percentage of sessions that see a hash collision as the number
that of AccelUPF with no fault tolerance. However, given the of active sessions at the UPF increases. These measurements
extremely low latencies of message processing in AccelUPF were obtained from the AccelUPF prototype running on
(few tens of microseconds in most cases), we do not expect Tofino in two different cases. First, we store a single entry in
this overhead to be a big concern. a hash bucket using single width Tofino registers. In this case,
Maximum number of user sessions. The maximum num- we see that the number of hash collisions is relatively low
ber of user sessions that can be supported by AccelUPF de- until the system has about 32K users, and the hash collisions
pends on the size of the register array memory available to increase afterwards reaching to as high as 43% at the full
store packet forwarding rules of a session, the number of capacity of 65k users. This would be a significant load for
separate pipelines available for the hardware platform with the slowpath software UPF. However, with using dual width
distinct register array, and the size of the hash computed registers, storing 2 entries in each hash bucket, we can bring
by the hardware platform. Across both hardware platforms, the hash collisions to a low value (under 10%), and we can
we found that we could support 64K entries in each register comfortably support 64K users in each register array.

11
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

Overhead of software fallback. AccelUPF needs to mi-


grate an established session from hardware to software in
certain scenarios, e.g., when a session exceeds its config-
ured maximum bit rate. In such cases, when the processing
needs to fallback to software, the first packet of the flow after
migration will incur a high overhead, due to the software
probing the hardware state and migrating it to the slowpath.
We measured this overhead and found that the first packet
processed in a session immediately after software fallback
incurs a processing latency of around 2.4 milliseconds, as
Figure 9: Comparison of UPFs for IoT traffic.
compared to the average case latency of around 100 microsec-
onds. We argue that this migration overhead is not a major
concern. The two main cases where a user session has to be IoT Trace A B C D E
migrated are when a user sends traffic beyond the configured PFCP % 3.65 12.53 19.63 28.86 35.79
rate limit, or when a session is modified using complex PFCP Table 2: PFCP messages in IoT traces.
messages that cannot be parsed in the fastpath. In the former
case, we argue that the user’s traffic will suffer long delays
due to buffering anyways, and the extra overhead of migra-
tion will not adversely impact performance. In the latter case, Figure 9 shows the PFCP and GTP processing through-
we expect future implementations of the SMF working with puts (in pps) for the various UPFs (we omit GTPOffloadUPF
AccelUPF to evolve towards simpler PFCP messages, at least Netronome and AccelUPF Netronome for clarity). We note
for UEs with stringent performance guarantees. that the average packet size in the IoT traces decreases from
trace A to trace E, which resulted in an increased throughput
(in Mpps) from left to right. The PFCP processing capacity
5.2 Real world traffic of GTPOffload designs and software UPF was too low to be
We now present an evaluation of AccelUPF on real-world visible on top of the GTP throughput in our representation
traces. We choose an IoT application that is likely to see in Figure 9. We find that AccelUPF has around 57% higher
a larger fraction of PFCP messages to GTP data, because throughput than SoftUPF or GTPOffloadUPF on Tofino for
an IoT device performs frequent signaling while sending the trace E, which had 35.79% PFCP messages, and the in-
small amounts of data intermittently. We obtained IoT packet creased throughput was primarily due to the higher PFCP
traces from 5 different IoT applications [44]. These traces message processing capacity of AccelUPF compared to the
contain the packet sizes and timestamps of when packets other two designs.
were generated by various IoT applications. We then extrapo- We note that under high PFCP traffic coming from a large
late the traces to add PFCP messages, to simulate the scenario number of UEs (beyond the capacity of the hardware fast-
where these IoT devices would be connected over a mobile path), the AccelUPF slowpath will still have to support a high
telecom network as follows. We add PFCP messages corre- PFCP throughput. The software UPF on the slowpath (whose
sponding to PDU initial session establishment for each IoT PFCP throughput per core is much lower than AccelUPF
device at the start of the trace. Further, when an IoT device hardware, see Table 1) is scaled to run on multiple cores in
goes inactive after an idle period, we add PFCP messages such cases, to effectively handle the traffic coming to the
corresponding to the AN release procedure that transitions slowpath. However, the number of cores required will be
a user from a connected state to the idle state. We also add much lower than in the case of a pure software UPF, because
PFCP messages corresponding to a service request when the fastpath is expected to handle a bulk of the traffic.
the user resumes activity once again. The inactivity timer
is usually set to a few seconds [34] by network operators to 6 RELATED WORK
reclaim radio resources of inactive users. We use a value of State-of-the-art UPFs. Most production grade UPFs are
10 seconds. After adding the emulated PFCP messages, the built over kernel-bypass techniques like DPDK to achieve
relative mix of PFCP and data traffic in the 5 traces (named high user plane throughput in software. Neutrino [19] pro-
trace A to trace E) is as shown in Table 2. We now generate poses a DPDK-based edge solution that replaces the 5G
PFCP and GTP traffic from our load generator in these ratios, standards-based components by a 5G non-compliant con-
obtaining other metrics like average packet sizes also from trol plane solution which is fast, reliable, and fault-tolerant.
the trace. We measure the total packet processing throughput Metaswitch [6] uses a specialized processing engine (CNAP)
of the various UPF variants for these IoT workloads. in the software itself to achieve high throughput. Some UPFs

12
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

also use programmable hardware or specialized processing and algorithms that ensure strong consistency and fault-
engines to offload some part of the UPF processing to hard- tolerance for an in-network key-value store. Choi et al. [23]
ware. Few proposals [7, 8, 18, 22, 24] offload the GTP en- and SwiSh [49] introduce new replication protocols for in-
cap/decap based forwarding to hardware, while some [31] network state. Redplane [30] implements a fault-tolerant
offload packet steering to cores via deep packet inspection state store that ensures consistent application state access
(DPI) of the inner IP header. Kaloom [5] offloads a subset of even if the switch fails or traffic is rerouted to another switch,
QoS processing (bit rate policing) along with GTP processing while offering two consistency modes; strong consistency
to the programmable hardware. TurboEPC [40] offloads the and bounded-inconsistency.
subset of 4G core signaling messages to the programmable
hardware, but the proposed changes are not standards com-
pliant. uP4 [33] offloads the 5G UPF user plane processing 7 CONCLUSION
to programmable hardware and uses microservices that run
This paper presented the design, implementation, and eval-
on commodity hardware to process the corresponding PFCP
uation of AccelUPF, a programmable data plane hardware
signaling messages.
accelerated 5G user plane function. Prior work on using pro-
Our previous position paper [20] evaluated the costs and
grammable hardware to accelerate the mobile packet core
benefits of multiple 5G UPF designs (with and without hard-
user plane was restricted to offloading only the GTP user data
ware offload) and quantified the performance gains of user
forwarding functionality to hardware, while continuing to
plane traffic offload. The work also identifies the PFCP pro-
process PFCP messages that configure the packet forwarding
cessing bottleneck of the programmable data plane acceler-
rules in software. These designs perform badly when applica-
ated UPF when only data handling is offloaded, and proposes
tions frequently reconfigure packet forwarding rules while
the offload of PFCP processing as well to programmable
sending little data in between (e.g., IoT applications), because
hardware.
the software control plane APIs that reconfigure hardware
Control plane offload. Much like AccelUPF, prior work has
rules have a limited capacity. To overcome this bottleneck,
also proposed offloading the control plane logic of network
AccelUPF offloads the processing of most PFCP messages
functions (and not just the data plane) to programmable
to the programmable hardware as well, carefully working
hardware, and highlighted the challenges with the same.
around the memory and compute constraints of the hard-
Mantis [48] designs a control plane architecture over pro-
ware platforms when processing the complex PFCP messages.
grammable switches that can react to data center network
Our experiments show that AccelUPF significantly improves
conditions within tens of 𝜇s to resolve congestion events that
UPF packet processing performance as compared to previous
are microscopic in duration. Molero et al. [35] achieves line
offload-based UPF designs, especially when the traffic has
rate for internet routing by processing failure detection, dis-
a high proportion of PFCP messages. Our work highlights
tributed path-vector computations (shortest-path and BGP-
the challenges in processing a complex protocol like PFCP
like policies), and forwarding state updates, entirely within
in programmable hardware. Given the significant perfor-
the data plane. D2R [43] implements fast reroute during
mance gains that accrue from processing PFCP messages
network failure by performing route computation without
in programmable hardware at the UPF, our work provides
control plane intervention. Lucid [41] presents a framework
guidance on how the future versions of PFCP for 6G and
that simplifies the in-network implementation of control
beyond can evolve to make them amenable for acceleration
plane constructs such as stateful table data structures, pe-
using programmable dataplane platforms.
riodic event triggers and event handler processing, packet
buffering, traffic shaping, and synchronized state writes. Lu-
cid also proposes a high-level language for writing control
function code, and the compiler translates this code to the ACKNOWLEDGEMENTS
optimized target code for Intel Tofino switches. AccelUPF is We thank our shepherd Hyojoon Kim, and the anonymous
complementary to, and strengthens the case for, frameworks reviewers, for their insightful feedback. We thank the 5G
like Lucid. testbed project, funded by the Department of Telecommu-
Fault-tolerance of switch state. With many stateful ap- nications, Govt. of India, for access to the various 5G core
plications offloaded to the programmable data planes, pro- components. We thank Dr. Venkanna U. and his research
tecting application state under switch failure conditions and team at IIIT Naya Raipur, especially Suvrima Datta, for pro-
concurrent state access is essential. Prior work has proposed viding access to their hardware setup during our initial work.
state replication and fault tolerance solutions for such ap- We also thank the Fast Forward Initiative Hardware Grant
plications, some of which we leverage for fault tolerance of Program by Intel® Connectivity Research Program (ICRP)
switch state in AccelUPF. Netchain [29] proposes protocols for their grant of a programmable switch.

13
SOSR ’22, October 19–20, 2022, Virtual Event, USA Abhik Bose, et al.

REFERENCES Predictable Tails Using Programmable Data Planes. In Proceedings of


[1] 2013. Cisco highlights next big switch. https://www.biztechafrica.com/ the 3rd Asia-Pacific Workshop on Networking (APNet).
article/cisco-announces-next-big-switch/5448 [24] Zhou Cong, Zhao Baokang, Wang Baosheng, and Yuan Yulei. 2022.
[2] 2015. Cavium Xpliant ethernet switch product line. https:// CeUPF: Offloading 5G User Plane Function to Programmable Hard-
people.ucsc.edu/~warner/Bufs/Xpliant-cavium.pdf ware Base on Co-Existence Architecture. In Proceedings of the ACM
[3] 2017. 3GPP Ref #: 29.244. 2017. System architecture for the 5G System International Conference on Intelligent Computing and Its Emerging
(5GS). https://www.3gpp.org/ftp/Specs/archive/29_series/29.244 Applications.
[4] 2017. 3GPP Ref #:23.501. 2017. System architecture for the 5G System [25] Edge-core. 2022. Quick Start Guide 32-Port 100G Ethernet Switch
(5GS). https://www.3gpp.org/ftp/Specs/archive/23_series/23.501 Wedge100BF-32X. https://www.edge-core.com/_upload/images/
[5] 2019. The Kaloom 5G User Plane Function (UPF). https: Wedge100BF-32X_QSG-R01_EN-SC_0114.pdf
//www.mbuzzeurope.com/wp-content/uploads/2020/02/Product- [26] Michaela Goss. 2022. Macrocell vs. small cell vs. femtocell: A 5G in-
Brief-Kaloom-5G-UPF-v1.0.pdf troduction. https://www.techtarget.com/searchnetworking/feature/
[6] 2019. Lighting Up the 5G Core with a High-Speed User Plane on Intel Ar- Macrocell-vs-small-cell-vs-femtocell-A-5G-introduction
chitecture. https://builders.intel.com/docs/networkbuilders/lighting- [27] R. E. Hattachi. 2015. Next Generation Mobile Networks,
up-the-5g-core-with-a-high-speed-user-plane-on-intel- NGMN. https://www.ngmn.org/wp-content/uploads/
architecture.pdf NGMN_5G_White_Paper_V1_0.pdf
[7] 2020. 5G User Plane Function (UPF) - Performance with AS- [28] Intel. 2021. P416 Intel® Tofino™ Native Architecture – Public Ver-
TRI. https://networkbuilders.intel.com/solutionslibrary/5g-user- sion. https://github.com/barefootnetworks/Open-Tofino/blob/master/
plane-function-upf-performance-with-astri-solution-brief PUBLIC_Tofino-Native-Arch.pdf
[8] 2020. Optimizing UPF performance using SmartNIC of- [29] Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert
fload. https://www.mavenir.com/app/uploads/2020/11/ Soulé, Changhoon Kim, and Ion Stoica. 2018. NetChain: Scale-Free Sub-
Mavenir_UPF_Solution_Brief .pdf RTT Coordination. In 15th USENIX Symposium on Networked Systems
[9] 2022. 5G testbed at IIT Bombay. https://www.cse.iitb.ac.in/~5gtestbed Design and Implementation (NSDI).
[10] 2022. Altera. https://www.mouser.in/manufacturer/altera [30] Daehyeok Kim, Jacob Nelson, Dan R. K. Ports, Vyas Sekar, and Srini-
[11] 2022. DPDK Overview. https://doc.dpdk.org/guides/prog_guide/ vasan Seshan. 2021. RedPlane: Enabling Fault-Tolerant Stateful in-
overview.html Switch Applications. In Proceedings of the ACM SIGCOMM Conference.
[12] 2022. Edgecore Wedge 100BF-32X 32-Port 100GbE Bare Metal [31] DongJin Lee, JongHan Park, Chetan Hiremath, John Mangan,
Switch with ONIE - Part ID: Wedge100BF-32X-O-AC-F-US. and Michael Lynch. 2018. Towards achieving high perfor-
https://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct= mance in 5G mobile packet core’s user plane function. https:
3485&idcategory= //builders.intel.com/docs/networkbuilders/towards-achieving-high-
[13] 2022. EZchip. https://www.radisys.com/partners/ez-chip performance-in-5g-mobile-packet-cores-user-plane-function.pdf
[14] 2022. Intel XL710-BM2 Dual-Port 40G QSFP+ PCIe 3.0 x8, Ethernet [32] Yuanjie Li, Qianru Li, Zhehui Zhang, Ghufran Baig, Lili Qiu, and
Network Interface Card. https://www.fs.com/products/75604.html Songwu Lu. 2020. Beyond 5G: Reliable Extreme Mobility Management.
[15] 2022. Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz. https: In Proceedings of the Annual Conference of the ACM Special Interest
//www.intel.com/content/www/us/en/products/sku/91767/intel- Group on Data Communication on the Applications, Technologies, Ar-
xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html chitectures, and Protocols for Computer Communication (SIGCOMM).
[16] 2022. Internet Mix (IMIX) Traffic. https://en.wikipedia.org/wiki/ [33] MacDavid, Robert and Cascone, Carmelo and Lin, Pingping and Pad-
Internet_Mix manabhan, Badhrinath and ThakuR, Ajay and Peterson, Larry and
[17] 2022. Xilinx. https://www.xilinx.com Rexford, Jennifer and Sunay, Oguz. 2021. A P4-Based 5G User Plane
[18] Ashkan Aghdai et al. 2018. Transparent Edge Gateway for Mobile Function. In Proceedings of the ACM SIGCOMM Symposium on SDN
Networks. In IEEE 26th International Conference on Network Protocols Research (SOSR).
(ICNP). [34] Foivos Michelinakis, Anas Saeed Al-Selwi, Martina Capuzzo, Andrea
[19] Mukhtiar Ahmad, Syed Usman Jafri, Azam Ikram, Wasiq Noor Ah- Zanella, Kashif Mahmood, and Ahmed Elmokashfi. 2021. Dissecting
mad Qasmi, Muhammad Ali Nawazish, Zartash Afzal Uzmi, and Za- Energy Consumption of NB-IoT Devices Empirically. IEEE Internet of
far Ayyub Qazi. 2020. A Low Latency and Consistent Cellular Control Things Journal 8, 2 (2021), 1224–1242.
Plane. In Proceedings of the Annual Conference of the ACM Special In- [35] Edgar Costa Molero, Stefano Vissicchio, and Laurent Vanbever. 2018.
terest Group on Data Communication on the Applications, Technologies, Hardware-Accelerated Network Control Planes. In Proceedings of the
Architectures, and Protocols for Computer Communication. 17th ACM Workshop on Hot Topics in Networks (HotNets).
[20] Abhik Bose, Diptyaroop Maji, Prateek Agarwal, Nilesh Unhale, Rinku [36] Barefoot networks. 2018. NoviWare 400.5 for Barefoot Tofino
Shah, and Mythili Vutukuru. 2021. Leveraging Programmable Data- chipset. https://noviflow.com/wp-content/uploads/NoviWare-Tofino-
planes for a High Performance 5G User Plane Function. In 5th Asia- Datasheet.pdf
Pacific Workshop on Networking (APNet). [37] Recep Ozdag. 2012. Intel Ethernet Switch FM6000 Series - Software
[21] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Defined Networking. https://people.ucsc.edu/~warner/Bufs/ethernet-
Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George switch-fm6000-sdn-paper.pdf
Varghese, and David Walker. 2014. P4: Programming Protocol- [38] Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo hashing.
independent Packet Processors. SIGCOMM Computer Communication Journal of Algorithms 51 (2004).
Review 44 (2014). [39] Javan Erfanian Rachid El Hattachi. 2015. NGMM 5G
[22] Carmelo Cascone and Uyen Chau. 2018. Offloading VNFs to pro- white paper. https://www.ngmn.org/wp-content/uploads/
grammable switches using P4. In ONS North America. NGMN_5G_White_Paper_V1_0.pdf
[23] Sean Choi, Seo Jin Park, Muhammad Shahbaz, Balaji Prabhakar, and [40] Rinku Shah, Vikas Kumar, Mythili Vutukuru, and Purushottam Kulka-
Mendel Rosenblum. 2019. Toward Scalable Replication Systems with rni. 2020. TurboEPC: Leveraging Dataplane Programmability to Accel-
erate the Mobile Packet Core. In Proceedings of the Symposium on SDN

14
AccelUPF: Accelerating the 5G user plane using programmable hardware SOSR ’22, October 19–20, 2022, Virtual Event, USA

Research (SOSR). [46] Netronome systems. 2022. Agilio CX 2x40GbE SmartNIC. https://
[41] John Sonchack, Devon Loehr, Jennifer Rexford, and David Walker. colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2871
2021. Lucid: A Language for Control in the Data Plane. In Proceedings [47] Pablo B. Viegas, Ariel G. de Castro, Arthur F. Lorenzon, Fábio D. Rossi,
of the ACM SIGCOMM Conference. and Marcelo C. Luizelli. 2021. The Actual Cost of Programmable
[42] Gábor Soós, Ferenc Nándor Janky, and Pál Varga. 2019. Distinguishing SmartNICs: Diving into the Existing Limits. In Advanced Information
5G IoT Use-Cases through Analyzing Signaling Traffic Characteristics. Networking and Applications.
In 2019 42nd International Conference on Telecommunications and Signal [48] Liangcheng Yu, John Sonchack, and Vincent Liu. 2020. Mantis: Reactive
Processing (TSP). Programmable Switches. In Proceedings of the Annual Conference of the
[43] Kausik Subramanian, Anubhavnidhi Abhashkumar, Loris D’Antoni, ACM Special Interest Group on Data Communication on the Applications,
and Aditya Akella. 2021. D2R: Policy-Compliant Fast Reroute. In Technologies, Architectures, and Protocols for Computer Communication.
Proceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR). [49] Lior Zeno, Dan R. K. Ports, Jacob Nelson, Daehyeok Kim, Shir Landau-
[44] UNSW Sydney. 2021. IOT TRAFFIC TRACES. https:// Feibish, Idit Keidar, Arik Rinberg, Alon Rashelbach, Igor De-Paula, and
iotanalytics.unsw.edu.au/iottraces.html Mark Silberstein. 2022. SwiSh: Distributed Shared State Abstractions
[45] Netronome systems. 2020. Agilio CX 2x10GbE SmartNIC. https: for Programmable Switches. In 19th USENIX Symposium on Networked
//www.netronome.com/media/documents/PB_Agilio_CX_2x10GbE- Systems Design and Implementation (NSDI 22).
7-20.pdf

15

You might also like