• B3 k6
z 2 l
• mTCP +4Intel4DPDK wi
• github mTCP+4DPDK
orz
• Key4Value4Store w k
Linux l
• RADIS →
• d v orz
• Memcached →
• d
3

A G 7 LNPMXXT
4
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Key size (bytes)
Key size CDF by appearance
USR
APP
ETC
VAR
SYS
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000 1e+06
Value size (bytes)
Value Size CDF by appearance
USR
APP
ETC
VAR
SYS
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10
Value size (bytes)
Value size CDF by total
Figure 2: Key and value size distributions for all traces. The leftmost CDF shows the sizes o
B.4Atikoglu,4et4al.,4“Workload4Analysis4 of4a4LargeUScale4KeyUValue4Store,”4ACM4SIGMETRICS42012.
here. It is important to note, however, that all Memcached
instances in this study ran on identical hardware.
2.3 Tracing Methodology
Our analysis called for complete traces of traffic passing
through Memcached servers for at least a week. This task
is particularly challenging because it requires nonintrusive
instrumentation of high-traffic volume production servers.
Standard packet sniffers such as tcpdump2
have too much
overhead to run under heavy load. We therefore imple-
mented an efficient packet sniffer called mcap. Implemented
as a Linux kernel module, mcap has several advantages over
standard packet sniffers: it accesses packet data in kernel
space directly and avoids additional memory copying; it in-
troduces only 3% performance overhead (as opposed to tcp-
dump’s 30%); and unlike standard sniffers, it handles out-
of-order packets correctly by capturing incoming traffic af-
ter all TCP processing is done. Consequently, mcap has a
complete view of what the Memcached server sees, which
eliminates the need for further processing of out-of-order
packets. On the other hand, its packet parsing is optimized
for Memcached packets, and would require adaptations for
other applications.
The captured traces vary in size from 3T B to 7T B each.
This data is too large to store locally on disk, adding another
challenge: how to offload this much data (at an average rate
of more than 80, 000 samples per second) without interfering
with production traffic. We addressed this challenge by com-
bining local disk buffering and dynamic offload throttling to
take advantage of low-activity periods in the servers.
Finally, another challenge is this: how to effectively pro-
cess these large data sets? We used Apache HIVE3
to ana-
lyze Memcached traces. HIVE is part of the Hadoop frame-
work that translates SQL-like queries into MapReduce jobs.
We also used the Memcached “stats” command, as well as
Facebook’s production logs, to verify that the statistics we
computed, such as hit rates, are consistent with the aggre-
gated operational metrics collected by these tools.
3. WORKLOAD CHARACTERISTICS
This section describes the observed properties of each trace
0
10000
20000
30000
40000
50000
60000
70000
USR APP ETC VAR SYS
Requests(millions)
Pool
DELETE
UPDATE
GET
Figure 1: Distribution of request types per pool,
over exactly 7 days. UPDATE commands aggregate
all non-DELETE writing operations, such as SET,
REPLACE, etc.
operations. DELETE operations occur when a cached
database entry is modified (but not required to be
set again in the cache). SET operations occur when
the Web servers add a value to the cache. The rela-
tively high number of DELETE operations show that
this pool represents database-backed values that are
affected by frequent user modifications.
ETC has similar characteristics to APP, but with an even
higher rate of DELETE requests (of which some may
not be currently cached). ETC is the largest and least
specific of the pools, so its workloads might be the most
representative to emulate. Because it is such a large
and heterogenous workload, we pay special attention
to this workload throughout the paper.
VAR is the only pool sampled that is write-dominated. It
stores short-term values such as browser-window size
rformance metrics over
ekly patterns (Sec. 3.3,
be used to generate more
We found that the salient
r-law distributions, sim-
serving systems (Sec. 5).
d deployment that can
-scale production usage
as follows. We begin by
cached, its deployment
d its workload. Sec. 3
properties of the trace
), while Sec. 4 describes
he server point of view).
model of the most rep-
tion brings the data to-
s, followed by a section
zing cache behavior and
RIPTION
ource software package
s over the network. As
more RAM can be added
added to the network.
mmunicate with clients.
o select a unique server
ge of the total number of
Table 1: Memcached pools sampled (in one cluster).
These pools do not match their UNIX namesakes,
but are used for illustrative purposes here instead
of their internal names.
Pool Size Description
USR few user-account status information
APP dozens object metadata of one application
ETC hundreds nonspecific, general-purpose
VAR dozens server-side browser information
SYS few system data on service location
A new item arriving after the heap is exhausted requires
the eviction of an older item in the appropriate slab. Mem-
cached uses the Least-Recently-Used (LRU) algorithm to
select the items for eviction. To this end, each slab class
has an LRU queue maintaining access history on its items.
Although LRU decrees that any accessed item be moved to
the top of the queue, this version of Memcached coalesces
repeated accesses of the same item within a short period
(one minute by default) and only moves this item to the top
the first time, to reduce overhead.
2.2 Deployment
Facebook relies on Memcached for fast access to frequently-
accessed values. Web servers typically try to read persistent
values from Memcached before trying the slower backend
databases. In many cases, the caches are demand-filled,
meaning that generally, data is added to the cache after
a client has requested it and failed.
Modifications to persistent data in the database often
propagate as deletions (invalidations) to the Memcached
tier. Some cached data, however, is transient and not backed
by persistent storage, requiring no invalidations.
. VPVNLNRP
USR4keys4are416B4or421B
90%4of4VAR4keys4are431B
USR4values4are4only42B
90%4of4values4are4smaller4than4500B

vw
c *)>M g
b*)*) ( 1%/-‐‑‒%*+ 1 t *.C
c EI +> ag
b+ *)2 ( *. *)/ t * NUXNT ( LNTP
*
* ii
5

L14(64KB)
L24(256KB)
6
CPU4Core
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
CPU4Core
L14(64KB)
L24(256KB)
CPU4Core
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
L14(64KB)
L24(256KB)
CPU4Core
LLC4(12MB)
44cycles
124cycles
444cycles
Memory4(xx4GB)
3004cyclesMatching4Tables
(4>4xx4MB)

Copyright420144NTT4Corporaton
m x86 v n 7

2010$Sep.
Per$Packet)CPU)Cycles)for)10G
8
1,200 600
1,200 1,600
Cycles'
needed
Packet'I/O IPv4'lookup
='1,800'cycles
='2,800
Your
budget
1,400'cycles
10G, min-sized packets, dual quad-core 2.66GHz CPUs
5,4001,200 … ='6,600
Packet'I/O IPv6'lookup
Packet'I/O Encryption'and'hashing
IPv4
IPv6
IPsec
+
+
+
(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)
S. Han, et al., “PacketShader: a GPU-accelerated Software Router,”
SIGCOMM 2010.
※

2010$Sep.
PacketShader:)psio I/O)Optimization
9
Packet'I/O
Packet'I/O
Packet'I/O
Packet'I/O
! 1,200'reduced'to'200'cycles'
per'packet
! Main'ideas
! Huge'packet'buffer
! Batch'processing
600
1,600
IPv4'lookup
='1,800'cycles
='2,800
5,400 … ='6,600
IPv6'lookup
Encryption'and'hashing
+
+
+
1,200
1,200
1,200
SIGCOMM 2010.

2010$Sep.
PacketShader:)GPU)Offloading
10
Packet'I/O
Packet'I/O
Packet'I/O
! GPU'Offloading'for
! MemoryMintensive'or
! ComputeMintensive'
operations
! Main'topic'of'this'talk
600
1,600
IPv4'lookup
5,400 …
IPv6'lookup
Encryption'and'hashing
+
+
+
SIGCOMM 2010.

Kernel Uses the Most CPU Cycles
4
83% of CPU usage spent
inside kernel!
Performance bottlenecks
1. Shared resources
2. Broken locality
3. Per packet processing
1) Efficient use of CPU cycles
for TCP/IP processing
2.35x more CPU cycles for app
2) 3x ~ 25x better performance
Bottleneck removed
by mTCPKernel
(without TCP/IP)
45%
Packet I/O
4%
TCP/IP
34%
Application
17%
CPU Usage Breakdown of Web Server
Web server (Lighttpd) Serving a 64 byte file
Linux-3.10
11
E.4Jeong,4et4al.,4“mTCP:4A4Highly4Scalable4UserUlevel4TCP4Stack4for4
Multicore4Systems,”4NSDI2014.

12
Inefficiencies in Kernel from Shared FD
1. Shared resources
– Shared listening queue
– Shared file descriptor space
5
Per-core packet queue
Receive-Side Scaling (H/W)
Core 0 Core 1 Core 3Core 2
Listening queue
Lock
File descriptor space
Linear search for finding empty slot

13
Inefficiencies in Kernel from Broken Locality
2. Broken locality
6
Per-core packet queue
Receive-Side Scaling (H/W)
Core 0 Core 1 Core 3Core 2
Interrupt
handle
accept()
read()
write()
Interrupt handling core != accepting core

14
Inefficiencies in Kernel from Lack of Support for Batching
3. Per packet, per system call processing
Inefficient per packet processing
Frequent mode switching
Cache pollution
Per packet memory allocation
Inefficient per system call processing
7
accept(), read(), write()
Packet I/O
Kernel TCP
Application thread
BSD socket LInux epoll
Kernel
User

15
Overview of mTCP Architecture
10
1. Thread model: Pairwise, per-core threading
2. Batching from packet I/O to application
3. mTCP API: Easily portable API (BSD-like)
User-level packet I/O library (PSIO)
mTCP thread 0 mTCP thread 1
Application
Thread 0
Application
Thread 1
mTCP socket mTCP epoll
NIC device driver Kernel-level
1
2
3
User-level
Core 0 Core 1
• [SIGCOMM’10] PacketShader: A GPU-accelerated software router,
http://shader.kaist.edu/packetshader/io_engine/index.html
Intel4DPDK

VPVNLNRP VH E
•
– k u •l • z
z h c.f.4SeaStar
16
main4
thread
worker4
thread
worker4
thread
worker4
thread
kernel
main4
thread
worker4
thread
worker4
thread
worker4
thread
mTCP
thread
mTCP
thread
mTCP
thread
pipe
accept()
accept()
read()
write()
read()
write()
read()
write()
read()
write()
accept()
read()
write()
accept()
read()
write()

c
b + E MLNT X MLNT
b VH E
c
b US R * -‐‑‒ + % 8 LNRP MP NRVL T
b CPVNLNRP * -‐‑‒ + -‐‑‒ % VN MP NRVL T
17
Hardware
CPU Intel Xeon E5-22430L/2.0GHz
(6 core) x 2 sockets
Memory 48 GB PC3-12800
Ethernet Intel X520-SR1 (10 GbE)
Software
OS Debian GNU/Linux 8.1
kernel Linux 3.16.0-4-amd64
Intel DPDK 2.0.0
mTCP (4603a1a,June 7 2015)

US R
0
20
40
60
80
100
120
140
160
180
0 2 4 6 8 10 12
10004REQUESTS/SECOND
#CORES
Linux SO_REUSEPORT mTCP
higher4is4better
• Apache4benchmark
• 64B4message
• 10004concurrency
• 100K4requests
3.3x
5.5x
18

VPVNLNRP
c VN MP NRVL T FL S MP NRVL T
l
c VH E v d
G<H . d><H +
c u d
19
TCP$
w/$1$thread
TCP$
w/$3$threads
mTCP
w/$1$thread
SET 85,404 146,3514(1.71) 115,1664(1.35)
GET 115,079 139,5754(1.21) 116,8384(1.02)
• mcUbenchmark
• 64B4message
• 5004concurrency
• 100K4requests

VH E g
c – v d
u k u
z ls E A u
c 9G qP XUU 8E@ v d
k P NX P US P S
P P l u d
c
c v
c D@ E A w d
c v v d
dH E(@E v z e
E A e
20

•
• X86
w
• vzh
– z
• cpufreqUinfo(1) v v1/2
– cgroups CPU4throttling z
– kXeon4Phil w z
• r FLARE Tilera v
21

o p
Supachai Thongprasit
e
[1]4S.4Thongprasit,4V.4Visoottiviseh,4and4R.4Takano,4“Toward4Fast4and4Scalable4
KeyUValue4Stores4Based4on4User4Space4TCP/IP4Stack,”4AINTEC42015.4
d d
k d l e
u d ve
22

User-space Network Processing