Software Zero-Copy for Web Caching
Software Zero-Copy for Web Caching
2
If the allocation request is for a large data block,
ZC alloc directly allocates a memory chunk rounded
from the requesting size. A threshold (4096 bytes by
default) is set in ZC alloc to decide whether a request is
for the large data block. This threshold can be tuned by
the programmer if needed.
The twin memory allocator is especially friendly to the
reusable data. Once a data block is allocated it will be
sent out to network multiple times before it is modified
or freed. One representative usage scenario is allocating
value data for Memcached. Memcached server caches a
Figure 3: An overview of architecture of ZCopy lot of key/value pairs in memory to serve quick key/value
queries. Every time the server receives a request con-
protected can be separated from other application data. taining a key, it will respond with the value correspond-
Figure 3 shows the general architecture of ZCopy. The ing to that key. For the perspective of long execution,
application running on ZCopy can use the original mem- the key/value pairs are not expected to be modified or
ory allocator (e.g., glibc) to allocate memory for normal freed. Hence, we can zero-copy the value during data
data or use the twin memory allocator named ZC alloc transferring without worrying about the modification to
to allocate memory for zero-copying data. A ZCopy such data in most cases.
proxy is added to the UDP and the TCP package pro- 3.2.2 Zero-copying Network I/O Data
cessing path to distinguish the network data that will be
zero-copied from others. If the data is allocated from ZCopy supports two common network protocols: UDP
ZC alloc, ZCopy will bypass the data copy path. Other- and TCP. We add a proxy in UDP and TCP’s package
wise, ZCopy will handle the data as usual. The proxy processing paths to distinguish the network data that will
also cooperates with the ZCopy data protection mod- be zero-copied and others. At the very beginning, ZCopy
ule to provide basic write protection on the zero-copying will first check whether current process wants to use
data. zero-copy mechanism or not. If so, it will check whether
there are any memory blocks that need to be write pro-
3.2 Supporting Zero-copy tected. The ZCopy data protection module is invoked if
3.2.1 Isolating Zero-copying Data with Twin write protection is needed.
Memory Allocator
To isolating zero-copying data from other data, ZCopy
provides a twin memory allocator along with the original
one to allocate memory for network data that is guar-
anteed to be insulated from other data allocated from
a generic memory allocator (e.g., glibc). Restricted by
the minimal memory protection granularity of a page
size and the following address alignment requirement,
ZC alloc has to pay special attention to small mem-
ory blocks (e.g., block size small than 1024 bytes). A
naive way to handle this is to allocate one page for each Figure 4: The structure of normal package and ZCopy package
request. However, this may waste a lot of memory. ZCopy handles zero-copying data at the time when the
ZC alloc uses an aggressive way by aggregating mem- network data is organized into a network package. Fig-
ory blocks with similar sizes into the same basic mem- ure 4 shows the structure of normal network package and
ory unit, namely the pageblock. A pageblock is treated as the ZCopy package. In normal cases (shown in the top
a basic protection chunk and usually consists of several half of Figure 4), a network package consists of several
pages (16 pages by default). It is write protected only protocol headers followed by network data. The network
when it is full of zero-copying data. As ZC alloc ag- data can be organized as a single data buffer or a list of
gregates zero-copying data together to provide memory data buffers. The data is copied from user address space
protection, it minimizes the amount of wasted memory into the package in order. If the package buffer is not
(e.g., by aggregating small objects smaller than 1 page large enough to hold all network data, the kernel will al-
size into a default pageblock, the maximum amount of locate new empty pages to hold the rest of the data and
memory wasted is less than 1 page, which is less than attaches them into the package’s page fragment list. Each
6.25%). entry in the list contains the starting address of the data
3
and its length. When the package is passed to the NIC batched in a group and are treated as a whole for write
driver, the driver will first transfer the package content protection. The minimal protection unit is one page-
and the fragments to the NIC hardware through the DMA block. When a pageblock is full, ZC alloc will request
engine. the kernel to protect it. To avoid the cost of context
switches between user space and kernel space and pos-
sible false protection problem caused by early write pro-
tection, ZCopy batches the requests from ZC alloc to de-
lay the protection of the pageblock until the system en-
ters the network package processing path. The protection
is done by walking the page table of the target range and
changing the protection bit of the corresponding page ta-
ble entries. When the pageblock is not full, data allocated
from ZC alloc are still sent through normal path without
being zero-copied.
ZCopy tries to protect zero-copying data blocks in an
aggressive way. ZCopy does not remove the write pro-
tection of the data block even if the data block is com-
Figure 5: Zero-copy in UDP package processing
pletely sent out by the hardware. The removal of the
ZCopy treats zero-copying data differently from nor- write protection is triggered only when a write operation
mal data. Each pageblock is identified by a magic string, is trapped by the kernel. At that time, the reference count
thus the zero-copy data buffers can be distinguished with of the page corresponding to the faulting address is first
others. We use the UDP package processing as an ex- checked. If the count is larger than one, a copy-on-write
ample to illustrate the process of handling zero-copying mechanism is used to protect the network data from be-
data. Figure 5 shows the UDP package processing path ing modified. Otherwise, the changing request should
in ZCopy and the bottom half of Figure 4 shows the come from the application itself and we simply remove
structure of a ZCopy package. ZCopy first scans the user the write protection. Note that, the basic protection unit
data buffer lists to copy all prior normal data into the is a pageblock, any write to a write protected pageblock
network package buffer including the protocol headers will cause all the data blocks belong to the pageblock
(step 1). Then, it iteratively processes the following user lose the write protection. However, we do not expect this
buffers by handling zero-copying data and normal data happens frequently as mutation on zero-copying data is
separately (step 2-5). It will check the pageblock magic rare.
string to discover zero-copying user buffers. For zero-
copying data (step 3), it first gets the starting address and 4 E XPERIMENTAL R ESULTS
the length of the data buffer. It then finds all pages cov- All experiments were conducted on an Intel machine
ered by the data buffer and finally organizes the pages in with 2 1.87 Ghz Six-Core Intel Xeon E7 chips running
the form of fragments and adds them into the package’s Debian GNU/Linux 6.0 with the kernel version 2.6.38.
page fragment list. For normal data (step 4), it allocates The NIC used is an Intel 82576 Gigabit Network Con-
new empty pages and copies the buffer content into them. troller. We use another Intel machine with the same hard-
ZCopy finally organizes the pages in the form of frag- ware and software configuration as the client machine.
ments and adds them into the package’s page fragment To minimize the interaction between different cores of
list. The package will be passed into the lower level of a multi-core system (e.g., cache trashing), experiments
the network stack. were conducted using only one CPU core.
One optimization to the ZCopy proxy is to treat read- We use two widely-used web-caching applications,
only data buffers as zero-copying buffers, though they Memcached 1.4.5 [8] and Varnish 3.0.0 [4] to demon-
are not allocated using ZC alloc. This can simply be strate the performance improvements. All applications
done by feeding the offset and length in the fragment list. in the experiments use the ZC alloc to allocate memory
for network data to eliminate the effect of using different
3.2.3 Protection of Zero-copying Data
memory allocators.
ZCopy must provide a protection mechanism to the zero-
copying data in case it is mutated when the data is sent 4.1 Memcached
out. To do this, ZCopy adds a simple data protection Memcached [8] caches multiple key/value pairs in mem-
module into the native memory management system. ory. Each time it receives a request containing a key, it
Based on the page-level protection granularity in will respond with the corresponding value. From a long
kernel, small data blocks allocated from ZC alloc are run’s perspective, the key/value pairs are not expected to
4
ZCopy
Vanilla Linux
00
50
0 00 00 00 00 00 00 00 00 00
00
40
Speedup
00
30
00
20
50
40
00
10
30
20
10
0
0 256 512 768 1024 1280
-10 Package Size (bytes)
128 256 512 768 1024 Figure 7: The time spent on UDP package processing for Mem-
Package Size (bytes)
cached in ZCopy and vanilla Linux.
Figure 6: The throughput of Memcached in ZCopy and vanilla
Linux with UDP and the speedup of ZCopy. shorter package sending time in ZCopy causes the NIC
interrupt handler switch frequently to the polling mode
be modified or freed. However, the metadata (e.g., the which is more effective than the interrupt mode in heavy
data expire time, the item links) along with the cached network stress. However, in vanilla Linux, the network
pairs may change. We modify Memcached to allocate status triggers less frequent switches to the NIC polling
memory for the values from ZC alloc. This takes only mode.
10 lines of modification to the original Memcached.
We use the memaslap testsuite form the libmemcached L2 Cache Miss Rate (1 miss/K cycles)
library [3] as the client of Memcached. The client first 512 bytes 768 bytes 1024 bytes
warms up Memcached with a user-defined number of UDP Linux 4.89 5.17 6.11
key/value pairs and then randomly issues get and set op- UDP ZCopy 4.17 4.57 4.73
erations through several concurrent connections. TCP Linux 8.08 9.06 10.86
UDP: Figure 6 shows the average throughput of Mem- TCP ZCopy 7.73 8.22 9.46
cached in ZCopy and vanilla Linux. The Memcached Table 1: The L2 cache miss rate in vanilla Linux and ZCopy in
is warmed up with ten thousand key/value pairs. The 256 byte, 768 byte and 1024 byte cases.
memaslap client is configured to issue pure get opera-
TCP: Figure 8 shows the average throughput of Mem-
tions through 36 concurrent connections from 12 threads
cached in ZCopy and vanilla Linux and the performance
using the UDP protocol. We adjust the number of worker
speed of ZCopy over vanilla Linux. We use the same
threads of Memcached to achieve the best performance.
evaluation method used in the UDP experiments. For
The CPU usage in all cases is above 99%. Vanilla Linux
each TCP connection, we only issues a signle request
performs slightly better when the value size is smaller
and then close it. Vanilla Linux performs better when the
than 256 bytes. However, when the value size reaches
value size is smaller than 256 bytes. However, when the
512 bytes, ZCopy starts to outperform vanilla Linux. In
value size reaches 512 bytes, ZCopy starts to outperform
512 bytes cases, ZCopy has a 28.7% performance im-
the vanilla Linux by 40.8%. When the value size is with
provement. When the value size is 768 bytes, the per-
1024 bytes, ZCopy outperforms vanilla Linux by 30.8%.
formance improvement increases to 41.1%. For the case
The performance of Memcached reaches the hardware
where the value size is 1024 bytes, ZCopy and vanilla
limits when the value size is of 2048 bytes.
Linux has nearly the same throughput as the network
As in UDP, the performance improvement comes from
reaches its hardware limitation.
copy avoidance and reduced cache trashing. As the code
The performance improvement comes from two parts: for TCP package processing and data sending is mixed
1) minimized data copying and 2) reduced cache trash- together, we measure the time spent on the tcp sendmsg
ing. Figure 7 compares the time spent on UDP pack- instead of TCP package processing time. Figure 9 shows
age processing in ZCopy and vanilla Linux. In ZCopy, the profiling results. From the figure we can see that
the package processing time is around 3000 cycles in ZCopy does reduce the time spent on tcp sendmsg in all
all cases. However, in vanilla Linux, the time increases cases. Table 1 shows the L2 cache miss rate of Mem-
along with the package size and reaches 4400 cycles in cached in Linux and ZCopy. ZCopy reduces 10.2% L2
1024 bytes cases. Table 1 shows the L2 cache miss rate cache misses in 768 byte cases and 14.8% L2 cache
of Memcached in Linux and ZCopy. ZCopy reduces misses in 1024 byte cases.
more than 10% L2 cache misses in UDP cases. The
hottest function copy user generic string in Linux disap- 4.2 Varnish
pears in ZCopy. Another reason for such notable perfor- Varnish [4] is an open-source web application accelera-
mance improvement in the 512 and 768 cases is that the tor. It caches web content into memory objects and re-
5
15000
00
00 00 00 00 00 00 00 00 00
Vanilla Linux ZCopy ZCopy
00
Throughput (requests/sec)
15
Throughput (requests/sec)
Speedup 14000
00
12
0
00
13000
90
50
0
00
40
60
30 12000
20
0
00
10
30
0
-10 11000
0
0
256 512 768 1024 2048 256 512 768 1024 2048 4096 0 1024 2048 3072 4096 5120 6144 7168 8196
Package Size (bytes) Package Size (bytes) Package Size (bytes)
Figure 8: The throughput of Memcached Figure 9: The time spent on function Figure 10: The throughput of varnish
in ZCopy and vanilla Linux with TCP and tcp sendmsg for Memcached in ZCopy server in ZCopy and vanilla Linux.
the speedup of ZCopy. and vanilla Linux.
turns web objects according to the network request. We 5 C ONCLUSION AND F UTURE W ORK
modify Varnish to allocate object memory from ZC alloc This paper revisited the existing software zero-copy
with 3 LOCs changes. mechanism and presented a new zero copy system named
We test Varnish using ab (apache benchmark) from ZCopy, which was based on the observation that the
Apache with the web page sizes ranging from 1 KBytes metadata around the network data will usually get mu-
to 8 KBytes (the average individual response size ranges tated. Experiments with two applications on an Intel
from 3 KBytes to 15 KBytes [1].) Figure 10 compares machine show that ZCopy outperforms vanilla Linux for
the performance of ZCopy and vanilla Linux. The var- sending a relative large network data package.
nish server saturates the CPU on both ZCopy and vanilla In our future work, we plan to extend our work in
Linux. Vanilla Linux performs slightly better with small two directions. First, though we focus specially on web-
web page sizes (1 KBytes). However, when the web page caching applications in this paper, ZCopy places little
size increases, ZCopy starts to outperform Linux. The constraints on applications and is applicable to other
performance improvement reaches 7.8% when the web networking applications. We plan to study and evalu-
page size increases to 6 KBytes. Both configurations ate the performance benefit of ZCopy on other network-
reach networking limitation when the web page size in- intensive applications. Second, ZCopy was evaluated us-
creases to 8 KBytes. The reason that the improvement ing a single core. We plan to extend the ZCopy to effi-
is much less than Memcached is that the single request ciently run on multicore machines.
processing time in Varnish is much longer than that in
Memcached, which thus amortize the improvements of R EFERENCES
ZCopy. [1] Average web response size. [Link]
[2] Infiniband. [Link]
CPU cycles [3] LibMemcached. [Link]
getpid 1149.9 [4] Varnish web cache system. [Link]
ZCopy write protection fault 2802.5 [5] J.C. Brustoloni and P. Steenkiste. Effects of buffering semantics
native page fault 6247.4 on i/o performance. In Proc. OSDI, 1996.
[6] Jerry Chu. Zero-copy tcp in solaris. In Proc. Usenix ATC, 1996.
Table 2: The execution time of invoking getpid system call, trig- [7] P. Druschel and L.L. Peterson. Fbufs: A high-bandwidth cross-
gering ZCopy write protection fault and triggering native page domain transfer facility. In Proc. SOSP, pages 189–202. ACM,
fault. 1993.
[8] R. LERNER. Memcached integration in rails. Linux Journal,
2009.
4.3 ZCopy Primitive [9] L. McVoy. The splice i/o model, 1998.
Overhead of Write Protection We also evaluate the cost [10] myricom. Myrinet. [Link]
of triggering write protection faults for zero-copied data. [11] V.S. Pai, P. Druschel, and W. Zwaenepoel. Io-lite: a unified i/o
buffering and caching system. ACM TOCS, 18(1):37–66, 2000.
Table 2 shows the execution time of invoking the get-
[12] J. Pasquale, E. Anderson, and P.K. Muller. Container shipping:
pid system call, triggering ZCopy write protection fault operating system support for i/o-intensive applications. Com-
and triggering traditional page fault respectively. The puter, 27(3):84–93, 1994.
cost of triggering a ZCopy write protection fault is much [13] S. Schneider, C.D. Antonopoulos, and D.S. Nikolopoulos. Scal-
smaller than triggering a native page fault. This is be- able locality-conscious multithreaded memory allocation. In
cause usually ZCopy only removes the write protection Proc. ISMM, pages 84–94, 2006.
of the faulting address from the page table, which is [14] D. Stancevic. Zero copy i: user-mode perspective. Linux Journal,
2003(105):3, 2003.
much less expensive.