Disaggregated Memory For Expansion and Sharing in
Disaggregated Memory For Expansion and Sharing in
net/publication/220770865
CITATIONS READS
257 846
6 authors, including:
All content following this page was uploaded by Thomas F. Wenisch on 05 March 2014.
ABSTRACT
Analysis of technology and application trends reveals a growing
1. INTRODUCTION
Recent trends point to the likely emergence of a new memory
imbalance in the peak compute-to-memory-capacity ratio for
wall—one of memory capacity—for future commodity systems.
future servers. At the same time, the fraction contributed by
On the demand side, current trends point to increased number of
memory systems to total datacenter costs and power consumption
cores per socket, with some studies predicting a two-fold increase
during typical usage is increasing. In response to these trends, this
every two years [1]. Concurrently, we are likely to see an
paper re-examines traditional compute-memory co-location on a
increased number of virtual machines (VMs) per core (VMware
single system and details the design of a new general-purpose
quotes 2-4X memory requirements from VM consolidation every
architectural building block—a memory blade—that allows
generation [2]), and increased memory footprint per VM (e.g., the
memory to be "disaggregated" across a system ensemble. This
footprint of Microsoft® Windows® has been growing faster than
remote memory blade can be used for memory capacity expansion
Moore’s Law [3]). However, from a supply point of view, the
to improve performance and for sharing memory across servers to
International Technology Roadmap for Semiconductors (ITRS)
reduce provisioning and power costs. We use this memory blade
estimates that the pin count at a socket level is likely to remain
building block to propose two new system architecture
constant [4]. As a result, the number of channels per socket is
solutions—(1) page-swapped remote memory at the virtualization
expected to be near-constant. In addition, the rate of growth in
layer, and (2) block-access remote memory with support in the
DIMM density is starting to wane (2X every three years versus 2X
coherence hardware—that enable transparent memory expansion
every two years), and the DIMM count per channel is declining
and sharing on commodity-based systems. Using simulations of a
(e.g., two DIMMs per channel on DDR3 versus eight for DDR)
mix of enterprise benchmarks supplemented with traces from live
[5]. Figure 1(a) aggregates these trends to show historical and
datacenters, we demonstrate that memory disaggregation can
extrapolated increases in processor computation and associated
provide substantial performance benefits (on average 10X) in
memory capacity. The processor line shows the projected trend of
memory constrained environments, while the sharing enabled by
cores per socket, while the DRAM line shows the projected trend
our solutions can improve performance-per-dollar by up to 87%
of capacity per socket, given DRAM density growth and DIMM
when optimizing memory provisioning across multiple servers.
per channel decline. If the trends continue, the growing imbalance
between supply and demand may lead to memory capacity per
Categories and Subject Descriptors core dropping by 30% every two years, particularly for
C.0 [Computer System Designs]: General – system commodity solutions. If not addressed, future systems are likely to
architectures; B.3.2 [Memory Structures]: Design Styles – be performance-limited by inadequate memory capacity.
primary memory, virtual memory. At the same time, several studies show that the contribution of
memory to the total costs and power consumption of future
General Terms systems is trending higher from its current value of about 25%
Design, Management, Performance. [6][7][8]. Recent trends point to an interesting opportunity to
address these challenges—namely that of optimizing for the
Keywords ensemble [9]. For example, several studies have shown that there
Memory capacity expansion, disaggregated memory, power and is significant temporal variation in how resources like CPU time
cost efficiencies, memory blades. or power are used across applications. We can expect similar
trends in memory usage based on variations across application
types, workload inputs, data characteristics, and traffic patterns.
Figure 1(b) shows how the memory allocated by TPC-H queries
can vary dramatically, and Figure 1(c) presents an eye-chart
Permission to make digital or hard copies of all or part of this work for illustration of the time-varying memory usage of 10 randomly-
personal or classroom use is granted without fee provided that copies are chosen servers from a 1,000-CPU cluster used to render a recent
not made or distributed for profit or commercial advantage and that animated feature film [10]. Each line illustrates a server’s memory
copies bear this notice and the full citation on the first page. To copy usage varying from a low baseline when idle to the peak memory
otherwise, or republish, to post on servers or to redistribute to lists,
usage of the application. Rather than provision each system for its
requires prior specific permission and/or a fee.
ISCA’09, June 20–24, 2009, Austin, Texas, USA. worst-case memory usage, a solution that provisions for the
Copyright 2009 ACM 978-1-60558-526-0/09/06...$5.00.
1
1000
In this paper, we propose a new architectural building block to
provide transparent memory expansion and sharing for
#Core
commodity-based designs. Specifically, we revisit traditional
DRAM
memory designs in which memory modules are co-located with
100
processors on a system board, restricting the configuration and
scalability of both compute and memory resources. Instead, we
Relative capacity
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
two new system architectures to achieve transparent expansion
(a) Trends leading toward the memory capacity wall and sharing. Our first solution requires no changes to existing
100GB system hardware, using support at the virtualization layer to
10GB
provide page-level access to a memory blade across the standard
PCI Express® (PCIe®) interface. Our second solution proposes
1GB
minimal hardware support on every compute blade, but provides
100MB finer-grained access to a memory blade across a coherent network
10MB fabric for commodity software stacks.
1MB We demonstrate the validity of our approach through simulations
0.1MB of a mix of enterprise benchmarks supplemented with traces from
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
2
Hardware shared-memory systems typically require specialized 3. DISAGGREGATED MEMORY
interconnects and non-commodity components that add costs; in Our approach is based on four observations: (1) The emergence of
addition, signaling, electrical, and design complexity increase blade servers with fast shared communication fabrics in the
rapidly with system size. Software DSMs [24][25][26][27] can enclosure enables separate blades to share resources across the
avoid these costs by managing the operations to send, receive, and ensemble. (2) Virtualization provides a level of indirection that
maintain coherence in software, but come with practical can enable OS-and-application-transparent memory capacity
limitations to functionality, generality, software transparency, changes on demand. (3) Market trends towards commodity-based
total costs, and performance [28]. A recent commercial design in solutions require special-purpose support to be limited to the non-
this space, Versatile SMP [29], uses a virtualization layer to chain volume components of the solution. (4) The footprints of
together commodity x86 servers to provide the illusion of a single enterprise workloads vary across applications and over time; but
larger system, but the current design requires specialized current approaches to memory system design fail to leverage these
motherboards, I/O devices, and non-commodity networking, and variations, resorting instead to worst-case provisioning.
there is limited documentation on performance benefits,
particularly with respect to software DSMs. Given these observations, our approach argues for a re-
examination of conventional designs that co-locate memory
To increase the compute-to-memory ratio directly, researchers DIMMs in conjunction with computation resources, connected
have proposed compressing memory contents [30][31] or through conventional memory interfaces and controlled through
augmenting/replacing conventional DRAM with alternative on-chip memory controllers. Instead, we argue for a
devices or interfaces. Recent startups like Virident [32] and Texas disaggregated multi-level design where we provision an
Memory [33] propose the use of solid-state storage, such as additional separate memory blade, connected at the I/O or
NAND Flash, to improve memory density albeit with higher communication bus. This memory blade comprises arrays of
access latencies than conventional DRAM. From a technology commodity memory modules assembled to maximize density and
perspective, fully-buffered DIMMs [34] have the potential to cost-effectiveness, and provides extra memory capacity that can
increase memory capacity but with significant trade-offs in power be allocated on-demand to individual compute blades. We first
consumption. 3D die-stacking [35] allows DRAM to be placed detail the design of a memory blade (Section 3.1), and then
on-chip as different layers of silicon; in addition to the open discuss system architectures that can leverage this component for
architectural issues on how to organize 3D-stacked main memory, transparent memory extension and sharing (Section 3.2).
this approach further constrains the extensibility of memory
capacity. Phase change memory (PCM) is emerging as a 3.1 Memory Blade Architecture
promising alternative to increase memory density. However, Figure 2(a) illustrates the design of our memory blade. The
current PCM devices suffer from several drawbacks that limit memory blade comprises a protocol engine to interface with the
their straightforward use as a main memory replacement, blade enclosure’s I/O backplane interconnect, a custom memory-
including high energy requirements, slow write latencies, and controller ASIC (or a light-weight CPU), and one or more
finite endurance. In contrast to our work, none of these channels of commodity DIMM modules connected via on-board
approaches enable memory capacity sharing across nodes. In repeater buffers or alternate fan-out techniques. The memory
addition, many of these alternatives provide only a one-time controller handles requests from client blades to read and write
improvement, thus delaying but failing to fundamentally address memory, and to manage capacity allocation and address mapping.
the memory capacity wall. Optional memory-side accelerators can be added to support
A recent study [36] demonstrates the viability of a two-level functions like compression and encryption.
memory organization that can tolerate increased access latency Although the memory blade itself includes custom hardware, it
due to compression, heterogeneity, or network access to second- requires no changes to volume blade-server designs, as it connects
level memory. However, that study does not discuss a commodity through standard I/O interfaces. Its costs are amortized over the
implementation for x86 architectures or evaluate sharing across entire server ensemble. The memory blade design is
systems. Our prior work [8] employs a variant of this two-level straightforward compared to a typical server blade, as it does not
memory organization as part of a broader demonstration of how have the cooling challenges of a high-performance CPU and does
multiple techniques, including the choice of processors, new not require local disk, Ethernet capability, or other elements (e.g.,
packaging design, and use of Flash-based storage, can help management processor, SuperIO, etc.) Client access latency is
improve performance in warehouse computing environments. The dominated by the enclosure interconnect, which allows the
present paper follows up on our prior work by: (1) extending the memory blade’s DRAM subsystem to be optimized for power and
two-level memory design to support x86 commodity servers; (2) capacity efficiency rather than latency. For example, the controller
presenting two new system architectures for accessing the remote can aggressively place DRAM pages into active power-down
memory; and (3) evaluating the designs on a broad range of mode, and can map consecutive cache blocks into a single
workloads and real-world datacenter utilization traces. memory bank to minimize the number of active devices at the
As is evident from this discussion, there is currently no single expense of reduced single-client bandwidth. A memory blade can
architectural approach that simultaneously addresses memory-to- also serve as a vehicle for integrating alternative memory
compute-capacity expansion and memory capacity sharing, and technologies, such as Flash or phase-change memory, possibly in
does it in an application/OS-transparent manner on commodity- a heterogeneous combination with DRAM, without requiring
based hardware and software. The next section describes our modification to the compute blades.
approach to define such an architecture. To provide protection and isolation among shared clients, the
memory controller translates each memory address accessed by a
3
superpages to a client, and sets up a mapping from the chosen
Compute Blades Backplane Memory blade
blade ID and SMA range to the appropriate RMMA range.
Protocol agent DIMMs (data, dirty, ECC)
In the case where there are no unused superpages, some existing
Memory controller DIMMs (data, dirty, ECC) mapping must be revoked so that memory can be reallocated. We
assume that capacity reallocation is a rare event compared to the
Address mapping DIMMs (data, dirty, ECC)
frequency of accessing memory using reads and writes.
Accelerators DIMMs (data, dirty, ECC) Consequently, our design focuses primarily on correctness and
transparency and not performance.
(a) Memory blade design When a client is allocated memory on a fully subscribed memory
System Memory Address Remote Machine Memory Address blade, management software first decides which other clients must
SMA RMMA give up capacity, then notifies the VMMs on those clients of the
amount of remote memory they must release. We propose two
(Address) general approaches for freeing pages. First, most VMMs already
Super page Offset RMMA maps provide paging support to allow a set of VMs to oversubscribe
+ local memory. This paging mechanism can be invoked to evict
(Blade ID) local or remote pages. When a remote page is to be swapped out,
+
it is first transferred temporarily to an empty local frame and then
Base Limit RMMA Permission
paged to disk. The remote page freed by this transfer is released
RMMA Permission for reassignment.
Map
registers Base Limit RMMA Free list
Alternatively, many VMMs provide a “balloon driver” [37] within
(b) Address mapping the guest OS to allocate and pin memory pages, which are then
returned to the VMM. The balloon driver increases memory
Figure 2: Design of the memory blade. (a) The memory blade pressure within the guest OS, forcing it to select pages for
connects to the compute blades via the enclosure backplane. eviction. This approach generally provides better results than the
(b) The data structures that support memory access and VMM’s paging mechanisms, as the guest OS can make a more
allocation/revocation operations. informed decision about which pages to swap out and may simply
client blade into an address local to the memory blade, called the discard clean pages without writing them to disk. Because the
Remote Machine Memory Address (RMMA). In our design, each newly freed physical pages can be dispersed across both the local
client manages both local and remote physical memory within a and remote SMA ranges, the VMM may need to relocate pages
single System Memory Address (SMA) space. Local physical within the SMA space to free a contiguous 16 MB remote
memory resides at the bottom of this space, with remote memory superpage.
mapped at higher addresses. For example, if a blade has 2 GB of Once the VMMs have released their remote pages, the memory
local DRAM and has been assigned 6 GB of remote capacity, its blade mapping tables may be updated to reflect the new
total SMA space extends from 0 to 8 GB. Each blade’s remote allocation. We assume that the VMMs can generally be trusted to
SMA space is mapped to a disjoint portion of the RMMA space. release memory on request; the unlikely failure of a VMM to
This process is illustrated in Figure 2(b). We manage the blade’s release memory promptly indicates a serious error and can be
memory in large chunks (e.g., 16 MB) so that the entire mapping resolved by rebooting the client blade.
table can be kept in SRAM on the memory blade’s controller. For
example, a 512 GB memory blade managed in 16 MB chunks
requires only a 32K-entry mapping table. Using these “superpage” 3.2 System Architecture with Memory Blades
mappings avoids complex, high-latency DRAM page table data Whereas our memory-blade design enables several alternative
structures and custom TLB hardware. Note that providing shared- system architectures, we discuss two specific designs, one based
memory communications among client blades (as in distributed on page swapping and another using fine-grained remote access.
shared memory) is beyond the scope of this paper. In addition to providing more detailed examples, these designs
also illustrate some of the tradeoffs in the multi-dimensional
Allocation and revocation: The memory blade’s total capacity is design space for memory blades. Most importantly, they compare
partitioned among the connected clients through the cooperation the method and granularity of access to the remote blade (page-
of the virtual machine monitors (VMMs) running on the clients, based versus block-based) and the interconnect fabric used for
in conjunction with enclosure-, rack-, or datacenter-level communication (PCI Express versus HyperTransport).
management software. The VMMs in turn are responsible for
allocating remote memory among the virtual machine(s) (VMs) 3.2.1 Page-Swapping Remote Memory (PS)
running on each client system. The selection of capacity allocation Our first design avoids any hardware changes to the high-volume
policies, both among blades in an enclosure and among VMs on a compute blades or enclosure; the memory blade itself is the only
blade, is a broad topic that deserves separate study. Here we non-standard component. This constraint implies a conventional
restrict our discussion to designing the mechanisms for allocation I/O backplane interconnect, typically PCIe. This basic design is
and revocation. illustrated in Figure 3(a).
Allocation is straightforward: privileged management software on Because CPUs in a conventional system cannot access cacheable
the memory blade assigns one or more unused memory blade memory across a PCIe connection, the system must bring
locations into the client blade’s local physical memory before they
4
In our design, we assume page swapping is performed on a 4 KB
Software Stack Compute Blade Backplane granularity, a common page size used by operating systems. Page
swaps logically appear to the VMM as a swap from high SMA
App (VA)
P P P P addresses (beyond the end of local memory) to low addresses
Memory
DIMM
OS (PA) Memory controller (within local memory). To decouple the swap of a remote page to
blade local memory and eviction of a local page to remote memory, we
Hypervisor (SMA) PCIe bridge
maintain a pool of free local pages for incoming swaps. The
software fault handler thus allocates a page from the local free list
(a) Compute blade
and initiates a DMA transfer over the PCIe channel from the
remote memory blade. The transfer is performed synchronously
(i.e., the execution thread is stalled during the transfer, but other
threads may execute). Once the transfer is complete, the fault
handler updates the page table entry to point to the new, local
SMA address and puts the prior remote SMA address into a pool
of remote addresses that are currently unused.
To maintain an adequate supply of free local pages, the VMM
must occasionally evict local pages to remote memory, effectively
performing the second half of the logical swap operation. The
VMM selects a high SMA address from the remote page free list
(b) Address mapping process
and initiates a DMA transfer from a local page to the remote
Figure 3: Page-swapping remote memory system design. memory blade. When complete, the local page is unmapped and
(a) No changes are required to compute servers and placed on the local free list. Eviction operations are performed
networking on existing blade designs. Our solution adds asynchronously, and do not stall the CPU unless a conflicting
minor modules (shaded block) to the virtualization layer. access to the in-flight page occurs during eviction.
(b) The address mapping design places the extended capacity
at the top of the address space. 3.2.2 Fine-Grained Remote Memory Access (FGRA)
The previous solution avoids any hardware changes to the
can be accessed. We leverage standard virtual-memory commodity compute blade, but at the expense of trapping to the
mechanisms to detect accesses to remote memory and relocate the VMM and transferring full pages on every remote memory access.
targeted locations to local memory on a page granularity. In In our second approach, we examine the effect of a few minimal
addition to enabling the use of virtual memory support, page- hardware changes to the high-volume compute blade to enable an
based transfers exploit locality in the client’s access stream and alternate design that has higher performance potential. In
amortize the overhead of PCIe memory transfers. particular, this design allows CPUs on the compute blade to
access remote memory directly at cache-block granularity.
To avoid modifications to application and OS software, we
implement this page management in the VMM. The VMM detects Software Stack Compute Blade Backplane
accesses to remote data pages and swaps those data pages to local
App (VA)
memory before allowing a load or store to proceed. P P P P
Memory
DIMM
5
multiprocessors, the filter ensures that the memory blade is a complete.) The simulated system has four 2.2 GHz cores, with
home agent but not a cache agent. Second, the filter can per-core dedicated 64KB L1 and 512 KB L2 caches, and a 2 MB
optionally translate coherence messages destined for the memory L3 shared cache.
blade into an alternate format. For example, HyperTransport-
The common simulation parameters for our remote memory blade
protocol read and write requests can be translated into generic
are listed in Table 1. For the baseline PS, we assume that the
PCIe commands, allowing the use of commodity backplanes and
decoupling the memory blade from specific cache-coherence memory blade interconnect has a latency of 120 ns and bandwidth
protocols and processor technologies. of 1 GB/s (each direction), based loosely on a PCIe 2.0 x2
channel. For the baseline FGRA, we assume a more aggressive
Because this design allows the remote SMA space to be accessed channel, e.g., based on HyperTransport™ or a similar technology,
directly by CPUs, VMM support is not required; an unmodified with 60 ns latency and 4 GB/s bandwidth. Additionally, for PS,
OS can treat both local and remote addresses uniformly. However, each access to remote memory results in a trap to the VMM, and
a VMM or additional OS support is required to enable dynamic VMM software must initiate the page transfer. Based on prior
allocation or revocation of remote memory. Performance can also work [40], we assume a total of 330 ns (roughly 1,000 cycles on a
potentially be improved by migrating the most frequently accessed 3 GHz processor) for this software overhead, including the trap
remote pages into local memory, swapping them with infrequently itself, updating page tables, TLB shootdown, and generating the
accessed local pages—a task that could be performed by a VMM request message to the memory blade. All of our simulated
or by extending the NUMA support available in many OSes. systems are modeled with a hard drive with 8 ms access latency
and 50 MB/s sustained bandwidth. We perform initial data
4. EVALUATION placement using a first-touch allocation policy.
4.1 Methodology We validated our model on a real machine to measure the impact
We compare the performance of our memory-blade designs to a of reducing the physical memory allocation in a conventional
conventional system primarily via trace-based simulation. Using server. We use an HP c-Class BL465c server with 2.2GHz AMD
traces rather than a detailed execution-driven CPU model makes it Opteron 2354 processors and 8 GB of DDR2-800 DRAM. To
practical to process the billions of main-memory references model a system with less DRAM capacity, we force the Linux
needed to exercise a multi-gigabyte memory system. Although we kernel to reduce physical memory capacity using a boot-time
forgo the ability to model overlap between processor execution kernel parameter.
and remote memory accesses with our trace-based simulations, The workloads used to evaluate our designs include a mixture of
our memory reference traces are collected from a simulator that Web 2.0-based benchmarks (nutch, indexer), traditional server
does model overlap of local memory accesses. Additionally, we benchmarks (pgbench, TPC-H, SPECjbb®2005), and traditional
expect overlap for remote accesses to be negligible due to the computational benchmarks (SPEC® CPU2006 – zeusmp, gcc,
relatively high latencies to our remote memory blade. perl, bwaves, mcf). Additionally we developed a multi-
We collected memory reference traces from a detailed full-system programmed workload, spec4p, by combining the traces from
simulator, used and validated in prior studies [39], modified to zeusmp, gcc, perl, and mcf. Spec4p offers insight into multiple
record the physical address, CPU ID, timestamp and read/write workloads sharing a single server’s link to the memory blade.
status of all main-memory accesses. To make it feasible to run the Table 1 describes these workloads in more detail. We further
workloads to completion, we use a lightweight CPU model for broadly classify the workloads into three groups—low, medium,
this simulation. (Each simulation still took 1-2 weeks to and high—based on their memory footprint sizes. The low group
consists of benchmarks whose footprint is less than 1 GB,
Memory blade parameters
DRAM Latency 120 ns Map table access 5 ns Request packet processing 60 ns
DRAM Bandwidth 6.4 GB/s Transfer page size 4KB Response packet processing 60 ns
Workloads Footprint size
SPEC CPU 5 large memory benchmarks: zeusmp, perl, gcc, bwaves, and mcf, as well as a combination of Low (zeusmp, gcc, perl, bwaves),
2006 four of them, spec4p. Medium (mcf), High (spec4p)
nutch4p Nutch 0.9.1 search engine with Resin and Sun JDK 1.6.0, 5GB index hosted on tempfs. Medium
TPC-H running on MySQL 5.0 with scaling factor of 1. 2 copies of query 17 mixed with
tpchmix Medium
query 1 and query 3 (representing balanced, scan and join heavy queries).
pgbench TPC-B like benchmark running PostgreSQL 8.3 with pgbench and a scaling factor of 100. High
Indexer Nutch 0.9.1 indexer, Sun JDK 1.6.0 and HDFS hosted on one hard drive. High
SPECjbb 4 copies of Specjbb 2005, each with 16 warehouses, using Sun JDK 1.6.0. High
Real-world traces
Resource utilization traces collected on 500+ animation rendering servers over a year, 1-second sample interval. We present
Animation
data from traces from a group of 16 representative servers.
VM consolidation traces of 16 servers based on enterprise and web2.0 workloads, maximum resource usage reported every
VM consolidation
10-minute interval.
Resource utilization traced collected on 290 servers from a web2.0 company, we use sar traces with 1-second sample interval
Web2.0
for 16 representative servers.
6
1000 memory requirements of applications seen in real-world
PS FGRA
Speedup over M-app-75%
environments.
To quantify the cost benefits of our design, we developed a cost
100 model for our disaggregated memory solutions and the baseline
servers against which we compare. Because our designs target the
memory system, we present data specific to the memory system.
10 We gathered price data from public and industry sources for as
many components as possible. For components not available off
the shelf, such as our remote memory blade controller, we
1 estimate a cost range. We further include power and cooling costs,
given a typical 3-year server lifespan. We used DRAM power
calculators to evaluate the power consumption of DDR2 devices.
Estimates for the memory contributions towards power and
cooling are calculated using the same methodology as in [8].
(a) Speedup over M-app-75% provisioning
1000
4.2 Results
PS FGRA 4.2.1 Memory expansion for individual benchmarks
We first focus on the applicability of memory disaggregation to
Speedup over M-median
100
address the memory capacity wall for individual benchmarks. To
illustrate scenarios where applications run into memory capacity
limitations due to a core-to-memory ratio imbalance, we perform
10 an experiment where we run each of our benchmarks on a baseline
system with only 75% of that benchmark’s memory footprint (M-
app-75%). The baseline system must swap pages to disk to
1 accommodate the full footprint of the workload. We compare
these with our two disaggregated-memory architectures, PS and
FGRA. In these cases, the compute nodes continue to have local
DRAM capacity corresponding to only 75% of the benchmark’s
memory footprint, but have the ability to exploit capacity from a
(b) Speedup over M-median provisioning
remote memory blade. We assume 32GB of memory on the
1 memory blade, which is sufficient to fit any application’s
0.9 footprint. Figure 5(a) summarizes the speedup for the PS and
Slowdown versus M-max
0.8 FGRA designs relative to the baseline. Both of our new solutions
0.7 achieve significant improvements, ranging from 4X to 320X
0.6 higher performance. These improvements stem from the much
0.5 lower latency of our remote memory solutions compared to OS-
0.4 based disk paging. In particular, zeusmp, bwaves, mcf, specjbb,
0.3 and spec4p show the highest benefits due to their large working
0.2 sets.
0.1 PS FGRA
Interestingly, we also observe that PS outperforms FGRA in this
0
experiment, despite our expectations for FGRA to achieve better
performance due to its lower access latency. Further investigation
reveals that the page swapping policy in PS, which transfers
pages from remote memory to local memory upon access,
(c) Slowdown vs. worst-case (M-max) provisioning accounts for its performance advantage. Under PS, although the
initial access to a remote memory location incurs a high latency
Figure 5: Capacity expansion results. (a) and (b) show the
due to the VMM trap and the 4 KB page transfer over the slower
performance improvement for our two designs over memory-
PCIe interconnect, subsequent accesses to that address
capacity-constrained baselines; (c) shows performance and costs
consequently incur only local-memory latencies. The FGRA
relative to worst-case provisioning.
design, though it has lower remote latencies compared to PS,
medium ranges from 1 GB to 1.75 GB, and high includes those continues to incur these latencies for every access to a frequently
with footprints between 1.75GB and 3GB. In addition to these used remote location. Nevertheless, FGRA still outperforms the
workloads, we have also collected traces of memory usage in three baseline. We examine the addition of page swapping to FGRA in
real-world, large-scale datacenter environments. These Section 4.2.5.
environments include Animation, VM consolidation, and web2.0,
Figure 5(b) considers an alternate baseline where the compute
and are described in Table 1. These traces were each gathered for
server memory is set to approximate the median-case memory
over a month across a large number of servers and are used to
footprint requirements across our benchmarks (M-median =
guide our selection of workloads to mimic the time-varying
1.5GB). This baseline models a realistic scenario where the server
7
1.5 divided by remote DRAM costs), using 32 GB of remote memory.
improvement over M-max PS
Average Performance / $
1.4 Note that for clarity, the cost range on the horizontal axis refers
FGRA only to the memory blade interface/packaging hardware excluding
1.3
1.2
DRAM costs (the fixed DRAM costs are factored in to the
results). The hardware cost break-even points for PS and FGRA
1.1
are high, implying a sufficiently large budget envelope for the
1
memory blade implementation. We expect that the overhead of a
0.9
realistic implementation of a memory blade could be below 50%
0.8
of the remote DRAM cost (given current market prices). This
0.7
overhead can be reduced further by considering higher capacity
0% 100% 200% 300%
memory blades; for example, we expect the cost to be below 7%
Memory blade cost of the remote DRAM cost of a 256 GB memory blade.
(percent of remote DRAM cost)
4.2.3 Server consolidation
Figure 6: Memory blade cost analysis. Average
Viewed as a key application for multi-core processors, server
performance-per-memory dollar improvement versus memory
consolidation improves hardware resource utilization by hosting
blade costs relative to the total cost of remote DRAM.
multiple virtual machines on a single physical platform. However,
is provisioned for the common-case workload, but can still see a memory capacity is often the bottleneck to server consolidation
mix of different workloads. Figure 5(b) shows that our proposed because other resources (e.g., processor and I/O) are easier to
solutions now achieve performance improvements only for multiplex, and the growing imbalance between processor and
benchmarks with high memory footprints. For other benchmarks, memory capacities exacerbates the problem. This effect is evident
the remote memory blade is unused, and does not provide any in our real-world web2.0 traces, where processor utilization rates
benefit. More importantly, it does not cause any slowdown. are typically below 30% (rarely over 45%) while more than 80%
of memory is allocated, indicating limited consolidation
Finally, Figure 5(c) considers a baseline where the server memory
opportunities without memory expansion. To address this issue,
is provisioned for the worst-case application footprint (M-max =
current solutions either advocate larger SMP servers for their
4GB). This baseline models many current datacenter scenarios memory capacity or sophisticated hypervisor memory
where servers are provisioned in anticipation of the worst-case management policies to reduce workload footprints, but they incur
load, either across workloads or across time. We configure our performance penalties, increase costs and complexity, and do not
memory disaggregation solutions as in the previous experiment, address the fundamental processor-memory imbalance.
with M-median provisioned per-blade and additional capacity in
the remote blade. Our results show that, for workloads with small Memory disaggregation enables new consolidation opportunities
footprints, our new solutions perform comparably. For workloads by supporting processor-independent memory expansion. With
with larger footprints, going to remote memory causes a memory blades to provide the second-level memory capacity, we
slowdown compared to local memory; however, PS provides can reduce each workload’s processor-local memory allocation to
comparable performance in some large-footprint workloads less than its total footprint (M-max) while still maintaining
(pgbench, indexer), and on the remaining workloads its comparable performance (i.e., <3% slowdown). This workload-
performance is still within 30% of M-max. As before, FGRA loses specific local vs. remote memory ratio determines how much
performance as it does not exploit locality patterns to ensure most memory can be freed on a compute server (and shifted onto the
accesses go to local memory. memory blade) to allow further consolidation. Unfortunately, it is
not possible to experiment in production datacenters to determine
4.2.2 Power and cost analysis these ratios. Instead, we determine the typical range of local-to-
Using the methodology described in 4.1, we estimate the memory remote ratios using our simulated workload suite. We can then use
power draw of our baseline M-median system as 10 W, and our this range to investigate the potential for increased consolidation
M-max system as 21 W. To determine the power draw of our using resource utilization traces from production systems.
disaggregated memory solutions, we assume local memory
We evaluate the consolidation benefit using the web2.0 workload
provisioned for median capacity requirements (as in M-median)
(CPU, memory and IO resource utilization traces for 200+
and a memory blade with 32 GB shared by 16 servers. servers) and a sophisticated consolidation algorithm similar to that
Furthermore, because the memory blade can tolerate increased used by Rolia et al. [41]. The algorithm performs multi-
DRAM access latency, we assume it aggressively employs DRAM dimensional bin packing to minimize the number of servers
low-power sleep modes. For a 16-server ensemble, we estimate needed for given resource requirements. We do not consider the
the amortized per-server memory power of the disaggregated other two traces for this experiment. Animation is CPU-bound and
solution (including all local and remote memory and the memory runs out of CPU before it runs out of memory, so memory
blade interface hardware, such as its controller, and I/O disaggregation does not help. However, as CPU capacity increases
connections) at 15 W. in the future, we may likely encounter a similar situation as
Figure 6 illustrates the cost impact of the custom designed web2.0. VM consolidation, on the other hand, does run out of
memory blade, showing the changes in the average performance- memory before it runs of out CPU, but these traces already
per-memory cost improvement over the baseline M-max system as represent the result of consolidation, and in the absence of
memory blade cost varies. To put the memory blade cost into information on the prior consolidation policy, it is hard to make a
context with the memory subsystem, the cost is calculated as a fair determination of the baseline and the additional benefits from
percentage of the total remote DRAM costs (memory blade cost memory disaggregation over existing approaches.
8
100%
2 1
Performance / $ improvement
0.9
0.7
over M-max
Animation 0.6
60% 1.4 web2.0 0.5
Current PS VM Consolidation
1.2 0.4
40%
0.3
1 0.2
20% 0.1
0.8
0
0.4 0.6 0.8 1 1.2 1.4
0% Remote Memory Capacity Animation web2.0 VM
Server Memory (fraction of 'Sum of peaks') consolidation
(a) Reductions from consolidation (b) Cost efficiency vs. remote capacity (c) Perf. at cost-optimized provisioning
Figure 7: Mixed workload and ensemble results. (a) Hardware reductions from improved VM consolidation made possible by remote
memory. (b) Performance-per-dollar as remote memory capacity is varied. (c) Slowdown relative to per-blade worst-case provisioning
(M-max) at cost-optimal provisioning.
As shown in Figure 7(a), without memory disaggregation, the additional memory it needs from the memory blade. (In a task-
state-of-the-art algorithm (“Current”) achieves only modest scheduling environment, this could be based on prior knowledge
hardware reductions (5% processor and 13% memory); limited of the memory footprint of the new task that will be scheduled.)
memory capacity precludes further consolidation. In contrast, For the cost of the memory blade, we conservatively estimated the
page-swapping–based memory disaggregation corrects the time- price to be approximately that of a low-end system. We expect
varying imbalance between VM memory demands and local this estimate to be conservative because of the limited
capacity, allowing a substantial reduction of processor count by a functionality and hardware requirements of the memory blade
further 68%. versus that of a general purpose server.
9
Though not shown here (due to space constraints), we have also
5
studied sensitivity of our results to the VMM overhead and
Normalized Performance
4.5
4 memory latency parameters in Table 1. Our results show no
3.5 qualitative change to our conclusions.
3
2.5
2 5. DISCUSSION
1.5 Evaluation assumptions. Our evaluation does not model
1 interconnect routing, arbitration, buffering, and QoS management
0.5 in detail. Provided interconnect utilization is not near saturation,
0
these omissions will not significantly impact transfer latencies.
We have confirmed that per-blade interconnect bandwidth
consumption falls well below the capabilities of PCIe and HT.
However, the number of channels to the memory blade may need
(a) FGRA placement-aware design to be scaled with the number of supported clients.
0.9
0.8 enterprise system reliability, availability, security, and
0.7 manageability. From a reliability perspective, dynamic
0.6 reprovisioning provides an inexpensive means to equip servers
0.5
0.4
with hot-spare DRAM; in the event of a DIMM failure anywhere
0.3 in the ensemble, memory can be remapped and capacity
0.2 reassigned to replace the lost DIMM. However, the memory
0.1 blade also introduces additional failure modes that impact
0 multiple servers. A complete memory-blade failure might impact
several blades, but this possibility can be mitigated by adding
redundancy to the blade's memory controller. We expect that high
availability could be achieved at a relatively low cost, given the
controller’s limited functionality. To provide security and
(b) FGRA over PCIe design
isolation, our design enforces strict assignment of capacity to
Figure 8: Alternate FGRA designs. (a) shows the normalized specific blades, prohibits sharing, and can optionally erase
performance when FGRA is supplemented by NUMA-type memory content prior to reallocation to ensure confidentiality.
optimizations; (b) shows the performance loss from tunneling From a manageability perspective, disaggregation allows
FGRA accesses over a commodity interconnect. management software to provision memory capacity across
blades, reducing the need to physically relocate DIMMs.
addressed by adding page migration to FGRA, similar to existing
CC-NUMA optimizations (e.g., Linux’s memory placement Memory blade scalability and sharing. There are several obvious
optimizations [42]). To study the potential impact of this extensions to our designs. First, to provide memory scaling
enhancement, we modeled a hypothetical system that tracks page beyond the limits of a single memory blade, a server ensemble
usage and, at 10 ms intervals, swaps the most highly used pages might include multiple memory blades. Second, prior studies of
into local memory. Figure 8(a) summarizes the speedup of this consolidated VMs have shown substantial opportunities to reduce
system over the base FGRA design for M-median compute memory requirements via copy-on-write content-based page
sharing across VMs [37]. Disaggregated memory offers an even
blades. For the high-footprint workloads that exhibit the worst
larger scope for sharing content across multiple compute blades.
performance with FGRA (mcf, SPECjbb, and SPEC4p), page
Finally, in some system architectures, subsets of processors/blades
migration achieves 3.3-4.5X improvement, with smaller (5-8%)
share a memory coherence domain, which we might seek to
benefit on other high-footprint workloads. For all workloads, the
extend via disaggregation.
optimized FGRA performs similarly to, and in a few cases better
than, PS. These results motivate further examination of data Synergy with emerging technologies. Disaggregated memory
placement policies for FGRA. extends the conventional virtual memory hierarchy with a new
layer. This layer introduces several possibilities to integrate new
The hardware cost of FGRA can be reduced by using a standard
technologies into the ensemble memory system that might prove
PCIe backplane (as PS does) rather than a coherent interconnect,
latency- or cost-prohibitive in conventional blade architectures.
as discussed in Section 3.2.2. This change incurs a latency and First, we foresee substantial opportunity to leverage emerging
bandwidth penalty as the standardized PCIe interconnect is less interconnect technologies (e.g., optical interconnects) to improve
aggressive than a more specialized interconnect such as cHT. communication latency and bandwidth and allow greater physical
Figure 8(b) shows the change in performance relative to the distance between compute and memory blades. Second, the
baseline FGRA. Performance is comparable, decreasing by at most memory blade’s controller provides a logical point in the system
20% on the higher memory usage workloads. This performance hierarchy to integrate accelerators for capacity and reliability
loss may be acceptable if the cost extending a high-performance enhancements, such as memory compression [30][31]. Finally,
interconnect like cHT across the enclosure backplane is high. one might replace or complement memory blade DRAM with
higher-density, lower-power, and/or non-volatile memory
10
technologies, such as NAND Flash or phase change memory. COTSon team at HP Labs, and Norm Jouppi for their support and
Unlike conventional memory systems, where it is difficult to useful comments.
integrate these technologies because of large or asymmetric access
latencies and lifetime/wearout challenges, disaggregated memory
8. REFERENCES
is more tolerant of increased access latency, and the memory
[1] K. Asanovic et al. The Landscape of Parallel Computing
blade controller might be extended to implement wear-leveling
Research: A View from Berkeley. UC Berkeley EECS Tech
and other lifetime management strategies [43]. Furthermore,
Report UCB/EECS-2006-183, Dec. 2006.
disaggregated memory offers the potential for transparent
[2] VMWare Performance Team Blogs. Ten Reasons Why
integration. Because of the memory interface abstraction provided
Oracle Databases Run Best on VMWare "Scale up with
by our design, Flash or phase change memory can be utilized on
Large Memory." http://tinyurl.com/cudjuy
the memory blade without requiring any further changes on the
[3] J. Larus. Spending Moore's Dividend. Microsoft Tech Report
compute blade.
MSR-TR-2008-69, May 2008
[4] SIA. International Technology Roadmap for Semiconductors
6. CONCLUSIONS 2007 Edition, 2007.
Constraints on per-socket memory capacity and the growing [5] HP. Memory technology evolution: an overview of system
contribution of memory to total datacenter costs and power memory technologies. http://tinyurl.com/ctfjs2
consumption motivate redesign of the memory subsystem. In this [6] A. Lebeck, X. Fan, H. Zheng and C. Ellis. Power Aware
paper, we discuss a new architectural approach—memory Page Allocation. In Proc. of the 9th Int. Conf. on
disaggregation—which uses dedicated memory blades to provide Architectural Support for Programming Languages and
OS-transparent memory extension and ensemble sharing for Operating Systems (ASPLOS-IX), Nov. 2000.
commodity-based blade-server designs. We propose an extensible [7] V. Pandey, W. Jiang, Y. Zhou and R. Bianchini. DMA-
design for the memory blade, including address remapping Aware Memory Energy Conservation. In Proc. of the 12th
facilities to support protected dynamic memory provisioning Int. Sym. on High-Performance Computer Architecture
across multiple clients, and unique density optimizations to (HPCA-12), 2006
address the compute-to-memory capacity imbalance. We discuss [8] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge and S.
two different system architectures that incorporate this blade: a Reinhardt. Understanding and Designing New Server
page-based design that allows memory blades to be used on Architectures for Emerging Warehouse-Computing
current commodity blade server architectures with small changes Environments. In Proc. of the 35th Int. Sym. on Computer
to the virtualization layer, and an alternative that requires small Architecture (ISCA-35), June 2008
amounts of extra hardware support in current compute blades but [9] P. Ranganathan and N. Jouppi. Enterprise IT Trends and
supports fine-grained remote accesses and requires no changes to Implications for Architecture Research. In Proc. of the 11th
the software layer. To the best of our knowledge, our work is the Int. Sym. on High-Performance Computer Architecture
first to propose a commodity-based design that simultaneously (HPCA-11), 2005.
addresses compute-to-memory capacity extension and cross-node [10] http://apotheca.hpl.hp.com/pub/datasets/animation-bear/
memory capacity sharing. We are also the first to consider [11] L. Barroso, J. Dean and U. Hoelzle. Web Search for a Planet:
dynamic memory sharing across the I/O communication network The Google Cluster Architecture. IEEE Micro, 23(2),
in a blade enclosure and quantitatively evaluate design tradeoffs March/April 2003.
in this environment. [12] E. Felten and J. Zahorjan. Issues in the implementation of a
remote memory paging system. University of Washington
Simulations based on detailed traces from 12 enterprise
CSE TR 91-03-09, March 1991.
benchmarks and three real-world enterprise datacenter
[13] M. Feeley, W. Morgan, E. Pighin, A. Karlin, H. Levy and C.
deployments show that our approach has significant potential. The
Thekkath. Implementing global memory management in a
ability to extend and share memory can achieve orders of
workstation cluster. In Proc. of the 15th ACM Sym. on
magnitude performance improvements in cases where applications
Operating System Principles (SOSP-15), 1995.
run out of memory capacity, and similar orders of magnitude
[14] M. Flouris and E. Markatos. The network RamDisk: Using
improvement in performance-per-dollar in cases where systems
remote memory on heterogeneous NOWs. Cluster
are overprovisioned for peak memory usage. We also demonstrate
Computing, Vol. 2, Issue 4, 1999.
how this approach can be used to achieve higher levels of server
[15] M. Dahlin, R. Wang, T. Anderson and D. Patterson.
consolidation than currently possible. Overall, as future server
Cooperative caching: Using remote client memory to
environments gravitate towards more memory-constrained and
improve file system performance. In Proc. of the 1st
cost-conscious solutions, we believe that the memory
USENIX Sym. of Operating Systems Design and
disaggregation approach we have proposed in the paper is likely
Implementation (OSDI ‘94), 1994.
to be a key part of future system designs.
[16] M. Hines, L. Lewandowski and K. Gopalan. Anemone:
Adaptive Network Memory Engine. Florida State University
7. ACKNOWLEDGEMENTS TR-050128, 2005.
We thank the anonymous reviewers for their feedback. This work [17] L. Iftode. K. Li and K. Peterson. Memory servers for
was partially supported by NSF grant CSR-0834403, and an Open multicomputers. IEEE Spring COMPCON ’93, 1993.
Innovation grant from HP. We would also like to acknowledge [18] S. Koussih, A. Acharya and S. Setia. Dodo: A user-level
Andrew Wheeler, John Bockhaus, Eric Anderson, Dean Cookson, system for exploiting idle memory in workstation clusters. In
Niraj Tolia, Justin Meza, the Exascale Datacenter team and the Proc. of the 8th IEEE Int. Sym. on High Performance
11
Distributed Computing (HPDC-8), 1999. [31] M. Ekman and P. Stenström. A Robust Main Memory
[19] A. Agarwal et al. The MIT Alewife Machine: Architecture Compression Scheme. In Proc. of the 32rd Int. Sym. on
and Performance. In Proc. of the 23rd Int. Sym. on Computer Computer Architecture (ISCA-32), 2005
Architecture (ISCA-23), 1995. [32] Virident. Virident’s GreenGateway™ technology and
[20] D. Lenoski et al. The Stanford DASH Multiprocessor. IEEE Spansion® EcoRAM. http://www.virident.com/solutions.php
Computer, 25(3), Mar. 1992. [33] Texas Memory Systems. TMS RamSan-440 Details.
[21] E. Hagersten and M. Koster. WildFire–A Scalable Path for http://www.superssd.com/products/ramsan-440/
SMPs. In Proc. of the 5th Int. Sym. on High-Performance [34] Intel. Intel Fully Buffered DIMM Specification Addendum.
Computer Architecture (HPCA-5), 1999. http://www.intel.com/technology/memory/FBDIMM/spec/Int
[22] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA el_FBD_Spec_Addendum_rev_p9.pdf
Highly Scalable Server. In Proc. of the 25th Int. Sym. on [35] T. Kgil et al. PicoServer: using 3D stacking technology to
Computer Architecture (ISCA-25), 1997. enable a compact energy efficient chip multiprocessor. In
[23] W. Bolosky, M. Scott, R. Fitzgerald, R. Fowler and A. Cox. Proc. of the 12th International Conference on Architectural
NUMA Policies and their Relationship to Memory Support for Programming Languages and Operating Systems
Architecture. In Proc. of the 4th Int. Conf. on Architectural (ASPLOS-XII), 2006.
Support for Programming Languages and Operating Systems [36] M. Ekman and P. Stenstrom. A Cost-Effective Main Memory
(ASPLOS-IV), 1991 Organization for Future Servers. In Proc. of the 19th Int.
[24] K. Li and P. Hudak, Memory coherence in shared virtual Parallel and Distributed Processing Symposium, 2005.
memory systems. ACM Transactions on Computer Systems [37] C. Waldspurger. Memory Resource Management in VMware
(TOCS), 7(4), Nov. 1989. ESX Server. In Proc. of the 5th USENIX Sym. on Operating
[25] D. Scales, K. Gharachorloo and C. Thekkath. Shasta: A Low System Design and Implementation (OSDI ‘02), 2002.
Overhead, Software-Only Approach for Supporting Fine- [38] D. Ye, A . Pavuluri, C. Waldspurger, B. Tsang, B. Rychlik
Grain Shared Memory. In Proc. of the 7th Int. Conf. on and S. Woo. Prototyping a Hybrid Main Memory Using a
Architectural Support for Programming Languages and Virtual Machine Monitor. In Proc. of the 26th Int. Conf. on
Operating Systems (ASPLOS-VII), 1996. Computer Design (ICCD), 2008.
[26] C. Amza et al. TreadMarks: Shared Memory Computing on [39] E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero and D.
Networks of Workstations. IEEE Computer, 29(2), 1996. Ortega. COTSon: Infrastructure for System-Level
[27] I. Schoinas, B. Falsafi, A. Lebeck, S. Reinhardt, J. Larus and Simulation. ACM Operating Systems Review 43(1), 2009.
D. Wood. Fine-grain Access Control for Distributed Shared [40] J.R. Santos, Y. Turner, G. Janakiraman and I. Pratt. Bridging
Memory. In Proc. of the 6th Int. Conf. on Architectural the gap between software and hardware techniques for I/O
Support for Programming Languages and Operating Systems virtualization. USENIX Annual Technical Conference, 2008.
(ASPLOS-VI), 1994. [41] J. Rolia, A. Andrzejak and M. Arlitt. Automating Enterprise
[28] K Gharachorloo. The Plight of Software Distributed Shared Application Placement in Resource Utilities. 14th IFIP/IEEE
Memory. Invited talk at 1st Workshop on Software Int. Workshop on Distributed Systems: Operations and
Distributed Shared Memory (WSDSM '99), 1999. Management, DSOM 2003.
[29] ScaleMP. The Versatile SMP™ (vSMP) Architecture and [42] R. Bryant and J. Hawkes. Linux® Scalability for Large
Solutions Based on vSMP Foundation™. White paper at NUMA Systems. In Proc. of Ottowa Linux Symposium
http://www.scalemp.com/prod/technology/how-does-it-work/ 2003, July 2003.
[30] F. Douglis. The compression cache: using online [43] T. Kgil, D. Roberts and T. Mudge. Improving NAND Flash
compression to extend physical memory. In Proc. of 1993 Based Disk Caches. In Proc. of the 35th Int. Sym. on
Winter USENIX Conference, 1993. Computer Architecture (ISCA-35), June 2008.
AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc.
HyperTransport is a trademark of the HyperTransport Consortium.
Microsoft and Windows are registered trademarks of Microsoft Corporation.
PCI Express and PCIe are registered trademarks of PCI-SIG.
Linux is a registered trademark of Linus Torvalds.
SPEC is a registered trademark Standard Performance Evaluation Corporation (SPEC).
12