0% found this document useful (0 votes)

31 views13 pages

Disaggregated Memory For Expansion and Sharing in

Uploaded by

Alberto John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views13 pages

Disaggregated Memory For Expansion and Sharing in

Uploaded by

Alberto John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220770865

Disaggregated memory for expansion and sharing in blade servers

Conference Paper · June 2009

DOI: 10.1145/1555754.1555789 · Source: DBLP

CITATIONS READS
257 846

6 authors, including:

Kevin Lim Trevor N. Mudge

University of Malaysia, Sarawak University of Michigan
32 PUBLICATIONS 2,077 CITATIONS 431 PUBLICATIONS 25,148 CITATIONS

SEE PROFILE SEE PROFILE

Parthasarathy Ranganathan Thomas F. Wenisch

HP Inc. University of Michigan
182 PUBLICATIONS 12,666 CITATIONS 177 PUBLICATIONS 9,972 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Thomas F. Wenisch on 05 March 2014.

The user has requested enhancement of the downloaded file.

Disaggregated Memory for Expansion and Sharing
in Blade Servers
Kevin Lim*, Jichuan Chang†, Trevor Mudge*, Parthasarathy Ranganathan†,
Steven K. Reinhardt+*, Thomas F. Wenisch*
† +
* University of Michigan, Ann Arbor Hewlett-Packard Labs Advanced Micro Devices, Inc.
{ktlim,tnm,twenisch}@umich.edu {jichuan.chang,partha.ranganathan}@hp.com [email protected]

ABSTRACT
Analysis of technology and application trends reveals a growing
1. INTRODUCTION
Recent trends point to the likely emergence of a new memory
imbalance in the peak compute-to-memory-capacity ratio for
wall—one of memory capacity—for future commodity systems.
future servers. At the same time, the fraction contributed by
On the demand side, current trends point to increased number of
memory systems to total datacenter costs and power consumption
cores per socket, with some studies predicting a two-fold increase
during typical usage is increasing. In response to these trends, this
every two years [1]. Concurrently, we are likely to see an
paper re-examines traditional compute-memory co-location on a
increased number of virtual machines (VMs) per core (VMware
single system and details the design of a new general-purpose
quotes 2-4X memory requirements from VM consolidation every
architectural building block—a memory blade—that allows
generation [2]), and increased memory footprint per VM (e.g., the
memory to be "disaggregated" across a system ensemble. This
footprint of Microsoft® Windows® has been growing faster than
remote memory blade can be used for memory capacity expansion
Moore’s Law [3]). However, from a supply point of view, the
to improve performance and for sharing memory across servers to
International Technology Roadmap for Semiconductors (ITRS)
reduce provisioning and power costs. We use this memory blade
estimates that the pin count at a socket level is likely to remain
building block to propose two new system architecture
constant [4]. As a result, the number of channels per socket is
solutions—(1) page-swapped remote memory at the virtualization
expected to be near-constant. In addition, the rate of growth in
layer, and (2) block-access remote memory with support in the
DIMM density is starting to wane (2X every three years versus 2X
coherence hardware—that enable transparent memory expansion
every two years), and the DIMM count per channel is declining
and sharing on commodity-based systems. Using simulations of a
(e.g., two DIMMs per channel on DDR3 versus eight for DDR)
mix of enterprise benchmarks supplemented with traces from live
[5]. Figure 1(a) aggregates these trends to show historical and
datacenters, we demonstrate that memory disaggregation can
extrapolated increases in processor computation and associated
provide substantial performance benefits (on average 10X) in
memory capacity. The processor line shows the projected trend of
memory constrained environments, while the sharing enabled by
cores per socket, while the DRAM line shows the projected trend
our solutions can improve performance-per-dollar by up to 87%
of capacity per socket, given DRAM density growth and DIMM
when optimizing memory provisioning across multiple servers.
per channel decline. If the trends continue, the growing imbalance
between supply and demand may lead to memory capacity per
Categories and Subject Descriptors core dropping by 30% every two years, particularly for
C.0 [Computer System Designs]: General – system commodity solutions. If not addressed, future systems are likely to
architectures; B.3.2 [Memory Structures]: Design Styles – be performance-limited by inadequate memory capacity.
primary memory, virtual memory. At the same time, several studies show that the contribution of
memory to the total costs and power consumption of future
General Terms systems is trending higher from its current value of about 25%
Design, Management, Performance. [6][7][8]. Recent trends point to an interesting opportunity to
address these challenges—namely that of optimizing for the
Keywords ensemble [9]. For example, several studies have shown that there
Memory capacity expansion, disaggregated memory, power and is significant temporal variation in how resources like CPU time
cost efficiencies, memory blades. or power are used across applications. We can expect similar
trends in memory usage based on variations across application
types, workload inputs, data characteristics, and traffic patterns.
Figure 1(b) shows how the memory allocated by TPC-H queries
can vary dramatically, and Figure 1(c) presents an eye-chart
Permission to make digital or hard copies of all or part of this work for illustration of the time-varying memory usage of 10 randomly-
personal or classroom use is granted without fee provided that copies are chosen servers from a 1,000-CPU cluster used to render a recent
not made or distributed for profit or commercial advantage and that animated feature film [10]. Each line illustrates a server’s memory
copies bear this notice and the full citation on the first page. To copy usage varying from a low baseline when idle to the peak memory
otherwise, or republish, to post on servers or to redistribute to lists,
usage of the application. Rather than provision each system for its
requires prior specific permission and/or a fee.
ISCA’09, June 20–24, 2009, Austin, Texas, USA. worst-case memory usage, a solution that provisions for the
Copyright 2009 ACM 978-1-60558-526-0/09/06...$5.00.

1
1000
In this paper, we propose a new architectural building block to
provide transparent memory expansion and sharing for
#Core
commodity-based designs. Specifically, we revisit traditional
DRAM
memory designs in which memory modules are co-located with
100
processors on a system board, restricting the configuration and
scalability of both compute and memory resources. Instead, we
Relative capacity

argue for a disaggregated memory design that encapsulates an

10
array of commodity memory modules in a separate shared
memory blade that can be accessed, as needed, by multiple
compute blades via a shared blade interconnect.
1
We discuss the design of a memory blade and use it to propose
2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017
two new system architectures to achieve transparent expansion
(a) Trends leading toward the memory capacity wall and sharing. Our first solution requires no changes to existing
100GB system hardware, using support at the virtualization layer to
10GB
provide page-level access to a memory blade across the standard
PCI Express® (PCIe®) interface. Our second solution proposes
1GB
minimal hardware support on every compute blade, but provides
100MB finer-grained access to a memory blade across a coherent network
10MB fabric for commodity software stacks.
1MB We demonstrate the validity of our approach through simulations
0.1MB of a mix of enterprise benchmarks supplemented with traces from
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

three live datacenter installations. Our results show that memory

(b) Memory variations for TPC-H queries (log scale) disaggregation can provide significant performance benefits (on
average 10X) in memory-constrained environments. Additionally,
the sharing enabled by our solutions can enable large
improvements in performance-per-dollar (up to 87%) and greater
levels of consolidation (3X) when optimizing memory
provisioning across multiple servers.
The rest of the paper is organized as follows. Section 2 discusses
prior work. Section 3 presents our memory blade design and the
implementation of our proposed system architectures, which we
evaluate in Section 4. Section 5 discusses other tradeoffs and
designs, and Section 6 concludes.
(c) Memory variations in server memory utilization
Figure 1: Motivating the need for memory extension and 2. RELATED WORK
sharing. (a) On average, memory capacity per processor core A large body of prior work (e.g., [12][13][14][15][16][17][18])
is extrapolated to decrease 30% every two years. (b) The has examined using remote servers’ memory for swap space
amount of granted memory for TPC-H queries can vary by [12][16], file system caching [13][15], or RamDisks [14],
orders of magnitude. (c) “Ensemble” memory usage trends typically over conventional network interfaces (i.e., Ethernet).
over one month across 10 servers from a cluster used for These approaches do not fundamentally address the compute-to-
animation rendering (one of the 3 datacenter traces used in memory capacity imbalance: the total memory capacity relative to
this study). compute is unchanged when all the servers need maximum
capacity at the same time. Additionally, although these
typical usage, with the ability to dynamically add memory approaches can be used to provide sharing, they suffer from
capacity across the ensemble, can reduce costs and power. significant limitations when targeting commodity-based systems.
Whereas some prior approaches (discussed in Section 2) can In particular, these proposals may require substantial system
alleviate some of these challenges individually, there is a need for modifications, such as application-specific programming
new architectural solutions that can provide transparent memory interfaces [18] and protocols [14][17]; changes to the host
capacity expansion to match computational scaling and operating system and device drivers [12][13][14][16]; reduced
transparent memory sharing across collections of systems. In reliability in the face of remote server crashes [13][16]; and/or
addition, given recent trends towards commodity-based solutions impractical access latencies [14][17]. Our solutions target the
(e.g., [8][9][11]), it is important for these approaches to require at commodity-based volume server market and thus avoid invasive
most minor changes to ensure that the low-cost benefits of changes to applications, operating systems, or server architecture.
commodity solutions not be undermined. The increased adoption Symmetric multiprocessors (SMPs) and distributed shared
of blade servers with fast shared interconnection networks and memory systems (DSMs) [19][20][21][22][23][24][25][26][27]
virtualization software creates the opportunity for new memory allow all the nodes in a system to share a global address space.
system designs. However, like the network-based sharing approaches, these
designs do not target the compute-to-memory-capacity ratio.

2
Hardware shared-memory systems typically require specialized 3. DISAGGREGATED MEMORY
interconnects and non-commodity components that add costs; in Our approach is based on four observations: (1) The emergence of
addition, signaling, electrical, and design complexity increase blade servers with fast shared communication fabrics in the
rapidly with system size. Software DSMs [24][25][26][27] can enclosure enables separate blades to share resources across the
avoid these costs by managing the operations to send, receive, and ensemble. (2) Virtualization provides a level of indirection that
maintain coherence in software, but come with practical can enable OS-and-application-transparent memory capacity
limitations to functionality, generality, software transparency, changes on demand. (3) Market trends towards commodity-based
total costs, and performance [28]. A recent commercial design in solutions require special-purpose support to be limited to the non-
this space, Versatile SMP [29], uses a virtualization layer to chain volume components of the solution. (4) The footprints of
together commodity x86 servers to provide the illusion of a single enterprise workloads vary across applications and over time; but
larger system, but the current design requires specialized current approaches to memory system design fail to leverage these
motherboards, I/O devices, and non-commodity networking, and variations, resorting instead to worst-case provisioning.
there is limited documentation on performance benefits,
particularly with respect to software DSMs. Given these observations, our approach argues for a re-
examination of conventional designs that co-locate memory
To increase the compute-to-memory ratio directly, researchers DIMMs in conjunction with computation resources, connected
have proposed compressing memory contents [30][31] or through conventional memory interfaces and controlled through
augmenting/replacing conventional DRAM with alternative on-chip memory controllers. Instead, we argue for a
devices or interfaces. Recent startups like Virident [32] and Texas disaggregated multi-level design where we provision an
Memory [33] propose the use of solid-state storage, such as additional separate memory blade, connected at the I/O or
NAND Flash, to improve memory density albeit with higher communication bus. This memory blade comprises arrays of
access latencies than conventional DRAM. From a technology commodity memory modules assembled to maximize density and
perspective, fully-buffered DIMMs [34] have the potential to cost-effectiveness, and provides extra memory capacity that can
increase memory capacity but with significant trade-offs in power be allocated on-demand to individual compute blades. We first
consumption. 3D die-stacking [35] allows DRAM to be placed detail the design of a memory blade (Section 3.1), and then
on-chip as different layers of silicon; in addition to the open discuss system architectures that can leverage this component for
architectural issues on how to organize 3D-stacked main memory, transparent memory extension and sharing (Section 3.2).
this approach further constrains the extensibility of memory
capacity. Phase change memory (PCM) is emerging as a 3.1 Memory Blade Architecture
promising alternative to increase memory density. However, Figure 2(a) illustrates the design of our memory blade. The
current PCM devices suffer from several drawbacks that limit memory blade comprises a protocol engine to interface with the
their straightforward use as a main memory replacement, blade enclosure’s I/O backplane interconnect, a custom memory-
including high energy requirements, slow write latencies, and controller ASIC (or a light-weight CPU), and one or more
finite endurance. In contrast to our work, none of these channels of commodity DIMM modules connected via on-board
approaches enable memory capacity sharing across nodes. In repeater buffers or alternate fan-out techniques. The memory
addition, many of these alternatives provide only a one-time controller handles requests from client blades to read and write
improvement, thus delaying but failing to fundamentally address memory, and to manage capacity allocation and address mapping.
the memory capacity wall. Optional memory-side accelerators can be added to support
A recent study [36] demonstrates the viability of a two-level functions like compression and encryption.
memory organization that can tolerate increased access latency Although the memory blade itself includes custom hardware, it
due to compression, heterogeneity, or network access to second- requires no changes to volume blade-server designs, as it connects
level memory. However, that study does not discuss a commodity through standard I/O interfaces. Its costs are amortized over the
implementation for x86 architectures or evaluate sharing across entire server ensemble. The memory blade design is
systems. Our prior work [8] employs a variant of this two-level straightforward compared to a typical server blade, as it does not
memory organization as part of a broader demonstration of how have the cooling challenges of a high-performance CPU and does
multiple techniques, including the choice of processors, new not require local disk, Ethernet capability, or other elements (e.g.,
packaging design, and use of Flash-based storage, can help management processor, SuperIO, etc.) Client access latency is
improve performance in warehouse computing environments. The dominated by the enclosure interconnect, which allows the
present paper follows up on our prior work by: (1) extending the memory blade’s DRAM subsystem to be optimized for power and
two-level memory design to support x86 commodity servers; (2) capacity efficiency rather than latency. For example, the controller
presenting two new system architectures for accessing the remote can aggressively place DRAM pages into active power-down
memory; and (3) evaluating the designs on a broad range of mode, and can map consecutive cache blocks into a single
workloads and real-world datacenter utilization traces. memory bank to minimize the number of active devices at the
As is evident from this discussion, there is currently no single expense of reduced single-client bandwidth. A memory blade can
architectural approach that simultaneously addresses memory-to- also serve as a vehicle for integrating alternative memory
compute-capacity expansion and memory capacity sharing, and technologies, such as Flash or phase-change memory, possibly in
does it in an application/OS-transparent manner on commodity- a heterogeneous combination with DRAM, without requiring
based hardware and software. The next section describes our modification to the compute blades.
approach to define such an architecture. To provide protection and isolation among shared clients, the
memory controller translates each memory address accessed by a

3
superpages to a client, and sets up a mapping from the chosen
Compute Blades Backplane Memory blade
blade ID and SMA range to the appropriate RMMA range.
Protocol agent DIMMs (data, dirty, ECC)
In the case where there are no unused superpages, some existing
Memory controller DIMMs (data, dirty, ECC) mapping must be revoked so that memory can be reallocated. We
assume that capacity reallocation is a rare event compared to the
Address mapping DIMMs (data, dirty, ECC)
frequency of accessing memory using reads and writes.
Accelerators DIMMs (data, dirty, ECC) Consequently, our design focuses primarily on correctness and
transparency and not performance.
(a) Memory blade design When a client is allocated memory on a fully subscribed memory
System Memory Address Remote Machine Memory Address blade, management software first decides which other clients must
SMA RMMA give up capacity, then notifies the VMMs on those clients of the
amount of remote memory they must release. We propose two
(Address) general approaches for freeing pages. First, most VMMs already
Super page Offset RMMA maps provide paging support to allow a set of VMs to oversubscribe
+ local memory. This paging mechanism can be invoked to evict
(Blade ID) local or remote pages. When a remote page is to be swapped out,
+
it is first transferred temporarily to an empty local frame and then
Base Limit RMMA Permission
paged to disk. The remote page freed by this transfer is released
RMMA Permission for reassignment.
Map
registers Base Limit RMMA Free list
Alternatively, many VMMs provide a “balloon driver” [37] within
(b) Address mapping the guest OS to allocate and pin memory pages, which are then
returned to the VMM. The balloon driver increases memory
Figure 2: Design of the memory blade. (a) The memory blade pressure within the guest OS, forcing it to select pages for
connects to the compute blades via the enclosure backplane. eviction. This approach generally provides better results than the
(b) The data structures that support memory access and VMM’s paging mechanisms, as the guest OS can make a more
allocation/revocation operations. informed decision about which pages to swap out and may simply
client blade into an address local to the memory blade, called the discard clean pages without writing them to disk. Because the
Remote Machine Memory Address (RMMA). In our design, each newly freed physical pages can be dispersed across both the local
client manages both local and remote physical memory within a and remote SMA ranges, the VMM may need to relocate pages
single System Memory Address (SMA) space. Local physical within the SMA space to free a contiguous 16 MB remote
memory resides at the bottom of this space, with remote memory superpage.
mapped at higher addresses. For example, if a blade has 2 GB of Once the VMMs have released their remote pages, the memory
local DRAM and has been assigned 6 GB of remote capacity, its blade mapping tables may be updated to reflect the new
total SMA space extends from 0 to 8 GB. Each blade’s remote allocation. We assume that the VMMs can generally be trusted to
SMA space is mapped to a disjoint portion of the RMMA space. release memory on request; the unlikely failure of a VMM to
This process is illustrated in Figure 2(b). We manage the blade’s release memory promptly indicates a serious error and can be
memory in large chunks (e.g., 16 MB) so that the entire mapping resolved by rebooting the client blade.
table can be kept in SRAM on the memory blade’s controller. For
example, a 512 GB memory blade managed in 16 MB chunks
requires only a 32K-entry mapping table. Using these “superpage” 3.2 System Architecture with Memory Blades
mappings avoids complex, high-latency DRAM page table data Whereas our memory-blade design enables several alternative
structures and custom TLB hardware. Note that providing shared- system architectures, we discuss two specific designs, one based
memory communications among client blades (as in distributed on page swapping and another using fine-grained remote access.
shared memory) is beyond the scope of this paper. In addition to providing more detailed examples, these designs
also illustrate some of the tradeoffs in the multi-dimensional
Allocation and revocation: The memory blade’s total capacity is design space for memory blades. Most importantly, they compare
partitioned among the connected clients through the cooperation the method and granularity of access to the remote blade (page-
of the virtual machine monitors (VMMs) running on the clients, based versus block-based) and the interconnect fabric used for
in conjunction with enclosure-, rack-, or datacenter-level communication (PCI Express versus HyperTransport).
management software. The VMMs in turn are responsible for
allocating remote memory among the virtual machine(s) (VMs) 3.2.1 Page-Swapping Remote Memory (PS)
running on each client system. The selection of capacity allocation Our first design avoids any hardware changes to the high-volume
policies, both among blades in an enclosure and among VMs on a compute blades or enclosure; the memory blade itself is the only
blade, is a broad topic that deserves separate study. Here we non-standard component. This constraint implies a conventional
restrict our discussion to designing the mechanisms for allocation I/O backplane interconnect, typically PCIe. This basic design is
and revocation. illustrated in Figure 3(a).
Allocation is straightforward: privileged management software on Because CPUs in a conventional system cannot access cacheable
the memory blade assigns one or more unused memory blade memory across a PCIe connection, the system must bring
locations into the client blade’s local physical memory before they
4
In our design, we assume page swapping is performed on a 4 KB
Software Stack Compute Blade Backplane granularity, a common page size used by operating systems. Page
swaps logically appear to the VMM as a swap from high SMA
App (VA)
P P P P addresses (beyond the end of local memory) to low addresses
Memory

DIMM
OS (PA) Memory controller (within local memory). To decouple the swap of a remote page to
blade local memory and eviction of a local page to remote memory, we
Hypervisor (SMA) PCIe bridge
maintain a pool of free local pages for incoming swaps. The
software fault handler thus allocates a page from the local free list
(a) Compute blade
and initiates a DMA transfer over the PCIe channel from the
remote memory blade. The transfer is performed synchronously
(i.e., the execution thread is stalled during the transfer, but other
threads may execute). Once the transfer is complete, the fault
handler updates the page table entry to point to the new, local
SMA address and puts the prior remote SMA address into a pool
of remote addresses that are currently unused.
To maintain an adequate supply of free local pages, the VMM
must occasionally evict local pages to remote memory, effectively
performing the second half of the logical swap operation. The
VMM selects a high SMA address from the remote page free list
(b) Address mapping process
and initiates a DMA transfer from a local page to the remote
Figure 3: Page-swapping remote memory system design. memory blade. When complete, the local page is unmapped and
(a) No changes are required to compute servers and placed on the local free list. Eviction operations are performed
networking on existing blade designs. Our solution adds asynchronously, and do not stall the CPU unless a conflicting
minor modules (shaded block) to the virtualization layer. access to the in-flight page occurs during eviction.
(b) The address mapping design places the extended capacity
at the top of the address space. 3.2.2 Fine-Grained Remote Memory Access (FGRA)
The previous solution avoids any hardware changes to the
can be accessed. We leverage standard virtual-memory commodity compute blade, but at the expense of trapping to the
mechanisms to detect accesses to remote memory and relocate the VMM and transferring full pages on every remote memory access.
targeted locations to local memory on a page granularity. In In our second approach, we examine the effect of a few minimal
addition to enabling the use of virtual memory support, page- hardware changes to the high-volume compute blade to enable an
based transfers exploit locality in the client’s access stream and alternate design that has higher performance potential. In
amortize the overhead of PCIe memory transfers. particular, this design allows CPUs on the compute blade to
access remote memory directly at cache-block granularity.
To avoid modifications to application and OS software, we
implement this page management in the VMM. The VMM detects Software Stack Compute Blade Backplane
accesses to remote data pages and swaps those data pages to local
App (VA)
memory before allowing a load or store to proceed. P P P P
Memory
DIMM

OS (PA) Memory controller

Figure 3(b) illustrates our page management scheme. Recall that, blade
when remote memory capacity is assigned to a specific blade, we Coherence filter
extend the SMA (machine physical address) space at that blade to
provide local addresses for the additional memory. The VMM Figure 4: Fine-grained remote memory access system design.
assigns pages from this additional address space to guest VMs, This design assumes minor coherence hardware support in
where they will in turn be assigned to the guest OS or to every compute blade.
applications. However, because these pages cannot be accessed Our approach leverages the glueless SMP support found in
directly by the CPU, the VMM cannot set up valid page-table current processors. For example, AMD Opteron™ processors
entries for these addresses. It instead tracks the pages by using have up to three coherent HyperTransport™ links coming out of
“poisoned” page table entries without their valid bits set or by the socket. Our design, shown in Figure 4, uses custom hardware
tracking the mappings outside of the page tables (similar on the compute blade to redirect cache fill requests to the remote
techniques have been used to prototype hybrid memory in memory blade. Although it does require custom hardware, the
VMWare [38]). In either case, a direct CPU access to remote changes to enable our FGRA design are relatively straightforward
memory will cause a page fault and trap into the VMM. On such a adaptations of current coherent memory controller designs
trap, the VMM initiates a page swap operation. This simple OS-
transparent memory-to-memory page swap should not be This hardware, labeled “Coherence filter” in Figure 4, serves two
confused with OS-based virtual memory swapping (paging to purposes. First, it selectively forwards only necessary coherence
swap space), which is orders of magnitude slower and involves an protocol requests to the remote memory blade. For example,
entirely different set of sophisticated data structures and because the remote blade does not contain any caches, the
algorithms. coherence filter can respond immediately to invalidation requests.
Only memory read and write requests require processing at the
remote memory blade. In the terminology of glueless x86

5
multiprocessors, the filter ensures that the memory blade is a complete.) The simulated system has four 2.2 GHz cores, with
home agent but not a cache agent. Second, the filter can per-core dedicated 64KB L1 and 512 KB L2 caches, and a 2 MB
optionally translate coherence messages destined for the memory L3 shared cache.
blade into an alternate format. For example, HyperTransport-
The common simulation parameters for our remote memory blade
protocol read and write requests can be translated into generic
are listed in Table 1. For the baseline PS, we assume that the
PCIe commands, allowing the use of commodity backplanes and
decoupling the memory blade from specific cache-coherence memory blade interconnect has a latency of 120 ns and bandwidth
protocols and processor technologies. of 1 GB/s (each direction), based loosely on a PCIe 2.0 x2
channel. For the baseline FGRA, we assume a more aggressive
Because this design allows the remote SMA space to be accessed channel, e.g., based on HyperTransport™ or a similar technology,
directly by CPUs, VMM support is not required; an unmodified with 60 ns latency and 4 GB/s bandwidth. Additionally, for PS,
OS can treat both local and remote addresses uniformly. However, each access to remote memory results in a trap to the VMM, and
a VMM or additional OS support is required to enable dynamic VMM software must initiate the page transfer. Based on prior
allocation or revocation of remote memory. Performance can also work [40], we assume a total of 330 ns (roughly 1,000 cycles on a
potentially be improved by migrating the most frequently accessed 3 GHz processor) for this software overhead, including the trap
remote pages into local memory, swapping them with infrequently itself, updating page tables, TLB shootdown, and generating the
accessed local pages—a task that could be performed by a VMM request message to the memory blade. All of our simulated
or by extending the NUMA support available in many OSes. systems are modeled with a hard drive with 8 ms access latency
and 50 MB/s sustained bandwidth. We perform initial data
4. EVALUATION placement using a first-touch allocation policy.

4.1 Methodology We validated our model on a real machine to measure the impact
We compare the performance of our memory-blade designs to a of reducing the physical memory allocation in a conventional
conventional system primarily via trace-based simulation. Using server. We use an HP c-Class BL465c server with 2.2GHz AMD
traces rather than a detailed execution-driven CPU model makes it Opteron 2354 processors and 8 GB of DDR2-800 DRAM. To
practical to process the billions of main-memory references model a system with less DRAM capacity, we force the Linux
needed to exercise a multi-gigabyte memory system. Although we kernel to reduce physical memory capacity using a boot-time
forgo the ability to model overlap between processor execution kernel parameter.
and remote memory accesses with our trace-based simulations, The workloads used to evaluate our designs include a mixture of
our memory reference traces are collected from a simulator that Web 2.0-based benchmarks (nutch, indexer), traditional server
does model overlap of local memory accesses. Additionally, we benchmarks (pgbench, TPC-H, SPECjbb®2005), and traditional
expect overlap for remote accesses to be negligible due to the computational benchmarks (SPEC® CPU2006 – zeusmp, gcc,
relatively high latencies to our remote memory blade. perl, bwaves, mcf). Additionally we developed a multi-
We collected memory reference traces from a detailed full-system programmed workload, spec4p, by combining the traces from
simulator, used and validated in prior studies [39], modified to zeusmp, gcc, perl, and mcf. Spec4p offers insight into multiple
record the physical address, CPU ID, timestamp and read/write workloads sharing a single server’s link to the memory blade.
status of all main-memory accesses. To make it feasible to run the Table 1 describes these workloads in more detail. We further
workloads to completion, we use a lightweight CPU model for broadly classify the workloads into three groups—low, medium,
this simulation. (Each simulation still took 1-2 weeks to and high—based on their memory footprint sizes. The low group
consists of benchmarks whose footprint is less than 1 GB,
Memory blade parameters
DRAM Latency 120 ns Map table access 5 ns Request packet processing 60 ns
DRAM Bandwidth 6.4 GB/s Transfer page size 4KB Response packet processing 60 ns
Workloads Footprint size
SPEC CPU 5 large memory benchmarks: zeusmp, perl, gcc, bwaves, and mcf, as well as a combination of Low (zeusmp, gcc, perl, bwaves),
2006 four of them, spec4p. Medium (mcf), High (spec4p)
nutch4p Nutch 0.9.1 search engine with Resin and Sun JDK 1.6.0, 5GB index hosted on tempfs. Medium
TPC-H running on MySQL 5.0 with scaling factor of 1. 2 copies of query 17 mixed with
tpchmix Medium
query 1 and query 3 (representing balanced, scan and join heavy queries).
pgbench TPC-B like benchmark running PostgreSQL 8.3 with pgbench and a scaling factor of 100. High
Indexer Nutch 0.9.1 indexer, Sun JDK 1.6.0 and HDFS hosted on one hard drive. High
SPECjbb 4 copies of Specjbb 2005, each with 16 warehouses, using Sun JDK 1.6.0. High
Real-world traces
Resource utilization traces collected on 500+ animation rendering servers over a year, 1-second sample interval. We present
Animation
data from traces from a group of 16 representative servers.
VM consolidation traces of 16 servers based on enterprise and web2.0 workloads, maximum resource usage reported every
VM consolidation
10-minute interval.
Resource utilization traced collected on 290 servers from a web2.0 company, we use sar traces with 1-second sample interval
Web2.0
for 16 representative servers.

Table 1: Simulation parameters and workload/trace description.

6
1000 memory requirements of applications seen in real-world
PS FGRA
Speedup over M-app-75%

environments.
To quantify the cost benefits of our design, we developed a cost
100 model for our disaggregated memory solutions and the baseline
servers against which we compare. Because our designs target the
memory system, we present data specific to the memory system.
10 We gathered price data from public and industry sources for as
many components as possible. For components not available off
the shelf, such as our remote memory blade controller, we
1 estimate a cost range. We further include power and cooling costs,
given a typical 3-year server lifespan. We used DRAM power
calculators to evaluate the power consumption of DDR2 devices.
Estimates for the memory contributions towards power and
cooling are calculated using the same methodology as in [8].
(a) Speedup over M-app-75% provisioning
1000
4.2 Results
PS FGRA 4.2.1 Memory expansion for individual benchmarks
We first focus on the applicability of memory disaggregation to
Speedup over M-median

100
address the memory capacity wall for individual benchmarks. To
illustrate scenarios where applications run into memory capacity
limitations due to a core-to-memory ratio imbalance, we perform
10 an experiment where we run each of our benchmarks on a baseline
system with only 75% of that benchmark’s memory footprint (M-
app-75%). The baseline system must swap pages to disk to
1 accommodate the full footprint of the workload. We compare
these with our two disaggregated-memory architectures, PS and
FGRA. In these cases, the compute nodes continue to have local
DRAM capacity corresponding to only 75% of the benchmark’s
memory footprint, but have the ability to exploit capacity from a
(b) Speedup over M-median provisioning
remote memory blade. We assume 32GB of memory on the
1 memory blade, which is sufficient to fit any application’s
0.9 footprint. Figure 5(a) summarizes the speedup for the PS and
Slowdown versus M-max

0.8 FGRA designs relative to the baseline. Both of our new solutions
0.7 achieve significant improvements, ranging from 4X to 320X
0.6 higher performance. These improvements stem from the much
0.5 lower latency of our remote memory solutions compared to OS-
0.4 based disk paging. In particular, zeusmp, bwaves, mcf, specjbb,
0.3 and spec4p show the highest benefits due to their large working
0.2 sets.
0.1 PS FGRA
Interestingly, we also observe that PS outperforms FGRA in this
0
experiment, despite our expectations for FGRA to achieve better
performance due to its lower access latency. Further investigation
reveals that the page swapping policy in PS, which transfers
pages from remote memory to local memory upon access,
(c) Slowdown vs. worst-case (M-max) provisioning accounts for its performance advantage. Under PS, although the
initial access to a remote memory location incurs a high latency
Figure 5: Capacity expansion results. (a) and (b) show the
due to the VMM trap and the 4 KB page transfer over the slower
performance improvement for our two designs over memory-
PCIe interconnect, subsequent accesses to that address
capacity-constrained baselines; (c) shows performance and costs
consequently incur only local-memory latencies. The FGRA
relative to worst-case provisioning.
design, though it has lower remote latencies compared to PS,
medium ranges from 1 GB to 1.75 GB, and high includes those continues to incur these latencies for every access to a frequently
with footprints between 1.75GB and 3GB. In addition to these used remote location. Nevertheless, FGRA still outperforms the
workloads, we have also collected traces of memory usage in three baseline. We examine the addition of page swapping to FGRA in
real-world, large-scale datacenter environments. These Section 4.2.5.
environments include Animation, VM consolidation, and web2.0,
Figure 5(b) considers an alternate baseline where the compute
and are described in Table 1. These traces were each gathered for
server memory is set to approximate the median-case memory
over a month across a large number of servers and are used to
footprint requirements across our benchmarks (M-median =
guide our selection of workloads to mimic the time-varying
1.5GB). This baseline models a realistic scenario where the server

7
1.5 divided by remote DRAM costs), using 32 GB of remote memory.
improvement over M-max PS
Average Performance / $

1.4 Note that for clarity, the cost range on the horizontal axis refers
FGRA only to the memory blade interface/packaging hardware excluding
1.3
1.2
DRAM costs (the fixed DRAM costs are factored in to the
results). The hardware cost break-even points for PS and FGRA
1.1
are high, implying a sufficiently large budget envelope for the
1
memory blade implementation. We expect that the overhead of a
0.9
realistic implementation of a memory blade could be below 50%
0.8
of the remote DRAM cost (given current market prices). This
0.7
overhead can be reduced further by considering higher capacity
0% 100% 200% 300%
memory blades; for example, we expect the cost to be below 7%
Memory blade cost of the remote DRAM cost of a 256 GB memory blade.
(percent of remote DRAM cost)
4.2.3 Server consolidation
Figure 6: Memory blade cost analysis. Average
Viewed as a key application for multi-core processors, server
performance-per-memory dollar improvement versus memory
consolidation improves hardware resource utilization by hosting
blade costs relative to the total cost of remote DRAM.
multiple virtual machines on a single physical platform. However,
is provisioned for the common-case workload, but can still see a memory capacity is often the bottleneck to server consolidation
mix of different workloads. Figure 5(b) shows that our proposed because other resources (e.g., processor and I/O) are easier to
solutions now achieve performance improvements only for multiplex, and the growing imbalance between processor and
benchmarks with high memory footprints. For other benchmarks, memory capacities exacerbates the problem. This effect is evident
the remote memory blade is unused, and does not provide any in our real-world web2.0 traces, where processor utilization rates
benefit. More importantly, it does not cause any slowdown. are typically below 30% (rarely over 45%) while more than 80%
of memory is allocated, indicating limited consolidation
Finally, Figure 5(c) considers a baseline where the server memory
opportunities without memory expansion. To address this issue,
is provisioned for the worst-case application footprint (M-max =
current solutions either advocate larger SMP servers for their
4GB). This baseline models many current datacenter scenarios memory capacity or sophisticated hypervisor memory
where servers are provisioned in anticipation of the worst-case management policies to reduce workload footprints, but they incur
load, either across workloads or across time. We configure our performance penalties, increase costs and complexity, and do not
memory disaggregation solutions as in the previous experiment, address the fundamental processor-memory imbalance.
with M-median provisioned per-blade and additional capacity in
the remote blade. Our results show that, for workloads with small Memory disaggregation enables new consolidation opportunities
footprints, our new solutions perform comparably. For workloads by supporting processor-independent memory expansion. With
with larger footprints, going to remote memory causes a memory blades to provide the second-level memory capacity, we
slowdown compared to local memory; however, PS provides can reduce each workload’s processor-local memory allocation to
comparable performance in some large-footprint workloads less than its total footprint (M-max) while still maintaining
(pgbench, indexer), and on the remaining workloads its comparable performance (i.e., <3% slowdown). This workload-
performance is still within 30% of M-max. As before, FGRA loses specific local vs. remote memory ratio determines how much
performance as it does not exploit locality patterns to ensure most memory can be freed on a compute server (and shifted onto the
accesses go to local memory. memory blade) to allow further consolidation. Unfortunately, it is
not possible to experiment in production datacenters to determine
4.2.2 Power and cost analysis these ratios. Instead, we determine the typical range of local-to-
Using the methodology described in 4.1, we estimate the memory remote ratios using our simulated workload suite. We can then use
power draw of our baseline M-median system as 10 W, and our this range to investigate the potential for increased consolidation
M-max system as 21 W. To determine the power draw of our using resource utilization traces from production systems.
disaggregated memory solutions, we assume local memory
We evaluate the consolidation benefit using the web2.0 workload
provisioned for median capacity requirements (as in M-median)
(CPU, memory and IO resource utilization traces for 200+
and a memory blade with 32 GB shared by 16 servers. servers) and a sophisticated consolidation algorithm similar to that
Furthermore, because the memory blade can tolerate increased used by Rolia et al. [41]. The algorithm performs multi-
DRAM access latency, we assume it aggressively employs DRAM dimensional bin packing to minimize the number of servers
low-power sleep modes. For a 16-server ensemble, we estimate needed for given resource requirements. We do not consider the
the amortized per-server memory power of the disaggregated other two traces for this experiment. Animation is CPU-bound and
solution (including all local and remote memory and the memory runs out of CPU before it runs out of memory, so memory
blade interface hardware, such as its controller, and I/O disaggregation does not help. However, as CPU capacity increases
connections) at 15 W. in the future, we may likely encounter a similar situation as
Figure 6 illustrates the cost impact of the custom designed web2.0. VM consolidation, on the other hand, does run out of
memory blade, showing the changes in the average performance- memory before it runs of out CPU, but these traces already
per-memory cost improvement over the baseline M-max system as represent the result of consolidation, and in the absence of
memory blade cost varies. To put the memory blade cost into information on the prior consolidation policy, it is hard to make a
context with the memory subsystem, the cost is calculated as a fair determination of the baseline and the additional benefits from
percentage of the total remote DRAM costs (memory blade cost memory disaggregation over existing approaches.

8
100%
2 1

Performance / $ improvement
0.9

Slowdown versus M-max

1.8
80% 0.8
1.6
% original provision

0.7

over M-max
Animation 0.6
60% 1.4 web2.0 0.5
Current PS VM Consolidation
1.2 0.4
40%
0.3
1 0.2
20% 0.1
0.8
0
0.4 0.6 0.8 1 1.2 1.4
0% Remote Memory Capacity Animation web2.0 VM
Server Memory (fraction of 'Sum of peaks') consolidation

(a) Reductions from consolidation (b) Cost efficiency vs. remote capacity (c) Perf. at cost-optimized provisioning
Figure 7: Mixed workload and ensemble results. (a) Hardware reductions from improved VM consolidation made possible by remote
memory. (b) Performance-per-dollar as remote memory capacity is varied. (c) Slowdown relative to per-blade worst-case provisioning
(M-max) at cost-optimal provisioning.

As shown in Figure 7(a), without memory disaggregation, the additional memory it needs from the memory blade. (In a task-
state-of-the-art algorithm (“Current”) achieves only modest scheduling environment, this could be based on prior knowledge
hardware reductions (5% processor and 13% memory); limited of the memory footprint of the new task that will be scheduled.)
memory capacity precludes further consolidation. In contrast, For the cost of the memory blade, we conservatively estimated the
page-swapping–based memory disaggregation corrects the time- price to be approximately that of a low-end system. We expect
varying imbalance between VM memory demands and local this estimate to be conservative because of the limited
capacity, allowing a substantial reduction of processor count by a functionality and hardware requirements of the memory blade
further 68%. versus that of a general purpose server.

4.2.4 Ensemble-level memory sharing Figure 7(b) shows the performance-per-memory-dollar

We now examine the benefits of disaggregated memory in multi- improvement, normalized to the M-max baseline, for PS over a
workload server ensembles with time-varying requirements. By range of remote memory sizes. We focus on the PS design as the
dynamically sharing memory capacity at an ensemble level, FGRA design is not as competitive due to its inability to migrate
disaggregated memory can potentially exploit the inter- and intra- frequently accessed data to local memory (see Section 4.2.1). As
workload variations in memory requirements. This variation is is shown, both the VM consolidation and web2.0 traces benefit
highlighted by the difference in the peak of sums versus the sum substantially from ensemble-level provisioning, gaining 78% and
of peaks. The peak of sums is the maximum total memory required 87% improvement in performance-per-dollar while requiring only
across the ensemble at any single point in time. On the other hand, 70% and 85% of the sum-of-peaks memory capacity, respectively.
the sum of peaks is the sum of the worst-case memory These savings indicate that the remote memory capacity can be
requirements of all the servers for the applications they are reduced below worst-case provisioning (sum of peaks) because
running. In conventional environments, servers must be demands in these workloads rarely reach their peak
provisioned for the worst-case memory usage (sum of peaks) to simultaneously. In contrast, the peak of sums closely tracks the
avoid potentially-catastrophic performance losses from sum of peaks in the Animation trace, limiting the opportunity for
underprovisioning (which may lead to swapping/thrashing). cost optimization.
However, the peak of sums is often much smaller than the sum of We next evaluate the performance of a cost-optimized
peaks as servers rarely reach their peak loads simultaneously; disaggregated memory solution relative to the M-max baseline
systems provisioned for worst-case demands are nearly always (worst-case provisioning). Figure 7(c) shows the performance
underutilized. Ensemble-level sharing allows servers to instead be sacrificed by the per-workload cost-optimal design (as determined
provisioned for the sum of peaks, saving costs and power. by the performance-per-dollar peak for each workload in Figure
7(b)). There is minimal performance loss for the web2.0 and VM
We examine the potential of ensemble-level sharing for a 16-
consolidation traces (5% and 8%), indicating that disaggregated
server blade enclosure running a mix of enterprise workloads with
memory can significantly improve cost-efficiency without
varying memory requirements (similar to the scenario shown in
adversely affecting performance. For the Animation traces there is
Figure 1(b)). We examine three real-world enterprise datacenter
a larger performance penalty (24%) due to its consistently high
workload traces (Animation, VM consolidation, and web2.0), and
memory demands. Compared to the M-median baseline, the
create a mixed workload trace using our simulated workloads to
disaggregated memory designs show substantial throughput
mimic the same memory usage patterns. We divide each trace into
improvements (34-277X) for all the traces.
epochs and measure the processing done per epoch; we then
compare these rates across different configurations to estimate
performance benefits. Given that allocation policies are outside
4.2.5 Alternate designs
As discussed earlier, our FGRA design suffers relative to PS
the scope of this paper, we assume a simple policy where, at the
beginning of each epoch, each compute blade requests the because it does not exploit locality by swapping heavily used
remote pages to local memory. This disadvantage can be

9
Though not shown here (due to space constraints), we have also
5
studied sensitivity of our results to the VMM overhead and
Normalized Performance

4.5
4 memory latency parameters in Table 1. Our results show no
3.5 qualitative change to our conclusions.
3
2.5
2 5. DISCUSSION
1.5 Evaluation assumptions. Our evaluation does not model
1 interconnect routing, arbitration, buffering, and QoS management
0.5 in detail. Provided interconnect utilization is not near saturation,
0
these omissions will not significantly impact transfer latencies.
We have confirmed that per-blade interconnect bandwidth
consumption falls well below the capabilities of PCIe and HT.
However, the number of channels to the memory blade may need
(a) FGRA placement-aware design to be scaled with the number of supported clients.

1 Impact of the memory blade on ensemble manageability. Memory

disaggregation has both positive and negative impacts on
Normalized Performance

0.9
0.8 enterprise system reliability, availability, security, and
0.7 manageability. From a reliability perspective, dynamic
0.6 reprovisioning provides an inexpensive means to equip servers
0.5
0.4
with hot-spare DRAM; in the event of a DIMM failure anywhere
0.3 in the ensemble, memory can be remapped and capacity
0.2 reassigned to replace the lost DIMM. However, the memory
0.1 blade also introduces additional failure modes that impact
0 multiple servers. A complete memory-blade failure might impact
several blades, but this possibility can be mitigated by adding
redundancy to the blade's memory controller. We expect that high
availability could be achieved at a relatively low cost, given the
controller’s limited functionality. To provide security and
(b) FGRA over PCIe design
isolation, our design enforces strict assignment of capacity to
Figure 8: Alternate FGRA designs. (a) shows the normalized specific blades, prohibits sharing, and can optionally erase
performance when FGRA is supplemented by NUMA-type memory content prior to reallocation to ensure confidentiality.
optimizations; (b) shows the performance loss from tunneling From a manageability perspective, disaggregation allows
FGRA accesses over a commodity interconnect. management software to provision memory capacity across
blades, reducing the need to physically relocate DIMMs.
addressed by adding page migration to FGRA, similar to existing
CC-NUMA optimizations (e.g., Linux’s memory placement Memory blade scalability and sharing. There are several obvious
optimizations [42]). To study the potential impact of this extensions to our designs. First, to provide memory scaling
enhancement, we modeled a hypothetical system that tracks page beyond the limits of a single memory blade, a server ensemble
usage and, at 10 ms intervals, swaps the most highly used pages might include multiple memory blades. Second, prior studies of
into local memory. Figure 8(a) summarizes the speedup of this consolidated VMs have shown substantial opportunities to reduce
system over the base FGRA design for M-median compute memory requirements via copy-on-write content-based page
sharing across VMs [37]. Disaggregated memory offers an even
blades. For the high-footprint workloads that exhibit the worst
larger scope for sharing content across multiple compute blades.
performance with FGRA (mcf, SPECjbb, and SPEC4p), page
Finally, in some system architectures, subsets of processors/blades
migration achieves 3.3-4.5X improvement, with smaller (5-8%)
share a memory coherence domain, which we might seek to
benefit on other high-footprint workloads. For all workloads, the
extend via disaggregation.
optimized FGRA performs similarly to, and in a few cases better
than, PS. These results motivate further examination of data Synergy with emerging technologies. Disaggregated memory
placement policies for FGRA. extends the conventional virtual memory hierarchy with a new
layer. This layer introduces several possibilities to integrate new
The hardware cost of FGRA can be reduced by using a standard
technologies into the ensemble memory system that might prove
PCIe backplane (as PS does) rather than a coherent interconnect,
latency- or cost-prohibitive in conventional blade architectures.
as discussed in Section 3.2.2. This change incurs a latency and First, we foresee substantial opportunity to leverage emerging
bandwidth penalty as the standardized PCIe interconnect is less interconnect technologies (e.g., optical interconnects) to improve
aggressive than a more specialized interconnect such as cHT. communication latency and bandwidth and allow greater physical
Figure 8(b) shows the change in performance relative to the distance between compute and memory blades. Second, the
baseline FGRA. Performance is comparable, decreasing by at most memory blade’s controller provides a logical point in the system
20% on the higher memory usage workloads. This performance hierarchy to integrate accelerators for capacity and reliability
loss may be acceptable if the cost extending a high-performance enhancements, such as memory compression [30][31]. Finally,
interconnect like cHT across the enclosure backplane is high. one might replace or complement memory blade DRAM with
higher-density, lower-power, and/or non-volatile memory

10
technologies, such as NAND Flash or phase change memory. COTSon team at HP Labs, and Norm Jouppi for their support and
Unlike conventional memory systems, where it is difficult to useful comments.
integrate these technologies because of large or asymmetric access
latencies and lifetime/wearout challenges, disaggregated memory
8. REFERENCES
is more tolerant of increased access latency, and the memory
[1] K. Asanovic et al. The Landscape of Parallel Computing
blade controller might be extended to implement wear-leveling
Research: A View from Berkeley. UC Berkeley EECS Tech
and other lifetime management strategies [43]. Furthermore,
Report UCB/EECS-2006-183, Dec. 2006.
disaggregated memory offers the potential for transparent
[2] VMWare Performance Team Blogs. Ten Reasons Why
integration. Because of the memory interface abstraction provided
Oracle Databases Run Best on VMWare "Scale up with
by our design, Flash or phase change memory can be utilized on
Large Memory." http://tinyurl.com/cudjuy
the memory blade without requiring any further changes on the
[3] J. Larus. Spending Moore's Dividend. Microsoft Tech Report
compute blade.
MSR-TR-2008-69, May 2008
[4] SIA. International Technology Roadmap for Semiconductors
6. CONCLUSIONS 2007 Edition, 2007.
Constraints on per-socket memory capacity and the growing [5] HP. Memory technology evolution: an overview of system
contribution of memory to total datacenter costs and power memory technologies. http://tinyurl.com/ctfjs2
consumption motivate redesign of the memory subsystem. In this [6] A. Lebeck, X. Fan, H. Zheng and C. Ellis. Power Aware
paper, we discuss a new architectural approach—memory Page Allocation. In Proc. of the 9th Int. Conf. on
disaggregation—which uses dedicated memory blades to provide Architectural Support for Programming Languages and
OS-transparent memory extension and ensemble sharing for Operating Systems (ASPLOS-IX), Nov. 2000.
commodity-based blade-server designs. We propose an extensible [7] V. Pandey, W. Jiang, Y. Zhou and R. Bianchini. DMA-
design for the memory blade, including address remapping Aware Memory Energy Conservation. In Proc. of the 12th
facilities to support protected dynamic memory provisioning Int. Sym. on High-Performance Computer Architecture
across multiple clients, and unique density optimizations to (HPCA-12), 2006
address the compute-to-memory capacity imbalance. We discuss [8] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge and S.
two different system architectures that incorporate this blade: a Reinhardt. Understanding and Designing New Server
page-based design that allows memory blades to be used on Architectures for Emerging Warehouse-Computing
current commodity blade server architectures with small changes Environments. In Proc. of the 35th Int. Sym. on Computer
to the virtualization layer, and an alternative that requires small Architecture (ISCA-35), June 2008
amounts of extra hardware support in current compute blades but [9] P. Ranganathan and N. Jouppi. Enterprise IT Trends and
supports fine-grained remote accesses and requires no changes to Implications for Architecture Research. In Proc. of the 11th
the software layer. To the best of our knowledge, our work is the Int. Sym. on High-Performance Computer Architecture
first to propose a commodity-based design that simultaneously (HPCA-11), 2005.
addresses compute-to-memory capacity extension and cross-node [10] http://apotheca.hpl.hp.com/pub/datasets/animation-bear/
memory capacity sharing. We are also the first to consider [11] L. Barroso, J. Dean and U. Hoelzle. Web Search for a Planet:
dynamic memory sharing across the I/O communication network The Google Cluster Architecture. IEEE Micro, 23(2),
in a blade enclosure and quantitatively evaluate design tradeoffs March/April 2003.
in this environment. [12] E. Felten and J. Zahorjan. Issues in the implementation of a
remote memory paging system. University of Washington
Simulations based on detailed traces from 12 enterprise
CSE TR 91-03-09, March 1991.
benchmarks and three real-world enterprise datacenter
[13] M. Feeley, W. Morgan, E. Pighin, A. Karlin, H. Levy and C.
deployments show that our approach has significant potential. The
Thekkath. Implementing global memory management in a
ability to extend and share memory can achieve orders of
workstation cluster. In Proc. of the 15th ACM Sym. on
magnitude performance improvements in cases where applications
Operating System Principles (SOSP-15), 1995.
run out of memory capacity, and similar orders of magnitude
[14] M. Flouris and E. Markatos. The network RamDisk: Using
improvement in performance-per-dollar in cases where systems
remote memory on heterogeneous NOWs. Cluster
are overprovisioned for peak memory usage. We also demonstrate
Computing, Vol. 2, Issue 4, 1999.
how this approach can be used to achieve higher levels of server
[15] M. Dahlin, R. Wang, T. Anderson and D. Patterson.
consolidation than currently possible. Overall, as future server
Cooperative caching: Using remote client memory to
environments gravitate towards more memory-constrained and
improve file system performance. In Proc. of the 1st
cost-conscious solutions, we believe that the memory
USENIX Sym. of Operating Systems Design and
disaggregation approach we have proposed in the paper is likely
Implementation (OSDI ‘94), 1994.
to be a key part of future system designs.
[16] M. Hines, L. Lewandowski and K. Gopalan. Anemone:
Adaptive Network Memory Engine. Florida State University
7. ACKNOWLEDGEMENTS TR-050128, 2005.
We thank the anonymous reviewers for their feedback. This work [17] L. Iftode. K. Li and K. Peterson. Memory servers for
was partially supported by NSF grant CSR-0834403, and an Open multicomputers. IEEE Spring COMPCON ’93, 1993.
Innovation grant from HP. We would also like to acknowledge [18] S. Koussih, A. Acharya and S. Setia. Dodo: A user-level
Andrew Wheeler, John Bockhaus, Eric Anderson, Dean Cookson, system for exploiting idle memory in workstation clusters. In
Niraj Tolia, Justin Meza, the Exascale Datacenter team and the Proc. of the 8th IEEE Int. Sym. on High Performance

11
Distributed Computing (HPDC-8), 1999. [31] M. Ekman and P. Stenström. A Robust Main Memory
[19] A. Agarwal et al. The MIT Alewife Machine: Architecture Compression Scheme. In Proc. of the 32rd Int. Sym. on
and Performance. In Proc. of the 23rd Int. Sym. on Computer Computer Architecture (ISCA-32), 2005
Architecture (ISCA-23), 1995. [32] Virident. Virident’s GreenGateway™ technology and
[20] D. Lenoski et al. The Stanford DASH Multiprocessor. IEEE Spansion® EcoRAM. http://www.virident.com/solutions.php
Computer, 25(3), Mar. 1992. [33] Texas Memory Systems. TMS RamSan-440 Details.
[21] E. Hagersten and M. Koster. WildFire–A Scalable Path for http://www.superssd.com/products/ramsan-440/
SMPs. In Proc. of the 5th Int. Sym. on High-Performance [34] Intel. Intel Fully Buffered DIMM Specification Addendum.
Computer Architecture (HPCA-5), 1999. http://www.intel.com/technology/memory/FBDIMM/spec/Int
[22] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA el_FBD_Spec_Addendum_rev_p9.pdf
Highly Scalable Server. In Proc. of the 25th Int. Sym. on [35] T. Kgil et al. PicoServer: using 3D stacking technology to
Computer Architecture (ISCA-25), 1997. enable a compact energy efficient chip multiprocessor. In
[23] W. Bolosky, M. Scott, R. Fitzgerald, R. Fowler and A. Cox. Proc. of the 12th International Conference on Architectural
NUMA Policies and their Relationship to Memory Support for Programming Languages and Operating Systems
Architecture. In Proc. of the 4th Int. Conf. on Architectural (ASPLOS-XII), 2006.
Support for Programming Languages and Operating Systems [36] M. Ekman and P. Stenstrom. A Cost-Effective Main Memory
(ASPLOS-IV), 1991 Organization for Future Servers. In Proc. of the 19th Int.
[24] K. Li and P. Hudak, Memory coherence in shared virtual Parallel and Distributed Processing Symposium, 2005.
memory systems. ACM Transactions on Computer Systems [37] C. Waldspurger. Memory Resource Management in VMware
(TOCS), 7(4), Nov. 1989. ESX Server. In Proc. of the 5th USENIX Sym. on Operating
[25] D. Scales, K. Gharachorloo and C. Thekkath. Shasta: A Low System Design and Implementation (OSDI ‘02), 2002.
Overhead, Software-Only Approach for Supporting Fine- [38] D. Ye, A . Pavuluri, C. Waldspurger, B. Tsang, B. Rychlik
Grain Shared Memory. In Proc. of the 7th Int. Conf. on and S. Woo. Prototyping a Hybrid Main Memory Using a
Architectural Support for Programming Languages and Virtual Machine Monitor. In Proc. of the 26th Int. Conf. on
Operating Systems (ASPLOS-VII), 1996. Computer Design (ICCD), 2008.
[26] C. Amza et al. TreadMarks: Shared Memory Computing on [39] E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero and D.
Networks of Workstations. IEEE Computer, 29(2), 1996. Ortega. COTSon: Infrastructure for System-Level
[27] I. Schoinas, B. Falsafi, A. Lebeck, S. Reinhardt, J. Larus and Simulation. ACM Operating Systems Review 43(1), 2009.
D. Wood. Fine-grain Access Control for Distributed Shared [40] J.R. Santos, Y. Turner, G. Janakiraman and I. Pratt. Bridging
Memory. In Proc. of the 6th Int. Conf. on Architectural the gap between software and hardware techniques for I/O
Support for Programming Languages and Operating Systems virtualization. USENIX Annual Technical Conference, 2008.
(ASPLOS-VI), 1994. [41] J. Rolia, A. Andrzejak and M. Arlitt. Automating Enterprise
[28] K Gharachorloo. The Plight of Software Distributed Shared Application Placement in Resource Utilities. 14th IFIP/IEEE
Memory. Invited talk at 1st Workshop on Software Int. Workshop on Distributed Systems: Operations and
Distributed Shared Memory (WSDSM '99), 1999. Management, DSOM 2003.
[29] ScaleMP. The Versatile SMP™ (vSMP) Architecture and [42] R. Bryant and J. Hawkes. Linux® Scalability for Large
Solutions Based on vSMP Foundation™. White paper at NUMA Systems. In Proc. of Ottowa Linux Symposium
http://www.scalemp.com/prod/technology/how-does-it-work/ 2003, July 2003.
[30] F. Douglis. The compression cache: using online [43] T. Kgil, D. Roberts and T. Mudge. Improving NAND Flash
compression to extend physical memory. In Proc. of 1993 Based Disk Caches. In Proc. of the 35th Int. Sym. on
Winter USENIX Conference, 1993. Computer Architecture (ISCA-35), June 2008.

AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc.
HyperTransport is a trademark of the HyperTransport Consortium.
Microsoft and Windows are registered trademarks of Microsoft Corporation.
PCI Express and PCIe are registered trademarks of PCI-SIG.
Linux is a registered trademark of Linus Torvalds.
SPEC is a registered trademark Standard Performance Evaluation Corporation (SPEC).

View publication stats

Tanzania's Reptile Biodiversity: Distribution, Threats and Climate Change Vulnerability
No ratings yet
Tanzania's Reptile Biodiversity: Distribution, Threats and Climate Change Vulnerability
25 pages
Are Women Really More Talkative Than Men
No ratings yet
Are Women Really More Talkative Than Men
2 pages
Already Operating at Frequencies Above
No ratings yet
Already Operating at Frequencies Above
9 pages
Trompet Parasram Anderson CSSPaperTRRTransit2013volume2 No 2351
No ratings yet
Trompet Parasram Anderson CSSPaperTRRTransit2013volume2 No 2351
10 pages
5G Deployment in Rural Areas: Challenges
No ratings yet
5G Deployment in Rural Areas: Challenges
9 pages
Redding Et Al 2008 Self-Help
No ratings yet
Redding Et Al 2008 Self-Help
10 pages
Overuse Injuries and Burnout in Youth Sports A Pos
No ratings yet
Overuse Injuries and Burnout in Youth Sports A Pos
17 pages
Mobilesolarpower
No ratings yet
Mobilesolarpower
8 pages
The Beginningofa New Erain Design Calibrated Discrete Element Modelling
No ratings yet
The Beginningofa New Erain Design Calibrated Discrete Element Modelling
9 pages
Model 001
No ratings yet
Model 001
18 pages
Seismic Retrofitting for Stone Masonry
No ratings yet
Seismic Retrofitting for Stone Masonry
19 pages
240 Seismically Induced Soft-Sediment Deformation 2
No ratings yet
240 Seismically Induced Soft-Sediment Deformation 2
4 pages
2012 - Analysis of Swirling Flow in Hydrocyclones Operating Under Dense Regime
No ratings yet
2012 - Analysis of Swirling Flow in Hydrocyclones Operating Under Dense Regime
12 pages
The Effects of Layering and Encapsulation On Software Development Cost and Quality
No ratings yet
The Effects of Layering and Encapsulation On Software Development Cost and Quality
10 pages
Gen AI
No ratings yet
Gen AI
14 pages
Awareness of The Association Between Obesity and P
No ratings yet
Awareness of The Association Between Obesity and P
5 pages
Near Threshold Computing Overcoming Performance de
No ratings yet
Near Threshold Computing Overcoming Performance de
7 pages
Liuetal 2020 JFunctFoods Inhibitoryeffectsofskinpermeableglucitol-Corecontaininggallotanninsfromredmapleleavesonelastase
No ratings yet
Liuetal 2020 JFunctFoods Inhibitoryeffectsofskinpermeableglucitol-Corecontaininggallotanninsfromredmapleleavesonelastase
12 pages
10.revised Test Protocols For The Identification of Dispersive Soils
No ratings yet
10.revised Test Protocols For The Identification of Dispersive Soils
8 pages
1 s2.0 S0045782522006065 Main
No ratings yet
1 s2.0 S0045782522006065 Main
37 pages
Effect of natural β-carotene from-carrot (Daucus carota) and Spinach (Spinacia oleracea) on colouration of an ornamental fish -swordtail (Xiphophorus hellerii)
No ratings yet
Effect of natural β-carotene from-carrot (Daucus carota) and Spinach (Spinacia oleracea) on colouration of an ornamental fish -swordtail (Xiphophorus hellerii)
5 pages
Composite Railway Sleepers Recent Developments Challenges
No ratings yet
Composite Railway Sleepers Recent Developments Challenges
12 pages
Anatomy Age and Evolution of A Collisional Mountai
No ratings yet
Anatomy Age and Evolution of A Collisional Mountai
21 pages
Pulsatile Flow in Bifurcation Arteries
No ratings yet
Pulsatile Flow in Bifurcation Arteries
7 pages
Nutritional Status of Honey Bee (Apis Mellifera L.) Workers Across An Agricultural Land-Use Gradient
No ratings yet
Nutritional Status of Honey Bee (Apis Mellifera L.) Workers Across An Agricultural Land-Use Gradient
11 pages
Model of Consumer
No ratings yet
Model of Consumer
14 pages
The Effects of Pre-Exhaustion, Exercise Order, and Rest Intervals in A Full-Body Resistance Training Intervention
No ratings yet
The Effects of Pre-Exhaustion, Exercise Order, and Rest Intervals in A Full-Body Resistance Training Intervention
7 pages
Oreinted Antibody Immobilization Strategies
No ratings yet
Oreinted Antibody Immobilization Strategies
15 pages
IJAETpaper 2
No ratings yet
IJAETpaper 2
4 pages
Energies 16 06579
No ratings yet
Energies 16 06579
34 pages
nrdp201584 1452768280 1
No ratings yet
nrdp201584 1452768280 1
25 pages
In Vitro Cultivation of Human Islets From Expanded
No ratings yet
In Vitro Cultivation of Human Islets From Expanded
7 pages
LaTeX Tools for Researchers
No ratings yet
LaTeX Tools for Researchers
2 pages
Modeling Emergency Evacuation of Individuals With Disabilities in A Densely Populated Airport
No ratings yet
Modeling Emergency Evacuation of Individuals With Disabilities in A Densely Populated Airport
8 pages
EasyChair Preprint 387
No ratings yet
EasyChair Preprint 387
6 pages
EE ImperfectCSIv2
No ratings yet
EE ImperfectCSIv2
6 pages
2010-Facilitating Preservice Teachers Development of T
No ratings yet
2010-Facilitating Preservice Teachers Development of T
246 pages
Interception of Peronospora Manshurica in Soybean Germplasm Imported During 1976-2005
No ratings yet
Interception of Peronospora Manshurica in Soybean Germplasm Imported During 1976-2005
5 pages
Factors Affecting Buildability of Building Designs: Canadian Journal of Civil Engineering February 2011
No ratings yet
Factors Affecting Buildability of Building Designs: Canadian Journal of Civil Engineering February 2011
13 pages
Piazza TransAP08
No ratings yet
Piazza TransAP08
14 pages
Real-time Path Planning for Agents
No ratings yet
Real-time Path Planning for Agents
9 pages
Benefits of All Work and No Play
No ratings yet
Benefits of All Work and No Play
18 pages
Trajectories of Change in Weekly and Biweekly Therapy
No ratings yet
Trajectories of Change in Weekly and Biweekly Therapy
13 pages
Constantinou Reinhorn Tsopelas Nagarajaiah 1999
No ratings yet
Constantinou Reinhorn Tsopelas Nagarajaiah 1999
23 pages
Sprint Timing Methods in Australia
No ratings yet
Sprint Timing Methods in Australia
6 pages
Analysis of Challenges Faced by Indian Logistics Service Providers
No ratings yet
Analysis of Challenges Faced by Indian Logistics Service Providers
13 pages
PROMISE Software Engineering Repository
No ratings yet
PROMISE Software Engineering Repository
2 pages
LittleCunninghamShaharandWidaman2002 PDF
No ratings yet
LittleCunninghamShaharandWidaman2002 PDF
24 pages
Antonellietal 2021
No ratings yet
Antonellietal 2021
24 pages
Automated Fault Location and Isolation in Distribution Grids With Distributed Control and Unreliable Communication
No ratings yet
Automated Fault Location and Isolation in Distribution Grids With Distributed Control and Unreliable Communication
9 pages
2014HMRIrisinstudy PDF
No ratings yet
2014HMRIrisinstudy PDF
7 pages
Musculoskeletal Injuries in Astronauts: Review of Pre - Ight, in - Ight, Post - Ight, and Extravehicular Activity Injuries
No ratings yet
Musculoskeletal Injuries in Astronauts: Review of Pre - Ight, in - Ight, Post - Ight, and Extravehicular Activity Injuries
11 pages
Dental Caries From A Molecular Microbiological Perspective
No ratings yet
Dental Caries From A Molecular Microbiological Perspective
15 pages
Fog Computing For Sustainable Smart Cities A Surve
No ratings yet
Fog Computing For Sustainable Smart Cities A Surve
45 pages
2010Lucrezietal2010CanstormsandshorearmouringadditiveeffectsMARFRESHWRESVol 61pp 951962 PDF
No ratings yet
2010Lucrezietal2010CanstormsandshorearmouringadditiveeffectsMARFRESHWRESVol 61pp 951962 PDF
13 pages
Can Storms and Shore Armouring Exert Additive Effectson Sandy-Beach Habitats and Biota?
No ratings yet
Can Storms and Shore Armouring Exert Additive Effectson Sandy-Beach Habitats and Biota?
13 pages
Effectiveness of Tot Workshop On Psychosocial Care On NDRF
No ratings yet
Effectiveness of Tot Workshop On Psychosocial Care On NDRF
5 pages
4 Press Release Lesson Plan
No ratings yet
4 Press Release Lesson Plan
4 pages
MediaTrust Writing A Press Release
No ratings yet
MediaTrust Writing A Press Release
6 pages
Church Anniversary Letter Template
100% (4)
Church Anniversary Letter Template
1 page
Press Release
No ratings yet
Press Release
1 page
Writing Press Release PDF
No ratings yet
Writing Press Release PDF
2 pages
Product Data Sheet Deltav Virtualization Hardware For Hyperconverged Infrastructure Deltav en 8486152
No ratings yet
Product Data Sheet Deltav Virtualization Hardware For Hyperconverged Infrastructure Deltav en 8486152
15 pages
Cisco UCS 5108 Blade Server Spec Sheet
No ratings yet
Cisco UCS 5108 Blade Server Spec Sheet
64 pages
All-Products - Esuprt - Laptop - Esuprt - Inspiron - Laptop - Inspiron-Xps - Owner's Manual - En-Us
No ratings yet
All-Products - Esuprt - Laptop - Esuprt - Inspiron - Laptop - Inspiron-Xps - Owner's Manual - En-Us
186 pages
NEC Express5800 Blade Server Overview
No ratings yet
NEC Express5800 Blade Server Overview
8 pages
Poweredge VRTX Spec Sheet
No ratings yet
Poweredge VRTX Spec Sheet
2 pages
b22m3 Specsheet
No ratings yet
b22m3 Specsheet
54 pages
sb6000 Arch Overview 400405
No ratings yet
sb6000 Arch Overview 400405
23 pages
Using Dell Blade Servers in A Dell Poweredge High Availability Cluster
No ratings yet
Using Dell Blade Servers in A Dell Poweredge High Availability Cluster
48 pages
Intel Enterprise Blade Server Family Configuration-275414
No ratings yet
Intel Enterprise Blade Server Family Configuration-275414
18 pages
HP Proliant BL460c Gen9 Datasheet
No ratings yet
HP Proliant BL460c Gen9 Datasheet
4 pages
Blade Server Product Roadmap
No ratings yet
Blade Server Product Roadmap
9 pages
dw1hn Iug
No ratings yet
dw1hn Iug
86 pages
Dell Poweredge Blade Servers
No ratings yet
Dell Poweredge Blade Servers
4 pages
Key To 5 Critical Data Center
No ratings yet
Key To 5 Critical Data Center
6 pages
FCG - HPD 629 Product Bulletin
No ratings yet
FCG - HPD 629 Product Bulletin
2 pages
B260M4 SpecSheet
No ratings yet
B260M4 SpecSheet
68 pages
Spec Sheet c17 662218
No ratings yet
Spec Sheet c17 662218
42 pages
Power Efficiency Comparisonof Delland Cisco High Memory Capacity Blade Servers
No ratings yet
Power Efficiency Comparisonof Delland Cisco High Memory Capacity Blade Servers
25 pages
Blade Servers Market Analysis 2011
No ratings yet
Blade Servers Market Analysis 2011
10 pages
Datasheet c78 730960
No ratings yet
Datasheet c78 730960
7 pages
ServerHardware Pillar PDFdownload
No ratings yet
ServerHardware Pillar PDFdownload
22 pages
At A Glance c45 739250
No ratings yet
At A Glance c45 739250
3 pages
Dell Poweredge m420 Technical Guide
No ratings yet
Dell Poweredge m420 Technical Guide
48 pages
Proliant bl460c g6
No ratings yet
Proliant bl460c g6
4 pages
Dell Poweredge Portfolio Guide
No ratings yet
Dell Poweredge Portfolio Guide
24 pages
Cooling & Hardware Study Guide
No ratings yet
Cooling & Hardware Study Guide
11 pages
Spectrum - 5.4.1 - ReleaseNotes
No ratings yet
Spectrum - 5.4.1 - ReleaseNotes
18 pages
Unix Network Programming Syllabus
No ratings yet
Unix Network Programming Syllabus
4 pages
IBM Aptitude & Technical Test Guide
100% (1)
IBM Aptitude & Technical Test Guide
19 pages
Azure Devops Integeration With Databricks
No ratings yet
Azure Devops Integeration With Databricks
10 pages
SRDF Interfamily Connectivity Information
No ratings yet
SRDF Interfamily Connectivity Information
15 pages
4010ES IDNAC CPU Card and Main Board
100% (1)
4010ES IDNAC CPU Card and Main Board
2 pages
Server 2022 Slides Chapter 6
No ratings yet
Server 2022 Slides Chapter 6
54 pages
BCA 421 Java - (B)
No ratings yet
BCA 421 Java - (B)
1 page
Coa Unit 1 Problems
No ratings yet
Coa Unit 1 Problems
6 pages
Sub PasswordBreaker
No ratings yet
Sub PasswordBreaker
3 pages
Akarsh Singh Resume
No ratings yet
Akarsh Singh Resume
1 page
Eternus lt260
No ratings yet
Eternus lt260
79 pages
Huawei MT6582 Partition Layout
No ratings yet
Huawei MT6582 Partition Layout
6 pages
Yeastar Product Price List 2021 Partner
No ratings yet
Yeastar Product Price List 2021 Partner
9 pages
Customer Experiences Using Linux On Ibm Z Systems and Linuxone
No ratings yet
Customer Experiences Using Linux On Ibm Z Systems and Linuxone
53 pages
Data Transmission Methods Explained
No ratings yet
Data Transmission Methods Explained
7 pages
SELinux and Vold Initialization Logs
No ratings yet
SELinux and Vold Initialization Logs
451 pages
User Requirements for BMS System
No ratings yet
User Requirements for BMS System
15 pages
Esp32-C3 Technical Reference Manual en
No ratings yet
Esp32-C3 Technical Reference Manual en
577 pages
Solutions To Set 7
100% (1)
Solutions To Set 7
20 pages
CD-compiler Designe Akash
0% (1)
CD-compiler Designe Akash
70 pages
TeleVideo Terminal Models Overview
No ratings yet
TeleVideo Terminal Models Overview
8 pages
Android Operating System
No ratings yet
Android Operating System
23 pages
Ultrathin Keyboard Folio m1 Setup Guide
No ratings yet
Ultrathin Keyboard Folio m1 Setup Guide
16 pages
Word Processor 3 Practical Questions
No ratings yet
Word Processor 3 Practical Questions
2 pages
A Level Computer Science Exam Paper
No ratings yet
A Level Computer Science Exam Paper
28 pages
How To Troubleshoot The - NTLDR Is Missing - Error Message in Windows Server 2003
No ratings yet
How To Troubleshoot The - NTLDR Is Missing - Error Message in Windows Server 2003
4 pages
Real-Time Motion Amplification On Mobile Devices: Hv28@cornell - Edu
No ratings yet
Real-Time Motion Amplification On Mobile Devices: Hv28@cornell - Edu
12 pages
Business Continuity For MES
No ratings yet
Business Continuity For MES
10 pages

Disaggregated Memory For Expansion and Sharing in

Uploaded by

Disaggregated Memory For Expansion and Sharing in

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Disaggregated memory for expansion and sharing in blade servers

Conference Paper · June 2009

Kevin Lim Trevor N. Mudge

SEE PROFILE SEE PROFILE

Parthasarathy Ranganathan Thomas F. Wenisch

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

argue for a disaggregated memory design that encapsulates an

three live datacenter installations. Our results show that memory

OS (PA) Memory controller

Table 1: Simulation parameters and workload/trace description.

Slowdown versus M-max

4.2.4 Ensemble-level memory sharing Figure 7(b) shows the performance-per-memory-dollar

1 Impact of the memory blade on ensemble manageability. Memory

View publication stats

You might also like