In Memory Pointer Chasing Accelerator - Iccd16
In Memory Pointer Chasing Accelerator - Iccd16
Abstract—Pointer chasing is a fundamental operation, used by many from several shortcomings: (1) they usually do not provide significant
important data-intensive applications (e.g., databases, key-value stores, benefit for data structures that diverge at each node [45], due to low
graph processing workloads) to traverse linked data structures. This prefetcher accuracy and low miss coverage; (2) aggressive prefetchers
operation is both memory bound and latency sensitive, as it (1) exhibits
irregular access patterns that cause frequent cache and TLB misses, and can consume too much of the limited off-chip memory bandwidth and,
(2) requires the data from every memory access to be sent back to the as a result, slow down the system [18, 43, 84]; and (3) a prefetcher that
CPU to determine the next pointer to access. Our goal is to accelerate works well for some pointer-based data structure(s) and access patterns
pointer chasing by performing it inside main memory, thereby avoiding (e.g., a Markov prefetcher designed for mostly-static linked lists [43])
inefficient and high-latency data transfers between main memory and the
CPU. To this end, we propose the In-Memory PoInter Chasing Accelerator usually does not work efficiently for different data structures and/or
(IMPICA), which leverages the logic layer within 3D-stacked memory for access patterns. Thus, it is important to explore new solution directions
linked data structure traversal. to alleviate performance and efficiency loss due to pointer chasing.
This paper identifies the key design challenges of designing a pointer Our goal in this work is to accelerate pointer chasing by di-
chasing accelerator in memory, describes new mechanisms employed
within IMPICA to solve these challenges, and evaluates the performance rectly minimizing the memory bottleneck caused by pointer chasing
and energy benefits of our accelerator. IMPICA addresses the key operations. To this end, we propose to perform pointer chasing in-
challenges of (1) how to achieve high parallelism in the presence of serial side main memory by leveraging processing-in-memory (PIM) mecha-
accesses in pointer chasing, and (2) how to effectively perform virtual- nisms, avoiding the need to move data to the CPU. In-memory pointer
to-physical address translation on the memory side without requiring
expensive accesses to the CPU’s memory management unit. We show that chasing greatly reduces (1) the latency of the operation, as an address
the solutions to these challenges, address-access decoupling and a region- does not need to be brought all the way into the CPU before it can be
based page table, respectively, are simple and low-cost. We believe these dereferenced; and (2) the reliance on caching and prefetching in the
solutions are also applicable to many other in-memory accelerators, which CPU, which are largely ineffective for pointer chasing.
are likely to also face the two challenges.
Our evaluations on a quad-core system show that IMPICA improves the Early work on PIM proposed to embed general-purpose logic in
performance of pointer chasing operations in three commonly-used linked main memory [20,30,44,48,69,70,80,85], but was not commercialized
data structures (linked lists, hash tables, and B-trees) by 92%, 29%, and due to the difficulty of fabricating logic and memory on the same die.
18%, respectively. This leads to a significant performance improvement The emergence of 3D die-stacked memory, where memory layers are
in applications that utilize linked data structures — on a real database stacked on top of a logic layer [38, 39, 41, 42, 50], provides a unique
application, DBx1000, IMPICA improves transaction throughput and
response time by 16% and 13%, respectively. IMPICA also significantly opportunity to embed simple accelerators or cores within the logic
reduces overall system energy consumption (by 41%, 23%, and 10% for layer. Several recent works recognized and explored this opportunity
the three commonly-used data structures, and by 6% for DBx1000). (e.g., [1–3,6,7,10,12,22,27,31,34,46,51,56,71,72,96,97]) for various
1. Introduction purposes. For the first time, in this work, we propose an in-memory
accelerator for chasing pointers in any linked data structure, called the
Linked data structures, such as trees, hash tables, and linked lists are In-Memory PoInter Chasing Accelerator (IMPICA). IMPICA lever-
commonly used in many important applications [21, 25, 28, 33, 60, 61, ages the low memory access latency at the logic layer of 3D-stacked
89]. For example, many databases use B/B+ -trees to efficiently index memory to speed up pointer chasing operations.
large data sets [21,28], key-value stores use linked lists to handle colli- We identify two fundamental challenges that we believe exist for a
sions in hash tables [25, 60], and graph processing workloads [1, 2, 81] wide range of in-memory accelerators, and evaluate them as part of a
use pointers to represent graph edges. These structures link nodes case study in designing a pointer chasing accelerator in memory. These
using pointers, where each node points to at least one other node by fundamental challenges are (1) how to achieve high parallelism in the
storing its address. Traversing the link requires serially accessing con- accelerator (in the presence of serial accesses in pointer chasing), and
secutive nodes by retrieving the address(es) of the next node(s) from (2) how to effectively perform virtual-to-physical address translation
the pointer(s) stored in the current node. This fundamental operation is on the memory side without performing costly accesses to the CPU’s
called pointer chasing in linked data structures. memory management unit. We call these, respectively, the parallelism
Pointer chasing is currently performed by the CPU cores, as part challenge and the address translation challenge.
of an application thread. While this approach eases the integration of
pointer chasing into larger programs, pointer chasing can be inefficient The Parallelism Challenge. Parallelism is challenging to exploit
within the CPU, as it introduces several sources of performance degra- in an in-memory accelerator even with the reduced latency and higher
dation: (1) dependencies exist between memory requests to the linked bandwidth available within 3D-stacked memory, as the performance
nodes, resulting in serialized memory accesses and limiting the avail- of pointer chasing is limited by dependent sequential accesses. The
able instruction-level and memory-level parallelism [33,61,62,67,75]; serialization problem can be exacerbated when the accelerator tra-
(2) irregular allocation or rearrangement of the connected nodes leads verses multiple streams of links: while traditional out-of-order or mul-
to access pattern irregularity [18,43,45,61,93], causing frequent cache ticore CPUs can service memory requests from multiple streams in
and TLB misses; and (3) link traversals in data structures that diverge parallel due to their ability to exploit high levels of instruction- and
at each node (e.g., hash tables, B-trees) frequently go down different memory-level parallelism [29, 33, 62, 64–67, 86], simple accelerators
paths during different iterations, resulting in little reuse, further lim- (e.g., [1, 22, 72, 97]) are unable to exploit such parallelism unless they
iting cache effectiveness [59]. Due to these inefficiencies, a significant are carefully designed.
memory bottleneck arises when executing pointer chasing operations in We observe that accelerator-based pointer chasing is primarily bot-
the CPU, which stalls on a large number of memory requests that suffer tlenecked by memory access latency, and that the address generation
from the long round-trip latency between the CPU and the memory. computation for link traversal takes only a small fraction of the to-
Many prior works (e.g., [14–16, 18, 36, 37, 43, 45, 55, 58, 59, 61, 75, tal traversal time, leaving the accelerator idle for a majority of the
76,83,92,93,95,99]) proposed mechanisms to predict and prefetch the traversal time. In IMPICA, we exploit this idle time by decoupling
next node(s) of a linked data structure early enough to hide the memory link address generation from the issuing and servicing of a memory
latency. Unfortunately, prefetchers for linked data structures suffer request, which allows the accelerator to generate addresses for one
link traversal stream while waiting on the request associated with a streams simultaneously. We call this approach address-access de-
different link traversal stream to return from memory. We call this coupling (Section 4.1).
design address-access decoupling. Note that this form of decoupling • IMPICA solves the address translation challenge by allocating
bears resemblance to the decoupled access/execute architecture [82], data structures it accesses into contiguous virtual memory regions,
and we in fact take inspiration from past works [17, 49, 82], except our and using an optimized and low-cost region-based page table
design is specialized for building a pointer chasing accelerator in 3D- structure for address translation (Section 4.2).
stacked memory, and this paper solves specific challenges within the • We evaluate IMPICA extensively using both microbenchmarks
context of pointer chasing acceleration. and a real database workload. Our results (Section 7) show that
The Address Translation Challenge. An in-memory pointer chas- IMPICA improves both system performance and energy efficiency
ing accelerator must be able to perform address translation, as each for all of these workloads, while requiring only very modest hard-
pointer in a linked data structure node stores the virtual address of ware overhead in the logic layer of 3D-stacked DRAM.
the next node, even though main memory is physically addressed.
To determine the next address in the pointer chasing sequence, the 2. Motivation
accelerator must resolve the virtual-to-physical address mapping. If To motivate the need for a pointer chasing accelerator, we first exam-
the accelerator relies on existing CPU-side address translation mech- ine the usage of pointer chasing in contemporary workloads. We then
anisms, any performance gains from performing pointer chasing in discuss opportunities for acceleration within 3D-stacked memory.
memory could easily be nullified, as the accelerator needs to send a 2.1. Pointer Chasing in Modern Workloads
long-latency translation request to the CPU via the off-chip channel Pointers are ubiquitous in fundamental data structures such as linked
for each memory access. The translation can sometimes require a page lists, trees, and hash tables, where the nodes of the data structure are
table walk, where the CPU must issue multiple memory requests to linked together by storing the addresses (i.e., pointers) of neighboring
read the page table, which further increases traffic on the memory nodes. Pointers make it easy to dynamically add/delete nodes in these
channel. While a naive solution is to simply duplicate the TLB and data structures, but link traversal is often serialized, as the address of
page walker within memory, this is prohibitively difficult for three the next node can be known only after the current node is fetched. The
reasons: (1) coherence would have to be maintained between the CPU serialized link traversal is commonly referred to as pointer chasing.
and memory-side TLBs, introducing extra complexity and off-chip Due to the flexibility of insertion/deletion, pointer-based data struc-
requests; (2) the duplication is very costly in terms of hardware; and tures and link traversal algorithms are essential building blocks in
(3) a memory module can be used in conjunction with many different programming, and they enable a very wide range of workloads. For
processor architectures, which use different page table implementa- instance, at least six different types of modern data-intensive appli-
tions and formats, and ensuring compatibility between the in-memory cations rely heavily on linked data structures: (1) databases and
TLB/page walker and all of these designs is difficult. file systems use B/B+ -trees for indexing tables or metadata [21, 28];
We observe that traditional address translation techniques do not (2) in-memory caching applications based on key-value stores, such
need to be employed for pointer chasing, as link traversals are (1) lim- as Memcached [25] and Masstree [60], use linked lists to resolve hash
ited to linked data structures, and (2) touch only certain data structures table collisions and trie-like B+ -trees as their main data structures;
in memory. We exploit this in IMPICA by allocating data structures (3) graph processing workloads use pointers to represent the edges
accessed by IMPICA into contiguous regions within the virtual mem- that connect the vertex data structures together [1,81]; (4) garbage col-
ory space, and designing a new translation mechanism, the region- lectors in high level languages typically maintain reference relations
based page table, which is optimized for in-memory acceleration. Our using trees [89]; (5) 3D video games use binary space partitioning
approach provides translation within memory at low latency and low trees to determine the objects that need to be rendered [68]; and (6) dy-
cost, while minimizing the cost of maintaining TLB coherence. namic routing tables in networks employ balanced search trees for
Evaluation. By solving both key challenges, IMPICA provides high-performance IP address lookups [88].
significant performance and energy benefits for pointer chasing op- While linked data structures are widely used in many modern appli-
erations and applications that use such operations. First we examine cations, chasing pointers is very inefficient in general-purpose proces-
three microbenchmarks, each of which performs pointer chasing on a sors. There are three major reasons behind the inefficiency. First, the
widely used data structure (linked list, hash table, B-tree), and find that inherent serialization that occurs when accessing consecutive nodes
IMPICA improves their performance by 92%, 29%, and 18%, respec- limits the available instruction-level and memory-level parallelism [43,
tively, on a quad-core system over a state-of-the-art baseline. Second, 55, 59, 61–67, 75, 76]. As a result, out-of-order execution provides
we evaluate IMPICA on a real database workload, DBx1000 [94], on a only limited performance benefit when chasing pointers [61, 62, 64].
quad-core system, and show that IMPICA increases overall database Second, as nodes can be inserted and removed dynamically, they can
transaction throughput by 16% and reduces transaction latency by get allocated to different regions of memory. The irregular allocation
13%. Third, IMPICA reduces overall system energy, by by 41%, 23%, causes pointer chasing to exhibit irregular access patterns, which lead
and 10% for the three microbenchmarks and by 6% for DBx1000. to frequent cache and TLB misses [18, 43, 45, 61, 93]. Third, for data
These benefits come at a very small hardware cost: our evaluations structures that diverge at each node, such as B-trees, link traversals
show that IMPICA comprises only 7.6% of the area of a small embed- often go down different paths during different iterations, as the inputs
ded core (the ARM Cortex-A57 [4]). to the traversal function change. As a result, lower-level nodes that
We make the following major contributions in this paper: were recently referenced during a link traversal are unlikely to be
• This is the first work to propose an in-memory accelerator for reused in subsequent traversals, limiting the effectiveness of many
pointer chasing. Our proposal, IMPICA, accelerates linked data caching policies [47, 55, 59], such as LRU replacement.
To quantify the performance impact of chasing pointers in real-
structure traversal by chasing pointers inside the logic layer of
world workloads, we profile two popular applications that heavily de-
3D-stacked memory, thereby eliminating inefficient, high-latency
serialized data transfers between the CPU and main memory. pend on linked data structures, using a state-of-art Intel Xeon system:1
(1) Memcached [25], using a real Twitter dataset [23] as its input; and
• We identify two fundamental challenges in designing an efficient (2) DBx1000 [94], an in-memory database system, using the TPC-
in-memory pointer chasing accelerator (Section 3). These chal- C benchmark [87] as its input. We profile the pointer chasing code
lenges can greatly hamper performance if the accelerator is not within the application separately from other parts of the application
designed carefully to overcome them. First, multiple streams of code. Figure 1 shows how pointer chasing compares to the rest of the
link traversal can unnecessarily get serialized at the accelerator, application in terms of execution time, cycles per instruction (CPI),
degrading performance (the parallelism challenge). Second, an in- and the ratio of last-level cache (LLC) miss cycles to the total cycles.
memory accelerator needs to perform virtual-to-physical address We make three major observations. First, both Memcached and
translation for each pointer, but this critical functionality does not DBx1000 spend a significant fraction of their total execution time (7%
exist on the memory side (the address translation challenge).
• IMPICA solves the parallelism challenge by decoupling link ad- 1 We use the Intel R VTuneTM profiling tool on a machine with a Xeon R
dress generation from memory accesses, and utilizes the idle W3550 processor (3GHz, 8-core, 8 MB LLC) [40] and 18 GB memory. We
time during memory accesses to service multiple pointer chasing profile each application for 10 minutes after it reaches steady state.
2
find
find
find
(A,
(A,
(A,
root)
root)
root)
Execution time Cycles per instruction LLC miss cycle ratio HHH Miss
Miss
MissHHH
100% 15 50%
HHH
80% 40% Miss
Miss
MissEEE find
find
find
EEE QQQ (A,
(A,
(A,root)
root)
root)
10
60% 30% EEE
AAA Memory
Memory
Memory
Miss
Miss
MissAAA Memory
Memory
Memory
40% 20% AAA FFF M
MM TTT CPU
CPU
CPU CPU
CPU
CPU Logic
Logic
Logic
5 AAA
20% 10% (a)
(a)
(a)Binary
Binary
BinaryTree
Tree
Tree (b)
(b)
(b)Traditional
Traditional
TraditionalArchitecture
Architecture
Architecture (c)
(c)
(c)IMPICA
IMPICA
IMPICAArchitecture
Architecture
Architecture
0% 0 0%
(a) Binary tree (b) Traditional architecture (c) IMPICA architecture
Memcached DBx1000 Memcached DBx1000 Memcached DBx1000 Fig. 2. Pointer chasing (a) in a traditional architecture (b) and in IMPICA with
Pointer Chasing Rest of the Application 3D-stacked memory (c).
Fig. 1. Profiling results of pointer chasing portions of code vs. the rest of the
application code in Memcached and DBx1000. bus are eliminated) and off-chip bandwidth consumption, as shown in
Figure 2c.
and 19%, respectively) on pointer chasing, as a result of dependent Our accelerator architecture has three major advantages. First, it
cache misses [33,61,75]. Though these percentages might sound small, improves performance and reduces bandwidth consumption by elim-
real software often does not have a single type of operation that con- inating the round trips required for memory accesses over the CPU-
sumes this significant a fraction of the total time. Second, we find that to-memory bus. Second, it frees the CPU to execute other work than
pointer chasing is significantly more inefficient than the rest of the linked data structure traversal, increasing system throughput. Third, it
application, as it requires much higher cycles per instruction (6× in minimizes the cache contention caused by pointer chasing operations.
Memcached, and 1.6× in DBx1000). Third, pointer chasing is largely
memory-bound, as it exhibits much higher cache miss rates than the
3. Design Challenges
rest of the application and as a result spends a much larger fraction We identify and describe two new challenges that are crucial to the
of cycles waiting for LLC misses (16× in Memcached, and 1.5× performance and functionality of our new pointer chasing accelerator
in DBx1000). From these observations, we conclude that (1) pointer in memory: (1) the parallelism challenge, and (2) the address trans-
chasing consumes a significant fraction of execution time in two im- lation challenge. Section 4 describes our IMPICA architecture, which
portant and complicated applications, (2) pointer chasing operations centers around two key ideas that solve these two challenges.
are bound by memory, and (3) executing pointer chasing code in a 3.1. Challenge 1: Parallelism in the Accelerator
modern general-purpose processor is very inefficient and thus can lead
A pointer chasing accelerator supporting a multicore system needs
to a large performance overhead. Other works made similar observa-
to handle multiple link traversals (from different cores) in parallel at
tions for different workloads [33, 61, 75].
low cost. A simple accelerator that can handle only one request at a
Prior works (e.g., [14–16,18,36,37,43,45,55,58,59,61,75,76,83,92, time (which we call a non-parallel accelerator) would serialize the
93, 95, 99]) proposed specialized prefetchers that predict and prefetch requests and could potentially be slower than using multiple CPU cores
the next node of a linked data structure to hide memory latency. While to perform the multiple traversals. As depicted in Figure 3a, while
prefetching can mitigate part of the memory latency problem, it has a non-parallel accelerator speeds up each individual pointer chasing
three major shortcomings. First, the efficiency of prefetchers degrades operation done by one of the CPU cores due to its shorter memory
significantly when the traversal of linked data structures diverges into latency, the accelerator is slower overall for two pointer chasing oper-
multiple paths and the access order is irregular [45]. Second, prefetch- ations, as multiple cores can operate in parallel on independent pointer
ers can sometimes slow down the entire system due to contention chasing operations. time
caused by inaccurate prefetch requests [18, 19, 43, 84]. Third, these Comp Comp
Core 1 Memory access time
CPU 1 1
hardware prefetchers are usually designed for specific data structure Core 1
Comp
Memory access
Comp Slower for two
implementations, and tend to be very inefficient when dealing with Cores
CPU Core 2
1 Comp
Memory access
1 Comp operations
2 2 Slower for two
other data structures. It is difficult to design a prefetcher that is efficient Cores Core 2
Comp
Memory access
Comp operations
2 2 time
and effective for all types of linked data structures. Our goal in this Non-parallel Comp Comp Comp Comp
work is to improve the performance of pointer chasing applications Memory access 1 Memory access 2 time
Accelerator
Non-parallel
1 1 2 2
Comp Comp Comp Comp
without relying on prefetchers, regardless of the types of linked data Accelerator 1
Memory access 1
1
Faster
Memory access 2
2 one operation
for 2
structures used in an application. (a) Execution time of two independent pointer chasing
Faster for one operation
operations on two cores vs. one non-parallel accelerator
(a) Execution time of two independent pointer chasing
2.2. Accelerating Pointer Chasing in 3D-Stacked Memory (a) Pointer chasing on two CPU cores vs. one non-parallel accelerator
operations on two cores vs. one non-parallel accelerator time
We propose to improve the performance of pointer chasing by lever- Address Comp Comp Comp Comp
time
Engine 1 2 1 2
aging processing-in-memory (PIM) to alleviate the memory bottle- Address Comp Comp Comp Comp
neck. Instead of sequentially fetching each node from memory and IMPICA Engine 1 2
Memory access 1 1 2
sending it to the CPU when an application is looking for a particular Access
IMPICA Engine MemoryMemory
access 1access 2 Faster for two operations
node, PIM-based pointer chasing consists of (1) traversing the linked Access
Engine
data structures in memory, and (2) returning only the final node found Memory access
(b) Execution time of two independent 2
pointer Faster for operations
chasing
(b) Pointer chasing using IMPICA
two operationsusing
IMPICA
to the CPU. (b)3.Execution timetime
of two
Fig. Execution of independent pointerpointer
two independent chasingchasing
operations using IMPICA
operations, broken
Unlike prior works that proposed general architectural models for down into address computation time (Comp) and memory access time.
in-memory computation by embedding logic in main memory [20, 30,
44, 48, 69, 70, 80, 85], we propose to design a specialized In-Memory To overcome this deficiency, an in-memory accelerator needs to
PoInter Chasing Accelerator (IMPICA) that exploits the logic layer exploit parallelism when it services requests. However, the accelerator
of 3D-stacked memory [38, 39, 41, 42, 50]. 3D die-stacked memory must do this at low cost, due to its placement within the logic layer
achieves low latency (and high bandwidth) by stacking memory dies of 3D-stacked memory, where complex logic, such as out of order
on top of a logic die, and interconnecting the layers using through- execution circuitry, is not currently feasible. The straightforward so-
silicon vias (TSVs). Figure 2 shows a binary tree traversal using lution of adding multiple accelerators to service independent pointer
IMPICA, compared to a traditional architecture where the CPU tra- chasing operations (e.g., [47]) does not scale well, and also can lead to
verses the binary tree. The traversal sequentially accesses the nodes excessive energy dissipation and die area usage in the logic layer.
from the root to a particular node (e.g., H→E→A in Figure 2a). A key observation we make is that pointer chasing operations are
In a traditional architecture (Figure 2b), these serialized accesses to bottlenecked by memory stalls, as shown in Figure 1. In our evalua-
the nodes miss in the caches and three memory requests are sent to tion, the memory access time is 10–15× the computation time. As a
memory serially across a high-latency off-chip channel. In contrast, result, the accelerator spends a significant amount of time waiting for
IMPICA traverses the tree inside the logic layer of 3D-stacked mem- memory, causing its compute resources to sit idle. This makes typical
ory, and as Figure 2c shows, only the final node (A) is sent from in-order or out-of-order execution engines inefficient for an in-memory
the memory to the host CPU in response to the traversal request. pointer-chasing accelerator. If we utilize the hardware resources in a
Doing the traversal in memory minimizes both traversal latency (as more efficient manner, we can enable parallelism by handling multiple
queuing delays in the on-chip interconnect and the CPU-to-memory pointer chasing operations within a single accelerator.
3
Based on our observation, we decouple address generation from Our address-access decoupling has similarities to, and is in fact in-
memory accesses in IMPICA using two engines (address engine and spired by, the decoupled access-execute (DAE) architecture [82], with
access engine), allowing the accelerator to generate addresses from two key differences. First, the goal of DAE is to exploit instruction-
one pointer chasing operation while it concurrently performs memory level parallelism (ILP) within a single thread, whereas our goal is to
accesses for a different pointer chasing operation (as shown in Fig- exploit thread-level parallelism (TLP) across pointer chasing opera-
ure 3b). We describe the details of our decoupled accelerator design tions from multiple threads. Second, unlike DAE, the decoupling in
in Section 4. IMPICA does not require any programmer or compiler effort. Our
3.2. Challenge 2: Virtual Address Translation approach is much simpler than both general-purpose DAE and out-
A second challenge arises when pointer chasing is moved out of of-order execution, as it can switch between different independent
the CPU cores, which are equipped with facilities for address trans- execution streams, without the need for dependency checking [35].
lation. Within the program data structures, each pointer is stored as a Figure 4 shows the architecture of the IMPICA core. The host CPU
virtual address, and requires translation to a physical address before initializes a pointer chasing operation by moving its code to main
its memory access can be performed. This is a challenging task for memory, and then enqueuing the request in the request queue (¬ in
an in-memory accelerator, which has no easy access to the virtual Figure 4). Section 5.1 describes the details of the CPU interface.
address translation engine that sits in the CPU core. While sequential To / From Memory
Memory
To / From From
array operations could potentially be constrained to work within page Inst RAM
boundaries or directly in physical memory, indirect memory accesses Access Queue
that come with pointer-based data structures require some support for Data RAM
Address Access
virtual memory translation, as they might touch many parts of the Engine Engine
CPU
virtual address space. Response Queue
There are two major issues when designing a virtual address trans- Request Queue IMPICA Cache
lation mechanism for an in-memory accelerator. First, different pro-
cessor architectures have different page table implementations and Fig. 4. IMPICA core architecture.
formats. This lack of compatibility makes it very expensive to simply The address engine services the enqueued request by loading the
replicate the CPU page table walker in the in-memory accelerator as pointer chasing code into its instruction RAM (). This engine con-
this approach requires replicating TLBs and page walkers for many tains all of IMPICA’s functional units, and executes the code in its
architecture formats. Second, a page table walk tends to be a high- instruction RAM while using its data RAM (®) as a stack. All instruc-
latency operation involving multiple memory accesses due to the heav- tions that do not involve memory accesses, such as ALU operations
ily layered formats of a conventional page table. As a result, TLB and control flow, are performed by the address engine. The number of
misses are a major performance bottleneck in data-intensive applica- pointer chasing operations that can be processed in parallel is limited
tions [8]. If the accelerator requires many page table walks that are by the size of the stack in the data RAM [35].
supported by the CPU’s address translation mechanisms, which require When the address engine encounters a memory instruction, it en-
high-latency off-chip accesses for the accelerator, its performance can queues the address (along with the data RAM stack pointer) into the
degrade greatly. access queue (¯), and then performs a context switch to an indepen-
To address these issues, we completely decouple the page table of dent stream. For the switch, the engine pushes the hardware context
IMPICA from that of the CPUs, obviating the need for compatibility (i.e., architectural registers and the program counter) into the data
between the two tables. This presents us with an opportunity to de- RAM stack. When this is done, the address engine can work on a
velop a new page table design that is much more efficient for our in- different pointer chasing operation.
memory accelerator. We make two key observations about the behavior The access engine services requests waiting in the access queue.
of a pointer chasing accelerator. First, the accelerator operates only This engine translates the enqueued address from a virtual address to
on certain data structures that can be mapped to contiguous regions a physical address, using the IMPICA page table (see Section 4.2).
in the virtual address space, which we refer to as IMPICA regions. It then sends the physical address to the memory controller, which
As a result, it is possible to map contiguous IMPICA regions with a performs the memory access. Since the memory controller handles
smaller, region-based page table without needing to duplicate the page data retrieval, the access engine can issue multiple requests to the
table mappings for the entire address space. Second, we observe that if controller without waiting on the data, just as the CPU does today, thus
we need to map only IMPICA regions, we can collapse the hierarchy quickly servicing queued requests [35]. Note that the access engine
present in conventional page tables, allowing us to limit the overhead does not contain any functional units.
of the IMPICA page table. We describe the IMPICA page table in When the access engine receives data back from the memory con-
detail in Section 4.2. troller, it stores this data in the IMPICA cache (°), a small cache that
4. IMPICA Architecture contains data destined for the address engine. The access queue entry
We propose a new in-memory accelerator, IMPICA, that addresses corresponding to the returned data is moved from the access queue to
the two design challenges facing accelerators for pointer chasing. The the response queue (±).
IMPICA architecture consists of a single specialized core designed to The address engine monitors the response queue. When a response
decouple address generation from memory accesses. Our approach, queue entry is ready, the address engine reads it, and uses the stack
which we call address-access decoupling, allows us to efficiently over- pointer to access and reload the registers and PC that were pushed onto
come the parallelism challenge (Section 4.1). The IMPICA core uses the data RAM stack. It then resumes execution for the pointer chasing
a novel region-based page table design to perform efficient address operation, continuing until it encounters the next memory instruction.
translation locally in the accelerator, allowing us to overcome the 4.2. IMPICA Page Table
address translation challenge (Section 4.2). IMPICA uses a region-based page table (RPT) design optimized
4.1. IMPICA Core Architecture for in-memory pointer chasing, leveraging the continuous ranges of
Our IMPICA core uses what we call address-access decoupling, accesses (IMPICA regions) discussed in Section 3.2. Figure 5 shows
where we separate the core into two parts: (1) an address engine, the structure of the RPT in IMPICA. The RPT is split into three
which generates the address specified by the pointer; and (2) an access levels: (1) a first-level region table, which needs to map only a small
engine, which performs memory access operations using addresses number of the contiguously-allocated IMPICA regions; (2) a second-
generated by the address engine. The key advantage of this design level flat page table for each region with a larger (e.g., 2MB) page
is that the address engine supports fast context switching between size; and (3) third-level small page tables that use conventional small
multiple pointer chasing operations, allowing it to utilize the idle (e.g., 4KB) pages. In the example in Figure 5, when a 48-bit virtual
time during memory access(es) to compute addresses from a different memory address arrives for translation, bits 47–41 of the address are
pointer chasing operation. As Figure 3b depicts, an IMPICA core can used to index the region table (¬ in Figure 5) to find the corresponding
process multiple pointer chasing operations faster than multiple cores flat page table. Bits 40–21 are used to index the flat page table (),
because it has the ability to overlap address generation with memory providing the location of the small page table, which is indexed using
accesses. bits 20–12 (®). The entry in the small page table provides the physical
4
Region Table valid requires minimal changes to applications, and it allows the system
Region Address Flat Table Address 1
Region Address Flat Table Address 1 to provide more efficient virtual address translation. This also allows
-- -- 0
-- -- 0 us to ensure that when multiple memory stacks are present within
the system, the OS can allocate all IMPICA regions belonging to a
IMPICA
= single application (along with the associated IMPICA page table) into
Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] one memory stack, thereby avoiding the need for the accelerator to
Region
Virtual
Address communicate with a remote memory stack.
IMPICA Flat 2MB The OS maintains coherence between the IMPICA RPT and the
+ + 4KB Page
Region Page Table + CPU page table. When memory is allocated in the IMPICA region,
Table
220 entries 29 entries
the OS allocates the IMPICA page table. The OS also shoots down
Physical
Address TLB entries in IMPICA if the CPU performs any updates to IMPICA
Virtual Address Space Region-based Page Table regions. While this makes the OS page fault handler more complex, the
Fig. 5. IMPICA virtual memory architecture. additional complexity does not cause a noticeable performance impact,
as page faults occur rarely and take a long time to service in the CPU.
page number of the page, and bits 11–0 specify the offset within the
physical page (¯). 5.3. Cache Coherence
The RPT is optimized to take advantage of the properties of pointer Coherence must be maintained between the CPU and IMPICA
chasing. The region table is almost always cached in the IMPICA caches, and with memory, to avoid using stale data and thus ensure
cache, as the total number of IMPICA regions is small, requiring small correct execution. We maintain coherence by executing every function
storage (e.g., a 4-entry region table needs only 68B of cache space). We that operates on the IMPICA regions in the accelerator. This solution
employ a flat table with large (e.g., 2MB) pages at the second level in guarantees that no data is shared between the CPU and IMPICA, and
order to reduce the number of page misses, though this requires more that IMPICA always works on up-to-date data. Other PIM coherence
memory capacity than the conventional 4-level page table structure. solutions (e.g., [2, 10, 26]) can also be used to allow CPU to update the
As the number of regions touched by the accelerator is limited, this linked data structures, but we choose not to employ these solutions in
additional capacity overhead remains constrained. Our page table can our evaluation, as our workloads do not perform any such updates.
optionally use traditional smaller page sizes to maximize memory 6. Methodology
management flexibility. The OS can freely choose large (2MB) pages
or small (4KB) pages at the last level. Thanks to this design, a page We use the gem5 [9] full-system simulator with DRAMSim2 [74]
walk in the RPT usually results in only two misses, one for the flat page to evaluate our proposed design. We choose the 64-bit ARMv8 archi-
table and another for the last-level small page table. This represents a tecture, the accuracy of which has been validated against real hard-
2× improvement over a conventional four-level page table, while our ware [32]. We model the internal memory bandwidth of the memory
flattened page table still provides coverage for a 2TB memory range. stack to be 4× that of the external bandwidth, similar to the config-
The size of the IMPICA region is configurable and can be increased to uration used in prior works [22, 96]. Our simulation parameters are
cover more virtual address space [35]. We believe that our RPT design summarized in Table 1. Our technical report [35] provides more detail
is general enough for use in a variety of in-memory accelerators that on the IMPICA configuration.
operate on a specific range of memory regions. Table 1. Major simulation parameters.
We discuss how the OS manages the IMPICA RPT in Section 5.2. Processor
5. Interface and Design Considerations ISA ARMv8 (64-bits)
Core Configuration 4 OoO cores, 2 GHz, 8 wide, 128-entry ROB
In this section, we discuss how we expose IMPICA to the CPU and Operating System 64-bit Linux from Linaro [54]
the OS. Section 5.1 describes the communication interface between L1 I/D Cache 32KB/2-way each, 2-cycle
the CPU and IMPICA. Section 5.2 discusses how the OS manages L2 Cache 1MB/8-way, shared, 20-cycle
the page tables in IMPICA. In Section 5.3, we discuss how cache DRAM Parameters
coherence is maintained between the CPU and IMPICA caches. Memory Configuration DDR3-1600, 8 banks/device, FR-FCFS scheduler
DRAM Bus Bandwidth 12.8 GB/s for CPU, 51.2 GB/s for IMPICA
5.1. CPU Interface and Programming Model IMPICA Configuration
We use a packet-based interface between the CPU and IMPICA. In- Accelerator Core 500 MHz, 16 entries for each queue
stead of communicating individual operations or operands, the packet- Cache 32KB2 / 2-way
based interface buffers requests and sends them in a burst to mini- Address Translator 32 TLB entries with region-based page table
mize the communication overhead. Executing a function in IMPICA RAM 16KB Data RAM and 16KB Inst RAM
consists of four steps on the interface. (1) The CPU sends to mem-
ory a packet comprising the function call and parameters. (2) This 6.1. Workloads
packet is written to a specific location in memory, which is memory- We use three data-intensive microbenchmarks, which are essential
mapped to the data RAM in IMPICA and triggers IMPICA execution. building blocks in a wide range of workloads, to evaluate the native
(3) IMPICA loads the specific function into the inst RAM with ap- performance of performance chasing operations: linked lists, hash
propriate parameters, by reading the values from predefined memory tables, and B-trees. We also evaluate the performance improvement
locations. (4) Once IMPICA finishes the function execution, it writes in a real data-intensive workload, measuring the transaction latency
the return value back to the memory-mapped locations in the data and throughout of DBx1000 [94], an in-memory OLTP database. We
RAM. The CPU periodically polls these locations and receives the modify all four workloads to offload each pointer chasing request to
IMPICA output. Note that the IMPICA interface is similar to the IMPICA. To minimize communication overhead, we map the IMPICA
interface proposed for the Hybrid Memory Cube (HMC) [38, 39]. registers to user mode address space, thereby avoiding the need for
The programming model for IMPICA is similar to the CPU pro- costly kernel code intervention.
gramming model. An IMPICA program can be written as a function in Linked list. We use the linked list traversal microbenchmark [98]
the application code with a compiler directive. The compiler compiles derived from the health workload in the Olden benchmark suite [73].
these functions into IMPICA instructions and wraps the function calls The parameters are configured to approximate the performance of
with communication codes that utilize the CPU–IMPICA interface. the health workload. We measure the performance of the linked list
5.2. Page Table Management traversal after 30,000 iterations.
In order for the RPT to identify IMPICA regions, the regions must Hash table. We create a microbenchmark from the hash table
be tagged by the application. For this, the application uses a spe- implementation of Memcached [25]. The hash table in Memcached
cial API to allocate pointer-based data structures. This API allocates resolves hash collisions using chaining via linked lists. When there are
memory to a contiguous virtual address space. To ensure that all API more than 1.5n items in a table of n buckets, it doubles the number of
allocations are contiguous, the OS reserves a portion of the unused
virtual address space for IMPICA, and always allocates memory for 2 We sweep the size of the IMPICA cache from 32KB to 128KB, and find that
IMPICA regions from this portion. The use of such a special API it has negligible effect on our results [35].
5
buckets. We follow this rule by inserting 1.5 × 220 random keys into Baseline + extra 128KB L2 IMPICA
2.0
a hash table with 220 buckets. We run evaluations for 100,000 random
Speedup
1.5
key look-ups. 1.0
B-tree. We use the B-tree implementation of DBx1000 for our B- 0.5
tree microbenchmark. It is a 16-way B-tree that uses a 64-bit integer as 0.0
the key of each node. We randomly generate 3,000,000 keys and insert Linked List Hash Table B-Tree
them into the B-tree. After the insertions, we measure the performance Fig. 6. Microbenchmark performance with IMPICA.
of the B-tree traversal with 100,000 random keys. This is the most time
consuming operation in the database index lookup. First, a major factor contributing to the performance improvement
DBx1000. We run DBx1000 [94] with the TPC-C benchmark [87]. is the reduction in TLB misses. The TLB MPKI in Figure 7a depicts
We set up the TPC-C tables with 2,000 customers and 100,000 items. the total (i.e., combined CPU and IMPICA) TLB misses in both the
For each run, we spawn 4 threads and bind them to 4 different CPUs baseline system and IMPICA. The pointer chasing operations have
to achieve maximum throughput. We run each thread for a warm- low locality and pollute the CPU TLB. This leads to a higher overall
up period for the duration of 2,000 transactions, and then record the TLB miss rate in the application. With IMPICA, the pointer chasing
software and hardware statistics for the next 5,000 transactions per operations are offloaded to the accelerator. This reduces the pollution
and contention at the CPU TLB, reducing the overall number of TLB
thread,3 which takes 300–500 million CPU cycles.
misses. The linked list has a significantly higher TLB MPKI than the
6.2. Die Area and Energy Estimation other data structures because linked list traversal requires far fewer
We estimate the die area of the IMPICA processing logic at the instructions in an iteration. It simply accesses the next pointer, while
40nm process node based on recently-published work [57]. We include a hash table or a B-tree traversal needs to compare the keys in the node
the most important components: processor cores, L1/L2 caches, and to determine the next step.
the memory controller. We use the area of ARM Cortex-A57 [4, 24], Baseline + extra 128KB L2 IMPICA
2.0
Normalized cache
bandwidth (GB/s)
a small embedded processor, for the main CPU. We conservatively 150 1.00 1.5
TLB MPKI
miss latency
0.75
Memory
estimate the die area of IMPICA using the area of the Cortex-R4 [5], 100
0.50
1.0
an 8-stage dual issue RISC processor with 32 KB I/D caches. Table 2 50 0.25 0.5
0 0.00 0.0
lists the area estimate of each component. Linked Hash B-Tree Linked Hash B-Tree Linked Hash B-Tree
List Table List Table List Table
Table 2. Die area estimates using a 40nm process. (a) Total TLB misses per kilo (b) Average cache miss latency (c) Total memory bandwidth
Baseline CPU (Cortex-A57) 5.85 mm2 per core instructions (MPKI) utilization (GB/s)
L2 Cache 5 mm2 per MB Fig. 7. Key architectural statistics for the microbenchmarks.
Memory Controller 10 mm2
IMPICA Core (including 32 KB I/D caches) 0.45 mm2 Second, we observe a significant reduction in cache miss latency
with IMPICA. Figure 7b compares the average cache miss latency be-
IMPICA comprises only 7.6% the area of a single baseline CPU tween the baseline last-level cache and the IMPICA cache. On average,
core, or only 1.2% the total area of the baseline chip (which includes the cache miss latency of IMPICA is only 60–70% of the baseline
four CPU cores, 1MB L2 cache, and one memory controller). Note cache miss latency. This is because IMPICA leverages the faster and
that we conservatively model IMPICA as a RISC core. A much more wider TSVs in 3D-stacked memory as opposed to the high latency and
specialized engine can be designed for IMPICA to solely execute narrow DRAM interface used by the CPU.
pointer chasing code. Doing so would reduce the area and energy Third, as Figure 7c shows, IMPICA effectively utilizes the in-
overheads of IMPICA greatly, but can reduce the generality of the ternal memory bandwidth in 3D-stacked memory, which is cheap
pointer chasing access patterns that IMPICA can accelerate. We leave and abundant. There are two reasons for high bandwidth utilization:
this for future work. (1) IMPICA runs much faster than the baseline so it generates more
We use McPAT [52] to estimate the energy consumption of the CPU, traffic within the same amount time; and (2) IMPICA always accesses
caches, memory controllers, and IMPICA. We conservatively use the memory at a larger granularity, retrieving each full node in a linked
configuration of the Cortex-R4 to estimate the energy consumed by data structure with a single memory request, while a CPU issues multi-
IMPICA. We use DRAMSim2 [74] to analyze DRAM energy. ple requests for each node as it can fetch only one cache line at a time.
The CPU can avoid using some of its limited memory bandwidth by
7. Evaluation skipping some fields in the data structure that are not needed for the
We first evaluate the effect of IMPICA on system performance, current loop iteration. For example, some keys and pointers in a B-tree
using both our microbenchmarks (Section 7.1) and the DBx1000 node can be skipped whenever a match is found. In contrast, IMPICA
database (Section 7.2). We investigate the impact of different IMPICA utilizes the wide internal bandwidth of 3D-stacked memory to retrieve
page table designs in Section 7.3, and examine system energy con- a full node on each access (more detail is in our tech report [35]).
sumption in Section 7.4. We compare a system containing IMPICA We conclude that IMPICA is effective at significantly improving the
to an accelerator-free baseline that includes an additional 128KB of performance of important linked data structures.
L2 cache (which is equivalent to the area of IMPICA) to ensure area-
equivalence across evaluated systems. 7.2. Real Database Throughput and Latency
Figure 8 presents two key performance metrics for our evaluation
7.1. Microbenchmark Performance of DBx1000: database throughput and database latency. Database
Figure 6 shows the speedup of IMPICA and the baseline with ex- throughput represents how many transactions are completed within a
tra 128KB of L2 cache over the baseline for each microbenchmark. certain period, while database latency is the average time to complete
IMPICA achieves significant speedups across all three data structures a transaction. We normalize the results of three configurations to the
— 1.92× for the linked list, 1.29× for the hash table, and 1.18× for the baseline. As mentioned earlier, the die area increase of IMPICA is
B-tree. In contrast, the extra 128KB of L2 cache provides very small similar to a 128KB cache. To understand the effect of additional LLC
speedup (1.03×, 1.01×, and 1.02×, respectively). We conclude that space better, we also show the results of adding 1MB of cache, which
IMPICA is much more effective than the area-equivalent additional L2 takes about 8× the area of IMPICA, to the baseline. We make two
cache for pointer chasing operations. observations from our analysis of DBx1000.
To provide insight into why IMPICA improves performance, we 1.20 1.00
Throughput
present total (CPU and IMPICA) TLB misses per kilo instructions
Database
0.95
Latency
Database
1.10
(MPKI), cache miss latency, and total memory bandwidth usage for 0.90
these microbenchmarks in Figure 7. We make three observations. 1.00 0.85
0.90 0.80
Baseline + extra Baseline + extra IMPICA Baseline + extra Baseline + extra IMPICA
3 Based 128KB L2 1MB L2 128KB L2 1MB L2
on our experiments on a real Intel Xeon machine, we find that this
(a) Database throughput (b) Database latency
is large enough to satisfactorily represent the behavior of 1,000,000 transac-
tions [35]. Fig. 8. Performance results for DBx1000, normalized to the baseline.
6
First, IMPICA improves the overall database throughput by 16% 8. Related Work
and reduces the average database transaction latency by 13%. The To our knowledge, this is the first work to (1) leverage the logic
performance improvement is due to three reasons: (1) database index- layer in 3D-stacked memory to accelerate pointer chasing in linked
ing becomes faster with IMPICA, (2) offloading database indexing to data structures, (2) propose an address-access decoupled architecture
IMPICA reduces the TLB and cache contention due to pointer chasing to tackle the parallelism challenge for in-memory accelerators, and
in the CPU, and (3) the CPU can do other useful tasks in parallel while (3) propose a general and efficient page table mechanism to tackle the
waiting for IMPICA. Note that our profiling results in Figure 1 show address translation challenge for in-memory accelerators.
that DBx1000 spends 19% of its time on pointer chasing. Therefore, Many prior works investigate the pointer chasing problem and pro-
a 16% overall improvement is very close to the upper bound that any pose solutions using software or hardware prefetching mechanisms
pointer chasing accelerator can achieve for this database. (e.g., [14–16, 18, 36, 37, 43, 45, 55, 58, 59, 61, 75, 76, 83, 92, 93, 95, 99]).
Second, IMPICA yields much higher database throughput than The effectiveness of this approach is fundamentally dependent on the
simply providing additional cache capacity. IMPICA improves the accuracy and generality of address prediction. In this work, we im-
database throughput by 16%, while an extra 128KB of cache (with a prove the performance of pointer chasing operation with an in-memory
similar area overhead as IMPICA) does so by only 2%, and an extra accelerator, inspired by the concept of processing-in-memory (PIM)
1MB of cache (8× the area of IMPICA) by only 5%. (e.g., [20, 30, 44, 48, 69, 70, 80, 85]). There are many recent proposals
We conclude that by accelerating the fundamental pointer chasing that aim to improve the performance of data intensive workloads using
operation, IMPICA can efficiently improve the performance of a so- accelerators in CPUs or in memory (e.g., [1–3,6,7,11–13,22,27,31,34,
phisticated real workload. 46, 47, 51, 53, 72, 77–79, 90, 91, 96, 97]). Here, we briefly discuss these
related works. None of these works provide an accelerator for pointer
7.3. Sensitivity to the IMPICA TLB Size & Page Table Design chasing in memory.
To understand the effect of different TLB sizes and page table de- Prefetching for Linked Data Structures. Many works propose
signs in IMPICA, we evaluate the speedup in the amount of time spent mechanisms to prefetch data in linked data structures to hide memory
on address translation for IMPICA when different IMPICA TLB sizes latency. These proposals are hardware-based (e.g., [14,16,36,37,43,61,
(32 and 64 entries) and accelerator page table structures (the baseline 76, 95]), software-based (e.g., [55, 59, 75, 92, 93]), pre-execution-based
4-level page table; and the region-based page table, or RPT) are used (e.g., [15, 58, 83, 99]), or software/hardware-cooperative (e.g., [18,
inside the accelerator. Figure 9 shows the speedup in address transla- 63, 76]) mechanisms. These approaches have two major drawbacks.
tion time relative to IMPICA with a 32-entry TLB and the conventional First, they rely on predictable traversal sequences to prefetch accu-
4-level page table. Two observations are in order. rately. These mechanisms can become very inefficient if the linked data
32 TLB + 4-level 64 TLB + 4-level 32 TLB + RPT 64 TLB + RPT structure is complex or when access patterns are less regular. Second,
the pointer chasing is performed at the CPU cores or at the memory
Address Translation
1.50
1.25 controller, which likely leads to pollution of the CPU caches and TLBs
1.00
Speedup
0.5
9. Conclusion
0.0 We introduce the design and evaluation of an in-memory acceler-
Linked List Hash Table B-Tree DBx1000 ator, called IMPICA, for performing pointer chasing operations in
Fig. 10. Effect of IMPICA on energy consumption. 3D-stacked memory. We identify two major challenges in the design
7
of such an in-memory accelerator: (1) the parallelism challenge and [40] Intel, “Intel Xeon Processor W3550,” 2009.
(2) the address translation challenge. We provide new solutions to [41] J. Jeddeloh and B. Keeth, “Hybrid memory cube: New DRAM architecture increases
density and performance,” in VLSIT, 2012.
these two challenges: (1) address-access decoupling solves the paral- [42] JEDEC, “High Bandwidth Memory (HBM) DRAM,” Standard No. JESD235, 2013.
lelism challenge by decoupling the address generation from memory [43] D. Joseph and D. Grunwald, “Prefetching using Markov predictors,” in ISCA, 1997.
[44] Y. Kang et al., “FlexRAM: Toward an advanced intelligent memory system,” in
accesses in pointer chasing operations and exploiting the idle time ICCD, 1999.
during memory accesses to execute multiple pointer chasing opera- [45] M. Karlsson et al., “A prefetching technique for irregular accesses to linked data
structures,” in HPCA, 2000.
tions in parallel, and (2) the region-based page table in 3D-stacked [46] D. Kim et al., “Neurocube: A programmable digital neuromorphic architecture with
memory solves the address translation challenge by tracking only those high-density 3D memory,” in ISCA, 2016.
[47] Y. O. Koçberber et al., “Meet the walkers: Accelerating index traversals for in-
limited set of virtual memory regions that are accessed by pointer memory databases,” in MICRO, 2013.
chasing operations. Our evaluations show that for both commonly- [48] P. M. Kogge, “EXECUBE–A new architecture for scaleable MPPs,” in ICPP, 1994.
[49] L. Kurian et al., “Memory latency effects in decoupled architectures with a single
used linked data structures and a real database application, IMPICA data memory module,” in ISCA, 1992.
significantly improves both performance and energy efficiency. We [50] D. Lee et al., “Simultaneous Multi-Layer Access: Improving 3D-stacked memory
conclude that IMPICA is an efficient and effective accelerator design bandwidth at low cost,” TACO, 2016.
[51] J. H. Lee et al., “BSSync: Processing near memory for machine learning workloads
for pointer chasing. We also believe that the two challenges we identify with bounded staleness consistency models,” in PACT, 2015.
(parallelism and address translation) exist in various forms in other in- [52] S. Li et al., “The McPAT framework for multicore and manycore architectures:
Simultaneously modeling power, area, and timing,” TACO, 2013.
memory accelerators (e.g., for graph processing), and, therefore, our [53] K. T. Lim et al., “Thin servers with smart pipes: Designing SoC accelerators for
solutions to these challenges can be adapted for use by a broad class of Memcached,” in ISCA, 2013.
[54] Linaro, “Linux gem5 support,” 2014.
(in-memory) accelerators. [55] M. H. Lipasti et al., “SPAID: Software prefetching in pointer- and call-intensive
environments,” in MICRO, 1995.
Acknowledgments [56] G. H. Loh et al., “A processing in memory taxonomy and a case for studying fixed-
function PIM,” in WoNDP, 2013.
We thank the anonymous reviewers and the SAFARI Research [57] P. Lotfi-Kamran et al., “Scale-out processors,” in ISCA, 2012.
Group for their feedback. Special thanks to David G. Andersen and [58] C. Luk, “Tolerating memory latency through software-controlled pre-execution in
Vivek Seshadri for feedback. We acknowledge the support of our in- simultaneous multithreading processors,” in ISCA, 2001.
[59] C. Luk and T. C. Mowry, “Compiler-based prefetching for recursive data structures,”
dustrial partners: Google, Intel, NVIDIA, Samsung and Seagate. This in ASPLOS, 1996.
research was partially supported by NSF (grants 1212962, 1320531, [60] Y. Mao et al., “Cache craftiness for fast multicore key-value storage,” in EuroSys,
2012.
1409723), ISTC-CC, SRC, and DSSC. [61] O. Mutlu et al., “Address-value delta (AVD) prediction: Increasing the effectiveness
of runahead execution by exploiting regular memory allocation patterns,” in MICRO,
References 2005.
[1] J. Ahn et al., “A scalable processing-in-memory accelerator for parallel graph pro- [62] O. Mutlu et al., “Techniques for efficient processing in runahead execution engines,”
cessing,” in ISCA, 2015. in ISCA, 2005.
[2] J. Ahn et al., “PIM-Enabled Instructions: A low-overhead, locality-aware processing- [63] O. Mutlu et al., “Address-value delta (AVD) prediction: A hardware technique for
in-memory architecture,” in ISCA, 2015. efficiently parallelizing dependent cache misses,” IEEE TC, 2006.
[3] B. Akin et al., “Data reorganization in memory using 3D-stacked DRAM,” in ISCA, [64] O. Mutlu et al., “Efficient runahead execution: Power-efficient memory latency tol-
2015. erance,” IEEE Micro, 2006.
[4] ARM, “ARM Cortex-A57,” http://www.arm.com/products/processors/cortex-a/ [65] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both
cortex-a57-processor.php. performance and fairness of shared DRAM systems,” in ISCA, 2008.
[5] ARM, “ARM Cortex-R4,” http://www.arm.com/products/processors/cortex-r/ [66] O. Mutlu et al., “Runahead execution: An alternative to very large instruction win-
cortex-r4.php. dows for out-of-order processors,” in HPCA, 2003.
[6] H. Asghari-Moghaddam et al., “Chameleon: Versatile and practical near-DRAM [67] O. Mutlu et al., “Runahead execution: An effective alternative to large instruction
acceleration architecture for large memory systems,” in MICRO, 2016. windows,” IEEE Micro, 2003.
[7] O. O. Babarinsa and S. Idreos, “JAFAR: Near-data processing for databases,” in [68] B. Naylor et al., “Merging BSP trees yields polyhedral set operations,” in SIG-
SIGMOD, 2015. GRAPH, 1990.
[8] A. Basu et al., “Efficient virtual memory for big memory servers,” in ISCA, 2013. [69] M. Oskin et al., “Active pages: A computation model for intelligent memory,” in
[9] N. Binkert et al., “The gem5 simulator,” CAN, 2011. ISCA, 1998.
[10] A. Boroumand et al., “LazyPIM: An efficient cache coherence mechanism for [70] D. Patterson et al., “A case for intelligent RAM,” IEEE Micro, 1997.
processing-in-memory,” CAL, 2016. [71] A. Pattnaik et al., “Kernel scheduling techniques for PIM-assisted GPU architec-
[11] K. K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter- tures,” in PACT, 2016.
subarray data movement in dram”,” in HPCA, 2016. [72] S. H. Pugsley et al., “NDC: Analyzing the impact of 3D-stacked memory+logic
[12] P. Chi et al., “A novel processing-in-memory architecture for neural network compu- devices on MapReduce workloads,” in ISPASS, 2014.
tation in ReRAM-based main memory,” in ISCA, 2016. [73] A. Rogers et al., “Supporting dynamic data structures on distributed-memory ma-
[13] E. S. Chung et al., “LINQits: Big data on little clients,” in ISCA, 2013. chines,” TOPLAS, 1995.
[14] J. D. Collins et al., “Pointer cache assisted prefetching,” in MICRO, 2002. [74] P. Rosenfeld et al., “DRAMSim2: A cycle accurate memory system simulator,” CAL,
[15] J. D. Collins et al., “Speculative precomputation: Long-range prefetching of delin- 2011.
quent loads,” in ISCA, 2001. [75] A. Roth et al., “Dependence based prefetching for linked data structures,” in ASP-
[16] R. Cooksey et al., “A stateless, content-directed data prefetching mechanism,” in LOS, 1998.
ASPLOS, 2002. [76] A. Roth and G. S. Sohi, “Effective jump-pointer prefetching for linked data struc-
[17] N. C. Crago and S. J. Patel, “OUTRIDER: Efficient memory latency tolerance with tures,” in ISCA, 1999.
decoupled strands,” in ISCA, 2011. [77] V. Seshadri et al., “Gather-Scatter DRAM: In-DRAM address translation to improve
[18] E. Ebrahimi et al., “Techniques for bandwidth-efficient prefetching of linked data the spatial locality of non-unit strided accesses,” in MICRO, 2015.
structures in hybrid prefetching systems,” in HPCA, 2009. [78] V. Seshadri et al., “Fast bulk bitwise AND and OR in DRAM,” CAL, 2015.
[19] E. Ebrahimi et al., “Coordinated control of multiple prefetchers in multi-core sys- [79] V. Seshadri et al., “RowClone: Fast and energy-efficient in-DRAM bulk data copy
tems,” in MICRO, 2009. and initialization,” in MICRO, 2013.
[20] D. G. Elliott et al., “Computational RAM: A memory-SIMD hybrid and its applica- [80] D. E. Shaw et al., “The NON-VON database machine: A brief overview,” IEEE
tion to DSP,” in CICC, 1992. Database Eng. Bull., 1981.
[21] R. Elmasri, Fundamentals of database systems. Pearson, 2007. [81] J. Shun and G. E. Blelloch, “Ligra: A lightweight graph processing framework for
[22] A. Farmahini-Farahani et al., “NDA: Near-DRAM acceleration architecture leverag- shared memory,” in PPoPP, 2013.
ing commodity DRAM devices and standard memory modules,” in HPCA, 2015. [82] J. E. Smith, “Decoupled access/execute computer architectures,” in ISCA, 1982.
[23] M. Ferdman et al., “Clearing the clouds: A study of emerging scale-out workloads on [83] Y. Solihin et al., “Using a user-level memory thread for correlation prefetching,” in
modern hardware,” in ASPLOS, 2012. ISCA, 2002.
[24] M. Filippo, “Technology preview: ARM next generation processing,” ARM Techcon, [84] S. Srinath et al., “Feedback directed prefetching: Improving the performance and
2012. bandwidth-efficiency of hardware prefetchers,” in HPCA, 2007.
[25] B. Fitzpatrick, “Distributed caching with Memcached,” Linux journal, 2004. [85] H. S. Stone, “A logic-in-memory computer,” IEEE TC, 1970.
[26] M. Gao et al., “Practical near-data processing for in-memory analytics frameworks,” [86] R. M. Tomasulo, “An efficient algorithm for exploiting multiple arithmetic units,”
in PACT, 2015. IBM JRD, 1967.
[27] M. Gao and C. Kozyrakis, “HRL: Efficient and flexible reconfigurable logic for near- [87] TPC, “Transaction processing performance council,” http://www.tpc.org.
data processing,” in HPCA, 2016. [88] M. Waldvogel et al., “Scalable high speed IP routing lookups,” in SIGCOMM, 1997.
[28] D. Giampaolo, Practical file system design with the BE file system. Morgan Kauf- [89] P. R. Wilson, “Uniprocessor garbage collection techniques,” in IWMM, 1992.
mann Publishers Inc., 1998. [90] L. Wu et al., “Navigating big data with high-throughput, energy-efficient data parti-
[29] A. Glew, “MLP yes! ILP no!” in ASPLOS WACI, 1998. tioning,” in ISCA, 2013.
[30] M. Gokhale et al., “Processing in memory: The Terasys massively parallel PIM [91] L. Wu et al., “Q100: The architecture and design of a database processing unit,” in
array,” IEEE Computer, 1995. ASPLOS, 2014.
[31] B. Gu et al., “Biscuit: A framework for near-data processing of big data workloads,” [92] Y. Wu, “Efficient discovery of regular stride patterns in irregular programs,” in PLDI,
in ISCA, 2016. 2002.
[32] A. Gutierrez et al., “Sources of error in full-system simulation,” in ISPASS, 2014. [93] C. Yang and A. R. Lebeck, “Push vs. pull: Data movement for linked data structures,”
[33] M. Hashemi et al., “Acclerating dependent cache misses with an enhanced memory in ICS, 2000.
controller,” in ISCA, 2016. [94] X. Yu et al., “Staring into the abyss: An evaluation of concurrency control with one
[34] K. Hsieh et al., “Transparent offloading and mapping (TOM): Enabling programmer- thousand cores,” VLDB, 2014.
transparent near-data processing in GPU systems,” in ISCA, 2016. [95] X. Yu et al., “IMP: Indirect memory prefetcher,” in MICRO, 2015.
[35] K. Hsieh et al., “A pointer chasing accelerator for 3D-stacked memory,” CMU [96] D. P. Zhang et al., “TOP-PIM: Throughput-oriented programmable processing in
SAFARI Technical Report No. 2016-007, 2016. memory,” in HPDC, 2014.
[36] Z. Hu et al., “TCP: Tag correlating prefetchers,” in HPCA, 2003. [97] Q. Zhu et al., “Accelerating sparse matrix-matrix multiplication with 3D-stacked
[37] C. J. Hughes and S. V. Adve, “Memory-side prefetching for linked data structures for logic-in-memory hardware,” in HPEC, 2013.
processor-in-memory systems,” JPDC, 2005. [98] C. B. Zilles, “Benchmark health considered harmful,” CAN, 2001.
[38] Hybrid Memory Cube Consortium, “HMC Specification 1.1,” 2013. [99] C. B. Zilles and G. S. Sohi, “Execution-based prediction using speculative slices,” in
[39] Hybrid Memory Cube Consortium, “HMC Specification 2.0,” 2014. ISCA, 2001.