0% found this document useful (0 votes)
32 views

Supporting Address Translation For Accelerator-Centric Architectures

Supporting_Address_Translation_for_Accelerator-Centric_Architectures

Uploaded by

jayliao0208
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Supporting Address Translation For Accelerator-Centric Architectures

Supporting_Address_Translation_for_Accelerator-Centric_Architectures

Uploaded by

jayliao0208
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2017 IEEE International Symposium on High Performance Computer Architecture

Supporting Address Translation for


Accelerator-Centric Architectures
Jason Cong Zhenman Fang Yuchen Hao Glenn Reinman
University of California, Los Angeles
{cong, zhenman, haoyc, reinman}@cs.ucla.edu

Abstract—While emerging accelerator-centric architectures of-    


fer orders-of-magnitude performance and energy improvements,          "
use cases and adoption can be limited by their rigid program-  %  ! 
ming model. A unified virtual address space between the host     !$
CPU cores and customized accelerators can largely improve 
the programmability, which necessitates hardware support for  
address translation. However, supporting address translation for
customized accelerators with low overhead is nontrivial. Prior          
  # 
studies either assume an infinite-sized TLB and zero page walk
!  
latency, or rely on a slow IOMMU for correctness and safety—  !    $
which penalizes the overall system performance.
To provide efficient address translation support for
accelerator-centric architectures, we examine the memory access Fig. 1. Problems in current address translation support for accelerator-centric
behavior of customized accelerators to drive the TLB augmen- architectures in an IOMMU-only configuration
tation and MMU designs. First, to support bulk transfers of
consecutive data between the scratchpad memory of customized problems, including but not limited to 1) how to integrate cus-
accelerators and the memory system, we present a relatively small tomized accelerators into the existing memory hierarchies and
private TLB design to provide low-latency caching of translations operating systems, and 2) how to efficiently offload algorithm
to each accelerator. Second, to compensate for the effects of the
kernels from general-purpose cores to customized accelerators.
widely used data tiling techniques, we design a shared level-two
TLB to serve private TLB misses on common virtual pages, One of the key challenges involved is the memory management
eliminating duplicate page walks from accelerators working on between the host CPU cores and accelerators. For conventional
neighboring data tiles that are mapped to the same physical physically addressed accelerators, if the application lives in
page. This two-level TLB design effectively reduces page walks by the user space, an offload process requires copying data
75.8% on average. Finally, instead of implementing a dedicated
across different privilege levels to/from the accelerator and
MMU which introduces additional hardware complexity, we
propose simply leveraging the host per-core MMU for efficient manually maintaining data consistency. Additional overhead
page walk handling. This mechanism is based on our insight that in data replication and OS intervention is inevitable, which
the existing MMU cache in the CPU MMU satisfies the demand of may diminish the gain of customization [17]. Zero-copy avoids
customized accelerators with minimal overhead. Our evaluation copying buffers via operating system support. However, pro-
demonstrates that the combined approach incurs only a 6.4%
gramming with special APIs and carefully managing buffers
performance overhead compared to the ideal address translation.
can be a giant pain for developers.
I. I NTRODUCTION Although accelerators in current heterogeneous systems
In light of the failure of Dennard scaling and recent slow- have limited support for virtual addresses, industry initiatives,
down of Moore’s law, the computer architecture community such as the Heterogeneous System Architecture (HSA) foun-
has proposed many heterogeneous systems that combine con- dation, are proposing a shift towards a unified virtual address
ventional processors with a rich set of customized accelerators between the host CPU and accelerators [18]. In this model,
onto the same die [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. instead of maintaining two copies of data in both host and
Such accelerator-centric architectures trade dark, unpowered device address spaces, only a single allocation is necessary.
silicon [11], [12] area for customized accelerators that of- Consequently, an offload process simply requires passing the
fer orders-of-magnitude performance and energy gains com- virtual pointer to the shared data to/from the accelerator. This
pared to general-purpose cores. These accelerators are usually has a variety of benefits, including the elimination of explicit
application-specific implementations of a particular function- data copying, increased performance of fine-grained memory
ality, and can range from simple tasks (e.g., a multiply- accesses, support for cache coherence and memory protection.
accumulate operation) to complex applications (e.g., medical Unfortunately, the benefits of unified virtual address also
imaging [13], database management [14], Memcached [15], come at a cost. A key requirement of virtually addressed
[16]). accelerators is the hardware support for virtual-to-physical
While such architectures promise tremendous perfor- address translation. Commercial CPUs and SoCs have intro-
mance/watt targets, system architects face a multitude of new duced I/O memory management units (IOMMUs) [19], [20],

2378-203X/17 $31.00 © 2017 IEEE 37


DOI 10.1109/HPCA.2017.19
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
the page walks compared to the IOMMU-only baseline, and
improves the performance from 12.3% (IOMMU baseline)
to 23.3% of the ideal address translation.
2) A Shared TLB. Data tiling techniques are widely used in
customized accelerators to improve the data reuse within
each tile and the parallelism between tiles. Due to the
capacity limit of each accelerator’s scratchpad memory,
this usually breaks the contiguous memory region within
a physical page into multiple data tiles that are mapped to
different accelerator instances for parallelism. In light of
this, we present a shared level-two TLB design to filter
translation requests on common pages so that duplicate
Fig. 2. Performance of the baseline IOMMU approach relative to ideal address page walks will not be triggered from accelerator instances
translation
working on neighboring data tiles. Our evaluation shows
[21], [22] to allow loosely coupled devices to handle virtual that a two-level TLB design with a 512-entry shared TLB
addresses, as shown in Figure 1. These IOMMUs have I/O can reduce page walks by 75.8% on average, and improves
translation lookaside buffers (IOTLBs) and logic to walk the the performance to 51.8% of the ideal address translation.
page table, which can provide address translation support for 3) Host Page Walks. As accelerators are sensitive to long
customized accelerators. However, a naive IOMMU configura- memory latency, the excessively long latency of page
tion cannot meet the requirement of today’s high-performance walks that cannot be filtered by TLBs degrades the system
customized accelerators as it lacks efficient TLB support, and performance. While enhancing the IOMMU with MMU
excessively long latency is incurred to walk the page table caches or introducing a dedicated MMU for accelerators
on IOTLB misses. Figure 2 shows that the performance of are viable approaches [25], [26], better opportunities lie in
the baseline IOMMU approach achieves only 12.3% of the the coordination between the host core and the accelerators
ideal address translation, where all translation requests hit in invoked by it. The idea is that by extending the per-core
an ideal TLB;1 this leaves a huge performance improvement MMU to provide an interface, accelerators operating within
gap. Recent advances in IOMMU enable translation caching the same application’s virtual address space can offload
in devices [22]. However, designing efficient TLBs for high- TLB misses to the page walker of the host core MMU. This
performance accelerators is nontrivial and should be carefully gives us three different benefits: first, the page walk latency
studied. Prototypes in prior studies encounter the challenge of can be significantly reduced due to the presence of MMU
virtual address support [15], [16], [23], [24] as well. However, caches [27], [28], [29] in the host core MMU; second,
their focus is mainly on the design and performance tuning prefetching effects can be achieved due to the support of
of accelerators, with either the underlying address translation data cache inasmuch as loading one cache line effectively
approach not detailed or the performance impact not evaluated. brings in multiple page table entries; third, cold misses in
In this paper, our goal is to provide an efficient address the MMU cache and data cache can be minimized since
translation support for heterogeneous customized accelerator- it is likely that the host core has already touched the data
centric architectures. The hope is that such a design can structure before offloading, so that corresponding resources
enable a unified virtual address space between host CPU have been warmed up. The experimental results show the
cores and accelerators with modest hardware modification host page walk reduces the average page walk latency to
and low performance overhead compared to the ideal address 58 cycles across different benchmarks, and the combined
translation. approach bridges the performance gap to 6.4% compared
By examining the memory access behavior of customized to the ideal address translation.
accelerators, we propose an efficient hardware support for The remainder of this paper is organized as follows. Sec-
address translation tailored to the specific challenges and tion II characterizes address translation behaviors of cus-
opportunities of accelerator-centric architectures that includes: tomized accelerators to motivate our design; Section III ex-
1) Private TLBs. Unlike conventional CPUs and GPUs, cus- plains our simulation methodology and workloads; Section IV
tomized accelerators typically exhibit bulk transfers of con- details the proposed design and evaluation; Section V dis-
secutive data when loading data into the scratchpad memory cusses more use cases; Section VI summarizes related work;
and writing data back to the memory system. Therefore, and Section VII concludes the paper.
a relatively small (16-32 entries) and low-latency private
TLB can not only allow accelerators to save trips to the II. C HARACTERIZATION OF C USTOMIZED ACCELERATORS
IOMMU, but can also capture the page access locality. On A. Accelerator-Centric Architectures
average, a private TLB with 32 entries can reduce 30.4% of
We present an overview of the baseline accelerator-centric
1 Detailed experimental setup is described in Section III. More analysis of architecture used throughout this paper in Figure 1. In this
this gap is presented in Section IV-A. architecture, CPU cores and loosely coupled accelerators share

38

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
   
    
 
    
 
 

   
 

Fig. 3. A detailed look at an accelerator connecting with system IOMMU


(baseline) Fig. 4. TLB miss behavior of BlackScholes

the physical memory. Each CPU core has its own TLB and
MMU, while all accelerators share an IOMMU that has an
IOTLB inside it. A CPU core can launch one or more accel-
erators by offloading a task to them for superior performance
and energy efficiency. Launching multiple accelerators can
exploit data parallelism by assigning accelerators to different
data partitions, which we call tiles.
The details of a customized accelerator are shown in Fig-
ure 3. In contrast to general-purpose CPU cores or GPUs,
accelerators do not use instructions and feature customized
registers and datapaths with deep pipelines [8]. Scratchpad
Fig. 5. TLB miss trace of a single execution from BlackScholes
memory (SPM) is predominantly used by customized accel-
erators instead of hardware-managed caches, and data layout is extremely regular, which is different from the more random
optimization techniques such as data tiling are often applied accesses in CPU or GPU applications. Since accelerators
for increased performance. A memory interface such as a feature customized deep pipelines without multithreading or
DMA (direct memory access) is often used to transfer data context switching, the page divergence is only determined by
between the SPM and the memory system. the number of input data arrays and the dimentionality of each
Due to these microarchitectural differences, customized ac- array. Figure 5 confirms this by showing the TLB miss trace in
celerators exhibit distinct memory access behaviors compared a single execution of BlackScholes, which accesses six one-
to CPUs and GPUs. To drive our design, we characterize such dimensional input arrays and one output array. In addition,
behaviors in the following subsections: the bulk transfer of we can see that TLB misses typically happen at the beginning
consecutive data, the impact of data tiling, and the sensitivity of the bulky data read and write phases, followed by a large
to address translation latency. number of TLB hits. Therefore, high hit rates can be expected
B. Bulk Transfer of Consecutive Data from TLBs with sufficient capacity.
The performance and energy gains of customized acceler- This type of regularity is also observed for a string-matching
ators are largely due to the removal of instructions through application and is reported to be common for a wide range of
specialization and deep pipelining [8]. To guarantee a high applications such as image processing and graphics [30]. We
throughput for such customized pipelines—processing one think that this characteristic is determined by the fundamental
input data every II (pipeline initialization interval) cycles, microarchitecture rather than the application domain. Such
where II is usually one or two—the entire input data must regular access behavior presents opportunities for relatively
be available in the SPM to provide register-like accessibility. simple designs that support address translation for accelerator-
Therefore, the execution process of customized accelerators centric architectures.
typically has three phases: reading data from the memory
C. Impact of Data Tiling
system to the SPM in bulk for local handling, pipelined
processing on local data, and then writing output data back Data tiling techniques are widely used on customized ac-
to the memory system. Such bulky reads and writes appear celerators, which group data points into tiles that are executed
as multiple streams of consecutive accesses in the memory atomically. As a consequence, each data tile can be mapped to
system, which exhibit good memory page locality and high a different accelerator to maximize the parallelism. Also, data
memory bandwidth utilization. tiling can improve data locality for the accelerator pipeline,
To demonstrate such characteristics, we plot the trace of leading to an increased computation to communication ratio.
virtual pages that trigger TLB misses in BlackScholes in This also enables the use of double (ping-pong) buffering.
Figure 4 (our simulation methodology and workloads are While the input data array could span several memory
detailed in Section III). We can see that the TLB miss behavior pages, the tile size of each input data is usually smaller than a

39

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
 








  

Fig. 6. Rectangular tiling on a 32 × 32 × 32 data array into 16 × 16 × 16


tiles. Each tile accesses 16 pages and can be mapped to a different accelerator Fig. 7. Geometric mean slowdown over all benchmarks with varied address
for parallel processing. translation latencies, with LPCIP being the most sensitive benchmark

memory page due to limited SPM resources, especially for buffering. Any additional latency beyond 16 cycles signifi-
high-dimensional arrays. As a result, neighboring tiles are cantly degrades overall system performance. LPCIP shows the
likely to be in the same memory page. These tiles, once highest sensitivity to additional cycles among all benchmarks
mapped to different accelerators, will trigger multiple address since the accelerator issues dynamic memory accesses during
translation requests on the same virtual page. Figure 6 shows a the pipelined processing which is beyond the coverage of
simple example of tiling on a 32×32×32 input float array with double buffering. While GPUs are reportedly able to tolerate
16 × 16 × 16 tile size, producing 8 tiles in total. This example 600 additional memory access cycles with a maximum slow-
is derived from the medical imaging applications, while the down of only 5% [32], the performance of accelerators will
sizes are picked for illustration purposes only. Since the first be decreased by 5x with the same additional cycles.
two dimensions exactly fit into a 4KB page, 32 pages in total Such immense sensitivity poses serious challenges to de-
are allocated for the input data. Processing one tile requires signing an efficient address translation support for accelerators:
accessing 16 of the 32 pages. However, mapping each tile 1) TLBs must be carefully designed to provide low access
to a different accelerator will trigger 16 × 8 = 128 address latency, 2) since page walks incur long latency which could
translation requests, which is 4 times more than the minimum be a few hundred cycles, TLB structures must be effective in
32 requests. This duplication in address translation requests reducing the number of page walks, 3) for page walks that
must be resolved so that additional translation service latency cannot be avoided, page walker must be optimized for lower
can be avoided. The simple coalescing logic used in GPUs will latency.
not be sufficient because concurrently running accelerators are
not designed to execute in lockstep. III. S IMULATION M ETHODOLOGY
Simulation. We use PARADE [33], an open-source cycle-
D. Address Translation Latency Sensitivity
accurate full-system simulator to evaluate the accelerator-
While CPUs expose memory-level parallelism (MLP) using centric architecture. PARADE extends the gem5 [34] simulator
large instruction windows, and GPUs leverage their extensive with high-level synthesis support [35] to accurately model the
multithreading to issue bursts of memory references, accel- accelerator module, including the customized data path, the
erators generally lack architectural support for fine-grained associated SPM and the DMA interface. We use CACTI [36]
latency hiding. As discussed earlier, the performance of cus- to estimate the area of TLB structures based on 32nm process
tomized accelerators relies on predictable accesses to the technology.
local SPM. Therefore, the computation pipeline cannot start We model an 8-issue out-of-order X86-64 CPU core at
until the entire input data tile is ready. To alleviate this 2GHz with 32KB L1 instruction and data cache, 2MB L2
problem, double buffering techniques are commonly used to cache and a per-core MMU. We implement a wide spectrum
overlap communication with computation—processing on one of accelerators, as shown in Table I, where each accelerator
buffer while transferring data on the other. However, such can issue 64 outstanding memory requests and has double
coarse-grained techniques require a careful design to balance buffering support to overlap communication with computation.
communication and computation, and can be ineffective in The host core and accelerators share 2GB DDR3 DRAM on
tolerating long-latency memory operations, especially page four memory channels. We extend PARADE to model an
walks on TLB misses. IOMMU with a 32-entry IOTLB [37]. To study the overhead
To further demonstrate latency sensitivity, we run sim- of address translation, we model an ideal address translation
ulations with varied address translation latencies added to in the simulator with infinite-sized TLBs and zero page walk
each memory reference. Figure 7 presents the performance latency for accelerators. We assume that a 4KB page size is
slowdown of LPCIP and the geometric mean slowdown over used in the system for the best compatibility. The impact of
all benchmarks from additional latency. In general, address using large pages will be discussed in Section V-B. Table II
translation latency within 8 cycles can be tolerated by double summarizes the major parameters used in our simulation.

40

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
TABLE I
B ENCHMARK DESCRIPTIONS WITH INPUT SIZES AND NUMBER OF HETEROGENEOUS ACCELERATORS . [13]

Domain Application Algorithmic Functionality Input Size Acc Types2


Deblur Total variation minimization and deconvolution 4
Medical Denoise Total variation minimization 128 slices of images, each 2
Imaging Registration Linear algebra and optimizations image of size 128 × 128 2
Segmentation Dense linear algebra, and spectral methods 1
Commercial BlackScholes Stock option price prediction using floating point math 256K sets of option data 1
from PARSEC StreamCluster Clustering and vector arithmetic 64K 32-dimensional streams 5
[31] Swaptions Computing swaption prices by Monte Carlo simulation 8K sets of option data 4
Calculate sums of absolute differences and integral Images pairs of size 64 × 64
Disparity Map 4
Computer image representations using vector arithmetic 8 × 8 window, 64 max. disparity
Vision Log-polar forward transformation of image patch 128K features from
LPCIP Desc 1
around each feature images of size 640 × 480
Partial derivative, covariance,
EKF SLAM 128K sets of sensor data 2
Computer and spherical coordinate computations
Navigation Robot Monte Carlo Localization using probabilistic model
128K sets of sensor data 1
Localization and particle filter

TABLE II A. Gap between the Baseline IOMMU Approach and the Ideal
PARAMETERS OF THE BASELINE ARCHITECTURE Address Translation
Component Parameters
1 X86-64 OoO core @ 2GHz Figure 2 shows the performance of the baseline IOMMU
CPU
8-wide issue, 32KB L1, 2MB L2 approach relative to the ideal address translation with infinite-
4 instances of each accelerator sized TLBs and zero page walk latency. Since an IOMMU-
Accelerator 64 outstanding memory references
Double buffering enabled only configuration requires each memory reference to be
IOMMU 4KB page, 32-entry IOTLB translated by the centralized hardware interface, performance
DRAM 2GB, 4 channels, DDR3-1600 suffers from frequent trips to the IOMMU. On one hand,
Workloads. To provide a quantitative evaluation of our ad- benchmarks with large-page reuse distances, such as the med-
dress translation proposal, we use a wide range of applica- ical imaging applications, experience IOTLB thrashing due to
tions that are open-sourced together with PARADE. These the limited capacity. In such cases, IOTLB cannot provide
applications are categorized into four domains: medical imag- effective translation caching, leading to a large number of
ing, computer vision, computer navigation and commercial long-latency page walks. On the other hand, while the IOTLB
benchmarks from PARSEC [31]. A brief description of each may satisfy the demand of some benchmarks with small page
application, its input size, and the number of different accel- reuse distances, such as computer navigation applications, the
erator types involved is specified in Table I. Each application IOMMU lacks efficient page walk handling; this significantly
may call one or more types of accelerators to perform differ- degrades system performance on IOTLB misses. As a result,
ent functionalities corresponding to the algorithm in various the IOMMU approach achieves only an average of 12.3%
phases. In total, we implement 25 types of accelerators.2 To relative to the performance of the ideal address translation,
achieve maximum performance, multiple instances of the same which leaves a huge performance gap.
type can be invoked by the host to process in parallel. By In order to bridge the performance gap, we propose to
default, we use four instances of each type in our evaluation reduce the address translation overhead in three steps: 1)
unless otherwise specified. There will be no more than 20 provide low-latency access to translation caching by allowing
active accelerators; others will be powered off depending on customized accelerators to store physical addresses locally in
which applications are running. TLBs, 2) reduce the number of page walks by exploiting page
sharing between accelerators resulted from data tiling, and 3)
IV. D ESIGN AND E VALUATION OF A DDRESS minimize the page walk latency by offloading the page walk
T RANSLATION S UPPORT request to the host core MMU. We detail our designs and
The goal of this paper is to design an efficient address evaluations in the following subsections.
translation support for accelerator-centric architectures. After
carefully examining the distinct memory access behaviors of B. Private TLBs
customized accelerators in Section II, we propose the corre- To enable more capable devices, such as accelerators, recent
sponding TLB and MMU designs with quantitative evaluations IOMMU proposals allow IO devices to cache address trans-
step by step. lation in devices [22]. This reduces the address translation
2 Some of the types are shared across different applications. For example, the latency on TLB hits and relies on the page walker in IOMMU
Racian accelerator is shared by Deblur and Denoise; the Gaussian accelerator on TLB misses. However, the performance impact and design
is shared by Deblur and Registration. tradeoffs are not scrutinized in the literature.

41

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 8. Performance for benchmarks other than medical imaging with various
private TLB sizes, assuming fixed access latency. Fig. 9. Performance for medical imaging benchmarks with various private
TLB sizes, assuming fixed access latency.
1) Implementation
Non-blocking design. Most CPUs and GPUs use blocking
TLB sizes. While a large TLB may have a higher hit rate, TLBs since the latency can be hidden with wide MLP. In
smaller TLB sizes are preferable in providing lower access contrast, accelerators are sensitive to long TLB miss latencies.
latency, since customized accelerators are very sensitive to Blocking accesses on TLB misses will stall the data transfer,
the address translation latency. Moreover, TLBs are report- reducing the memory bandwidth utilization which results in
edly power-hungry and even TLB hits consume a significant performance degradation. In light of this, our design provides
amount of dynamic energy [38]. Thus, TLB sizes must be non-blocking hit-under-miss support to overlap TLB miss with
carefully chosen. hits to other entries.
Commercial CPUs currently implement 64-entry per-core Correctness issues. In practice, correctness issues, including
L1 TLBs, and recent GPU studies [25], [26] introduce 64- page faults and TLB shootdowns [39], have negligible ef-
128 entry post-coalescer TLBs. As illustrated in Section II-B, fects on the experimental results. We discuss them here for
customized accelerators have much more regular and less implementation purposes. While page faults can be handled
divergent access patterns compared to general-purpose CPUs by the IOMMU, accelerator private TLBs must support TLB
and GPUs; this fact endorses a relatively small private TLB shootdowns from the system. In a multicore accelerator sys-
size for shorter access latency and lower energy consumption. tem, if the mapping of memory pages is changed, all sharers
Next we quantitatively evaluate the performance effects of of the virtual memory are notified to invalidate TLBs using
various private TLB sizes. We assume a least-recently-used TLB shootdown inter-processor interrupts (IPIs). We assume
(LRU) replacement policy to capture locality. shootdowns are supported between CPU TLBs and the IOTLB
a) Private TLB size for all benchmarks except medical based on this approach. We also extend IOMMU to send
imaging applications: Figure 8 illustrates that all seven evalu- invalidations to accelerator private TLBs to flush stale values.
ated benchmarks greatly benefit from adding private TLBs. In
general, small TLB sizes such as 16-32 entries are sufficient to 2) Evaluation
achieve most of the improved performance. The cause of this We find that low TLB access latency provided by local
gain is twofold: 1) accelerators benefit from reduced access translation caching is key to the performance of customized
time to locally cached translations; 2) even though the capacity accelerators. While the optimal TLB size appears to be
is not enlarged compared to the 32-entry IOTLB, accelerators application-specific, we choose 32-entry for balance between
enjoy private TLB resources rather than sharing the IOTLB. latency and capacity for our benchmarks, and also lower area
LPCIP receives the largest performance improvement from and power consumption. On average, the 32-entry private TLB
having a private TLB. This matches the observation that it achieves 23.3% of the ideal address translation performance,
has the highest sensitivity to address translation latency due to one step up from the 12.3% IOMMU baseline, with an
dynamic memory references during pipelined processing, since area overhead of around 0.7% to each accelerator. Further
providing a low-latency private TLB greatly reduces pipeline improvements are possible by customizing TLB sizes and
stalls. supporting miss-under-miss targeting individual applications.
b) Private TLB size for medical imaging applications: Fig- We leave these for future work.
ure 9 shows the evaluation of the four medical imaging bench-
marks. These benchmarks have a larger memory footprint C. A Shared Level-Two TLB
with more input array references, and the three-dimensional Figure 10 depicts the basic structure of our two-level TLB
access pattern (accessing multiple pages per array reference design, and illustrates two orthogonal benefits provided from
as demonstrated in Figure 6) further stresses the capacity of adding a shared TLB. First, the requested entry is previously
private TLBs. While apparently the 256-entry achieves the best inserted to the shared TLB by a request from the same acceler-
performance of the four, the increased TLB access time would ator. This is the case when the private TLB size is not sufficient
decease performance for other benchmarks, especially latency- to capture the reuse distance so that the requested entry is
sensitive ones such as LPCIP. In addition, a large TLB will evicted from the private TLB earlier. Specifically, medical
also consume more energy. imaging benchmarks would benefit from having a large shared

42

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
       

   


   


Fig. 10. The structure of two-level TLBs

TLB. Second, the requested entry is previously inserted to


the shared TLB by a request from another accelerator. This
case is common when data tile size is smaller than a memory Fig. 11. Page walk reduction compared to the IOMMU baseline3
page, and neighboring tiles within a memory page are mapped
to different accelerators (illustrated in Section II-C). Once an walks in two ways: 1) providing a larger capacity to capture
entry is brought into the shared TLB by one accelerator, it is the page locality for applications which is difficult to achieve
immediately available to other accelerators, leading to shared in private TLBs without sacrificing access latency, 2) reducing
TLB hits. Requests that also miss in the shared TLB need to TLB misses on common virtual pages by enabling translation
access IOMMU for page table walking. sharing between concurrent accelerators.
Figure 11 sheds light upon the page walk reduction3 in
1) Implementation our two-level TLB design. Compared to private TLBs only,
TLB size. While our design trades capacity for lower access adding a shared TLB consistently reduces the overall number
latency in private TLBs, we provide a relatively larger capacity of page walk requests. For medical imaging benchmarks,
in the shared TLB to avoid thrashing. Based on the evaluation especially Deblur, Denoise and Registration, which suffer from
of performance impact of private TLB sizes, we assume a insufficient private TLB capacity, the shared TLB significantly
512-entry shared TLB for the four-accelerator-instances case, cuts the number of page walks by providing more resources.
where LRU replacement policy is used. Though it virtually For benchmarks that already find enough entries in private
provides a 128-entry level-two TLB for each sharer, a shared TLBs, such as Segmentation and BlackScholes, the shared
TLB is more flexible in allocating resources to a specific TLB reduces the number of page walks by decreasing TLB
sharer, resulting in improved performance. misses on common virtual pages, which are due to data tiling
Non-blocking design. Similar to the private TLBs, our shared effects. In general, the two-level TLB design achieves 76.8%
TLB design also provides non-blocking hit-under-miss support page walk reduction compared to the IOMMU-only approach,
to overlap TLB miss with hit accesses to other entries. leaving only a small gap to an ideal two-level TLB. The 512-
Inclusion policy. In order to enable the aforementioned use entry shared TLB only incurs around 0.3% area overhead to
cases, entries requested by an accelerator must be inserted in the four sharing accelerators. StreamCluster and DisparityMap
both the private and the shared TLB. We adopt the approach involve multiple iterations over the same input data using
in [40] to implement a mostly-inclusive policy, where each different types of accelerators. The result shows a gap between
TLB is allowed to make independent replacement decisions. the 512-entry case and the ideal case because the size of
This relaxes the coordination between private and shared TLBs input data exceeds the reach of the 512-entry TLB, but has
and simplifies the control logic. no problem fitting in the infinite-sized TLB which eliminates
Placement. We provide a centralized shared level-two TLB cold misses at the beginning of each iteration.
not tied to any of the accelerators. This requires each accel- To further isolate the effect of page sharing caused by data
erator to send requests through the interconnect to access the tiling, we run simulations with infinite-sized private TLBs
shared TLB which adds additional latency. However, we find so that the capacity issue is eliminated. Figure 12 shows
that the benefit completely outweighs the added access latency the page walk reduction by adding a 512-entry shared TLB
for the current configuration. Much larger TLB sizes or more to infinite-sized private TLBs. As infinite-sized private TLBs
sharers (we will discuss this in Section V-A) could benefit from leave only cold misses, the shared TLB exploits page sharing
a banked design, but such use cases are not well established. among those misses and filters duplicate ones, resulting in a
Correctness issues. In addition to the TLB shootdown support 41.7% reduction on average. Notice that not all benchmarks
in private TLBs, the shared TLB also needs to be checked for greatly benefit from having a shared TLB, which is due to
invalidation. The reason is in a mostly-inclusive policy, entries the different tiling mechanism of each application. While
that are previously brought in can be present in both levels. developing a TLB-aware tiling mechanism could potentially
2) Evaluation 3 Note that the number of page walks does not equal the number of

In contrast to private TLBs where low access latency is the translation requests even in the IOMMU case, since the IOTLB can filter
key, the shared TLB mainly aims to reduce the number of page part of them.

43

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
for each page walk. We assume an AMD-style page walk
cache [27] in this paper, which stores entries in a data
cache fashion. However, other implementations, such as Intel’s
paging structure cache [29], could provide similar benefits.
Second, PTE (page table entry) locality within a cacheline
provides an opportunity to amortize the cost of memory
accesses over more table walks. Unlike the IOMMU, the
CPU MMU has data cache support, which means a PTE is
first brought from the DRAM to the data cache and then
to the MMU. Future access to the same entry, if missed in
the MMU cache, could still hit in the data cache with much
lower latency than a DRAM access. More importantly, as one
Fig. 12. Page walk reduction from adding a 512-entry shared TLB to infinite- cache line could contain eight PTEs, one DRAM access for
sized private TLBs a PTE potentially prefetches seven consecutive ones, so that
reduce duplicate TLB misses, it is not easy to do so when the future references to these PTEs could be cache hits. While
input data size and tile size are user-defined and thus can be this may not benefit CPU or GPU applications with large page
arbitrary. We leave this for future work. divergence, we have shown that the regularity of accelerator
The remainder of the page walks are due to cold TLB TLB misses could permit improvement through prefetching.
misses, where alternating TLB sizes or organization can not Third, resources are likely warmed up by previous host
make a difference. Therefore, we propose an efficient page core operations within the same application’s virtual address
walk handling mechanism to minimize the latency penalty space. Since a unified virtual address space permits a close
introduced by those page walks. coordination between the host core and the accelerators, both
can work on the same data with either general-purpose ma-
nipulation or high-performance specialization. Therefore, the
D. Host Page Walks
host core operations could very well warm up the resources
As the IOMMU is not capable of delivering efficient page for accelerators. Specifically, a TLB miss triggered by the host
walks, the performance of accelerators still suffers from ex- core brings both upper-level entries to the MMU cache and
cessive long page walk latency even with a reduced num- PTEs to the data cache, leading to reduced page walk latency
ber of page walks. While providing a dedicated full-blown for accelerators in the near future. While the previous two
MMU support for accelerators could potentially alleviate this benefits can also be obtained through any dedicated MMU
problem, there may not be a need to establish a new piece with an MMU cache and data cache, this benefit is unique to
of hardware—especially when off-the-shelf resources can be host page walks.
readily leveraged by accelerators.
This opportunity lies in the coordination between the ac- 1) Implementation
celerators and the host core that launches them. After the Modifications to accelerator TLBs. In addition to the accel-
computation task has been offloaded from the host core to erator ID bits in each TLB entry, the shared TLB also needs to
accelerators, a common practice is to put the host core into store the host process (or context) ID within each entry. On a
spinning so that the core can react immediately to any status shared TLB miss, a translation service request with the virtual
change of the accelerators. As a result, during the execution address and the process ID is sent through the interconnect to
of accelerators, the host core MMU and data cache is less the host core operating within the same virtual address space.
stressed; this can be used to service translation requests from Modifications to host MMUs. The host core MMU must
the accelerators invoked by this core. By offloading page walk be able to distinguish accelerator page walk requests from the
operations to the host core MMU, the following benefits can core requests, so that PTEs can be sent back to the accelerator
be achieved: shared TLB instead of being inserted into the host core TLBs
First, the MMU cache support is provided from the host after page walking. As CPU MMUs are typically designed to
core MMU. Commercial CPUs have introduced MMU caches handle one single page walk at a time, a separate port and
to store upper-level entries in page walks [27], [29]. The request queue for accelerator page walk requests is required
page walker accesses the MMU cache to determine if one or to buffer multiple requests. An analysis on the number of
more levels of walks can be skipped before issuing memory outstanding shared TLB misses is presented in Section V-A.
references. As characterized in Section II-B, accelerators have Demultiplexer logic is also required for the MMU to send
extremely regular page access behaviors with small page responses with the requested PTE back to the accelerator
divergence. Therefore, the MMU cache can potentially work shared TLB.
very well with accelerators by capturing the good locality Correctness issues. In contrast to implementing a dedicated
in upper levels of the page table. We expect that the MMU MMU for accelerators where coordination with the host core
cache is able to skip all three non-leaf page table accesses MMU is required on page fault handling, our approach re-
most of the time, leaving only one memory reference required quires no additional support for system-level correctness issue.

44

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
If a page walk returns a NULL pointer on the virtual address
requested by accelerators, the faulting address is written to the
core’s CR2 register and an interrupt is raised. The core can
proceed with the normal page fault handling process without
the knowledge of the requester of the faulting address. The
MMU is signaled once the OS has written the page table
with the correct translation. Then, the MMU finishes the page
walk to send the requested PTE to the accelerator-shared TLB.
The support for TLB shootdowns works the same as in the
IOMMU case.

2) Evaluation
Fig. 13. Average page walk latency when offloading page walks to the host
To evaluate the effects of host page walks, we simulate an core MMU
8KB page walk cache with 3-cycle access latency, and a 2MB
data cache with 20-cycle latency. If the PTE request misses in
the data cache, it is forwarded to the off-chip DRAM which
typically takes more than 200 cycles. We faithfully simulate
the interconnect delays in a mesh topology.
We first evaluate the capability of the host core MMU by
showing the average latency of page walks that are triggered
by accelerators. Figure 13 shows that the host core MMU
consistently provides low-latency page walks across all bench-
marks, with an average of only 58 cycles. Given the latency
of four consecutive data cache accesses is 80 cycles plus
interconnect delays, most page walks should be a combination Fig. 14. Average translation latency of (a) all requests; (b) requests that
of MMU cache hits and data cache hits, with DRAM access actually trigger page walks
only in rare cases. This is partly due to the warm-up effects TABLE III
where cold misses in both MMU cache and data cache are C ONFIGURATION OF OUR PROPOSED ADDRESS TRANSLATION SUPPORT
minimized. Based on this, it is difficult for a dedicated MMU Component Parameters
to provide even lower page walk latency than the host core Private TLBs 32-entry, 1-cycle access latency
MMU. Shared TLB 512-entry, 3-cycle access latency
Host MMU 4KB page, 8KB page walk cache [28]
We further analyze the average translation latency of each Interconnect Mesh, 4-stage routers
design to relate to our latency sensitivity study. As shown
in Figure 14(a), the average translation latency across all first enhance the IOMMU approach by designing a low-
benchmarks for designs with private TLBs and two-level latency private TLB for each accelerator. Second, we present a
TLBs is 101.1 and 27.7 cycles, respectively. This level of shared level-two TLB design to enable page sharing between
translation latency, if uniformly distributed, should not result accelerators, reducing duplicate TLB misses. The two-level
in more than a 50% performance slowdown according to TLB design effectively reduces the number of page walks by
Figure 7. However, as shown in Figure 14(b), the average 76.8%. Finally, we propose to offload page walk requests to the
translation latency of the requests that trigger page walks host core MMU so that we can efficiently handle page walks
is well above 1000 cycles for the two designs that use an with an average latency of 58 cycles. Table III summarizes the
IOMMU. This is due to both long page walk latency and parameters of key components in our design.
queueing latency when there are multiple outstanding page
walk requests. With such long latencies added to the runtime, Overall system performance. Figure 15 compares the perfor-
accelerators become completely ineffective in latency hiding, mance of different designs against the ideal address translation.
even on shorter latencies which could otherwise be tolerated Note that the first three designs rely on the IOMMU for
by double buffering. In contrast, host page walks reduce page walks which could take more than 900 cycles. Our
page walk latencies and meanwhile minimize the variance of proposed three designs, as shown in Figure 15 achieve 23.3%,
address translation latency. Therefore, the overall performance 51.8% and 93.6% of the ideal address translation performance,
benefits from a much lower average address translation latency respectively, while the IOMMU baseline only achieves 12.3%
(3.2 cycles) and decreased level of variations. of the ideal performance. The performance gap between our
combined approach (two-level TLBs with host page walks)
E. Summary: Two-level TLBs and Host Page Walks and the ideal address translation is reduced to 6.4% on average,
Overall design. In summary, to provide an efficient address which is in the range deemed acceptable in the CPU world (5-
translation support for accelerator-centric architectures, we 15% overhead of runtime [27], [41], [40], [42]).

45

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 16. The average number of outstanding shared TLB misses of the 4-
instance and 16-instance cases
Fig. 15. Total execution time normalized to ideal address translation

V. D ISCUSSION
A. Impact of More Accelerators
While we have shown that significant performance improve-
ment can be achieved for four accelerator instances by sharing
resources including the level-two TLBs and host MMU, it
is possible that resource contention with too many sharers
results in performance slowdown. Specifically, since CPU
MMUs typically handle one page walk at a time, the host
core MMU can potentially become a bottleneck as the number
of outstanding shared TLB misses increases. To evaluate the
Fig. 17. Performance of launching 16 accelerator instances relative to ideal
impact of launching more accelerators by the same host core, address translation
we run simulations with 16 accelerator instances in the system
with the same configuration summarized in Table III. will not experience higher pressure in such scenarios since our
We compare the average number of outstanding shared TLB mechanism requires that accelerators only offload TLB misses
misses4 for the 4-instance and 16-instance cases in Figure 16. to the host core that operates within the same application’s
Our shared TLB provides a consistent filtering effect, requiring virtual address space. A larger shared TLB may be required
on average only 1.3 and 4.9 outstanding page walks at the for more sharers where a banked placement could be more
same time in the 4-instance and 16-instance cases, respectively. efficient. We leave this for future work.
While more outstanding requests lead to a longer waiting time, B. Large Pages
subsequent requests are likely to hit in the page walk cache
and data cache due to the regular page access pattern, thus Large pages [43] can potentially reduce TLB misses by
requiring less service time. Using a dedicated MMU with enlarging TLB reach and speedup misses by requiring less
threaded page walker [26] could reduce the waiting time. access to memory while walking the page table. To reduce
However, the performance improvement may not justify the memory management overhead, the OS with Transparent Huge
additional hardware complexity, even for GPUs [25]. Page [44] support can automatically construct large pages by
Figure 17 presents the overall performance of our proposed allocating contiguous baseline pages aligned at the large page
address translation support relative to the ideal address trans- size. As a result, developers no longer need to identify the
lation when there are 16 accelerator instances. We can see that data that could benefit from using large pages and explicitly
even with the same amount of shared resource, launching 16 request the allocation of large pages.
accelerator instances does not have a significant impact over As we have shown that accelerators typically feature bulk
the efficiency of address translation, with the overhead being transfers of consecutive data and are sensitive to long memory
7.7% on average. While even more active accelerators promise latencies, large pages are expected to improve the overall
greater parallelism, we already observe diminishing returns performance of accelerators by reducing TLB misses and page
in the 16-instance case, as the interconnect and memory walk latencies. We believe this approach is orthogonal to ours
bandwidth is saturating. and can be readily applied to the proposed two-level TLBs
Another way of having more active accelerators in the and host page walk design. It is worthwhile to note that the
system is by launching multiple accelerators using multiple page sharing effect that results from tiling high-dimensional
CPU cores. However, the page walker in each core MMU data will become more significant under large pages, leading
to an increased number of TLB misses on common pages. Our
4 As TLB misses are generally sparse during the execution, we only sample shared TLB design is shown to be effective in alleviating this
the number when there is at least one TLB miss. issue.

46

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
VI. R ELATED W ORK Such a static TLB approach requires allocation of pinned
memory and kernel driver intervention on TLB refills. As
Address Translation on CPUs. To meet the ever-increasing a result, programmers need to work with special APIs and
memory demands of applications, commercial CPUs have manually manage various buffers, which can be a giant pain.
included one or more levels of TLBs [4], [45] and pri- Xilinx Zynq SoC provides a coherent interface between the
vate low-latency caches [28], [29] in the per-core MMU ARM Cortex-A9 processor and FPGA programmable logic
to accelerate address translation. These MMU caches, with through the accelerator coherency port [53]. While prototypes
different implementations, have been shown to greatly increase in [23], [24] are based on this platform, the address translation
performance for CPU applications [27]. Prefetching [46], [47], mechanism is not detailed. [16] assumes a system MMU
[48] techniques are proposed to speculate on PTEs that will be support for the designed hardware accelerator. However, the
referenced in the future. While such techniques benefit appli- impact on performance is not studied. [54] studies system co-
cations with regular page access patterns, additional hardware design of cache-based accelerators, but only with a simplified
such as a prefetching table is typically required. Shared last- address translation model.
level TLBs [40] and shared MMU caches [49] are proposed for Modern processors are equipped with IOMMUs [19], [21],
multicores to accelerate multithreaded applications by sharing [22] to provide address translation support for loosely cou-
translations between cores. The energy overheads of TLB pled devices, including customized accelerators and GPUs.
resources are also studied [38], and advocate for energy- rIOMMU [55] improves the throughput for devices that em-
efficient TLBs. A software mechanism has also been proposed ploy circular ring buffers, such as network and PCIe SSD
for address translation on CPUs [50]. controllers, but this is not intended for customized accelerators
Address Translation on GPUs. A recent IOMMU tuto- with more complex memory behaviors. With a unified address
rial [22] presents a detailed introduction to the IOMMU design space, [56] proposes a sandboxing mechanism to protect the
within the AMD fused CPU-GPUs, with a key focus on its system against improper memory accesses. While we choose
functionality and security. Though it also enables translation an IOMMU configuration as the baseline in this paper for its
caching in devices, no detail or quantitative evaluation is generality, the key insights of this work are applicable to other
revealed. To provide hardware support for virtual memory platforms with modest adjustments.
and page faults on GPUs, [25], [26] propose GPU MMU
designs consisting of post-coalescer TLBs and logic to walk VII. C ONCLUSION
the page table. As GPUs can potentially require hundreds of The goal of this paper is to provide simple but efficient
translations per cycle due to high parallelism in the architec- address translation support for accelerator-centric architec-
ture, [25] uses 4-ported private TLBs and improved page walk tures. We propose a two-level TLB design and host page
scheduling, while [26] uses a highly threaded page walker to walks tailored to the specific challenges and opportunities of
serve bursts of TLB misses. ActivePointers [51] introduces a customized accelerators. We find that a relatively small and
software address translation layer on GPUs that supports page low-latency private TLB with 32 entries for each accelerator
fault handling without CPU involvement. However, system reduces page walks by 30.4% compared to the IOMMU
abstractions for GPUs are required. baseline. Adding a shared 512-entry TLB eliminates 75.8%
Based on our characterization of customized accelerators, of total page walks by exploiting page sharing resulting from
we differentiate the address translation requirements between data tiling. Moreover, by simply offloading page walk requests
customized accelerators and GPUs in three ways. First, accel- to the host core MMU, the average page walk latency can be
erators do not use instructions and have much more regular reduced to 58 cycles. Our evaluation shows that the combined
access patterns compared to GPUs, which enables a simpler approach achieves 93.6% of the performance of the ideal
private TLB design. Second, the page sharing effect between address translation.
accelerators cannot be resolved using the same coalescing This paper is the first to provide hardware support for
structure as in GPU since accelerators are not designed to a unified virtual address space between the host CPU and
execute in lockstep. Instead, a shared TLB design is tailored customized accelerators with marginal overhead. We hope
to compensate for the impact of data tiling. Third, while that this paper will stimulate future research in this area and
GPUs average 60 concurrent TLB misses [26], we show that facilitate the adoption of customized accelerators.
accelerators have far less outstanding TLB misses. Therefore,
host page walks with the existing MMU cache and data cache ACKNOWLEDGMENT
support are sufficient to provide low page walk latency. We thank the anonymous reviewers for their feedback.
Current Virtual Memory Support for Accelerators. Some This work is partially supported by the Center for Domain-
initial efforts have been made to support address translation Specific Computing under the NSF InTrans Award CCF-
for customized accelerators. The prototype of Intel-Altera 1436827; funding from CDSC industrial partners including
heterogeneous architecture research platform [52] uses a static Baidu, Fujitsu Labs, Google, Huawei, Intel, IBM Research
1024-entry TLB with 2MB page size to support virtual Almaden and Mentor Graphics; and C-FAR, one of the six
address for user-defined accelerators. A similar approach is centers of STARnet, a Semiconductor Research Corporation
also adopted in the design of a Memcached accelerator [15]. program sponsored by MARCO and DARPA.

47

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [26] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 address
translation for 100s of gpu lanes,” in HPCA-20, 2014.
[1] H. Park, Y. Park, and S. Mahlke, “Polymorphic pipeline array: A flexible [27] T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: Skip, don’t
multicore accelerator with virtualized execution for mobile multimedia walk (the page table),” in ISCA-37, 2010.
applications,” in MICRO-42, 2009. [28] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, “Accelerating two-
dimensional page walks for virtualized systems,” in ASPLOS-XIII, 2008.
[2] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C.
[29] Intel Corporation, TLBs, Paging-Structure Caches, and Their Invalida-
Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding
tion, 2008.
sources of inefficiency in general-purpose chips,” in ISCA-37, 2010.
[30] Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, “Toward
[3] C. Johnson, D. Allen, J. Brown, S. VanderWiel, R. Hoover, H. Achilles,
cache-friendly hardware accelerators,” in Proc. Sensors to Cloud Archi-
C.-Y. Cher, G. May, H. Franke, J. Xenedis, and C. Basso, “A wire-speed
tectures Workshop (SCAW), in conjuction with HPCA-21, 2015.
powertm processor: 2.3ghz 45nm soi with 16 cores and 64 threads,” in
[31] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite:
ISSCC, 2010.
characterization and architectural implications,” in PACT-17, 2008.
[4] M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks, [32] J. Hestness, S. W. Keckler, and D. A. Wood, “A comparative analysis
M. Greenberg, G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smit- of microarchitecture effects on cpu and gpu memory system behavior,”
tle, and T. Ziaja, “Sparc t4: A dynamically threaded server-on-a-chip,” in IISWC, 2014.
IEEE Micro, vol. 32, pp. 8–19, Mar. 2012. [33] J. Cong, Z. Fang, M. Gill, and G. Reinman, “Parade: A cycle-accurate
[5] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, full-system simulation platform for accelerator-rich architectural design
K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality and and exploration,” in ICCAD, 2015.
parallelism specialization for energy-efficient computing,” IEEE Micro, [34] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
vol. 32, pp. 38–51, Sept. 2012. J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al., “The gem5
[6] A. Krishna, T. Heil, N. Lindberg, F. Toussi, and S. VanderWiel, simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,
“Hardware acceleration in the ibm poweren processor: Architecture and pp. 1–7, 2011.
performance,” in PACT-21, 2012. [35] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
[7] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, “High-level synthesis for fpgas: From prototyping to deployment,”
“Architecture support for accelerator-rich cmps,” in DAC-49, 2012. Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 30, pp. 473–491, Apr.
[8] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, 2011.
and M. A. Horowitz, “Convolution engine: Balancing efficiency and [36] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, CACTI 6.0:
flexibility in specialized computing,” in ISCA-40, 2013. A tool to model large caches. HP Laboratories, 2009.
[9] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural ac- [37] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, “Iommu: Strategies for
celeration for general-purpose approximate programs,” in MICRO-45, mitigating the iotlb bottleneck,” in Workshop on Interaction between
2012. Operating Systems and Computer Architecture (WIOSCA), 2010.
[10] R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. F. Wenisch, [38] V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley,
“Sonic millip3de: A massively parallel 3d-stacked accelerator for 3d M. Nemirovsky, M. M. Swift, and O. S. Unsal, “Energy-efficient address
ultrasound,” in HPCA-19, 2013. translation,” in HPCA-22, 2016.
[11] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and [39] D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, “Translation
D. Burger, “Dark silicon and the end of multicore scaling,” in ISCA- lookaside buffer consistency: A software approach,” in ASPLOS-III,
38, 2011. 1989.
[12] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo- [40] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared last-level tlbs
Martinez, S. Swanson, and M. B. Taylor, “Conservation cores: Reducing for chip multiprocessors,” in HPCA-17, 2011.
the energy of mature computations,” in ASPLOS-XV, 2010. [41] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient
[13] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, virtual memory for big memory servers,” in ISCA-40, 2013.
“Architecture support for domain-specific accelerator-rich cmps,” ACM [42] C. McCurdy, A. L. Coxa, and J. Vetter, “Investigating the tlb behavior
Trans. Embed. Comput. Syst., vol. 13, pp. 131:1–131:26, Apr. 2014. of high-end scientific applications on commodity microprocessors,” in
[14] O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ran- ISPASS, 2008.
ganathan, “Meet the walkers: Accelerating index traversals for in- [43] M. Talluri and M. D. Hill, “Surpassing the tlb performance of superpages
memory databases,” in MICRO-46, 2013. with less operating system support,” in ASPLOS-VI, 1994.
[15] M. Lavasani, H. Angepat, and D. Chiou, “An fpga-based in-line accel- [44] A. Arcangeli, “Transparent hugepage support,” in KVM Forum, 2010.
erator for memcached,” Computer Architecture Letters, vol. 13, no. 2, [45] P. Hammarlund, “4th generation intel core processor, codenamed
pp. 57–60, 2014. haswell,” in Hot Chips, 2013.
[16] K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, [46] B. L. Jacob and T. N. Mudge, “A look at several memory management
“Thin servers with smart pipes: Designing soc accelerators for mem- units, tlb-refill mechanisms, and page table organizations,” in ASPLOS-
cached,” in ISCA-40, 2013. VIII, 1998.
[17] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “A quan- [47] A. Saulsbury, F. Dahlgren, and P. Stenström, “Recency-based tlb
titative analysis on microarchitectures of modern cpu-fpga platforms,” preloading,” in ISCA-27, 2000.
in DAC-53, 2016. [48] G. B. Kandiraju and A. Sivasubramaniam, “Going the distance for tlb
[18] HSA Foundation, HSA Platform System Architecture Specification 1.0, prefetching: An application-driven study,” in ISCA-29, 2002.
2015. [49] A. Bhattacharjee, “Large-reach memory management unit caches,” in
[19] Advanced Micro Devices, Inc., AMD I/O Virtualization Technology MICRO-46, 2013.
(IOMMU) Specification, 2011. [50] B. Jacob and T. Mudge, “Software-managed address translation,” in
[20] ARM Ltd., ARM System Memory Management Unit Architecture Spec- HPCA-3, 1997.
ification, 2015. [51] S. Shahar, S. Bergman, and M. Silberstein, “Activepointers: a case for
[21] Intel Corporation, Intel Virtualization Technology for Directed I/O software address translation on gpus,” in ISCA-43, 2016.
Architecture, 2014. [52] Intel Corporation, Accelerator abstraction layer software programmers
[22] A. Kegel, P. Blinzer, A. Basu, and M. Chan, “Virtualizing io through io guide.
memory management unit (iommu),” in ASPLOS-XXI Tutorials, 2016. [53] S. Neuendorffer and F. Martinez-Vallina, “Building zynq accelerators
with vivado high level synthesis.,” in FPGA, 2013.
[23] E. S. Chung, J. D. Davis, and J. Lee, “Linqits: Big data on little clients,”
[54] Y. S. Shao, S. L. X. V. Srinivasan, and G.-Y. W. D. Brooks, “Co-
in ISCA-40, 2013.
designing accelerators and soc interfaces using gem5-aladdin,” in
[24] T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze,
MICRO-49, 2016.
and M. Oskin, “Snnap: Approximate computing on programmable socs
[55] M. Malka, N. Amit, M. Ben-Yehuda, and D. Tsafrir, “riommu: Efficient
via neural acceleration,” in HPCA-21, 2015.
iommu for i/o devices that employ ring buffers,” in ASPLOS-XX, 2015.
[25] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for [56] L. E. Olson, J. Power, M. D. Hill, and D. A. Wood, “Border control:
address translation on gpus: Designing memory management units for sandboxing accelerators,” in MICRO-48, 2015.
cpu/gpus with unified address spaces,” in ASPLOS-XIX, 2014.

48

Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.

You might also like