Supporting Address Translation For Accelerator-Centric Architectures
Supporting Address Translation For Accelerator-Centric Architectures
38
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
the physical memory. Each CPU core has its own TLB and
MMU, while all accelerators share an IOMMU that has an
IOTLB inside it. A CPU core can launch one or more accel-
erators by offloading a task to them for superior performance
and energy efficiency. Launching multiple accelerators can
exploit data parallelism by assigning accelerators to different
data partitions, which we call tiles.
The details of a customized accelerator are shown in Fig-
ure 3. In contrast to general-purpose CPU cores or GPUs,
accelerators do not use instructions and feature customized
registers and datapaths with deep pipelines [8]. Scratchpad
Fig. 5. TLB miss trace of a single execution from BlackScholes
memory (SPM) is predominantly used by customized accel-
erators instead of hardware-managed caches, and data layout is extremely regular, which is different from the more random
optimization techniques such as data tiling are often applied accesses in CPU or GPU applications. Since accelerators
for increased performance. A memory interface such as a feature customized deep pipelines without multithreading or
DMA (direct memory access) is often used to transfer data context switching, the page divergence is only determined by
between the SPM and the memory system. the number of input data arrays and the dimentionality of each
Due to these microarchitectural differences, customized ac- array. Figure 5 confirms this by showing the TLB miss trace in
celerators exhibit distinct memory access behaviors compared a single execution of BlackScholes, which accesses six one-
to CPUs and GPUs. To drive our design, we characterize such dimensional input arrays and one output array. In addition,
behaviors in the following subsections: the bulk transfer of we can see that TLB misses typically happen at the beginning
consecutive data, the impact of data tiling, and the sensitivity of the bulky data read and write phases, followed by a large
to address translation latency. number of TLB hits. Therefore, high hit rates can be expected
B. Bulk Transfer of Consecutive Data from TLBs with sufficient capacity.
The performance and energy gains of customized acceler- This type of regularity is also observed for a string-matching
ators are largely due to the removal of instructions through application and is reported to be common for a wide range of
specialization and deep pipelining [8]. To guarantee a high applications such as image processing and graphics [30]. We
throughput for such customized pipelines—processing one think that this characteristic is determined by the fundamental
input data every II (pipeline initialization interval) cycles, microarchitecture rather than the application domain. Such
where II is usually one or two—the entire input data must regular access behavior presents opportunities for relatively
be available in the SPM to provide register-like accessibility. simple designs that support address translation for accelerator-
Therefore, the execution process of customized accelerators centric architectures.
typically has three phases: reading data from the memory
C. Impact of Data Tiling
system to the SPM in bulk for local handling, pipelined
processing on local data, and then writing output data back Data tiling techniques are widely used on customized ac-
to the memory system. Such bulky reads and writes appear celerators, which group data points into tiles that are executed
as multiple streams of consecutive accesses in the memory atomically. As a consequence, each data tile can be mapped to
system, which exhibit good memory page locality and high a different accelerator to maximize the parallelism. Also, data
memory bandwidth utilization. tiling can improve data locality for the accelerator pipeline,
To demonstrate such characteristics, we plot the trace of leading to an increased computation to communication ratio.
virtual pages that trigger TLB misses in BlackScholes in This also enables the use of double (ping-pong) buffering.
Figure 4 (our simulation methodology and workloads are While the input data array could span several memory
detailed in Section III). We can see that the TLB miss behavior pages, the tile size of each input data is usually smaller than a
39
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
memory page due to limited SPM resources, especially for buffering. Any additional latency beyond 16 cycles signifi-
high-dimensional arrays. As a result, neighboring tiles are cantly degrades overall system performance. LPCIP shows the
likely to be in the same memory page. These tiles, once highest sensitivity to additional cycles among all benchmarks
mapped to different accelerators, will trigger multiple address since the accelerator issues dynamic memory accesses during
translation requests on the same virtual page. Figure 6 shows a the pipelined processing which is beyond the coverage of
simple example of tiling on a 32×32×32 input float array with double buffering. While GPUs are reportedly able to tolerate
16 × 16 × 16 tile size, producing 8 tiles in total. This example 600 additional memory access cycles with a maximum slow-
is derived from the medical imaging applications, while the down of only 5% [32], the performance of accelerators will
sizes are picked for illustration purposes only. Since the first be decreased by 5x with the same additional cycles.
two dimensions exactly fit into a 4KB page, 32 pages in total Such immense sensitivity poses serious challenges to de-
are allocated for the input data. Processing one tile requires signing an efficient address translation support for accelerators:
accessing 16 of the 32 pages. However, mapping each tile 1) TLBs must be carefully designed to provide low access
to a different accelerator will trigger 16 × 8 = 128 address latency, 2) since page walks incur long latency which could
translation requests, which is 4 times more than the minimum be a few hundred cycles, TLB structures must be effective in
32 requests. This duplication in address translation requests reducing the number of page walks, 3) for page walks that
must be resolved so that additional translation service latency cannot be avoided, page walker must be optimized for lower
can be avoided. The simple coalescing logic used in GPUs will latency.
not be sufficient because concurrently running accelerators are
not designed to execute in lockstep. III. S IMULATION M ETHODOLOGY
Simulation. We use PARADE [33], an open-source cycle-
D. Address Translation Latency Sensitivity
accurate full-system simulator to evaluate the accelerator-
While CPUs expose memory-level parallelism (MLP) using centric architecture. PARADE extends the gem5 [34] simulator
large instruction windows, and GPUs leverage their extensive with high-level synthesis support [35] to accurately model the
multithreading to issue bursts of memory references, accel- accelerator module, including the customized data path, the
erators generally lack architectural support for fine-grained associated SPM and the DMA interface. We use CACTI [36]
latency hiding. As discussed earlier, the performance of cus- to estimate the area of TLB structures based on 32nm process
tomized accelerators relies on predictable accesses to the technology.
local SPM. Therefore, the computation pipeline cannot start We model an 8-issue out-of-order X86-64 CPU core at
until the entire input data tile is ready. To alleviate this 2GHz with 32KB L1 instruction and data cache, 2MB L2
problem, double buffering techniques are commonly used to cache and a per-core MMU. We implement a wide spectrum
overlap communication with computation—processing on one of accelerators, as shown in Table I, where each accelerator
buffer while transferring data on the other. However, such can issue 64 outstanding memory requests and has double
coarse-grained techniques require a careful design to balance buffering support to overlap communication with computation.
communication and computation, and can be ineffective in The host core and accelerators share 2GB DDR3 DRAM on
tolerating long-latency memory operations, especially page four memory channels. We extend PARADE to model an
walks on TLB misses. IOMMU with a 32-entry IOTLB [37]. To study the overhead
To further demonstrate latency sensitivity, we run sim- of address translation, we model an ideal address translation
ulations with varied address translation latencies added to in the simulator with infinite-sized TLBs and zero page walk
each memory reference. Figure 7 presents the performance latency for accelerators. We assume that a 4KB page size is
slowdown of LPCIP and the geometric mean slowdown over used in the system for the best compatibility. The impact of
all benchmarks from additional latency. In general, address using large pages will be discussed in Section V-B. Table II
translation latency within 8 cycles can be tolerated by double summarizes the major parameters used in our simulation.
40
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
TABLE I
B ENCHMARK DESCRIPTIONS WITH INPUT SIZES AND NUMBER OF HETEROGENEOUS ACCELERATORS . [13]
TABLE II A. Gap between the Baseline IOMMU Approach and the Ideal
PARAMETERS OF THE BASELINE ARCHITECTURE Address Translation
Component Parameters
1 X86-64 OoO core @ 2GHz Figure 2 shows the performance of the baseline IOMMU
CPU
8-wide issue, 32KB L1, 2MB L2 approach relative to the ideal address translation with infinite-
4 instances of each accelerator sized TLBs and zero page walk latency. Since an IOMMU-
Accelerator 64 outstanding memory references
Double buffering enabled only configuration requires each memory reference to be
IOMMU 4KB page, 32-entry IOTLB translated by the centralized hardware interface, performance
DRAM 2GB, 4 channels, DDR3-1600 suffers from frequent trips to the IOMMU. On one hand,
Workloads. To provide a quantitative evaluation of our ad- benchmarks with large-page reuse distances, such as the med-
dress translation proposal, we use a wide range of applica- ical imaging applications, experience IOTLB thrashing due to
tions that are open-sourced together with PARADE. These the limited capacity. In such cases, IOTLB cannot provide
applications are categorized into four domains: medical imag- effective translation caching, leading to a large number of
ing, computer vision, computer navigation and commercial long-latency page walks. On the other hand, while the IOTLB
benchmarks from PARSEC [31]. A brief description of each may satisfy the demand of some benchmarks with small page
application, its input size, and the number of different accel- reuse distances, such as computer navigation applications, the
erator types involved is specified in Table I. Each application IOMMU lacks efficient page walk handling; this significantly
may call one or more types of accelerators to perform differ- degrades system performance on IOTLB misses. As a result,
ent functionalities corresponding to the algorithm in various the IOMMU approach achieves only an average of 12.3%
phases. In total, we implement 25 types of accelerators.2 To relative to the performance of the ideal address translation,
achieve maximum performance, multiple instances of the same which leaves a huge performance gap.
type can be invoked by the host to process in parallel. By In order to bridge the performance gap, we propose to
default, we use four instances of each type in our evaluation reduce the address translation overhead in three steps: 1)
unless otherwise specified. There will be no more than 20 provide low-latency access to translation caching by allowing
active accelerators; others will be powered off depending on customized accelerators to store physical addresses locally in
which applications are running. TLBs, 2) reduce the number of page walks by exploiting page
sharing between accelerators resulted from data tiling, and 3)
IV. D ESIGN AND E VALUATION OF A DDRESS minimize the page walk latency by offloading the page walk
T RANSLATION S UPPORT request to the host core MMU. We detail our designs and
The goal of this paper is to design an efficient address evaluations in the following subsections.
translation support for accelerator-centric architectures. After
carefully examining the distinct memory access behaviors of B. Private TLBs
customized accelerators in Section II, we propose the corre- To enable more capable devices, such as accelerators, recent
sponding TLB and MMU designs with quantitative evaluations IOMMU proposals allow IO devices to cache address trans-
step by step. lation in devices [22]. This reduces the address translation
2 Some of the types are shared across different applications. For example, the latency on TLB hits and relies on the page walker in IOMMU
Racian accelerator is shared by Deblur and Denoise; the Gaussian accelerator on TLB misses. However, the performance impact and design
is shared by Deblur and Registration. tradeoffs are not scrutinized in the literature.
41
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 8. Performance for benchmarks other than medical imaging with various
private TLB sizes, assuming fixed access latency. Fig. 9. Performance for medical imaging benchmarks with various private
TLB sizes, assuming fixed access latency.
1) Implementation
Non-blocking design. Most CPUs and GPUs use blocking
TLB sizes. While a large TLB may have a higher hit rate, TLBs since the latency can be hidden with wide MLP. In
smaller TLB sizes are preferable in providing lower access contrast, accelerators are sensitive to long TLB miss latencies.
latency, since customized accelerators are very sensitive to Blocking accesses on TLB misses will stall the data transfer,
the address translation latency. Moreover, TLBs are report- reducing the memory bandwidth utilization which results in
edly power-hungry and even TLB hits consume a significant performance degradation. In light of this, our design provides
amount of dynamic energy [38]. Thus, TLB sizes must be non-blocking hit-under-miss support to overlap TLB miss with
carefully chosen. hits to other entries.
Commercial CPUs currently implement 64-entry per-core Correctness issues. In practice, correctness issues, including
L1 TLBs, and recent GPU studies [25], [26] introduce 64- page faults and TLB shootdowns [39], have negligible ef-
128 entry post-coalescer TLBs. As illustrated in Section II-B, fects on the experimental results. We discuss them here for
customized accelerators have much more regular and less implementation purposes. While page faults can be handled
divergent access patterns compared to general-purpose CPUs by the IOMMU, accelerator private TLBs must support TLB
and GPUs; this fact endorses a relatively small private TLB shootdowns from the system. In a multicore accelerator sys-
size for shorter access latency and lower energy consumption. tem, if the mapping of memory pages is changed, all sharers
Next we quantitatively evaluate the performance effects of of the virtual memory are notified to invalidate TLBs using
various private TLB sizes. We assume a least-recently-used TLB shootdown inter-processor interrupts (IPIs). We assume
(LRU) replacement policy to capture locality. shootdowns are supported between CPU TLBs and the IOTLB
a) Private TLB size for all benchmarks except medical based on this approach. We also extend IOMMU to send
imaging applications: Figure 8 illustrates that all seven evalu- invalidations to accelerator private TLBs to flush stale values.
ated benchmarks greatly benefit from adding private TLBs. In
general, small TLB sizes such as 16-32 entries are sufficient to 2) Evaluation
achieve most of the improved performance. The cause of this We find that low TLB access latency provided by local
gain is twofold: 1) accelerators benefit from reduced access translation caching is key to the performance of customized
time to locally cached translations; 2) even though the capacity accelerators. While the optimal TLB size appears to be
is not enlarged compared to the 32-entry IOTLB, accelerators application-specific, we choose 32-entry for balance between
enjoy private TLB resources rather than sharing the IOTLB. latency and capacity for our benchmarks, and also lower area
LPCIP receives the largest performance improvement from and power consumption. On average, the 32-entry private TLB
having a private TLB. This matches the observation that it achieves 23.3% of the ideal address translation performance,
has the highest sensitivity to address translation latency due to one step up from the 12.3% IOMMU baseline, with an
dynamic memory references during pipelined processing, since area overhead of around 0.7% to each accelerator. Further
providing a low-latency private TLB greatly reduces pipeline improvements are possible by customizing TLB sizes and
stalls. supporting miss-under-miss targeting individual applications.
b) Private TLB size for medical imaging applications: Fig- We leave these for future work.
ure 9 shows the evaluation of the four medical imaging bench-
marks. These benchmarks have a larger memory footprint C. A Shared Level-Two TLB
with more input array references, and the three-dimensional Figure 10 depicts the basic structure of our two-level TLB
access pattern (accessing multiple pages per array reference design, and illustrates two orthogonal benefits provided from
as demonstrated in Figure 6) further stresses the capacity of adding a shared TLB. First, the requested entry is previously
private TLBs. While apparently the 256-entry achieves the best inserted to the shared TLB by a request from the same acceler-
performance of the four, the increased TLB access time would ator. This is the case when the private TLB size is not sufficient
decease performance for other benchmarks, especially latency- to capture the reuse distance so that the requested entry is
sensitive ones such as LPCIP. In addition, a large TLB will evicted from the private TLB earlier. Specifically, medical
also consume more energy. imaging benchmarks would benefit from having a large shared
42
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
In contrast to private TLBs where low access latency is the translation requests even in the IOMMU case, since the IOTLB can filter
key, the shared TLB mainly aims to reduce the number of page part of them.
43
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
for each page walk. We assume an AMD-style page walk
cache [27] in this paper, which stores entries in a data
cache fashion. However, other implementations, such as Intel’s
paging structure cache [29], could provide similar benefits.
Second, PTE (page table entry) locality within a cacheline
provides an opportunity to amortize the cost of memory
accesses over more table walks. Unlike the IOMMU, the
CPU MMU has data cache support, which means a PTE is
first brought from the DRAM to the data cache and then
to the MMU. Future access to the same entry, if missed in
the MMU cache, could still hit in the data cache with much
lower latency than a DRAM access. More importantly, as one
Fig. 12. Page walk reduction from adding a 512-entry shared TLB to infinite- cache line could contain eight PTEs, one DRAM access for
sized private TLBs a PTE potentially prefetches seven consecutive ones, so that
reduce duplicate TLB misses, it is not easy to do so when the future references to these PTEs could be cache hits. While
input data size and tile size are user-defined and thus can be this may not benefit CPU or GPU applications with large page
arbitrary. We leave this for future work. divergence, we have shown that the regularity of accelerator
The remainder of the page walks are due to cold TLB TLB misses could permit improvement through prefetching.
misses, where alternating TLB sizes or organization can not Third, resources are likely warmed up by previous host
make a difference. Therefore, we propose an efficient page core operations within the same application’s virtual address
walk handling mechanism to minimize the latency penalty space. Since a unified virtual address space permits a close
introduced by those page walks. coordination between the host core and the accelerators, both
can work on the same data with either general-purpose ma-
nipulation or high-performance specialization. Therefore, the
D. Host Page Walks
host core operations could very well warm up the resources
As the IOMMU is not capable of delivering efficient page for accelerators. Specifically, a TLB miss triggered by the host
walks, the performance of accelerators still suffers from ex- core brings both upper-level entries to the MMU cache and
cessive long page walk latency even with a reduced num- PTEs to the data cache, leading to reduced page walk latency
ber of page walks. While providing a dedicated full-blown for accelerators in the near future. While the previous two
MMU support for accelerators could potentially alleviate this benefits can also be obtained through any dedicated MMU
problem, there may not be a need to establish a new piece with an MMU cache and data cache, this benefit is unique to
of hardware—especially when off-the-shelf resources can be host page walks.
readily leveraged by accelerators.
This opportunity lies in the coordination between the ac- 1) Implementation
celerators and the host core that launches them. After the Modifications to accelerator TLBs. In addition to the accel-
computation task has been offloaded from the host core to erator ID bits in each TLB entry, the shared TLB also needs to
accelerators, a common practice is to put the host core into store the host process (or context) ID within each entry. On a
spinning so that the core can react immediately to any status shared TLB miss, a translation service request with the virtual
change of the accelerators. As a result, during the execution address and the process ID is sent through the interconnect to
of accelerators, the host core MMU and data cache is less the host core operating within the same virtual address space.
stressed; this can be used to service translation requests from Modifications to host MMUs. The host core MMU must
the accelerators invoked by this core. By offloading page walk be able to distinguish accelerator page walk requests from the
operations to the host core MMU, the following benefits can core requests, so that PTEs can be sent back to the accelerator
be achieved: shared TLB instead of being inserted into the host core TLBs
First, the MMU cache support is provided from the host after page walking. As CPU MMUs are typically designed to
core MMU. Commercial CPUs have introduced MMU caches handle one single page walk at a time, a separate port and
to store upper-level entries in page walks [27], [29]. The request queue for accelerator page walk requests is required
page walker accesses the MMU cache to determine if one or to buffer multiple requests. An analysis on the number of
more levels of walks can be skipped before issuing memory outstanding shared TLB misses is presented in Section V-A.
references. As characterized in Section II-B, accelerators have Demultiplexer logic is also required for the MMU to send
extremely regular page access behaviors with small page responses with the requested PTE back to the accelerator
divergence. Therefore, the MMU cache can potentially work shared TLB.
very well with accelerators by capturing the good locality Correctness issues. In contrast to implementing a dedicated
in upper levels of the page table. We expect that the MMU MMU for accelerators where coordination with the host core
cache is able to skip all three non-leaf page table accesses MMU is required on page fault handling, our approach re-
most of the time, leaving only one memory reference required quires no additional support for system-level correctness issue.
44
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
If a page walk returns a NULL pointer on the virtual address
requested by accelerators, the faulting address is written to the
core’s CR2 register and an interrupt is raised. The core can
proceed with the normal page fault handling process without
the knowledge of the requester of the faulting address. The
MMU is signaled once the OS has written the page table
with the correct translation. Then, the MMU finishes the page
walk to send the requested PTE to the accelerator-shared TLB.
The support for TLB shootdowns works the same as in the
IOMMU case.
2) Evaluation
Fig. 13. Average page walk latency when offloading page walks to the host
To evaluate the effects of host page walks, we simulate an core MMU
8KB page walk cache with 3-cycle access latency, and a 2MB
data cache with 20-cycle latency. If the PTE request misses in
the data cache, it is forwarded to the off-chip DRAM which
typically takes more than 200 cycles. We faithfully simulate
the interconnect delays in a mesh topology.
We first evaluate the capability of the host core MMU by
showing the average latency of page walks that are triggered
by accelerators. Figure 13 shows that the host core MMU
consistently provides low-latency page walks across all bench-
marks, with an average of only 58 cycles. Given the latency
of four consecutive data cache accesses is 80 cycles plus
interconnect delays, most page walks should be a combination Fig. 14. Average translation latency of (a) all requests; (b) requests that
of MMU cache hits and data cache hits, with DRAM access actually trigger page walks
only in rare cases. This is partly due to the warm-up effects TABLE III
where cold misses in both MMU cache and data cache are C ONFIGURATION OF OUR PROPOSED ADDRESS TRANSLATION SUPPORT
minimized. Based on this, it is difficult for a dedicated MMU Component Parameters
to provide even lower page walk latency than the host core Private TLBs 32-entry, 1-cycle access latency
MMU. Shared TLB 512-entry, 3-cycle access latency
Host MMU 4KB page, 8KB page walk cache [28]
We further analyze the average translation latency of each Interconnect Mesh, 4-stage routers
design to relate to our latency sensitivity study. As shown
in Figure 14(a), the average translation latency across all first enhance the IOMMU approach by designing a low-
benchmarks for designs with private TLBs and two-level latency private TLB for each accelerator. Second, we present a
TLBs is 101.1 and 27.7 cycles, respectively. This level of shared level-two TLB design to enable page sharing between
translation latency, if uniformly distributed, should not result accelerators, reducing duplicate TLB misses. The two-level
in more than a 50% performance slowdown according to TLB design effectively reduces the number of page walks by
Figure 7. However, as shown in Figure 14(b), the average 76.8%. Finally, we propose to offload page walk requests to the
translation latency of the requests that trigger page walks host core MMU so that we can efficiently handle page walks
is well above 1000 cycles for the two designs that use an with an average latency of 58 cycles. Table III summarizes the
IOMMU. This is due to both long page walk latency and parameters of key components in our design.
queueing latency when there are multiple outstanding page
walk requests. With such long latencies added to the runtime, Overall system performance. Figure 15 compares the perfor-
accelerators become completely ineffective in latency hiding, mance of different designs against the ideal address translation.
even on shorter latencies which could otherwise be tolerated Note that the first three designs rely on the IOMMU for
by double buffering. In contrast, host page walks reduce page walks which could take more than 900 cycles. Our
page walk latencies and meanwhile minimize the variance of proposed three designs, as shown in Figure 15 achieve 23.3%,
address translation latency. Therefore, the overall performance 51.8% and 93.6% of the ideal address translation performance,
benefits from a much lower average address translation latency respectively, while the IOMMU baseline only achieves 12.3%
(3.2 cycles) and decreased level of variations. of the ideal performance. The performance gap between our
combined approach (two-level TLBs with host page walks)
E. Summary: Two-level TLBs and Host Page Walks and the ideal address translation is reduced to 6.4% on average,
Overall design. In summary, to provide an efficient address which is in the range deemed acceptable in the CPU world (5-
translation support for accelerator-centric architectures, we 15% overhead of runtime [27], [41], [40], [42]).
45
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 16. The average number of outstanding shared TLB misses of the 4-
instance and 16-instance cases
Fig. 15. Total execution time normalized to ideal address translation
V. D ISCUSSION
A. Impact of More Accelerators
While we have shown that significant performance improve-
ment can be achieved for four accelerator instances by sharing
resources including the level-two TLBs and host MMU, it
is possible that resource contention with too many sharers
results in performance slowdown. Specifically, since CPU
MMUs typically handle one page walk at a time, the host
core MMU can potentially become a bottleneck as the number
of outstanding shared TLB misses increases. To evaluate the
Fig. 17. Performance of launching 16 accelerator instances relative to ideal
impact of launching more accelerators by the same host core, address translation
we run simulations with 16 accelerator instances in the system
with the same configuration summarized in Table III. will not experience higher pressure in such scenarios since our
We compare the average number of outstanding shared TLB mechanism requires that accelerators only offload TLB misses
misses4 for the 4-instance and 16-instance cases in Figure 16. to the host core that operates within the same application’s
Our shared TLB provides a consistent filtering effect, requiring virtual address space. A larger shared TLB may be required
on average only 1.3 and 4.9 outstanding page walks at the for more sharers where a banked placement could be more
same time in the 4-instance and 16-instance cases, respectively. efficient. We leave this for future work.
While more outstanding requests lead to a longer waiting time, B. Large Pages
subsequent requests are likely to hit in the page walk cache
and data cache due to the regular page access pattern, thus Large pages [43] can potentially reduce TLB misses by
requiring less service time. Using a dedicated MMU with enlarging TLB reach and speedup misses by requiring less
threaded page walker [26] could reduce the waiting time. access to memory while walking the page table. To reduce
However, the performance improvement may not justify the memory management overhead, the OS with Transparent Huge
additional hardware complexity, even for GPUs [25]. Page [44] support can automatically construct large pages by
Figure 17 presents the overall performance of our proposed allocating contiguous baseline pages aligned at the large page
address translation support relative to the ideal address trans- size. As a result, developers no longer need to identify the
lation when there are 16 accelerator instances. We can see that data that could benefit from using large pages and explicitly
even with the same amount of shared resource, launching 16 request the allocation of large pages.
accelerator instances does not have a significant impact over As we have shown that accelerators typically feature bulk
the efficiency of address translation, with the overhead being transfers of consecutive data and are sensitive to long memory
7.7% on average. While even more active accelerators promise latencies, large pages are expected to improve the overall
greater parallelism, we already observe diminishing returns performance of accelerators by reducing TLB misses and page
in the 16-instance case, as the interconnect and memory walk latencies. We believe this approach is orthogonal to ours
bandwidth is saturating. and can be readily applied to the proposed two-level TLBs
Another way of having more active accelerators in the and host page walk design. It is worthwhile to note that the
system is by launching multiple accelerators using multiple page sharing effect that results from tiling high-dimensional
CPU cores. However, the page walker in each core MMU data will become more significant under large pages, leading
to an increased number of TLB misses on common pages. Our
4 As TLB misses are generally sparse during the execution, we only sample shared TLB design is shown to be effective in alleviating this
the number when there is at least one TLB miss. issue.
46
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
VI. R ELATED W ORK Such a static TLB approach requires allocation of pinned
memory and kernel driver intervention on TLB refills. As
Address Translation on CPUs. To meet the ever-increasing a result, programmers need to work with special APIs and
memory demands of applications, commercial CPUs have manually manage various buffers, which can be a giant pain.
included one or more levels of TLBs [4], [45] and pri- Xilinx Zynq SoC provides a coherent interface between the
vate low-latency caches [28], [29] in the per-core MMU ARM Cortex-A9 processor and FPGA programmable logic
to accelerate address translation. These MMU caches, with through the accelerator coherency port [53]. While prototypes
different implementations, have been shown to greatly increase in [23], [24] are based on this platform, the address translation
performance for CPU applications [27]. Prefetching [46], [47], mechanism is not detailed. [16] assumes a system MMU
[48] techniques are proposed to speculate on PTEs that will be support for the designed hardware accelerator. However, the
referenced in the future. While such techniques benefit appli- impact on performance is not studied. [54] studies system co-
cations with regular page access patterns, additional hardware design of cache-based accelerators, but only with a simplified
such as a prefetching table is typically required. Shared last- address translation model.
level TLBs [40] and shared MMU caches [49] are proposed for Modern processors are equipped with IOMMUs [19], [21],
multicores to accelerate multithreaded applications by sharing [22] to provide address translation support for loosely cou-
translations between cores. The energy overheads of TLB pled devices, including customized accelerators and GPUs.
resources are also studied [38], and advocate for energy- rIOMMU [55] improves the throughput for devices that em-
efficient TLBs. A software mechanism has also been proposed ploy circular ring buffers, such as network and PCIe SSD
for address translation on CPUs [50]. controllers, but this is not intended for customized accelerators
Address Translation on GPUs. A recent IOMMU tuto- with more complex memory behaviors. With a unified address
rial [22] presents a detailed introduction to the IOMMU design space, [56] proposes a sandboxing mechanism to protect the
within the AMD fused CPU-GPUs, with a key focus on its system against improper memory accesses. While we choose
functionality and security. Though it also enables translation an IOMMU configuration as the baseline in this paper for its
caching in devices, no detail or quantitative evaluation is generality, the key insights of this work are applicable to other
revealed. To provide hardware support for virtual memory platforms with modest adjustments.
and page faults on GPUs, [25], [26] propose GPU MMU
designs consisting of post-coalescer TLBs and logic to walk VII. C ONCLUSION
the page table. As GPUs can potentially require hundreds of The goal of this paper is to provide simple but efficient
translations per cycle due to high parallelism in the architec- address translation support for accelerator-centric architec-
ture, [25] uses 4-ported private TLBs and improved page walk tures. We propose a two-level TLB design and host page
scheduling, while [26] uses a highly threaded page walker to walks tailored to the specific challenges and opportunities of
serve bursts of TLB misses. ActivePointers [51] introduces a customized accelerators. We find that a relatively small and
software address translation layer on GPUs that supports page low-latency private TLB with 32 entries for each accelerator
fault handling without CPU involvement. However, system reduces page walks by 30.4% compared to the IOMMU
abstractions for GPUs are required. baseline. Adding a shared 512-entry TLB eliminates 75.8%
Based on our characterization of customized accelerators, of total page walks by exploiting page sharing resulting from
we differentiate the address translation requirements between data tiling. Moreover, by simply offloading page walk requests
customized accelerators and GPUs in three ways. First, accel- to the host core MMU, the average page walk latency can be
erators do not use instructions and have much more regular reduced to 58 cycles. Our evaluation shows that the combined
access patterns compared to GPUs, which enables a simpler approach achieves 93.6% of the performance of the ideal
private TLB design. Second, the page sharing effect between address translation.
accelerators cannot be resolved using the same coalescing This paper is the first to provide hardware support for
structure as in GPU since accelerators are not designed to a unified virtual address space between the host CPU and
execute in lockstep. Instead, a shared TLB design is tailored customized accelerators with marginal overhead. We hope
to compensate for the impact of data tiling. Third, while that this paper will stimulate future research in this area and
GPUs average 60 concurrent TLB misses [26], we show that facilitate the adoption of customized accelerators.
accelerators have far less outstanding TLB misses. Therefore,
host page walks with the existing MMU cache and data cache ACKNOWLEDGMENT
support are sufficient to provide low page walk latency. We thank the anonymous reviewers for their feedback.
Current Virtual Memory Support for Accelerators. Some This work is partially supported by the Center for Domain-
initial efforts have been made to support address translation Specific Computing under the NSF InTrans Award CCF-
for customized accelerators. The prototype of Intel-Altera 1436827; funding from CDSC industrial partners including
heterogeneous architecture research platform [52] uses a static Baidu, Fujitsu Labs, Google, Huawei, Intel, IBM Research
1024-entry TLB with 2MB page size to support virtual Almaden and Mentor Graphics; and C-FAR, one of the six
address for user-defined accelerators. A similar approach is centers of STARnet, a Semiconductor Research Corporation
also adopted in the design of a Memcached accelerator [15]. program sponsored by MARCO and DARPA.
47
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [26] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 address
translation for 100s of gpu lanes,” in HPCA-20, 2014.
[1] H. Park, Y. Park, and S. Mahlke, “Polymorphic pipeline array: A flexible [27] T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: Skip, don’t
multicore accelerator with virtualized execution for mobile multimedia walk (the page table),” in ISCA-37, 2010.
applications,” in MICRO-42, 2009. [28] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, “Accelerating two-
dimensional page walks for virtualized systems,” in ASPLOS-XIII, 2008.
[2] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C.
[29] Intel Corporation, TLBs, Paging-Structure Caches, and Their Invalida-
Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding
tion, 2008.
sources of inefficiency in general-purpose chips,” in ISCA-37, 2010.
[30] Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, “Toward
[3] C. Johnson, D. Allen, J. Brown, S. VanderWiel, R. Hoover, H. Achilles,
cache-friendly hardware accelerators,” in Proc. Sensors to Cloud Archi-
C.-Y. Cher, G. May, H. Franke, J. Xenedis, and C. Basso, “A wire-speed
tectures Workshop (SCAW), in conjuction with HPCA-21, 2015.
powertm processor: 2.3ghz 45nm soi with 16 cores and 64 threads,” in
[31] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite:
ISSCC, 2010.
characterization and architectural implications,” in PACT-17, 2008.
[4] M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks, [32] J. Hestness, S. W. Keckler, and D. A. Wood, “A comparative analysis
M. Greenberg, G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smit- of microarchitecture effects on cpu and gpu memory system behavior,”
tle, and T. Ziaja, “Sparc t4: A dynamically threaded server-on-a-chip,” in IISWC, 2014.
IEEE Micro, vol. 32, pp. 8–19, Mar. 2012. [33] J. Cong, Z. Fang, M. Gill, and G. Reinman, “Parade: A cycle-accurate
[5] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, full-system simulation platform for accelerator-rich architectural design
K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality and and exploration,” in ICCAD, 2015.
parallelism specialization for energy-efficient computing,” IEEE Micro, [34] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
vol. 32, pp. 38–51, Sept. 2012. J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al., “The gem5
[6] A. Krishna, T. Heil, N. Lindberg, F. Toussi, and S. VanderWiel, simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,
“Hardware acceleration in the ibm poweren processor: Architecture and pp. 1–7, 2011.
performance,” in PACT-21, 2012. [35] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
[7] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, “High-level synthesis for fpgas: From prototyping to deployment,”
“Architecture support for accelerator-rich cmps,” in DAC-49, 2012. Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 30, pp. 473–491, Apr.
[8] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, 2011.
and M. A. Horowitz, “Convolution engine: Balancing efficiency and [36] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, CACTI 6.0:
flexibility in specialized computing,” in ISCA-40, 2013. A tool to model large caches. HP Laboratories, 2009.
[9] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural ac- [37] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, “Iommu: Strategies for
celeration for general-purpose approximate programs,” in MICRO-45, mitigating the iotlb bottleneck,” in Workshop on Interaction between
2012. Operating Systems and Computer Architecture (WIOSCA), 2010.
[10] R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. F. Wenisch, [38] V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley,
“Sonic millip3de: A massively parallel 3d-stacked accelerator for 3d M. Nemirovsky, M. M. Swift, and O. S. Unsal, “Energy-efficient address
ultrasound,” in HPCA-19, 2013. translation,” in HPCA-22, 2016.
[11] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and [39] D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, “Translation
D. Burger, “Dark silicon and the end of multicore scaling,” in ISCA- lookaside buffer consistency: A software approach,” in ASPLOS-III,
38, 2011. 1989.
[12] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo- [40] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared last-level tlbs
Martinez, S. Swanson, and M. B. Taylor, “Conservation cores: Reducing for chip multiprocessors,” in HPCA-17, 2011.
the energy of mature computations,” in ASPLOS-XV, 2010. [41] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient
[13] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, virtual memory for big memory servers,” in ISCA-40, 2013.
“Architecture support for domain-specific accelerator-rich cmps,” ACM [42] C. McCurdy, A. L. Coxa, and J. Vetter, “Investigating the tlb behavior
Trans. Embed. Comput. Syst., vol. 13, pp. 131:1–131:26, Apr. 2014. of high-end scientific applications on commodity microprocessors,” in
[14] O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ran- ISPASS, 2008.
ganathan, “Meet the walkers: Accelerating index traversals for in- [43] M. Talluri and M. D. Hill, “Surpassing the tlb performance of superpages
memory databases,” in MICRO-46, 2013. with less operating system support,” in ASPLOS-VI, 1994.
[15] M. Lavasani, H. Angepat, and D. Chiou, “An fpga-based in-line accel- [44] A. Arcangeli, “Transparent hugepage support,” in KVM Forum, 2010.
erator for memcached,” Computer Architecture Letters, vol. 13, no. 2, [45] P. Hammarlund, “4th generation intel core processor, codenamed
pp. 57–60, 2014. haswell,” in Hot Chips, 2013.
[16] K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, [46] B. L. Jacob and T. N. Mudge, “A look at several memory management
“Thin servers with smart pipes: Designing soc accelerators for mem- units, tlb-refill mechanisms, and page table organizations,” in ASPLOS-
cached,” in ISCA-40, 2013. VIII, 1998.
[17] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “A quan- [47] A. Saulsbury, F. Dahlgren, and P. Stenström, “Recency-based tlb
titative analysis on microarchitectures of modern cpu-fpga platforms,” preloading,” in ISCA-27, 2000.
in DAC-53, 2016. [48] G. B. Kandiraju and A. Sivasubramaniam, “Going the distance for tlb
[18] HSA Foundation, HSA Platform System Architecture Specification 1.0, prefetching: An application-driven study,” in ISCA-29, 2002.
2015. [49] A. Bhattacharjee, “Large-reach memory management unit caches,” in
[19] Advanced Micro Devices, Inc., AMD I/O Virtualization Technology MICRO-46, 2013.
(IOMMU) Specification, 2011. [50] B. Jacob and T. Mudge, “Software-managed address translation,” in
[20] ARM Ltd., ARM System Memory Management Unit Architecture Spec- HPCA-3, 1997.
ification, 2015. [51] S. Shahar, S. Bergman, and M. Silberstein, “Activepointers: a case for
[21] Intel Corporation, Intel Virtualization Technology for Directed I/O software address translation on gpus,” in ISCA-43, 2016.
Architecture, 2014. [52] Intel Corporation, Accelerator abstraction layer software programmers
[22] A. Kegel, P. Blinzer, A. Basu, and M. Chan, “Virtualizing io through io guide.
memory management unit (iommu),” in ASPLOS-XXI Tutorials, 2016. [53] S. Neuendorffer and F. Martinez-Vallina, “Building zynq accelerators
with vivado high level synthesis.,” in FPGA, 2013.
[23] E. S. Chung, J. D. Davis, and J. Lee, “Linqits: Big data on little clients,”
[54] Y. S. Shao, S. L. X. V. Srinivasan, and G.-Y. W. D. Brooks, “Co-
in ISCA-40, 2013.
designing accelerators and soc interfaces using gem5-aladdin,” in
[24] T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze,
MICRO-49, 2016.
and M. Oskin, “Snnap: Approximate computing on programmable socs
[55] M. Malka, N. Amit, M. Ben-Yehuda, and D. Tsafrir, “riommu: Efficient
via neural acceleration,” in HPCA-21, 2015.
iommu for i/o devices that employ ring buffers,” in ASPLOS-XX, 2015.
[25] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for [56] L. E. Olson, J. Power, M. D. Hill, and D. A. Wood, “Border control:
address translation on gpus: Designing memory management units for sandboxing accelerators,” in MICRO-48, 2015.
cpu/gpus with unified address spaces,” in ASPLOS-XIX, 2014.
48
Authorized licensed use limited to: FUDAN UNIVERSITY. Downloaded on August 22,2024 at 12:50:49 UTC from IEEE Xplore. Restrictions apply.