Compression Aware DCR
Compression Aware DCR
a r t i c l e i n f o a b s t r a c t
Article history: Optimization techniques are widely used in embedded systems design to improve overall area, per-
Received 30 September 2011 formance and energy requirements. Dynamic cache reconfiguration is very effective to reduce energy
Received in revised form 17 January 2012 consumption of cache subsystems which accounts for about half of the total energy consumption in
Accepted 30 January 2012
embedded systems. Various studies have shown that code compression can significantly reduce memory
requirements, and may improve performance in many scenarios. In this paper, we study the challenges
Keywords:
and associated opportunities in integrating dynamic cache reconfiguration with code compression to
Cache reconfiguration
retain the advantages of both approaches. We developed efficient heuristics to explore large space of
Code compression
Embedded systems
two-level cache hierarchy in order to study the effect of a two-level cache on energy consumption. Experi-
mental results demonstrate that synergistic combination of cache reconfiguration and code compression
can significantly reduce both energy consumption (61% on average) and memory requirements while
drastically improve the overall performance (up to 75%) compared to dynamic cache reconfiguration
alone.
© 2012 Elsevier Inc. All rights reserved.
2210-5379/$ – see front matter © 2012 Elsevier Inc. All rights reserved.
doi:10.1016/j.suscom.2012.01.003
72 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80
Fig. 2. Traditional code compression methodology. compressed only if it completely matches with a dictionary entry.
Fig. 4 illustrates an example of bitmask-based compression in
which it can compress up to six data entries using bitmask-based
2.2. Code compression in embedded systems compression, whereas using only dictionary-based compression
would compress only four entries. The example in Fig. 4 uses only
Various code compression algorithms are suitable for embed- one bitmask. In this case, vectors that match exactly a dictionary
ded systems, i.e., provide good compression efficiency with minor entry are compressed with 3 bits. The first bit represents whether
(acceptable) or no decompression overhead. Wolfe and Chanin [11] it is compressed (using 0) or not (using 1). The second bit indicates
were among the first to propose an embedded processor design whether it is compressed using bitmask (using 0) or not (using 1).
that incorporates code compression. Xie et al. [12] introduced a The last bit indicates the dictionary index. Data that are compressed
compression technique capable of compressing flexible instruction using bitmask requires 8 bits. The first two bits, as before, repre-
formats in VLIW architectures. Seong and Mishra [13] modified sent if the data is compressed, and whether the data is compressed
dictionary-based compression (BMC) technique using bitmasks using bitmasks. The next three bits indicate the bitmask position
which improved compression efficiency without introducing any and followed by two bits that indicate the bitmask pattern.
additional decompression overhead. Lin et al. [14] proposed LZW- In this example, the compression ratio is 80%. Compression ratio
based algorithms to compress branch blocks. Recently, Rawlins and (CR), widely accepted as a primary metric for measuring the effi-
Gordon-Ross [15] used compressed programs in their approach ciency of code compression, is defined as:
of combined loop caching with DCR. Their approach has several
limitations. They primarily focus on loop caching which may not Compressed program size
CR =
be applicable in many embedded systems due to intrusive addi- Original program size
tion of another level of cache. Furthermore, due to emphasis on Bitmask selection and dictionary selection are two major chal-
loop caching, interactions between compression and DCR was not lenges in bitmask-based code compression. Seong and Mishra [13]
explored in detail. In this paper we provide comprehensive analy- have shown that the profitable bitmasks to be selected for code
sis of how compression and DCR synergistically interact with each compression are 1s, 2s, 2f, 4s, and 4f (s and f stand for sliding and
other as well as energy-performance trade-offs available for system fixed bitmasks respectively). Since the decompression engine must
designer. be able to start execution from any of jump targets, branch targets
Traditional code compression and decompression flow is illus- should be aligned in the compressed code. In addition, the map-
trated in Fig. 2 where the compression is done offline (prior to ping of old addresses (in the original uncompressed code) to new
execution) and the compressed program is loaded into the memory. addresses (in the compressed code) is kept in a jump table.
The decompression is done during the program execution (online)
and as shown in Fig. 7 it can be placed before or after cache. It is pos-
3. Compression-aware DCR
sible to place the decompression unit between two levels of cache
as well, if the system has multi-level cache hierarchy.
It is a major challenge to optimize both performance and
In this paper we explore three compression techniques:
energy consumption simultaneously. In case of DCR, tradeoffs
dictionary-based compression (DC), bitmask-based compression
between performance and energy consumption should be consid-
(BMC) [13], and Huffman coding. DC and Huffman coding represent
ered in order to choose the most profitable cache configuration for
two extremes. DC is a simple compression technique and therefore
each application. Fig. 5 shows an example of performance-energy
produces moderate compression but decompression is very fast.
On the other hand, Huffman coding is considered to be one of the
most efficient compression techniques but has higher decompres-
sion overhead/latency. DC and Huffman are widely used but BMC is
a recent enhancement of DC that enables more matching patterns.
Fig. 3 shows the generic encoding formats of bitmask-based com-
pression technique for various numbers of bitmasks. Compressed
data stores information regarding the bitmask type, bitmask loca-
tion, and the mask pattern itself. The bitmask can be applied in
different places in a vector and the number of bits required for indi-
cating the position varies depending on the bitmask type. Bitmasks
may be sliding or fixed. A fixed bitmask can be applied to fixed
locations, such as byte boundaries. However, sliding bitmasks can
be applied anywhere in the code vector.
The main advantage of bitmask-based compression over
traditional dictionary-based compression is the increased match-
ing patterns. In dictionary-based compression, each vector is Fig. 4. An example of bitmask-based code compression.
74 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80
120
Execuon Time (Millions of cycles)
100
80
60
40
20
0
1.8 3.8 5.8 7.8 9.8 11.8
energyconsumpon (milli J)
Fig. 7 shows two different placement of the decompression unit. In this section, we study the effect of a two-level cache hierar-
In pre-cache placement the memory contains compressed code and chy on compression and DCR. We consider a system with a unified
instructions are stored in cache in original form. Whereas, in the (instruction/data) level two cache (L2). We compress only instruc-
post-cache placement the decompression unit is placed between tions. In other words, we do not consider data compression in this
cache and processor thus both memory and cache contain com- paper. However, selecting energy efficient cache configuration for
pressed instructions. L2 cache is dependent on both level one instruction and data caches
Our studies show that having the pre-cache placement has very (IL1 and DL1). Therefore we consider the energy consumption of the
little effect on energy and performance of cache. In this case uncom- entire cache subsystem including IL1, DL1, and L2.
pressed instructions are stored in the cache and when cache miss We present efficient heuristics to generate profile tables with
occurs, the cache controller asks the decompression unit to provide profitable cache configurations. Tuning a two-level cache faces the
a block of instructions. In majority of the cases the decompression difficulty of exploring an enormous configuration space. In this
hardware requires one clock cycle in pipelined mode (as shown paper, we examine typical exploration parameters of a two-level
in Fig. 7), so one clock cycle will be added to the latency of entire cache in conventional embedded systems. As discussed in Section
block fetch. In rare cases, e.g., when the first instruction of the block 3.2, there are 18 (=3 + 6 + 9) configuration candidates for L1 caches.
is not compressed, it will introduce two cycle penalty since it will Let Sil1 and Sdl1 denote the size of exploration space for IL1 cache
take two cycles to fetch and decompress the instruction [17]. As and DL1 caches, respectively. So we have Sil1 = 18 and Sdl1 = 18. For
demonstrated in Fig. 8, the energy consumption of cache in the L2 cache, we choose 8 KB, 16 KB and 32 KB as cache sizes; 32, 64 and
pre-cache placement is almost the same as the case when there is 128 bytes as line sizes; 4-, 8- and 16-way set associativity with a 32
no compression involved. So the best choice is to use post-cache KB cache architecture composed of four separate banks. Similarly,
there are 18 possible configurations (Sul2 = 18). For comparison, we
3 have chosen a base cache hierarchy, which reflects a global optimal
BMC pre-cache placement Uncompressed configuration for all the tasks, consisting of two 2 KB, 2-way set
Energy Consumpon (milli J)
2.5 associative L1 caches with a 32 byte line size, and a 16 KB, 8-way set
associative unified L2 cache with a 64 byte line size. The remainder
2
of this section describes our proposed exploration techniques.
1.5
caches. Also, the number of L2 cache accesses directly depends on energy-optimal line sizes for each cache. These two tasks are
the number of L1 cache misses. repeated for both L1 caches and L2.
The obvious way to find the optimal configuration is to search 3. Finally, tune by associativity. We set the cache sizes and line sizes
the entire space exhaustively. Since the instruction and data caches to the energy-optimal ones in exploring energy-optimal associa-
could have different configurations, there are 324 (=Sil1 *Sdl1 ) pos- tivity. Note that we only explore associativities for L1 caches in
sible configurations for L1 cache. Addition of the L2 cache increases this step. During the process of finding DL1’s optimal associa-
the design space size to 4752 (Not equal to Sil1 *Sdl1 *Sul2 because tivities, we already have all the other parameters we needed to
the candidates in which L2 cache’s line size is smaller than any of compute the total numbers of execution cycles that are required
the L1 caches are eliminated). We use the exhaustive method for in the profile table.
comparison with the heuristics presented in the following sections.
Design of these heuristics is motivated by the exploration heuris- In the worst case, ILT explores 30 configurations. The first step
tics of Wang and Mishra [18]. However, our approach also considers explores 6 for L1 caches and 9 for L2 cache. The second step explores
the effect of compression during exploration. 9 (=3*3) candidates. Final step explores 6 (=3*2) candidates. How-
ever, in most cases, there are a lot of repetitive configurations
4.2. Independent L1 cache tuning – ICT throughout the process that we only have to execute once. In prac-
tice, ILT has exploration space size of around 19 configurations.
While different cache levels are dependent on each other, our
4.4. Hierarchy level independent tuning – HIT
initial results demonstrate that instruction and data caches are
relatively independent. In this study, we fix one’s configuration
Although we stated that IL1 and DL1 can be selected indepen-
while changing the other’s to see whether varying one impacts the
dently, in some cases it is better to explore the two level one caches
fixed one. We observe that the profiling statistics for the instruc-
together. Suppose for a particular benchmark, there is a large varia-
tion cache almost remain identical with different data caches and
tion in the require L2 size for data/instruction when changing IL1 or
vice versa. It is mainly due to the fact that access pattern of L1
DL1. In this case, using a large portion of L2 for instruction for a spe-
cache is purely determined by the application’s characteristics, and
cific IL1 configuration can affect DL1 indirectly (may increase data
the instruction and data streams are relatively independent from
access miss ratio in L2). Since ICT finds energy-optimal caches for
each other. Furthermore, factors affecting the instruction cache’s
IL1 and DL1 independently without considering the effect of each
energy consumption as well as performance (such as hit energy,
on L2 cache behavior, it may produce suboptimal results. We pro-
miss energy and miss penalty cycles) have very little dependency
pose HIT – Hierarchy Level Independent Tuning – in which we first
on the data cache and vice versa.
find the optimal cache configurations for level one caches fixing L2
This observation offers an opportunity to reduce the exploration
to the base cache. We explore all possible 324 (=18*18) combina-
space. We propose ICT – Independent L1 Tuning heuristic – during
tions for IL1 and DL1 caches and select the energy optimal ones.
which IL1 and DL1 caches always use the same configuration while
Next we fix L1 caches to the found energy optimal caches in the
exploring with all L2 cache configurations. This method results in a
first step and try all 18 candidates for L2 cache. In summary, ILT
total of 288 configurations – a considerable cut down of the original
explores only 30 configurations, whereas ICT and HIT explore 288
quantity, though still not small. Throughout the static analysis, we
and 342 (=324 + 18) configurations, respectively.
make book keeping including the energy consumptions and miss
cycles of each cache individually. The energy-optimal IL1 cache is
the one with the lowest energy consumption of itself (and same 5. Experiments
for DL1 cache and L2 cache). We choose the cache configuration
combination composed of the three locally energy-optimal caches In order to quantify compression-aware cache configuration
as the energy-optimal cache hierarchy to be stored in the profile tradeoffs, we have applied our methodology to select embedded
table. system benchmarks. Following the same flow in Sections 3 and
4, we first investigate integration of code compression with DCR
for systems with one level of cache. In subsection 0, we extend
4.3. Interlaced tuning – ILT our experiments to evaluate our method with the presence of a
two-level cache.
We adapt the strategy used in TCaT [2] and propose ILT –
Interlaced Tuning heuristic – which finds energy-optimal param- 5.1. Experimental setup
eters throughout the exploration. The basic idea is to tune cache
parameters in the order of their importance to the overall energy We examined cjpeg, djpeg, epic, and adpcm (rawcaudio), g.721
consumption, which is cache size followed by line size and finally (encode, decode) benchmarks from the MediaBench [19] and dijk-
associativity. In order to increase the chances of finding optimal L2 stra, patricia from MiBench [20] compiled for the Alpha target
cache size, which we believe has the highest importance, we com- architecture. These benchmarks are all specially designed for
bine the exploration of L2 cache’s size and associativity together. embedded systems and suitable for the cache configuration param-
ILT is described below: eters described in Section 3.2. All applications were executed with
the default input sets provided with the benchmarks suites.
1. First, tune by cache size. Hold the IL1’s line size, associativity Three different code compression techniques including
as well as DL1 to the smallest configuration. L2 is set to the bitmask-based, dictionary-based and Huffman code compression
base cache. Explore all three instruction cache sizes (1 KB, 2 KB were used. To achieve the best attainable compression ratios, in
and 4 KB) and find out the energy-optimal one(s). Perform same bitmask-based compression, for each application we examined
explorations for DL1 cache size. In L2 size exploration, we try dictionaries of 1 KB, 2 KB, 4 KB, and 8 KB. Similar to Seong and
all the associativities for each cache size. We set L1 sizes to the Mishra [13] we tried three mask sets including one 2-bit sliding,
energy-optimal ones in the process of finding energy-optimal L2 1-bit sliding and 2-bit fixed, and 1-bit sliding and 2-bit fixed masks.
size(s). Similarly for dictionary-based and Huffman compression we used
2. Next, tune by line size. We set cache sizes to the energy-optimal 0.5 KB, 1 KB, 2 KB, 4 KB, and 8 KB dictionary sizes with 8 bits, 16
ones and L2’s associativity found in the first step in exploring bits and 32 bits word sizes. We found out that dictionary size of
H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80 77
Fig. 9. Energy consumption of the selected “minimal-energy cache” normalized to the base cache.
2 KB and word size of 16 bits are the best choices for this set of subsystem. Energy consumption is normalized to the fixed base
benchmarks. The reason is that using 8 bits words increases the cache configuration such that value of 100% represents our baseline.
number of compression decision bits and using 32 bits word size Energy savings in the instruction cache subsystem ranges from 10%
decreases the words frequencies significantly. Hence, as simulation to 76% with an average of 45% for utilizing only DCR. As we expected,
results showed, 16 bits word size is the best choice. due to higher decompression overhead, Huffman (when combined
Code compression is performed offline. In order to extract the with DCR) achieves lower energy savings compared to BMC virtu-
code (instruction) part from executable binaries, we used ECOFF ally for all benchmarks. Energy savings in DC + DCR approach are
(Extended Common Object File Format) header files provided in even lower than Huffman + DCR as a result of moderate compres-
SimpleScalar toolset [16]. We placed the compressed code back into sion ratio by DC. Incorporating BMC in DCR increases energy savings
binary files so that they can be loaded into the simulator. up to 48% – on top of 10–76% energy savings obtained by DCR only –
We utilized the configurable cache architecture developed by without any performance degradation. Our methodology achieves
Zhang et al. [6] with a four-bank cache of base size 4 KB, which on average 61% energy savings of the cache subsystem.
offers sizes of 1 KB, 2 KB, and 4 KB, line sizes ranging from 16 bytes Energy consumption of some benchmarks is reduced drasti-
to 64 bytes, and associativity of 1-way, 2-way, and 4-way. For com- cally when using BMC. For example, energy consumption of cjpeg
parison purposes, we used the base cache configuration for L1 set benchmark is decreased by nearly 50% when applying BMC on DCR
to be a 4 KB, 4-way set associative cache with a 32-byte line size, compared to using DCR alone. Fig. 10(a) shows the number of cache
a reasonably common configuration that meets the average needs misses per thousand dynamic instructions for cjpeg benchmark.
of the studied benchmarks. It shows that for smaller cache sizes, cache misses are drastically
To obtain cache hit and miss statistics, we modified the reduced when incorporating compression. In other words, by using
SimpleScalar toolset [16] to decode and simulate compressed compression, smaller cache sizes are capable of containing the crit-
applications. We implemented and placed the required decom- ical portion of cjpeg benchmark and keep the number of misses
pression routines/functions for respective compression algorithms low (maintaining performance) while reducing static energy con-
in Simplescalar simulator. We considered the latency of decom- sumption. Fig. 10(b) presents the same statistics for rawcaudio
pression unit carefully. Decompression unit can decompress the benchmark. It should be noticed that although integrating com-
next instruction in one cycle (in pipelined mode) if it finds the pression with DCR reduces the number of cache misses when using
entire needed bits in its buffer. Otherwise, it takes one cycle (or small cache sizes (similar to cjpeg behavior) it does not drastically
more cycles, if cache miss occurs) to fetch the needed bits into its decrease energy consumption. The extremely low range of cache
buffer and on more cycle to decompress the next instruction. Cor- misses, usually less than 0.05 (0.45 in the extreme case) misses per
rectness of the compression and decompression algorithms was thousand dynamic instructions, leads to nominal contribution of
verified by comparing the outputs of compressed applications with cache misses to the overall energy consumption of the cache. For
uncompressed versions. The performance overhead of decompres- this reason dynamic energy consumption is nearly the same for all
sion includes decompression unit buffer flush overhead due to configurations and DCR chooses the smallest possible cache con-
jumps, and variable latency of memory reads in each block fetch figuration to minimize the static energy. In this case, incorporating
(because of variable length compressed code). These overhead are compression in DCR with the selection of small cache size can only
negligible according to the experimental results. reduce the cache misses and therefore dynamic energy and thus has
We applied the same energy model used in [6], which calculates a small impact on overall cache energy consumption for rawcaudio
both dynamic and static energy consumption, memory latency, CPU benchmark.
stall energy, and main memory fetch energy. The energy model Fig. 11 illustrates an example of performance-energy consump-
was modified to include decompression energy. We updated the tion tradeoffs for both uncompressed and compressed (using BMC)
dynamic energy consumption for each cache configuration using cases for rawcaudio (adpcm-enc) benchmark. It can be observed that
CACTI 4.2 [21]. for every possible configuration for the uncompressed program
there is an alternative which has a better performance and lower
5.2. One-level cache tuning energy requirement if the program is compressed. This observation
shows that compression-aware DCR leads to better design choices.
Energy consumption for several benchmarks from the Media- Another observation we have made is that without DCR, apply-
Bench and MiBench in different approaches are analyzed: a fixed ing compression on an application (which executes using base cache
base cache configuration, bitmask-based compression without uti- configuration that already fits the critical portion of the application)
lizing DCR (BMC only), DCR without compression (DCR only), will not gain noticeable energy savings. However, compression-
dictionary-based compression with DCR (DC + DCR), Huffman cod- aware DCR effectively uses the advantage of reduced program size
ing with DCR (Huffman + DCR), and bitmask-based compression achieved by compression to choose smaller cache size, associativ-
with DCR (BMC + DCR). The most energy efficient cache configu- ity, or line size and yet fit critical portion of programs. Therefore,
ration found by exploration in each technique is considered for compression aware-DCR can achieve more energy savings com-
comparison. Fig. 9 presents energy savings for the instruction cache pared to DCR alone. Fig. 12 illustrates comparison of energy profile
78 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80
50 7
dynamic instrucons
40 5
4
30 3
2
20
1
0
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
a) cjpeg Fig. 12. The impact of cache/line size on energy profile of cache using cjpeg bench-
mark.
Number of cache misses per thousand
0.45
0.4
increases, the critical portion of program (regardless of whether
dynamic instrucons
0.35
compressed or not) will fit into cache entirely. Therefore by utiliz-
0.3
ing large cache sizes energy consumption of the compressed code
0.25 is very close to uncompressed one. It should be noted that the main
0.2 objective of exploration is to find the most energy efficient cache
configurations so we are not interested in large cache sizes since
0.15
they require more energy.
0.1
Fig. 13 shows performance of applications for different schemes
0.05 normalized to the base cache. Applying DCR alone for the purpose
0 of energy saving, results in 12% performance loss on average. We
observe that code compression can improve performance in many
scenarios while achieving significant reduction in energy consump-
tion. For instance, in the case of the application patricia, applying
only DCR would result in 12% performance degradation with 34%
b) rawcaudio energy savings. However, incorporating BMC boosts performance
by 33% while gaining extra 17% energy savings on top of DCR achiev-
Fig. 10. Number of cache misses using DCR only and with various compression
techniques.
ing 51% energy savings compared to the base cache. Results show
that synergistic integration of BMC with DCR achieves as much as
75% performance improvement for g721 enc (27% improvement on
for different caches for compressed (using BMC) and uncompressed
average) compared to DCR alone. Thus it is possible to have a cache
cjpeg benchmark. Using a 4 KB cache with associativity of 4 and 64-
architecture that is tuned for applications to have both increased
bit line size, energy consumption of cjpeg benchmark is nearly the
performance as well as lower energy consumption. Fig. 14 shows
same for compressed and uncompressed programs.
performance and miss statistics for g721 enc benchmark. Further
In the post-cache placement, compression has a significant
analysis of g721 enc benchmark reveals that having numerous
effect when combined with small cache sizes. In this case com-
small if then else and switch clauses leads to large number of misses
pressed instructions are stored in the cache. Since the compressed
due to overlapping addresses (conflict miss). In this case, compres-
code size is 30–45 percent less than uncompressed code it can fit
sion reduces the number of misses by decreasing the amount of
in smaller cache sizes. However, when size of the selected cache
overlap the address of these small code sections. Fig. 14 confirms
that compression drastically improves the performance of g721 enc
3.5 benchmark for most of available cache configurations.
Fig. 15 shows performance trend of all cache configurations for
Energy consumption (milli J)
Fig. 13. Performance of the selected “minimal-energy cache” normalized to the base cache.
1.6
0.6
140
120
1.4 0.58
Base Ccache
dynamic instrucons
1.2
100 0.56
1
80 0.54
0.8
60
0.6
0.52
40
0.4 0.5
20 0.2
0.48
0 0
0.46
ICT ILT HIT Exhaust
Fig. 16. Cache hierarchy energy consumption using heuristics for cjpeg benchmark.
Fig. 14. Number of cache misses using DCR only and DCR + BMC for g721 enc bench-
mark.
obtained by heuristics are very close to the optimal value obtained
5.3. Two-level cache tuning by exhaustive search. As we explained in Section 4.1, exploring all
possible configurations exhaustively results in 4752 simulations.
To evaluate the effect of two-level cache hierarchy using our Performing these set of simulations for cjpeg benchmark (which
exploration heuristics, we selected cjpeg, djpeg, epic benchmarks takes the lowest simulation time among others in the benchmark
from MediaBench [19] and crc32 from MiBench [20] benchmark suites) on a system with a 4-core AMD Opteron (an ×86 server
suites. For L2 cache, we choose 8 KB, 16 KB and 32 KB as possible processor) running at 3.0 GHz takes more than three days. Clearly,
cache sizes; 32, 64 and 128 bytes as line sizes; 4-, 8- and 16-way this will take longer for other benchmarks. Although, these heuris-
set associativity with a 32 KB cache architecture composed of four tics take significantly less time than exhaustive exploration, they
separate banks. L2 cache is unified; in other words, it contains both provide very close to optimal energy savings. Table 1 presents the
instructions and data. We define L2 base cache to be a 16 KB, 8-way total number of cache configurations explored by each exploration
set associative L2 cache with a 64 byte line size. We quantify the heuristic. Our experience is that it may take several days to profile a
cache subsystem energy savings using our approach by comparing task using exhaustive method while few minutes if ILT is employed.
to the base cache scenario. We use four cache exploration methods Designers can decide which heuristic to use based on the static
– exhaustive, ICT, ILT, and HIT – to generate profile tables. Fig. 16 profiling time and the overall energy savings. Therefore, we only
presents the total cache hierarchy energy consumption normalized perform heuristic space exploration for the remaining benchmarks.
to the base cache for cjpeg benchmark using each exploration tech- Fig. 17 presents the total cache hierarchy energy consump-
nique. It can be observed that, for cjpeg benchmark, the best results tion normalized to the base cache for cjpeg, djpeg, epic, patricia
and dijkstra benchmarks using each exploration technique for
uncompressed and compressed scenarios. ICT achieves best results
90
Uncompressed BMC post-cache placement obtaining 67% average energy saving when applying DCR only. It
Execuon me (Millions of cycles)
0 Table 1
Cache hierarchy configuration explored using different exploration methods for
cjpeg benchmark.
0.6
ergistic integration of DCR and code compression for embedded
0.5
DCRonly-ICT systems. Our methodology employs an ideal combination of code
compression and dynamic tuning of two-level cache parameters
base cache
DCRonly-ILT
0.4
DCRonly-HIT with minor or no impact on timing constraints. Our experimental
0.3 BMC+DCR-ICT results demonstrated 61% reduction on average in overall energy
BMC+DCR-ILT consumption of the cache subsystem as well as up to 75% per-
0.2
BMC+DCR-HIT formance improvement (compared to DCR only) in embedded
0.1 systems.
0
cjpeg djpeg epic-encode patricia dijkstra
References
Fig. 17. Cache hierarchy energy consumption using three heuristics. [1] A. Malik, B. Moyer, D. Cermak, A low power unified cache architecture providing
power and performance flexibility, ISLPED (2000).
[2] A. Gordon-Ross, F. Vahid, N. Dutt, Automatic tuning of two-level caches to
embedded applications, DATE (2004).
1.8
[3] C. Lefurgy, Efficient execution of compressed programs, Ph.D. Thesis, University
Performance normalized to base cache