0% found this document useful (0 votes)
39 views

Compression Aware DCR

Compression aware DCR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Compression Aware DCR

Compression aware DCR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Sustainable Computing: Informatics and Systems 2 (2012) 71–80

Contents lists available at SciVerse ScienceDirect

Sustainable Computing: Informatics and Systems


journal homepage: www.elsevier.com/locate/suscom

Compression-aware dynamic cache reconfiguration for embedded systems夽


Hadi Hajimiri ∗ , Kamran Rahmani, Prabhat Mishra
Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL, USA

a r t i c l e i n f o a b s t r a c t

Article history: Optimization techniques are widely used in embedded systems design to improve overall area, per-
Received 30 September 2011 formance and energy requirements. Dynamic cache reconfiguration is very effective to reduce energy
Received in revised form 17 January 2012 consumption of cache subsystems which accounts for about half of the total energy consumption in
Accepted 30 January 2012
embedded systems. Various studies have shown that code compression can significantly reduce memory
requirements, and may improve performance in many scenarios. In this paper, we study the challenges
Keywords:
and associated opportunities in integrating dynamic cache reconfiguration with code compression to
Cache reconfiguration
retain the advantages of both approaches. We developed efficient heuristics to explore large space of
Code compression
Embedded systems
two-level cache hierarchy in order to study the effect of a two-level cache on energy consumption. Experi-
mental results demonstrate that synergistic combination of cache reconfiguration and code compression
can significantly reduce both energy consumption (61% on average) and memory requirements while
drastically improve the overall performance (up to 75%) compared to dynamic cache reconfiguration
alone.
© 2012 Elsevier Inc. All rights reserved.

1. Introduction to have both increased performance as well as lower energy con-


sumption. Since too many cache configurations are possible, the
Energy conservation has been a primary optimization objec- challenge is to determine the best cache configuration (in terms of
tive in designing embedded systems as these systems are generally total size, associativity, and line size) for a particular application.
limited by battery lifetime. Several studies have shown that mem- Studies have shown that cache tuning can achieve 53% memory-
ory hierarchy accounts for as much as 50% of the total energy access-related energy savings and 30% performance improvement
consumption in many embedded systems [1]. Dynamic cache [2].
reconfiguration (DCR) and code compression are two of the exten- The use of high-level programming languages coupled with RISC
sively studied approaches in order to achieve energy savings as well instruction sets leads to a larger memory footprint and increased
as area and performance gains. area/cost and power requirements, all of which are important
Different applications require highly diverse cache configura- design constraints in most embedded applications. Code compres-
tions for optimal energy consumption in the memory hierarchy. sion is clearly beneficial for memory size reduction because it
Unlike desktop-based systems, embedded systems are designed to reduces the static memory size of executable code. Several code
run a specific set of well-defined applications. Thus it is possible compression techniques have been proposed for reducing instruc-
to have a cache architecture that is tuned for those applications tion memory size in low cost embedded applications [3]. The basic
idea is to store instructions in compressed form and decompress
them on-the-fly at execution time. More importantly, code com-
夽 This is an extended version of the paper that has appeared in the proceedings of pression could also be beneficial for energy by reducing memory
International Green Computing Conference (IGCC) 2011 [22]. The IGCC paper pre- size and the communication between memory and the processor
sented some initial results for employing dynamic cache reconfiguration and code core [4].
compression for embedded systems with one level (L1) cache. This article considers Design of efficient compression techniques needs to con-
the effects of code compression on DCR in the presence of both level one (L1 data and
L1 instruction) and unified level two (L2) caches. Specifically the following major
sider two important aspects. First, the compressed code has
contributions are new: (i) we have used a highly configurable two-level cache hier- to support the possibility of starting the decompression dur-
archy and added Section 4 to propose multi-level cache tuning heuristics for DCR; ing execution at several points inside the program (i.e., branch
(ii) we have added Section 5.3 to present new experimental results for exploration targets). Second, since decompression is performed on-line, dur-
of two-level cache hierarchy.
∗ Corresponding author. ing program execution, decompression algorithms should be fast
E-mail addresses: [email protected]fl.edu, [email protected] (H. Hajimiri), and power efficient to achieve savings in memory size and
[email protected]fl.edu (K. Rahmani), [email protected]fl.edu (P. Mishra). power, without compromising performance. We explore various

2210-5379/$ – see front matter © 2012 Elsevier Inc. All rights reserved.
doi:10.1016/j.suscom.2012.01.003
72 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80

a traditional system and Fig. 1(b) depicts a system with a reconfig-


urable cache. For the ease of illustration let’s assume cache size is
the only reconfigurable parameter of cache (associativity and line
size are ignored). In this example, Task1 starts its execution at time
P1. Task2 and Task3 start at P2 and P3 respectively. In a traditional
approach, the system always executes using a 4096-byte cache. We
call this cache as base cache throughout the paper. Base cache is the
best possible cache configuration optimized for all the tasks. With the
option of reconfigurable cache, Task1, Task2, and Task3 execute
using 1024-byte cache starting at P1, 8192-byte cache starting at
P2, and 4096-byte cache starting at P3 respectively. Through proper
selection of cache size for each task the system can achieve sig-
nificant amount of energy savings as well as performance gains
compared to using only the base cache.
The inter-task DCR problem is defined as follows. Consider a
set of n applications (tasks) A = {a1 , a2 , a3 , . . ., an } intended to run
on a configurable cache architecture capable of supporting m pos-
sible cache configurations C = {c1 , c2 , c3 , . . ., cm }. We define e(cj ,
ai ) as the total energy consumed by running application ai on the
architecture with cache configuration cj . We also define co ∈ C as
Fig. 1. DCR for a system with three tasks. the optimal cache configuration for application ai , such that e(co ,
ai ) ≤ e(cj , ai ), ∀cj ∈ C. Through exhaustive exploration of all possible
configurations of C = {c1 , c2 , c3 , . . ., cm }, best energy optimal cache
configuration for each application can be found.
compression techniques (including dictionary-based compression, Dynamic cache reconfiguration has been extensively studied in
bitmask-based compression and Huffman coding) that represent a several works [5–8]. The reconfigurable cache architecture pro-
trade-off between compression performance and decompression posed by Zhang et al. [6] determines the best cache parameters by
overhead. using Pareto-optimal points trading off energy consumption and
It is expected that by compressing instructions the cache behav- performance. Their method imposes no overhead to the critical
ior of programs is no longer the same. Thus in order to have path, thus cache access time does not increase. Chen et al. [9] intro-
the optimal cache configuration, more analysis should be done duced a novel reconfiguration management algorithm to efficiently
including hit/miss behavior of the compressed programs. In other search the large design space of possible cache configurations for
words, cache reconfiguration needs to be aware of code compres- the optimal one. None of these approaches consider the effects of
sion to obtain best possible area, power and performance results. compressed code on cache reconfiguration.
In this paper, we present an elaborate analysis of combining two DCR can be viewed as a technique that tries to squeeze cache
optimization techniques: dynamic cache reconfiguration and code size with other cache parameters to reduce energy consump-
compression. In addition, we propose efficient heuristics to explore tion without (or with minor) performance degradation. Smaller
large design space of two-level cache hierarchy in order to find caches contribute less static power but may increase cache misses
energy efficient cache configurations. Our experimental results which can lead to increased dynamic power and performance
demonstrate that the combination is synergistic and achieves more degradation (longer execution time thus higher energy consump-
energy savings as well as overall performance improvement com- tion). Therefore, the smallest possible cache may not be a feasible
pared to DCR and code compression alone. solution in many cases. DCR techniques find the best cache
The rest of the paper is organized as follows. Section 2 provides that fits the application by exploring cache configurations using
an overview of related research activities. In Section 3, we describe various schemes. In this paper, we show that code compres-
our compression-aware cache reconfiguration methodology. Sec- sion which significantly reduces the code size can also help the
tion 4 presents efficient heuristics to explore two-level cache cache reconfiguration technique to choose relatively smaller cache
hierarchy. Section 5 presents our experimental results. Finally, Sec- sizes, smaller associativity, or smaller line size without perfor-
tion 6 concludes the paper. mance degradation, therefore, reduces cache energy consumption
significantly.
2. Background and related work The configurable caches used in our work are based on the
architecture described in [10]. The underlying cache architecture
2.1. Dynamic cache reconfiguration (DCR) contains four separate banks that can operate as four separate
ways. Special configuration registers are used to inform the cache
In power constrained embedded systems, nearly half of the tuner – a custom hardware or a lightweight process – to con-
overall power consumption is attributed to the cache subsystem catenate ways such that the associativity can be altered. The
[1]. Applications require vastly different cache requirements in special registers may also be configured to shut down ways to
terms of cache size, line size, and associativity. Research shows vary the cache size. Similarly, by configuring the fetch unit to
that specializing the cache to application’s needs can significantly fetch cache lines in various lengths, we can adjust the line sizes.
reduce energy consumption [2]. Fig. 1 illustrates how energy con- The area overhead for this architecture is 3%. In addition, search-
sumption can be reduced by using inter-task (application-based) ing an average of 5.4 configurations to find the best configuration
cache reconfiguration in a simple system supporting three tasks. has a very low energy consumption of 11.9 nJ on average. This
In application-based cache tuning, DCR happens when a task starts energy is negligible compared to the energy consumption in bench-
its execution or it resumes from an interrupt (either by preemp- marks that is 2.34 J on average. In this paper the only part that
tion or when execution of another task completes) and the same we used is dynamic cache reconfiguration. It means our archi-
cache for the application gets chosen no matter if it is starting from tecture is not self-tuning and has much less overhead compared
the beginning or resuming anywhere in between. Fig. 1(a) depicts to [10].
H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80 73

Format for Uncompressed Code


Application
Program Compression Decision
Algorithm (1-bit) Uncompressed Data
(binary)

Format for Compressed Code


Fetch and Decompression Compressed Decision # of mask Mask Mask Dictionary
Execute Hardware Code Location …
(1-bit) patterns Type pattern Index
Processor + Memory
Cache Hierarchy Extra bits for considering mismatches

Fig. 3. Encoding format for incorporating mismatches.

Fig. 2. Traditional code compression methodology. compressed only if it completely matches with a dictionary entry.
Fig. 4 illustrates an example of bitmask-based compression in
which it can compress up to six data entries using bitmask-based
2.2. Code compression in embedded systems compression, whereas using only dictionary-based compression
would compress only four entries. The example in Fig. 4 uses only
Various code compression algorithms are suitable for embed- one bitmask. In this case, vectors that match exactly a dictionary
ded systems, i.e., provide good compression efficiency with minor entry are compressed with 3 bits. The first bit represents whether
(acceptable) or no decompression overhead. Wolfe and Chanin [11] it is compressed (using 0) or not (using 1). The second bit indicates
were among the first to propose an embedded processor design whether it is compressed using bitmask (using 0) or not (using 1).
that incorporates code compression. Xie et al. [12] introduced a The last bit indicates the dictionary index. Data that are compressed
compression technique capable of compressing flexible instruction using bitmask requires 8 bits. The first two bits, as before, repre-
formats in VLIW architectures. Seong and Mishra [13] modified sent if the data is compressed, and whether the data is compressed
dictionary-based compression (BMC) technique using bitmasks using bitmasks. The next three bits indicate the bitmask position
which improved compression efficiency without introducing any and followed by two bits that indicate the bitmask pattern.
additional decompression overhead. Lin et al. [14] proposed LZW- In this example, the compression ratio is 80%. Compression ratio
based algorithms to compress branch blocks. Recently, Rawlins and (CR), widely accepted as a primary metric for measuring the effi-
Gordon-Ross [15] used compressed programs in their approach ciency of code compression, is defined as:
of combined loop caching with DCR. Their approach has several
limitations. They primarily focus on loop caching which may not Compressed program size
CR =
be applicable in many embedded systems due to intrusive addi- Original program size
tion of another level of cache. Furthermore, due to emphasis on Bitmask selection and dictionary selection are two major chal-
loop caching, interactions between compression and DCR was not lenges in bitmask-based code compression. Seong and Mishra [13]
explored in detail. In this paper we provide comprehensive analy- have shown that the profitable bitmasks to be selected for code
sis of how compression and DCR synergistically interact with each compression are 1s, 2s, 2f, 4s, and 4f (s and f stand for sliding and
other as well as energy-performance trade-offs available for system fixed bitmasks respectively). Since the decompression engine must
designer. be able to start execution from any of jump targets, branch targets
Traditional code compression and decompression flow is illus- should be aligned in the compressed code. In addition, the map-
trated in Fig. 2 where the compression is done offline (prior to ping of old addresses (in the original uncompressed code) to new
execution) and the compressed program is loaded into the memory. addresses (in the compressed code) is kept in a jump table.
The decompression is done during the program execution (online)
and as shown in Fig. 7 it can be placed before or after cache. It is pos-
3. Compression-aware DCR
sible to place the decompression unit between two levels of cache
as well, if the system has multi-level cache hierarchy.
It is a major challenge to optimize both performance and
In this paper we explore three compression techniques:
energy consumption simultaneously. In case of DCR, tradeoffs
dictionary-based compression (DC), bitmask-based compression
between performance and energy consumption should be consid-
(BMC) [13], and Huffman coding. DC and Huffman coding represent
ered in order to choose the most profitable cache configuration for
two extremes. DC is a simple compression technique and therefore
each application. Fig. 5 shows an example of performance-energy
produces moderate compression but decompression is very fast.
On the other hand, Huffman coding is considered to be one of the
most efficient compression techniques but has higher decompres-
sion overhead/latency. DC and Huffman are widely used but BMC is
a recent enhancement of DC that enables more matching patterns.
Fig. 3 shows the generic encoding formats of bitmask-based com-
pression technique for various numbers of bitmasks. Compressed
data stores information regarding the bitmask type, bitmask loca-
tion, and the mask pattern itself. The bitmask can be applied in
different places in a vector and the number of bits required for indi-
cating the position varies depending on the bitmask type. Bitmasks
may be sliding or fixed. A fixed bitmask can be applied to fixed
locations, such as byte boundaries. However, sliding bitmasks can
be applied anywhere in the code vector.
The main advantage of bitmask-based compression over
traditional dictionary-based compression is the increased match-
ing patterns. In dictionary-based compression, each vector is Fig. 4. An example of bitmask-based code compression.
74 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80

120
Execuon Time (Millions of cycles)

100

80

60

40

20

0
1.8 3.8 5.8 7.8 9.8 11.8
energyconsumpon (milli J)

Fig. 5. An example of performance-energy consumption tradeoff using Anagram


benchmark (Pareto optimal alternatives are connected using dashed lines).
Fig. 6. Different caches used in different scenarios. Cache1: conventional system
without reconfiguration, Cache2: only dynamic reconfiguration (no compression),
Cache3: both dynamic reconfiguration and compression.
consumption tradeoff using Anagram benchmark. Each dot rep-
resents a cache configuration showing its corresponding energy
consumption and total execution time of the task. By plotting all Incorporating compression into DCR would lead to selection of
cache configurations in performance-energy consumption graph Cache3. Applying compression will help dynamic reconfiguration
(based on time and energy consumption from simulation results) to perfectly fit the critical portion of the application into smaller
we can determine Pareto optimal points representing feasible cache thus gaining even more energy savings without increasing
alternatives. For instance, increasing cache line or associativity the execution time.
can improve performance and may increase energy consump-
tion as well. High performance alternatives will sacrifice some
3.2. Compression-aware DCR
amounts of energy while selecting energy saving options would
have lower performance. The remainder of this section describes
Here, we consider systems with one level cache. In Section 4 we
how to combine the advantages of both compression and dynamic
extend our approach for systems with two-level cache. Algorithm 1
reconfiguration.
outlines the major steps in our cache configuration selection in the
presence of compressed applications. The algorithm collects simu-
3.1. Motivation lation results for all possible cache configurations (cache sizes of 1
KB, 2 KB, 4 KB, and 8 KB; associativity of 1, 2, 4-way; cache line sizes
A reconfigurable cache can be viewed as an elastic cache with of 16, 32, 64). It finds the best energy optimal cache configuration
flexible parameters such as cache size, line size, and associativ- for each application through exhaustive exploration of all possible
ity. The dynamic reconfiguration technique exploits the elasticity cache configurations of C = {c1 , c2 , c3 , . . ., cm }. Number of simulation
of such caches by selecting a profitable cache configuration which cycles for each run is collected based on the simulation results. The
is capable of maintaining the critical portion of the application to energy model of [6] is used to calculate the energy consumption
reduce energy consumption. Choosing smaller caches that fail to using the cache hit and miss statistics. The algorithm finally con-
store the critical portion of the program may lead to increased cache structs the Pareto optimal alternatives and returns it in a list. The
misses thus longer execution time and eventually escalation in most energy efficient cache configuration among all Pareto optimal
energy consumption. However, it is possible that the cache recon- alternatives which satisfies timing requirements of the application
figuration method may find a cache configuration that increases is chosen next. Suppose there are two cache configurations, C1 with
the execution time of the application in spite of reduced energy execution time of 2 million cycles and energy consumption of 5 mJ
consumption. This may not be an issue for systems without real- and C2 with execution time of 1.8 million cycles and energy con-
time constraints but timing constraints in real-time applications sumption of 6 mJ, available in the Pareto optimal list of alternatives.
limit use of such cache reconfiguration techniques. Integrating code If the task has to be done in 1.9 million cycles, the faster alterna-
compression with cache reconfiguration resolves this problem by tive (C2) gets chosen. If the timing requirement of the task is not
effectively shrinking the program size in order to fit the critical constrained by 2 million cycles, the more energy efficient cache
portion of the application into a smaller cache. alternative (C1) gets selected.
Fig. 6 illustrates different caches for a real-time embedded sys-
tem with a set of applications. Associativity is ignored for the ease Algorithm 1.
Finding Pareto optimal cache configurations
of illustration. The horizontal and vertical axis show different pos-
Input: compressed code
sibilities of cache size and line size, respectively. The base cache Output: List of Pareto optimal cache alternatives
is a globally optimized cache is used for all applications and has Begin
the minimal aggregate energy consumption while ensuring that li = an empty list to store cache alternatives
no deadlines are missed. As an illustrative example, Fig. 6 shows for s = cache sizes of 1 KB, 2 KB, 4 KB, and 8 KB do
for a = associativity of 1,2,4-way do
one application in this set in three scenarios: no reconfiguration
for l = cache lines of 16,32,64 do
or compression, reconfiguration without compression, and recon- do cycle accurate simulation for cache Cs,a,l ;
figuration + compression. Cache1 is used for this application when ts,a,l = simulation cycles;
no compression or cache reconfiguration is available. Cache2 is es,a,l = energy consumption of the cache subsystem;
add the triple (Cs,a,l , ts,a,l , es,a,l ) to li;
the cache selected by dynamic reconfiguration technique (with no
end for
compression) to reduce the energy consumption of this applica- end for
tion. But to ensure real-time (deadline) constraints, low energy end for
cache alternatives may get rejected because of longer execution return Pareto optimal points in li;
times (critical portion of applications may not fit, for example). end
H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80 75

Instruction Main placement to achieve maximum performance as well as minimum


Processor Decompression Unit
Cache Memory energy consumption.
Incorporating compression, cache miss penalty caused by mem-
a)pre-cache placement ory fetch latency is reduced because of improved bandwidth (since
compressed code is smaller). In addition, off-chip access energy (the
Processor Decompression Unit Instruction Main buses to main memory and memory access) is also reduced since
Cache Memory
the decompression engine reads compressed code from memory
b)post-cache placement resulting in lower traffic to main memory. However, post-cache
placement can introduce significant performance overhead to the
Fig. 7. Different placement of decompression unit. system. Seong and Mishra [13] presented a bitmask-based com-
pression technique that adds no penalty to the system performance
The algorithm is similar to traditional DCR but uses compressed using pipelined one-cycle decompression engine with negligible
code. Therefore the simulation/profiling infrastructure needs to power requirement. Using this decompression engine makes it
have decompression unit to provide the ability of decoding com- practical to place the decompression unit after cache (post-cache
pressed instructions. For example, in our case, we implemented and placement) and benefit from the compressed code stored in the
placed the required decompression routines/functions for respec- cache.
tive compression algorithms in Simplescalar simulator [16]. In the context of embedded systems one of the main goals is
In this section, we consider systems with only one level of maximizing energy savings while ensuring the system will meet
reconfigurable cache architecture; therefore number of cache con- applications requirements. Usually, choosing a cache configuration
figurations is small. So we can exhaustively explore all possible for energy savings may result in performance degradation. How-
configurations in a reasonable time. Since the reconfiguration of ever, the synergistic combination of cache reconfiguration and code
associativity is achieved by way concatenation, 1 KB L1 cache can compression enables energy savings without loss of performance.
only be direct-mapped as other three banks are shut down. For the Our proposed methodology provides an efficient and optimal strat-
same reason, 2 KB cache can only be configured to direct-mapped egy for cache tuning based on static profiling using compressed
or 2-way associativity. Therefore, there are 18 (=3 + 6 + 9) configu- programs.
ration candidates for L1.

3.3. Placement of decompression hardware 4. Tuning of two-level caches

Fig. 7 shows two different placement of the decompression unit. In this section, we study the effect of a two-level cache hierar-
In pre-cache placement the memory contains compressed code and chy on compression and DCR. We consider a system with a unified
instructions are stored in cache in original form. Whereas, in the (instruction/data) level two cache (L2). We compress only instruc-
post-cache placement the decompression unit is placed between tions. In other words, we do not consider data compression in this
cache and processor thus both memory and cache contain com- paper. However, selecting energy efficient cache configuration for
pressed instructions. L2 cache is dependent on both level one instruction and data caches
Our studies show that having the pre-cache placement has very (IL1 and DL1). Therefore we consider the energy consumption of the
little effect on energy and performance of cache. In this case uncom- entire cache subsystem including IL1, DL1, and L2.
pressed instructions are stored in the cache and when cache miss We present efficient heuristics to generate profile tables with
occurs, the cache controller asks the decompression unit to provide profitable cache configurations. Tuning a two-level cache faces the
a block of instructions. In majority of the cases the decompression difficulty of exploring an enormous configuration space. In this
hardware requires one clock cycle in pipelined mode (as shown paper, we examine typical exploration parameters of a two-level
in Fig. 7), so one clock cycle will be added to the latency of entire cache in conventional embedded systems. As discussed in Section
block fetch. In rare cases, e.g., when the first instruction of the block 3.2, there are 18 (=3 + 6 + 9) configuration candidates for L1 caches.
is not compressed, it will introduce two cycle penalty since it will Let Sil1 and Sdl1 denote the size of exploration space for IL1 cache
take two cycles to fetch and decompress the instruction [17]. As and DL1 caches, respectively. So we have Sil1 = 18 and Sdl1 = 18. For
demonstrated in Fig. 8, the energy consumption of cache in the L2 cache, we choose 8 KB, 16 KB and 32 KB as cache sizes; 32, 64 and
pre-cache placement is almost the same as the case when there is 128 bytes as line sizes; 4-, 8- and 16-way set associativity with a 32
no compression involved. So the best choice is to use post-cache KB cache architecture composed of four separate banks. Similarly,
there are 18 possible configurations (Sul2 = 18). For comparison, we
3 have chosen a base cache hierarchy, which reflects a global optimal
BMC pre-cache placement Uncompressed configuration for all the tasks, consisting of two 2 KB, 2-way set
Energy Consumpon (milli J)

2.5 associative L1 caches with a 32 byte line size, and a 16 KB, 8-way set
associative unified L2 cache with a 64 byte line size. The remainder
2
of this section describes our proposed exploration techniques.
1.5

1 4.1. Exhaustive exploration

0.5 Intuitively, if the two levels of caches can be explored indepen-


dently, one can easily profile one level at a time while holding the
0
other level to a typical configuration, which will result in a much
smaller exploration space. However, there is no certainty that the
combination of three independently found energy-optimal config-
urations would be close to the global optimal one. The two cache
Fig. 8. The impact of pre-cache placement of decompression engine on cache energy levels affect each other’s behavior in various ways. For instance,
– djpeg benchmark. L2 cache’s configuration determines the miss penalty of the L1
76 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80

caches. Also, the number of L2 cache accesses directly depends on energy-optimal line sizes for each cache. These two tasks are
the number of L1 cache misses. repeated for both L1 caches and L2.
The obvious way to find the optimal configuration is to search 3. Finally, tune by associativity. We set the cache sizes and line sizes
the entire space exhaustively. Since the instruction and data caches to the energy-optimal ones in exploring energy-optimal associa-
could have different configurations, there are 324 (=Sil1 *Sdl1 ) pos- tivity. Note that we only explore associativities for L1 caches in
sible configurations for L1 cache. Addition of the L2 cache increases this step. During the process of finding DL1’s optimal associa-
the design space size to 4752 (Not equal to Sil1 *Sdl1 *Sul2 because tivities, we already have all the other parameters we needed to
the candidates in which L2 cache’s line size is smaller than any of compute the total numbers of execution cycles that are required
the L1 caches are eliminated). We use the exhaustive method for in the profile table.
comparison with the heuristics presented in the following sections.
Design of these heuristics is motivated by the exploration heuris- In the worst case, ILT explores 30 configurations. The first step
tics of Wang and Mishra [18]. However, our approach also considers explores 6 for L1 caches and 9 for L2 cache. The second step explores
the effect of compression during exploration. 9 (=3*3) candidates. Final step explores 6 (=3*2) candidates. How-
ever, in most cases, there are a lot of repetitive configurations
4.2. Independent L1 cache tuning – ICT throughout the process that we only have to execute once. In prac-
tice, ILT has exploration space size of around 19 configurations.
While different cache levels are dependent on each other, our
4.4. Hierarchy level independent tuning – HIT
initial results demonstrate that instruction and data caches are
relatively independent. In this study, we fix one’s configuration
Although we stated that IL1 and DL1 can be selected indepen-
while changing the other’s to see whether varying one impacts the
dently, in some cases it is better to explore the two level one caches
fixed one. We observe that the profiling statistics for the instruc-
together. Suppose for a particular benchmark, there is a large varia-
tion cache almost remain identical with different data caches and
tion in the require L2 size for data/instruction when changing IL1 or
vice versa. It is mainly due to the fact that access pattern of L1
DL1. In this case, using a large portion of L2 for instruction for a spe-
cache is purely determined by the application’s characteristics, and
cific IL1 configuration can affect DL1 indirectly (may increase data
the instruction and data streams are relatively independent from
access miss ratio in L2). Since ICT finds energy-optimal caches for
each other. Furthermore, factors affecting the instruction cache’s
IL1 and DL1 independently without considering the effect of each
energy consumption as well as performance (such as hit energy,
on L2 cache behavior, it may produce suboptimal results. We pro-
miss energy and miss penalty cycles) have very little dependency
pose HIT – Hierarchy Level Independent Tuning – in which we first
on the data cache and vice versa.
find the optimal cache configurations for level one caches fixing L2
This observation offers an opportunity to reduce the exploration
to the base cache. We explore all possible 324 (=18*18) combina-
space. We propose ICT – Independent L1 Tuning heuristic – during
tions for IL1 and DL1 caches and select the energy optimal ones.
which IL1 and DL1 caches always use the same configuration while
Next we fix L1 caches to the found energy optimal caches in the
exploring with all L2 cache configurations. This method results in a
first step and try all 18 candidates for L2 cache. In summary, ILT
total of 288 configurations – a considerable cut down of the original
explores only 30 configurations, whereas ICT and HIT explore 288
quantity, though still not small. Throughout the static analysis, we
and 342 (=324 + 18) configurations, respectively.
make book keeping including the energy consumptions and miss
cycles of each cache individually. The energy-optimal IL1 cache is
the one with the lowest energy consumption of itself (and same 5. Experiments
for DL1 cache and L2 cache). We choose the cache configuration
combination composed of the three locally energy-optimal caches In order to quantify compression-aware cache configuration
as the energy-optimal cache hierarchy to be stored in the profile tradeoffs, we have applied our methodology to select embedded
table. system benchmarks. Following the same flow in Sections 3 and
4, we first investigate integration of code compression with DCR
for systems with one level of cache. In subsection 0, we extend
4.3. Interlaced tuning – ILT our experiments to evaluate our method with the presence of a
two-level cache.
We adapt the strategy used in TCaT [2] and propose ILT –
Interlaced Tuning heuristic – which finds energy-optimal param- 5.1. Experimental setup
eters throughout the exploration. The basic idea is to tune cache
parameters in the order of their importance to the overall energy We examined cjpeg, djpeg, epic, and adpcm (rawcaudio), g.721
consumption, which is cache size followed by line size and finally (encode, decode) benchmarks from the MediaBench [19] and dijk-
associativity. In order to increase the chances of finding optimal L2 stra, patricia from MiBench [20] compiled for the Alpha target
cache size, which we believe has the highest importance, we com- architecture. These benchmarks are all specially designed for
bine the exploration of L2 cache’s size and associativity together. embedded systems and suitable for the cache configuration param-
ILT is described below: eters described in Section 3.2. All applications were executed with
the default input sets provided with the benchmarks suites.
1. First, tune by cache size. Hold the IL1’s line size, associativity Three different code compression techniques including
as well as DL1 to the smallest configuration. L2 is set to the bitmask-based, dictionary-based and Huffman code compression
base cache. Explore all three instruction cache sizes (1 KB, 2 KB were used. To achieve the best attainable compression ratios, in
and 4 KB) and find out the energy-optimal one(s). Perform same bitmask-based compression, for each application we examined
explorations for DL1 cache size. In L2 size exploration, we try dictionaries of 1 KB, 2 KB, 4 KB, and 8 KB. Similar to Seong and
all the associativities for each cache size. We set L1 sizes to the Mishra [13] we tried three mask sets including one 2-bit sliding,
energy-optimal ones in the process of finding energy-optimal L2 1-bit sliding and 2-bit fixed, and 1-bit sliding and 2-bit fixed masks.
size(s). Similarly for dictionary-based and Huffman compression we used
2. Next, tune by line size. We set cache sizes to the energy-optimal 0.5 KB, 1 KB, 2 KB, 4 KB, and 8 KB dictionary sizes with 8 bits, 16
ones and L2’s associativity found in the first step in exploring bits and 32 bits word sizes. We found out that dictionary size of
H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80 77

BMC only DCR only DC+DCR Huffman+DCR BMC+DCR


100%
Normalized Energy Consumpon 90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
cjpeg djpeg epic_encode rawcaudio patricia dijkstra g721_enc g721_dec

Fig. 9. Energy consumption of the selected “minimal-energy cache” normalized to the base cache.

2 KB and word size of 16 bits are the best choices for this set of subsystem. Energy consumption is normalized to the fixed base
benchmarks. The reason is that using 8 bits words increases the cache configuration such that value of 100% represents our baseline.
number of compression decision bits and using 32 bits word size Energy savings in the instruction cache subsystem ranges from 10%
decreases the words frequencies significantly. Hence, as simulation to 76% with an average of 45% for utilizing only DCR. As we expected,
results showed, 16 bits word size is the best choice. due to higher decompression overhead, Huffman (when combined
Code compression is performed offline. In order to extract the with DCR) achieves lower energy savings compared to BMC virtu-
code (instruction) part from executable binaries, we used ECOFF ally for all benchmarks. Energy savings in DC + DCR approach are
(Extended Common Object File Format) header files provided in even lower than Huffman + DCR as a result of moderate compres-
SimpleScalar toolset [16]. We placed the compressed code back into sion ratio by DC. Incorporating BMC in DCR increases energy savings
binary files so that they can be loaded into the simulator. up to 48% – on top of 10–76% energy savings obtained by DCR only –
We utilized the configurable cache architecture developed by without any performance degradation. Our methodology achieves
Zhang et al. [6] with a four-bank cache of base size 4 KB, which on average 61% energy savings of the cache subsystem.
offers sizes of 1 KB, 2 KB, and 4 KB, line sizes ranging from 16 bytes Energy consumption of some benchmarks is reduced drasti-
to 64 bytes, and associativity of 1-way, 2-way, and 4-way. For com- cally when using BMC. For example, energy consumption of cjpeg
parison purposes, we used the base cache configuration for L1 set benchmark is decreased by nearly 50% when applying BMC on DCR
to be a 4 KB, 4-way set associative cache with a 32-byte line size, compared to using DCR alone. Fig. 10(a) shows the number of cache
a reasonably common configuration that meets the average needs misses per thousand dynamic instructions for cjpeg benchmark.
of the studied benchmarks. It shows that for smaller cache sizes, cache misses are drastically
To obtain cache hit and miss statistics, we modified the reduced when incorporating compression. In other words, by using
SimpleScalar toolset [16] to decode and simulate compressed compression, smaller cache sizes are capable of containing the crit-
applications. We implemented and placed the required decom- ical portion of cjpeg benchmark and keep the number of misses
pression routines/functions for respective compression algorithms low (maintaining performance) while reducing static energy con-
in Simplescalar simulator. We considered the latency of decom- sumption. Fig. 10(b) presents the same statistics for rawcaudio
pression unit carefully. Decompression unit can decompress the benchmark. It should be noticed that although integrating com-
next instruction in one cycle (in pipelined mode) if it finds the pression with DCR reduces the number of cache misses when using
entire needed bits in its buffer. Otherwise, it takes one cycle (or small cache sizes (similar to cjpeg behavior) it does not drastically
more cycles, if cache miss occurs) to fetch the needed bits into its decrease energy consumption. The extremely low range of cache
buffer and on more cycle to decompress the next instruction. Cor- misses, usually less than 0.05 (0.45 in the extreme case) misses per
rectness of the compression and decompression algorithms was thousand dynamic instructions, leads to nominal contribution of
verified by comparing the outputs of compressed applications with cache misses to the overall energy consumption of the cache. For
uncompressed versions. The performance overhead of decompres- this reason dynamic energy consumption is nearly the same for all
sion includes decompression unit buffer flush overhead due to configurations and DCR chooses the smallest possible cache con-
jumps, and variable latency of memory reads in each block fetch figuration to minimize the static energy. In this case, incorporating
(because of variable length compressed code). These overhead are compression in DCR with the selection of small cache size can only
negligible according to the experimental results. reduce the cache misses and therefore dynamic energy and thus has
We applied the same energy model used in [6], which calculates a small impact on overall cache energy consumption for rawcaudio
both dynamic and static energy consumption, memory latency, CPU benchmark.
stall energy, and main memory fetch energy. The energy model Fig. 11 illustrates an example of performance-energy consump-
was modified to include decompression energy. We updated the tion tradeoffs for both uncompressed and compressed (using BMC)
dynamic energy consumption for each cache configuration using cases for rawcaudio (adpcm-enc) benchmark. It can be observed that
CACTI 4.2 [21]. for every possible configuration for the uncompressed program
there is an alternative which has a better performance and lower
5.2. One-level cache tuning energy requirement if the program is compressed. This observation
shows that compression-aware DCR leads to better design choices.
Energy consumption for several benchmarks from the Media- Another observation we have made is that without DCR, apply-
Bench and MiBench in different approaches are analyzed: a fixed ing compression on an application (which executes using base cache
base cache configuration, bitmask-based compression without uti- configuration that already fits the critical portion of the application)
lizing DCR (BMC only), DCR without compression (DCR only), will not gain noticeable energy savings. However, compression-
dictionary-based compression with DCR (DC + DCR), Huffman cod- aware DCR effectively uses the advantage of reduced program size
ing with DCR (Huffman + DCR), and bitmask-based compression achieved by compression to choose smaller cache size, associativ-
with DCR (BMC + DCR). The most energy efficient cache configu- ity, or line size and yet fit critical portion of programs. Therefore,
ration found by exploration in each technique is considered for compression aware-DCR can achieve more energy savings com-
comparison. Fig. 9 presents energy savings for the instruction cache pared to DCR alone. Fig. 12 illustrates comparison of energy profile
78 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80

DCR only DC+DCR Huffman+DCR BMC+DCR 10


Uncompressed BMC post-cache placement
60 9

Energy consumpon (milli J)


Number of cache misses per thousand

50 7
dynamic instrucons

40 5

4
30 3

2
20
1

0
10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

a) cjpeg Fig. 12. The impact of cache/line size on energy profile of cache using cjpeg bench-
mark.
Number of cache misses per thousand

0.45

0.4
increases, the critical portion of program (regardless of whether
dynamic instrucons

0.35
compressed or not) will fit into cache entirely. Therefore by utiliz-
0.3
ing large cache sizes energy consumption of the compressed code
0.25 is very close to uncompressed one. It should be noted that the main
0.2 objective of exploration is to find the most energy efficient cache
configurations so we are not interested in large cache sizes since
0.15
they require more energy.
0.1
Fig. 13 shows performance of applications for different schemes
0.05 normalized to the base cache. Applying DCR alone for the purpose
0 of energy saving, results in 12% performance loss on average. We
observe that code compression can improve performance in many
scenarios while achieving significant reduction in energy consump-
tion. For instance, in the case of the application patricia, applying
only DCR would result in 12% performance degradation with 34%
b) rawcaudio energy savings. However, incorporating BMC boosts performance
by 33% while gaining extra 17% energy savings on top of DCR achiev-
Fig. 10. Number of cache misses using DCR only and with various compression
techniques.
ing 51% energy savings compared to the base cache. Results show
that synergistic integration of BMC with DCR achieves as much as
75% performance improvement for g721 enc (27% improvement on
for different caches for compressed (using BMC) and uncompressed
average) compared to DCR alone. Thus it is possible to have a cache
cjpeg benchmark. Using a 4 KB cache with associativity of 4 and 64-
architecture that is tuned for applications to have both increased
bit line size, energy consumption of cjpeg benchmark is nearly the
performance as well as lower energy consumption. Fig. 14 shows
same for compressed and uncompressed programs.
performance and miss statistics for g721 enc benchmark. Further
In the post-cache placement, compression has a significant
analysis of g721 enc benchmark reveals that having numerous
effect when combined with small cache sizes. In this case com-
small if then else and switch clauses leads to large number of misses
pressed instructions are stored in the cache. Since the compressed
due to overlapping addresses (conflict miss). In this case, compres-
code size is 30–45 percent less than uncompressed code it can fit
sion reduces the number of misses by decreasing the amount of
in smaller cache sizes. However, when size of the selected cache
overlap the address of these small code sections. Fig. 14 confirms
that compression drastically improves the performance of g721 enc
3.5 benchmark for most of available cache configurations.
Fig. 15 shows performance trend of all cache configurations for
Energy consumption (milli J)

3 BMC both uncompressed and compressed codes for cjpeg benchmark. It


uncompressed
is interesting to note that compression also improves performance.
2.5
The compressed program can fit in smaller cache because of 30–45%
2 reduction in code size. This decreases cache misses significantly
for small caches. Reduced number of misses can lead to reduced
1.5 stalls and improved performance. As it can be observed in Fig. 15,
without compression, reducing the cache size may lead to major
1
performance degradation so DCR is forced to discard many cache
0.5 alternatives due to timing constraints. For instance, having tim-
ing constraint of 25 million cycles for cjpeg benchmark will force
0 to discard all cache configurations of size 2048 KB or lower. How-
6.4 6.6 6.8 7 7.2 7.4 7.6 ever, compression improves the performance significantly when
Execution time (Millions of cycles) small cache sizes are used. Thus combination of cache reconfigura-
Fig. 11. Performance-energy consumption tradeoff for compressed and uncom-
tion and code compression enables energy savings while improving
pressed codes using rawcaudio (adpcm-enc) benchmark. overall performance.
H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80 79

180% BMC only DCR only DC+DCR Huffman+DCR BMC+DCR


Normalized Performance
160%
140%
120%
100%
80%
60%
40%
20%
0%
cjpeg djpeg epic_encode rawcaudio patricia dijkstra g721_enc g721_dec

Fig. 13. Performance of the selected “minimal-energy cache” normalized to the base cache.

#misses_DCRonly #misses_BMC+DCR 0.64


DCR only BMC + DCR

Energy Consumpon Normalized to the


perf_DCRonly perf_DCR+BMC
0.62

Performance normalized to base cache


160 1.8
Number of cache misses per thousand

1.6
0.6
140

120
1.4 0.58

Base Ccache
dynamic instrucons

1.2
100 0.56
1
80 0.54
0.8
60
0.6
0.52
40
0.4 0.5
20 0.2
0.48
0 0
0.46
ICT ILT HIT Exhaust

Fig. 16. Cache hierarchy energy consumption using heuristics for cjpeg benchmark.
Fig. 14. Number of cache misses using DCR only and DCR + BMC for g721 enc bench-
mark.
obtained by heuristics are very close to the optimal value obtained
5.3. Two-level cache tuning by exhaustive search. As we explained in Section 4.1, exploring all
possible configurations exhaustively results in 4752 simulations.
To evaluate the effect of two-level cache hierarchy using our Performing these set of simulations for cjpeg benchmark (which
exploration heuristics, we selected cjpeg, djpeg, epic benchmarks takes the lowest simulation time among others in the benchmark
from MediaBench [19] and crc32 from MiBench [20] benchmark suites) on a system with a 4-core AMD Opteron (an ×86 server
suites. For L2 cache, we choose 8 KB, 16 KB and 32 KB as possible processor) running at 3.0 GHz takes more than three days. Clearly,
cache sizes; 32, 64 and 128 bytes as line sizes; 4-, 8- and 16-way this will take longer for other benchmarks. Although, these heuris-
set associativity with a 32 KB cache architecture composed of four tics take significantly less time than exhaustive exploration, they
separate banks. L2 cache is unified; in other words, it contains both provide very close to optimal energy savings. Table 1 presents the
instructions and data. We define L2 base cache to be a 16 KB, 8-way total number of cache configurations explored by each exploration
set associative L2 cache with a 64 byte line size. We quantify the heuristic. Our experience is that it may take several days to profile a
cache subsystem energy savings using our approach by comparing task using exhaustive method while few minutes if ILT is employed.
to the base cache scenario. We use four cache exploration methods Designers can decide which heuristic to use based on the static
– exhaustive, ICT, ILT, and HIT – to generate profile tables. Fig. 16 profiling time and the overall energy savings. Therefore, we only
presents the total cache hierarchy energy consumption normalized perform heuristic space exploration for the remaining benchmarks.
to the base cache for cjpeg benchmark using each exploration tech- Fig. 17 presents the total cache hierarchy energy consump-
nique. It can be observed that, for cjpeg benchmark, the best results tion normalized to the base cache for cjpeg, djpeg, epic, patricia
and dijkstra benchmarks using each exploration technique for
uncompressed and compressed scenarios. ICT achieves best results
90
Uncompressed BMC post-cache placement obtaining 67% average energy saving when applying DCR only. It
Execuon me (Millions of cycles)

80 achieves up to 22% (patricia benchmark) more energy savings (11%


70 on average) incorporating compression. ILT reduces the number of
60 simulations significantly but presents results that are slightly infe-
rior. It achieves 62% energy savings using only DCR and up to 20%
50
extra savings adding compression to DCR. HIT outperforms ICT and
40
ILT in djpeg benchmark but on average saves 61% of the energy con-
30 sumption. Integrating compression boosts energy savings achieved
20 by HIT by 7%. The reason for ICT not finding the optimal
10

0 Table 1
Cache hierarchy configuration explored using different exploration methods for
cjpeg benchmark.

Exhaust ICT ILT HIT

4752 288 18 342


Fig. 15. Performance trend of different cache configurations using cjpeg benchmark.
80 H. Hajimiri et al. / Sustainable Computing: Informatics and Systems 2 (2012) 71–80

0.7 significantly reduce memory requirements, and may improve


performance in many scenarios. In this paper, we presented a syn-
Energy consumpon normalized to

0.6
ergistic integration of DCR and code compression for embedded
0.5
DCRonly-ICT systems. Our methodology employs an ideal combination of code
compression and dynamic tuning of two-level cache parameters
base cache

DCRonly-ILT
0.4
DCRonly-HIT with minor or no impact on timing constraints. Our experimental
0.3 BMC+DCR-ICT results demonstrated 61% reduction on average in overall energy
BMC+DCR-ILT consumption of the cache subsystem as well as up to 75% per-
0.2
BMC+DCR-HIT formance improvement (compared to DCR only) in embedded
0.1 systems.

0
cjpeg djpeg epic-encode patricia dijkstra
References

Fig. 17. Cache hierarchy energy consumption using three heuristics. [1] A. Malik, B. Moyer, D. Cermak, A low power unified cache architecture providing
power and performance flexibility, ISLPED (2000).
[2] A. Gordon-Ross, F. Vahid, N. Dutt, Automatic tuning of two-level caches to
embedded applications, DATE (2004).
1.8
[3] C. Lefurgy, Efficient execution of compressed programs, Ph.D. Thesis, University
Performance normalized to base cache

1.6 of Michigan, 2000.


[4] L. Benini, F. Menichelli, M. Olivieri, A class of code compression schemes
1.4
for reducing power consumption in embedded microprocessor systems, IEEE
1.2 DCRonly-ICT Transactions on Computers (April) (2004) 467–482.
DCRonly-ILT [5] A. Gordon-Ross, F. Vahid, N. Dutt, Fast configurable-cache tuning with a unified
1 second level cache, in: International Symposium on Low Power Electronics and
DCRonly-HIT
0.8 Design, 2005.
BMC+DCR-ICT
[6] C. Zhang, F. Vahid, W. Najjar, A highly-configurable cache architecture for
0.6 BMC+DCR-ILT embedded systems, in: 30th Annual International Symposium on Computer
0.4 BMC+DCR-HIT Architecture, June 2003.
[7] P. Vita, Configurable Cache Subsetting for Fast Cache Tuning, in: Design
0.2 Automation Conference, DAC, 2006.
[8] D.H. Albonesi, Selective Cache Ways: On-Demand Cache Resource Allocation,
0
2000.
cjpeg djpeg epic-encode patricia dijkstra
[9] L. Chen, X. Zou, J. Lei, Z. Liu, Dynamically reconfigurable cache for low-power
embedded system, in: Third International Conference on Natural Computation,
Fig. 18. Performance of the selected “minimal-energy cache” using different heuris- 2007.
tics normalized to the base cache. [10] C. Zhang, F. Vahid, R. Lysecky, A self-tuning cache architecture for embedded
systems, DATE (2004).
[11] A. Wolfe, A. Chanin, Executing compressed programs on an embedded RISC
configurations is that though L1 caches are relatively independent,
architecture, in: Proc. of the Intl. Symposium on Microarchitecture, 1992, pp.
they both have impact on the L2 cache which has effect back on L1 81–91.
caches. So they are essentially indirectly dependent on each other [12] Y. Xie, W. Wolf, H. Lekatsas, A code decompression architecture for VLIW
through the L2 cache. HIT only considers Pareto-optimal configura- processors, in: 34th ACM/IEEE International Symposium on Microarchitecture
(MICRO), 2001.
tions at the cost of losing the chance of finding more efficient cache [13] S. Seong, P. Mishra, Bitmask-based code compression for embedded systems,
combinations which actually consists of non-beneficial ones. One of IEEE Trans. CAD (2008) 673–685.
the reasons is that a less energy efficient (due to oversize) L1 cache [14] C. Lin, Y. Xie, W. Wolf, LZW-based code compression for VLIW embedded sys-
tems, Proc. DATE (2004) 76–81.
may cause fewer accesses to L2 cache. Hence an appropriate L2 [15] M. Rawlins, A. Gordon-Ross, On the interplay of loop caching, code compres-
cache may make this non-beneficial L1 cache overall better. Since sion, and cache configuration, ASP-DAC (2011).
ILT is least expensive, it is expected to produce worst results. In [16] D. Burger, T. Austin, S. Bennet, Evaluating future microprocessors:the
simplescalar toolset, University of Wisconsin-Madison, Computer Science
reality, it produces comparable, sometimes even better, than some Department Technical Report CS-TR-1308, July 2000.
expensive heuristics (HIT). [17] C. Murthy, P. Mishra, Lossless Compression Using Efficient Encoding of Bit-
Fig. 18 shows the performance of selected energy-optimal masks, isvlsi, IEEE Computer Society Annual Symposium on VLSI, 2009, pp.
163–168.
caches using each heuristic normalized to the base cache configu-
[18] W. Wang, P. Mishra, Dynamic reconfiguration of two-level caches in soft real-
ration. On ICT, ILT, and HIT gain up to 16%, 15% and 2% performance time embedded systems, in: IEEE International Symposium on VLSI, 2009.
improvements when compression is added to DCR. [19] C. Lee, M. Potkonjak, W.H. Mangione-smith., MediaBench: a tool for evaluating
and synthesizing multimedia and communications systems, in: International
Symposium on Microarchitecture, 1997.
6. Conclusion [20] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B. Brown,
MiBench: a free, commercially representative embedded benchmark suite,
International Workshop on Workload Characterization (WWC), 2001.
Optimization techniques are widely used in embedded systems
[21] CACTI, HP Labs, CACTI 4.2, http://www.hpl.hp.com/.
to improve overall area, energy and performance requirements. [22] H. Hajimiri, K. Rahmani, P. Mishra, Synergistic Integration of Dynamic Cache
Dynamic cache reconfiguration (DCR) is very effective to reduce Reconfiguration and Code Compression in Embedded Systems, International
energy consumption of cache subsystem. Code compression can Green Computing Conference (IGCC), 2011.

You might also like