Heterogeneous Multicore Processor Technologies for Embedded Systems Compress
Heterogeneous Multicore Processor Technologies for Embedded Systems Compress
Hiroaki Shikano
Heterogeneous Multicore
Processor Technologies
for Embedded Systems
Kunio Uchiyama Fumio Arakawa
Research and Development Group Renesas Electronics Corp.
Hitachi, Ltd. 5-20-1 Josuihon-cho, Kodaira-shi
1-6-1 Marunouchi, Chiyoda-ku Tokyo 187-8588, Japan
Tokyo 100-8220, Japan
Tohru Nojiri
Hironori Kasahara Central Research Lab.
Green Computing Systems Hitachi, Ltd.
Waseda University 1-280 Higashi-koigakubo
R&D Center Kokubunji-shi
27 Waseda-machi, Shinjuku-ku Tokyo 185-8601, Japan
Tokyo 162-0042, Japan
Yasuhiro Tawara
Hideyuki Noda Renesas Electronics Corp.
Renesas Electronics Corp. 5-20-1 Josuihon-cho, Kodaira-shi
4-1-3 Mizuhara, Itami-shi Tokyo 187-8588, Japan
Hyogo 664-0005, Japan
Kenichi Iwata
Akio Idehara Renesas Electronics Corp.
Nagoya Works, Mitsubishi Electric Corp. 5-20-1 Josuihoncho, Kodaira
1-14 Yada-minami 5-chome Tokyo 187-8588, Japan
Higashi-ku
Nagoya 461-8670, Japan
Hiroaki Shikano
Central Research Lab.
Hitachi, Ltd.
1-280 Higashi-koigakubo
Kokubunji-shi
Tokyo 185-8601, Japan
The expression “Digital Convergence” was coined in the mid-1990s and became a
topic of discussion. Now, in the twenty-first century, the “Digital Convergence” era
of various embedded systems has begun. This trend is especially noticeable in digi-
tal consumer products such as cellular phones, digital cameras, digital players, car
navigation systems, and digital TVs. That is, various kinds of digital applications
are now converged and executed on a single device. For example, several video
standards such as MPEG-2, MPEG-4, H.264, and VC-1 exist, and digital players
need to encode and decode these multiple formats. There are even more standards
for audio, and newer ones are continually being proposed. In addition, recognition
and synthesis technologies have recently been added. The latest digital TVs and
DVD recorders can even extract goal-scoring scenes from soccer matches using
audio and image recognition technologies. Therefore, a System-on-a-Chip (SoC)
embedded in the digital-convergence system needs to execute countless tasks such
as media, recognition, information, and communication processing.
Digital convergence requires, and will continue to require, higher performance in
various kinds of applications such as media and recognition processing. The prob-
lem is that any improvements in the operating frequency of current embedded CPUs,
DSPs, or media processors will not be sufficient in the future because of power
consumption limits. We cannot expect a single processor with an acceptable level of
power consumption to run applications at high performance. One solution that
achieves high performance at low-power consumption is to develop special hard-
ware accelerators for limited applications such as the processing of standardized
formats such as MPEG videos. However, the hardware-accelerator approach is not
efficient enough for processing many of the standardized formats. Furthermore, we
need to find a more flexible solution for processing newly developed algorithms
such as those for media recognition.
To satisfy the higher requirements of digitally converged embedded systems, this
book proposes heterogeneous multicore technology that uses various kinds of low-
power embedded processor cores on a single chip. With this technology, heteroge-
neous parallelism can be implemented on an SoC, and we can then achieve greater
v
vi Preface
flexibility and superior performance per watt. This book defines the heterogeneous
multicore architecture and explains in detail several embedded processor cores
including CPU cores and special-purpose processor cores that achieve highly arith-
metic-level parallelism. We developed three multicore chips (called RP-1, RP-2,
and RP-X) according to the defined architecture with the introduced processor
cores. The chip implementations, software environments, and applications running
on the chips are also explained in the book.
We, the authors, hope that this book is helpful to all readers who are interested in
embedded-type multicore chips and the advanced embedded systems that use these
chips.
A book like this cannot be written without the help in one way or another of many
people and organizations.
First, part of the research and development on the heterogeneous multicore pro-
cessor technologies introduced in this book was supported by three NEDO (New
Energy and Industrial Technology Development Organization) projects: “Advanced
heterogeneous multiprocessor,” “Multicore processors for real-time consumer elec-
tronics,” and “Heterogeneous multicore technology for information appliances.”
The authors greatly appreciate this support.
The R&D process on heterogeneous multicore technologies involved many
researchers and engineers from Hitachi, Ltd., Renesas Electronics Corp., Waseda
University, Tokyo Institute of Technology, and Mitsubishi Electric Corp. The
authors would like to express sincere gratitude to all the members of these organiza-
tions associated with the projects. We give special thanks to Prof. Hideo Maejima
of Tokyo Institute of Technology, Prof. Keiji Kimura of Waseda University,
Dr. Toshihiro Hattori, Mr. Osamu Nishii, Mr. Masayuki Ito, Mr. Yusuke Nitta,
Mr. Yutaka Yoshida, Mr. Tatsuya Kamei, Mr. Yasuhiko Saito, Mr. Atsushi Hasegawa
of Renesas Electronics Corp., Mr. Shiro Hosotani of Mitsubishi Electric Corp., and
Mr. Toshihiko Odaka, Dr. Naohiko Irie, Dr. Hiroyuki Mizuno, Mr. Masaki Ito,
Mr. Koichi Terada, Dr. Makoto Satoh, Dr. Tetsuya Yamada, Dr. Makoto Ishikawa,
Mr. Tetsuro Hommura, and Mr. Keisuke Toyama of Hitachi, Ltd. for their efforts in
leading the R&D process.
Finally, the authors thank Mr. Charles Glaser and the team at Springer for their
efforts in publishing this book.
vii
Contents
1 Background ............................................................................................... 1
1.1 Era of Digital Convergence ................................................................ 1
1.2 Heterogeneous Parallelism Based on Embedded Processors ............. 3
References ................................................................................................... 8
ix
x Contents
Since the mid-1990s, the concept of “digital convergence” has been proposed and
discussed from both technological and business viewpoints [1]. In the twenty-first
century, “digital convergence” has become stronger and stronger in various digital
fields. It is especially notable in the recent trend in digital consumer products such
as cellular phones, car information systems, and digital TVs (Fig. 1.1) [2, 3]. This
trend will become more widespread in various embedded systems, and it will expand
the conventional market due to the development of new functional products and also
lead to the creation of new markets for goods such as robots.
In a digitally converged product, various applications are combined and executed
on a single device. For example, several video formats such as MPEG-4 and H.264
and several audio formats such as MP3 and AAC are decoded and encoded in a cel-
lular phone. In addition, recognition and synthesis technologies have recently been
added. The latest digital TVs and DVD recorders can even extract goal-scoring
scenes from soccer matches using audio and image recognition technologies. Thus,
an embedded SoC in the “digital-convergence” product needs to execute countless
tasks such as media, recognition, information, and communication processing.
Figure 1.2 shows the required performance of various current and future digital-
convergence applications, executed at giga operations per second (GOPS) [2, 3].
Digital convergence requires and will continue to require higher performance in
various kinds of media and recognition processes. The problem is that the improve-
ments made in the frequency of embedded CPUs, DSPs, or media processors will
not be sufficient in the future because of power consumption limits. In our estima-
tion, only applications that require performance of less than several GOPS can be
executed by a single processor at an acceptable level of power consumption of
embedded systems. We therefore need to find a solution for applications that require
higher GOPS performance. A special hardware accelerator is one solution [4, 5].
It is suitable for processing standardized formats like MPEG videos. However, the
Still Image
JPEG
Video MotionJPEG
MPEG2 JPEG2000
MPEG4
Information, H.264 Graphics
Communication VC-1
2D, 3D
WEB Browser SoC Image base
XML Multi path
Java Recognition, Rendering
Data base Synthesis
DLNA Voice, Audio
Image
Biometrics
Audio
Security MP3
AAC
AES AAC Plus
DES Dolby 5.1
RSA Media
Elgamal Flash WMA
DRM HDD RealAudio
DVD
Blu-ray Disc
hardware-accelerator approach is not always flexible. Better solutions that can exe-
cute a wide variety of high-GOPS applications should therefore be studied.
A photo of a ball-catching robot is shown in Fig. 1.3. This is an example of
media-recognition and motion-control convergence [6, 7]. In this system, a ball
image is extracted and recognized from video images of two cameras. The trajec-
tory of the ball is predicted by calculating its three-dimensional position. Based
on the trajectory projection, the joint angles of the robot manipulators are calcu-
lated, and the robot catches the ball. The four stages of the media recognition and
the motion control need to be executed every 30 ms, and this requires over
10-GOPS performance. Like this example, a variety of functions, which may
1.2 Heterogeneous Parallelism Based on Embedded Processors 3
Ball Extraction
3D-Position Cal.
30ms
Trajectory Prediction
High
Power Consumption
Integration
Low
250nm 180nm 130nm 90nm 65nm 45nm
Technology
100 Power-efficient
Heterogeneous Multicore
(Embedded system)
Performance/W
10
High-performance
Multicore
1 (PC/Server)
0.1
1 10 100
Power Consumption (W)
a b
MIPS/W Performance
10000 (MIPS)
6000 PC/
4500 10000
Server’s
1050
1000 720
300
1000
100
100
30 Embedded
100
10 0.01 0.1 1 10 100
SH1,2 SH3 SH4 SH-Mobile Power Consumption (W)
1992 Æ
* MIPS: based on Dhrystone 2.1
Dynamic
Reconf.
Media
Special Proc.
Purpose
Processor DSP
CPU
Low
Low High
Flexibility
6,000 MIPS/W, which was 200 times higher than that of 15 years ago. When we
compare this with the other types of processors in Fig. 1.6b, we can see the excellent
power efficiency of the embedded processor [2].
Our other policy is to effectively use heterogeneous parallelism to attain high
power efficiency in various digital-convergence applications. Now, various types of
processor cores other than CPU cores have been developed. Figure 1.7 shows exam-
ples of these processor cores, which are positioned in terms of flexibility and perfor-
mance per watt/performance per area.
The CPU is a general-purpose processor core and has the most flexibility. The
other processor cores are developed for special purposes. They have less flexibility
6 1 Background
Sequence Manager
Local Memory
Arithmetic array
(24+8 cells) LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
LS LRAM
Crossbar Switch
(10 cells) (10 banks)
Configuration Manager
but high power/area efficiency. The DSP is for signal processing applications,
and the media processor is for effectively processing various media data such as
audio and video. There are also special-purpose processor cores that are suitable
for arithmetic-intensive applications. These include the dynamically reconfigurable
core and highly SIMD (single instruction multiple data)-type core.
Figure 1.8 depicts an example of a dynamically reconfigurable processor core
[25], which is described in Sect. 3.2 in detail. It includes an arithmetic array consist-
ing of 24 ALU cells and 8 multiply cells, each of which executes a 16-bit arithmetic
operation. The array is connected to ten load/store cells with dual-ported local
memories via a crossbar switch. The core can achieve highly arithmetic-level paral-
lelism using the two-dimensional array. When an algorithm such as an FFT or FIR
filter is executed in the core, the configurations of the cells and their connections are
determined, and the data in the local RAMs are processed very quickly according to
the algorithm.
Figure 1.9 is an example of a highly SIMD-type processor core [26], which is
described in Sect. 3.3 in detail. The core has 2,048 processing elements, each of
which includes two 2-bit full adders and some logic circuits. The processing ele-
ments are directly connected to two data register arrays, which are composed of
single-port SRAM cells. The processor core can execute arithmetic-intensive appli-
cations such as image and signal processing by operating 2,048 processing elements
in the SIMD manner.
The hardware accelerator is a core that has been developed for a dedicated appli-
cation. To achieve high power and area efficiency, the internal architecture of the
1.2 Heterogeneous Parallelism Based on Embedded Processors 7
Instruction
Processor Controller
RAM
PE
PE
2048 entries
Data Registers PE Data Registers
PE
PE
PE
Stream processor
Shift-register-based bus
CABAC accelerator
Symbol TRF FME DEB
#0 codec (PIPE) (PIPE) (PIPE) CME
L-MEM
accelerator is highly optimized for the target applications. The full HD H.264 video
CODEC accelerator described in Sect. 3.4 is a good example [5]. The accelerator
(Fig. 1.10), which is fabricated using 65-nm CMOS technology and operates at
162 MHz, consists of dedicated processing elements, hardware logics, and proces-
sors which are suitably designed to execute each CODEC stage. The accelerator
decodes full HD (high definition) H.264 video at 172 mW. If we use a high-end
CPU core for this decoding, at least a 2–3 GHz frequency is necessary with the
100% load of the CPU. This means this CODEC core achieves 2–300 times higher
performance per watt than a high-end CPU core.
In our heterogeneous multicore approach, both general-purpose CPU cores and
special-purpose processor cores described above are used effectively. When a pro-
gram is executed, it is divided into small parts, and each part is executed in the most
suitable processor core. This should achieve a very power-efficient and cost-effective
8 1 Background
References
17. Kamei T, et al (2004) A resume-standby application processor for 3G cellular phones, ISSCC
Dig Tech Papers:336–337, 531
18. Ishikawa M, et al (2004) A resume-standby application processor for 3G cellular phones with
low power clock distribution and on-chip memory activation control, COOL Chips VII
Proceedings, vol I, pp 329–351
19. Arakawa F, et al (2004) An embedded processor core for consumer appliances with 2.8GFLOPS
and 36 M Polygons/s FPU. IEICE Trans Fundamentals, E87-A(12):3068–3074
20. Ishikawa M, et al (2005) A 4500 MIPS/W, 86 mA resume-standby, 11 mA ultra-standby appli-
cation processor for 3G cellular phones. IEICE Trans Electron E88-C(4):528–535
21. Arakawa F, et al (2005) SH-X: An Embedded Processor Core for Consumer Appliances, ACM
SIGARCH Computer Architecture News 33(3), pp 33–40
22. Yamada T, et al (2005) Low-Power Design of 90-nm SuperHTM Processor Core, Proceedings
of 2005 IEEE International Conference on Computer Design (ICCD), pp 258–263
23. Arakawa F, et al (2005) SH-X2: An Embedded Processor Core with 5.6 GFLOPS and 73 M
Polygons/s FPU, 7th Workshop on Media and Streaming Processors (MSP-7), pp 22–28
24. Yamada T et al (2006) Reducing Consuming Clock Power Optimization of a 90nm Embedded
Processor Core. IEICE Trans Electron E89–C(3):287–294
25. Kodama T, Tsunoda T, Takada M, Tanaka H, Akita Y, Sato M, Ito M (2006) Flexible Engine:
A dynamic reconfigurable accelerator with high performance and low power consumption, in
Proc. of the IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX)
26. Noda H et al (2007) The design and implementation of the massively parallel processor based
on the matrix architecture. IEEE J Solid-State Circuits 42(1):183–192
Chapter 2
Heterogeneous Multicore Architecture
Chip
FVC FVC FVC FVC
On-chip interconnect
FVC FVC
command chaining, where multiple commands are executed in order. The frequency
and voltage controller (FVC) connected to each core controls the frequency, voltage,
and power supply of each core independently and reduces the total power con-
sumption of the chip. If the frequencies or power supplies of the core’s PU, DTU,
and LM can be independently controlled, the FVC can vary their frequencies and
power supplies individually. For example, the FVC can stop the frequency of the PU
and run the frequencies of the DTU and LM when the core is executing only data
transfers. The on-chip shared memory (CSM) is a medium-sized on-chip memory
that is commonly used by cores. Each core is connected to the on-chip interconnect,
which may be several types of buses or crossbar switches. The chip is also con-
nected to the off-chip main memory, which has a large capacity but high latency.
Figure 2.1 illustrates a typical model of a heterogeneous multicore architecture.
A number of variations based on this architecture model are possible. Several varia-
tions of an LM structure are shown in Fig. 2.2. Case (a) is a hierarchical structure
where the LM has two levels. LM1 is a first-level, small-size, low-latency local
memory. LM2 is a second-level, medium-sized, not-so-low-latency local memory.
For example, the latency from the PU to LM1 is one processor cycle, and the latency
to LM2 is a few processor cycles. Case (b) is a Harvard type. The LM is divided into
an LMi that stores instructions and an LMd that stores data. The PU has an indepen-
dent access path to each LM. This structure allows parallel accesses to instructions
and data and enhances processing performance. Case (c) is a combination of (a) and
(b). The LMi and LMd are first-level local memories for instructions and data,
respectively. LM2 is a second-level local memory that stores both instructions and
data. In each case, each LM is mapped on a different address area; that is, the PU
accesses each LM with different addresses.
2.1 Architecture Model 13
LM2 LM2
Hierarchical -
Hierarchical Harvard
Harvard
FVC
FVC FVC
PU PU PU PU
LM DTU LM DTU LM LM
In Fig. 2.3, we can see other configurations of a DTU, CSM, FVC, and an on-chip
interconnect. First, processor cores are divided into two clusters. The CPU cores,
the CSMl, and the off-chip main memory are tightly connected in the left cluster.
The SPP cores, CSMr, and the DMAC are nearly connected in the right cluster.
Not every SPP core has a DTU inside. Instead, the DMAC that has multiple chan-
nels is commonly used for data transfer between an LM and a memory outside an
SPP core. For example, when data are transferred from an LM to the CSMr, the
DMAC reads data in the LM via the right on-chip bus, and the data are written on
the CSMr from the DMAC. We need two bus transactions for this data transfer. On
the other hand, if a DTU in a CPU core on the left cluster is used for the same
transfer, data are read from an LM by the DTU in the core, and the data are written
on the CSMl via the on-chip bus by the DTU. Only one transaction on the on-chip
bus is necessary in this case, and the data transfer is more efficient compared with
the case using the off-core DMAC. Although each CPU core in the left cluster has
an individual FVC, the SPP cores in the right cluster share an FVC. With this simpler
FVC configuration, all SPP cores operate at the same voltage and the same fre-
quency, which are controlled simultaneously.
14 2 Heterogeneous Multicore Architecture
Time
Processing
P1 P8 W1 P11
CPU #0
Data Transfer
T2
P2 P5 P9
CPU #1
T1 T5 T8
P3 P7 W2
SPPa #0
T4 T7
P4 P6 P10
SPPb #0
T3 T6
CPU #3 CPU #7
CPU
CPU #2#2 LCPG CPU
CPU #6#2 LCPG
CPU#1
CPU #1 LCPG CPU#5
CPU #1 LCPG
CPU #0 CPU #0
CPU #0 LCPG CPU #4 LCPG Local Clock
LCPG LCPG Pulse Generator
LMLM:16/16KB LMLM:16/16KB
DSM:64KB
I/OLRAM:16/16KB DSM:64KB
I/OLRAM:16/16KB
URAM:64KB URAM:64KB DMA controller
DTU CSM #0 DTU CSM #1 DMAC
DTU DTU 256KB
256KB #1
DTU FE #1 MX #1
FE #2 Matrix Processor
Off-chip Video Processor Unit FE #3 Off-chip
DDR3 DRAM Flexible Engine DDR3 DRAM
Three types of SPPs are embedded on the chip. The first SPP is a video processing
unit (VPU, see Sect. 3.4) which is specialized for video processing such as MPEG-4
and H.264 codec. The VPU has a 300-KB LM and a DTU built-in. The second and
third SPPs are four flexible engines (FEs, see Sect. 3.2), and two matrix processors
(MXs, see Sect. 3.3), and they are included in another cluster. The FE is a dynami-
cally reconfigurable processor which is suitable for data-parallel processing such as
16 2 Heterogeneous Multicore Architecture
digital signal processing. The FE has an internal 30-KB LM but does not have a
DTU. The on-chip DMA controller (DMAC) that can be used in common by on-chip
units or a DTU of another core is used to transfer data between the LM and other
memories. The MX has 1,024-way single instruction multiple data (SIMD) architec-
ture that is suitable for highly data-intensive processing such as video recognition.
The MX has an internal 128-KB LM but does not have its DTU, just as with the FE.
In the chip photograph in Fig. 2.5, the upper-left island includes four CPUs, and the
lower-left island has the VPU with other blocks. The left cluster in Fig. 2.6 includes
these left islands and a DDR3 port depicted at the lower-left side. The lower-right
island in the photo in Fig. 2.5 includes another four CPUs, the center-right island has
four FEs, and the upper-right has two MXs. The right cluster in Fig. 2.6 includes
these right islands and a DDR3 port depicted at the upper-right side. With these 15
on-chip heterogeneous cores, the chip can execute a wide variety of multimedia and
digital-convergence applications at high-speed and low-power consumption. The
details of the chip and its applications are described in Chaps. 4–6.
There are two types of address spaces defined for a heterogeneous multicore chip.
One is a public address space where all major memory resources on and off the
chip are mapped and can be accessed by processor cores and DMA controllers in
common. The other is a private address space where the addresses looked for from
inside the processor core are defined. The thread of a program on a processor core
runs on the private address space of the processor core. The private address space of
each processor core is defined independently.
Figure 2.7a shows a public address space of the heterogeneous multicore chip
depicted in Fig. 2.1. The CSM, the LMs of CPU #0 to CPU #m, the LMs of SPPa #0
to SPPa #n, and the LMs of SPPb #0 to SPPb #k are mapped in the public address
space, as well as the off-chip main memory. Each DTU in each processor core can
access the off-chip main memory, the CSM, and the LMs in the public address
space and can transfer data between various kinds of memories. A private address
space is independently defined per processor core. The private addresses are gener-
ated by the PU of each processor core. For a CPU core, the address would be
generated during the execution of a load or store instruction in the PU. Figure 2.7b, c
shows examples of private address spaces of a CPU and SPP. The PU of the CPU
core accesses data of the off-chip main memory, the CSM, and its own LM mapped
on the private address space of Fig. 2.7b. If the LM of another processor core is not
mapped on this private address space, the load/store instructions executed by the PU
of the CPU core cannot access data on the other processor core’s LM. Instead, the
DTU of the CPU core transfers data from the other processor core’s LM to its own
LM, the CSM, or the off-chip main memory using the public address space, and the
PU accesses the data in its private address space. In the SPP example (Fig. 2.7c),
the PU of the SPP core can access only its own LM in this case. The data transfer
2.2 Address Space 17
Off-chip Off-chip
main memory main memory
CSM CSM
LM (CPU #0) LM
LM (CPU #m)
Private address space
LM (SPPa #0) (CPU core)
c
LM (SPPb #k) LM
Other resources
Private address space
(SPP core)
Public address space
LM2
LM2
between its own LM and memories outside the core is done by its own DTU on the
public address space.
The address mapping of a private address space varies according to the structure
of the local memory. Figure 2.8 illustrates the case of the hierarchical Harvard
structure of Fig. 2.2c. The LMi and LMd are first-level local memories for instruc-
tions and data, respectively. The LM2 is a second-level local memory that stores
both instructions and data. The LMi, LMd, and LM2 are mapped on different
address areas in the private address space. The PU accesses each LM with different
addresses.
The size of the address spaces depends on the implementation of the heteroge-
neous multicore chip and its system. For example, a 40-bit address is assigned for a
public address space, a 32-bit address for a CPU core’s private address space, a
16-bit address for the SPP’s private address space, and so on. In this case, the sizes
of each space are 1 TB, 4 GB, and 64 KB, respectively. Concrete examples of this
are described in Chaps. 3 and 4.
18 2 Heterogeneous Multicore Architecture
References
The processor cores described in this chapter are well tuned for embedded systems.
They are SuperHTM RISC engine family processor cores (SH cores) as typical
embedded CPU cores, flexible engine/generic ALU array (FE–GA or shortly called
FE as flexible engine) as a reconfigurable processor core, MX core as a massively
parallel SIMD-type processor, and video processing unit (VPU) as a video processing
accelerator. We can implement heterogeneous multicore processor chips with them,
and three implemented prototype chips, RP-1, RP-2, and RP-X, are introduced in
the Chap. 4.
Since the beginning of the microprocessor history, a processor especially for PC/
servers had continuously advanced its performance while maintaining a price range
from hundreds to thousands of dollars [1, 2]. On the other hand, a single chip micro-
controller had continuously reduced its price resulting in the range from dozens of
cents to several dollars with maintaining its performance and had been equipped to
various products [3]. As a result, there was a situation of no demand on the proces-
sor of the middle price range from tens to hundreds of dollars.
However, with the introduction of the home game console in the late 1980s and
the digitization of the home electronic appliances from the 1990s, there occurred the
demands to a processor suitable for multimedia processing in this price range.
Instead of seeking high performance, such a processor has attached great impor-
tance to high efficiency. For example, the performance is 1/10 of a processor for
PCs, but the price is 1/100, or the performance equals to a processor for PCs for the
important function of the product, but the price is 1/10. The improvement of area
efficiency has become the important issue in such a processor.
In the late 1990s, a high-performance processor consumed too high power for
mobile devices such as cellar phones and digital cameras, and the demand was
increasing on the processor with higher performance and lower power for multimedia
processing. Therefore, the improvement of the power efficiency became the impor-
tant issues.
Furthermore, when the 2000s began, more functions were integrated by further
finer processes, but on the other hand, the increase of the initial and development
costs became a serious problem. As a result, the flexible specification and the cost
reduction came to be important issues. In addition, the finer processes suffered from
the more leakage current.
Under the above background, embedded processors were introduced to meet the
requirements and have improved the area, power, and development efficiencies.
In this section, SuperHTM RISC (reduced instruction set computer) engine family
processor cores are introduced as one of the highly efficient CPU cores.
A multicore SoC is one of the most promising approaches to realize high efficiency,
which is the key factor to achieve high performance under some fixed power and
cost budgets. As a result, embedded systems are employing multicore architecture
more and more. The multicore is good for multiplying single-core performance
with maintaining the core efficiency, but does not enhance the efficiency of the core
itself. Therefore, we must use highly efficient cores. In this section, SuperHTM RISC
engine family (SH) processors are introduced as highly efficient typical embedded
CPU cores for both single- and multicore chips.
The first SH processor was developed based on SuperHTM architecture as one of
embedded processors in 1993. Then the SH processors have been developed as a
processor with suitable performance for multimedia processing and area-and-power
efficiency. In general, performance improvement causes degradation of the efficiency
as Pollack’s rule indicates [4]. However, we can find the ways to improve both the
performance and the efficiency. Even each way contributes to small improvement,
total improvement can be meaningful.
The first-generation product SH-1 was manufactured using a 0.8-mm process,
operated at 20 MHz, and achieved performance of 16 MIPS in 500 mW. It was a
high-performance single chip microcontroller and integrated a ROM, a RAM, a
direct memory access controller (DMAC), and an interrupt controller.
The MIPS is abbreviation of million instructions per second and a popular inte-
ger-performance measure of embedded processors. The same performance proces-
sors should take the same time for the same program, but the original MIPS varies,
reflecting the number of instructions executed for a program. Therefore, perfor-
mance of Dhrystone benchmark relative to that of a VAX 11/780 minicomputer is
broadly used [5]. This is because it achieved 1 MIPS, and the relative performance
value is called VAX MIPS or DMIPS or simply MIPS.
The second-generation product SH-2 was manufactured successively using the
same 0.8-mm process as the SH-1 in 1994 [6]. It operated at 28.5 MHz and achieved
3.1 Embedded CPU Cores 21
An SH-X3, the third-generation core, supported multicore features for both SMP
and AMP [29, 30]. It was developed using a 90-nm generic process and achieved
600 MHz and 1,080 MIPS with 360 mW, resulting in 3,000 MIPS/W and 3.2
GIPS2/W. The first prototype chip of the SH-X3 was a RP-1 that integrated four
SH-X3 cores [31–34], and the second one was a RP-2 that integrated eight SH-X3
cores [35–37]. Then, it was ported to a 65-nm low-power process and used for prod-
uct chips [38]. The design is discussed in Sect. 3.1.7.
An SH-X4, the latest fourth generation of the SH-4A processor core series,
achieved 648 MHz and 1,717 MIPS with 106 mW, resulting in 16,240 MIPS/W and
28 GIPS2/W using a 45-nm process [39–41]. The design is discussed in Sect. 3.1.8.
The SH-4 enhanced its performance and efficiency mainly with superscalar archi-
tecture, which is suitable for multimedia processing having high parallelism, and
makes an embedded processor suitable for digital appliances. However, a conven-
tional superscalar processor put the first priority to performance, and efficiency was
not considered seriously, because it was a high-end processor for a PC/server
[42–46]. Therefore, a highly efficient superscalar architecture was developed and
adopted to the SH-4. The design target was to adopt the superscalar architecture to
an embedded processor with maintaining its efficiency, which was already high
enough and much higher than that of a high-end processor.
A high-end general-purpose processor was designed to enhance general perfor-
mance for PC/server use. However, no serious restriction caused low efficiency.
A program with low parallelism cannot use the parallelism of a highly parallel
superscalar processor, and the efficiency of the processor degrades. Therefore, the
target parallelism of the superscalar architecture was set for the programs with rela-
tively low parallelism, and performance enhancement of the multimedia processing
was accomplished in another way (see Sect. 3.1.5).
The superscalar architecture enhances peak performance by simultaneous issue
of plural instructions. However, effective performance of the real application is
estranged from peak performance when the number of the instruction issue
increases. The estrangement between the peak and effective performance is caused
by hazard of waiting cycles. A branch operation mainly causes the waiting cycles
for a fetched instruction, and it is important to speed up the branch efficiently.
A resource conflict, which causes the waiting cycles for a resource to be available,
can be reduced by the resource addition. However, the efficiency will decrease if
the performance enhancement does not compensate the hardware amount of the
additional resource. Therefore, balanced resource addition is necessary to main-
tain the efficiency. The register conflict, which causes the waiting cycles for a
register value to be available, can be reduced by shortening instruction execution
time and by data forwarding from a data-definition instruction to a data-use one at
appropriate timing.
3.1 Embedded CPU Cores 23
Since the beginning of the RISC architecture, all the RISC processor had adopted a
32-bit fixed-length instruction set architecture (ISA). However, such a RISC ISA
required larger-size codes than a conventional CISC (complicated instruction set
computer) ISA, and it was necessary to increase the capacity of program memories
and an instruction cache to support this, and efficiency decreased. SH architecture
with the 16-bit fixed-length ISA was defined in such a situation to achieve compact
code sizes. The 16-bit fixed-length ISA was spread to other processors such as ARM
Thumb and MIPS16.
On the other hand, a CISC ISA has been variable length to define the instructions
of various complexities from simple to complicated ones. The variable length is
good for realizing the compact code sizes, but is not suitable for parallel decoding
of plural instructions for the superscalar issue. Therefore, the 16-bit fixed-length
ISA is good both for the compact code sizes and the superscalar architecture.
As always, there should be pros and cons of the selection, and there are some draw-
backs of the 16-bit fixed-length ISA, which are the restriction of the number of oper-
ands and the short literal length in the code. For example, an instruction of a binary
operation modifies one of its operand, and an extra data transfer instruction is neces-
sary if the original value of the modified operand must be kept. A literal load instruc-
tion is necessary to utilize a longer literal than that in an instruction. Further, there is
an instruction using an implicitly defined register, which contributes to increase the
number of operand with no extra operand field, but requires special treatment to iden-
tify it and spoils orthogonal characteristics of the register number decoding. Therefore,
careful implementation is necessary to treat such special features.
the duplicated resources were not often used simultaneously, and the architecture would
not achieve high efficiency.
All the instructions were categorized to reduce a pipeline hazard by the resource
conflicts, which would not occur in symmetric architecture with the expense of the
resource duplication. Especially, a transfer instruction of a literal or register value is
important for the 16-bit fixed-length ISA, and the transfer instructions were catego-
rized as a type that could utilize both execution and load/store pipelines properly.
Further a zero-cycle transfer operation was implemented for the transfer instruc-
tions and contributes to reduce the hazard.
As for memory architecture, Harvard architecture was popular for PC/server pro-
cessors enabling simultaneous accesses to instruction and data caches, and unified
cache architecture was popular for embedded processors to reduce the hardware
cost and to utilize relatively small size cache efficiently. The SH-4 adopted the
Harvard architecture, which was necessary to avoid the memory access conflict
increased by the superscalar issue.
The SH architecture adopted a delayed branch to reduce the branch penalty
cycles. In addition, the SH-4 adopted an early-stage branch to reduce the penalty
further. The penalty cycles increased with the superscalar issue, but were not so
much as that of a superpipeline processor having deep pipeline stages, and the SH-4
did not adopt more expensive ways such as a branch target buffer (BTB), an out-of-
order issue of a branch instruction, and a branch prediction. The SH-4 kept the
backward compatibility and did not adopt a method with ISA change like a method
using plural instructions for a branch.
As the result of the selection, the SH-4 adopted an in-order dual-issue asymmet-
ric five-stage superscalar pipeline and Harvard architecture with special treatment
of transfer instructions including zero-cycle transfer method.
3.1 Embedded CPU Cores 25
Figure 3.1 illustrates the pipeline structure to realize the asymmetric superscalar
architecture described above. The pipeline is five stages of instruction fetch (IF),
instruction decoding (ID), instruction execution (EX), memory access (MA), and
write-back (WB).
Two consecutive instructions of 32 bits are fetched every cycle at the IF stage to
sustain the two-way superscalar issue and provided to the input latch of the ID
stage. The fetched instructions are stored in an instruction queue (IQ), when the
latch is occupied by the instructions suspended to be issued. The instruction fetch is
issued after checking the emptiness of either the input latch or the IQ to avoid dis-
carding the fetched instructions.
At the ID stage, instruction decoders decode the two instructions at the input latch,
judge the group, assign pipelines, read registers as source operands, forward a operand
value if it is available but not stored in a register yet, judge issuable immediately or
not, and provide instruction execution information to the following stages. Further,
BR pipeline starts a branch processing of a BR-group instruction. The details of the
branch processing are described in the next section.
The INT, LS, BR, and FE pipelines are assigned to an instruction of the INT, LS,
BR, and FE groups, respectively. The second instruction of the two simultaneously
decoded ones is not issued if the pipeline to be assigned is occupied, kept at the
input latch, and decoded again at the next cycle. A BO group instruction is assigned
to the LS pipeline if the other instruction simultaneously decoded is the INT group;
otherwise, it is assigned to the INT pipeline, except both the instructions are in the
BO group. In this case, they are assigned to the INT and LS pipelines. The NS
instruction is assigned to a proper pipeline or pipelines if it is the first instruction;
otherwise, it is kept at the input latch and decoded again at the next cycle.
The issue possibility is judged by checking the operand value availability in par-
allel with the execution pipeline assignment. An operand is immediate value or
register value, and the immediate value is always available. Therefore, the register
value availability is checked for the judgment. The register value is defined by some
instruction and used by a following instruction. A write-after-read register conflict,
a true dependency in other words, occurs if the distance of the defining and using
instructions is less than the latency of the defining instruction, and the defined reg-
ister value is not available until the distance becomes equal or more than the latency.
28 3 Processor Cores
The parallel operations of a register conflict check and the other ID-stage operations
are realized by comparing a register field candidate of the instruction before identi-
fying that the field is a real register field, and the compared result is judged to be
meaningful or not after the identification that requires the instruction format type
from instruction decoding logic. The parallel operations reduce the time of the ID
stage and enhance the operating frequency.
After the ID stage, the operation depends on the pipeline and is executed accord-
ing to the instruction information provided from the ID stage. The INT pipeline
executes the operation at the EX stage using an ALU, a shifter, and so on; forwards
the operation result to the WB stage at the MA stage; and writes back the result to the
register at the WB stage. The LS pipeline calculates the memory access address at the
EX stage, loads or stores a data of the calculated address in a data cache at the MA
stage, and writes back the loaded data and/or the calculated address to the register at
the WB stage if any. If a cache miss occurs, all the pipelines are stalled to wait an
external memory access. The FE pipeline operations are described later in detail.
SH-4 adopted the Harvard architecture, which required the simultaneous access
of translation look aside buffers (TLBs) of instruction and data, and a conventional
Harvard-architecture processor separated the TLBs symmetrically. However, the
SH-4 enhanced the efficiency of the TLBs by breaking the symmetry. The address of
the instruction fetch is localized, and a four-entry instruction TLB (ITLB) was
enough to suppress the TLB miss. On the contrary, the address of the data access is
not so localized and requires more entries. Therefore, a 64-entry unified TLB (UTLB)
was integrated and used for both a data access and an ITLB miss handling. The ITLB
miss handling is supported by hardware, and it takes short cycles if the ITLB-missed
entry is in the UTLB. If the UTLB miss occurs for either of the accesses, a TLB miss
exception occurs, and a proper software miss handling will be issued.
The caches of the SH-4 are also asymmetric to enhance the efficiency. Since a
code size of the SH-4 is smaller than that of a conventional processor, the size of the
instruction cache is half of the data cache. The cache sizes are 8 and 16 KB.
Since the number of transfer instructions of an SH-4 program was more than that of
the other architecture, the transfer instructions were categorized to BO group. Then
the transfer instructions can be inserted to any unused issue slots. Further, a zero-
cycle transfer operation was implemented for the transfer instructions and contrib-
utes to reduce the hazard.
The result of the transfer instruction already exists at the beginning of the opera-
tion as an immediate value in an instruction code, a value in a source operand
resister, or a value on the fly in a pipeline, and it is provided to the pipeline at the ID
stage, and the value is just forwarded in the pipeline to the WB stage. Therefore, the
simultaneous operation of the instruction right after the transfer instruction at
another pipeline can use the result of the transfer instruction, if the result is properly
forwarded by source-operand forwarding network.
3.1 Embedded CPU Cores 29
Branch IF ID EX MA WB
Delay Slot IF ID EX MA WB
ID Empty Issue Slot
Target IF ID EX MA WB
4 cycles
Branch IF ID EX MA WB
Delay Slot IF ID EX MA WB
ID
ID
Empty Issue Slots
ID
ID
Target IF ID EX MA WB
4 cycles
Compare IF ID EX MA WB
Branch IF ID EX MA WB
Delay Slot IF ID EX MA WB
ID
Empty Issue Slots
ID
Target IF ID EX MA WB
3 cycles
The SH-4 adopted an early-stage branch to reduce the increased branch penalty by
the superscalar architecture. Figures 3.2–3.4 illustrate branch sequences of a scalar
processor, a superscalar processor, and the SH-4 with the early-stage branch, respec-
tively. The sequence consists of branch, delay slot, and target instructions. In the
SH-4 case, a compare instruction, which is often right before the conditional branch
instruction, is also shown to clarify the define-use distance of a branch condition
between the EX and ID stages of the compare and branch instructions.
Both the scalar and superscalar processors execute the three instructions in the
same four cycles. There is no performance gain by the superscalar architecture, and
the empty issue slot becomes three or four times more. On the other hand, the SH-4
executes the three instructions in three cycles with one or two empty issue slots.
The branch without a delay slot requires one more empty issue slot for all the cases.
As shown by the example sequences, the SH-4 performance was enhanced, and the
empty issue slots decreased.
30 3 Processor Cores
The branch address calculation at the ID stage was the key method for the early-stage
branch and realized by the parallel operations of the calculation and the instruction
decoding. The early-stage branch was adopted to six frequently used branch instruc-
tions summarized in Table 3.4. The calculation was 8-bit or 12-bit offset addition,
and a 1-bit check of the instruction code could identify the offset size of the six
branch instructions. The first code of the two instruction codes at the ID stage was
chosen to process if the first code was a branch; otherwise, the second code was
chosen. However, this judgment took more time than the above 1-bit check, and
some part of calculation was done before the selection by duplicating required hard-
ware to realize the parallel operations.
The SH-4 performance was measured using a Dhrystone benchmark which was pop-
ular for evaluating integer performance of embedded processor [5]. The Dhrystone
benchmark is small enough to fit all the program and data into the caches and to
use at the beginning of the processor development. Therefore, only the processor
core architecture can be evaluated without the influence from the system level archi-
tecture, and the evaluation result can be fed back to the architecture design. On the
contrary, the system level performance cannot be measured considering cache miss
rates, external memory access throughput and latencies, and so on. The evaluation
result includes compiler performance because the Dhrystone benchmark is described
in C language. The optimizing compiler tuned up for SH-4 was used for compiling
the benchmark.
The optimizing compiler for a superscalar processor must have new optimization
items, which is not necessary for a scalar processor. For example, the distance of a
load instruction and an instruction using the loaded data must be two cycles or more
to avoid a pipeline stall. The scalar processor requires one instruction inserted between
the instructions, but the superscalar processor requires two or three instructions.
Therefore, the optimizing compiler must insert independent instructions more than
the compiler for a scalar processor.
3.1 Embedded CPU Cores 31
SH-3 1.00
+ SH-4 Compiler 1.10
+ Harvard 1.27
+ Superscalar 1.49
+ BO type 1.59
+ Early branch 1.77
+ 0-cycle MOV 1.81
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Cycle Performance (MIPS/MHz)
Figure 3.5 shows the result of the cycle performance evaluation. Starting from
the SH-3, five major enhancements were adopted to construct the SH-4 microarchi-
tecture. The SH-3 achieved 1.0 MIPS/MHz when it was released, and the SH-4
compiler enhanced its performance to 1.1 MIPS/MHz. The cycle performance was
enhanced to 1.27 MIPS/MHz by the Harvard architecture, to 1.49 MIPS/MHz by
the superscalar architecture, to 1.59 MIPS/MHz by adding the BO group, to 1.77
MIPS/MHz by the early-stage branch, and to 1.81 MIPS/MHz by the zero-cycle
transfer operation. As a result, the SH-4 achieved 1.81 MIPS/MHz. The SH-4
enhanced the cycle performance by 1.65 times form the SH-3 excluding the com-
piler effect.
The SH-3 was a 60-MHz processor in a 0.5-mm process and estimated to be a
133-MHz processor in a 0.25-mm process. The SH-4 achieved 200 MHz in the same
0.25-mm process. Therefore, SH-4 enhanced the frequency by 1.5 times form the
SH-3. As a result, the architectural performance of the SH-4 is 1.65 × 1.5 = 2.47
times as high as that of the SH-3.
Efficiency is more important feature than performance for an embedded proces-
sor. Therefore, the area and power efficiencies of the SH-4 were also evaluated, and
it was confirmed that the SH-4 achieved the excellent efficiencies.
The area of the SH-3 was 7 mm2 in a 0.5-mm process and estimated to be 3 mm2
in a 0.25-mm process, whereas the area of the SH-4 was 4.9 mm2 in a 0.25-mm pro-
cess. Therefore, the SH-4 was 1.63 times as large as the SH-3. As described above,
the cycle and architectural performances of the SH-4 were 1.65 and 2.47 times as
high as those of the SH-3. As a result, the SH-4 kept the area efficiency of the cycle
performance that was calculated as 1.65/1.63 = 1.01 and enhanced the area efficiency
of the performance that was calculated as 2.47/1.63 = 1.52. The actual efficiencies
including a process contribution were 60 MIPS/7 mm2 = 8.6 MIPS/mm2 for the SH-3
and 360 MIPS/4.9 mm2 = 73.5 MIPS/mm2 for the SH-4.
The SH-3 and SH-4 were ported to a 0.18-mm process and tuned with keeping
their major architecture. Since they adopted the same five-stage pipeline, the achiev-
able frequency was also the same after the tuning. The ported SH-3 and SH-4 were
170 and 240 mW at 133 MHz and 1.5 V power supply. Therefore, the power of the
32 3 Processor Cores
SH-4 was 240/170 = 1.41 times as high as that of the SH-3. As a result, the SH-4
kept the power efficiency of the cycle performance that is calculated as
1.65/1.41 = 1.17. The actual efficiencies including the process contribution were
147 MIPS/0.17 W = 865 MIPS/W for the SH-3 and 240 MIPS/0.24 W = 1,000
MIPS/W for the SH-4. Although a conventional superscalar processor was thought
to be less efficient than a scalar processor, the SH-4 was more efficient than a scalar
processor. On the other conditions, the SH-4 achieved 166 MHz at 1.8 V with
400 mW and 240 MHz at 1.95 V with 700 mW, and the corresponding efficiencies
were 300 MIPS/0.4 W = 750 MIPS/W and 432 MIPS/0.7 W = 617 MIPS/W.
The asymmetric superscalar architecture of the SH-4 achieved high performance and
efficiency. However, further parallelism would not contribute to the performance
because of the limited parallelism of a general program. On the other hand, the oper-
ating frequency would be limited by an applied process without fundamental change
of the architecture or microarchitecture. Although conventional superpipeline archi-
tecture was thought inefficient as was the conventional superscalar architecture
before the SH-4 [47, 48], an SH-X embedded processor core was developed with
superpipeline architecture to enhance the operating frequency with maintaining the
high efficiency of the SH-4.
I1
Early Branch Instruction Fetch
I2
ID Branch Instruction Decoding FPU Instruction Decoding
E1 Execution Address FPU FPU
E2 Data Data Arithmetic
E3 Load/Store Transfer Execution
E4 WB WB WB
E5
E6 WB
BR INT LS FE
The load/store latencies were also a serious problem, and the out-of-order issue
was effective to hide the latencies, but too inefficient to adopt as mentioned above.
The SH-X adopted a delayed execution and a store buffer as more efficient methods.
The selected methods were effective to reduce the pipeline hazard caused by the
superpipeline architecture, but not effective to avoid a long-cycle stall caused by a
cache miss for an external memory access. Such a stall could be avoided by an out-
of-order architecture with large-scale buffers, but was not a serious problem for
embedded systems.
I1 Out-of-order
Branch Instruction Fetch
I2
ID Branch Instruction FPU Instruction
E1 Decoding Address Decoding
E2 Execution Data Tag FPU FPU
E3 Load - Data Arithmetic
E4 WB WB Data Transfer Execution
E5 Store WB
E6 Store Buffer
E7 Flexible Forwarding WB
BR INT LS FE
frequency can be 1.4 times as high as the SH-4. The degradation from the 1.5 times is
caused by the increase of pipeline latches for the extra stage.
The control signals and processing data are flowing to the backward as well as
fall through the pipeline. The backward flows convey various information and exe-
cution results of the preceding instructions to control and execute the following
instructions. The information includes that preceding instructions were issued or
still occupying resources, where the latest value of the source operand is flowing in
the pipeline, and so on. Such information is used for an instruction issue every
cycle, and it is necessary to collect the latest information in a cycle. This informa-
tion gathering and handling become difficult if a cycle time becomes short for the
superpipeline architecture, and the issue control logic tends to be complicated and
large. However, the quantity of hardware is determined mainly by the major micro-
architecture, and the hardware increase was expected to be less than 1.4 times.
A conventional seven-stage pipeline had less cycle performance than a five-stage
one by 20%. This means the performance gain of the superpipeline architecture was
only 1.4 × 0.8 = 1.12 times, which would not compensate the hardware increase. The
branch penalty increased by the increase of the instruction fetch cycles of I1 and I2
stages. The load-use conflict penalty increased by the increase of the data load
cycles of E1 and E2 stages. They were the main reason of the 20% degradation.
Figure 3.7 illustrates the seven-stage superpipeline structure of the SH-X with
delayed execution, store buffer, out-of-order branch, and flexible forwarding.
Compared to the conventional pipeline shown in Fig. 3.6, the INT pipeline starts its
execution one cycle later at the E2 stage, a store data is buffered to the store buffer
at the E4 stage and stored to the data cache at the E5 stage, and the data transfer of
the FPU supports flexible forwarding. The BR pipeline starts at the ID stage, but is
not synchronized to the other pipelines for an out-of-order branch issue.
The delayed execution is effective to reduce the load-use conflict as Fig. 3.8
illustrates. It also lengthens the decoding stages into two except for the address
calculation and relaxes the decoding time. With the conventional architecture shown
in Fig. 3.6, a load instruction, MOV.L, sets up an R0 value at the ID stage, calculates
a load address at the E1 stage, loads a data from the data cache at the E2 and E3
3.1 Embedded CPU Cores 35
stages, and the load data is available at the end of the E3 stage. An ALU instruction,
ADD, sets up R1 and R2 values at the ID stage and adds the values at the E1 stage.
Then the load data is forwarded from the E3 stage to the ID stage, and the pipeline
stalls two cycles. With the delayed execution, the load instruction execution is the
same, and the add instruction sets up R1 and R2 values at E1 stage and adds the
values at the E2 stage. Then the load data is forwarded from the E3 stage to the E1
stage, and the pipeline stalls only one cycle, which is the same number of cycle as
that of a five-stage pipeline like SH-4.
There was another choice to start the delayed execution at the E3 stage to avoid
the pipeline stall of the load-use conflict. However, the E3 stage was bad for the
result define. For example, if an ALU result was defined at E3 and an address cal-
culation used the result at E1, it would require three-cycle issue distance between
the instructions for the ALU result and the address calculation. On the other hand, a
program for the SH-4 already considered the one-cycle stall. Therefore, the E2-start
type of the SH-X was considered to be better. Especially, we could expect the pro-
gram optimized for the SH-4 would run on the SH-X properly.
As illustrated in Fig. 3.7, a store instruction performs an address calculation,
TLB and cache-tag accesses, a store-data latch, and a data store to the cache at the
E1, E2, E4, and E5 stages, respectively, whereas a load instruction performs a cache
access at the E2 stage. This means the three-stage gap of the cache access timing
between the E2 and the E5 stages of a load and a store. However, a load and a store
use the same port of the cache. Therefore, a load instruction gets the priority to a
store instruction if the access is conflicted, and the store instruction must wait the
timing with no conflict. In the N-stage gap case, N entries are necessary for the store
buffer to treat the worst case, which is a sequence of N consecutive store issues fol-
lowed by N consecutive load issues, and the SH-X implemented three entries.
The flexible forwarding enables both an early register release and a late register
allocation and eases the optimization of a program. Figure 3.9 shows the examples
of both the cases. In the early register release case, a floating-point addition instruc-
tion (FADD) generates a result at the end of the E4 stage, and a store instruction
(FMOV) gets the result forwarded from the E5 stage of the FADD. Then the FR1 is
released only one cycle after the allocation, although the FADD takes three cycles
to generate the result. In the late register allocation case, an FADD forwards a result
at the E6 stage, and a transfer instruction (FMOV) gets the forwarded result at the
E1 stage. Then the FR2 allocation is five cycles after the FR1 allocation.
36 3 Processor Cores
Compare I1 I2 ID E1 E2 E3 E4
Branch I1 I2 IQ ID ID ID E1 E2 E3 E4
Delay Slot I1 I2 ID ID ID E1 E2 E3 E4
ID
2 cycles ID Empty
ID Issue Slots
ID
Target I1 I2 ID E1 E2 E3 E4
2 cycles
I1 I2 ID E1 E2 E3 E4
I1 I2 ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
Compare I1 I2 IQ ID E1 E2 E3 E4
Branch I1 I2 ID
Delay Slot I1 I2 IQ IQ ID E1 E2 E3 E4
Target I1 I2 ID E1 E2 E3 E4
I1 I2 ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
Fall through I1 I2 IQ IQ IQ IQ ID E1 E2 E3 E4
I1 I2 IQ IQ IQ IQ ID E1 E2 E3 E4
(Prediction miss)
I1 I2 IQ IQ IQ IQ IQ ID E1 E2 E3 E4
I1 I2 IQ IQ IQ IQ IQ ID E1 E2 E3 E4
2-cycle stall
direction that the branch is taken or not taken. However, this is not early enough to
make the empty issue slots zero. Therefore, the SH-X adopted an out-of-order issue
to the branches using no general-purpose register.
The SH-X fetches four instructions per cycle and issues two instructions at most.
Therefore, instructions are buffered in an instruction queue (IQ) as illustrated. A branch
instruction is searched from the IQ or an instruction-cache output at the I2 stage and
provided to the ID stage of the branch pipeline for the out-of-order issue earlier than
the other instructions provided to the ID stage in order. Then the conditional branch
instruction is issued right after it is fetched, while the preceding instructions are in the
IQ, and the issue becomes early enough to make the empty issue slots zero. As a result,
the target instruction is fetched and decoded at the ID stage right after the delay-slot
instruction. This means no branch penalty occurs in the sequence when the preceding
or delay-slot instructions stay two or more cycles in the IQ.
The compare result is available at the E3 stage, and the prediction is checked if it is
hit or miss. In the miss case, the instruction of the correct flow is decoded at the ID stage
right after the E3 stage, and two-cycle stall occurs. If the correct flow is not held in the
IQ, the miss-prediction recovery starts from the I1 stage and takes two more cycles.
Historically, the dynamic branch prediction method started from a BHT with
1-bit history per entry, which recorded a branch direction of taken or not for the last
time, and predicted the same branch direction. Then a BHT with 2-bit history per
entry became popular, and the four direction states of strongly taken, weakly taken,
weakly not taken, and strongly not taken were used for the prediction to reflect the
history of several times. There were several types of the state transitions including
a simple up–down transition. Since each entry held only one or two bits, it is too
expensive to attach a tag consisting of a part of the branch-instruction address,
which was usually about 20 bits for a 32-bit addressing. Therefore, we could increase
38 3 Processor Cores
A-drv.
Module 128-256 F/Fs
Clock B-drv. C-drv. D-drvs. F/Fs with CCP
Gen. Leaf
Clock GCKD
Control
Registers Software Hardware (dynamic)
(static)
CCP: Control Clock Pin
ph1 edge trigger F/F ph2 transparent latch
GCKD: Gated Clock Driver Cell
the number of entries about 10 or 20 times without the tag. Although the different
branch instructions could not be distinguished without the tag and there occurred
a false hit, the merit of the entry increase exceeded the demerit of the false hit.
A global history method was also popular for the prediction and usually used with
the 2-bit/entry BHT.
The SH-X stalled only two cycles for the prediction miss, and the performance
was not so sensitive to the hit ratio. Further, the one-bit method required a state
change only for a prediction miss, and it could be done during the stall. Therefore,
the SH-X adopted a dynamic branch prediction method with a 4 K-entry 1-bit/entry
BHT and a global history. The size was much smaller than the instruction and data
caches of 32 KB each.
The SH-X achieved excellent power efficiency by using various low-power tech-
nologies. Among them, hierarchical clock gating and pointer controlled pipeline are
explained in this section.
Figure 3.12 illustrates a conventional clock-gating method. In this example, the
clock tree has four levels with A-, B-, C-, and D-drivers. The A-driver receives the
clock from the clock generator and distributes the clock to each module in the processor.
Then, the B-driver of each module receives the clock and distributes it to various sub-
modules including 128–256 flip-flops (F/Fs). The B-driver gates the clock with the
signal from the clock control register, whose value is statically written by software to
stop and start the modules. Next, the C- and D-drivers distribute the clock hierarchi-
cally to the leaf F/Fs with a Control Clock Pin (CCP). The leaf F/Fs are gated by
hardware with the CCP to avoid activating them unnecessarily. However, the clock
tree in the module is always active while the module is activated by software.
Figure 3.13 illustrates the clock-gating method of the SH-X. In addition to the
clock gating at the B-driver, the C-drivers gate the clock with the signals dynamically
generated by hardware to reduce the clock tree activity. As a result, the clock power
is 30% less than that of the conventional method.
The superpipeline architecture improved operating frequency, but increased
number of F/Fs and power. Therefore, one of the key design considerations was
3.1 Embedded CPU Cores 39
A-drv.
Module 128-256 F/Fs
Clock B-drv. C-drv. D-drvs. F/Fs with CCP
Gen. Leaf
Clock GCKD
Control GCKD
Hardware
Registers Software Hardware (dynamic)
(static) (dynamic)
CCP: Control Clock Pin
ph1 edge trigger F/F ph2 transparent latch
GCKD: Gated Clock Driver Cell
E1
FF
E2
to other modules
FF
E3
E4
FF
E5 Register file
to reduce the activity ratio of the F/Fs. To address this issue, a pointer-controlled
pipeline was developed. It realizes a pseudopipeline operation with a pointer control.
As shown in Fig. 3.14, three pipeline F/Fs are connected in parallel, and the pointer
is used to show which F/Fs correspond to which stages. Then, only one set of F/Fs
is updated in the pointer-controlled pipeline, while all pipeline F/Fs are updated
every cycle in the conventional pipeline as shown in Fig. 3.15.
Table 3.6 shows the relationship between F/Fs FF0–FF2 and pipeline stages E2–E4
for each pointer value. For example, when the pointer indexes zero, the FF0 holds an
input value at E2 and keeps it for three cycles as E2, E3, and E4 latches until the
40 3 Processor Cores
pointer indexes zero again and the FF0 holds a new input value. This method is good
for a short latency operation in a long pipeline. The power of pipeline F/Fs decreases
to 1/3 for transfer instructions and decreases by an average of 25% as measured using
Dhrystone 2.1.
The SH-X performance was measured using the Dhrystone benchmark as the SH-4
was. The popular version was changed to 2.1 that was 1.1 when the SH-4 was devel-
oped, because the advance of the optimization technology of compliers made the
version 1.1 not to reflect the features of real applications with excessive elimination of
unused results in the program [49]. The complier advance and the increase of the
optimization difficulty for the version 2.1 were well balanced to maintain the continuity
of the measured performances by using proper optimization level of the compiler.
Figure 3.16 shows the evaluated result of the cycle performance. The improve-
ment from the SH-3 to the SH-4 in the figure was already explained in Sect. 3.1.2.7.
3.1 Embedded CPU Cores 41
The cycle performance was decreased by 18% to 1.47 MIPS/MHz with adopting a
conventional seven-stage superpipeline to the SH-4. Branch prediction, out-of-order
branch issue, store buffer, and delayed execution improve the cycle performance by
23% and recover the 1.8 MIPS/MHz. Since 1.4 times high operating frequency was
achieved by the superpipeline architecture, the architectural performance was also
1.4 times as high as that of the SH-4. The actual performance was 720 MIPS at
400 MHz in a 0.13-mm process and improved by two times from the SH-4 in a 0.25-
mm process. The improvement by each method is shown in Fig. 3.16.
Figures 3.17 and 3.18 show the area and power efficiency improvements, respec-
tively. Upper three graphs of both the figures show architectural performance, rela-
tive area/power, and architectural area-/power-performance ratio. Lower three
graphs show actual performance, area/power, and area-/power-performance ratio.
The area of the SH-X core was 1.8 mm2 in a 0.13-mm process, and the area of the
SH-4 was estimated as 1.3 mm2 if it was ported to a 0.13-mm process. Therefore, the
relative area of the SH-X was 1.4 times as much as that of the SH-4 and 2.26 times
as much as the SH-3. Then the architectural area efficiency of the SH-X was nearly
equal to that of the SH-4 and 1.53 times as high as the SH-3. The actual area
efficiency of the SH-X reached 400 MIPS/mm2, which was 8.5 times as high as the
74 MIPS/mm2 of the SH-4.
SH-4 was estimated to achieve 200 MHz, 360 MIPS with 140 mW at 1.15 V, and
280 MHz, 504 MIPS with 240 mW at 1.25 V. The power efficiencies were 2,500
and 2,100 MIPS/W, respectively. On the other hand, SH-X achieved 200 MHz,
360 MIPS with 80 mW at 1.0 V, and 400 MHz, 720 MIPS with 250 mW at 1.25 V.
The power efficiencies were 4,500 and 2,880 MIPS/W, respectively. As a result,
the power efficiency of the SH-X improved by 1.8 times from that of the SH-4 at the
42 3 Processor Cores
same frequency of 200 MHz and by 1.4 times at the same supply voltage with
enhancing the performance by 1.4 times. These were architectural improvements,
and actual improvements were multiplied by the process porting.
According to the SH-X analyzing, the ID stage was the most critical timing part,
and the branch acceleration successfully reduced the branch penalty. Therefore, we
added the third instruction fetch stage (I3) to the SH-X2 pipeline to relax the ID
stage timing. The cycle performance degradation was negligible small by the suc-
cessful branch architecture, and the SH-X2 achieved the same cycle performance of
1.8 MIPS/MHz as the SH-X.
3.1 Embedded CPU Cores 43
I1 Out-of-order
Instruction Fetch
I2 Branch
I3 Branch Search / Instruction Pre-decoding
ID Branch Instruction FPU Instruction
E1 Decoding Address Decoding
E2 Execution Data Tag FPU FPU
E3 Load - Data Arithmetic
E4 WB WB Data Transfer Execution
E5 Store WB
E6 Store Buffer
E7 Flexible Forwarding WB
BR INT LS FE
Figure 3.19 illustrates the pipeline structure of the SH-X2. The I3 stage was
added and performs branch search and instruction predecoding. Then the ID stage
timing was relaxed, and the achievable frequency increased.
Another critical timing path was in first-level (L1) memory access logic. SH-X
had L1 memories of a local memory and I- and D-caches, and the local memory was
unified for both instruction and data accesses. Since all the memories could not be
placed closely, a memory separation for instruction and data was good to relax the
critical timing path. Therefore, the SH-X2 separated the unified L1 local memory of
the SH-X into instruction and data local memories (ILRAM and OLRAM).
With the other various timing tuning, the SH-X2 achieved 800 MHz using a
90-nm generic process from the SH-X’s 400 MHz using a 130-nm process. The
improvement was far higher than the process porting effect.
The SH-X2 enhanced the low-power technologies from that of the SH-X explained in
Sect. 3.1.3.4. Figure 3.20 shows the clock-gating method of the SH-X2. The D-drivers
also gate the clock with the signals dynamically generated by hardware, and the leaf
F/Fs requires no CCP. As a result, the clock tree and total powers are 14% and 10%
lower, respectively, than in the SH-X method.
The SH-X2 adopted a way prediction method to the instruction cache. The SH-X2
aggressively fetched the instructions using branch prediction and early-stage branch
techniques to compensate branch penalty caused by long pipeline. The power con-
sumption of the instruction cache reached 17% of the SH-X2, and the 64% of the
instruction cache power was consumed by data arrays. The way prediction misses
were less than 1% in most cases and were 0% for the Dhrystone 2.1. Then the 56%
of the array access was eliminated by the prediction for the Dhrystone. As a result,
the instruction cache power was reduced by 33%, and the SH-X2 power was reduced
by 5.5%.
44 3 Processor Cores
A-drv.
Module 128-256 F/Fs
Clock B-drv. C-drv. D-drvs. F/Fs
Gen.
Clock GCKD
Control GCKD GCKD
Hardware
Registers Software Hardware (dynamic)
(static) (dynamic)
CCP: Control Clock Pin
ph1 edge trigger F/F ph2 transparent latch
GCKD: Gated Clock Driver Cell
In 1995, SH-3E, the first embedded processor with an on-chip floating-point unit
(FPU) was developed by Hitachi mainly for a home game console. It operated
at 66 MHz and achieved peak performance of 132 MFLOPS with a floating-point
multiply–accumulate instruction (FMAC). At that time, the on-chip FPU was popu-
lar for PC/server processors, but there was no demand of the FPU on the embedded
processors mainly because it was too expensive to integrate. However, the program-
ming of game consoles became difficult to support higher resolution and advanced
features of the 3D graphics. Especially it was difficult to avoid overflow and
underflow of fixed-point data with small dynamic range, and there was a demand to
use floating-point data. Since it was easy to implement a four-way parallel operation
with 16-bit fixed-point data, equivalent performance had to be realized to change
the data type to the floating-point format at reasonable costs.
Since an FPU was about three times as large as a fixed-point unit, and a four-way
SMID data path was four times as large as a normal one, it was too expensive to
adopt the four-way SMID FPU. Further, the FPU architecture of the SH-3E was
limited by the 16-bit fixed-length ISA. The latency of the floating-point operations
was long and required more number of registers than the fixed-point operations, but
the ISA could define only 16 registers. A popular transformation matrix of the 3D
graphics was four by four and occupied 16 registers, and no register remained for
other values. Therefore, an efficient parallelization method of FPU had to be devel-
oped with solving above issues.
The 16 was the limit of the number of registers directly specified by the 16-bit
fixed-length ISA. Therefore, the registers were extended to 32 as two banks of 16
registers. The two banks are front and back banks, named FR0–FR15 and
XF0–XF15, respectively, and they are switched by changing a control bit FPSCR.
FR in a floating-point status and control register (FPSCR). Most of instructions use
3.1 Embedded CPU Cores 45
only the front bank, but some newly defined instructions use both the front and back
banks. The SH-4 uses the front-bank registers as eight pairs or four length-4 vectors
as well as 16 registers and uses the back-bank registers as eight pairs or a four-by-
four matrix. They were defined as follows:
Since an ordinary SIMD extension of an FPU was too expensive for an embedded
processor as described above, another parallelism was applied to the SH-4. The large
hardware of an FPU is for a mantissa alignment before the operation and normaliza-
tion and rounding after the operation. Further, a popular FPU instruction, FMAC,
requires three read and one write ports. The consecutive FMAC operations are a
popular sequence to accumulate plural products. For example, an inner product of
two length-4 vectors is one of such sequences and popular in a 3D graphics pro-
gram. Therefore, a floating-point inner-product instruction (FIPR) was defined to
accelerate the sequence with smaller hardware than that for the SIMD. It uses the
two of four length-4 vectors as input operand and modifies the last register of one of
the input vectors to store the result. The defining formula is as follows:
However, the exact value is 1.FFFFFE × 2103 , and the error is 1.FFFFFE × 2103
for the formula, which causes the worst error of 2 −23 times of the maximum term.
We can get the exact value if we change the operation order properly. The floating-
point standard defines the rule of each operation, but does not define the result of the
formula, and either of the result is fine for the conformance. Since the FIPR opera-
tion is not defined by the standard, we defined its maximum error as “2 E − 25+ round-
ing error of result” to make it better than or equal to the average and worst-case
errors of the equivalent sequence that conforms the standard, where E is the maxi-
mum exponent of the four products.
A length-4 vector transformation was also popular operation of a 3D graphics,
and a floating-point transform vector instruction (FTRV) was defined. It required 20
registers to specify the operands in a modification type definition. Therefore, the
defining formula is as follows, using a four-by-four matrix of all the back bank reg-
isters, XMTRX, and one of the four front-bank vector registers, FV0–FV3:
operations. However, it required four more registers and would be useful only to
replace the FTRV, and the FTRV was simpler and better approach.
The newly defined FIPR and FTRV enhanced the performance, but data transfer
ability became bottleneck to realize the enhancement. Therefore, a pair load/store/
transfer mode was defined to double the data move ability. In the pair mode, floating-
point move instructions (FMOVs) treat 32 front- and back-bank floating-point reg-
isters as 16 pairs and directly access all the pairs without the bank switch controlled
by the FPSCR.FR bit. The mode switch between the pair and normal modes is con-
trolled by a move-size bit FPSCR.SZ in the FPSCR. Further, a floating-point regis-
ter-bank and move-size change instructions (FPCRG and FSCHG) were defined for
fast changes of the modes defined above.
The 3D graphics required high performance but used only a single precision. On
the other hand, a double-precision format was popular for server/PC market and would
ease a PC application porting to a handheld PC, but the performance requirement was
not so high as the 3D graphics. However, software emulation was several hundred
times slower than hardware implementation. Therefore, SH-4 adopted hardware emu-
lation with minimum additional hardware to the single-precision hardware. The dif-
ference of the hardware emulation and the implementation is not visible from the
architecture, and it appears as performance difference reflecting microarchitecture.
The SH-4 introduced single- and double-precision modes, which were controlled
by a precision bit FPSCR.PR of the FPSCR. Some conversion operations between
the precisions were necessary, but not fit to the mode separation. Therefore, SH-4
supported two conversion instructions in the double-precision mode. An FCNVSD
converts a single-precision data to a double-precision one, and an FCNVDS con-
verts vice versa.
In the double-precision mode, eight pairs of the front-bank registers are used for
double-precision data, and one 32-bit register, FPUL, is used for a single-precision
or integer data, mainly for the conversion, but the back-bank registers are not used.
This is because the register-file extension is an option as well as the new instructions
of FIPR and FTRV. Table 3.7 summarizes all the floating-point instructions includ-
ing the new ones.
Figure 3.21 illustrates the pipeline structure of the FPU, which corresponds to the
FPU part of the LS pipeline and the FE pipeline of Fig. 3.1. This structure enables
the zero-cycle transfer of the LS-category instructions except load/store ones,
two-cycle latency of the FCMP, four-cycle latency of the FIPR and FTRV, and
three-cycle latency of the other FE-category instructions. On the latter half of the
ID stage, register reads and forwarding of on-the-fly data in the LS pipeline are
performed. The forwarding destinations include the FE pipeline. Especially, a
source operand value of the LS pipeline instruction is forwarded to the FE pipe-
line as a destination operand value of the LS pipeline instruction in order to real-
ize the zero-cycle transfer.
48
EX E0
FLS FDS VEC
MAIN
MA EX
WB Register
Write
LS FE
A floating-point load/store block (FLS) is the main part of the LS pipeline. At the EX
stage, it outputs a store data for the FMOV with a store operation, changes a sign for the
FABS and FNEG, and outputs an on-the-fly data for the forwarding. At the MA stage, it
gets a load data for the FMOV with a load operation and outputs an on-the-fly data for
the forwarding. It writes back the result in the middle of the WB stage at the negative
edge of the clock pulse. Then the written data can be read on the latter half of the ID
stage, and no forwarding path form the WB stage is necessary.
The FE pipeline consists of three blocks of MAIN, FDS, and VEC. An E0 stage
is inserted to execute the vector instructions of FIPR and FTRV. The VEC block is
the special hardware to execute the vector instructions of FIPR and FTRV, and the
FDS block is for the floating-point divide and square-root instructions (FDIV and
FSQRT). Both the blocks will be explained later. The MAIN block executes the
other FE-category instructions and the postprocessing of all the FE-category ones.
The MAIN block executes the arithmetic operations for two and half cycles of the
EX, MA, and WB stages.
Figure 3.22 illustrates the structure of the MAIN block. It is constructed to exe-
cute the FMAC, whose three operands are named A, B, and C, and a formula
A + B × C is calculated. Other instructions of FADD, FSUB, and FMUL are treated
by setting one of the inputs to 1.0, −1.0 or 0.0 appropriately.
A floating-point format includes special numbers of zero, denormalized number,
infinity, and not a number (NaN) as well as a normalized number. The inputs are
checked by Type Check part, and if there is a special number, a proper special-
number output is generated in parallel with the normal calculation and selected at
Rounder parts of the WB stage instead of the calculation result.
The compare instructions are treated at Compare part. The comparison is simple
like an integer comparison except for some special numbers. The input check result
of the Type Check part is used for the exceptional case and selected instead of the
simple comparison result if necessary. The final result is transferred to EX pipeline
to set or clear the T-bit according to the result at the MA stage.
50 3 Processor Cores
Type Exp.
Check MUX Diff. Exp.
Multiplier Array Adder
EX Aligner
Compare FDS
output
MUX
Leading Carry
Propagate Feedback
Non-Zero Adder
T-bit path
MA Detector (CPA)
for Double
(LNZ) Adjuster
Mantissa Normalizer
Exp.
WB Mantissa Rounder
Rounder
MAIN output
There are two FMAC definitions. One calculates a sequence of FMUL and FADD
and is good for conforming the ANSI/IEEE standard, but requires extra normaliza-
tion and rounding between the multiply and add. The extra operations require extra
time and causes inaccuracy. The other calculates an accurate multiply-and-add
value, then normalizes and rounds it. It was not defined by the standard at that time,
but now, it is in the standard. The SH-4 adopted the latter fused definition.
The FMAC processing flow is as follows. At the EX stage, Exp. Diff. and Exp.
Adder calculates an exponent difference of “A” and “B*C” and an exponent of B*C,
respectively, and Aligner aligns “A” according to the exponent difference. Then the
Multiplier Array calculates a mantissa of “A + B*C.” The “B*C” is calculated in
parallel with the above executions, and the aligned “A” is added at the final reduc-
tion logic. At the MA stage, CPA adds the Multiplier Array outputs, LNZ detects
the leading nonzero position of the absolute value of the CPA output from the
Multiplier Array outputs in parallel with the CPA calculation, and Mantissa
Normalizer normalizes the CPA outputs with the LNZ output. At the WB stage,
Mantissa Rounder rounds the Mantissa Normalizer output, Exp. Rounder normal-
izes and rounds the Exp. Adder output, and both the Rounders replace the rounded
result by the special result if necessary to produce the final MAIN block output.
Figure 3.23 illustrates the VEC block. The FTRV reads the inputs for four cycles
to calculate four transformed vector elements. This means the last read is at the forth
cycle, but it is too late to cancel the FTRV even the input value causes an exception.
Therefore, the VEC block must treat all the data types appropriately for the FTRV,
and all the denormalized numbers are detected and adjusted differently from the
normalized numbers. As illustrated in Fig. 3.23, the VEC block can start the opera-
tion at the ID stage by eliminating the input operand forwarding, and the above
adjustment can be done at the ID stage.
3.1 Embedded CPU Cores 51
Vector-A Vector-B
ID
MSB MSB MSB MSB MSB MSB MSB MSB Adj. Adj. Adj. Adj. Adj. Adj. Adj. Adj.
Exp. Exp. Exp. Exp.
Multiplier Multiplier Multiplier Multiplier Adder0 Adder1 Adder2 Adder3
Array 0 Array 1 Array 2 Array 3 Exp. Exp. Exp. Exp. Exp. Exp.
E0 Diff. Diff. Diff. Diff. Diff. Diff.
01 02 03 12 13 23
CPA0 CPA1 CPA2 CPA3 Max. Exp.
MUX0 MUX1MUX2MUX3 EMUX
Dec. Dec. Dec. Dec.
Aligner 0 Aligner 1 Aligner 2 Aligner 3
EX 4-to-2 Reduction Array
VEC output (Exponent)
VEC output (Mantissa)
At the E0 stage, Multiplier Arrays 0–3 and Exp. Adders 0–3 produce the mantissas
and exponents of the four intermediate products, respectively. Since the FIPR and
FTRV definitions allow the error of “ 2 E − 25+ rounding error of result,” the multipliers
need not to produce an accurate value, and we can make smaller multiplier allowing
the error by eliminating the lower bit calculations properly. Then, Exp. Diffs. 01, 02,
03, 12, 13, and 23 generate all the six combinations of the exponent differences,
Max. Exp. judges the maximum exponent from the signs of the six differences, and
MUX0–3 select four differences from the six ones or zero to align the mantissas to
the mantissa of the maximum exponent product. The zero is selected for the maxi-
mum exponent one. Further, EMUX selects the maximum exponent as an exponent
of the VEC output.
At the EX stage, Aligners 0–3 align the mantissas by the four selected differ-
ences. Each difference can be positive or negative depending on what is the maxi-
mum exponent product, but the shift direction for the alignment is always right, and
proper adjustment is done when the difference is decoded. A 4-to-2 Reduction
Array reduces the four aligned mantissas into two as sum and carry of the mantissa
of the VEC output. The VEC output is received by MAIN block at the MUX of the
EX stage.
The vector instructions of FIPR and FTRV were defined as optional instructions,
and the hardware should be optimized for the configuration without the optional
instructions. Further, if we optimized hardware for all of the instructions, we cannot
share hardware properly because of the latency difference of FIPR and FTRV to the
others. Therefore, the E0 stage is inserted only when FIPR and FTRV are executed
with variable length pipeline architecture, although it causes one-cycle stall when
an FE-category instruction other than FIPR and FTRV is issued right after an FIPR
or an FTRV as illustrated in Fig. 3.24.
52 3 Processor Cores
FADD ID EX MA WB
FSUB ID EX MA WB
FIPR (Vector) ID E0 EX MA WB
FIPR (Vector) ID E0 EX MA WB
FMUL ID ID EX MA WB
1 cycle stall
E1 E2 WB
FDIV ID
FDS FDS FDS FDS FDS FDS FDS FDS FDS
FADD ID E1 E2 WB
FSUB ID E1 E2 WB
FMUL ID E1 E2 WB
(FDIV post process) ID E1 E2 WB
The FDS block is for FDIV and FSQRT. The SH-4 adopts a SRT method with
carry-save adders, and the FDS block generates three bits of quotient or square-root
value per cycle. The numbers of bits of single- and double-precision mantissas are
24 and 53, respectively, and two extra bits, guard and round bits, are required to
generate the final result. Then, the FDS block takes 9 and 19 cycles to generate the
mantissas, and the pitches are 10 and 23 for the single- and double-precision FDIVs,
respectively. The differences are form some extra cycles before and after the man-
tissa generations. The pitches of the FSQRTs are one cycle shorter than the FDIV
with a special treatment at the beginning. The pitches are much longer than the other
instructions and degrade performance even though the frequency of the FDIV and
FSQRT is much less than the others. For example, if one of ten instructions is FDIV,
and the pitches of the other instructions are one, the total pitches are 19. Therefore,
an out-of-order completion of the FDIV and FSQRT is adopted to hide the long
pitches of them. Then only the FDS block is occupied for a long time. Figure 3.25
illustrates the out-of-order completion of single-precision FDIV.
The single-precision FDIV and FSQRT use the MAIN block for two cycles at the
beginning and ending of the operations to minimize the dedicated hardware for the
FDIV and FSQRT. The double-precision ones use it for five cycles, two cycles at
the beginning and three cycles at the ending. Then, the MAIN block is released to
the following instructions for the other cycles of the FDIV and FSQRT.
The double-precision instructions other than the FDIV and FSQRT are emulated
by hardware for single-precision instructions with small amount of additional hardware
for the emulation. Since the SH-4 merged an integer multiplier into the FPU, it sup-
ports 32-bit multiplication and 64-bit addition for an integer multiply-and-accumulate
3.1 Embedded CPU Cores 53
Less Higher
+ Greater Lower
=
Sticky bit
+ Greater Higher
instruction as well as 24-bit multiplication and 73-bit addition for the FMAC.
The 73 bits are necessary to align the added to the product even when the exponent
of the addend is larger than the product. Then the FPU supports 32-bit multiplica-
tion and 73-bit addition. The 53-bit input mantissas are divided into higher 21 bits
and lower 32 bits for the emulation. Figure 3.26 illustrates the FMUL emulation.
Four products of lower-by-lower, lower-by-higher, higher-by-lower, and higher-by-
higher are calculated and accumulated properly. FPU exception checking is done at
the first step, the calculation is done at second to fifth steps, and the lower and
higher parts are outputted at fifth and last steps, respectively.
Figure 3.27 illustrates the FADD and FSUB emulation. The less operand is
aligned to the greater operand by comparing which is larger at the first step as well
as checking the exception. Only higher halves of the input operands are compared
54 3 Processor Cores
y
N’ I L
V’
screen
(z=1) z
Sy Sx
x V
N
because exponents are in the higher halves, and the alignment shift is not necessary
if the higher halves are the same. Then the read operands are swapped if necessary
at the third and later steps. The alignment and addition are done at third to fifth
steps, and the lower and higher parts are outputted at fifth and last steps.
As a result, the FMUL, FADD, and FSUB take six steps. The conversion instruc-
tions of FLOAT, FTRC, FCNVSD, and FCNVDS take two steps mainly because a
double-precision operand requires two cycles to read or write.
V ′ = AV , S x = Vx ′ / Vz ′ , S y = Vy ′ / Vz ′ , N ′ = AN , I = (L, N ′ ) / (N ′, N ′ ),
⎛ Nx ⎞ ⎛ N x′ ⎞ ⎛ Lx ⎞
⎜N ⎟ ⎜ N ′⎟ ⎜L ⎟
N = ⎜ ⎟ , N ′ = ⎜ y′ ⎟ , L = ⎜ ⎟ .
y y
⎜ Nz ⎟ ⎜ Nz ⎟ ⎜ Lz ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ 0⎠ ⎝ 0 ⎠ ⎝ 0⎠
The numbers of arithmetic instructions per polygon with the above formula are 17
FMULs, 40 FMACs, 4 FDIVs, and an FSQRT without the architecture extension,
and 4 FTRVs, 2 FIPRs, 7 FMULs, 4 FDIVs, and an FSQRT with the extension.
Figure 3.29 shows the resource-occupying cycles for the benchmark. (1) It took
166 cycles to execute the benchmark with the conventional architecture for the exe-
cution cycles of load, store, and transfer instructions. The arithmetic operations took
121 cycles and did not affect the performance. (2) The load/store/transfer execution
cycles were reduced to half with the pair load/store/transfer instructions, and the
arithmetic operations were reduced to 67 cycles with the out-of-order completion of
the FDIV and FSQRT. Then the execution cycles became 83 cycles. (3) Furthermore,
the register extension with the bank register file enabled to keep the transformation
matrix in the back bank and reduced reloading or save/restore of data. Only the light
vector was reloaded. Then, the number of load/store/transfer instructions decreased
to 25 and was not a bottleneck of the performance. In addition, arithmetic opera-
tions decreased to 35 cycles by the FIPR and FTRV. As explained with Fig. 3.24,
one-cycle stall occurs after the E0 stage use, and three-cycle stalls occurred for the
benchmark as well as two-cycle stalls of normal register conflicts. As a result, it was
reduced by 76% to 40 cycles from 166 cycles to execute the benchmark.
Figure 3.30 shows the benchmark performance of the SH-4 at 200 MHz. The
performance was enhanced from 1.2-M polygons/s of the conventional superscalar
architecture to 2.4-M polygons/s by the pair load/store/transfer instructions and
out-of-order completion of the FDIV and FSQRT and to 5.0-M polygons/s by the
1) Conventional Architecture
FMUL FMAC FDIV FSQRT
Load/Store/Transfer
2) with Pair Load/Store/Transfer & Out-of-Order FDIV/FSQRT
FMUL FMAC FDIV FSQRT
Load/Store/Transfer
3) with Vector Instructions & Bank Register File
FIPR FMUL FDIV FSQRT
FTRV 76% shorter
Load/Store Stall
Transfer
0 20 40 60 67 80 83 100 120121 140 160166
Resource-occupying cycles
M Polygons/s
4 Scalar x1.6 5.0M
Superscalar x4.2
64
3 3.1M
x1.8 x2.0 83
2 2.4M
x1.7 166 150
1 287 1.2M 1.3M
0 0.7M
1) Conventional 2) Pair Load/Store, etc. 3) Vector Inst., etc.
register extension and the extended instructions of the FIPR and FTRV. The corre-
sponding scalar performances would be 0.7, 1.3, and 3.1-M polygons/s at 200 MHz
for 287, 150, and 64 cycles, respectively, and the superscalar performances were
about 70% higher than the scalar ones, which was 30% for the Dhrystone benchmark.
This showed the superscalar architecture was more effective for multimedia appli-
cations than for general integer applications. Since the SH-3E was a scalar proces-
sor without the SH-4’s enhancement, it took 287 cycles as the slowest case of the
above performance evaluations. Therefore, the SH-4 achieved 287/40 = 7.2 times as
high cycle performance as the SH-3E for the media processing like a 3D graphics.
The SH-4 achieved the excellent media processing efficiency. Its cycle perfor-
mance and frequency were 7.2 and 1.5 times as high as those of the SH-3E in the
same process. Therefore, the media performance in the same process was
7.2 × 1.5 = 10.8 times high. The FPU area of the SH-3E was estimated to be 3 mm2
and that of the SH-4 was 8 mm2 in a 0.25-mm process. Then the SH-4 was 8/3 = 2.7
times as large as the SH-3E. As a result, the SH-4 achieved 10.8/2.7 = 4.0 times as
high area efficiency as the SH-3E for the media processing.
The SH-3E consumed similar power for both Dhrystone and the 3D benchmark.
On the other hand, the SH-4 consumed 2.2 times as much power for the 3D bench-
mark as the Dhrystone. As described in Sect. 3.1.2.7, the power consumptions of the
SH-3 and SH-4 ported to a 0.18-mm process were 170 and 240 mW at 133 MHz and
1.5 V power supply for the Dhrystone. Therefore, the power of the SH-4 was
240 × 2.2/170 = 3.3 times as high as that of the SH-3. The corresponding performance
ratio is 7.2 times because they run at the same frequency after the porting. As a result,
the SH-4 achieved 7.2/3.3 = 2.18 times as high power efficiency as the SH-3E.
The actual efficiencies including the process contribution are 60 MHz/
287 = 0.21-M polygons/s/0.6 W = 0.35-M polygons/s/W for the SH-3E and
5.0-M polygons/s/2 W = 2.5-M polygons/s/W for the SH-4.
FTRV, the out-of-order completions of FDIV and FSQRT, and proper exten-
sions of the register files and load/store/transfer width. Further parallelization
could be one of the next approaches, but we took another approach to enhance
the operating frequency. Main reason was that the CPU side had to take this
approach for the general applications with low parallelism as described in
Sect. 3.1.2. However, it caused serious performance degradation to allow 1.5
times long latencies of the FPU instructions. Therefore, we enhanced the archi-
tecture and microarchitecture to reduce the latencies efficiently.
The FDIV and FSQRT of the SH-4 were already long latency instructions, and the
1.5 times long latencies of the SH-X could cause serious performance degradations.
The long latencies were mainly from the strict operation definitions by the ANSI/
IEEE 754 floating-point standard. We had to keep accurate value before rounding.
However, there was another way if we allowed proper inaccuracies.
A floating-point square-root reciprocal approximate (FSRRA) was defined as an
elementary function instruction to replace the FDIV, FSQRT, or their combination.
Then we do not need to use the long latency instructions. Especially, 3D graphics
applications require a lot of reciprocal and square-root reciprocal values, and the
FSRRA is highly effective. Further, 3D graphics require less accuracy, and the sin-
gle precision without strict rounding is enough accuracy. The maximum error of the
FSRRA is ±2 E − 21 where E is the exponent value of an FSRRA result. The FSRRA
definition is as follows:
1
FRn = .
FRn
The SH-X FPU achieved 1.4 times of the SH-4 frequency in a same process with
maintaining or enhancing the cycle performance. Table 3.8 shows the pitches and
latencies of the FE-category instructions of the SH-3E, SH-4, and SH-X. As for the
SH-X, the simple single-precision instructions of FADD, FSUB, FLOAT, and FTRC
have three-cycle latencies. Both single- and double-precision FCMPs have two-
cycle latencies. Other single-precision instructions of FMUL, FMAC, and FIPR and
the double-precision instructions except FMUL, FCMP, FDIV, and FSQRT have
five-cycle latencies. All the above instructions have one-cycle pitches. The FTRV
consists of four FIPR like operations resulting in four-cycle pitch and eight-cycle
latency. The FDIV and FSQRT are out-of-order completion instructions having
two-cycle pitches for the first and last cycles to initiate a special resource operation
and to perform postprocesses of normalizing and rounding of the result. Their
pitches of the special hardware expressed in the parentheses are about halves of the
mantissa widths, and the latencies are four cycles more than the special-hardware
pitches. The FSRRA has one-cycle pitch, three-cycle pitch of the special hardware,
and five-cycle latency. The FSCA has three-cycle pitch, five-cycle pitch of the spe-
cial hardware, and seven-cycle latency. The double-precision FMUL has three-cycle
pitch and seven-cycle latency.
Multiply–accumulate (MAC) is one of the most frequent operations in intensive
computing applications. The use of four-way SIMD would achieve the same
throughput as the FIPR, but the latency was longer, and the register file had to be
larger. Figure 3.31 illustrates an example of the differences according to the pitches
and latencies of the FE-category SH-X instructions shown in Table 3.8. In this
example, each box shows an operation issue slot. Since FMUL and FMAC have
five-cycle latencies, we must issue 20 independent operations for peak throughput
in the case of four-way SIMD. The result is available 20 cycles after the FMUL
issue. On the other hand, five independent operations are enough to get the peak
throughput of a program using FIPRs. Therefore, FIPR requires one-quarter of the
program’s parallelism and latency.
Figure 3.32 compares the pitch and latency of an FSRRA and the equivalent
sequence of an FSQRT and an FDIV according to Table 3.8. Each of the FSQRT
and FDIV occupies 2 and 13 cycles of the MAIN FPU and special resources, respec-
tively, and takes 17 cycles to get the result, and the result is available 34 cycles after
the issue of the FSQRT. In contrast, the pitch and latency of the FSRRA are one and
3.1
Program Flow
FMUL FIPR
5 cycles
20 operations for 5 operations
peak throughput
FMAC
20 cycles
FMAC
FMAC
Result is
available here
17 cycles
5 cycles
equivalent sequence of
Program Flow
11 4
FSQRT and FDIV
(post process)
4
FDIV
17 cycles
11
(post process)
4
Result is available here
5 cycles
equivalent sequence of 3 cycles
Program Flow
11
4
post
process
Resource is
4 available here
five cycles that are only one-quarter and approximately one-fifth of those of the
equivalent sequences, respectively. The FSRRA is much faster using a similar
amount of the hardware resource.
The FSRRA can compute a reciprocal as shown in Fig. 3.33. The FDIV occupies
2 and 13 cycles of the MAIN FPU and special resources, respectively, and takes 17
cycles to get the result. On the other hand, the FSRRA and FMUL sequence occu-
pies two and three cycles of the MAIN FPU and special resources, respectively, and
takes ten cycles to get the result. Therefore, the FSRRA and FMUL sequence is bet-
ter than using the FDIV if an application does not require a result conforming to the
IEEE standard, and 3D graphics are one of such applications.
3.1 Embedded CPU Cores 61
E2
FDS
FLS Short FPOLY
E3
E4 Main
E5 Register Write
E6
E7 Register Write
LS FE
We decided the vector instructions to be standard ones of the SH-X, which were
optional ones of the SH-4, and the SH-X merged the vector hardware and optimized
the merged hardware. Then the latencies of the most instructions became less than 1.5
times of the SH-4, and all the instructions could use the vector hardware if necessary.
There were weak requirements of high-speed double-precision operations when the
SH-4 was developed and chose the hardware emulation to implement them. However,
they could use the vector hardware and became faster mainly with the wider read/
write register ports and the more multipliers in the SH-X implementation.
Figure 3.34 illustrates the FPU arithmetic execution pipeline. With the delayed
execution architecture, the register-operand read and forwarding are done at the E1
stage, and the arithmetic operation starts at E2. The short arithmetic pipeline treats
three-cycle-latency instructions. All the arithmetic pipelines share one register write
port to reduce the number of ports. There are four forwarding source points to provide
the specified latencies for any cycle distance of the define-and-use instructions. The
FDS pipeline is occupied by 13/28 cycles to execute a single/double FDIV or FSQRT,
and these instructions cannot be issued frequently. The FPOLY pipeline is three cycles
long and is occupied three or five times to execute an FSRRA or FSCA instruction.
Therefore, the third E4 stage and E6 stage of the main pipeline are synchronized for
the FSRRA, and the FPOLY pipeline output merges with the main pipeline at this
point. The FSCA produce two outputs, and the first output is produced at the same
timing of the FSRRA, and the second one is produced two cycles later, and the main
pipeline is occupied for three cycles, although the second cycle is not used. The
FSRRA and FSCA are implemented by calculating the cubic polynomials of the prop-
erly divided periods. The width of the third order term is eight bits, which adds only a
small area overhead, while enhancing accuracy and reducing latency.
Figure 3.35 illustrates the structure of the main FPU pipeline. There are four
single-precision multiplier arrays at E2 to execute FIPR and FTRV and to emulate
62 3 Processor Cores
accumulated to the lower-by-lower product by the reduction array, and its lower 23
bits are also used to generate a sticky bit. At the third step, the remaining four prod-
ucts of middle-by-middle, upper-by-middle, middle-by-upper, and upper-by-upper
are produced and accumulated to the already accumulated intermediate values.
Then the CPA adds the sum and carry of the final product, and 53-bit result and
guard/round/sticky bits are produced. The accumulated terms of the second and
third steps are ten because each product consists of sum and carry, but the bitwise
position of some terms are not overlapped. Therefore, the eight-term reduction array
is enough to accumulate them.
V ′′ V′ Vy ′ ( L, N ′ )
V ′′ = TV , V ′ = , S x = x ′ , S y = ′ , N ′ = TN , I = ,
VW′′ Vz Vz ( N ′, N ′ )
⎛ Nx ⎞ ⎛ N x′ ⎞ ⎛ Lx ⎞
⎜N ⎟ ⎜ N ′⎟ ⎜L ⎟
N = ⎜ ⎟ , N ′ = ⎜ y′ ⎟ , L = ⎜ ⎟
y y
⎜ Nz ⎟ ⎜ Nz ⎟ ⎜ Lz ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ 0⎠ ⎝ 0 ⎠ ⎝ 0⎠
11 cycles/polygon
36M
30 with Special Inst.
x2.4
M Polygons/s 19 cycles
20
x2.7 21M
not used. Similarly, when intensity is also calculated, the execution cycles are 19
and 52 with and without special instructions, respectively, and 63% shorter using
special instructions compared to not using them.
Figure 3.38 shows the 3D graphics benchmark performance at 400 MHz, accord-
ing to the cycles shown in Fig. 3.37. Without special instructions, the coordinate and
perspective transformation performance is 15-M polygons/s. With special instruc-
tions, the performance is accelerated 2.4 times to 36-M polygons/s. Similarly, with
intensity calculation, but without any special instructions, 7.7-M polygons/s is
achieved. Using special instructions, the performance is accelerated 2.7 times to
21-M polygons/s.
It is useful to compare the SH-3E, SH-4, and SH-X performance with the same
benchmark. Figure 3.39 shows the resource-occupying cycles of the SH-3E, SH-4,
and SH-X. The main difference between the SH-4 and the SH-X is the newly defined
FSRRA and FSCA, and the effect of the FSRRA is clearly shown in the figure.
The conventional SH-3E architecture took 68 cycles for coordinate and perspec-
tive transformations, 74 cycles for intensity calculation, and totally 142 cycles.
Applying superscalar architecture and SRT method for FDIV/FSQRT with keeping
the SH-3E ISA, they became 39, 42, and 81 cycles, respectively. The SH-4 architec-
ture having the FIPR/FTRV and the out-of-order FDIV/FSQRT made them 20, 19,
and 39 cycles, respectively. The performance was good, but only the FDIV/FSQRT
resource was busy in this case. Further, applying the superpipeline architecture with
keeping the SH-4 ISA, they became 26, 26, and 52 cycles, respectively. Although
the operating frequency grew higher by the superpipeline architecture, the cycle
performance degradation was serious, and almost no performance gain was achieved.
In the SH-X ISA case with the FSRRA, they became 11, 8, and 19 cycles, respec-
tively. Clearly, the FSRRA solved the long pitch problem of the FDIV/FSQRT.
Since we emphasized the importance of the efficiency, we evaluated the area and
power efficiencies. Figure 3.40 shows the area efficiencies of the SH-3E, SH-4, and
SH-X. The upper half shows architectural performance, relative area, and architectural
66 3 Processor Cores
Fig. 3.39 Resource-occupying cycles of SH-3E, SH-4, and SH-X for a 3D benchmark
Continuously, the SH cores achieved high efficiency as described above. The SH-X3
core was developed as the third generation of the SH-4A processor core series to
achieve higher performance with keeping the high-efficiency maintained in all the
SH core series.
The multicore architecture was the next approach for the series. In this section,
the multicore support features of the SH-X3 are described, whereas the multicore
cluster of the SH-X3 and a snoop controller (SNC) are described in the chip imple-
mentation sections of RP-1 (Sect. 4.2) and RP-2 (Sect. 4.3).
Table 3.9 shows the specifications of an SH-X3 core designed based on the SH-X2
core (see Sect. 3.1.4). The most of the specifications are the same as that of the
SH-X2 core as the successor of it. In addition to such succeeded specifications, the
core supports both symmetric and asymmetric multiprocessor (SMP and AMP) fea-
tures with interrupt distribution and interprocessor interrupt, in corporate with an
interrupt controller of such SoCs as RP-1 and RP-2. Each core of the cluster can be
set to one of the SMP and AMP modes individually. It also supports three low-power
modes of light sleep, sleep, and resume standby, which can be different for each
core as the operating frequency can be. The size of the RAMs and caches is flexible
depending on requirements in the range as shown in the table.
68 3 Processor Cores
The supported SMP data-cache coherency protocols are standard MESI (Modified,
Exclusive, Shared, Invalid) and ESI modes for copy-back and write-through modes,
respectively. The copy-back and MESI modes are good for performance, and the
write-through and ESI modes are suitable to control some accelerators that cannot
control the data cache of the SH-X3 cores properly.
The SH-X3 outputs one of the following snoop requests of the cache line to the
SNC with the line address and write-back data if any:
1. Invalidate request for write and shared case
2. Fill-data request for read and cache-miss case
3. Fill-data and invalidate request for write and cache-miss case
4. Write-back request to replace a dirty line
The SNC transfers a request other than a write-back one to proper cores by
checking its DAA (duplicated address array), and the requested SH-X3 core
processes the requests.
In a chip multiprocessor, the core loads are not equal, and each SH-X3 core can
operate at a different operating frequency and in a different low-power mode to
minimize the power consumption for the load. The SH-X3 core can support the
SMP features even such heterogeneous operation modes of the cores. The SH-X3
supports a new low-power mode “light sleep” in order to respond a snoop request
from the SNC while the core is inactive. In this mode, the data cache is active for the
snoop operation, but the other modules are inactive. The detailed snoop processes
including the SNC actions are described in Sect. 4.2.
3.1 Embedded CPU Cores 69
The on-chip RAMs and the data transfer among the various memories are the key
features for the AMP support. The use of on-chip RAM makes it possible to control
the data access latency, which cannot be controlled well in systems with on-chip
caches. Therefore, each core integrates L1 instruction and data RAMs and a second-
level (L2) unified RAM. The RAMs are globally addressed to transfer data to/from
the other globally addressed memories. Then, application software can place data in
proper timing and location.
The SH-X3 integrated a data transfer unit (DTU) to accelerate the memory data
transfer between the SH-X3 and other modules. The details of the DTU will be
explained in Sect. 3.1.8.4.
Continuously, embedded systems expand their application fields and enhance their
performance and functions in each field. As a key component of the system, embed-
ded processors must enhance their performance and functions with maintaining or
enhancing their efficiencies. As the latest SH processor core, the SH-X4 extended
its ISA and address space efficiently for this purpose.
The SH-X4 was integrated on the RP-X heterogeneous multicore chip as two four-
core clusters with four FE–GAs, two MX-2 s, a VPU5, and various peripheral mod-
ules. The SH-X4 core features are described in this section, and the chip integration
and evaluation results are described in Sect. 4.4. Further, software environments are
described in Chap. 5, and application programs and systems are described in Chap. 6.
Table 3.10 shows the specifications of an SH-X4 core designed based on the SH-X3
core (see Sect. 3.1.7). The most of the specifications are the same as that of the
SH-X3 core as the successor of it, and the same part is not shown. The SH-X4
extended the ISA with some prefixes, and the cycle performance is enhanced from
70 3 Processor Cores
2.23 to 2.65 MIPS/MHz. As a result, the SH-X4 achieved 1,717 MIPS at 648 MHz.
The 648 MHz is not so high compared to the 600 MHz of the SH-X3, but the SH-X4
achieved the 648 MHz in a low-power process. Then the typical power consumption
is 106 mW, and the power efficiency reached as high as 16 GIPS/W.
The 16-bit fixed-length ISA of the SH cores is an excellent feature enabling a higher
code density than that of 32-bit fixed-length ISAs of conventional RISCs. However,
we made some trade-off to establish the 16-bit ISA. Operand fields are carefully
shortened to fit the instructions into the 16 bits according to the code analysis of
typical embedded programs in the early 1990s. The 16-bit ISA was the best choice
at that time and following two decades. However, required performance grew higher
and higher, and program size and treating data grew larger and larger. Therefore, we
decided to extend the ISA by some prefix codes.
The weak points of the 16-bit ISA are (1) short-immediate operand, (2) lack of
three-operand operation instructions, and (3) implicit fixed-register operand. The
short-immediate ISA uses a two-instruction sequence of a long-immediate load and
a use of the loaded-data, instead of a long-immediate instruction. A three-operand
operation becomes a two-instruction sequence of a move instruction and a two-
operand instruction. The implicit fixed-register operand makes register allocation
difficult and causes inefficient register allocations.
The popular ISA extension from the 16-bit ISA is a variable-length ISA. For example,
an IA-32 is a famous variable-length ISA, and ARM Thumb-2 is a variable-length ISA
of 16 and 32 bits. However, a variable-length instruction consists of plural unit-length
codes, and each unit-length code has plural meaning depending on the preceding codes.
Therefore, the variable-length ISA causes complicated, large, and slow parallel issue
with serial code analysis.
Another way is using prefix codes. The IA-32 uses some prefixes as well as the
variable-length instructions, and using prefix codes is one of the conventional ways.
However, if we use the prefix codes but not use the variable-length instructions, we
can implement a parallel instruction decoding easily. The SH-X4 introduced some
16-bit prefix codes to extend the 16-bit fixed-length ISA.
Figure 3.42 shows some examples of the ISA extension. The first example is an
operation “Rc = Ra + Rb (Ra, Rb, Rc: registers),” which requires a two-instruction
sequence of “MOV Ra, Rc (Rc = Ra)” and “ADD Rb, Rc (Rc + = Rb)” before extension,
but only one instruction “ADD Ra, Rb, Rc” after the extension. The new instruction is
made of the “ADD Ra, Rb” by a prefix to change a destination register operand Rb to
a new register operand Rc. The code sizes are the same, but the number of issue slots
reduces from two to one. Then the next instruction can be issued simultaneously if
there is no other pipeline stall factor.
The second example is an operation “Rc = @(Ra + Rb),” which requires a two-
instruction sequence of “MOV Rb, R0 (R0 = Rb)” and “MOV.L @(Ra, R0), Rc
(Rc = @(Ra + R0))” before extension, but only an instruction “MOV.L @(Ra, Rb),
3.1 Embedded CPU Cores 71
Rc” after the extension. The new instruction is made of the “MOV @(Ra, R0), Rc”
by a prefix to change the R0 to a new register operand. Then we do not need to use
the R0, which is the third implicit fixed operand with no operand field to specify. It
makes the R0 busy and register allocation inefficient to use the R0-fixed operand,
but the above extension solves the problem.
The third example is an operation “Rc = @(Ra + lit8) (lit8: 8-bit literal),” which
requires a two-instruction sequence of “MOV lit8, R0 (R0 = lit8)” and “MOV.L @
(Ra, R0), Rc (Rc = @(Ra + R0))” before extension, but only an instruction “MOV.L
@(Ra, lit8), Rc” after the extension. The new instruction is made of the “MOV.L @
(Ra, lit4), Rc (lit4: 4-bit literal)” by a prefix to extend the lit4 to lit8. The prefix can
specify the loaded data size in memory and the extension type of signed or unsigned
if the size is 8 or 16 bits as well as the extra 4-bit literal.
Figure 3.43 illustrates the instruction decoder of the SH-X4 enabling a dual issue
including extended instructions by prefix codes. The gray parts are the extra logic for
the extended ISA. Instruction registers at the I3 stage hold first four 16-bit codes,
which was two codes for the conventional 16-bit fixed-length ISA. The simultaneous
dual issue of the instructions with prefixes consumes the four codes per cycle at peak
throughput. Then a predecoder checks each code in parallel if it is a prefix or not, and
outputs control signals of multiplexers MUX to select the inputs of prefix and normal
decoders properly. Table 3.11 summarizes all cases of the input patterns and corre-
sponding selections. A code after the prefix code is always a normal code, and hard-
ware need not check it. Each prefix decoder decodes a provided prefix code and
overrides the output of the normal decoder appropriately. As a result, the instruction
decoder performs the dual issue of instructions with prefixes.
Figure 3.44 shows evaluation results of the extended ISA with four benchmark
programs. The performance of Dhrystone 2.1 was accelerated from 2.24 to 2.65
MIPS/MHz by 16%. The performance of FFT, FIR, and JPEG encoding was
72 3 Processor Cores
Pre-decoder
MUX MUX MUX MUX
Prefix Prefix
Dec. 0 Normal Dec. 1 Normal
Dec. 0 Dec. 1
MUX MUX
Output 0 Output 1
FFT 123%
FIR 134%
0 50 100 (%)
improved by 23%, 34%, and 10%, respectively. On the other hand, area overhead of
the prefix code implementation was less than 2% of the SH-X4. This means the ISA
extension by the prefix codes enhanced both performance and efficiency.
The 32-bit address can define an address space of 4 GB. The space consists of main
memory, on-chip memories, various IO spaces, and so on. Then the maximum linearly
addressed space is 2 GB for the main memory. However, the total memory size is
3.1 Embedded CPU Cores 73
P0/U0
3.5GB
(TLB)
Linear
Space 1TB
7FFFFFFF Linear
80000000 (232–229 Space
P1 ( PMB)
Bytes)
P2 (PMB) (240–229
Bytes)
P3 (TLB)
E0000000 P4 P4
FFFFFFFF
FF E0000000
P4
FF FFFFFFFF
High-speed and efficient data transfer is one of the key features for multicore perfor-
mance. The SH-X4 core integrates a DTU for this purpose. A DMAC is conventional
hardware for the data transfer. However, the DTU has some advantage to the DMAC,
74 3 Processor Cores
SH-X4 FE-GA
DTU Command
Dst .Adr . Src. Adr .
Command
SuperHyway
Fig. 3.46 DTU operation example of transfer between SH-X4 and FE–GA
because the DTU is a part of an SH-X4 core. For example, when a DMAC transfers
the data between a memory in an SH-X4 core and a main memory, the DMAC must
initiate two SuperHyway bus transactions between the SH-X4 core and the DMAC
and between the DMAC and the main memory. On the other hand, the DTU can
perform the transfer with one SuperHyway bus transaction between the SH-X4 core
and the main memory. In addition, the DTU can use the initiator port of the SH-X4
core, whereas the DMAC must have its own initiator port, and even if all the SH-X4
cores have a DTU, no extra initiator port is necessary. Another merit is that the DTU
can share the UTLB of the SH-X4 core, and the DTU can handle a logical address.
Figure 3.46 shows an example of a data transfer between an SH-X4 core and an
FE–GA. The DTU has TTLB as a micro-TLB that caches UTLB entries of the CPU
for independent executions. The DTU can get a UTLB entry when the translation
misses the TTLB. The DTU action is defined by a command chain in a local mem-
ory. The DTU can execute the command chain of plural commands without CPU
control. In the example, the DTU transfers data in a local memory of the SH-X4 to
a memory in the FE–GA. The source data specified by the source address from the
command is read from the local memory, and the destination address specified by
the command is translated by the TTLB. Then the address and data are output to the
SuperHyway via the bus interface, and the data are transferred to the destination
memory of the FE–GA.
Internal bus
Interruption / Sequence manager (SEQM)
DMA request Cell control bus LS Cells Local memory
Operation cells
(24+8 cells) LS CRAM
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
I/O port controller
Compiled RAM
ALU ALU cells MLT Mult. cells LS Load/store cells CRAM
(4-16-KB, 2-Port)
Figure 3.47 illustrates the architecture of the FE–GA, which consists of an operation
block and a control block. The operation block is composed of two-dimensionally
arrayed arithmetic logic unit (ALU)/multiplication (MLT) cells whose functions
and connections to neighboring cells are dynamically changeable, a multiple-banked
local memory (LM) for data storage, load/store (LS) cells which generate addresses
for the LM, and a crossbar (XB) network supporting internal data transfer between
the LS cells and the LM. The LM is divided into plural banks (CRAMs). The control
block consists of a configuration manager (CFGM) that manages the configuration
data for the operation block and a sequence manager (SEQM) that controls the
state of the operation block. The FE–GA is highly optimized in terms of power and
performance in media processing for embedded systems.
76 3 Processor Cores
ALU
From four Arithmetic op. To four
neighboring Logical op. neighboring
cells Flow control cells
Transfer registers
Output switch
Delay control
Input switch
SFT
Shift op.
× 4 THR × 4
Data control
1-bit data (carry) with a valid bit 8-bit data with a valid bit
Transfer registers
Output switch
Delay control Addition
Input switch
THR
×4 Data control ×4
1-bit data (carry) with a valid bit 8-bit data with a valid bit
The FE–GA has a 10-bank local memory (CRAMs) in order to store both operands for
the operation cell array and operation results. Each bank can be accessed from both the
operation cell array and the outside CPUs in a unit of 16-bit data. The maximum size
of a memory bank is 16 KB or 8 K words. The bank is a dual-port type; therefore, both
data transfers to/from the memory and operations on the cell array can be executed
simultaneously.
To utilize multiple banks of the local memory easily and flexibly, it has load-
store (LS) cells that can be configured exclusively for access control of every bank.
Figure 3.51 shows a block diagram of the LS cell. The LS cells generate addresses,
3.2 Flexible Engine/Generic ALU Array (FE–GA) 79
From/To
Operation cell control / Configuration control
crossbar
(Port 0)
×4
To local memory
Write control
(Port 0)
Memory
×4 I/F Read control
Write control (Port 0)
Read control
Bus
I/F Write control
From/To To local memory
Read control (Port 1)
crossbar Write control
(Port 1) Memory
×4 I/F Read control
(Port 1)
×4
1-bit data (carry) with a valid bit 8-bit data with a valid bit
arbitrate multiple accesses, and control access protocols to the local memory by
responding to memory accesses from the cell array or the outside CPUs. The LS
cells have the capability to generate various addressing patterns satisfying the
applications’ characteristics by selecting the appropriate addressing methods or
timing control methods. The addressing methods include direct supply from the
cell array and generation of modulo addresses in the LS cells, and both methods
can use bit reversing. The timing control methods include designation by the cell
array and generation in the LS cells. Table 3.13 gives the instruction set, including
ten instructions for the LS cells. The instructions support data widths of 16 bits and
8 bits, where no suffix is attached to instructions for 16-bit data, and suffix “.B” is
attached for 8-bit data.
The crossbar is a network comprising switches that connect 16 operation cells
on both the left and right sides of the cell array and 10 LS cells by the crossbar
configuration. It supports various connections such as point to point, multiple
points to point (broadcast of loaded data on an LS cell to operation cells), and point
to multiple points (stores of data on an operation cell to multiple banks of the local
memory via LS cells) for efficient memory usage. It also supports separate trans-
fers of the upper and lower bits on a load data bus from multiple banks of the local
memory.
Memory
System bus
Sequence manager
Configuration
registers
Local memory
Crossbar
LS cells
Operation Op. unit Op. unit Op. unit
cell
Start
Set up configuration
control registers
Set up sequence
control registers
Transfer data
Thread switch?
Operation
Execute operations finished?
End
The FE–GA carries out various processes on a single hardware platform by setting
up configurations of the operation cell array, the LS cells, and the crossbar network
and by changing the configurations dynamically. Figure 3.54 shows an operation
flowchart of an FE–GA.
The operation steps of the FE–GA are as follows:
1. Set up configuration control registers.
The FE–GA executes specified arithmetic processing in such a way that each cell
and the crossbar operates according to their configurations corresponding to CPU
commands. This specified processing is called a thread, which is identified by the
logical thread number. At this stage, an outside CPU or a DMA controller sets up
controlling resources in the configuration manager, such as registers that define buf-
fers storing configuration data and correspondence of a logical thread number to
data stored on the configuration buffer.
2. Set up sequence control registers.
The FE–GA provides states by combining the configuration state of each cell and
the crossbar identified by the logical thread number and parameters such as an oper-
ation mode and an operation state. A transition from a specified internal state to
another internal state is called a thread switch, and a series of switchings is called a
sequence. At this stage, an outside CPU or a DMA controller sets up a sequence
control table defining switching conditions and states before and after the switching
and initializes the internal state.
3.2 Flexible Engine/Generic ALU Array (FE–GA) 83
FDL Assembler
FE-GA Editor
FDL Linker
FDL(S-FDL) S-FDL Object
Linked
object
Thread Verified Thread
FDL(T-FDL) T-FDL Object
3. Transfer data.
An outside CPU or a DMA controller transfers necessary data for operation from
an outside buffer memory or another bank of the FE–GA’s local memory to the
specified bank of the local memory. It also transfers the operation result to memo-
ries inside and outside the FE–GA.
4. Thread switch (reconfiguration).
After completion of the setups, an outside CPU triggers the FE–GA, and FE–GA
starts its operation by the sequence manager. The sequence manager observes both
the internal state and trigger events that establish the condition for thread switching.
When the condition for thread switching is satisfied, it updates the internal state and
executes thread switching. Thread switching consumes two cycles. When the pro-
cessing is finished or an error occurs, it halts the processing and issues an interruption
to an outside CPU for service.
5. Execute operations.
When thread switching is completed, it starts the processing defined with
configurations identified by the next-switching logical thread number. The processing
is continued until the next thread-switch condition is satisfied.
The programming of the FE–GA involves mapping the operation cell array called a
thread and a sequence of multiple threads as depicted in Fig. 3.52. The FE–GA has
a dedicated assembly-like programming language called Flexible-Engine Description
Language, or FDL. There are two types of FDLs; one is Thread-FDL (T-FDL),
which describes a cell-array mapping, and the other is Sequence-FDL (S-FDL), which
describes a sequence of threads. Users first create both T-FDL and S-FDL with an
FE–GA editor and convert them into binary using FE–GA tools as shown in Fig. 3.55.
The tool-chain includes an editor, a constraint checker, an assembler, and a linker.
The editor is a graphical tool on which users set up functions of each operation cell,
data allocation of the local memory, and sequence definition of threads. It has a simu-
lator so as to verify users’ FE–GA programming, and it can also generate FDLs.
84 3 Processor Cores
The constraint checker verifies both types of FDL files in terms of grammars and
specifications and generates verified FDL files. Then the assembler converts the
FDL files into a sequence object and a thread object, respectively. Finally, the FDL
linker combines both object files into a linked object with section information that
includes the address of its placement in a memory. It also compresses the object by
combining common instructions among the operation cells so that the object is
placed in the configuration buffer of the FE–GA.
The software development process in a system with an FE–GA is shown in
Fig. 3.56. The process is rather complicated so as to obtain the optimal perfor-
mance. Users first create a reference program implementing a target application,
which is executable on a CPU. Then, FE–GA executable parts in the program are
determined by considering whether such parts can be mapped on the operation
array of the FE–GA in both a parallel and pipelined manner. Because operation
resources, such as the operation cells and the local memory, are limited, users need
to divide an FE–GA executable part into multiple threads. Then data flows are
extracted in each thread to create a data flow graph (DFG). Data placement on
multiple banks of the local memory is also studied in such a way that the data are
provided to the operation cells continuously in parallel. Users then program the
operation cells’ functions and intercell wirings, taking into consideration the timing
of data arrival on each cell, according to the DFG and the data placement, using the
FE–GA editor. The program is debugged using the FE–GA simulator in the next step.
3.2 Flexible Engine/Generic ALU Array (FE–GA) 85
Then the object is generated using the assembler and the linker. Since the FE–GA
is managed by CPUs, users need to create FE–GA control codes and attach them
to the reference program. Finally, the combined program for CPUs and FE–GA is
debugged on the integrated CPU and FE–GA simulator or on a real chip.
Fast Fourier transform (FFT), which is a common algorithm used in media process-
ing, was implemented on the FE–GA for evaluation. In this subsection, details of
the implementation are described. The algorithm used for mapping and evaluation
was a radix-2 decimation-in-time FFT, which is the most common form of the
Cooley–Tukey algorithm [53, 54]. We used this algorithm because the radix-2 FFT
is simple, and the decimation-in-time FFT has a multiplication of data and a twiddle
factor in the first part of calculation before fixed-point processing. It avoids having
to use wiring in order for supplying a twiddle factor into the middle of the cell array;
therefore, it preserves the resources of the cells for the fixed-point processing. The
format of input and output data is 16-bit fixed point (Q-15 format).
The FFT is calculated by repeating the butterfly calculation as follows (a decima-
tion-in-time algorithm):
a = x + y × W,b = x − y × W,
where a, b, x, and y are imaginary data and W is a twiddle factor. The equation can
be divided into a real part and an imaginary part as follows:
ar = xr + yr × Wr − yi × Wi, ai = xi + yr × Wi + yi × Wr ,
br = xr − yr × Wr + yi × Wi, bi = xi − yr × Wi − yi × Wr.
Figure 3.57a shows a data flow graph of the above equations. The circled “×,”
“<<,” “+,” and “−,” respectively, indicate multiplication, 1-bit left shift, addition,
and subtraction. Because the data are 16-bit fixed point, the upper 16 bits of the
multiplied data with W should be left shifted. Figure 3.57b depicts a mapping of
the data flow graph on the 4 × 4 cells (the upper half part of the operation cell array).
A rectangle in each cell indicates an operation, and an arrow between rectangles
shows 16-bit data (dashed arrow is 1-bit data). In the rectangles, “D” inserts 1-cycle
delay. “ROTL” shifts 1 bit to the left and inserts an input 1-bit data into the LSB.
“MSB” outputs the MSB as 1-bit data, which is realized by an “add with carry”
operation. “~MSB” outputs the complement of the MSB as 1-bit data. Note that
the MLT cells, normally set in the second row of the cell array, are set in the first row
in order to map the FFT. After multiplications at the first row, the MSB of the lower
16-bit data is extracted, and the upper 16-bit datum with 1-cycle delay applied
is shifted 1 bit to the left with the MSB of the lower data attached to the LSB.
86 3 Processor Cores
a b
1 2 3 4
yr ¥
H xr
xr DROTL
D +
wr
L ~MSB ar
xi + ar
yi L
¥ ~MSB D xr
yr ¥ << + ai H
D -
- wi D ROTL D - br
yi ¥ << - br H
yi ¥ DROTL xi
wr ¥ <<
- bi wr L D +
+
MSB ai
wi ¥ << yr ¥ L MSB D xi
H D +
wi D ROTL D - bi
Fig. 3.57 Data flow graph and mapping of FFT butterfly calculation
a b
Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3
0 + + + 0 0 + + + 0
1 + + 0
8 - 1 1 + + + 1
0
2 + 4 - + 2 2 + + + 2
3 + 0
4 - 2
8 - 3 3 + + + 3
4 0
2 - + + 4 4 0
2 - 0
4 - 0
8 - 4
5 0
2 - + 1
8 - 5 5 0
2 - 1
4 - 2
8 - 5
6 0
2 - 1
4 - + 6 6 0
2 - 0
4 - 1
8
- 6
1 3 1 3
7 0
2 - 4 - 8 - 7 7 0
2 - 4 - 8 - 7
After subtraction and addition are applied, the calculation results are obtained and
stored in the local memory.
The FFT algorithm is modified to obtain identical flow graphs at each stage.
This makes it possible to reduce the number of configurations and avoid the port-
number constraint of the local memory. Figure 3.58 shows both the original flow
(a) and the modified flow (b) of 8-point FFT. A square in each stage shows the
twiddle factor, Wab = exp (−2p ib/a), where “a” is positioned higher and “b” is
lower in the rectangles.
Two butterfly calculations can be mapped and executed on the cell array.
Therefore, for efficient use of the local memory, one butterfly calculation is applied
to the data with even numbers, and the other butterfly is applied to those with odd
3.2 Flexible Engine/Generic ALU Array (FE–GA) 87
1 2 3 4
¥ DROTL
DROTL
+
D
~MSB
¥ ~MSB D
D -
0 yer DROTL
DROTL D - xer 0
1 yor xor 1
¥ DROTL
D +
i xei
2 ye MSB 2
3 yoi ¥ MSB xoi 3
D
D +
4 DROTL
DROTL D - ar 4
5 br 5
¥ DROTL
DROTL D +
6 ~MSB ai 6
7 ¥ ~MSB D bi 7
D -
DROTL
DROTL D -
8 wr 8
9 wi ¥ DROTL 9
D +
MSB
Local memory Local memory
w/ LS cells ¥ MSB D w/ LS cells
D +
DROTL
DROTL D -
Input (LDINC)
Operation cell array Output (STINC)
numbers. In other words, the input data are divided into two groups with even num-
bers and odd numbers, and they are stored in different banks (bank 0 and bank 2 for
the even numbers, bank 1 and bank 3 for the odd numbers) of the local memory
(Fig. 3.59). Also, the two different input data to the butterfly, x and y, are respec-
tively stored on the first half and the latter half of the same bank of the local mem-
ory. Since each bank is a dual-port memory, these two data items can be read
simultaneously, and they are provided to two operation cells at the same time by the
crossbar’s multicast operation. Operation results are stored on different banks (bank
4–7) of the local memory.
Since the FFT algorithm is modified to obtain an identical mapping of the
butterfly calculation, the total number of threads depends on the cell configurations
related to data input and output. Figure 3.60 describes the defined threads and their
sequence for 1,024-point FFT. The 1,024-point FFT has 10 stages of the butterfly
calculations. The configuration of the cell array that includes the ALU and MLT
cells is common among all the stages. Input data and output data are divided in order
to be stored in ten banks of the local memory. One stage places its output data in a
bank of the local memory, and the next stage uses the output data in the bank as
input data. In other words, two types of configurations for the LS cells (L1 and L2
in the figure) are defined and alternatively used. The twiddle factors are placed in
88 3 Processor Cores
L1 L2 L1 L2
1 A1 X1 4 A1 X2 7 A1 X1 10 A1 X2
- W8 W 64 W 512
L2 L1 L2
2 A1 X2 5 A1 X1 8 A1 X2
W2 W 16 W 128
L1 L2 L1
3 A1 X1 6 A1 X2 9 A1 X1
W4 W 32 W 256
different banks, and the location in the banks differs at each FFT stage. Therefore,
the total number of threads is the same as that of the FFT stages as illustrated in
Fig. 3.60.
The performance of 1,024-point and 2,048-point FFT on FE–GA was evaluated.
This process involved placing all the data including input data and twiddle factors
placed in the local memory and storing the configurations in the configuration buf-
fer. Therefore, the evaluated cycles of execution include operations, data load and
store from/to the local memory, thread switching, and configuration preloading to
the operation cells. Note that the cycles exclude a bit-reversing process. Table 3.14
gives the evaluation results. The operations account for most of the total cycles, and
there are relatively few overhead cycles consisting of initial delay (data and
configuration load) and thread switching.
Instruction
MX Processor Controller (MPC)
RAM
Control Signals
PE
I/O Interface
2048 entries
PE
PE
V-ch H-ch
PE
PE
256b 256b
MX Processor Array (MPA)
3.3.1 MX-1
Applications like image processing and recognition which are employed in the por-
table devices demand the processing ability of up to several tens of GOPS, which is
far beyond the capabilities of conventional CPUs or DSPs. In these areas which
require high performance, hard-wired logic LSIs are commonly used to realize both
high-performance and low-power dissipations. However, hard-wired solutions have
problems in cost efficiency because algorisms for media processing are being
improved at short intervals. Therefore, powerful and also programmable devices are
desired to be employed in these multimedia applications. Considering these back-
grounds, our motivation is to improve energy efficiencies and flexibilities of SIMD
architectures while realizing a high performance which is enough for the multime-
dia applications.
Figure 3.61 shows the overview of the MX-1 architecture [55, 56]. MX-1 is the
first version of the MX core. MX-1 consists of matrix processor array (MPA), matrix
processor controller (MPC) which is a dedicated controller with an instruction
memory, and the I/O interface for data I/O. The main components of MPA are two
planes of data register array matrices, fine-grained (2-bit) 2,048 processing elements
(PE). The data register array matrices are composed of single-port SRAM cells to
enhance the area efficiency. PE adopts a 2-bit-grained structure, which includes two
full adders and some logic circuits, to minimize the size of each PE.
As shown in Fig. 3.61, there are two-directional channels for data processing.
One is the Horizontal channel (H-ch) which connects the data register array matri-
ces and PEs. The other is the Vertical channel (V-ch) which realizes the flexible data
90 3 Processor Cores
MPC
Pointer_0 Inst. Pointer_1
PE
Entry #n PE
PE
Operand_A : Operand_B :
Cycle 00000110 Temp. Reg. 00000101
2bit
k Read xx ALU
00 00 01 10 + 00 00 01 01
Read
Read 01
k+1 00 00 01 10 10 + 00 00 01
11
Write
Read
01
k+2 00 00 01 10 01 + 00 00 11
10
Write
a Read-out
Sense RS
SRAM Amp. Latch Execution
(1 Column) PE (Modify)
Write
Driver
Write-back
Operation Flow
b
CLK
Word-Line
(SRAM)
Timing Diagram
addition per 1 cycle. With these techniques, if 2,048 sets of 16-bit-additions are
executed with 2,048 entries in parallel, MX-1 can process all the data in ten cycles
(including the overhead of pipelined operation); therefore, a set of operands stored in
1 entry can be processed in approximately 0.005 cycle (ten cycles/2,048 entries). Note
that the practical implementation of PE and double-sided memory is completely sym-
metrical, temporary registers, and PEs have connections with both sides of the data
register array (required selectors not described in Fig. 3.62). The design concept of
H-ch proposed here significantly contributes to the enhancement of the processing
throughput while maintaining the area efficiency.
Figure 3.63 shows the proposed design technique employed in this work, which
is based on the read–modify–write (RMW) operation of SRAM. The main feature
of this design is that the required sequential operations of the H-ch processing, readout,
execution, and write-back, can be completed in one clock cycle. The asynchronous
RS-latch located next to the sense amplifier is implemented for holding the readout
data until the write-back operation is completed. As shown in the timing diagram of
Fig. 3.63b, the word line of SRAM can be activated at every clock cycle, and that
brings a high data processing throughput. In addition, by adopting the proposed
design methodology, the size of PE can be reduced as small as possible by eliminat-
ing unnecessary pipeline registers. Although the proposed scheme reduces the max-
imum operating frequency, portable multimedia devices do not require high-frequency
system clock, and reducing the required clock cycles for data processing is more
important to build up a high-performance engine.
92 3 Processor Cores
Intra-bank Inter-bank
V-ch V-ch
PE PE
64 Entries
SR AM
SR AM
SR AM
PE PE
SRAM
SRAM
SRAM
PEs
PEs
SRAM
SRAM
SRAM
PE PE
PEs
PEs
SRAM
SRAM
SRAM
SRAM
SRAM
PEs
PEs
PEs
PEs
metal layers, and the V-ch circuit shown in Fig. 3.66 is simple and small enough;
therefore, these powerful networks have been realized with negligibly small silicon
area overhead.
3.3.1.2 PE Design
Valid Reg.
V
I<2> O<2>
I<1> Booth- O<1>
I<0> Encoder O<0> N
F
Temp. Reg.
D
0 Carry Reg.
XH I2 Cout
I1
C
1 FA
Cin Sum
<1>
Temp. Reg.
<1>
<0> 0
X I2 Cout
<1>
<0> I1
<1:0>
1 FA
S Cin Sum <0>
<1:0>
Shift-Compensate Reg.
<1:0>
MUX MUX
Left/Right Left/Right
IN_L<1:0>
IN_R<1:0>
OUT<1:0>
Output_Enable
which operates according to Table 3.16. D/F/N registers, which store the encoded
results of the Booth’s encoder, are implemented to control the way of generating
partial products. That is, D switches the multiplicand shifts (1 bit shift) or not, F
switches the multiplicand inverts for complementing or not, and N switches whether
the partial product is valid or not. In addition, S is the register for shift compensation
which functions when D register is set to 1, and V register is implemented for vali-
dating the function of each PE. Figure 3.68 shows the proposed operation flow of a
MAC operation. At first, 2 bits of the multiplier are loaded to the temporary register
of PE, XH, and X, and the values of F/D/N registers are fixed with Booth’s encod-
ing. Next, 2 bits of the multiplicand are loaded to XH and X registers, and also 2 bits
of the accumulator region are added with the data in XH, X, and S registers at the
conditions of D/F/N registers. These sequences are realized by programming the
96 3 Processor Cores
XH/X D F N
Sign- 11 1 1 1
Extension 10 11 01 10 00 00 10 10
S 1 2bit - FA
Multiplier XH/X D F N
Loaded & 01 1 0 1
Booth 10 11 01 10 00 00 10 10
S 0 2bit - FA
Encoding
(be continued)
microprograms stored in the instruction RAM in the controller. With the proposed
circuit configuration, a 16-bit fixed-point signed MAC operation costs about 100
cycles in each PE, which is 56% smaller than that of non-Booth circuit configuration.
The MAC cycle cost of 100 cycles is normalized to 0.05 cycle per one PE because
MX-1 executes 2,048 MAC operations in parallel. In this way, fast MAC operations
based on the Booth’s algorithm can be realized even with the 2-bit-grained PE
configuration of MX-1.
Figure 3.69 shows the micrograph of the MX-1 core, and the performance of
MX-1 is summarized in Table 3.17.
3.3 Matrix Engine (MX) 97
H-Ch0
H-Ch1
4b ALU
V-Ch -8x
-4x
0
4x
Mux 8x
-2x Adder
-x
0
XREG x
2x
[3:0] Booth
[3:2]
Encoder
0
[1:0] Booth
Encoder
1
3.3.2 MX-2
The required performance of image processing is getting higher and higher; there-
fore, the second version of MX core (MX-2), which architecture is improved from the
MX-1, is developed [63]. The main technologies for the enhancement are men-
tioned below:
1. Expanding the processor elements from 2-bit grained to 4-bit grained
2. Improving the pipeline architecture of the MX controller
3. Equipping double frequency mode
Hereafter, these technologies are described in detail.
Figure 3.70 shows a block diagram of the PE of the MX-2. The PE contains a
4-bit temporary register (XREG) and 4-bit-grained ALU. The XREG loads data from
data registers through Horizontal channel (H-ch0, H-ch1) or Vertical channel (V-ch).
The PE loads data from data registers and through H-ch0 or H-ch1 and operates
98 3 Processor Cores
MPC MPA
Control Registers
Register
Control
Command SRAM
Instruction Data
RAM
Control
Logic FIFO PEs
with data from the XREG and stores the result back to the data register in 1 cycle
with the same way of MX-1. The ALU and XREG operate in parallel when they
access a different bank of the data register. The ALU contains two booth encoders,
an adder, and their peripherals. In MAC operation, a multiplicand data in the XREG
is operated with the outputs of two booth encoders, and they are calculated in the
adder. Therefore, two sets of 4-bit partial product are handled in 1 cycle. 4-bit ALU
calculates partial product four times faster than 2-bit ALU of MX-1, because it
repeats a 2-bit calculations four times.
In addition to the improvement of the PEs of MPA, the architecture of MPC is
also enhanced to extract the maximum performance of the parallel processing per-
formance of MX processor. Figure 3.71 shows the block diagram of the MPC. It
basically consists of the instruction RAM, the control registers, and the control
logic. The control logic decodes the microinstructions stored in the instruction RAM
and generates the control command for the control registers in the MPC and PEs
and SRAMs in the MPA. The control registers store the data for controlling the
MPA, such as address pointers. Due to this architecture, when the MPC is occupied
for maintaining the registers in itself, such as setting the immediate data to the con-
trol registers, the operation rate of the MPA degrades. To avoid this degradation,
FIFO circuits are newly added in the MPC.
Figure 3.72 shows an example operation of the controller and the MPA when an
application program is executed. A1–A4 are instructions for the MPA operation.
C1–C3 are instructions when only the controller is operated and the MPA is not
operated. The MPA needs multiple cycles for A1–A3 instructions. The multiple
cycle execution happens when the same bank of the data register is accessed by
both of the XREG and the ALU in the PE operation. Without FIFO, the controller
must wait for completion of the MPA operation. The MPA also needs to stay in idle
states until A4 instruction in the controller is completed. These “WAIT” and
“IDLE” cycles are absorbed by FIFO. While the MPA executes A1, the controller
can operate without waiting, and the next instructions (A2–A3) are stored in FIFO.
3.3 Matrix Engine (MX) 99
• With FIFO
Clock
MPC A1 A2 A3 C1 C2 C3 A4
FIFO A1 A2 A2~3 A3 A4
MPA A1 A2 A3 A4
These instructions in FIFO are executed by the MPA in parallel with the controller
operations of C2–A4.
In addition to the above-mentioned technologies, MX-2 equips double frequency
mode which enhances the maximum operating frequency of MX-2 and its perfor-
mance. High throughput of ALU operation of the MX core is realized by the read–
modify–write (RMW) operation of SRAM [55]. The RMW operation also realizes
low-power operation because a set of read and write operations of SRAM activate
the word line only once. The RMW operation is useful for power-efficient ALU
operation. However, the operating frequency of the MX core is limited by this RMW
operation. The MX-2 core has normal frequency (NF) mode and double frequency
(DF) mode. In the NF mode, the RMW operation is executed in 1 cycle. In the DF
mode, the RMW operation is divided into two cycles, and the MX-2 can be operated
at higher frequency. This mode is used when high performance is required rather
than low power consumption. Operating cycles of 8-bit addition and 8-bit MAC are
increased to 6 cycles and 18 cycles, respectively, in the DF mode. In the image pro-
cessing applications, the operating cycle of the DF mode is increased up to 40% from
the NF mode. With the DF mode, the maximum operating frequency of the MX-2
can be enhanced almost up to double compared with the NF mode. Therefore, the
processing performance of the real application can be improved with the DF mode.
Figure 3.73 shows the performance comparisons of MX-1 and MX-2 in case
various application programs are executed. To clarify the effect of the improvement,
Case A which is the case of 4-bit PE with conventional MPC is added in this graph.
About 20–40% improvement is confirmed with only the 4-bit-grained PE. In addi-
tion to that, about 20–40% improvement is realized with the improved MPC.
Figure 3.74 shows the micrograph of the MX-2 core, and the performance of
MX-2 is summarized in the Table 3.18.
100 3 Processor Cores
FFT (2048point)
FFT (8point)
3x3 Convolution
Filter
3x3 Median Filter
Optical Flow
Look Up Table
Harris Operator
0 10 20 30 40 50 60 70 80 90 100
Normalized Operating Cycles (a.u.)
This section introduces the architecture and circuit techniques for video encoding/
decoding processors. This video codec processor is embedded in the heterogeneous
multicore chip as a special-purpose processor (SPP), which is described in Chap. 2.
3.4.1 Introduction
Consumer audiovisual devices such as digital video cameras, mobile handsets, and
home entertainment equipment have become major drivers for raising the perfor-
mance and lowering the power consumption of signal processing circuits. Market
trends in the field of consumer video demand larger picture sizes, higher bit rates,
and more complex video processing. In video coding, the wide range of consumer
applications requires the ability to handle video resolutions across the range from
standard definition (SD, i.e., 720 pixels by 480 lines) to full high definition (full HD,
i.e., 1,920 pixels by 1,080 lines) encoded in multiple video coding standards such as
H.264, MPEG-2, MPEG-4, and VC-1. H.264 [64] is one of the latest standards for
motion-estimation-based codecs. It contains a number of new features [65, 66] that
allow it to compress video much more effectively than older standards, but it requires
more processing power. The availability of context-adaptive binary arithmetic coding
(CABAC) is considered one of the primary advantages of the H.264 encoding
scheme, since it provides more efficient data compression than other entropy encoding
schemes, including context-adaptive variable-length coding (CAVLC). However, it
also requires considerably more processing. The trade-off between high performance
and low-power consumption is a key focus of video codec design for advanced
embedded systems, especially for mobile application processors [28, 67–69].
Many video coding processors have been proposed. Generally, these codecs use
one of two approaches. The first approach constructs video encoding and decoding
software on homogenous high-performance processor cores [67, 68]. This approach,
which handles multiple video coding standards by changing the software or
firmware, suffers from large power consumption and lack of performance. A dual-
core DSP operating at 216 MHz [67] offers up to SD video, and an eight-core media
processor operating at 324 MHz [68] supports high definition (HD, i.e., 1,280 pixels
by 720 lines) at most. The second approach aims to develop dedicated video coding
hardware. While dedicated circuits can minimize power consumption, the dedicated
encoders and decoders described in previous reports [70–73] have difficulty in per-
forming all of the media processing that is indispensable for an embedded device
such as a modern smart phone [28, 67–69]. In addition, few of these video codecs
can handle video streams at more than 20 megabits per second (Mbps), so they have
difficulty in supporting full HD high-quality video.
In response to these issues, a video processing unit (VPU) has been designed based
on a heterogeneous multicore processor in order to achieve both high performance
102 3 Processor Cores
and low power consumption with multiple video formats. In full HD video processing,
dynamic current is still a dominant form of power consumption in low-power CMOS
technology. Therefore, the focus was on achieving lower dynamic power in the video
codec design using video signal processing characteristics.
Subsection 3.4.2 describes an overview of the video codec architecture. A two-
domain (stream-rate and pixel-rate) processing approach raises the performance of
both stream and image processing units for a given operating frequency. In the
image-processing unit, a sophisticated dual macroblock-level pipeline processing
with a shift-register-based ring bus is introduced. This circuit is simple yet provides
high throughput and a reasonable latency for video coding. Subsection 3.4.3
describes the stream processor and media processor architecture. The media proces-
sor is applied to transformations, subpixel motion compensation, and an in-loop
deblocking filter. Including the single stream processor, a total of seven application-
specific processors are integrated on the proposed video codec. Subsection 3.4.4
discusses the results of implementing the VPU from the viewpoints of performance
and power consumption. Subsection 3.4.5 concludes with a brief summary.
Figure 3.75 shows the basic architecture of the VPU based on a heterogeneous mul-
ticore approach, the concept of which is the same as the heterogeneous multicore
chip for embedded systems described in Chap. 2. To satisfy both the high-performance
and low-power requirements for advanced embedded systems with greater flexibility,
it is necessary to develop parallel processing on a video processing unit by taking
advantage of the data dependency in video coding process.
Several low-power special-purpose processor (SPP) cores, several high-performance
application-specific hard-wired circuits (HWC), shared memory, and a global data
transfer unit (DTU) are embedded on a VPU. There are two types of SPPs, a stream
processor and a media processor. Each processing core includes local memories
(LM) and a local DTU. These are embedded in the processing core to achieve paral-
lel execution of internal operation in the core and data transfer operations between
cores and memories. Each core processes the data on its LM, and the DTU simulta-
neously executes memory-to-memory data transfer between cores, shared memory,
or off-chip memory via a global DTU. The dynamic clock controller (DCC), which
is connected to each core, controls the clock supply of each core independently and
reduces the dynamic power consumption of the VPU. The shared memory is a
middle-sized on-chip memory which is used as a line buffer in vertical deblocking
processing or as a reference image buffer for motion estimation/compensation. Each
core is connected to the on-chip interconnect called the shift-register-based bus
(SBUS), which is suitable for block-level pipeline processing. Frequency and voltage
control (FVC) is applied to the top level of the video processing unit only.
3.4 Video Processing Unit 103
LM LM LM LM
DTU DTU DTU DTU
HWC #0 HWC #k
Shared
Memory Global DTU
LM LM
DTU DTU
CPU On-chip
DCC DCC FVC FVC
#0 Interconnect
Off-chip memory
Figure 3.76 is a block diagram of the video processing unit, which is a heteroge-
neous multicore processing unit that applies our architecture model shown in
Fig. 3.75.
The architecture consists of a stream-rate domain and a pixel-rate domain [74].
These units operate independently in a picture-level pipeline manner to achieve full
HD performance while lowering the operating frequency. At a given time, this video
104 3 Processor Cores
Media interconnect
unit (m=1)
Symbol TRF1 FME1 DEB1 Global
On-chip interconnect
#1 codec1 (PIPE) (PIPE) (PIPE)
CME1
DMAC
Memory port
Intermediate
Bit stream Image Off-chip memory
stream
Fig. 3.76 Block diagram of video processing unit. The stream-rate domain and pixel-rate domain
can access the intermediate stream via the global DMAC
codec performs either encoding or decoding. In decoding mode, the stream processing
unit (SPU) reads bit streams from off-chip memory and outputs a transformed inter-
mediate stream. The image processing units (IPU) read the intermediate streams
produced by the stream processing unit and generate the final decoded image.
The space for the intermediate streams in the off-chip memory serves as a buffer
between the stream-rate domain and the pixel-rate domain. Variable-length coding
inherently lacks fixed processing times. CABAC times have particularly large varia-
tion. Up to 384 symbols of transform coefficients are definable in a macroblock, but
the maximum number of bits changes according to the probability of a syntactic
element in the given context. If the stream processing unit takes more time to process
a frame than is available at the frame rate, the operating frequency must be raised.
Figure 3.77 shows an example of the decoding time and the number of bits for
each picture in an H.264 40-Mbps video stream. As the figure shows, when the
number of bits in the pictures around picture #30 is large, the stream-rate domain’s
decoding time is longer than that of the pixel-rate domain. When the number of bits
assigned to the pictures around picture #5 is small, the stream-rate domain’s decoding
time is shorter than that of the pixel-rate domain.
The intermediate stream buffer fills the performance gap between the stream
processing unit and the image processing unit in the picture-level pipeline.
Figure 3.78 is the stream and pixel decoding time chart in the picture-level pipeline.
The time slot is defined as the decoding time of image processing in the pixel-rate
3.4 Video Processing Unit 105
80
Decoding time of stream processing
unit in the stream-rate domain
Number of bits/picture
Decoding time (ms)
60
33.3ms(=Decoding time of
Image processing unit in
40 the pixel-rate domain)
20
40Mbps
0
0 5 10 15 20 25 30 35 40
Picture number
Fig. 3.77 Stream processing unit’s decoding time for an H.264 40-Mbps full HD video stream
running at 30 frames per second (fps)
S0 S1 S2 S3 S4
P0 P1 P2 P3 P4
Delay
Time slot
Time
Without intermediate stream buffering
b
S0 S1 S2 S3 S4
P0 P1 P2 P3 P4
Time slot
No delay Time
Sn : Stream decoding for picture n
Intermediate stream buffering Pn : Pixel decoding for picture n
Fig. 3.78 Parallel operation in picture-level pipeline in stream-rate domain and pixel-rate domain
domain. Except for stream decoding S0 for Picture 0, stream and pixel decoding are
processed in parallel. In Fig. 3.78a, that is, without an intermediate stream buffer, if
the decoding time of S2 is larger than the decoding time of P1, it causes a delay in
the start of P2. To shorten the S2 decoding time, we can increase the operating
106 3 Processor Cores
frequency. However, this increases the power consumption of the stream processing
unit in the stream-rate domain. To meet the performance requirements without
increasing the operating frequency, we introduce an intermediate stream buffer.
By using the intermediate stream buffering depicted in Fig. 3.78b, the outputs of S0
and S1 are stored in the intermediate stream buffer. As this time chart shows, S1 and
S2 can start processing independently of the time slot, and S2 is finished before the
end of P1. Therefore, the start of P2 is not delayed from the defined time slot, and
we can say that by using the intermediate stream buffering, each picture can start its
pixel decoding at every time slot. Thus, the two-domain structure with the interme-
diate stream buffer can handle all pictures at the average frequency, and this helps
to keep the required operating frequency, and hence the power consumption, low.
The intermediate stream format has two segments, one in fixed-length and the
other in variable-length coding, and the two parts are processed per symbol (not
per bit). The fixed-length part consists of information on the macroblocks, including
the slice boundaries, coded block pattern, quantization scale parameter, and several
other items. The variable-length part of the intermediate stream contains the other
syntax elements (motion vectors and transform coefficients) in exponential-Golomb
coding, which is a common, simple, and highly structured technique.
We evaluated the memory bandwidth between the stream and pixel domains.
Although access to the intermediate stream by the stream processing unit and image
processing units takes the form of access to the external synchronous DRAM
(SDRAM), the required memory bandwidth is less than would be required for the
conventional method (directly applying 16-bit-per-pixel transform coefficients).
Figure 3.79 plots the compression ratio of the intermediate stream relative to the
original stream for individual pictures of the H.264 conformance-test streams [75],
other than those for I_PCM. The compression ratios are around 1.6 and 1.5 for CABAC
and CAVLC, respectively. Although a portion of the intermediate stream is in fixed-
length coding, the coding efficiency was within 1.6 in the case of CABAC. The com-
pression effect of the intermediate stream relative to the conventional method
corresponds to a 95% reduction in required memory capacity and memory bandwidth
for the processing of a 40-Mbps full HD stream (64 Mbps for the intermediate stream
and 90 Mpixels/s for the transform coefficients). Table 3.19 lists the bandwidths of all
DMA channels in the video decoding process. The ratio of bandwidth for the interme-
diate stream buffer is only 4.8% and is small even in the worst case. Therefore, the use
of a stream buffer has only a small impact on power consumption.
As shown in Fig. 3.76, all submodules of the video codec are connected in a ring
structure by a bidirectional 64-bit shift-register-based bus (SBUS). Figure 3.80
shows the architecture of the SBUS and the data flow in the macroblock-pipeline
stages. The clockwise SBUS is the path for data readout from the external SDRAM.
The counterclockwise SBUS is used for intermodule data transfer to the next stage
3.4 Video Processing Unit 107
a 9
8 y = 1.5871x + 0.0045
Intermediate stream
7
bit-rate (Mbps)
6
5
4
3
2
1
0
0 1 2 3 4 5 6
Original stream bit-rate (Mbps)
CABAC
b 9
8 y = 1.4946x + 0.0093
Intermediate stream
7
bit-rate (Mbps)
6
5
4
3
2
1
0
0 1 2 3 4 5 6
Original stream bit-rate (Mbps)
CAVLC
Fig. 3.79 Bit rate increase of the intermediate stream. (a) CABAC and (b) CAVLC
(4)
SPP/HWC SPP or HWC SPP/HWC Global SD
(PSn-1) Pipeline stage (PSn) (PSn+1) DMAC RAM
Clockwise
(1) PS n-1 a b c d
(2) PS n a b c d
(3) PS n+1 a b c d
Fig. 3.80 Shift-register-based bus network and depiction of how it works in macroblock-level
pipeline processing
the number of stages from the source module to the destination module. For a video
coding process, however, the major form of data transfer will be to the next stages
of the macroblock pipeline. Transactions between individual modules and the line
memory (L-MEM) constitute the only exception, but we avoid this problem by
scheduling this in a time slot taking up the first few tens of clock cycles before the
processing of each macroblock begins. This keeps the latency of the SBUS from
affecting the performance of the codec. The SBUS architecture provides an easy
way to connect an additional image processing unit for larger screens or a higher
frame rate without having to increase bandwidth, as would be required with a
conventional bus. The SBUS thus provides excellent video-size scalability.
The two image processing units work cooperatively as two macroblock-based
pipelines. Processing proceeds as shown in Fig. 3.81. Most state-of-the-art video-
coding standards, including H.264, utilize context correlation between adjacent
macroblocks. For example, macroblock X is coded by using the context information
from macroblocks A, B, C, or D in Fig. 3.81. We can take advantage of this charac-
teristic in the sophisticated dual macroblock-pipeline architecture. The delay and
parallelization for the two image processing units (IPU #0, #1) that handle the
respective pipelines are controlled accordingly. As shown in Fig. 3.82, context
information processed by IPU #1 is directly transferred to IPU #0, and context
information from IPU #0 is transferred to L-MEM. The two macroblock lines share
L-MEM. This halves the requirement for L-MEM to store context information.
3.4 Video Processing Unit 109
D X
IPU
#1
IPU
#0
Macroblock
Data flow
a b c IPU #n
Data flow
PLL
Figure 3.83 shows how the clock domains are divided at the submodule level. The
power management architecture consists of multiple domains, which are controlled
independently using video codec processing characteristics. A new layer of control
is set up between the static module stop controlled by software and the bit-grained
clock gating. We defined clock domains for each submodule of the video codec.
Each clock domain corresponds to a macroblock processing pipeline stage and is
also a unit of hierarchical synthesis and layout. The clock signals in the respective
domains are switched in accordance with the time reference defined by the macro-
block processing.
Figure 3.84 illustrates the dynamic power management in the pipeline for
macroblock processing. Data are processed in a macroblock-based pipeline manner.
The time reference defined 1,200 clock cycles at 162 MHz. Variable-length coding
inherently lacks a fixed processing time. The clock supply for each block is inde-
pendently cut off as soon as it finishes its required processing. This scheme reduces
the amount of power consumed by the clock drivers. Dynamic power consumption
is reduced by 27% by using this technique [76].
3.4 Video Processing Unit 111
The DMA read and DMA write depicted in Fig. 3.84 could cause a delay in the
macroblock-level pipeline processing. To prevent this, we have to improve the efficiency
of the image-data transfer especially in the reference-image read. To achieve an
efficient 2D data transfer, an address transformation scheme is introduced in the mem-
ory management unit for VPU and other media IPs in order to avoid a page miss in the
external SDRAM.
Most video codec standards require small, submacroblock level 2D data transfer
for reference reads in the decoding mode. Without using any particular techniques
to achieve such transfers, this will result in a page miss at every line. The penalty for
a page miss, which is around ten cycles or more, requires a high proportion of
memory bandwidth. Efficiency in the 2D data transfer is thus critically important.
With embedded systems such as mobile applications, in which various kinds of
software are executed, it is not feasible to adopt a particular form of memory alloca-
tion such as using a bank-interleave operation for each pixel line.
To avoid page misses, tile-linear address translation (TLAT) [76] is introduced
between the video codec and the on-chip interconnect. Figure 3.85a shows the
TLAT circuits and memory allocation in the virtual address (VADR) and physical
address (PADR) space. The lower-order bits of the VADR issued by the video codec
are rearranged into the corresponding PADR. As shown in Fig. 3.85b, 32 × 32 tile
access from the video codec is mapped to linear addressing in the PADR space.
When the lower address of the VADR is defined as VADR [m: 0], the PADR is
described as follows:
PADR [m: TB+VB+HB] = VADR [m: TB+VB+HB];
PADR [TB+VB+HB-1: TB+VB] = VADR [TB+HB-1: TB];
PADR [TB+VB-1: TB] = VADR [TB+HB+VB-1: TB+HB];
PADR [TB -1:0] = VADR [TB-1: 0].
In these equations, TB, HB, and VB are calculated by the following equations:
TB=log2 (Blk_h),
HB=log2 (stride)-TB,
VB=log (Blk_v);
Stride, Blk_h, and Blk_v should be power of two.
With this address translation scheme, codec performance improved a maximum
of 47% in the bipredictive prediction picture (called the B-picture), and power con-
sumption in the video codec core was reduced by 16% [76]. This scheme is also
well suited for image rotation and block-based filter processing.
To provide flexibility for handling multiple video coding standards, a stream processor
and six media processors are implemented in the video processing unit. Two fine
112 3 Processor Cores
a
Stride
Media IPs VPU
Virtual address (VADR) BlkV Tile Tile
BlkH
Address space judgment Log2(BlkV) Log2(Stride)
Off-chip memory
b
VADR (Tile-based addressing) PADR (Linear addressing)
32Byte
Stride=2048Byte Stride=2048Byte
0 32 64 2016 0 1 31 32 63
32 lines
1 33 65 64 65 95 96 127
2049
Image plane page = 1024Byte
2079
Global SBUS
Internal bus
DMAC interface
Stream processor memory interface Intermediate Intermediate
Initial stream stream
Firmware Table data
On-chip parameter
interconnect Hardwired logic
Instruction Data Table
memory memory memory
3,220-bit
Memory
context flip-flop
port Two-way VLIW stream processor
Control CABAC accelerator
Fig. 3.86 Stream-processing unit architecture. The stream processor and CABAC accelerator are
connected to the internal bus, allowing them to access the intermediate stream buffer in an external
memory via the global DMAC
Figure 3.86 shows the stream processing unit architecture, which consists of a two-
way very long instruction word (VLIW) stream processor and an H.264 context-
adaptive binary arithmetic coding (CABAC) accelerator with 3,220-bit context
flip-flops. These parts and a common SBUS interface are connected to the internal
bus so that each part can access the intermediate stream buffer in an external
SDRAM via the global DMAC. The stream processing unit can support various
video coding standards by changing the firmware, which consists of the decoder or
encoder program and the table data. The video codec loads the firmware from the
external SDRAM to the stream processing unit’s internal memories before the unit
starts decoding or encoding video streams. The program in the firmware is loaded
to the stream processor’s instruction memory, and the table data are loaded to the
table memory. The CABAC accelerator, which the stream processor controls,
includes context flip-flops to achieve high performance.
Figure 3.87 shows the architecture of the proposed stream processor (STX). We
employ the 32-bit 2-way VLIW, 3-stage pipelined architecture as the stream proces-
sor architecture [77].
Stream encoding/decoding is divided into variable length coding, syntax analysis,
and context calculation. Variable length coding is further divided into coding/
decoding with table data (table encoding/decoding) and Golomb encoding/decoding,
which is employed in H.264. Table encoding/decoding in various video coding stan-
dards can be easily developed by changing the data in the table memory in the STX.
114 3 Processor Cores
32bitx32(Type 3) STX
Also, the Golomb encoding/decoding process does not change with each video
coding standard. Therefore, we developed the variable-length coding unit in the
STX as the dedicated variable-length coding hardware. On the contrary, syntax
analysis and context calculation have complicated data flows, and they vary with
each video coding standard. Thus, they are implemented into the firmware for each
standard. These processes also have a lot of branch operations. In general, VLIW
architecture is not good at handling branch operations, and branch-stall cycles
increase in proportion to the number of pipeline stages. Thus, the number of stages
in the STX is reduced to as few as possible.
The STX also has an out-of-order execution feature. If the instruction decoder in
the STX judges that there is no data dependency with a variable-length coding
instruction and the following instructions, then the pipeline executes the next
instruction, even though the execution of the variable-length coding instruction is
not finished. This feature enables the symbol-level processing of variable length
coding and syntax analysis/context calculation to be pipelined. This pipeline pro-
cessing is effective for improving the performance in processing stream data that
have large bit rates and include a lot of residual data.
When calculating the context for a symbol in a video stream, various previously
decoded symbols are required. For efficient access to these symbols, they are located
in the register file. Before designing the STX, we estimated the number of entries
required in the register file from specifications of several video coding standards.
Based on this estimation, it was determined that 128 entries were sufficient for
storing previously decoded symbols while encoding or decoding various video
streams. However, 128 entries × 32 bits (4,096 bits) of flip-flops require large hardware.
3.4 Video Processing Unit 115
PIPE
Sub-MB (4x4) level pipeline processing
bit extension, bit rounding,
2-D processing
transposition transposition
PC Shared instruction memory
Registers
Registers
ALU ALU
Media
LD/ ALU LD/
ST ST
Shift-register-based bus
To reduce the number of flip-flops, the symbols are categorized into three types by
bit width: type 1 (1–4 bits), type 2 (4–16 bits), and type 3 (16–32 bits). As a result,
the number of entries belonging to type 1 is about 1.8 times larger than that belong-
ing to the other categories. Based on this result, our register file architecture consists
of three partitions: 64 type 1 entries (4 bits), 32 type 2 entries (16 bits), and 32 type
3 entries (32 bits) as shown in Fig. 3.87. Compared with a 32-bit nonpartitioned
register file, a 57% reduction in the number of flip-flops is achieved.
The CABAC accelerator achieves a performance of two cycles per bit of the bin
string (an intermediate binary representation of the syntax elements), which corre-
sponds to three cycles per bit of the stream. This is assuming that the compression
rate for the arithmetic coding is 1.5 and that single-cycle-access flip-flops are used
to update the context information. Taking the several cycles of processing overhead
into account, the performance is 40 Mbps at 162-MHz operation.
To provide flexibility for handling multiple video standards, the following six
submodules of the image processing units are implemented as low-power media
processors [74]: two fine motion estimators/motion compensators (FME), two
transformers (TRF), and two in-loop deblocking filters (DEB). These modules are
shown in Fig. 3.76.
Figure 3.88 is a block diagram of the programmable image processing element
(PIPE). The PIPE is a tightly coupled multiprocessing unit (PU) system which con-
sists of three PUs (the loading PU, media PU, and storage PU), a local data memory,
and a shared instruction memory. Each PIPE is capable of simultaneously loading
116 3 Processor Cores
a
64bit
Sync Source1/2 Opecode Destination Count Width Pitch
Synchronization Load/store
Source Destination
between PUs instruction
Count
Count
wait
Count
Reg3 Reg7 Data
Media- Width Width
PU
b
Register File
Src1 Src2
Shifter / extender
Decoder
X X
add add
Barrel Barrel
shifter shifter
data, performing image processing, and storing data. Arrays of data are specifiable
as the operands for several single instructions of the PUs, so they are capable of
handling multiple horizontal data as vectors. This aspect of the PIPE can reduce the
number of cycles required for operations such as pre-/post-transposition processing,
as well as the code size and instruction fetches. Reducing instruction fetches from
the shared instruction memory reduces power consumption. Overall, the PIPE
improves the efficiency with which 2D data and instructions are supplied.
A single instruction multiple data (SIMD) architecture performs the same opera-
tion on multiple data simultaneously using multiple processing elements. In gen-
eral, the SIMD can only handle a pair of source data, which is in the horizontal
pixels of images. A major cause of performance degradation in 2D image process-
ing is that one instruction can handle only single source data. To solve this issue, 2D
vector data in a single instruction are taken into account.
Figure 3.89a shows a single instruction with arrayed data (SIAD) instruction
format. The width and count fields specify multiple source data as multiple vector data.
3.4 Video Processing Unit 117
0 1200 (cycles/MB)
0
TRF
Enc. T*, Q**, Inverse T, I
P Processing cycles/MB
Inverse Q
B
FME I Cycle of fetching
Fine ME*** P by the 3 PUs
MC**** B
DEB I Ratio of cycles
De-blocking P for fetching to
filter B overall cycles for
TRF I 1-MB-processing
Dec.
Inverse T, P per PU (%)
Inverse Q B
FME I
MC P
B
*T: Transform
DEB I
**Q: Quantization
De-blocking P ***ME: Motion estimation
filter B ****MC: Motion compensation
Picture type 0 10 20 30 50 100 (%)
Fig. 3.90 Evaluation of performance and efficiency in instruction fetching of PIPE acting as modules
of the image processing unit in H.264 video processing
The hardware controls source and destination register pointers with multiple cycles.
This architectural concept provides parallelism for vertical data. Figure 3.89b shows a
basic SIAD ALU structure. This dataflow goes through mapping logics, multipliers,
sigma adders, and barrel shifters in a pipeline. Each data path is similar to the general
SIMD structure, but the total structure differs in how source data are supplied.
Each PIPE also has a local DMA controller for communication with the other
PIPE modules and with the hard-wired modules (e.g., coarse motion estimator,
symbol coder). Connecting multiple PIPEs in series to form the macroblock-based
pipeline modules provides strong parallel computing performance and scalability
for the video codec (as described in Fig. 3.82).
less than 1,200 in H.264 encoding and less than 1,000 in H.264 decoding. In addi-
tion to the H.264 processing, the average macroblock-processing cycle for MPEG-2,
MPEG-4, and VC-1 is less than 1,200, which means the video codec is capable of
full HD real-time processing at an operating frequency of 162 MHz.
Table 3.20 lists specifications of the video codec and the measured results for
power consumption in the processing of full HD video at 30 fps. With 45-nm CMOS
technology, the codec consumed 162 mW in encoding and 95 mW in decoding of
H.264 High Profile at 1.10 V at room temperature. Figure 3.91 is a micrograph
of the test chip, which is overlaid with the layout of the video processing unit.
3.4.5 Conclusion
References
17. Arakawa F et al (2005) An exact leading non-zero detector for a floating-point unit. IEICE
Trans Electron E88-C(4):570–575
18. Arakawa F et al (2005) SH-X: an embedded processor core for consumer appliances. ACM
SIGARCH Comput Architect News 33(3):33–40
19. Kamei T, et al (2004) A resume-standby application processor for 3G cellular phones. ISSCC
Dig Tech Papers:336–337, 531
20. Ishikawa M, et al (2004) A resume-standby application processor for 3G cellular phones with
low power clock distribution and on-chip memory activation control. COOL Chips VII
Proceedings, vol. I:329–351
21. Ishikawa M et al (2005) A 4500 MIPS/W, 86 mA resume-standby, 11 mA ultra-standby appli-
cation processor for 3 G cellular phones. IEICE Trans Electron E88-C(4):528–535
22. Yamada T, et al (2005) Low-Power Design of 90-nm SuperHTM Processor Core. Proceedings
of 2005 IEEE International Conference on Computer Design (ICCD), pp 258–263
23. Arakawa F, et al (2005) SH-X2: An embedded processor core with 5.6 GFLOPS and 73 M
Polygons/s FPU, 7th Workshop on Media and Streaming Processors (MSP-7):22–28
24. Yamada T et al (2006) Reducing consuming clock power optimization of a 90 nm embedded
processor core. IEICE Trans Electron E89–C(3):287–294
25. Hattori T, et al (2006) A power management scheme controlling 20 power domains for a single-
chip mobile processor. ISSCC Dig Tech Papers, Session 29.5
26. Ito M, et al (2007) A 390 MHz single-chip application and dual-mode baseband processor in
90 nm Triple-Vt CMOS. ISSCC Dig Tech Papers, Session 15.3
27. Naruse M, et al (2008) A 65 nm single-chip application and dual-mode baseband processor
with partial clock activation and IP-MMU. ISSCC Dig Tech Papers, Session 13.3
28. Ito M et al (2009) A 65 nm single-chip application and dual-mode baseband processor with
partial clock activation and IP-MMU. IEEE J Solid-State Circuits 44(1):83–89
29. Kamei T (2006) SH-X3: Enhanced SuperH core for low-power multi-processor systems. Fall
Microprocessor Forum 2006
30. Arakawa F (2007) An embedded processor: is it ready for high-performance computing?
IWIA 2007:101–109
31. Yoshida Y, et al (2007) A 4320MIPS four-processor core SMP/AMP with Individually managed
clock frequency for low power consumption. ISSCC Dig Tech Papers, Session 5.3
32. Shibahara S, et al (2007) SH-X3: Flexible SuperH multi-core for high-performance and low-
power embedded systems. HOT CHIPS 19, Session 4, no 1
33. Nishii O, et al (2007) Design of a 90 nm 4-CPU 4320 MIPS SoC with individually managed
frequency and 2.4 GB/s multi-master on-chip interconnect. Proc 2007 A-SSCC, pp 18–21
34. Takada M, et al (2007) Performance and power evaluation of SH-X3 multi-core system. Proc
2007 A-SSCC, pp 43–46
35. Ito M, et al (2008) An 8640 MIPS SoC with independent power-off control of 8 CPUs and 8
RAMs by an automatic parallelizing compiler. ISSCC Dig Tech Papers, Session 4.5
36. Yoshida Y, et al (2008) An 8 CPU SoC with independent power-off control of CPUs and
multicore software debug function. COOL Chips XI Proceedings, Session IX, no. 1
37. Arakawa F (2008) Multicore SoC for embedded systems. International SoC Design Conference
(ISOCC) 2008, pp.I-180–I-183
38. Kido H, et al (2009) SoC for car navigation systems with a 53.3 GOPS image recognition
engine. HOT CHIPS 21, Session 6, no. 3
39. Yuyama Y, et al (2010) A 45 nm 37.3GOPS/W heterogeneous multi-core SoC. ISSCC
Dig:100–101
40. Nito T, et al (2010) A 45 nm heterogeneous multi-core SoC supporting an over 32-bits physical
address space for digital appliance. COOL Chips XIII Proceedings, Session XI, no. 1
41. Arakawa F (2011) Low power multicore for embedded systems. COMS Emerg Technol,
Session 5B, no. 1
42. Song SP et al (1994) The PowerPC 604 RISC microprocessor. IEEE Micro 14(5):8–22
43. Levitan D, et al (1995) The PowerPC 620TM microprocessor: a high performance superscalar
RISC microprocessor. Compcon ‘95.’Technologies for the Information Superhighway’, Digest
of Papers, pp 285–291
References 121
44. Edmondson JH et al (1995) Superscalar instruction execution in the 21164 alpha microprocessor.
IEEE Micro 15(2):33–43
45. Gronowski PE et al (1998) High-performance microprocessor design. IEEE J Solid-State
Circuit 33(5):676–686
46. Yeager KC (1996) The MIPS R10000 superscalar microprocessor. IEEE Micro 16(2):28–40
47. Golden M et al (1999) A seventh-generation x86 microprocessor. IEEE J Solid-State Circuit
34(11):1466–1477
48. Hinton G, et al (2001) A 0.18-mm CMOS IA-32 processor with a 4-GHz integer execution
unit. IEEE J Solid-State Circuit 36:11
49. Weicker RP (1988) Dhrystone benchmark: rationale for version 2 and measurement rules.
ACM SIGPLAN Notices 23(8):49–62
50. Kodama T, et al (2006) Flexible engine: a dynamic reconfigurable accelerator with high
performance and low power consumption, In: Proc of the IEEE Symposium on Low-Power
and High-Speed Chips (COOL Chips IX)
51. Motomura M (2002) A dynamically reconfigurable processor architecture. Microprocessor
Forum 2002, Session 4-2
52. Fujii T, et al (1999) A dynamically reconfigurable logic engine with a multi-Context/multi-mode
unified-cell architecture. Proc Intl Solid-State Circuits Conf, pp 360–361
53. Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier
series. Mathemat Comput, 19
54. Pease MC (1968) An adaptation of the fast Fourier transform for parallel processing. J ACM,
15(2)
55. Noda H et al (2007) The design and implementation of the massively parallel processor based
on the matrix architecture. IEEE J Solid-State Circuits 42(1):183–192
56. Noda H et al (2007) The circuits and robust design methodology of the massively parallel
processor based on the matrix architecture. IEEE J Solid-State Circuits 42(4):804–812
57. Kuang JB, et al (2005) A double-precision multiplier with fine-grained clock-gating support
for a first-generation CELL processor. In: IEEE Int Solid-State Circuits Conf Dig Tech Papers,
378–379
58. Flachs B, et al (2005) A streaming processor unit for a CELL processor. IEEE Int Solid-State
Circuits Conf Dig Tech Papers 134–135
59. Kyo S et al (2003) A 51.2GOPS scalable video recognition processor for intelligent cruise
control based on a linear array of 128 four-way VLIW processing elements. IEEE J Solid-State
Circuits 38(11):1992–2000
60. Hillis D (1985) The connection machine. MIT, Cambridge, MA
61. Swan RJ et al (1977) The implementation of the CM multiprocessor. Proc NCC 46:645–655
62. Amano H (1996) Parallel computers. Tokyo, Shoukoudou
63. Kurafuji T, et al (2010) A scalable massively parallel processor for real-time image processing.
IEEE Int Solid-State Circuits Conf Dig Tech Papers:334–335
64. Joint Video Team (JVT) of ISO/IEC MEPG & ITU-T VCEG, Text of International Standard
of Joint Video Specification, ITU-T Rec. H.264 | ISO/IEC 14496-10 Advanced Video Coding,
Dec. 2003
65. Richardson IEG (2003) H.264 and MPEG-4 video compression: video coding for next-generation
multimedia. Wiley, New York
66. Wiegand T et al (2003) Overview of the H.264/AVC video coding standard. IEEE Trans
Circuits Syst Video Technol 13(7):560–576
67. Shirasaki M, et al (2009) A 45 nm Single-Chip Application-and-Baseband Processor Using an
Intermittent Operation Technique. IEEE ISSCC Dig Tech Papers:156–157
68. Nomura S, et al (2008) A 9.7 mW AAC-decoding, 620 mW H.264 720p 60fps decoding,
8-core media processor with embedded forward-body-biasing and power-gating circuit in
65 nm CMOS technology. IEEE ISSCC Dig Tech Papers:262–263
69. Mair H, et al (2007) A 65-nm mobile multimedia applications processor with an adaptive
power management scheme to compensate for variations. Dig Symp VLSI Circuits:224–225
70. Chien CD et al (2007) A 252kgate/7lmW multi-standard multi-channel video decoder for high
definition video applications. IEEE ISSCC Dig Tech Papers:282–283
122 3 Processor Cores
71. Liu TM et al (2007) A 125 mW, fully scalable MPEG-2 and H.264/AVC video decoder for
mobile applications. IEEE J Solid-State Circuits 42(1):161–169
72. Lin YK, et al (2008) A 242 mW 10 mm2 1080p H.264/AVC high-profile encoder chip. IEEE
ISSCC Dig Tech Papers:314–315
73. Chen YH, et al (2008) An H.264/AVC scalable extension and high profile HDTV 1080p
encoder chip. Symp VLSI Circuits Dig:104–105
74. Iwata K et al (2009) 256 mW 40 Mbps Full-HD H.264 high-profile codec featuring a dual-
macroblock pipeline architecture in 65 nm CMOS. IEEE J Solid-State Circuits 44(4):
1184–1191
75. ITU-T, ITU-T Recommendation H.264.1, Conformance Specification for H.264 Advanced
Video Coding, 2005
76. Iwata K et al (2010) A 342 mW mobile application processor with full-hd multi-standard video
codec and tile-based address-translation circuits. IEEE J Solid-State Circuits 45(1):59–68
77. Kimura M et al (2009) A full HD multistandard video codec for mobile applications. IEEE
Micro 29(6):18–27
78. Wiegand T et al (2010) Special Section on the Joint Call for Proposals on High Efficiency
Video Coding (HEVC) Standardization. IEEE Trans Circuits Syst Video Technol 20(12):
1661–1666
Chapter 4
Chip Implementations
Three prototype multicore chips, RP-1, RP-2, and RP-X, were implemented with
the highly efficient cores described in Chap. 3. The details of the chips are described
in this chapter. The multicore architecture makes it possible to enhance the perfor-
mance while maintaining the efficiency, but not to enhance the efficiency. Therefore,
a multicore with inefficient cores is still inefficient, and the highly efficient cores are
the key components to realize a high-performance and highly efficient SoC.
However, the multicore requires different technologies from that of a single core to
maximize its capabilities. The prototype chips are useful for researching and devel-
oping such technologies and have been utilized for developing and evaluating soft-
ware environments, application programs, and systems (see Chaps. 5 and 6).
K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 123
DOI 10.1007/978-1-4614-0284-8_4, © Springer Science+Business Media New York 2012
124 4 Chip Implementations
50
Sensors
New Categories
10
DGIPS/W
5 Mobile Devices
Controllers
D
1 Mobile PC
Equipped Devices
0.5
Server/PC
01
0.
0.1 0.5 1 DGIPS 5 10 50 100
*) DGIPS: Dhrystone GIPS
The power of chips in the server/PC category is limited at around 100 W, and the
chips above the 100-W oblique line must be used. Similarly, the chips roughly above
the 10- or 1-W oblique line must be used for equipped devices/mobile PCs, or control-
lers/mobile devices, respectively. Further, some sensors must use the chips above
the 0.1-W oblique line, and new categories may grow from this region. Consequently,
we must develop high DGIPS2/W chips to achieve high performance under the
power limitations.
Figure 4.2 maps various processors on a graph, whose horizontal and vertical
axes, respectively, represent operating frequency (MHz) and power frequency ratio
(MHz/W) in logarithmic scale. Figure 4.2 uses MHz or GHz instead of the DGIPS of
Fig. 4.1. This is because few DGIPS of the server/PC processors are disclosed. Some
power values include leak current, whereas the others do not; some are under the
worst conditions, while the others are not. Although the MHz value does not directly
represent the performance, and the power measurement conditions are not identical,
they roughly represent the order of performance and power. The triangles and circles
represent embedded and server/PC processors, respectively. The dark gray, light
gray, and white plots represent the periods up to 1998, after 2003, and in between,
respectively. The GHz2/W improved roughly ten times from 1998 to 2003, but only
three times from 2003 to 2008. The enhancement of single cores is apparently slow-
ing down. Instead, the processor chips now typically adopt a multicore architecture.
Figure 4.3 summarizes the multicore chips presented at the International Solid-
State Circuit Conference (ISSCC) from 2005 to 2008. All the processor chips pre-
sented at ISSCC since 2005 have been multicore ones. The axes are similar to
Fig. 4.2, although the horizontal axis reflects the number of cores. Each plot at the
start and end points of an arrow represents single core and multicore, respectively.
The performance of multicore chips has continued to improve which has com-
pensated for the slowdown in the performance gains of single cores in both the
4.1 Multicore SoC with Highly Efficient Cores 125
FR-V 8x
4x Cavium
MHz/W
CELL
300 16x 16x
8x Niagara
4x
100 PAsemi 3rdG SPARC
Merom POWER6
Opteron
Itanium PPC970
30 SPARC v9 Xeon
4x
1000 3000 10000 30000
1x, 2x, 4x, 8x, or 16x of Operating Frequency (MHz)
embedded and server/PC processor categories. There are two types of multicore
chips. One type integrates multiple-chip functions into a single chip, resulting in a
multicore SoC. This integration type has been popular for more than ten years. Cell
phone SoCs have integrated various types of hardware intellectual properties
(HW-IPs), which were formerly integrated into multiple chips. For example, an
SH-Mobile G1 integrated the function of both the application and baseband proces-
sor chips [3], followed by SH-Mobile G2 [4] and SH-Mobile G3 [5, 6], which
enhanced both the application and baseband functionalities and performance. The
other type has increased number of cores to meet the requirements of performance
and functionality enhancement. The RP-1, RP-2, and RP-X are the prototype SoCs,
126 4 Chip Implementations
and an SH2A-DUAL [7] and an SH-Navi3 [8] are the multicore SoC products of
this enhancement type. The transition from single core chips to multicore ones
seems to have been successful on the hardware side, and various multicore products
are already on the market. However, various issues still need to be addressed for
future multicore systems.
The first issue concerns memories and interconnects. Flat memory and intercon-
nect structures are the best for software, but hardly possible in terms of hardware.
Therefore, some hierarchical structures are necessary. The power of on-chip inter-
connects for communications and data transfers degrade power efficiency, and a
more effective process must be established. Maintaining the external I/O perfor-
mance per core is more difficult than increasing the number of cores, because the
number of pins per transistors decreases for finer processes. Therefore, a break-
through is needed in order to maintain the I/O performance.
The second issue concerns runtime environments. The performance scalability
was supported by the operating frequency in single core systems, but it should be
supported by the number of cores in multicore systems. Therefore, the number of
cores must be invisible or virtualized with small overhead when using a runtime
environment. A multicore system will integrate different subsystems called domains.
The domain separation improves system reliability by preventing interference
between domains. On the other hand, the well-controlled domain interoperation
results in an efficient integrated system.
The third issue relates to the software development environments. Multicore sys-
tems will not be efficient unless the software can extract application parallelism and
utilize parallel hardware resources. We have already accumulated a huge amount of
legacy software for single cores. Some legacy software can successfully be ported,
especially for the integration type of multicore SoCs like the SH-Mobile G series.
However, it is more difficult with the enhancement type. We must make a single
program that runs on multicore or distribute functions now running on a single core
to multicore. Therefore, we must improve the portability of legacy software to the
multicore systems. Developing new highly parallel software is another issue. An
application or parallelization specialist could do this, although it might be necessary
to have specialists in both areas. Some excellent research has been done on auto-
matic parallelization compilers, and the products of such compilers are expected to
be released in the future. Further, we need a paradigm shift in the development, for
example, a higher level of abstraction, new parallel languages, and assistant tools
for effective parallelization.
The RP-1 is the first multicore chip with four SH-X3 CPU cores (see Sect. 3.1.7)
[9–13]. It was fabricated as a prototype chip using a 90-nm CMOS process to accel-
erate the research and development of various embedded multicore systems. The
RP-1 achieved a total of 4,320 MIPS at 600 MHz by the four SH-X3 cores measured
4.2 RP-1 Prototype Chip 127
using the Dhrystone 2.1 benchmark. It supports both symmetric and asymmetric
multiprocessor (SMP and AMP) features for embedded applications. The SMP and
AMP modes can be mixed to construct a hybrid system of the SMP and AMP. Each
core can operate at different frequencies and can stop individually with maintaining
its data cache coherency, while the other processors are running in order to achieve
both the maximum processing performance and the minimum operating power for
various applications.
Table 4.1 summarizes the RP-1 specifications. The RP-1 integrates four SH-X3
cores with a snoop controller (SNC) to maintain the data cache coherency among
the cores, DDR2-SDRAM and SRAM memory interfaces, a PCI-Express interface,
some HW-IPs for various types of processing, and some peripheral modules. The
HW-IPs include a DMA controller, a display unit, and accelerators. Each SH-X3
core includes a CPU, an FPU, 32-KB 4-way set-associative instruction and data
caches, a 4-entry instruction TLB, a 64-entry unified TLB, an 8-KB instruction
local RAM (ILRAM), a 16-KB operand local RAM (OLRAM), and a 128-KB user
RAM (URAM).
Figure 4.4 illustrates a block diagram of the RP-1. The four SH-X3 cores, a snoop
controller (SNC), and a debug module (DBG) constitute a cluster. The HW-IPs are
connected to an on-chip system bus (SuperHyway). The arrows to/from the SuperHyway
indicate connections from/to initiator/target ports, respectively. The details of the
SH-X3 cluster and SuperHyway are described in the following sections.
128 4 Chip Implementations
SH-X3 Cluster
SNC: Snoop Controller (Cntl.)
SH-X3 Core3
INTC
DAA: Duplicated Address Array
SH-X3 Core2
LCPG3
CRU:Cache RAM Control Unit
DAA
SH-X3 Core1
LCPG2
I$/D$: Instruction (Inst.)/Data Cache
SH-X3 Core0 IL/DL: Inst./Data Local Memory
LCPG1
URAM: User RAM
SHPB
LCPG0
CPU CRU FPU
DBG: Debug Module
SNC0
IL I$ D$ DL
URAM GCPG/LCPG: Global/Local CPG
INTC: Interrupt Cntl.
DBG
SHPB,HPB: Peripheral Bus Bridge
On-chip system bus (SuperHyway) CSM: Centralized Shared Memory
DMAC: Direct Memory Access Cntl.
PCIe: PCIexpress Interface (i/f)
HPB
IPs
SRAM i/f
DDR2 i/f
TMU0/1
SCIF0-3
HWIPs
PCIe
GCPG
GPIO
The four SH-X3 cores constitute a cluster sharing an SNC and a DBG to support
symmetric-multiprocessor (SMP) and multicore-debug features. The SNC has a
duplicated address array (DAA) of data caches of all the four cores and is connected
to the cores by a dedicated snoop bus separated from the SuperHyway to avoid both
deadlock and interference by some cache coherency protocol operations. The DAA
minimizes the number of data cache accesses of the cores for the snoop operations,
resulting in the minimum coherency maintenance overhead. Each core can operate
at different CPU clock (ICLK) frequencies and can stop individually to minimize
the power (see Sect. 4.2.3). The coherency protocol was optimized to avoid the
interference that results from a slow core to a fast core (see Sect. 4.2.4).
Each core can operate at different CPU clock (ICLK) frequencies and can stop indi-
vidually, while the other processors are running with a short switching time in order
to achieve both the maximum processing performance and the minimum operating
power for various applications. A data cache coherency is maintained during opera-
tions at different frequencies, including frequencies lower than the on-chip system
bus clock (SCLK). The following four schemes make it possible to change each
ICLK frequency individually while maintaining data cache coherency:
1. Each core has its own clock divider for an individual clock frequency change.
2. A handshake protocol is executed before the frequency change to avoid conflicts
in bus access, while keeping the other cores running.
4.2 RP-1 Prototype Chip 129
3. Each core supports various ICLK frequency ratios to SCLK including a lower
frequency than that of SCLK.
4. Each core has a light-sleep mode to stop its ICLK while maintaining data cache
coherency.
The global ICLK and the SCLK that run up to 600 and 300 MHz, respectively,
are generated by a global clock pulse generator (GCPG) and distributed to each
core. Both the global ICLK and SCLK are programmable by setting the frequency
control register in the GCPG. Each local ICLK is generated from the global ICLK
by the clock divider of each core. The local CPG (LCPG) of a core executes a hand-
shake sequence dynamically when the frequency control register of the LCPG is
changed so that it can keep the other cores running and can maintain coherency in
data transfers of the core. The previous approach assumed a low frequency in a
clock frequency change, and it stopped all the cores when a frequency was changed.
The core supports “light-sleep mode” to stop its ICLK except for its data cache in
order to maintain the data cache coherency. This mode is effective for reducing the
power of an SMP system.
Each core should operate at the proper frequency for its load, but in some cases of
the SMP operation, a low frequency core can cause a long stall of a high frequency
core. We optimized the cache snoop sequences for the SMP mode to minimize such
stalls. Table 4.2 summarizes the coherency overhead cycles. These cycles vary
according to various conditions; the table indicates a typical case. The bold values
indicate optimized cases explained below.
Figure 4.5 (i), (ii) shows examples of core snoop sequences before and after the
optimization. The case shown is a “write access to a shared line,” which is the third
case in the table.
130 4 Chip Implementations
time
(1) Core Snoop Request (6) Snoop Acknowledge
Core #0 (600MHz)
State: S to M
Core #1 (150MHz) D$
State: S to I (2) DAA DAA (4) D$ Update
Core #2 (600MHz) Update
D$
State: S to I
(3) Invalidate Request (5) Invalidate Acknowledge
Snoop Latency
(i) Before Optimization
time
(1) Core Snoop Request (3) Snoop Acknowledge
Core #0 (600MHz)
State: S to M
Core #1 (150MHz) D$
State: S to I (2) DAA DAA (4) D$ Update
Core #2 (600MHz) Update
D$
State: S to I
(3) Invalidate Request (5) Invalidate Acknowledge
Snoop Latency
(ii) After Optimization
The operating frequencies of cores #0, #1, and #2 are 600, 150, and 600 MHz,
respectively. Initially, all the data caches of the cores hold a common cache line, and
all the cache-line states are “shared.” Sequence (i) is as follows:
1. Core Snoop Request: Core #0 stores data in the cache, changes the stored-line
state from “Shared” to “Modified,” and sends a “Core Snoop Request” of the
store address to the SNC.
2. DAA Update: The SNC searches the DAA of all the cores and changes the states
of the hit lines from “Shared” to “Modified” for core #0 and “Invalid” for cores
#1 and #2. The SNC runs at SCLK frequency (300 MHz).
3. Invalidate Request: The SNC sends “Invalidate Request” to cores #1 and #2.
4. Data Cache Update: Cores #1 and #2 change the states of the corresponding
cache lines from “Shared” to “Invalid.” The processing time depends on each
core’s ICLK.
5. Invalidate Acknowledge: Cores #1 and #2 return “Invalidate Acknowledge” to
the SNC.
6. Snoop Acknowledge: The SNC returns “Snoop Acknowledge” to core #0.
As shown in Fig. 4.5 (i), the return from core #1 is late due to its low frequency,
resulting in long snoop latency.
Sequence (ii) is as follows by the optimization:
1. Core Snoop Request
2. DAA Update
4.2 RP-1 Prototype Chip 131
It would require too much time and money to design an SoC consisting of a lot of
originally designed modules. Therefore, we make a module that is reusable and
refer to it as a HW-IP. A standard and highly efficient method is needed to connect
the HW-IPs. An on-chip system bus called SuperHyway is a packet-based split
transaction bus used to connect the HW-IPs, and transactions may contain up to 32
bytes of data. The bus is compatible with Virtual Socket Interface (VSI) protocols.
It seamlessly connects to VSI virtual-component libraries.
Effective support of high-speed, multi-initiator, multi-target data transfer is
important for cost-effective SoC implementations. Such data transfer mechanisms
132 4 Chip Implementations
R1 G1 R2 G2 R3 G3
D1 D2 D3
I Bus Bridge
Core #0 I Initiator
T
T SHPB T Target
I
Target I/F
Core #1 T CSM
T
I T SRAM i/f 600MHz
Initiator I/F
Core #2
T T DDR2 i/f 32 bits
I 300MHz (I-to-T: 3.3ns)
SuperHyway
Core #3
T 64 bits, 2.4GB / s
I I - to -T Connection
Router
SNC
T 29-bit Address (Request)
I 64-bit Data (Request)
Debug 64-bit Data (Response)
T
Router
Core #3
Core #2
1 mm
IP address and ends at many F/Fs of target HW-IPs. The 300 MHz 64-bit SuperHyway
achieved the throughput of 2.4 GB/s, which is the same throughput as the 600 MHz
32-bit DDR2 interface.
Figure 4.8 shows the physical organization of the interconnect logic, where each
arrow includes 29-bit-address and 128-bit-data lines corresponding to Fig. 4.7. The
routing block was synthesized as a net list without actual wire lengths. A long wire
path caused an unacceptable CR delay that could be calculated after place and route,
so we inserted repeater cells to improve the path delay. As a result, a one-cycle path
from an initiator to a target could reach the shaded area.
Figure 4.9 shows the chip micrograph of the RP-1. The chip was integrated in
two steps to minimize the design period of the physical integration, and successfully
fabricated: (1) First, a single core was laid out as a hard macro and completed
134 4 Chip Implementations
1.00
Execution Time (Normalized)
0.75
1 Thread
2 Threads
0.50
4 Threads
Barrier
0.25
0
FFT LU Radix Water
timing closure of the core, and (2) the whole chip was laid out with instancing the
core four times.
800
200
Linux
0
FFT LU Radix Water
1000
Energy (mW·S)
0
600MHz 300MHz
threads, respectively. The time should be 50% and 25% for ideal performance scal-
ability. The major overhead was synchronization and snoop time. The SNC improved
cache coherency performance, and the performance overhead by snoop transactions
was reduced up to 0.1% when SPLASH-2 was executed.
Figure 4.11 shows the power consumption of the SPLASH-2 suite. The suite ran
at 600 MHz and at 1.0 V. The average power consumption of one, two, and four
threads was 251, 396, and 675 mW, respectively. This included 104 mW of active
power for the idle tasks of SMP Linux. The results of the performance and power
evaluation showed that the power efficiency was maintained or enhanced when the
number of threads increased.
Figure 4.12 shows the energy consumption with low power modes. These modes
were implemented to save power when fewer threads were running than available
on CPU cores. As a benchmark, two threads of FFT were running on two CPU
cores, and two CPU cores were idle. The energy consumed in the light-sleep, sleep,
and module-stop modes at 600 MHz was 4.5%, 22.3%, and 44.0% lower than in the
136 4 Chip Implementations
normal mode, respectively, although these modes took some time to stop and start
the CPU core and to save and return the cache. The execution time increased by
79.5% at 300 MHz, but the power consumption decreased, and the required energy
decreased by 5.2%.
The RP-2 is a prototype multicore chip with eight SH-X3 CPU cores (see Sect.
3.1.7) [15–17]. It was fabricated in a 90-nm CMOS process that was the same pro-
cess used for the RP-1. The RP-2 achieved a total of 8,640 MIPS at 600 MHz by the
eight SH-X3 cores measured with the Dhrystone 2.1 benchmark. Because it is
difficult to lay out the eight cores close to each other, we did not select a tightly
coupled cluster of eight cores. Instead, the RP-2 consists of two clusters of four
cores, and the cache coherency is maintained in each cluster. Therefore, the inter-
cluster cache coherency must be maintained by software if necessary.
Table 4.3 summarizes the RP-2 specifications. The RP-2 integrates eight SH-X3
cores as two clusters of four cores, DDR2-SDRAM and SRAM memory interfaces,
DMA controllers, and some peripheral modules. Figure 4.13 illustrates a block dia-
gram of the RP-2. The arrows to/from the SuperHyway indicate connections from/
to initiator/target ports, respectively.
4.3 RP-2 Prototype Chip 137
INTC
SH-X3 Core2 SH-X3 Core6
LCPG3
LCPG7
DAA
DAA
SH-X3 Core1 SH-X3 Core5
LCPG2
LCPG6
SH-X3 Core0 SH-X3 Core4
LCPG1
LCPG5
CPU CRU FPU FPU CRU CPU
SHPB
LCPG0
LCPG4
SNC1
SNC0
IL I$ D$ DL DL D$ I$ IL
URAM URAM
DBG0 DBG1
120µm 70µm
C0 C1
U0 U1
U2 C2 C3 U3 Core
U6 U7 URAM
C6 C7
U4 U5 50µm
C4 C5
VSS
VSWC for Core VSWC for URAM
VSSM
(virtual ground)
Power Control Register
The RP-2 has barrier registers to support CPU core synchronization for multipro-
cessor systems. Software can use these registers for fast synchronization between
the cores. In the synchronization, one core waits for other cores to reach a specific
point in a program. Figure 4.15 illustrates the barrier registers for the synchroni-
zation. In a conventional software solution, the cores have to test and set a specific
memory location, but this requires long cycles. We provide three sets of barrier
registers to accelerate the synchronization. Each CPU core has a one-bit BARW
register to notify when its program flow reaches a specific point. The BARW
values of all the cores are gathered by hardware to form an 8-bit BARR register
of each core so that each core can obtain all the BARW values from its BARR
register with a single instruction. As a result, the synchronization is fast and does
not disturb other transactions on the SuperHyway bus.
Figure 4.16 shows an example of the barrier register usage. In the beginning, all
the BARW values are initialized to zero. Then each core inverts its BARW value
4.3 RP-2 Prototype Chip 139
Cluster0 Cluster1
Executions (Each core runs and sets its BARW to one at specific point)
Executions (Each core runs and clears its BARW to zero at specific point)
when it reaches a specific point, and it checks and waits until all its BARR values
are ones reflecting the BARW values. The synchronization is complete when all the
BARW values are inverted to ones. The next synchronization can start immediately
with the BARWs being ones and is complete when all the BARW values are inverted
to zeros.
Table 4.5 compares the results of eight-core synchronizations with and without
the barrier registers. The average number of clock cycles required for a certain task
to be completed with and without barrier registers is 8,510 and 52,396 cycles,
respectively. The average differences in the synchronizing cycles between the first
and last cores are 10 and 20,120 cycles with and without the barrier registers, respec-
tively. These results show that the barrier registers effectively improve the
synchronization.
140 4 Chip Implementations
NMI
Non-maskable
Request
External Maskable Interrupt (MNI)
External Interrupt Control Control
Interrupt
Request Interrupt Interrupt
CPU CPU
Mask Distribution
Control x8 Control
Core i/f Core
x8 x8
On-chip Inter-core x8
On-chip Peripheral
Peripheral Interrupt Control
Interrupt Control
Interrupt
Request #0 #1 #2 #3 IPI
#4 #5 #6 #7 reg.
INTC
The RP-2 has an autorotating dynamic distribution mode to reduce the overhead.
In this mode, the INTC asserts an interrupt request to one core for some cycles
specified by the software, which are at most 24 cycles. Then the other cores need not
consume the redundant interrupt handling time. This mode is best in terms of com-
puting throughput and can keep the worst response time of the conventional mode,
which is important in order to guarantee the response time. The average number of
clock cycles for an evaluated task is 73,028 cycles in the newly added mode, which
is 1,900 cycles fewer on average than that in the conventional mode.
In the multicore system, each core needs to interrupt other cores, and an inter-
processor interrupt (IPI) is supported by the RP-2. There are eight IPI registers in
the IPI control block for eight cores. Each core can generate an interrupt to other
cores by writing to its IPI register in the INTC. Each IPI register consists of eight
fields corresponding to the target cores.
The RP-2 was fabricated using the same 90-nm CMOS process as that for the RP-1.
Figure 4.18 is the chip micrograph of the RP-2. It achieved a total of 8,640 MIPS at
600 MHz by the eight SH-X3 cores measured with the Dhrystone 2.1 benchmark
and consumed 2.8 W at 1.0 V including leakage power.
The fabricated RP-2 chip was evaluated using the SPLASH-2 benchmarks on an
SMP Linux operating system. Figure 4.19 plots the RP-2 execution time on one
cluster based on the number of POSIX threads. The processing time was reduced to
142 4 Chip Implementations
1.0
0.8
Relative Execution Time
0.6 1 Thread
2 Threads
4 Threads
0.4
8 Threads
16 Threads
0.2
0.0
Water FFT LU Radix Barnes Ocean
2,000,000 200,000
Acknowledged Interrupts
–31% Core0
–7% –57%
Core1
1,500,000 150,000 Core2
Core3
1,000,000 100,000
500,000 50,000
0 0
Conventional Auto-rotating Conventional Auto-rotating Conventional Auto-rotating
Water Radix Barnes
51–63% with two threads and to 41–27% with four or eight threads running on one
cluster. Since there were fewer cores than threads, the eight-thread case showed
similar performance to four-thread one. Furthermore, in some cases, the increase in
the number of threads resulted in an increase in the processing time due to the syn-
chronization overhead.
The autorotating dynamic interrupt distribution mode was evaluated and com-
pared to a conventional one by SPLASH-2 with four threads on SMP Linux using
one cluster of a real chip. Figure 4.20 shows the number of interrupts acknowl-
edged by the CPU cores during the SPLASH-2 execution. The total acknowledged
interrupts by all the cores in the autorotating mode decreased by 7% for Water, 31%
for Radix, and 57% for Barnes from the conventional mode. As a result, it avoided
the redundant interrupt handling. This improvement leads to a reduced processing
time in Linux kernel mode. Figure 4.21 shows the processing time reduction in
4.4 RP-X Prototype Chip 143
1.0
-8% -11%
0.2
0.0
Water Radix Barnes
kernel mode. The reduction was 8% for Water, 11% for Radix, and 21% for Barnes,
respectively.
In addition to the improved performance, the reduction in the acknowledged
interrupts is expected to be effective for saving power. In sleep mode in particular,
the redundant interrupt handling leads to wasted power used to wake up the cores
and put them in sleep mode.
The RP-X specifications are summarized in Table 4.6. It was fabricated using a
45-nm CMOS process, integrating eight SH-X4 cores, four FE–GAs, two MX-2s,
one VPU5, one SPU, and various peripheral modules as a heterogeneous multicore
SoC, which is one of the most promising approaches to attain high performance
with low frequency and power, for consumer electronics or scientific applications.
144 4 Chip Implementations
The eight SH-X4 cores achieved 13.7 GIPS at 648 MHz measured using the
Dhrystone 2.1 benchmark. Four FE–GAs, dynamically reconfigurable processors,
were integrated and attained a total performance of 41.5GOPS and a power con-
sumption of 0.76 W. Two 1,024-way MX-2s were integrated and attained a total
performance of 36.9GOPS and a power consumption of 1.10 W. Overall, the
efficiency of the RP-X was 37.3 GOPS/W at 1.15 V excluding special-purpose
cores of a VPU5 and an SPU. This was the highest among comparable processors.
The operation granularity of the SH-X4, FE–GA, and MX-2 processors are 32 bits,
16 bits, and 4 bits, respectively, and thus, we can assign the appropriate processor
cores for each task in an effective manner.
Figure 4.22 illustrates the structure of the RP-X. The processor cores of the
SH-X4, FE–GA, and MX-2; the programmable special-purpose cores of the
VPU5 and SPU; and the various modules are connected by three SuperHyway
buses to handle high-volume and high-speed data transfers. SuperHyway-0 con-
nects the modules for an OS, general tasks, and video processing, SuperHyway-1
connects the modules for media acceleration, and SuperHyway-2 connects media
IPs except for the VPU5. Some peripheral buses and modules are not shown in
the figure.
A data transfer unit (DTU) was implemented in each SH-X4 core to transfer data
to and from the special-purpose cores or various memories without using CPU
instructions. In this kind of system, multiple OSes are used to control various func-
tions, and thus, high-volume and high-speed memories are required.
4.4 RP-X Prototype Chip 145
SH - X4 SNC SH - X4 SNC
SH - X4 SH - X4
SH - X4 L2 SH - X4 L2
SH - X4 SH - X4
CPU CRU FPU PWC CPU CRU FPU PWC
I$ DTU D$ I$ DTU D$
ILM UM DLM SH-X4 ILM UM DLM SH-X4
Cluster#0 Cluster#1
SuperHyway-0 SuperHyway-1
SuperHyway-2 VPU5 FE MX-2
Video FE
DDR3#1
CSM#1
DDR3#0
MX-2
CSM#0
PWC
LBSC SPU2 Processing FE
Unit FE LM
PCIexpress S-ATA Media LM
IPs RP-X
Internal Bus
Sequence Manager (SEQM)
LS Local
Array Control Bus Cells Memories
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
I/O Port Control
System Bus
ALU MLT ALU ALU LS CRAM
Bus I/F
I/O Ports
256 x n PEs (n = 1, 2, …, 8)
PE
I/O Interface H-ch H-ch
PE
Data Registers PE Data Registers
(SRAM) PE (SRAM)
V- ch
PE
PE
Media Bus
PE
PE
PE
512b 512b
Figure 4.24 illustrates the structure of the MX-2. It is a massively parallel processor
consisting of 1,024-way-SIMD 4-bit PEs with an ALU and a booth encoder, two
SRAMs as data registers, a controller with an instruction memory, and an I/O interface
to subsystem and media buses. The number of PEs can be in multiple of 256, and each
MX-2 of the RP-X integrated 1,024 PEs. The PEs and SRAMs are connected conform-
ing horizontal channels (H-ch), and the PEs are connected by a shifter that forms
vertical channels (V-ch). The MX-2 performs efficient massively parallel arithmetic
processing. It is especially good for multiple-of-4-bit-wide data such as image data,
which are mainly 8 or 12 bits. The details of the MX-2 are described in Sect. 3.3.
Figure 4.25 illustrates the structure of the VPU5. It is a programmable video pro-
cessing core consisting of two codec elements for pixel-rate domain and a variable
4.4 RP-X Prototype Chip 147
Shift-register-based bus
PIPE micro-program
Load 2D Store
Module ALU Module
Data I/O
length coding for stream-rate domain (VLCS) codec. They are connected by a
shift-register-based bus for fast and efficient transfer of processing data. Each
codec element consists of a DMAC and three programmable image processing
elements (PIPEs) for transform prediction, motion compensation, and a deblock
filter. Each PIPE consists of a load module, a two-dimensional ALU, and a store
module. They are controlled by a microprogram, and the load/store modules use a
data I/O to connect to the bus. The VPU5 can handle various formats such as
MPEG-1/2/4, H.263, and H.264 and various resolutions from QCIF to full HD. The
programmability is a convenient feature that allows a new algorithm to be applied
or a previous algorithm to be updated. The details of the VPU5 are described in
Sect. 3.4.
Because the RP-X integrated various modules, it was important to reduce the power
consumption of unused modules by clock gating. The power consumption of clock
buffers was particularly large. Figure 4.26 shows the clock buffer deactivation cir-
cuits. In the conventional clock tree (i), global clock trees from a clock generator
were divided logically into CLK0, CLK1, and CLK2, and the clock of Modules A,
B, and C was provided by the same clock tree CLK0. However, the Module C was
located further away from the Modules A and B, and the clock tree of the Module C
became a dedicated tree from a point near the clock generator, which had to be acti-
vated even when the Module C was not used. On the contrary, the Modules A and B
successfully shared the clock tree and saved the clock tree’s capacitance.
After optimizing the power (ii), the clock tree of the Module C was separated and
gated at the clock generator as CLK0_1, whereas the Modules A and B shared
the clock tree CLK0_0. In this way, the clock tree CLK0_1 can be stopped when
148 4 Chip Implementations
Clock Generator
Clock Generator
CLK0 CLK0_0
Clock Buffer Module A Module A
CLK0_1
CLK1 Clock Gating Cell Module B Module B
CLK1
Clock
clock Generator CK, /CK
DDR3 Memory Controller
F/F CMD,ADDR.,DM
CMD,ADDR.,DQM
Write Data
F/F
DQ
Read Data
F/F F/F 90° Mask
Shift Logic
FIFO DQS
F/F
PHY
the Module C is not used. In a large-scale chip, it is not easy to lay out all the mod-
ules using the same clock close together, and proper tree separation is effective for
reducing the power. A gate-level simulation showed that by applying this method,
the deactivation of all clock buffers related to MX-2 and PCI-Express saved 41.5 mW
of power at 1.15 V.
The RP-X contains two 2-GB DDR3-SDRAM interfaces. Figure 4.27 illustrates the
DDR3-SDRAM interface. The latency of this interface was reduced to improve the
performance and power efficiency by deleting unnecessary data buffering and
invalid data masking. No F/F except retiming F/Fs was used in the DDR3 PHY to
reduce write latency. The DDR3 interface included asynchronous FIFO and invalid
level mask circuit for latching valid strobe signals from a bidirectional interface to
reduce read latency. Overall, the DDR3 interface including I/O buffer and data sam-
pling requires four cycles (10 ns), and the total latency is nine cycles including
memory latency.
4.4 RP-X Prototype Chip 149
Total 4.4 GB
The RP-X was fabricated using a 45-nm low-power CMOS process. A chip micro-
graph of the RP-X is in Fig. 4.28. It achieved a total of 13,738 MIPS at 648 MHz by
the eight SH-X4 cores measured using the Dhrystone 2.1 benchmark, and consumed
3.07 W at 1.15 V including leakage power.
The RP-X is a prototype chip for consumer electronics or scientific applications.
As an example, we produced a digital TV prototype system with IP networks (IP-
TV) including image recognition and database search. Its system configuration and
memory usage are shown in Fig. 4.29. The system is capable of decoding 1,080i
audio/video data using a VPU and an SPU on the OS#1. For image recognition, the
MX-2s are used for image detection and feature quantity calculation, and the
150 4 Chip Implementations
FE–GAs are used for optical flow calculation of a VGA (640 × 480) video at 15fps
on the OS#2. These operations required 30.6 and 0.62 GOPS of the MX-2 and
FE–GA, respectively. The SH-X4 cores are used for database search using the
results of the above operations on the OS#3, as well as supporting of all the process-
ing including OS#1, OS#2, OS#3, and data transfers between the cores. The main
memories of 0.4, 0.6, 1.6, and 1.8 GB are assigned to OS#1, OS#2, OS#3, and PCI,
respectively, for a total of 4.4 GB. The details of the prototype system are described
in Chap. 6.
Table 4.7 lists the total performance and power consumption at 1.15 V when
eight CPU cores, four FE–GAs, and two MX-2s are used at the same time. The
power efficiency of the CPU cores, FE–GAs, and MX-2s reached 42.9 GFLOPS/W,
41.5 GOPS/W, and 36.9 GOPS/W, respectively. The power consumption of the
other components was reduced to 0.40 W by clock gating of 31 out of 44 modules.
In total, if we count 1 GFLOPS as 1 GOPS, the RP-X achieved 37.3 GOPS/W at
1.15 V excluding I/O area power consumption.
References
1. Patrick P. Gelsinger (2001) Microprocessors for the new millennium challenges, opportunities,
and new frontiers. ISSCC Dig. Tech. Papers, Session 1.3
2. Arakawa F (2008) Multicore SoC for Embedded Systems. International SoC Design Conference
(ISOCC) 2008: I-180–I-183
3. Hattori T, et al (2006) A Power Management Scheme Controlling 20 Power Domains for a
Single-Chip Mobile Processor, ISSCC Dig. Tech. Papers, Session 29.5
4. Ito M, et al (2007) A 390 MHz Single-Chip Application and Dual-Mode Baseband Processor
in 90 nm Triple-Vt CMOS, ISSCC Dig. Tech. Papers, Session 15.3
5. Naruse M, et al (2008) A 65nm single-chip application and dual-mode baseband processor
with partial clock activation and IP-MMU. ISSCC Dig. Tech. Papers, Session 13.3
6. Ito M et al (2009) A 65 nm Single-Chip Application and Dual-Mode Baseband Processor with
Partial Clock Activation and IP-MMU. IEEE J Solid-State Circuits 44(1):83–89
7. Hagiwara K, et al (2008) High performance and low power SH2A-DUAL core for embedded
microcontrollers. COOL Chips XI Proceedings, Session XI, no. 2
8. Kido H, et al (2009) SoC for car navigation systems with a 53.3 GOPS image recognition
engine. HOT CHIPS 21, Session 6, no 3
9. Kamei T (2006) SH-X3: Enhanced SuperH core for low-power multi-processor systems. Fall
Microprocessor Forum 2006
References 151
5.1.1.1 Introduction
1
Linux® is the registered trademark of Linus Torvalds in the USA and other countries.
2
SuperH™ is a trademark of Renesas Electronics.
K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 153
DOI 10.1007/978-1-4614-0284-8_5, © Springer Science+Business Media New York 2012
154 5 Software Environments
1 and reads 4-byte data into a general register. If, however, an interrupt, exception,
or other core’s data access occurs, the LL/SC private flag is cleared to 0.
Storage by the SC instruction only proceeds when the instruction is executed
after the LL/SC private flag has been set by the LL instruction and not cleared by an
interrupt, other exception, or other core’s data access. When the LL/SC private flag
has been cleared to 0, the SC instruction clears the T-bit in the status register and
does not execute its storage.
The difference between a TAS instruction and LL/SC instructions is indicated in
Table 5.2.
5.1.1.2 Implementation
We describe the atomic add operation using LL/SC instructions. The atomic add
operation needs two arguments: address (call by reference) and value (call by value).
The address argument specifies an address of the variable to be accessed. The value
argument is an immediate value to be added to the variable. The sequences of atomic
add are as follows:
1. Load the data referenced by the address argument to a temporary register using
the LL instruction.
2. Add the value passed by the value argument to the temporary register using a
normal add instruction.
3. Try to store the value of the temporary register to the referenced address using
the SC instruction.
4. Check the condition of the SC instruction’s result (T-bit in status register). If
another core’s data access occurs between (1) and (3), the SC instruction will
fail, and the value will not be stored to the address. In this case, the above
sequence is retried from the first step.
5.1.1.3 Evaluation
We evaluate SMP Linux performance of each implementation, using the TAS atomic
operation and LL/SC atomic operation. We use LMBench [2], which is widely used as
a benchmark program on UNIX-like systems, to compare the performance. The Linux
version on which we have implemented the SMP extension is shown in Table 5.3.
To compare each implementation, we calculate performance ratio using the follow-
ing formula:
5.1.1.4 Considerations
As described before, the LL/SC instructions do not lock the bus. On the other hand,
the TAS instruction locks the bus and causes huge overhead. Moreover, atomic oper-
ation using the TAS instruction requires complex implementation because the TAS
instruction only compares binary data (zero or nonzero). For example, the number of
CPU instructions to implement an atomic add operation is shown in Table 5.7.
The atomic add using TAS requires four times the number of CPU instructions com-
pared with LL/SC. This is also the case with other atomic operations. LMBench
results show this overhead. The advantage of LL/SC compared to TAS increases with
the number of atomic operations used in the benchmark [3].
5.1.2.1 Introduction
RP-2 is the multicore chip which has the following enhanced power-saving features:
• Power on/off control to each core
• Frequency control to each core
• Voltage control of the chip
Linux already has the power-saving frameworks CPU hot-plug and CPUfreq for
multicore processors [4, 5], but these frameworks have the following problems:
• No coordination between CPU hot-plug and CPUfreq
• CPUfreq has some governors that control voltage and frequency dynamically
according to system loads. However, CPU hot-plug does not have such a feature.
• No coordination between voltage control and core frequency control
• The input voltage to the chip limits the maximum frequency for CPU cores of
RP-2. This means each CPU core frequency is to be coordinately controlled with
the input voltage. However, CPUfreq governor does not have such a feature.
To resolve these issues, we have developed a new power-saving framework called
“idle reduction” based on CPU hot-plug and CPUfreq.
5.1.2.2 Implementation
Figure 5.1 and Table 5.8 show the structure of the idle reduction framework.
158 5 Software Environments
CPU CPUfreq
Hot-plug (userspace gov.)
Figure 5.1 shows the major components of idle reduction that reduce power con-
sumption coordinately using CPUfreq and CPU hot-plug. CPUfreq can be con-
trolled by userspace governors. The available governors are listed in Table 5.9.
An example of idle reduction is as follows. If the multicore chip has no execution
load, idle reduction forces CPU hot remove on all cores except for the primary core
and drops the primary core frequency to minimum automatically. After that, if two
threads are runnable, idle reduction forces CPU hot add on a core (two cores are
alive) and ups core frequencies step by step.
5.1.2.3 Evaluation
An evaluation was done using the following steps: first, we evaluated the instanta-
neous power consumption of CPU hot-plug and CPUfreq, and second, we evaluated
5.1 Linux® on Multicore Processor 159
5000
4500
4000
3500
Power [mW]
3000
2500
2000 4cores
1500 3cores
1000 2cores
500 1core
0
0 500 1000 1500 2000 2500
Total Frequency (CPU cores) [MHz]
Fig. 5.2 Power consumption with CPU hot-plug and CPUfreq control
the total energy consumption of idle reduction with application loads. The Linux
version on which we implemented idle reduction is given in Table 5.10:
1. Power consumption of CPU hot-plug and CPUfreq.
The total power consumption of up to four CPU cores of the RP-2 using CPUfreq
and CPU hot add/remove is shown in Fig. 5.2. Each CPU core executes the Dhrystone
2.1 program at 75-MHz, 150-MHz, 300-MHz, or 600-MHz frequency controlled by
CPUfreq, or is stopped and powered off using CPU hot remove.
This figure shows that the total frequency that can be translated to total instructions
per second (IPS) is the same, but the power consumption is different. For example, if
each core frequency is set to 600, 300, 150, and 150 MHz, and the total is 1,200 MHz,
the power consumption is about 3.2 W where the chip voltage is 1.4 V. In another
case, all four core frequencies are set to 300 MHz and the total is 1,200 MHz; the
power consumption is about 1.9 W where the chip voltage is 1.2 V [6].
2. Energy consumption of idle reduction.
We evaluated the energy consumption of idle reduction using a multi-thread
benchmark (Splash2 RAYTRACE) [7]. The evaluation was done by comparing it
with the old governors.
Table 5.11 and Fig. 5.3 show the energy consumption with no load (all cores are
in an idle state). The table presents a comparison of 10 s of energy consumption of
each governor. Idle reduction clearly had the lowest energy consumption, and
160 5 Software Environments
20 Idle Reduction
ondemand
conservative
powersave
Electrical Energy [Ws]
15 performance
10
0
0 2 4 6 8 10 12
Elapsed Time [s]
powersave had the second lowest. Idle reduction was able to reduce energy con-
sumption by 16% compared to the powersave governor.
Table 5.12 lists the energy consumption of the two-thread RAYTRACE bench-
mark; two cores have high loads (executing threads) and the other two cores have no
load (sleep or CPU hot remove). Idle reduction has the lowest energy consumption
(103.3 Ws), and the conservative governor has the second lowest (111.1 Ws). Idle
reduction was able to reduce power consumption by 7% compared to the conserva-
tive governor.
5.1 Linux® on Multicore Processor 161
140
120
Energy Consumption [Ws]
100
80
60
40 Idle Reduction
ondemand
conservative
20 powersave
performance
0
0 50 100 150 200
Elapsed Time [s]
Fig. 5.4 Elapsed time and energy consumption with mixed loads
Figure 5.4 plots the elapsed time and energy consumption. The graph also shows
the execution time of the RAYTRACE benchmark. Idle reduction shows very good
performance per energy consumption [8].
5.1.3.1 Introduction
The heterogeneous multicore RP-X described in Sect. 4.4 has the physical address
extension (AE) feature, which extends its physical address to 40 bits. We have
decided to extend the Linux HIGHMEM framework to support AE on RP-X.
1. Address Extension
Even if a CPU core has only a 32-bit virtual address space, it can access a physi-
cal address space of over 4 GB with the address extension (AE) feature. It enables
use of more than 4 GB of memory without having to modify the application
software.
2. HIGHMEM
Figure 5.5 is an overview of the Linux HIGHMEM framework. In the figure,
direct translation means that H/W translates from a virtual address to a physical
address. Indirect translation means S/W manages address translation, and a vir-
tual address is translated to a physical address on a memory page basis by using
a translation look-aside buffer (TLB).
The HIGHMEM framework separates physical address memory into two
regions: straight mapped memory corresponding to the straight mapped area in a
162 5 Software Environments
[Kernel Space]
HIGHMEM Mapped
Direct Translation Memory
Indirect Translation
Linux
Physical page
allocation
(1)
HIGHMEM area
Memory access
allocation
(2) Architecture independent (5)
Architecture dependent
Software
Hardware
TLB
does not have the I/O space, so we do not need to worry about the I/O space. The
address of the memory-mapped I/O space must not be changed regardless of the
existence or nonexistence of HIGHMEM, in order to keep the driver’s source
code compatible. However, if I/O devices use DMA, which accesses the physical
address space directly, it is necessary to modify DMA’s physical address in the
driver’s source code.
5. Application Layer
By implementing the above design, application programs can run without any
modifications.
In summary, we have designed a new HIGHMEM framework according to
the following policy:
a. Place the HIGHMEM area in the virtual address space.
b. Leave the memory-mapped I/O area as it is.
c. Leave the user area as it is.
5.1.3.2 Implementation
The calling sequence of physical page allocation to use the HIGHMEM area is
shown in Fig. 5.6. This function returns the physical page address. In step (1), the
physical page allocator calls the HIGHMEM area allocator to access the HIGHMEM
area. In step (2), the HIGHMEM area allocator allocates the virtual address for a
HIGHMEM page and calls the page table entry (PTE) updater. In step (3), the PTE
updater updates the PTE. In step (5), a TLB miss occurs when Linux accesses a
virtual address, and the address is not registered in the TLB. At this time, the TLB
164 5 Software Environments
miss resolver calls the physical page allocator and calculates the physical address
from the TLB-missed virtual address and the PTE in step (4). In step (6), the TLB
miss resolver registers the combination of the virtual address and physical address
to the TLB.
As just described, the HIGHMEM feature implemented by the Linux kernel
saves the combination of the physical address (over 32 bits) and the virtual address
(under 32 bits) as the PTE, and the TLB is updated when a memory access violation
occurs (Table 5.13).
5.1.3.3 Evaluation
Our HIGHMEM framework was evaluated to determine whether the existing appli-
cations and driver could be run without any modifications. The Linux version on
which we have extended the HIGHMEM framework is shown in the following table
(Table 5.14).
To evaluate the application’s compatibility, LMBench and IOzone [9] are used
for testing. Both benchmarks are suitable for verifying that Linux runs correctly.
Development tools like the cross compiler and cross linker are also tested using
these benchmarks.
To maintain compatibility between drivers, none of the devices for RP-X needed
changes in their drivers’ source code except for the serial-ATA driver, which needed
a change in the DMA access function. This is because the RP-X’s serial-ATA device
5.2 Domain-Partitioning System 165
cannot handle physical addresses over 32 bits (Linux HIGHMEM allocates a physical
address for DMA access over 32 bits). This limitation is not RP-X specific, but the
same issue will occur when supporting some PCI or PCI-Express devices which are
limited to less than 32-bit DMA addressing [10].
5.2.1 Introduction
The application fields of embedded systems are rapidly expanding, and the func-
tionality and complexity of these systems are increasing dramatically [11]. Today’s
embedded systems require not only real-time control functions of traditional embed-
ded systems but also IT functions, such as multimedia computing, multiband net-
work connectivity, and extensive processing for database transactions. Facilitating
embedded systems’ many requirements calls for new system architectures.
One approach for designing system architectures is to integrate multiple operat-
ing systems on a multicore processor. In this approach,
• Heterogeneous operating systems run different types of applications within the
multicore processor.
• A real-time operating system delivers real-time behavior such as low latency and
predictable control function performance.
• A versatile operating system processes applications developed for IT systems.
However, this system architecture has a drawback. An unintentional failure of one
operating system could overwrite important data and codes and bring down not only
that operating system but others as well. This can occur because a CPU core, which
executes operating system codes, can access any hardware resource on a multicore
processor. We therefore need a partitioning mechanism to isolate any unintentional
operating system failure within a domain to prevent it from affecting systems in other
domains. A domain is a virtual resource-management entity that executes operating
system codes in a multi-operating system integrated on a multicore processor.
System engineers have developed several partitioning mechanisms for servers
and high-end desktop systems [12, 13]. However, these mechanisms are unsuitable
for an embedded multicore processor equipped with only the minimally required
resources. Rather, they are for multiprocessor systems in which many processors
share large amounts of memory and I/O devices. The mechanisms cannot divide a
small memory system into areas nor segment a device into groups of channels to be
assigned to multidomains on the multicore processor. We have therefore developed
a low-overhead domain-partitioning mechanism for a multidomain embedded sys-
tem architecture that protects a domain from being affected by other domains on an
embedded multicore processor. Additionally, we fabricated a multicore processor
that incorporates a physical partitioning controller (PPC), which is a hardware sup-
port for the domain-partitioning mechanism.
166 5 Software Environments
Figure 5.7 shows our multidomain system, built on a multicore processor. The system
architecture lets the designer assign domains to the different CPU cores and imple-
ment them independently in each core. Applications and operating systems in both
domains are largely unaware of each other. Domains might exchange information and
5.2 Domain-Partitioning System 167
Trans-
mission
CPU #0 DMAC INTC Timer SCIF CPU #1
Brake
GPIO ROM Con RAM Con DU PCI
: RT domain : IT domain
coordinate tasks, but there is no dynamic load balancing. The task assignment is fixed,
and hardware resources can be dedicated to the domains, resulting in a more determin-
istic performance. Despite some possible memory overhead due to multiple operating
system images in the main memory, this feature is one of the system architecture’s
most significant advantages for embedded system developers.
As the size and complexity of embedded systems increase, so do the chances that a
system will break down because of software malfunctions or attacks over the network.
Although operating systems isolate software failures within an application, a failure
could affect the operating system itself, causing it to bring down all applications run-
ning on it because operating systems, especially versatile ones, are becoming larger
and more complex.
In developing control subsystems whose failure might endanger a person’s life,
such as an automobile’s brake control system, an engineer tries to achieve a high
level of safety by every conceivable means. However, even a safe and secure control
subsystem can be affected if it is incorporated with IT subsystems into a multido-
main system on a multicore processor.
Our domain-partitioning approach helps to isolate failures within unreliable IT
domains rather than let them affect control domains on the multidomain embedded
system. This domain partitioning protects a domain from being affected by other
domains in the multidomain system and maintains the system’s safety and security by
• Allocating multicore processor resources for each domain to let it run its own
operating system and applications [16]
• Protecting a domain from the effects of software failure in other domains and
ensuring that only the domain causing the failure is affected
168 5 Software Environments
Partitioning
RT domain IT domain
• Resetting the domain and rebooting its operating system without letting the other
domains observe any of the failure’s effects
• Lowering system performance overhead to less than 5% to implement fault
isolation
Partitioning techniques with hardware support fall into two categories: physical par-
titioning and logical partitioning [12].
With physical partitioning, each domain uses dedicated processor resources. In the
multidomain system in Fig. 5.8, system designers allocate each CPU core and each
group of channels in multichannel devices—DMA controllers (DMAC), timer units
(TMU), and serial communication interfaces (SCIF)—and other devices, such as
the display unit (DU), PCI, and general-purpose I/O (GPIO), to one of the domains.
Each allocated resource is physically distinct from the resources used by the other
domain. Although the domains share the on-chip system bus, each transaction is
dedicated to a domain. This prevents the other domain from affecting transactions
that relate to issues other than bandwidth.
In physical partitioning, each partition’s configuration—that is, the resources
assigned to a domain—is controlled in the hardware (such as the partition con-
troller in Fig. 5.8), because physical partitioning does not require sophisticated
algorithms to schedule and manage resources. When the system boots up, the
partition controller sets up the hardware resources to use in a partition according
5.2 Domain-Partitioning System 169
Partitioning
RT domain IT domain
Hypervisor
With logical partitioning, domains share some physical resources, usually in a time-
multiplexed manner. Thus, logical partitioning makes it possible to run multiple
operating system images on a single hardware system, which enables dynamic
workload balancing. Logical partitioning is used to implement virtual machines on
PC servers and mainframes to optimize utilization of hardware resources.
Logical partitioning is more flexible than physical partitioning but requires
additional mechanisms to provide the services needed to share resources safely and
efficiently. Usually, a hypervisor—that is, a programming layer lower than the
operating system and hidden from general system users (see Fig. 5.9)—controls
each partition’s configuration. When the system boots up, the hypervisor sets up
the hardware resources for use in a partition according to the partition-configuration
commands. Once a partition is configured, the hypervisor loads the operating sys-
tems and applications into each partition, and a domain on each partition starts to
run them. During the execution of the operating system and applications on the
170 5 Software Environments
partition, the hypervisor traps every hardware resource access request that the
operating system and applications generate, in order to check their authenticity and
to provide the requested resource services if authorized.
Optimizing hardware utilization is one of the main goals of logical partitioning.
To achieve this goal, logical partitioning sacrifices the partition’s physical isolation
in exchange for greater flexibility in dynamically allocating resources to partitions.
It also imposes performance penalties because the hypervisor is implemented in
software layers.
Figure 5.10 is a block diagram of the multicore processor we used to implement the
proposed domain-partitioning technique. The processor is a multicore chip contain-
ing four SH-4A processor cores, each of which is a 32-bit RISC microprocessor
containing an instruction cache, data cache in write-back mode, and memory man-
agement unit (MMU) with a translation look-aside buffer (TLB), which supports a
32-bit virtual address space. The SH-4A cores maintain consistency between data
caches and share instruction/data unified L2 cache in write-through mode. The pro-
cessor incorporates a DDR3-SDRAM memory controller (DBSC), local bus state
5.2 Domain-Partitioning System 171
SHPB
ROM RAM
CPU #2
Area #1 DMAC 6-11 TMU-05
Access Check List CPG
CPU #3 CPU #2 Mem A#0 RW Area #C TMU6-11
CPU #3 Mem A#0 RW WDT
CPU #0 Mem A#1 RW SDIF0
DMAC 0-5 CPU #2 DMAC 0-5 RW
CPU #0 DMAC 6-11 RW
GPIO
DMAC 0-5 Mem A#0 RW
SDIF1
DMAC 6-11 PCIe 0
DMAC 6-11 Mem A#1 RW DU
CPU #2 Ether RW SSI0-1
Ether Mem A#0 RW PCIe 1 HSPI
PCIe 0 CPU #2 DU RW
SSI2-3
DU Mem A#0 R
CPU #2 PCIe 0 RW PCIe 2 SCIF0-1
PCIe 1 PCIe 0 Mem A#0 RW I2C0
CPU #2 SCIF 0-1 RW
PCIe 2 CPU #0 SCIF 2-5 RW
SCIF2-5
CPU #2 TMU 0-5 RW I2C1
CPU #0 TMU 6-11 RW
SHPB USBF
DU CPU #0 GPIO RW ROM LBSC HAC0
INTC
USBH
Ether Area #0 HAC1
INTC2
USBF Area #1 LCPG 0
USBH LCPG 1
The PPC is located between the access initiator modules and the access target mod-
ules. It checks every access request and blocks requests that are not authentic. The
PPC contains an access checklist (ACL) to set access authorization rules, and the
ACL defines the processor’s partition configuration. The ACL consists of several
register entries, each having three fields:
An SRC field, which specifies an access initiator
A DEST field, which specifies an access target
An AUTH field, which specifies authorized operation for both the SRC and DEST
fields
The processor has multiple-channel devices, such as a DMAC, PCIe, TMU,
SCIF, audio codec I/F (HAC), serial sound I/F (SSI), and I2C. The PPC segments
them into groups of channels and recognizes each group as a separate module so
that each domain uses one group’s function exclusively. The PPC also segments
RAM and ROM into several memory areas so that a domain can use them as private
memory. Moreover, several domains can access a shared RAM area to communicate
with each other. Therefore, the PPC recognizes each initiator and target, indicated
in Fig. 5.11, as separate modules.
5.2 Domain-Partitioning System 173
Op Address SrcID
ACL: Access Control List
Address
Address [31:xx] SrcID Mask[7:0]
Mask[31:xx] SrcID[7:0] Pwr
Address [31:xx] SrcID[7:0] Prd
Address [31:xx] SrcID[7:0] Prd
“Read” “Write”
Deny
We assigned an SrcID to each initiator module and used a control register address
as an identifier for the DEST field. The size of the address range should be a power
of 2, and its start address should be a multiple of the alignment, which must be a
power of 2 and a multiple of the size of the address range.
Figure 5.12 shows the PPC structure. The SRC field consists of an SrcID and an
SrcID mask; the DEST field consists of an address and address mask; and the AUTH
field consists of two bits—one for read permission and one for write permission.
The PPC checks every access request by comparing a set consisting of an operation,
target address, and SrcID with all ACL entries using the logical circuit shown in the
figure. When the PPC finds a match for an ACL entry, it authorizes the access
request, and the PPC passes the access request to the target module. When the PPC
finds no match for an ACL entry, it does not authorize the access request; rather, the
PPC blocks it and generates a deny signal to start error handling.
The PPC modules are located between the internal system bus and the bus-target
modules—that is, the DBSC, SHPB and HPB bus bridges, PCIe, and DMAC (see
Fig. 5.13). The PPC has six subblocks—DBSC-PPC, LBSC-PPC, SHPB-PPC,
HPB-PPC, DMAC-PPC, and PCI-PPC—each of which has its own set of registers
174 5 Software Environments
Partitioning
Real time control IT applications
PPC applications
error
handler INT INT
handler RTOS Versatile OS handler
ROM RAM
Table 5.15 Number of access checklist (ACL) entries for each physical partitioning controller
No. of ACLs No. of ACLs
PPC Initiator Target needed implemented
LBSC-PPC 2 CPU cores 2 ROM areas 2 4
DBSC-PPC 2 CPU cores 9 2 RAM areas allocated 11 + 11 32
initiator modules to a domain 1 RAM
area shared by
domains
SHPB-PPC 2 CPU cores 4 Target modules 4 8
HPB-PPC 2 CPU cores 27 Target modules 27 32
DMAC-PPC 2 CPU cores 2 Target modules 2 4
PCI-PPC 2 CPU cores 3 Target modules 3 4
and ACLs. Table 5.15 lists the number of ACL entries of each PPC. For SHPB-
PPC, HPB-PPC, DMAC-PPC, and PCI-PPC, the initiators are CPU cores, and the
targets are modules connected on the bus-target modules; therefore, an ACL entry
is needed for each target to authenticate an access from the CPU cores of the
domain. For LBSC-PPC, CPU cores gain access to two ROM areas that are each
allocated to a domain; therefore, LBSC-PPC needs an ACL entry for each ROM
5.2 Domain-Partitioning System 175
area. For DBSC-PPC, the initiators, which are two CPU cores and nine initiator
modules, access two dedicated RAM areas and a shared RAM area; therefore, this
subblock needs 11 entries for the dedicated RAM areas and 11 entries for the
shared RAM area.
When these conditions are not satisfied, the PPC judges the access to have been
inauthentic and rejects it. The PPC then sends an error response to the internal system
bus instead of passing the access request to the target module. The PPC also gener-
ates an access-violation interrupt signal, which is transmitted to the INTC.
The interrupt controller (INTC) prioritizes interrupt sources and controls the
flow of interrupt requests to the CPU. The INTC has registers for prioritizing each
interrupt, and it processes interrupt requests following the priority order set in
these registers by the program. Most of these registers are system registers, so they
cannot be physically partitioned. Therefore, we assumed that the real-time domain
would be more reliable than the IT domain, so we decided that CPU #0, which
houses the real-time domain, should be allowed to access the INTC registers and
that the IT domain should send requests to the real-time domain for operation on
the registers.
When the INTC receives an access-violation interrupt signal, the execution
jumps to the start address of the PPC error-handling routine. Each PPC subblock has
registers that determine the access-violation interrupt signal’s behavior and hold the
information on rejected access requests. Based on this information, the PPC error-
handling routine classifies the access violation’s seriousness and decides whether
the system should be rebooted.
5.2.6 Evaluation
Table 5.16 Performance evaluation results for memory latency and context switching times using
LMBench
Context switching times (ms)
(number of processes/process
Memory latency (ns) image size [in bytes])
Main Random
Operating system L1 cache memory memory 2p/64 k 8p/64 k 16p/64 k
Linux 4.99 148.30 1,842.05 11.10 43.50 48.05
Linux + PPC error 4.99 150.50 1,912.95 11.50 45 49.40
handler
Overhead (%) 0.00 1.48 3.85 3.60 3.45 2.81
Table 5.17 Performance evaluation results for processing times using LMBench
Process times (ms)
Signal Signal Fork Execution
Operating system Null call Null I/O install handling processing processing
Linux 0.41 0.79 1.63 13.80 3,906.50 4,432
Linux + PPC error 0.48 0.87 1.70 13.95 3,911 4,452.50
handler
Overhead (%) 17.07 10.13 4.29 1.09 0.12 0.46
was related to the PPC in response to interrupts or exceptions and passed the processing
to the appropriate normal service routine in Linux if unrelated; otherwise, the PPC
error handler rebooted Linux, causing a serious access-violation error.
To observe the domain partitioning using PPC, we injected access-violation
errors by configuring PPC so that it did not allow applications running on Linux to
access a small memory area assigned to Linux. When an application wrote some
data into the small memory area, the PPC rejected the write access request so that
no data were written in the memory area, and the PPC generated an access-violation
interrupt signal to initiate the PPC error handler to reboot Linux.
Tables 5.16 and 5.17 indicate the overhead of the domain partitioning using the
PPC. The average performance penalty was 2.49%, and the overheads were typi-
cally less than 5%. In memory latency cases, the overheads were due only to the
additional bus access cycle generated by implementing PPC because the PPC error
handler was not initiated during the tests; thus, the overhead was 0.0% for “L1
cache” of the LMBench, 1.48% for “main memory,” and 3.85% for “random mem-
ory.” We presumed that the difference in overhead between the main and random
memories was due to the effect of the CPU core store buffers. The worst cases of
overhead were 17.07% for null call and 10.13% for null I/O. We attributed this
overhead to the PPC error handler because “null call” and “null I/O” are system
calls that only generate exceptions that trigger the PPC error handler’s execution.
Implementing the PPC error handler into the Linux service routine using a paravir-
tualization approach [18] could reduce the overhead.
References 177
References
1. Yamamoto H, Takata H (2004) Porting Linux to a Single Chip Multi-processor. The 66th
National Convention of Information Processing Society of Japan, Kanagawa, Japan
2. LMbench: http://lmbench.sourceforge.net/
3. Idehara A, Tawara Y, Yamamoto H, Ochiai S (2007) Development of SMP Linux for embed-
ded multicore processor. Embedded Systems Symposium 2007, Tokyo, Japan, pp 226–232
4. Brock B, Rajamani K (2003) Dynamic power management for embedded systems. Proceedings
of the IEEE International SOC Conference 2003, Portland, USA, pp 416–419
5. IBM, Montavista, Dynamic power management for embedded systems, 2002, http://www.
research.ibm.com/arl/publications/papers/DPM_V1.1.pdf, July (2008)
6. Idehara A, Tawara Y, Yamamoto H, Sugai N, Iizuka T (2008) An evaluation of dynamic power
management support of SMP Linux for embedded multicore processor. Embedded Systems
Symposium 2008, Tokyo, Japan, pp 115–123
7. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: character-
ization and methodological considerations. Proceedings of the 22nd International Symposium
on Computer Architecture, Santa Margherita Ligure, Italy, pp 24–36
8. Idehara A, Tawara Y, Yamamoto H, Ohtani H, Ochiai S (2009) Idle reduction: dynamic power
manager for embedded multicore processor. Embedded Systems Symposium 2009, Tokyo,
Japan, Oct 2009, pp 5–12
9. IOzone: http://www.iozone.org/
10. Idehara A, Tawara Y, Yamamoto H, Motai H, Ochiai S, Matsumoto T (2010) Design and
implementation of Linux HIGHMEM extension for the embedded processor. Embedded
Systems Symposium 2010, Tokyo, Japan, Oct (2010), pp 75–80
11. Ebert C, Jones C (2009) Embedded software: facts, figures, and future. Computer 42(4):42–52
12. Smith JE, Nair R (2005) Virtual machines: versatile platforms for systems and processors.
Morgan Kaufmann, MA, USA
13. Sun Microsystems (1999) Sun Enterprise 10000 Server: Dynamic System Domains, white
paper; http://www.sun.com/datacenter/docs/domainswp.pdf
14. Takada H, Honda S (2006) Real-time operating system for function distributed multiproces-
sors. J Inform Proc Soc Jpn 47(1):41–47
15. Kruger C (1992) Software reuse. ACM Comput Surv 24(2):131–183
16. Nesbit KJ et al (2008) Multi-core resource management. IEEE Micro 28(3):6–16
17. Uhlig R et al (2005) Intel virtualization technology. Computer 38(5):48–56
18. Barham P et al (2003) Xen and the art of virtualization. Proc 19th ACM Symp Operating
Systems Principles ACM, pp 164–177
19. McVoy LW, Staelin C (1996) Lmbench: portable tools for performance analysis. Proc. Usenix
Ann. Technical Conf., Usenix Assoc., pp 279–294
20. Nojiri T et al (2009) Domain partitioning technology for embedded multi core processors.
IEEE Cool Chips XII:273–286
21. Nojiri T et al (2010) Domain partitioning technology for embedded multicore processors.
IEEE Micro 29(6):7–17
Chapter 6
Application Programs and Systems
The evaluated chip is equipped with two homogeneous CPU cores and two accel-
erators (FE-GA) [2], which are described in Sect. 3.2. Figures 6.1 and 6.2 show a
block diagram and micrograph of the chip, respectively [3, 4]. The chip has two
SH-4A (SH) cores capable of multicore functions such as cache snooping, a 128-
KB on-chip shared memory (CSM), a DMAC, and two FE-GAs. The SH cores have
several types of local memories and a data transfer unit (DTU). The local memories
include a 128-KB users’ RAM (URAM) as a distributed shared memory, a 16-KB
operand local RAM (OLRAM) as a local data memory, and an 8-KB instruction
local RAM (ILRAM) as a local program memory. The FE-GAs also have a 40-KB
local memory (4 KB × 10 banks) that can be accessed from its internal load/store
cells as well as other processor cores. All the memories are distributed shared types,
which means they are address-mapped globally.
The SH cores are also equipped with an instruction cache and a coherent data
cache corresponding to ILRAM and OLRAM, respectively. In our use model, the data
cache is normally utilized for non-real-time applications. In contrast, the OLRAM
is used for real-time applications because data placement on the OLRAM can be
K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 179
DOI 10.1007/978-1-4614-0284-8_6, © Springer Science+Business Media New York 2012
180 6 Application Programs and Systems
STB
FE-GA#0 FE-GA#1 Memory
LM ALU LM ALU controller
40 KB array 40 KB array
Chip
LRAM, URAM Local Memories
DTU Data Transfer Unit Off-chip CSM
CSM Centralized Shared Memory (SDRAM)
DMAC Direct Memory Access Controller 128 MB
STB Split-Transaction Bus
a Start b
Huffman
Bitstream
Frame read coding 4%
generation 8%
Filter bank
M/S stereo
Stream generation
For CPU
For FE-GA
Processing flow Profiling results on CPU
The encoding process for the AAC consists of the use of a filter bank and mid-side
(M/S) stereo, quantization, Huffman coding, and bit-stream generation. The process
is performed frame by frame, which is a unit of sampled points in input pulse-code
modulation (PCM) data. Figure 6.3 outlines the process flow and profiling results
for the AAC encoding. The profiling results in (b) indicate that the filter bank, M/S
stereo, and quantization account for 89% of the total encoding time. Table 6.2 lists
the specifications of the encoder used for the evaluation.
182 6 Application Programs and Systems
The encoder program was investigated thoroughly to confirm its suitability for
processing on the FE-GA at every encoding stage. The filter bank is a band-pass
filter separating the input audio signal into several components of frequency sub-
bands. The calculation of the filter bank is composed of additions to and multiplica-
tions of the streaming data, which is suitable for processing on the FE-GA. The M/S
stereo extracts parts of the frequency sub-bands that appear in both left and right
channels. The calculation consists of additions to and subtractions of the left and
right sub-bands, and it is thus implemented on the FE-GA. Quantization constrains
the output value of the filter bank to a discrete set of values in accordance with the
specified bit rate. The calculation is a power of 3/4 to the data. The evaluated pro-
gram contains a table reference, which is implemented on the FE-GA. Huffman
coding assigns shorter coding symbols to more frequently appearing bit strings for
compression. In the implementation, quantization and Huffman coding iterate after
the step value for quantization is increased, until the amount of encoded data satisfies
a given bit rate. Since the coding length of bit strings is not fixed, it is difficult to
improve the performance with the FE-GA, and thus, a CPU is used for the Huffman
coding. Bit-stream generation arranges coded symbols in compliance with the AAC
stream format. A CPU is used to generate bit streams.
We developed the configurations of the FE-GAs for the filter bank, M/S stereo,
and quantization for the evaluation. The configurations for the filter bank and M/S
stereo were merged because the M/S stereo continuously follows the filter bank
process. The execution cycles were measured both on an FE-GA and on a single
CPU, as indicated in Table 6.3. Note that the FE-GA cycles are converted to CPU
cycles since the FE-GAs operate at 300 MHz, which is half the CPU’s cycles at
600 MHz. Introducing FE-GAs to the merged filter bank and M/S stereo and quan-
tization yields 24- and 7.7-fold speedups in performance against sequential execu-
tion on a CPU.
Each processor core has a data transfer unit (DTU) attached to an internal bus con-
nected to the local memories. The DTU simultaneously transfers data between local
memories on different processor cores, between a local memory and on-chip CSM or
6.1 AAC Encoding 183
Command #1 Command #3
FLAG CHECK TRANSFER
Flag adr. Source adr.
Check value Destination adr.
Check interval Transfer size
Next cmd. ptr. Next cmd. ptr.
off-chip main memory, or between the on-chip CSM and off-chip main memory
behind task executions on processor cores. The DTU is also equipped with flag-set
and flag-check commands. The DTU sets a flag with a number specified in a com-
mand in the flag-set mode. In the flag-check mode, it reads the value of the flag and
checks its correspondence with the number specified in the command.
The DTU interprets a transfer list, which is a set of DTU commands placed on the
local memory. Different types of transfers can be defined in advance, and thus, the
DTU can operate independently behind a CPU. Furthermore, the setup time, such as
that to register operations, is also reduced, which results in improved performance.
Figure 6.4 shows an example of a DTU transfer list. Each command is linked by a
pointer. The following explains how the list is interpreted. First, a CPU initiates a DTU
by setting up a start-up register for the DTU and specifying an address for the first com-
mand to be interpreted. The DTU starts to perform a flag-check. It checks a flag on a
memory and compares it to the specified value in the command. The correspondence
of the two values enables the DTU to read the next command, which is specified as a
pointer address in the original command. The flag-check interval cycles are optionally
specified with the aim of restraining bus traffic or reducing power consumption. In the
next command, data are transferred from the source address to the destination address
of the specified size. As soon as the transfer has finished, the DTU reads the next com-
mand, which is another data transfer in this example. After the transfer, the DTU then
sets a flag with the specified value on the specified address.
The DTU also supports data packing to nonaligned burst transfer. Users do not need
to be concerned about the alignment of data placement in a memory, and the utilization
of a bus is also improved. In addition, it supports the stride transfer mode that enables
gathered/scattered data transfers. This is effective for applications using transfers of
rectangular regions on a memory, for example, image handling, because these transfers
can be completed with one transfer command.
184 6 Application Programs and Systems
CPU#0 FE-GA#0
SH
DTU ALU/MLT array
Core
FE-
CPU
GA
#1
#1
LM com. Bus LM Bus
lists I/F I/F
data data
flag
Figure 6.5 outlines the DTU implementation on the evaluated chip with an example
diagram of its operation. Transfer lists, data, and flags are placed in the local mem-
ory (LM). The example shows that the DTU interprets the command on CPU#0’s
LM and transfers data on the LM in CPU#0’s LM to the FE-GA#0’s LM.
In order to maximize the performance of the encoding process, the on-chip and off-
chip memories are used as follows. The encoding is done frame by frame. Input PCM
data and output AAC streams are stored in the off-chip main memory (SDRAM). Before
every frame is processed, the PCM frame data are transferred to the URAM of a target
CPU. Intermediately generated data are also placed on the URAM. For processes on an
FE-GA, data are transferred from the URAM to the local memory of a target FE-GA
before they are executed, and processed data stored on the local memory are transferred
to the URAM of the target CPU after they are executed.
The processing time for AAC encoding was evaluated for the following data trans-
fer methods: by a CPU, by a DMAC, by a DTU without transfer lists, and by a DTU
with the lists on a configuration of one CPU and one FE-GA. The encoding options
and conditions are described in Table 6.2 with music-1 adopted for the evaluation.
Figure 6.6 shows the improved performance with various data transfer methods as a
result. Encoding on one CPU resulted in 58.2 s of execution time. The encoding
speedup rate is 3.3, which is calculated from the length of input music, which is
192 s. By introducing an FE-GA with data transferred by a CPU, the encoding time
is 14.1 s, which is 13.6 times the encoding speed. The FE-GA contributes to greater
speedups against the CPU, which had speedups of 4.1. Next, encoding with DMAC
transfers resulted in an encoding time of 10.1 s, which is 20.1 times the encoding
speed. Furthermore, with DTU transfers without transfer lists, the encoding time
was 7.9 s, which is 24.2 times the encoding speed. Finally, evaluation with DTU
transfers operated by transfer lists resulted in an encoding time of 7.5 s, which is
25.6 times the encoding speed.
6.1 AAC Encoding 185
60.0 30.0
25.6
50.0 24.2 25.0
10.0 5.0
3.3
0.0 0.0
CPU CPU DMAC DTU DTU
transfer transfer transfer transfer transfer
w/o list with list
1 CPU 1 CPU + 1FE-GA
The evaluation results indicate that the efficient use of accelerators for process
executions and DTUs for data transfer plays an active role in improving performance.
Performance with the DTU was better than that with the DMAC because twice as
many transactions of the interconnection bus are required with the DMAC than with
the DTU, and this bus is slower than the CPU internal bus connected directly to the
URAM and DTU. The beneficial effect of the DTU transfer lists is due to a reduction
in the number of DTU register setups for multiple transfers to the banks of the local
memory in the FE-GA. Since the FE-GA has multiple banks of memory, divided data
are placed in different banks, and transfers are done multiple times. As a result, the
number of DTU operations is reduced by utilizing transfer lists.
We measured the performance of AAC encoding on the evaluated chip. The evalu-
ation included the execution time and average power consumed in the encoding.
The encoding process was mapped to the four processor cores as outlined in Fig. 6.7.
For simple implementation of parallel processes, two streams of encoding were
individually assigned to a pair of one CPU and one FE-GA. However, processing
tasks of the encoding on both a CPU and an FE-GA in parallel will be achieved by
utilizing inter-frame parallelism.
The evaluation was done under the conditions listed in Table 6.2. The perfor-
mance was measured with double input streams of music-2. In other words, the
186 6 Application Programs and Systems
60 1.8
Encoding speed x54.1
Power consumption Measured power consumption [W]
50 1.46 1.5
1.17
Encoding speed
40 1.36 1.2
1.22
30 x27.1 0.9
20 0.6
10 x8.0 0.3
x4.0
0 0.0
CPUx1 CPUx2 CPUx1+FEx1 CPUx2+FEx2
[1 stream] [2 streams] [1 stream] [2 streams]
Encoding 3.4 5.8 22.2 37.1
speed / W xEnc/W xEnc/W xEnc/W xEnc/W
input stream was encoded twice on one CPU and one FE-GA, and the two streams
of the same input music were encoded simultaneously on two CPUs and two
FE-GAs. Input PCM and output AAC stream data were placed in the off-chip main
memory. The DTU transferred data by using transfer lists.
Figure 6.8 plots the evaluation results. The speedup was 4.0 and the average
power consumption was 1.17 W with encoding on a single CPU. The encoding
6.2 Real-Time Image Recognition 187
CPU 27 1 55 55 15
㨯㨯㨯
11 26 19 5 19
FE-GA 68 7 7
speedup was 8.0 and the power consumption was 1.36 W on the homogeneous
multicore with two CPUs. The encoding speedup was 27.1 and the power consump-
tion was 1.22 W on the heterogeneous multicore with one CPU and one FE-GA.
Finally, the encoding speedup was 54.1 and the power consumption was 1.46 W on
two CPUs and two FE-GAs. The heterogeneous multicore configuration outper-
formed the homogeneous multicores. Even though the power consumption increases
as the number of processor cores is increased, the speedup in encoding is much
faster. To evaluate the power-performance efficiency of the heterogeneous multi-
core configuration, the index of encoding speed per W [xEnc/W] was calculated for
all the evaluated configurations as listed in the bottom of Fig. 6.8. Sequential execu-
tion on a single CPU resulted in 3.4 xEnc/W. Parallel execution on a configuration
of two CPUs and two FE-GAs resulted in 37.1 xEnc/W, which is 10.9 times better
power-performance efficiency.
Figure 6.9 is a Gantt chart of one-frame encoding on one CPU and one FE-GA.
The filter bank and quantization were processed on the FE-GA, and DTU data transfers
were performed between executions on the CPU and the FE-GA.
6.2.1 MX Library
User Application
API
Application Application
Specific Image/Signal Processing Library Specific
Library functions
(Filters, FFT, • • •)
On-Chip Bus
SDRAM
SDRAM
I/F
Semantic Information
“Count” “Run” “Congestion”
Tracking Information
S-T MRF
6.2.2 MX Application
41 41
41
41 41
17 17
41 17
41 41 41
Motion Vector MAP Object MAP
An overview of the S-T MRF algorithm is shown in Fig. 6.14. This algorithm
presumes the boundary of each object based on the motion vectors. First, the motion
vectors of each object are extracted by comparing the previous and current image
frames. The extracted motion vectors are mapped onto the motion vector maps.
With these motion vector maps, the object maps including the boundary informa-
tion of objects are also generated. In the process of generating these object maps,
the boundaries of objects are evaluated by a high-level algorithm and then opti-
mized. In this way, robust and stable object tracking is achieved by updating each
map in every image frame.
As shown in Fig. 6.15, the S-T MRF can be divided into two layers, that is, object
map creation and motion vector extraction. In applications that use the S-T MRF
algorithm, the event detection algorithm is added in the application layer. The vol-
ume of data is reduced as the software level is elevated because the higher-level
layer does not need to handle the pixel data. These operations on each layer are
executed independently; therefore, the thread parallel processing can be applied.
Figure 6.16 is an overview of the thread parallel processing of the S-T MRF
application on the proposed SoC architecture. The motion vector extraction that
processes large-volume data is assigned to the MX core. The object map creation is
assigned to CPU#0, and event detection is assigned to CPU#1. Each thread com-
municates with the other threads by giving and receiving the information. For exam-
ple, the object map creation gives the pixel pointer to the motion vector extraction
and receives the motion vector map. Thus, effective thread parallel processing can
be achieved with this scheme.
6.2 Real-Time Image Recognition 191
Small Data
Volume Application
Event Detection
Motion Vector
Extraction
( SAD Operation)
Large Data
Volume
Software Layer
Application
Frame Pointer Pixel Pointer
MX Core
CPU#1 CPU#0
MPC
MPA
Hardware Layer
Figure 6.17 illustrates the frame pipelining technique for the thread parallel pro-
cessing shown in Fig. 6.16. The horizontal axis is the time, and the time unit refers
to the processing time of each frame. Each output of the thread is given to another
thread that is processed one time unit later. Parallel execution of the threads is
achieved with this frame pipelining technique.
In the S-T MRF algorithm, the motion vector extraction is based on the sum of
the absolute difference between the sequential frames. The SAD algorithm is very
useful for evaluating the similarity between two frames, as depicted in Fig. 6.18.
192 6 Application Programs and Systems
1 Frame Processing
Traffic situation
EX.) Run, Congestion
0 1 2 3 4 Time
Absolute difference
(each pixel)
SAD
value
Summation
Image Data Template Data
(8x8pixel) (8x8pixel)
Expression :
The absolute difference of each paired pixel between the two pixel blocks is first
determined and then their sum is calculated.
Figure 6.19 shows the implementation overview of SAD to the MX core. The
absolute difference between the sequential frames is processed line by line with the
PEs of the MX core in parallel. The MX core has a powerful data network between
PEs; therefore, inter-PE operations such as the summation are also easily imple-
mented. With these implementation techniques, effective and high-performance
SAD operations are realized with the MX core.
The performance evaluation of the S-T MRF application using the proposed SoC
architecture is illustrated in Fig. 6.20. This graph shows the speed performance; the results
6.3 Applications on SMP Linux 193
PE
PE
PE
PE
PE
PE
PE
PE
x8 Columns
Speed Performance
Case A Case B
CPU#1: Event Detection CPU#1: Event Detection
CPU#0: Object Map Creation CPU#0: Object Map Creation
Motion Vector Extraction MX-2: Motion Vector Extraction
were obtained under the conditions where one image frame was processed at an
operating frequency of 648 MHz in each CPU and 324 MHz in the MX-2. About the
speed performance, the proposed system exhibits the 20.4-ms processing time that
is 10.8 times faster than the CPU-only configuration. As shown here, high-
performance image recognition applications can be achieved by implementing the
heterogeneous architecture with the MX core and the CPU cores.
Three Linux applications running on the RP-1, RP-2, and RP-X multicore chips (as
described in Chap. 4) have been developed. The first application program visualizes
the load balancing mechanism of Linux on the RP-1, which has four CPU cores
with the cache coherency protocol among them. A monolithic Linux kernel runs on
the four cores, and the load balancer of Linux balances the loads among the cores.
194 6 Application Programs and Systems
The second application program on the RP-2 visualizes the power-saving mechanisms
for multiple cores using Linux. Two mechanisms are implemented in Linux. One
is dynamic voltage and frequency scaling (DVFS) of multiple cores, and the other
is dynamic plugging or unplugging of each CPU core. The two mechanisms
are controlled by the newly introduced “power control manager” daemon. The third
application program performs image processing of magnetic resonance imaging
(MRI) images using the RP-X chip. These three applications are described in
detail below.
6.3.1.1 Introduction
The RP-1 chip has four SH-4A cores. The main memory is shared by the four cores.
A pair of caches—an instruction cache and an operand cache—is placed between
each core and the main memory. Each operand cache is kept coherent with the other
operand caches using the directory-based write invalidation cache coherency protocol.
The write invalidation protocol is either the MESI cache coherency protocol or the
MSI cache coherency protocol. Eight channels of the inter-CPU interrupts (ICIs)
are implemented. The communication between cores inside Linux is mapped to one
or more channels of the ICIs. An interrupt caused by an event outside a core can
be either bound to a specific core or distributed to an arbitrary core so that the core
that receives an interrupt first serves the interrupt.
Some problems in scalability have been found with multiple processors in Linux
2.4. The problems have become obvious as multi-thread application programs have
become popular. Even on a single processor, the scheduler in Linux 2.4 runs in O(n)
time, where n is the size of the run queue. In symmetric multiprocessing (SMP) on
Linux 2.4, there is a single global run queue protected by a global spinlock. Only
one processor that has acquired the global lock may handle the run queue [9]. To
designate a task to run, the scheduler searches the run queue looking for the highest
dynamic priority of processes. That results in an O(n) time algorithm and causes the
scalability problem in SMP.
Linux 2.6 has been improved for SMP and has a per-CPU run queue, which
avoids the global spinlock with multiple CPUs and provides SMP scalability. The
scheduler on Linux 2.6, before 2.6.23, is called the O(1) scheduler [10], which was
designed and implemented by Ingo Molnar.
The load balancer on Linux 2.6 supports SMP. Balancing within a schedul-
ing domain occurs among groups. The RP-1 Linux has one scheduling domain
with four groups, each of which consists of one CPU. The scheduler works
independently on each CPU. To maintain an equal load in multiple processors,
a load balancer is run periodically to equalize the workload among the proces-
sors. The four-core multiprocessor system has four schedulers. Each CPU has
a run queue. Each run queue maintains a variable called cpu_load, which represents
6.3 Applications on SMP Linux 195
the CPU’s load. When run queues are initialized, their cpu_loads are set at zero
and updated periodically afterward. The number of runnable tasks on each run
queue is represented by the nr_running variable. The current run queue’s cpu_
load variable is roughly set to the average of the current load and the previous
load using the statement shown below:
The constant 128 is used to increase the resolution of load calculations and to
produce a fixed-point number. The above statement means that the cpu_load vari-
able accumulates the recent load history. The load balancing is done at a certain
appropriate timing. The load balancer looks for the busiest CPU. If the busiest
CPU is the current CPU, it does nothing because it is busy. If the load of the current
CPU is less than the average, and the difference in loads of two CPUs exceeds a
certain threshold, the current CPU will pull a certain number of tasks from the
busiest CPU. The number of tasks pulled is the smaller of the following two calcu-
lations. One is the difference between the busiest load and the average load of the
four CPU’s, and the other is the difference between the average load of four CPU’s
and the current load [11].
The purpose of the first application program is to visualize the load balancing
mechanism of Linux. The application program shows that the number of processes
on each CPU core is averaged among the four CPU cores on the RP-1 chip.
When the application creates several processes, they will be distributed to the four
CPU cores according to the load balancing mechanism of the Linux kernel. This
mechanism should work effectively when the number of processes is both increasing
and decreasing.
A system diagram of the RP-1 application is shown in Fig. 6.21, and the software
architecture of the RP-1 application is in Fig. 6.22. The display unit (“DU” hereafter)
on the RP-1 chip has been used for visualization. The DU converts the contents of a
frame buffer located in the main memory into a video signal. The size of the display
is fixed to VGA, 640 × 480 pixels. The display is divided into four sections. They are
assigned to CPU #0, CPU #1, CPU #2, and CPU #3 exclusively, as shown in
Fig. 6.23. The location of the frame buffer can be an arbitrary address. If the system
has a dedicated memory area for the frame buffer, the DU driver uses the virtual
address after mapping by the ioremap() function of Linux. In this system, the DU
driver allocates the frame buffer in the main memory, DRAM, using the dma_alloc_
coherent() function of Linux. This function allocates one or more physical pages
which can be written or read by the processor or device without worrying about
cache effects, and returns a virtual address. Finally, a frame buffer of plane 0 of the
DU can be accessed by a user program as a file, “/dev/fb0.”
The application program creates some processes. One process shows a bitmap
image of a penguin on the display. When a penguin process is assigned to CPU #3,
196 6 Application Programs and Systems
RP-1
CPU CPU CPU CPU
#0 #1 #2 #3
On-chip Interconnect
DRAM Display
Video
Controller Unit
encoder
(DU)
Display
DRAM
Penguin drawing 0
application DU Background Penguin drawing 1
initialization painting …
Penguin drawing N
OS SMP Linux
Memory (DRAM)
hardware
CPU #0 CPU #1 CPU #2 CPU #3
CPU #0 CPU #1
CPU #2 CPU #3
the bitmap image of a penguin appears in the CPU #3 section on the display as
shown in Fig. 6.23.
In the same way, when a penguin process is assigned to CPU #1, the penguin
appears in the CPU #1 section. Likewise, when a penguin process is assigned to
CPU #2, the penguin appears in the CPU #2 section. This application consists of
three sub-applications. They are initiated out of a shell script. First, the DU initial-
ization sub-application is initiated. It disables the DU, sets the pixel format to the
16-bit RGB 5:6:5 format, fills all of the pixels with the black color, and enables
the DU. Second, the background painting sub-application is initiated. It paints
the contents of the 640 × 480 bitmap file, in which the CPU #0, CPU #1, CPU #2,
and CPU #3 sections are drawn, onto “/dev/fb0,” which is plane 0 of the DU. Third,
several penguin drawing sub-applications are created and killed after a while. The
penguin drawing sub-application clears the penguin image at the previous position
whose initial position is given arbitrarily, obtains the CPU #ID number from the/
proc/xxxxxx/stat file where xxxxxx is the decimal process ID (PID) of that sub-
application process, calculates the position inside the corresponding CPU section
randomly using the rand() of <stdlib.h>, and draws the 82 × 102 pixel bitmap image
of a penguin at that position. This sub-application repeats the above procedures
continuously until an interrupt is signaled and clears the penguin image upon
receiving an interrupt. The Bourne shell script below creates and kills some pen-
guin sub-applications:
Line 0001: ./penguin 0/dev/fb0 &
Line 0002: sleep 1
Line 0003: ./penguin 1/dev/fb0 &
Line 0004: sleep 1
…
…
Line 1001: kill -2 `ps a | grep 'penguin 0' | grep -v grep | awk '{print $1}'`
Line 1002: sleep 1
Line 1003: kill -2 `ps a | grep 'penguin 1' | grep -v grep | awk '{print $1}'`
Line 1004: sleep 1
…
Initially, no penguin images are on the display. The line 0001 above creates a
penguin drawing sub-application process whose image is named “0” and draws it
on/dev/fb0, the plane 0 of the DU. This process will be created on the same CPU
where the parent shell script process exists. The load balancer of a less busy CPU
might pull one or more processes from the busiest CPU. Line 0002 waits for 1 s.
Line 0003 creates a penguin drawing sub-application process whose image is named
“1” and draws it on/dev/fb0. The load balancer of a less busy CPU might pull one
or more processes from the busiest CPU. Line 0004 waits for 1 s. After creating
several penguin sub-application processes, they are killed in turn. Line 1001 kills
the penguin drawing sub-application process named “0.” The “kill -2 [PID]” command
198 6 Application Programs and Systems
6.3.2.1 Introduction
The second application program has been designed to instantiate the power man-
agement capabilities of the RP-2 chip and RP-2 Linux and to visualize the power
consumption and performance of the system. The RP-2 has two capabilities that
support power saving. One is dynamic voltage and frequency scaling (DVFS), and
the other is power gating. The DVFS of the RP-2 allows each CPU core to change
the frequency independently and allows the whole chip to change the voltage to one
of three voltage sources. The voltage supplied to the whole chip is determined by
the highest frequency of all the CPU cores on the chip as indicated in Table 6.4.
The power gating of the RP-2 chip allows each CPU core to turn off or on the
power supplied to the core. Each CPU is inside an independent power domain. The
power supplied to a CPU core can be turned off either by itself or by another CPU
core through manipulation of a memory-mapped register. The power supplied to a
CPU core can be turned on either by an interrupt to the CPU core or by another CPU
core also by manipulating the memory-mapped register.
The RP-2 has two clusters each of which consists of four CPU cores. The four
CPU cores are cache coherent within a cluster. The SMP Linux kernel works on
only one cluster with each operand cache turned on. We have used only one cluster
in the application program.
6.3 Applications on SMP Linux 199
2
1
CPU #2 CPU #3
0
3
CPU #0 CPU #1
2
1
CPU #2 CPU #3
0
3
The RP-2 Linux kernel supports DVFS with the CPUfreq framework. The RP-2
Linux kernel supports power gating with the CPU Hot-plug framework. Both
CPUfreq and CPU Hot-plug are controlled with the power control manager daemon
that realizes the “Idle Reduction” framework described in Sect. 5.1.2. The original
CPUfreq has the “ondemand,” “conservative,” “powersave,” “performance,” and
“userspace” governors which represent power management policies. The power
200 6 Application Programs and Systems
control manager daemon supports the “Idle Reduction” framework which coordi-
nates both the CPUfreq and the CPU Hot-plug frameworks. The purpose of the
second application program is to control the power consumption using the “Idle
Reduction” framework, to protect the system from heat or battery life shortages, and
to visualize the status of the system.
The CPUfreq framework of Linux uses the ratio of idle time per sampling time to
increase or decrease the frequency of a CPU. The Idle Reduction framework takes
advantage of this process and samples the idle time every 2,000 ms. The kernel runs
the idle loop when a CPU has no workload. If the idle time ratio is less than 20%
in the sampling period, the workload is dense. If the idle time ratio is more than
80%, the workload is sparse. The Idle Reduction framework increases or decreases
the frequency of a CPU when the workload is dense or sparse, respectively. If the
workload of a CPU at the lowest frequency is sparse in two consecutive sampling
periods, the CPU will be turned off using the CPU Hot Remove of the CPU Hot-plug
framework. Because the voltage is the dominant factor in the power consumption on
the RP-2 board, and because the voltage is determined by the highest frequency of
the four CPUs, the Idle Reduction framework tries to level the frequencies of the
four CPUs. If one CPU has been turned off and another CPU has a dense workload,
the Idle Reduction framework will turn on a CPU using the CPU Hot Add of the
CPU Hot-plug framework rather than increasing the frequency of the CPU with the
dense workload.
The software decoder of MPEG-2 was selected to evaluate the Idle Reduction
framework because it can be multi-threaded and because the workload can be
specified by the frame rate in frames per second (fps). The original software of the
MPEG-2 decoder was downloaded from the web site of the ALPBench [12] bench-
mark program suite. The program of the MPEG-2 decoder has already been multi-
threaded, and the number of threads can be specified by the user when initiating the
decoder. The screen image of MPEG-2 is divided horizontally into nearly equal
areas, and the number of areas is equal to the number of threads. The load balancer
of SMP Linux balances the loads among the CPU cores. In the four-CPU SMP Linux,
the performance or fps nearly scales the number of threads up to four threads.
The DVFS and power-gating controls have been evaluated by changing the
workload. The workload of the MPEG-2 decode application was changed by speci-
fying the fps. The number of threads of the MPEG-2 decode application was four
for the four-CPU SMP in this evaluation. The workload was decreased from the
highest to the lowest workload, and then increased from the lowest to the highest
workload. The power consumption and the status of each CPU core changed as
shown in Table 6.5.
6.3 Applications on SMP Linux 201
The CPUfreq framework has been used in general for laptop personal computers
with an AC adapter or a battery. The CPUfreq receives data on the activation status
of the AC adapter and the remaining battery life. If the AC adapter is activated, the
CPUfreq disregards the battery life. If the AC adapter is not activated, the CPUfreq
takes the battery life into consideration when choosing a governor.
The CPUfreq also takes the temperature around the board into consideration in
the choice of a governor. Heat is generated by the power consumption of the semi-
conductors. The temperature of the semiconductors can exceed the upper bound at
which normal operation of the semiconductors is not guaranteed. A processor is
usually the main source of on the board. However, a processor with the DVFS capa-
bility and multiple power domains can control the amount of heat radiation and
power consumption.
The power control manager daemon controls both the CPUfreq and the CPU
Hot-plug frameworks depending on the activation status of the AC adapter,
remaining battery life, and the temperature around the board. The activation status
of the AC adapter is represented by the value of a DIP switch on the RP-2 board.
The value, 0 or 1, of the DIP switch is read from one bit of a general-purpose
input/output (GPIO) port. If the AC adapter is activated, the battery life is ignored.
The battery life is translated from the voltage of the battery. A battery manufac-
turer supplies a datasheet on which a graph shows the correspondence between
the remaining battery life and the output voltage. We have developed a battery life
and output voltage model from a battery currently on the market. We used a DC
power unit with variable voltage output instead of a battery because the charge or
discharge time of a battery takes too long to test or demonstrate the control that
depends on the battery life.
The temperature is measured by a heat sensor, which can be either inside or out-
side the chip. We used a heat sensor outside the RP-2 chip. The power unit of the
RP-2 board is compatible with that of the Advanced Technology extended (ATX)
PC motherboard. The ATX PC power unit made for the automobile PC takes DC
current from a battery and generates power that is compatible with the ATX PC
power unit. The advantage of this power unit is that it has both a voltage sensor and
202 6 Application Programs and Systems
Temperature
Upward threshold
Downward threshold
time
a heat sensor. Both the measured voltage of the input DC current and the measured
temperature of the power unit board are output via the USB cable. The USB human
interface device (HID) class driver of the RP-2 Linux obtains the voltage and tem-
perature data from the USB host device, and the power control manager daemon
requests and reads the data via the “/dev/hiddev0” driver interface.
The temperature control changes the power management policy to “powersave”
if the temperature goes above the user-specified upward threshold temperature.
Likewise, the temperature control changes the power management policy from
“powersave” to another mode if the temperature goes below the user-specified
downward threshold temperature. In the “powersave” policy, CPU#0 operates at
75 MHz, and the three other CPUs are turned off to reduce the amount of heat radi-
ating from the CPUs. Chattering, in which the power management policy frequently
goes into and comes out of the “powersave” mode, might occur if the temperature
fluctuates around a threshold. This is inefficient because turning a CPU off or on
takes much more time than changing a CPU’s frequency. The temperature control
has two thresholds, upward and downward, as shown in Fig. 6.25. If the downward
threshold is the same as the upward threshold, chattering might occur.
The battery life control changes the power management policy to “powersave” if
the battery life goes below the user-specified downward threshold. On the other
hand, the battery life control changes the policy from “powersave” to another mode
if the battery life goes above the user-specified upward threshold. In the “power-
save” policy, the power consumed by the CPUs is reduced in order to prolong the
battery life. The chattering may occur if the remaining battery life fluctuates around
the threshold. The battery life control has two thresholds, a downward one and an
upward one, as shown in Fig. 6.26. If the upward and downward thresholds are the
same, chattering might occur.
Figure 6.27 shows a system diagram of the RP-2 application.
The application program uses the X-window system. To show the MPEG-2 image
on a window of the X-window system, the DU driver uses two planes of the XGA
size or 1,024 × 768. One plane is used to display the whole screen of the X-window
system. That plane is accessed via the “/dev/fb0” frame buffer device. The other
plane is used to display the MPEG-2 image on a window. That plane is accessed via
the “/dev/fb1” frame buffer device. The image on the screen is the graphical user
6.3 Applications on SMP Linux 203
Upward threshold
Downward threshold
time
Frame
buffer
Plane 0
Frame Display
buffer
Plane 1
DRAM
OS SMP Linux
Memory (DRAM)
hardware
CPU #0 CPU #1 CPU #2 CPU #3
interface (GUI) program implemented using the X toolkit of the X-window system.
A mouse is used as a pointing device. Figure 6.28 shows the software architecture of
the RP-2 application.
A display image of the application is shown in Fig. 6.29. It consists of three win-
dows; they are those of the main application, system monitor, and “xeyes.”
Figure 6.30 shows the main application window, which has two parts. One is the
area to display both the MPEG-2 video and a histogram to show the current speed in
fps up to 40 fps. This area is an instance of a custom widget class of the X toolkit. The
contents of the “/dev/fb1” frame buffer device are mapped to this area. The other part
6.3 Applications on SMP Linux 205
is the area for the control buttons. A button on the X-window screen is associated with
a shell script with the X toolkit. Pushing the button executes the shell script. The
MPEG-2 decode program is executed out of one of the shell scripts.
The system monitor window is shown in Fig. 6.31. This system monitor was
developed by modifying the “xosview” [13] program which runs on the X-window
system. The program continuously updates system-related statistics obtained
from the “/proc” file system. The source code of “xosview” has been downloaded
from the Internet and modified to show the statistics listed in Table 6.6. The first
items, “CPU0,” “CPU1,” “CPU2,” and “CPU3” display the information based on
“/proc/stats” and “/proc/cpuinfo.” The original “xosview” does not work cor-
rectly with CPU Hot Remove or CPU Hot Add of CPU Hot-plug because the
removed CPU disappears from “/proc/stats.” The “xosview” has been modified
to gray the area of the removed CPU. The other items, “BTRY,” “THER,” “POLI,”
“FREQ,” and “WATT” display the information obtained from the power control
manager daemon.
206 6 Application Programs and Systems
6.3.3.1 Introduction
Image recognition technology involves several kinds of analyses at the same time.
Image processing is one of the research fields that can benefit from multicore pro-
cessors. This subsection describes a system that takes images captured by a camera
and displays them after carrying out some filtering processes.
6.3 Applications on SMP Linux 207
The source code of the SUSAN benchmark package is available in the MiBench
benchmark suite, which is a set of commercially representative embedded applica-
tions. The package includes three visual effect algorithms. They find corners, find
edges, and smooth shapes in the images. The three algorithms are independent.
The original SUSAN application performs only one choice of the three visual
208 6 Application Programs and Systems
USB Hub
USB Host
Controller
Hard Disk Drive
OS lib Appli.
Off-chip Interconnect
RP-X Off-chip
CPU CPU CPU CPU SATA Bus
#0 #1 #2 #3 Controller Controller
On-chip Interconnect
DRAM
LDC Video
Controller
Controller encoder
Frame
buffer
Plane 0
Display
DRAM
effects. However, the application presented here has been modified to perform the
three visual effects in parallel to take advantage of a multicore processor.
The original SUSAN program accepts an image file stored in the portable gray
map (PGM) format as an input, reads the size of image and subsequent 8-bit gray-
scale images, and distributes the input to one of the visual effect algorithms. In this
implementation, however, the visual images are captured via the USB video class
(UVC) device and stored in the YUY2 format once and then converted into the 8-bit
grayscale format as the input to each visual effect algorithm. The size of the image
is smaller than 320 × 240 pixels and is one of the sizes supported by the USB cam-
era. Figure 6.32 shows the system diagram of the RP-X application.
The software architecture of the RP-X application is shown in Fig. 6.33. The
SUSAN process creates four threads: “input,” “smoothing,” “edges,” and “corners.”
The “input” thread converts the captured image in the YUY2 format into the 8-bit
grayscale format. The “smoothing” thread smoothes shapes. The “edges” thread
finds edges in the image. The “corners” thread finds corners in the image. The system
is built to take advantage of the X-window system on the Linux operating system.
6.3 Applications on SMP Linux 209
OS SMP Linux
Memory (DRAM)
hardware
CPU #0 CPU #1 CPU #2 CPU #3
The X toolkit is used to lay out the images. The “luvcview” [16] package is a web
camera viewer based on the UVC. The source code of the “luvcview” package is
available on the Internet. The “v4l2uvc.c” file and the related header files have been
extracted from the package and integrated into the application. The “v4l2uvc.c”
captures the UVC camera images using the Video4Linux2 driver.
Figure 6.34 shows the display image of the SUSAN application on the X-window.
There are four video images in the figure. The upper left image shows the gray-scaled
210 6 Application Programs and Systems
image of the original input image from the USB camera. The size of the input image
is 320 × 240 pixels. A smoothed image is shown at the upper right. The lower left
image shows the edge detection effect, and the lower right image shows the corner
detection effect. The images are written on the frame buffer in Linux using the
Xgraphics functions of the Xlib library.
One example of the systems utilizing the multicore chip is a video image search
system. A detailed implementation of the system with the multicore chip or RP-X
[17] is described in this chapter. It offers a video-stream playback with a graphical
operation interface, as well as a similar-image search [18] that recognizes faces
while playing back video. It makes the most use of the heterogeneous cores such
as the video processing unit (VPU) in playing video streams and SH-4A in per-
forming image recognition. Figure 6.35 shows a block diagram of the implemented
video image search system on the chip. The system operates on two different oper-
ating systems, uITRON and Linux, over a hypervisor, to manage the physical
resources of the chip. The hypervisor is a programming layer lower than operating
systems [19]. The two operating systems use a common shared memory for their
intercommunications.
Face Face
Application Video image search GUI detection recognition
Hypervisor
(YrCbCr x1)
1024
Frame buffer of graphics
Image-graphics (YrCbCr x1)
BEU 768
synthesis
Frame buffer of synthesized data
(YrCbCr x1) 1024
768
Image output to Display
LCDC
DVI
Data transfer Memory area dedicated for uITRON
Memory area shared by both uITRON and Linux
Fig. 6.37 Data flow of uITRON system and utilized hardware IP cores
The system on uITRON plays back motion pictures, carries out the image scaling
and synthesis, and outputs the image to a monitor, which are the main functions of
the video image search. Figure 6.37 illustrates the data flow of the system on uITRON.
It also shows the utilized hardware IP cores. The VPU that decodes video streams
supports multiple video codecs such as H.264, MPEG-2, and MPEG-4. The codec
used by the system is MPEG-2. The VEU reads an image placed on the specified area
of the memory, enlarges/reduces the size of the image, and writes it to the specified
area of the memory. The BEU reads three images placed on the specified areas of the
memory, blends the three images, and writes them to the specified areas. The imple-
mented system uses BEU’s blending of two images. The LCDC reads an image on
the specified area of the memory and transmits it to a display device. The system uses
a DVI interface for the transmission.
The implementation details of the five main functions on the uITRON system are
described as follows:
1. MPEG-2 decoding
2. Still-image capturing
3. Image scaling
4. Video image and graphics synthesizing
5. Output image controlling
6.4 Video Image Search 213
First, the MPEG-2 decoding is processed on the VPU using a frame buffer of
decoding data, whose size corresponds to four frames of the video image. The VPU
starts the decoding frame-by-frame when one frame of an input data stream is
obtained from the memory, and it stores the decoded image to one of the four frames
in the frame buffer.
The still-image capturing duplicates the decoded image to the frame buffer of the
captured image at every decoding frame. The buffer of the captured image is shared
with uITRON and Linux in the memory; therefore, a program on Linux can obtain
a decoded image any time.
The image scaling also duplicates the decoded image to the frame buffer of the
decoded data at every decoding frame. Since the adjusted size of images at the scaling is
set to 720 × 480, scaling factors in both horizontal and vertical directions to the decoded
image are calculated and set to the VEU. For example, when the size of the image is
720 × 480, the scaling factors are set to 1.00 and 1.00 in the horizontal and vertical direc-
tions, respectively. In the same manner, when the size is 960 × 540, the scaling factors are
set to 0.75 and 0.89. When the size is 320 × 240, the factors are 2.25 and 2.00. After the
start-up of the VEU, it reads an image from the frame buffer of the decoded data, adjusts
the size of the image according to specified scaling factors, and writes the scaled image
whose size is 720 × 480 to the frame buffer of the scaled data.
The video image and graphics synthesizing process uses image data in the frame
buffer of the scaled data, as well as graphics data in the frame buffer of graphics and
blends them in the BEU. The size of the frame buffers is 1,024 × 768. When a scaled
image is stored in the frame buffer of scaled data, the BEU starts the blending and
writes the synthesized image to the frame buffer of the synthesized data. The graphics
frame buffer is placed in the memory area shared by both uITRON and Linux and
can therefore be updated on Linux at any time.
Finally, the output image control sets up the LCDC and a DVI transmitter to con-
vert the synthesized image stored in the frame buffer into video signals that are trans-
mitted to the monitor via the DVI interface. Figure 6.38 illustrates the processing
flow of the uITRON system. The process is repeated from supplying the video stream
to copying the frame buffer of decoding data to that of a still-captured image.
Start LCDC
Initialize VPU
Start VEU
Start BEU
Still Memory
α plane Graphics
image (DDR3-SDRAM)
Start
Initialization
Event No
occurs?
Yes
Face Feature
detection calc.& search
Detected- Thumbnail
faces display display
The similar-image search consists of feature calculation, in which the feature value
of a face image is calculated; registering, in which faces are registered in a database
created on a hard disk drive; deletion, where a face entry in the database is deleted;
and an image search, where similar face images are searched for in the database.
The face detection utilizes a face detection function offered by Intel’s
OpenCV [20], which is a general image processing library.
The event processing consists of mouse event detection that detects the operation
of a pointing device and internal event generation that starts the face detection by
the detected mouse event.
The image object management manages objects of the still image obtained from
uITRON via the shared memory and the image generated by the face detection. It
also manages the depiction of mouse trails detected by the event processing and
generation of the a plane that determines the synthesizing position of the video
plane and the graphics plane.
Finally, the image processing performs trimming, which trims a specified range
of an image; scaling, which enlarges or reduces the size of an image; YUV–RGB
conversion, which converts the color format of an image; and frame depiction,
which makes it possible to draw a shape on a face-detected area.
Figure 6.40 shows the processing flow of the Linux application programs. First,
image objects displayed on the graphics plane are initialized. Then the operation of
216 6 Application Programs and Systems
a mouse connected via the USB interface is detected by a device driver embedded
in the Linux kernel. The device driver outputs on/off values of each mouse button
and the distance of mouse movement. The mouse event detection classifies three
events of mouse button operations: PUSH, REPEAT, and RELEASE. Furthermore,
it converts the distance into the axis of coordinates. The internal event generation is
processed in accordance with the values generated by the mouse detection. The
defined internal events include no events, still-image capturing, face detection, sim-
ilar-image display, similar-image search, similar-image registering, and similar-
image deletion.
When a mouse event is detected on a video-plane area, the still-image capturing
event is generated, and a still image captured from decoded video images is
obtained as a still-image object. The graphics plane is updated in order to display
the newly captured image. Then the area range of the image selected by the mouse
is trimmed. The trimmed image is treated as a face-region image object, and the
graphics plane is updated again. The face detection uses the face-region image
object, and a frame shape is drawn on the area of the detected face. When a mouse
event is detected on the still-image object or the similar-image object, the face
detection is carried out by using these two objects. When a mouse event is detected
on the thumbnail image object, a thumbnail image shown in the event is displayed
as a similar image. When one is detected on a framed face of the face image object,
the face-framed part of the image is trimmed. The trimmed image is converted in
the image format in order to calculate the feature value, and the calculation is per-
formed. Then the similar-image search is carried out by using the calculated fea-
ture value, and the top ten similar images are displayed. When a mouse event is
detected on the framed face, the face image is registered on the similar-image data-
base. When one is detected on a thumbnail image, an entry of the image is deleted
from the database.
The execution time of each process on the Linux system was measured. Table 6.7
lists the average time for the processes. The face detection required is 1.6 s, and the
access of the similar-image database took more than 0.5 s. The time for such processes
References 217
References
A C
AAC. See Advanced audio codec (AAC) CABAC. See Context-adaptive binary
Access checklist (ACL), 172–175 arithmetic coding (CABAC)
ACL. See Access checklist (ACL) Cache coherency, 68, 127–129, 135–137,
Address extension, 153, 161–165 193, 194
Advanced audio codec (AAC), 1, 179–187 CAVLC. See Context-adaptive variable-length
Affine transformation, 54, 63 coding (CAVLC)
ALPBench, 200 Centralized shared memory (CSM),
ALU. See Arithmetic logical unit (ALU) 12–17, 127, 128, 136, 137, 139,
AMP. See Asymmetric multiprocessor 180, 182, 183
(AMP) CFGM. See Configuration manager (CFGM)
ANSI/IEEE 754, 46, 57, 62 CISC. See Complicated instruction set
Area efficiency, 6, 19, 31, 41, 56, 65, 66, 73, computer (CISC)
89, 91, 93, 94 Clock gating, 38, 39, 43, 44, 110, 147, 150
Arithmetic logical unit (ALU), 6, 19, 25, 28, Cluster, 13–16, 67, 69, 127, 128, 132, 133,
35, 74–88, 90, 97–99, 117, 143, 136, 139, 141, 142, 198
145–147 CODEC, 7, 15, 21, 101–111, 113, 117–119,
Asymmetric multiprocessor (AMP), 22, 67, 146, 147, 172, 212
69, 127 Coherency, 68, 127–129, 135–137, 193, 194
Atomic operation, 154–157 Complicated instruction set computer
(CISC), 23
COMS technology, 7
B Configuration manager (CFGM), 6, 74–77,
BARR, 138, 139 80–82, 145
BARW, 138, 139 Context-adaptive binary arithmetic coding
BEU. See Blend engine unit (BEU) (CABAC), 7, 101, 104, 106, 107,
BHT. See Branch history table (BHT) 113, 115
Blend engine unit (BEU), 210–214 Context-adaptive variable-length coding
Bourne shell, 197 (CAVLC), 101, 106, 107
Branch history table (BHT), 32, 33, 37, 38 Cooley–Tukey algorithm, 85
Branch prediction, 24, 32, 33, 36–38, 41, 43 CPU, 1, 5–7, 14, 16, 57, 74–76, 78, 80–85, 89,
Branch target buffer (BTB), 24, 32, 33 127, 137, 144, 154, 157, 165, 166, 170,
BTB. See Branch target buffer (BTB) 175, 176, 179, 180, 182–190, 194–195,
Butterfly calculation, 85–87 197, 200–202
K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 219
DOI 10.1007/978-1-4614-0284-8, © Springer Science+Business Media New York 2012
220 Index
FTRV. See Floating-point transform vector Interrupt controller (INTC), 20, 67, 140, 141,
(FTRV) 171, 175
Full HD, 7, 101–103, 105, 106, 109, 110, I/O device, 154, 163, 166
118, 119, 147 I/O space, 163
FVC. See Frequency and voltage controller IOzone, 164
(FVC) ISA. See Instruction set architecture (ISA)
G J
Giga operations per second (GOPS), 1, 2, 89, JCT-VC. See Joint Collaborative Team on
143, 144, 150 Video Coding (JCT-VC)
Global history, 32, 33, 38 Joint Collaborative Team on Video Coding
Golomb, 106, 113, 114 (JCT-VC), 119
GOPS. See Giga operations per second
(GOPS)
GUI control, 213–217 L
Latency, 11, 12, 24, 27, 33, 40, 44, 47, 51,
57–59, 61, 69, 102, 107, 108, 130, 132,
H 148, 165, 166, 176
H.264, 1, 2, 7, 15, 101, 103–106, 108, 113, LCDC. See Display controller (LCDC)
117–119, 144, 147, 212 LCPG. See Local clock pulse generator
Hardware emulation, 47, 61 (LCPG)
Harvard architecture, 24, 28, 31 Leading nonzero (LNZ) detector, 50, 62
H-ch. See Horizontal channel (H-ch) Leakage current, 3, 20, 137, 138
Heterogeneous multicore, 3, 4, 7, 8, 11–17, Legacy software, 126
19, 69, 101–103, 143, 161, 166, 179, Linux, 134, 135, 141, 142, 153–165, 175, 176,
187, 189 193–215, 217
Heterogeneous parallelism, 3–8 Linux kernel, 140, 142, 162, 164, 193, 195,
HEVC. See High Efficiency Video Coding 198, 199, 216
(HEVC) LL/SC instructions, 154–155, 157
High Efficiency Video Coding (HEVC), 119 LM. See Local memory (LM)
HIGHMEM, 161–165 LMBench, 155–157, 164, 175, 176
Horizontal channel (H-ch), 89–92, 146 LNZ. See Leading nonzero (LNZ) detector
Hypervisor, 169, 170, 175, 210 Load balancing, 167, 169, 193–199
Local clock pulse generator (LCPG), 14, 15,
128, 129
I Local memory (LM), 6, 11–17, 43, 74–76,
ICIs. See Inter-CPU interrupts (ICIs) 78–80, 83, 84, 86–88, 102, 107, 145,
Idle reduction, 157–161, 199–201 179, 180, 182–185
ILRAM. See Instruction local RAM (ILRAM) Logical partitioning, 169–170
Image filtering, 206–210
Image processing, 89, 97, 99, 102, 104, 106,
108, 109, 112, 115–118, 147, 189, 194, M
206, 211, 213, 215 Macroblock, 102–104, 106–110, 117–119
In-order, 23, 24, 32 Magnetic resonance imaging (MRI), 194, 207
Instruction categorization, 25, 33 Matrix Engine (MX), 15, 16, 19, 69, 88–100,
Instruction local RAM (ILRAM), 14, 43, 127, 187–193
136, 179, 180 Matrix processor array (MPA), 89, 98, 99, 191
Instruction predecoding, 43 Matrix processor controller (MPC), 89, 90, 98,
Instruction set architecture (ISA), 21, 23–26, 99, 191
33, 44, 65, 68–74 Memory management unit (MMU), 21, 73,
INTC. See Interrupt controller (INTC) 111, 170, 210
Inter-CPU interrupts (ICIs), 194 MESI. See Modified, Exclusive, Shared,
Inter-frame parallelism, 185 Invalid (MESI) modes
222 Index
RP-1 prototype chip, 19, 22, 67, 123, 125–136, Store buffer, 33–35, 41
141, 153, 154, 175, 193–198 Store with extension (STX), 80, 113, 114
RP-2 prototype chip, 19, 22, 67, 123, 125, STX. See Store with extension (STX)
136–143, 153, 157, 159, 193, 194, Sum of absolute difference (SAD), 94,
198–206 191–193
RP-X prototype chip, 14, 19, 69, 123, 125, SuperHTM, 19–22, 68, 69
143–150, 153, 161, 162, 164, 165, SuperHyway, 21, 74, 127, 128, 131–133, 136,
193, 194, 206–211 138, 144–146
RTOS. See Real-time operating system Superpipeline, 24, 32–36, 38, 41, 43, 65
(RTOS) Superscalar, 22–27, 29–32, 55, 56, 65, 68
SUSAN. See Smallest Univalue Segment
Assimilating Nucleus (SUSAN)
S Symmetric multiprocessor (SMP), 22, 67, 68,
SAD. See Sum of absolute difference (SAD) 127–129, 134, 135, 141, 142, 153–156,
SEQM. See Sequence manager (SEQM) 193–210
Sequence manager (SEQM), 6, 75–77, 80–81, Synchronization, 116, 138–139
83, 145, 146 System on a chip (SoC), 1, 3, 20, 67, 76,
SH-1, 4, 20, 21 123–126, 131, 137, 143, 189, 190, 192
SH-2, 20, 21
SH-3, 21, 31, 32, 40, 41
SH-4, 21–36, 40–42, 44–59, 61–63, 65–67 T
SH-5, 21 TAS instruction, 154, 155, 157
SH-4A, 21, 22, 67, 170, 179, 194, 210 Thread, 16, 80–84, 87, 88, 134, 135, 141,
SH core, 19, 67, 70, 179, 184 142, 154, 158–160, 190, 191, 194,
SH-3E, 44, 56, 58, 59, 65, 66 200, 208
SH processor, 4, 20, 21, 69 TLBs. See Translation look aside buffer
SH-X, 21, 32–43, 56–67 (TLBs)
SH-X2, 21, 42–44, 67 Transformer (TRF), 7, 104, 112, 115, 117
SH-X3, 22, 67–70, 126–128, 132, 133, 136, 141 Translation look aside buffer (TLBs), 28, 154,
SH-X4, 22, 69–74, 143, 144, 149, 150 161, 164, 170
SIMD. See Single instruction multiple data TRF. See Transformer (TRF)
(SIMD)
Single instruction multiple data (SIMD), 6, 7,
16, 19, 45, 58, 60, 88–90, 92, 94, 116, U
117, 146, 188 uITRON, 210–215
Smallest Univalue Segment Assimilating URAM. See User RAM (URAM)
Nucleus (SUSAN), 207–209 USB, 171, 202, 203, 208, 210, 215
SMP. See Symmetric Multiprocessor (SMP) User RAM (URAM), 14, 127, 137, 138, 179,
SNC. See Snoop controller (SNC) 180, 184, 185
Snoop, 68, 128–131, 135
Snoop controller (SNC), 67, 68, 127, 128, 130,
131, 135 V
SoC. See System on a chip (SoC) VC-1, 101, 118, 119, 144
Spatiotemporal Markov random field model V-ch. See Vertical channel (V-ch)
(S-T MRF), 189–192 Vertical channel (V-ch), 89, 90, 92–94,
Special purpose processor (SPP), 11–17, 97, 146
101, 102 VEU. See Video engine unit (VEU)
SPLASH-2, 134, 135, 140–142 Video codec, 7, 21, 101–111, 113,
Split transaction, 131, 132 117–119, 212
SPP. See Special purpose processor (SPP) Video engine unit (VEU), 210–213
SRAM, 6, 89–91, 93, 96, 98, 99, 127, 128, Video image search, 210–217
133, 136, 146 Video processing unit (VPU), 15, 16, 19,
S-T MRF. See Spatiotemporal Markov random 101–119, 143, 144, 146, 147, 149,
field model (S-T MRF) 210–214
224 Index
Virtualization, 170 X
Virtual Socket Interface (VSI), 131 Xeyes, 204
VPU. See Video processing unit (VPU) Xosview, 205–206
VSI. See Virtual Socket Interface (VSI) XREG, 97–98
W Z
Way prediction, 43 Zero-cycle transfer, 24, 28, 31, 47