0% found this document useful (0 votes)
4 views

Heterogeneous Multicore Processor Technologies for Embedded Systems Compress

The document discusses heterogeneous multicore processor technologies for embedded systems, emphasizing the need for high performance and low power consumption in the era of digital convergence. It introduces a heterogeneous multicore architecture that utilizes various low-power processor cores to achieve greater flexibility and efficiency in processing multiple digital applications. The book details the development of multicore chips and their software environments, aiming to support advanced embedded systems and applications.

Uploaded by

tippars
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Heterogeneous Multicore Processor Technologies for Embedded Systems Compress

The document discusses heterogeneous multicore processor technologies for embedded systems, emphasizing the need for high performance and low power consumption in the era of digital convergence. It introduces a heterogeneous multicore architecture that utilizes various low-power processor cores to achieve greater flexibility and efficiency in processing multiple digital applications. The book details the development of multicore chips and their software environments, aiming to support advanced embedded systems and applications.

Uploaded by

tippars
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 233

Heterogeneous Multicore Processor

Technologies for Embedded Systems


Kunio Uchiyama Fumio Arakawa

Hironori Kasahara Tohru Nojiri


Hideyuki Noda Yasuhiro Tawara


Akio Idehara Kenichi Iwata


Hiroaki Shikano

Heterogeneous Multicore
Processor Technologies
for Embedded Systems
Kunio Uchiyama Fumio Arakawa
Research and Development Group Renesas Electronics Corp.
Hitachi, Ltd. 5-20-1 Josuihon-cho, Kodaira-shi
1-6-1 Marunouchi, Chiyoda-ku Tokyo 187-8588, Japan
Tokyo 100-8220, Japan
Tohru Nojiri
Hironori Kasahara Central Research Lab.
Green Computing Systems Hitachi, Ltd.
Waseda University 1-280 Higashi-koigakubo
R&D Center Kokubunji-shi
27 Waseda-machi, Shinjuku-ku Tokyo 185-8601, Japan
Tokyo 162-0042, Japan
Yasuhiro Tawara
Hideyuki Noda Renesas Electronics Corp.
Renesas Electronics Corp. 5-20-1 Josuihon-cho, Kodaira-shi
4-1-3 Mizuhara, Itami-shi Tokyo 187-8588, Japan
Hyogo 664-0005, Japan
Kenichi Iwata
Akio Idehara Renesas Electronics Corp.
Nagoya Works, Mitsubishi Electric Corp. 5-20-1 Josuihoncho, Kodaira
1-14 Yada-minami 5-chome Tokyo 187-8588, Japan
Higashi-ku
Nagoya 461-8670, Japan

Hiroaki Shikano
Central Research Lab.
Hitachi, Ltd.
1-280 Higashi-koigakubo
Kokubunji-shi
Tokyo 185-8601, Japan

ISBN 978-1-4614-0283-1 ISBN 978-1-4614-0284-8 (eBook)


DOI 10.1007/978-1-4614-0284-8
Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2012932273

© Springer Science+Business Media New York 2012


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal
reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication
of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained
through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright
Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the
authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made.
The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

The expression “Digital Convergence” was coined in the mid-1990s and became a
topic of discussion. Now, in the twenty-first century, the “Digital Convergence” era
of various embedded systems has begun. This trend is especially noticeable in digi-
tal consumer products such as cellular phones, digital cameras, digital players, car
navigation systems, and digital TVs. That is, various kinds of digital applications
are now converged and executed on a single device. For example, several video
standards such as MPEG-2, MPEG-4, H.264, and VC-1 exist, and digital players
need to encode and decode these multiple formats. There are even more standards
for audio, and newer ones are continually being proposed. In addition, recognition
and synthesis technologies have recently been added. The latest digital TVs and
DVD recorders can even extract goal-scoring scenes from soccer matches using
audio and image recognition technologies. Therefore, a System-on-a-Chip (SoC)
embedded in the digital-convergence system needs to execute countless tasks such
as media, recognition, information, and communication processing.
Digital convergence requires, and will continue to require, higher performance in
various kinds of applications such as media and recognition processing. The prob-
lem is that any improvements in the operating frequency of current embedded CPUs,
DSPs, or media processors will not be sufficient in the future because of power
consumption limits. We cannot expect a single processor with an acceptable level of
power consumption to run applications at high performance. One solution that
achieves high performance at low-power consumption is to develop special hard-
ware accelerators for limited applications such as the processing of standardized
formats such as MPEG videos. However, the hardware-accelerator approach is not
efficient enough for processing many of the standardized formats. Furthermore, we
need to find a more flexible solution for processing newly developed algorithms
such as those for media recognition.
To satisfy the higher requirements of digitally converged embedded systems, this
book proposes heterogeneous multicore technology that uses various kinds of low-
power embedded processor cores on a single chip. With this technology, heteroge-
neous parallelism can be implemented on an SoC, and we can then achieve greater

v
vi Preface

flexibility and superior performance per watt. This book defines the heterogeneous
multicore architecture and explains in detail several embedded processor cores
including CPU cores and special-purpose processor cores that achieve highly arith-
metic-level parallelism. We developed three multicore chips (called RP-1, RP-2,
and RP-X) according to the defined architecture with the introduced processor
cores. The chip implementations, software environments, and applications running
on the chips are also explained in the book.
We, the authors, hope that this book is helpful to all readers who are interested in
embedded-type multicore chips and the advanced embedded systems that use these
chips.

Kokubunji, Japan Kunio Uchiyama


Acknowledgments

A book like this cannot be written without the help in one way or another of many
people and organizations.
First, part of the research and development on the heterogeneous multicore pro-
cessor technologies introduced in this book was supported by three NEDO (New
Energy and Industrial Technology Development Organization) projects: “Advanced
heterogeneous multiprocessor,” “Multicore processors for real-time consumer elec-
tronics,” and “Heterogeneous multicore technology for information appliances.”
The authors greatly appreciate this support.
The R&D process on heterogeneous multicore technologies involved many
researchers and engineers from Hitachi, Ltd., Renesas Electronics Corp., Waseda
University, Tokyo Institute of Technology, and Mitsubishi Electric Corp. The
authors would like to express sincere gratitude to all the members of these organiza-
tions associated with the projects. We give special thanks to Prof. Hideo Maejima
of Tokyo Institute of Technology, Prof. Keiji Kimura of Waseda University,
Dr. Toshihiro Hattori, Mr. Osamu Nishii, Mr. Masayuki Ito, Mr. Yusuke Nitta,
Mr. Yutaka Yoshida, Mr. Tatsuya Kamei, Mr. Yasuhiko Saito, Mr. Atsushi Hasegawa
of Renesas Electronics Corp., Mr. Shiro Hosotani of Mitsubishi Electric Corp., and
Mr. Toshihiko Odaka, Dr. Naohiko Irie, Dr. Hiroyuki Mizuno, Mr. Masaki Ito,
Mr. Koichi Terada, Dr. Makoto Satoh, Dr. Tetsuya Yamada, Dr. Makoto Ishikawa,
Mr. Tetsuro Hommura, and Mr. Keisuke Toyama of Hitachi, Ltd. for their efforts in
leading the R&D process.
Finally, the authors thank Mr. Charles Glaser and the team at Springer for their
efforts in publishing this book.

vii
Contents

1 Background ............................................................................................... 1
1.1 Era of Digital Convergence ................................................................ 1
1.2 Heterogeneous Parallelism Based on Embedded Processors ............. 3
References ................................................................................................... 8

2 Heterogeneous Multicore Architecture ................................................... 11


2.1 Architecture Model ............................................................................ 11
2.2 Address Space .................................................................................... 16
References ................................................................................................... 18

3 Processor Cores ......................................................................................... 19


3.1 Embedded CPU Cores ....................................................................... 19
3.1.1 SuperHTM RISC Engine Family Processor Cores................... 20
3.1.2 Efficient Parallelization of SH-4 ............................................ 22
3.1.3 Efficient Frequency Enhancement of SH-X ........................... 32
3.1.4 Frequency and Efficiency Enhancement of SH-X2 ............... 42
3.1.5 Efficient Parallelization of SH-4 FPU .................................... 44
3.1.6 Efficient Frequency Enhancement of SH-X FPU .................. 56
3.1.7 Multicore Architecture of SH-X3 .......................................... 67
3.1.8 Efficient ISA and Address-Space
Extension of SH-X4 ............................................................... 69
3.2 Flexible Engine/Generic ALU Array (FE–GA) ................................. 74
3.2.1 Architecture Overview ........................................................... 75
3.2.2 Arithmetic Blocks .................................................................. 77
3.2.3 Memory Blocks and Internal Network ................................... 78
3.2.4 Sequence Manager and Configuration Manager .................... 80
3.2.5 Operation Flow of FE–GA ..................................................... 82
3.2.6 Software Development Environment ..................................... 83
3.2.7 Implementation of Fast Fourier
Transform on FE–GA............................................................. 85

ix
x Contents

3.3 Matrix Engine (MX) .......................................................................... 88


3.3.1 MX-1 ...................................................................................... 89
3.3.2 MX-2 ...................................................................................... 97
3.4 Video Processing Unit........................................................................ 101
3.4.1 Introduction ............................................................................ 101
3.4.2 Video Codec Architecture ...................................................... 102
3.4.3 Processor Elements ................................................................ 111
3.4.4 Implementation Results.......................................................... 117
3.4.5 Conclusion.............................................................................. 118
References ................................................................................................... 119

4 Chip Implementations .............................................................................. 123


4.1 Multicore SoC with Highly Efficient Cores....................................... 123
4.2 RP-1 Prototype Chip .......................................................................... 126
4.2.1 RP-1 Specifications ................................................................ 127
4.2.2 SH-X3 Cluster ........................................................................ 128
4.2.3 Dynamic Power Management ................................................ 128
4.2.4 Core Snoop Sequence Optimization ...................................... 129
4.2.5 SuperHyway Bus .................................................................... 131
4.2.6 Chip Integration ..................................................................... 132
4.2.7 Performance Evaluations........................................................ 134
4.3 RP-2 Prototype Chip .......................................................................... 136
4.3.1 RP-2 Specifications ................................................................ 136
4.3.2 Power Domain and Partial Power-Off .................................... 137
4.3.3 Synchronization Support Hardware ....................................... 138
4.3.4 Interrupt Handling for Multicore ........................................... 140
4.3.5 Chip Integration and Evaluation ............................................ 141
4.4 RP-X Prototype Chip ......................................................................... 143
4.4.1 RP-X Specifications ............................................................... 143
4.4.2 Dynamically Reconfigurable Processor FE–GA.................... 145
4.4.3 Massively Parallel Processor MX-2 ....................................... 146
4.4.4 Programmable Video Processing Core VPU5........................ 146
4.4.5 Global Clock Tree Optimization ............................................ 147
4.4.6 Memory Interface Optimization ............................................. 148
4.4.7 Chip Integration and Evaluation ............................................ 149
References ................................................................................................... 150

5 Software Environments ............................................................................ 153


5.1 Linux® on Multicore Processor .......................................................... 153
5.1.1 Porting SMP Linux ................................................................ 153
5.1.2 Power-Saving Features ........................................................... 157
5.1.3 Physical Address Extension ................................................... 161
5.2 Domain-Partitioning System .............................................................. 165
5.2.1 Introduction ............................................................................ 165
5.2.2 Trends in Embedded Systems ................................................ 166
Contents xi

5.2.3 Programming Model on Multicore Processors ...................... 166


5.2.4 Partitioning of Multicore Processor Systems ......................... 168
5.2.5 Multicore Processor with Domain-Partitioning
Mechanism ............................................................................. 170
5.2.6 Evaluation............................................................................... 175
References ................................................................................................... 177

6 Application Programs and Systems......................................................... 179


6.1 AAC Encoding ................................................................................... 179
6.1.1 Target System ......................................................................... 179
6.1.2 Processing Flow of AAC Encoding ....................................... 181
6.1.3 Process Mapping on FE-GA .................................................. 182
6.1.4 Data Transfer Optimization with DTU .................................. 182
6.1.5 Performance Evaluation on CPU and FE-GA ........................ 184
6.1.6 Performance Evaluation in
Parallelized Processing........................................................... 185
6.2 Real-Time Image Recognition ........................................................... 187
6.2.1 MX Library ............................................................................ 187
6.2.2 MX Application ..................................................................... 189
6.3 Applications on SMP Linux............................................................... 193
6.3.1 Load Balancing on RP-1 ........................................................ 194
6.3.2 Power Management on RP-2.................................................. 198
6.3.3 Image Filtering on RP-X ........................................................ 206
6.4 Video Image Search ........................................................................... 210
6.4.1 Implementation of Main Functions ........................................ 212
6.4.2 Implementation of Face Recognition
and GUI Controls ................................................................... 213
References ................................................................................................... 217

Index ................................................................................................................. 219


Chapter 1
Background

1.1 Era of Digital Convergence

Since the mid-1990s, the concept of “digital convergence” has been proposed and
discussed from both technological and business viewpoints [1]. In the twenty-first
century, “digital convergence” has become stronger and stronger in various digital
fields. It is especially notable in the recent trend in digital consumer products such
as cellular phones, car information systems, and digital TVs (Fig. 1.1) [2, 3]. This
trend will become more widespread in various embedded systems, and it will expand
the conventional market due to the development of new functional products and also
lead to the creation of new markets for goods such as robots.
In a digitally converged product, various applications are combined and executed
on a single device. For example, several video formats such as MPEG-4 and H.264
and several audio formats such as MP3 and AAC are decoded and encoded in a cel-
lular phone. In addition, recognition and synthesis technologies have recently been
added. The latest digital TVs and DVD recorders can even extract goal-scoring
scenes from soccer matches using audio and image recognition technologies. Thus,
an embedded SoC in the “digital-convergence” product needs to execute countless
tasks such as media, recognition, information, and communication processing.
Figure 1.2 shows the required performance of various current and future digital-
convergence applications, executed at giga operations per second (GOPS) [2, 3].
Digital convergence requires and will continue to require higher performance in
various kinds of media and recognition processes. The problem is that the improve-
ments made in the frequency of embedded CPUs, DSPs, or media processors will
not be sufficient in the future because of power consumption limits. In our estima-
tion, only applications that require performance of less than several GOPS can be
executed by a single processor at an acceptable level of power consumption of
embedded systems. We therefore need to find a solution for applications that require
higher GOPS performance. A special hardware accelerator is one solution [4, 5].
It is suitable for processing standardized formats like MPEG videos. However, the

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 1


DOI 10.1007/978-1-4614-0284-8_1, © Springer Science+Business Media New York 2012
2 1 Background

Still Image
JPEG
Video MotionJPEG
MPEG2 JPEG2000
MPEG4
Information, H.264 Graphics
Communication VC-1
2D, 3D
WEB Browser SoC Image base
XML Multi path
Java Recognition, Rendering
Data base Synthesis
DLNA Voice, Audio
Image
Biometrics
Audio
Security MP3
AAC
AES AAC Plus
DES Dolby 5.1
RSA Media
Elgamal Flash WMA
DRM HDD RealAudio
DVD
Blu-ray Disc

Fig. 1.1 Digital convergence

Giga Operations per Second


0.01 0.1 1 10 100
Video
MPEG1 MP/ML MPEG2 MP/HL

JPEG MPEG4, H.264, DivX


Graphics
3D rendering 10Mpps 100Mpps
2D rendering 3D image extraction
Audio AAC
Sentence Translation
Voice Dolby-AC3
MPEG WMAWord Recog. Voice Recognition Voice Translation
Recognition VoIP modem Face Recognition
Comm. Data modem
Voice-print Rec.Eye Recog. Video Recognition
FAX

Fig. 1.2 Required performance of digital-convergence applications

hardware-accelerator approach is not always flexible. Better solutions that can exe-
cute a wide variety of high-GOPS applications should therefore be studied.
A photo of a ball-catching robot is shown in Fig. 1.3. This is an example of
media-recognition and motion-control convergence [6, 7]. In this system, a ball
image is extracted and recognized from video images of two cameras. The trajec-
tory of the ball is predicted by calculating its three-dimensional position. Based
on the trajectory projection, the joint angles of the robot manipulators are calcu-
lated, and the robot catches the ball. The four stages of the media recognition and
the motion control need to be executed every 30 ms, and this requires over
10-GOPS performance. Like this example, a variety of functions, which may
1.2 Heterogeneous Parallelism Based on Embedded Processors 3

Ball-catching Robot Camera x 2

Ball Extraction

3D-Position Cal.

30ms
Trajectory Prediction

Joint Angle Cal.

Courtesy: Tohoku Univ. / Kameyama&Hariyama Lab. Robot Manipulator

Fig. 1.3 Ball-catching robot

require high performance, will be converged in future embedded systems and


will need to be achieved on an embedded system-on-a-chip (SoC) at low power
consumption.

1.2 Heterogeneous Parallelism Based on Embedded Processors

To satisfy the digital-convergence requirements described in the previous section,


i.e., high performance, low power, and flexibility, we need to develop a power-
efficient computing platform for advanced digital-convergence embedded systems.
When we analyze the trends in semiconductor technology from a design advantage
viewpoint (Fig. 1.4), there seems to have been a turning point around the 90-nm
technology node.
Because voltage scaling was possible before the 90-nm era, frequency, integra-
tion, and power consumption were able to be improved. After the 90-nm era, it has
been and will be difficult to reduce the voltage because of transistor leakage current.
This means that it is very difficult to both increase the operating frequency of a
processor core and reduce or maintain the same level of power consumption. The
only remaining advantage is the one relating to the advances in integration accord-
ing to Moore’s law. Taking these facts into account, we have been developing het-
erogeneous multicore technologies that combine various types of processor cores
and that achieve heterogeneous parallel processing on a chip.
In our heterogeneous multicore technologies, we focus not only on high perfor-
mance but also on low power consumption. Figure 1.5 shows the positioning of our
heterogeneous multicore chip, compared with multicore chips in PCs or servers. We
are aiming at a few-watt multicore solution instead of 100-W high-performance
multicores. Under natural air-cooling conditions, we aim at high performance and
maximizing performance per watt to satisfy the digital-convergence requirements of
4 1 Background

High

Design Merit Frequency

Power Consumption

Integration

Low
250nm 180nm 130nm 90nm 65nm 45nm
Technology

Fig. 1.4 Trend in semiconductor technology

100 Power-efficient
Heterogeneous Multicore
(Embedded system)
Performance/W

10

High-performance
Multicore
1 (PC/Server)

0.1
1 10 100
Power Consumption (W)

Fig. 1.5 Target of our heterogeneous multicore chip

the embedded systems. Our heterogeneous multicore technology is based on an


embedded processor core to achieve high power efficiency. In the embedded proces-
sor field, increasing the performance per watt has been one of the main objectives
since the 1990s [8–16]. The MIPS (million instructions per second)-per-watt index
was created and has been used to try to increase those values for single CPU cores
[17–24]. Figure 1.6a presents an example that shows the MIPS-per-watt improve-
ment of SuperH™ microprocessors (SH) that have been used in advanced embed-
ded systems. The first value for SH-1, which was developed using 0.8-mm technology,
was 30 MIPS/W in 1992. The 90-nm core used in SH-Mobile achieved over
1.2 Heterogeneous Parallelism Based on Embedded Processors 5

a b
MIPS/W Performance
10000 (MIPS)
6000 PC/
4500 10000
Server’s

1050
1000 720

300
1000

100
100
30 Embedded

100
10 0.01 0.1 1 10 100
SH1,2 SH3 SH4 SH-Mobile Power Consumption (W)
1992 Æ
* MIPS: based on Dhrystone 2.1

MIPS/W of SH microprocessors Comparison of MIPS/W

Fig. 1.6 MIPS/W of embedded processors

Fig. 1.7 Various processor High


cores Hardware
Acc.
Highly
SIMD
Performance/mm2
Performance/W

Dynamic
Reconf.
Media
Special Proc.
Purpose
Processor DSP

CPU
Low
Low High
Flexibility

6,000 MIPS/W, which was 200 times higher than that of 15 years ago. When we
compare this with the other types of processors in Fig. 1.6b, we can see the excellent
power efficiency of the embedded processor [2].
Our other policy is to effectively use heterogeneous parallelism to attain high
power efficiency in various digital-convergence applications. Now, various types of
processor cores other than CPU cores have been developed. Figure 1.7 shows exam-
ples of these processor cores, which are positioned in terms of flexibility and perfor-
mance per watt/performance per area.
The CPU is a general-purpose processor core and has the most flexibility. The
other processor cores are developed for special purposes. They have less flexibility
6 1 Background

Sequence Manager
Local Memory
Arithmetic array
(24+8 cells) LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
ALU MLT ALU ALU LS LRAM
LS LRAM
Crossbar Switch
(10 cells) (10 banks)
Configuration Manager

ALU: 16-bit ALU cell, MUL: 16-bit Multiplier. cell


LS: Load/store cell, LRAM: Local RAM bank (4KB, 2-Port)

Fig. 1.8 Dynamic reconfigurable processor core

but high power/area efficiency. The DSP is for signal processing applications,
and the media processor is for effectively processing various media data such as
audio and video. There are also special-purpose processor cores that are suitable
for arithmetic-intensive applications. These include the dynamically reconfigurable
core and highly SIMD (single instruction multiple data)-type core.
Figure 1.8 depicts an example of a dynamically reconfigurable processor core
[25], which is described in Sect. 3.2 in detail. It includes an arithmetic array consist-
ing of 24 ALU cells and 8 multiply cells, each of which executes a 16-bit arithmetic
operation. The array is connected to ten load/store cells with dual-ported local
memories via a crossbar switch. The core can achieve highly arithmetic-level paral-
lelism using the two-dimensional array. When an algorithm such as an FFT or FIR
filter is executed in the core, the configurations of the cells and their connections are
determined, and the data in the local RAMs are processed very quickly according to
the algorithm.
Figure 1.9 is an example of a highly SIMD-type processor core [26], which is
described in Sect. 3.3 in detail. The core has 2,048 processing elements, each of
which includes two 2-bit full adders and some logic circuits. The processing ele-
ments are directly connected to two data register arrays, which are composed of
single-port SRAM cells. The processor core can execute arithmetic-intensive appli-
cations such as image and signal processing by operating 2,048 processing elements
in the SIMD manner.
The hardware accelerator is a core that has been developed for a dedicated appli-
cation. To achieve high power and area efficiency, the internal architecture of the
1.2 Heterogeneous Parallelism Based on Embedded Processors 7

Instruction
Processor Controller
RAM

PE
PE

2048 entries
Data Registers PE Data Registers
PE

PE
PE

256w 2-bit processing element 256w

Fig. 1.9 Highly SIMD-type processor core

Image processing unit


Stream processing unit Symbol TRF FME DEB
#1 codec (PIPE) (PIPE) (PIPE) CME
DMAC

Stream processor
Shift-register-based bus
CABAC accelerator
Symbol TRF FME DEB
#0 codec (PIPE) (PIPE) (PIPE) CME
L-MEM

PIPE: Programmable image processing element


TRF: Transformer, FME: Fine motion estimator/compensator, DEB: De-blocking filter
CME: Coarse motion estimator, L-MEM: Line memory
CABAC: Context-based Adaptive Binary Arithmetic Coding

Fig. 1.10 Full HD H.264 video CODEC accelerator

accelerator is highly optimized for the target applications. The full HD H.264 video
CODEC accelerator described in Sect. 3.4 is a good example [5]. The accelerator
(Fig. 1.10), which is fabricated using 65-nm CMOS technology and operates at
162 MHz, consists of dedicated processing elements, hardware logics, and proces-
sors which are suitably designed to execute each CODEC stage. The accelerator
decodes full HD (high definition) H.264 video at 172 mW. If we use a high-end
CPU core for this decoding, at least a 2–3 GHz frequency is necessary with the
100% load of the CPU. This means this CODEC core achieves 2–300 times higher
performance per watt than a high-end CPU core.
In our heterogeneous multicore approach, both general-purpose CPU cores and
special-purpose processor cores described above are used effectively. When a pro-
gram is executed, it is divided into small parts, and each part is executed in the most
suitable processor core. This should achieve a very power-efficient and cost-effective
8 1 Background

solution. In the following chapters, we introduce heterogeneous multicore technolo-


gies which have been developed according to the above described policies from the
hardware and software viewpoints.

References

1. Negroponte N (1995) Being digital. Knopf, New York


2. Uchiyama K (2008) Power-efficient heterogeneous parallelism for digital convergence, digest
of technical papers of 2008 Symposium of VLSI circuits, Honolulu, USA, pp 6–9
3. Uchiyama K (2010) Power-efficient heterogeneous multicore for digital convergence,
Proceedings of 10th International Forum on Embedded MPSoC and Multicore, Gifu, Japan,
pp 339–356
4. Liu T-M, Lin T-A, Wang S-Z, Lee W-P, Hou K-C, Yang J-Y, Lee C-Y (2006) A 125uW, Fully
Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Application, Digest of Technical
Papers of 2006 IEEE International Solid-State Circuits Conference, San Francisco, USA,
pp 402–403
5. Iwata K, Mochizuki S, Shibayama T, Izuhara F, Ueda H, Hosogi K, Nakata H, Ehama M,
Kengaku T, Nakazawa T, Watanabe H (2008) A 256 mW Full-HD H.264 High-Profile CODEC
Featuring Dual Macroblock-Pipeline Architecture in 65 nm CNOS, Digest of Technical Papers
of 2008 Symposium of VLSI circuits, Honolulu, USA, pp 102–103
6. Hariyama M, Kazama H, Kameyama M (2000) VLSI Processor for Hierarchical Template
Matching and Its Application to a Ball-Catching Robot System, IEEE International Symposium
on Intelligent Signal Processing and Communication Systems (ISPACS), vol 2, pp 613–618
7. Kazama H, Hariyama M, Kameyama M (2000) Design of a VLSI processor based on an
immediate output generation scheduling for ball-trajectory prediction. J Robot Mechatron
12(5):534–540
8. Kawasaki S (1994) SH-II: a low power RISC micro for consumer applications. Hot Chips
VI:79–103
9. Narira S, Ishibashi K, Tachibana S, Norisue K, Shimazaki Y, Nishimoto J, Uchiyama K,
Nakazawa T, Hirose K, Kudoh I, Izawa R, Matsui S, Yoshioka S, Yamamoto M, Kawasaki I
(1995) A low-power single-chip microprocessor with multiple page-size MMU for nomadic
computing, 1995 Symposium on VLSI Circuits, Dig. Tech. Papers, pp 59–60
10. Hasegawa A, Kawasaki I, Yamada K, Yoshioka S, Kawasaki S, Biswas P (1995) SH3: high
code density, low power. IEEE Micro 15(6):11–19
11. Maejima H, Kainaga M, Uchiyama K (1997) Design and architecture for low-power/high-
speed RISC microprocessor: SuperH. IEICE Trans Electron E80-C(12):1593–1545
12. Arakawa F, Nishii O, Uchiyama K, Nakagawa N (1997) SH4 RISC microprocessor for multi-
media. HOT Chips IX:165–176
13. Uchiyama K (1998) Low-power, high-performance Microprocessors for Multimedia
Applications, Cool Chips I, An International Symposium on Low-Power and High-Speed
Chips, pp 83–98
14. Arakawa F, Nishii O, Uchiyama K, Nakagawa N (1998) SH4 RISC multimedia microproces-
sor. IEEE Micro 18(2):26–34
15. Nishii O, Arakawa F, Ishibashi K, Nakano S, Shimura T, Suzuki K, Tchibana M, Totsuka Y,
Tsunoda T, Uchiyama K, Yamada T, Hattori T, Maejima H, Nakagawa N, Narita S, Seki M,
Shimazaki Y, Satomura R, Takasuga T, Hasegawa A (1998) A 200 MHz 1.2 W 1.4GFLOPS
Microprocessor with Graphic Operation Unit, 1998 IEEE International Solid-State Circuits
Conference Dig. Tech. Papers, pp 288–289
16. Mizuno H, Ishibashi K, Shimura T, Hattori T, Narita S, Shiozawa K, Ikeda S, Uchiyama K
(1999) An 18-mA standby current 1.8 V 200-MHz microprocessor with self-substrate-biased
data-retention mode. IEEE J Solid-State Circuits 34(11):1492–1500
References 9

17. Kamei T, et al (2004) A resume-standby application processor for 3G cellular phones, ISSCC
Dig Tech Papers:336–337, 531
18. Ishikawa M, et al (2004) A resume-standby application processor for 3G cellular phones with
low power clock distribution and on-chip memory activation control, COOL Chips VII
Proceedings, vol I, pp 329–351
19. Arakawa F, et al (2004) An embedded processor core for consumer appliances with 2.8GFLOPS
and 36 M Polygons/s FPU. IEICE Trans Fundamentals, E87-A(12):3068–3074
20. Ishikawa M, et al (2005) A 4500 MIPS/W, 86 mA resume-standby, 11 mA ultra-standby appli-
cation processor for 3G cellular phones. IEICE Trans Electron E88-C(4):528–535
21. Arakawa F, et al (2005) SH-X: An Embedded Processor Core for Consumer Appliances, ACM
SIGARCH Computer Architecture News 33(3), pp 33–40
22. Yamada T, et al (2005) Low-Power Design of 90-nm SuperHTM Processor Core, Proceedings
of 2005 IEEE International Conference on Computer Design (ICCD), pp 258–263
23. Arakawa F, et al (2005) SH-X2: An Embedded Processor Core with 5.6 GFLOPS and 73 M
Polygons/s FPU, 7th Workshop on Media and Streaming Processors (MSP-7), pp 22–28
24. Yamada T et al (2006) Reducing Consuming Clock Power Optimization of a 90nm Embedded
Processor Core. IEICE Trans Electron E89–C(3):287–294
25. Kodama T, Tsunoda T, Takada M, Tanaka H, Akita Y, Sato M, Ito M (2006) Flexible Engine:
A dynamic reconfigurable accelerator with high performance and low power consumption, in
Proc. of the IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX)
26. Noda H et al (2007) The design and implementation of the massively parallel processor based
on the matrix architecture. IEEE J Solid-State Circuits 42(1):183–192
Chapter 2
Heterogeneous Multicore Architecture

2.1 Architecture Model

In order to satisfy the high-performance and low-power requirements for advanced


embedded systems with greater flexibility, it is necessary to develop parallel pro-
cessing on chips by taking advantage of the advances being made in semiconductor
integration. Figure 2.1 illustrates the basic architecture of our heterogeneous multi-
core [1, 2]. Several low-power CPU cores and special purpose processor (SPP)
cores, such as a digital signal processor, a media processor, and a dynamically
reconfigurable processor, are embedded on a chip. In the figure, the number of CPU
cores is m. There are two types of SPP cores, SPPa and SPPb, on the chip. The values
n and k represent the respective number of SPPa and SPPb cores. Each processor
core includes a processing unit (PU), a local memory (LM), and a data transfer unit
(DTU) as the main elements. The PU executes various kinds of operations. For
example, in a CPU core, the PU includes arithmetic units, register files, a program
counter, control logic, etc., and executes machine instructions. With some SPP cores
like the dynamic reconfigurable processor, the PU executes a large quantity of data
in parallel using its array of arithmetic units. The LM is a small-size and low-latency
memory and is mainly accessed by the PU in the same core during the PU’s execu-
tion. Some cores may have caches as well as an LM or may only have caches with-
out an LM. The LM is necessary to meet the real-time requirements of embedded
systems. The access time to a cache is non-deterministic because of cache misses.
On the other hand, the access to an LM is deterministic. By putting a program and
data in the LM, we can accurately estimate the execution cycles of a program that
has hard real-time requirements. A data transfer unit (DTU) is also embedded in the
core to achieve parallel execution of internal operation in the core and data transfer
operations between cores and memories. Each PU in a core processes the data on its
LM or its cache, and the DTU simultaneously executes memory-to-memory data
transfer between cores. The DTU is like a direct memory controller (DMAC) and
executes a command that transfers data between several kinds of memories, then
checks and waits for the end of the data transfer, etc. Some DTUs are capable of

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 11


DOI 10.1007/978-1-4614-0284-8_2, © Springer Science+Business Media New York 2012
12 2 Heterogeneous Multicore Architecture

Chip
FVC FVC FVC FVC

CPU #0 CPU #m SPPa #0 SPPa #n


PU PU PU PU

LM DTU LM DTU LM DTU LM DTU

On-chip interconnect

SPPb #0 SPPb #k On-chip


PU PU shared memory
(CSM)
Off-chip
LM DTU LM DTU main memory

FVC FVC

Fig. 2.1 Heterogeneous multicore architecture

command chaining, where multiple commands are executed in order. The frequency
and voltage controller (FVC) connected to each core controls the frequency, voltage,
and power supply of each core independently and reduces the total power con-
sumption of the chip. If the frequencies or power supplies of the core’s PU, DTU,
and LM can be independently controlled, the FVC can vary their frequencies and
power supplies individually. For example, the FVC can stop the frequency of the PU
and run the frequencies of the DTU and LM when the core is executing only data
transfers. The on-chip shared memory (CSM) is a medium-sized on-chip memory
that is commonly used by cores. Each core is connected to the on-chip interconnect,
which may be several types of buses or crossbar switches. The chip is also con-
nected to the off-chip main memory, which has a large capacity but high latency.
Figure 2.1 illustrates a typical model of a heterogeneous multicore architecture.
A number of variations based on this architecture model are possible. Several varia-
tions of an LM structure are shown in Fig. 2.2. Case (a) is a hierarchical structure
where the LM has two levels. LM1 is a first-level, small-size, low-latency local
memory. LM2 is a second-level, medium-sized, not-so-low-latency local memory.
For example, the latency from the PU to LM1 is one processor cycle, and the latency
to LM2 is a few processor cycles. Case (b) is a Harvard type. The LM is divided into
an LMi that stores instructions and an LMd that stores data. The PU has an indepen-
dent access path to each LM. This structure allows parallel accesses to instructions
and data and enhances processing performance. Case (c) is a combination of (a) and
(b). The LMi and LMd are first-level local memories for instructions and data,
respectively. LM2 is a second-level local memory that stores both instructions and
data. In each case, each LM is mapped on a different address area; that is, the PU
accesses each LM with different addresses.
2.1 Architecture Model 13

Fig. 2.2 Structures of a b c


various local memories
PU PU PU

LM1 LMi LMd LMi LMd

LM2 LM2

Hierarchical -
Hierarchical Harvard
Harvard

FVC
FVC FVC

CPU #0 CPU #m SPP #0 SPP #n

PU PU PU PU

LM DTU LM DTU LM LM

On-chip bus (left) On-chip bus (right)

CSM l CSM r DMAC


Off-chip
main memory

Fig. 2.3 Example of other heterogeneous multicore configurations

In Fig. 2.3, we can see other configurations of a DTU, CSM, FVC, and an on-chip
interconnect. First, processor cores are divided into two clusters. The CPU cores,
the CSMl, and the off-chip main memory are tightly connected in the left cluster.
The SPP cores, CSMr, and the DMAC are nearly connected in the right cluster.
Not every SPP core has a DTU inside. Instead, the DMAC that has multiple chan-
nels is commonly used for data transfer between an LM and a memory outside an
SPP core. For example, when data are transferred from an LM to the CSMr, the
DMAC reads data in the LM via the right on-chip bus, and the data are written on
the CSMr from the DMAC. We need two bus transactions for this data transfer. On
the other hand, if a DTU in a CPU core on the left cluster is used for the same
transfer, data are read from an LM by the DTU in the core, and the data are written
on the CSMl via the on-chip bus by the DTU. Only one transaction on the on-chip
bus is necessary in this case, and the data transfer is more efficient compared with
the case using the off-core DMAC. Although each CPU core in the left cluster has
an individual FVC, the SPP cores in the right cluster share an FVC. With this simpler
FVC configuration, all SPP cores operate at the same voltage and the same fre-
quency, which are controlled simultaneously.
14 2 Heterogeneous Multicore Architecture

Time

Processing
P1 P8 W1 P11
CPU #0
Data Transfer
T2

P2 P5 P9
CPU #1
T1 T5 T8

P3 P7 W2
SPPa #0
T4 T7

P4 P6 P10
SPPb #0
T3 T6

Fig. 2.4 Parallel operation

When a program is executed on a heterogeneous multicore, it is divided into


small parts, and each is executed in parallel in the most suitable processor core, as
shown in Fig. 2.4. Each core processes data on its LM or cache in a Pi period, and
the DTU of a core simultaneously executes a memory–memory data transfer in a Ti
period. For example, CPU #1 processes data on its LM at a P2 period, and its DTU
transfers processed data from the LM of CPU #1 to the LM of SPPb #0 at the
T1 period. After the data transfer, SPPb #0 starts to process data on its LM at a
P6 period. CPU #1 also starts a P5 process that overlaps with the T1 period. In the
parallel operation of Fig. 2.4, there is a time slot like W1 when the corresponding
core CPU #0 does not need to process or transfer data from the core. During this
time slot, the frequencies of the PU and DTU of CPU #0 can be slowed down or
stopped, or their power supplies can be cut off by control of the connected FVC. As
there are no internal operations of SPPa #0 during the time slot W2, the power of
SPPa #0 can be cut off during this time slot. This FVC control reduces redundant
power consumption of cores and can result in lowering the power consumption of a
heterogeneous multicore chip.
Here, we show an example of our architecture model applied to a heterogeneous
multicore chip. Figure 2.5 is a photograph of the RP-X chip (see Sect. 4.4) [3–5].
Figure 2.6 depicts the internal block diagram. The chip includes eight CPU cores
and seven three-type SPP cores. The CPU (see Sect. 3.1) includes a two-level LM
as well as a 32-KB instruction cache and a 32-KB operand cache. The LM consists
of a 16-KB ILRAM for instruction storage, a 16-KB OLRAM for data storage, and
a 64-KB URAM for instruction and data storage. Each CPU has a local clock pulse
generator (LCPG) that corresponds to the FVC and controls the CPU’s clock
frequency independently. The eight CPUs are divided into two clusters. Each
cluster of four CPUs is connected to independent on-chip buses. Additionally,
each cluster has a 256-KB CSM and a DDR3 port which is connected to off-chip
DDR3 DRAMs.
2.1 Architecture Model 15

Fig. 2.5 Heterogeneous


multicore chip

CPU #3 CPU #7
CPU
CPU #2#2 LCPG CPU
CPU #6#2 LCPG
CPU#1
CPU #1 LCPG CPU#5
CPU #1 LCPG
CPU #0 CPU #0
CPU #0 LCPG CPU #4 LCPG Local Clock
LCPG LCPG Pulse Generator
LMLM:16/16KB LMLM:16/16KB
DSM:64KB
I/OLRAM:16/16KB DSM:64KB
I/OLRAM:16/16KB
URAM:64KB URAM:64KB DMA controller
DTU CSM #0 DTU CSM #1 DMAC
DTU DTU 256KB
256KB #1

On-chip bus #0 On-chip bus #1

DDR3 VPU DMAC FE #0 MX #0 DDR3


port#0 #0 port #1
LM:300KB LM:30KB LM:30KB

DTU FE #1 MX #1
FE #2 Matrix Processor
Off-chip Video Processor Unit FE #3 Off-chip
DDR3 DRAM Flexible Engine DDR3 DRAM

Fig. 2.6 Block diagram of heterogeneous multicore chip

Three types of SPPs are embedded on the chip. The first SPP is a video processing
unit (VPU, see Sect. 3.4) which is specialized for video processing such as MPEG-4
and H.264 codec. The VPU has a 300-KB LM and a DTU built-in. The second and
third SPPs are four flexible engines (FEs, see Sect. 3.2), and two matrix processors
(MXs, see Sect. 3.3), and they are included in another cluster. The FE is a dynami-
cally reconfigurable processor which is suitable for data-parallel processing such as
16 2 Heterogeneous Multicore Architecture

digital signal processing. The FE has an internal 30-KB LM but does not have a
DTU. The on-chip DMA controller (DMAC) that can be used in common by on-chip
units or a DTU of another core is used to transfer data between the LM and other
memories. The MX has 1,024-way single instruction multiple data (SIMD) architec-
ture that is suitable for highly data-intensive processing such as video recognition.
The MX has an internal 128-KB LM but does not have its DTU, just as with the FE.
In the chip photograph in Fig. 2.5, the upper-left island includes four CPUs, and the
lower-left island has the VPU with other blocks. The left cluster in Fig. 2.6 includes
these left islands and a DDR3 port depicted at the lower-left side. The lower-right
island in the photo in Fig. 2.5 includes another four CPUs, the center-right island has
four FEs, and the upper-right has two MXs. The right cluster in Fig. 2.6 includes
these right islands and a DDR3 port depicted at the upper-right side. With these 15
on-chip heterogeneous cores, the chip can execute a wide variety of multimedia and
digital-convergence applications at high-speed and low-power consumption. The
details of the chip and its applications are described in Chaps. 4–6.

2.2 Address Space

There are two types of address spaces defined for a heterogeneous multicore chip.
One is a public address space where all major memory resources on and off the
chip are mapped and can be accessed by processor cores and DMA controllers in
common. The other is a private address space where the addresses looked for from
inside the processor core are defined. The thread of a program on a processor core
runs on the private address space of the processor core. The private address space of
each processor core is defined independently.
Figure 2.7a shows a public address space of the heterogeneous multicore chip
depicted in Fig. 2.1. The CSM, the LMs of CPU #0 to CPU #m, the LMs of SPPa #0
to SPPa #n, and the LMs of SPPb #0 to SPPb #k are mapped in the public address
space, as well as the off-chip main memory. Each DTU in each processor core can
access the off-chip main memory, the CSM, and the LMs in the public address
space and can transfer data between various kinds of memories. A private address
space is independently defined per processor core. The private addresses are gener-
ated by the PU of each processor core. For a CPU core, the address would be
generated during the execution of a load or store instruction in the PU. Figure 2.7b, c
shows examples of private address spaces of a CPU and SPP. The PU of the CPU
core accesses data of the off-chip main memory, the CSM, and its own LM mapped
on the private address space of Fig. 2.7b. If the LM of another processor core is not
mapped on this private address space, the load/store instructions executed by the PU
of the CPU core cannot access data on the other processor core’s LM. Instead, the
DTU of the CPU core transfers data from the other processor core’s LM to its own
LM, the CSM, or the off-chip main memory using the public address space, and the
PU accesses the data in its private address space. In the SPP example (Fig. 2.7c),
the PU of the SPP core can access only its own LM in this case. The data transfer
2.2 Address Space 17

Fig. 2.7 Public/private a b


address spaces

Off-chip Off-chip
main memory main memory

CSM CSM

LM (CPU #0) LM

LM (CPU #m)
Private address space
LM (SPPa #0) (CPU core)
c
LM (SPPb #k) LM

Other resources
Private address space
(SPP core)
Public address space

Fig. 2.8 Private address


PU
space (Hierarchical Harvard)
LMi
LMi LMd
LMd

LM2
LM2

Hierarchical-Harvard structure Private address space

between its own LM and memories outside the core is done by its own DTU on the
public address space.
The address mapping of a private address space varies according to the structure
of the local memory. Figure 2.8 illustrates the case of the hierarchical Harvard
structure of Fig. 2.2c. The LMi and LMd are first-level local memories for instruc-
tions and data, respectively. The LM2 is a second-level local memory that stores
both instructions and data. The LMi, LMd, and LM2 are mapped on different
address areas in the private address space. The PU accesses each LM with different
addresses.
The size of the address spaces depends on the implementation of the heteroge-
neous multicore chip and its system. For example, a 40-bit address is assigned for a
public address space, a 32-bit address for a CPU core’s private address space, a
16-bit address for the SPP’s private address space, and so on. In this case, the sizes
of each space are 1 TB, 4 GB, and 64 KB, respectively. Concrete examples of this
are described in Chaps. 3 and 4.
18 2 Heterogeneous Multicore Architecture

References

1. Uchiyama K (2008) Power-efficient heterogeneous parallelism for digital convergence, digest of


technical papers of 2008 Symposium of VLSI circuits, Honolulu, USA, pp 6–9
2. Uchiyama K (2010) Power-efficient heterogeneous multicore for digital convergence,
Proceedings of 10th International Forum on Embedded MPSoC and Multicore, Gifu, Japan,
pp 339–356
3. Yuyama Y, et al (2010) A 45 nm 37.3GOPS/W heterogeneous multi-core SoC, ISSCC Dig:
100–101
4. Nito T, et al (2010) A 45 nm heterogeneous multi-core SoC supporting an over 32-bits physical
address space for digital appliance, COOL Chips XIII Proceedings, Session XI, no. 1
5. Arakawa F (2011) Low power multicore for embedded systems, CMOS Emerging Technology
2011, Session 5B, no. 1
Chapter 3
Processor Cores

The processor cores described in this chapter are well tuned for embedded systems.
They are SuperHTM RISC engine family processor cores (SH cores) as typical
embedded CPU cores, flexible engine/generic ALU array (FE–GA or shortly called
FE as flexible engine) as a reconfigurable processor core, MX core as a massively
parallel SIMD-type processor, and video processing unit (VPU) as a video processing
accelerator. We can implement heterogeneous multicore processor chips with them,
and three implemented prototype chips, RP-1, RP-2, and RP-X, are introduced in
the Chap. 4.

3.1 Embedded CPU Cores

Since the beginning of the microprocessor history, a processor especially for PC/
servers had continuously advanced its performance while maintaining a price range
from hundreds to thousands of dollars [1, 2]. On the other hand, a single chip micro-
controller had continuously reduced its price resulting in the range from dozens of
cents to several dollars with maintaining its performance and had been equipped to
various products [3]. As a result, there was a situation of no demand on the proces-
sor of the middle price range from tens to hundreds of dollars.
However, with the introduction of the home game console in the late 1980s and
the digitization of the home electronic appliances from the 1990s, there occurred the
demands to a processor suitable for multimedia processing in this price range.
Instead of seeking high performance, such a processor has attached great impor-
tance to high efficiency. For example, the performance is 1/10 of a processor for
PCs, but the price is 1/100, or the performance equals to a processor for PCs for the
important function of the product, but the price is 1/10. The improvement of area
efficiency has become the important issue in such a processor.
In the late 1990s, a high-performance processor consumed too high power for
mobile devices such as cellar phones and digital cameras, and the demand was

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 19


DOI 10.1007/978-1-4614-0284-8_3, © Springer Science+Business Media New York 2012
20 3 Processor Cores

increasing on the processor with higher performance and lower power for multimedia
processing. Therefore, the improvement of the power efficiency became the impor-
tant issues.
Furthermore, when the 2000s began, more functions were integrated by further
finer processes, but on the other hand, the increase of the initial and development
costs became a serious problem. As a result, the flexible specification and the cost
reduction came to be important issues. In addition, the finer processes suffered from
the more leakage current.
Under the above background, embedded processors were introduced to meet the
requirements and have improved the area, power, and development efficiencies.
In this section, SuperHTM RISC (reduced instruction set computer) engine family
processor cores are introduced as one of the highly efficient CPU cores.

3.1.1 SuperHTM RISC Engine Family Processor Cores

A multicore SoC is one of the most promising approaches to realize high efficiency,
which is the key factor to achieve high performance under some fixed power and
cost budgets. As a result, embedded systems are employing multicore architecture
more and more. The multicore is good for multiplying single-core performance
with maintaining the core efficiency, but does not enhance the efficiency of the core
itself. Therefore, we must use highly efficient cores. In this section, SuperHTM RISC
engine family (SH) processors are introduced as highly efficient typical embedded
CPU cores for both single- and multicore chips.
The first SH processor was developed based on SuperHTM architecture as one of
embedded processors in 1993. Then the SH processors have been developed as a
processor with suitable performance for multimedia processing and area-and-power
efficiency. In general, performance improvement causes degradation of the efficiency
as Pollack’s rule indicates [4]. However, we can find the ways to improve both the
performance and the efficiency. Even each way contributes to small improvement,
total improvement can be meaningful.
The first-generation product SH-1 was manufactured using a 0.8-mm process,
operated at 20 MHz, and achieved performance of 16 MIPS in 500 mW. It was a
high-performance single chip microcontroller and integrated a ROM, a RAM, a
direct memory access controller (DMAC), and an interrupt controller.
The MIPS is abbreviation of million instructions per second and a popular inte-
ger-performance measure of embedded processors. The same performance proces-
sors should take the same time for the same program, but the original MIPS varies,
reflecting the number of instructions executed for a program. Therefore, perfor-
mance of Dhrystone benchmark relative to that of a VAX 11/780 minicomputer is
broadly used [5]. This is because it achieved 1 MIPS, and the relative performance
value is called VAX MIPS or DMIPS or simply MIPS.
The second-generation product SH-2 was manufactured successively using the
same 0.8-mm process as the SH-1 in 1994 [6]. It operated at 28.5 MHz and achieved
3.1 Embedded CPU Cores 21

performance of 25 MIPS in 500 mW by optimization on the redesign from the SH-1.


The SH-2 integrated a cache memory and an SDRAM controller instead of the
ROM and the RAM of the SH-1. It was designed for the systems using external
memories. The integrated SDRAM controller was not popular at that time, but
enabled to eliminate an external circuitry and contributed to system cost reduction.
In addition, the SH-2 integrated a 32-bit multiplier and a divider to accelerate mul-
timedia processing. And it was equipped to a home game console which was one of
the most popular digital appliances. The SH-2 extend the application field of the SH
processors to the digital appliances with multimedia processing.
The third-generation product SH-3 was manufactured using a 0.5-mm process in
1995 [7]. It operated at 60 MHz and achieved performance of 60 MIPS in 500 mW.
Its power efficiency was improved for a mobile device. For example, the clock
power was reduced by dividing the chip into plural clock regions and operating each
region with the most suitable clock frequency. In addition, the SH-3 integrated a
memory management unit (MMU) for such devices as a personal organizer and a
handheld PC. The MMU is necessary for a general-purpose operating system (OS)
that enables various application programs to run on the system.
The fourth-generation product SH-4 was manufactured using a 0.25-mm process
in 1997 [8–10]. It operated at 200 MHz and achieved performance of 360 MIPS in
900 mW. The SH-4 was ported to a 0.18-mm process, and its power efficiency was
further improved. The power efficiency and the product of performance and the
efficiency reached to 400 MIPS/W and 0.14 GIPS2/W, respectively, which were
among the best values at that time. The product roughly indicates the attained degree
of the design, because there is a trade-off relationship between performance and
efficiency. The design is discussed in Sects. 3.1.2 and 3.1.5.
The fifth-generation processor SH-5 was developed with a newly defined instruc-
tion set architecture (ISA) in 2001 [11–13], and an SH-4A, the advanced version of
the SH-4, was also developed with keeping the ISA compatibility in 2003. The
compatibility was important, and the SH-4A was used for various products. The
SH-5 and the SH-4A were developed as a CPU core connected to other various
hardware intellectual properties (HW-IPs) on the same chip with a SuperHyway
standard internal bus. This approach was available using the fine process of 0.13 mm
and enabled to integrate more functions on a chip, such as a video codec, 3D graph-
ics, and global positioning systems (GPS).
An SH-X, the first generation of the SH-4A processor core series, achieved
performance of 720 MIPS with 250 mW using a 0.13-mm process [14–18]. The
power efficiency and the product of performance and the efficiency reached to
2,880 MIPS/W and 2.1 GIPS2/W, respectively, which were among the best values
at that time. The low-power version achieved performance of 360 MIPS and
power efficiency of 4,500 MIPS/W [19–21]. The design is discussed in Sects. 3.1.3
and 3.1.6.
An SH-X2, the second-generation core, achieved performance of 1,440
MIPS using a 90-nm process, and the low-power version achieved power
efficiency of 6,000 MIPS/W [22–24]. Then it was integrated on product chips
[25–28]. The design is discussed in Sect. 3.1.4.
22 3 Processor Cores

An SH-X3, the third-generation core, supported multicore features for both SMP
and AMP [29, 30]. It was developed using a 90-nm generic process and achieved
600 MHz and 1,080 MIPS with 360 mW, resulting in 3,000 MIPS/W and 3.2
GIPS2/W. The first prototype chip of the SH-X3 was a RP-1 that integrated four
SH-X3 cores [31–34], and the second one was a RP-2 that integrated eight SH-X3
cores [35–37]. Then, it was ported to a 65-nm low-power process and used for prod-
uct chips [38]. The design is discussed in Sect. 3.1.7.
An SH-X4, the latest fourth generation of the SH-4A processor core series,
achieved 648 MHz and 1,717 MIPS with 106 mW, resulting in 16,240 MIPS/W and
28 GIPS2/W using a 45-nm process [39–41]. The design is discussed in Sect. 3.1.8.

3.1.2 Efficient Parallelization of SH-4

The SH-4 enhanced its performance and efficiency mainly with superscalar archi-
tecture, which is suitable for multimedia processing having high parallelism, and
makes an embedded processor suitable for digital appliances. However, a conven-
tional superscalar processor put the first priority to performance, and efficiency was
not considered seriously, because it was a high-end processor for a PC/server
[42–46]. Therefore, a highly efficient superscalar architecture was developed and
adopted to the SH-4. The design target was to adopt the superscalar architecture to
an embedded processor with maintaining its efficiency, which was already high
enough and much higher than that of a high-end processor.
A high-end general-purpose processor was designed to enhance general perfor-
mance for PC/server use. However, no serious restriction caused low efficiency.
A program with low parallelism cannot use the parallelism of a highly parallel
superscalar processor, and the efficiency of the processor degrades. Therefore, the
target parallelism of the superscalar architecture was set for the programs with rela-
tively low parallelism, and performance enhancement of the multimedia processing
was accomplished in another way (see Sect. 3.1.5).
The superscalar architecture enhances peak performance by simultaneous issue
of plural instructions. However, effective performance of the real application is
estranged from peak performance when the number of the instruction issue
increases. The estrangement between the peak and effective performance is caused
by hazard of waiting cycles. A branch operation mainly causes the waiting cycles
for a fetched instruction, and it is important to speed up the branch efficiently.
A resource conflict, which causes the waiting cycles for a resource to be available,
can be reduced by the resource addition. However, the efficiency will decrease if
the performance enhancement does not compensate the hardware amount of the
additional resource. Therefore, balanced resource addition is necessary to main-
tain the efficiency. The register conflict, which causes the waiting cycles for a
register value to be available, can be reduced by shortening instruction execution
time and by data forwarding from a data-definition instruction to a data-use one at
appropriate timing.
3.1 Embedded CPU Cores 23

3.1.2.1 Highly Efficient Instruction Set Architecture

Since the beginning of the RISC architecture, all the RISC processor had adopted a
32-bit fixed-length instruction set architecture (ISA). However, such a RISC ISA
required larger-size codes than a conventional CISC (complicated instruction set
computer) ISA, and it was necessary to increase the capacity of program memories
and an instruction cache to support this, and efficiency decreased. SH architecture
with the 16-bit fixed-length ISA was defined in such a situation to achieve compact
code sizes. The 16-bit fixed-length ISA was spread to other processors such as ARM
Thumb and MIPS16.
On the other hand, a CISC ISA has been variable length to define the instructions
of various complexities from simple to complicated ones. The variable length is
good for realizing the compact code sizes, but is not suitable for parallel decoding
of plural instructions for the superscalar issue. Therefore, the 16-bit fixed-length
ISA is good both for the compact code sizes and the superscalar architecture.
As always, there should be pros and cons of the selection, and there are some draw-
backs of the 16-bit fixed-length ISA, which are the restriction of the number of oper-
ands and the short literal length in the code. For example, an instruction of a binary
operation modifies one of its operand, and an extra data transfer instruction is neces-
sary if the original value of the modified operand must be kept. A literal load instruc-
tion is necessary to utilize a longer literal than that in an instruction. Further, there is
an instruction using an implicitly defined register, which contributes to increase the
number of operand with no extra operand field, but requires special treatment to iden-
tify it and spoils orthogonal characteristics of the register number decoding. Therefore,
careful implementation is necessary to treat such special features.

3.1.2.2 Microarchitecture Selections

Since a conventional superscalar processor gave priority to performance, the super-


scalar architecture was considered to be inefficient, and scalar architecture was still
popular for embedded processors. However, this is not always true. For the SH-4
design, the superscalar architecture was tuned by selecting an appropriate micro-
architecture with considering efficiency seriously for an embedded processor.
Table 3.1 summarizes the selection result of the microarchitecture.
At first, dual-issue superscalar architecture was chosen because it was difficult
for a general-purpose program to utilize the simultaneous issue of more than two
instructions effectively. Then, in-order issue architecture was chosen though
out-of-order issue architecture was popular for a high-end processor. This was
because a performance enhancement was not enough to compensate the hard-
ware increase for the out-of-order issue. The in-order dual-issue architecture
could maintain the efficiency of the conventional scalar-issue one.
Further, asymmetric superscalar architecture was chosen to duplicate resources as
few as possible to minimize the overhead and to maximize the efficiency. The symmetric
architecture was not chosen, because it required duplicating execution resources, even
24 3 Processor Cores

Table 3.1 Microarchitecture selections of SH-4


Selections Other candidates Merits
Number of issues Dual Scalar, triple, quad Maintaining
Issue order In-order Out-of-order high efficiency
Resource duplication Asymmetric Duplicated (symmetric)
Important category Transfer Memory access, Good for two-
arithmetic operand ISA
Latency concealing Zero-cycle transfer Delayed execution,
store buffers
Internal memories Harvard architecture Unified cache Simultaneous access
Branch acceleration Delayed branch, Branch prediction, Simple, small,
early-stage branch out-of-order issue, compatible
branch target buffer,
separated instructions

the duplicated resources were not often used simultaneously, and the architecture would
not achieve high efficiency.
All the instructions were categorized to reduce a pipeline hazard by the resource
conflicts, which would not occur in symmetric architecture with the expense of the
resource duplication. Especially, a transfer instruction of a literal or register value is
important for the 16-bit fixed-length ISA, and the transfer instructions were catego-
rized as a type that could utilize both execution and load/store pipelines properly.
Further a zero-cycle transfer operation was implemented for the transfer instruc-
tions and contributes to reduce the hazard.
As for memory architecture, Harvard architecture was popular for PC/server pro-
cessors enabling simultaneous accesses to instruction and data caches, and unified
cache architecture was popular for embedded processors to reduce the hardware
cost and to utilize relatively small size cache efficiently. The SH-4 adopted the
Harvard architecture, which was necessary to avoid the memory access conflict
increased by the superscalar issue.
The SH architecture adopted a delayed branch to reduce the branch penalty
cycles. In addition, the SH-4 adopted an early-stage branch to reduce the penalty
further. The penalty cycles increased with the superscalar issue, but were not so
much as that of a superpipeline processor having deep pipeline stages, and the SH-4
did not adopt more expensive ways such as a branch target buffer (BTB), an out-of-
order issue of a branch instruction, and a branch prediction. The SH-4 kept the
backward compatibility and did not adopt a method with ISA change like a method
using plural instructions for a branch.
As the result of the selection, the SH-4 adopted an in-order dual-issue asymmet-
ric five-stage superscalar pipeline and Harvard architecture with special treatment
of transfer instructions including zero-cycle transfer method.
3.1 Embedded CPU Cores 25

3.1.2.3 Asymmetric Superscalar Architecture

The asymmetric superscalar architecture is sensitive to the instruction categorization,


because the same category instruction cannot be issued simultaneously. For example,
if we categorize all floating-point instructions in the same category, we can reduce
the number of floating-point register ports, but cannot issue both floating-point
instructions of arithmetic and load/store/transfer operations at a time. This degrades
the performance. Therefore, the categorization requires careful trade-off consider-
ation between performance and hardware cost.
First of all, both the integer and load/store instructions were used most frequently
and categorized to different groups of integer (INT) and load/store (LS), respec-
tively. This categorization required address calculation unit in addition to the con-
ventional arithmetic logical unit (ALU). Branch instructions are about one fifth of a
program on average. However, it was difficult to use the ALU or the address calcu-
lation unit to implement the early-stage branch, which calculated the branch
addresses at one stage earlier than the other type of operations. Therefore, the branch
instruction was categorized in another group of branch (BR) with a branch-address
calculation unit. As a result, the SH-4 had three calculation units, but the perfor-
mance enhancement compensated the additional hardware.
Even a RISC processor had a special instruction that could not fit to the super-
scalar issue. For example, some instruction changed a processor state and was
categorized to a group of nonsuperscalar (NS) because most of instructions could
not be issued with it.
The SH-4 would frequently use an instruction to transfer a literal or register value
to a register because of the 16-bit fixed-length ISA. Therefore, the transfer instruc-
tion was categorized to BO group to be executable on both integer and load/store
(INT and LS) pipelines, which were originally for the INT and LS groups. Then the
transfer instruction could be issued with no resource conflict. A usual program could
not utilize all the instruction issue slots of conventional RISC architecture that has
three operand instructions and uses transfer instructions less frequently. Extra trans-
fer instructions of the SH-4 could be inserted easily with no resource conflict to the
issue slots that would be empty for a conventional RISC.
As mentioned above, it increased a pipeline hazard to set a single group for all
the floating-point instructions. Therefore, the floating-point load/store/transfer and
arithmetic instructions were categorized to the LS group and a floating-point execu-
tion (FE) group, respectively. This categorization increased the number of the ports
of the floating-point register file. However, the performance enhancement deserved
the increase.
The floating-point transfer instructions were not categorized to the BO group.
This was because neither the INT nor FE group fits to the instruction. The INT
pipeline could not use the floating-point register file, and the FE pipeline was too
complicated to treat the simple transfer operation. Further, the transfer instruction
was often issued with an FE group instruction, and the categorization to other than
the FE group was enough condition for the performance.
26 3 Processor Cores

Table 3.2 Categories of SH-4 instructions


INT LS BR NS
MOV imm, Rn, MOV (load/store) BRA MUL, MULU, MULS
MOVA, MOVT MOVCA BSR DMULU, DMULS
ADD, ADDC, ADDV OCBI, PREF BT MAC, CLRMAC
SUB, SUBC, SUBV FMOV BF AND imm, @(R0,GBR)
DIV0U, DIV0S, DIV1 FLDS, FSTS BT/S OR imm, @(R0,GBR)
DT, NEG, NEGC FLDI0, FLDI1 BF/S XOR imm, @(R0,GBR)
EXTU, EXTS FABS, FNEG FE TST imm, @(R0,GBR)
AND Rm, Rn, AND imm, R0 LDS Rm, FPUL FADD TAS, BRAF, BSRF
OR Rm, Rn, OR imm, R0 STS FPUL, Rn FSUB JMP, JSR, RTS
XOR Rm, Rn, XOR imm, R0 FMUL CLRS, SETS, SLEEP
ROTL, ROTR FDIV LDC, STC
ROTCL, ROTCR BO FSQRT LDS (except FPUL)
SHAL, SHAR, MOV Rm, Rn FCMP STS (except FPUL)
SHLL, SHLR CMP FLOAT LDTLB, TRAPA
SHLL2, SHLR2 TST imm, R0 FTRC
SHLL8, SHLR8 TST Rm, Rn FCNVSD
SHLL16, SHLR16 CLRT FCNVDS
SHAD, SHLD SETT FMAC
NOT, SWAP, XTRCT NOP FIPR
FTRV

Table 3.3 Simultaneous issue of instructions


Second instruction category
BO INT LS BR FE NS
First BO
instruction INT
category LS
BR
FE
NS

The SH ISA supports floating-point sign-negation and absolute-value (FNEG


and FABS) instructions. Although these instructions seemed to fit the FE group,
they were categorized to the LS group. Their operations were simple enough to
execute at the LS pipeline, and the combination of another arithmetic instruction
became a useful operation. For example, the FNEG and floating-point multiply-
accumulate (FMAC) instructions became a multiply-and-subtract operation.
Table 3.2 summarizes the categories of the SH-4 instructions, and Table 3.3
shows the ability of simultaneous issue of two instructions. As an asymmetric super-
scalar processor, each pipeline for the INT, LS, BR, or FE group is one, and the
simultaneous issue is limited to a pair of different group instructions, except for a
pair of the BO group instructions, which can be issued simultaneously using both
the INT and LS pipelines. An NS group instruction cannot be issued with another
instruction.
3.1 Embedded CPU Cores 27

IF Early Branch Instruction Fetch


ID Branch Instruction Decoding FPU Instruction Decoding
EX Execution Address Sign FPU
MA - Load/Store - Execution
WB WB WB WB & WB
BR INT LS FE

Fig. 3.1 Pipeline structure of SH-4

3.1.2.4 Pipeline Structure of Asymmetric Superscalar Architecture

Figure 3.1 illustrates the pipeline structure to realize the asymmetric superscalar
architecture described above. The pipeline is five stages of instruction fetch (IF),
instruction decoding (ID), instruction execution (EX), memory access (MA), and
write-back (WB).
Two consecutive instructions of 32 bits are fetched every cycle at the IF stage to
sustain the two-way superscalar issue and provided to the input latch of the ID
stage. The fetched instructions are stored in an instruction queue (IQ), when the
latch is occupied by the instructions suspended to be issued. The instruction fetch is
issued after checking the emptiness of either the input latch or the IQ to avoid dis-
carding the fetched instructions.
At the ID stage, instruction decoders decode the two instructions at the input latch,
judge the group, assign pipelines, read registers as source operands, forward a operand
value if it is available but not stored in a register yet, judge issuable immediately or
not, and provide instruction execution information to the following stages. Further,
BR pipeline starts a branch processing of a BR-group instruction. The details of the
branch processing are described in the next section.
The INT, LS, BR, and FE pipelines are assigned to an instruction of the INT, LS,
BR, and FE groups, respectively. The second instruction of the two simultaneously
decoded ones is not issued if the pipeline to be assigned is occupied, kept at the
input latch, and decoded again at the next cycle. A BO group instruction is assigned
to the LS pipeline if the other instruction simultaneously decoded is the INT group;
otherwise, it is assigned to the INT pipeline, except both the instructions are in the
BO group. In this case, they are assigned to the INT and LS pipelines. The NS
instruction is assigned to a proper pipeline or pipelines if it is the first instruction;
otherwise, it is kept at the input latch and decoded again at the next cycle.
The issue possibility is judged by checking the operand value availability in par-
allel with the execution pipeline assignment. An operand is immediate value or
register value, and the immediate value is always available. Therefore, the register
value availability is checked for the judgment. The register value is defined by some
instruction and used by a following instruction. A write-after-read register conflict,
a true dependency in other words, occurs if the distance of the defining and using
instructions is less than the latency of the defining instruction, and the defined reg-
ister value is not available until the distance becomes equal or more than the latency.
28 3 Processor Cores

The parallel operations of a register conflict check and the other ID-stage operations
are realized by comparing a register field candidate of the instruction before identi-
fying that the field is a real register field, and the compared result is judged to be
meaningful or not after the identification that requires the instruction format type
from instruction decoding logic. The parallel operations reduce the time of the ID
stage and enhance the operating frequency.
After the ID stage, the operation depends on the pipeline and is executed accord-
ing to the instruction information provided from the ID stage. The INT pipeline
executes the operation at the EX stage using an ALU, a shifter, and so on; forwards
the operation result to the WB stage at the MA stage; and writes back the result to the
register at the WB stage. The LS pipeline calculates the memory access address at the
EX stage, loads or stores a data of the calculated address in a data cache at the MA
stage, and writes back the loaded data and/or the calculated address to the register at
the WB stage if any. If a cache miss occurs, all the pipelines are stalled to wait an
external memory access. The FE pipeline operations are described later in detail.
SH-4 adopted the Harvard architecture, which required the simultaneous access
of translation look aside buffers (TLBs) of instruction and data, and a conventional
Harvard-architecture processor separated the TLBs symmetrically. However, the
SH-4 enhanced the efficiency of the TLBs by breaking the symmetry. The address of
the instruction fetch is localized, and a four-entry instruction TLB (ITLB) was
enough to suppress the TLB miss. On the contrary, the address of the data access is
not so localized and requires more entries. Therefore, a 64-entry unified TLB (UTLB)
was integrated and used for both a data access and an ITLB miss handling. The ITLB
miss handling is supported by hardware, and it takes short cycles if the ITLB-missed
entry is in the UTLB. If the UTLB miss occurs for either of the accesses, a TLB miss
exception occurs, and a proper software miss handling will be issued.
The caches of the SH-4 are also asymmetric to enhance the efficiency. Since a
code size of the SH-4 is smaller than that of a conventional processor, the size of the
instruction cache is half of the data cache. The cache sizes are 8 and 16 KB.

3.1.2.5 Zero-Cycle Data Transfer

Since the number of transfer instructions of an SH-4 program was more than that of
the other architecture, the transfer instructions were categorized to BO group. Then
the transfer instructions can be inserted to any unused issue slots. Further, a zero-
cycle transfer operation was implemented for the transfer instructions and contrib-
utes to reduce the hazard.
The result of the transfer instruction already exists at the beginning of the opera-
tion as an immediate value in an instruction code, a value in a source operand
resister, or a value on the fly in a pipeline, and it is provided to the pipeline at the ID
stage, and the value is just forwarded in the pipeline to the WB stage. Therefore, the
simultaneous operation of the instruction right after the transfer instruction at
another pipeline can use the result of the transfer instruction, if the result is properly
forwarded by source-operand forwarding network.
3.1 Embedded CPU Cores 29

Branch IF ID EX MA WB
Delay Slot IF ID EX MA WB
ID Empty Issue Slot
Target IF ID EX MA WB
4 cycles

Fig. 3.2 Branch sequence of a scalar processor

Branch IF ID EX MA WB
Delay Slot IF ID EX MA WB
ID
ID
Empty Issue Slots
ID
ID
Target IF ID EX MA WB
4 cycles

Fig. 3.3 Branch sequence of a superscalar processor

Compare IF ID EX MA WB
Branch IF ID EX MA WB
Delay Slot IF ID EX MA WB
ID
Empty Issue Slots
ID
Target IF ID EX MA WB
3 cycles

Fig. 3.4 Branch sequence of SH-4 with early-stage branch

3.1.2.6 Early-Stage Branch

The SH-4 adopted an early-stage branch to reduce the increased branch penalty by
the superscalar architecture. Figures 3.2–3.4 illustrate branch sequences of a scalar
processor, a superscalar processor, and the SH-4 with the early-stage branch, respec-
tively. The sequence consists of branch, delay slot, and target instructions. In the
SH-4 case, a compare instruction, which is often right before the conditional branch
instruction, is also shown to clarify the define-use distance of a branch condition
between the EX and ID stages of the compare and branch instructions.
Both the scalar and superscalar processors execute the three instructions in the
same four cycles. There is no performance gain by the superscalar architecture, and
the empty issue slot becomes three or four times more. On the other hand, the SH-4
executes the three instructions in three cycles with one or two empty issue slots.
The branch without a delay slot requires one more empty issue slot for all the cases.
As shown by the example sequences, the SH-4 performance was enhanced, and the
empty issue slots decreased.
30 3 Processor Cores

Table 3.4 Early-stage branch instructions


Instruction Code Displacement Function
BT Label 10001001 8 bits If (T==1) PC = PC + 4 + disp*2
BF Label 10001011 8 bits If (T==0) PC = PC + 4 + disp*2
BT/S Label 10001101 8 bits If (T==1) PC = PC + 4 + disp*2; execute delay slot
BF/S Label 10001111 8 bits If (T==0) PC = PC + 4 + disp*2; execute delay slot
BRA Label 1010 12 bits PC = PC + 4 + disp*2; execute delay slot
BSR Label 1011 12 bits PR = PC + 4; PC = PC + 4 + disp*2; execute delay slot

The branch address calculation at the ID stage was the key method for the early-stage
branch and realized by the parallel operations of the calculation and the instruction
decoding. The early-stage branch was adopted to six frequently used branch instruc-
tions summarized in Table 3.4. The calculation was 8-bit or 12-bit offset addition,
and a 1-bit check of the instruction code could identify the offset size of the six
branch instructions. The first code of the two instruction codes at the ID stage was
chosen to process if the first code was a branch; otherwise, the second code was
chosen. However, this judgment took more time than the above 1-bit check, and
some part of calculation was done before the selection by duplicating required hard-
ware to realize the parallel operations.

3.1.2.7 Performance Evaluations

The SH-4 performance was measured using a Dhrystone benchmark which was pop-
ular for evaluating integer performance of embedded processor [5]. The Dhrystone
benchmark is small enough to fit all the program and data into the caches and to
use at the beginning of the processor development. Therefore, only the processor
core architecture can be evaluated without the influence from the system level archi-
tecture, and the evaluation result can be fed back to the architecture design. On the
contrary, the system level performance cannot be measured considering cache miss
rates, external memory access throughput and latencies, and so on. The evaluation
result includes compiler performance because the Dhrystone benchmark is described
in C language. The optimizing compiler tuned up for SH-4 was used for compiling
the benchmark.
The optimizing compiler for a superscalar processor must have new optimization
items, which is not necessary for a scalar processor. For example, the distance of a
load instruction and an instruction using the loaded data must be two cycles or more
to avoid a pipeline stall. The scalar processor requires one instruction inserted between
the instructions, but the superscalar processor requires two or three instructions.
Therefore, the optimizing compiler must insert independent instructions more than
the compiler for a scalar processor.
3.1 Embedded CPU Cores 31

SH-3 1.00
+ SH-4 Compiler 1.10
+ Harvard 1.27
+ Superscalar 1.49
+ BO type 1.59
+ Early branch 1.77
+ 0-cycle MOV 1.81
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Cycle Performance (MIPS/MHz)

Fig. 3.5 Dhrystone performance evaluation result

Figure 3.5 shows the result of the cycle performance evaluation. Starting from
the SH-3, five major enhancements were adopted to construct the SH-4 microarchi-
tecture. The SH-3 achieved 1.0 MIPS/MHz when it was released, and the SH-4
compiler enhanced its performance to 1.1 MIPS/MHz. The cycle performance was
enhanced to 1.27 MIPS/MHz by the Harvard architecture, to 1.49 MIPS/MHz by
the superscalar architecture, to 1.59 MIPS/MHz by adding the BO group, to 1.77
MIPS/MHz by the early-stage branch, and to 1.81 MIPS/MHz by the zero-cycle
transfer operation. As a result, the SH-4 achieved 1.81 MIPS/MHz. The SH-4
enhanced the cycle performance by 1.65 times form the SH-3 excluding the com-
piler effect.
The SH-3 was a 60-MHz processor in a 0.5-mm process and estimated to be a
133-MHz processor in a 0.25-mm process. The SH-4 achieved 200 MHz in the same
0.25-mm process. Therefore, SH-4 enhanced the frequency by 1.5 times form the
SH-3. As a result, the architectural performance of the SH-4 is 1.65 × 1.5 = 2.47
times as high as that of the SH-3.
Efficiency is more important feature than performance for an embedded proces-
sor. Therefore, the area and power efficiencies of the SH-4 were also evaluated, and
it was confirmed that the SH-4 achieved the excellent efficiencies.
The area of the SH-3 was 7 mm2 in a 0.5-mm process and estimated to be 3 mm2
in a 0.25-mm process, whereas the area of the SH-4 was 4.9 mm2 in a 0.25-mm pro-
cess. Therefore, the SH-4 was 1.63 times as large as the SH-3. As described above,
the cycle and architectural performances of the SH-4 were 1.65 and 2.47 times as
high as those of the SH-3. As a result, the SH-4 kept the area efficiency of the cycle
performance that was calculated as 1.65/1.63 = 1.01 and enhanced the area efficiency
of the performance that was calculated as 2.47/1.63 = 1.52. The actual efficiencies
including a process contribution were 60 MIPS/7 mm2 = 8.6 MIPS/mm2 for the SH-3
and 360 MIPS/4.9 mm2 = 73.5 MIPS/mm2 for the SH-4.
The SH-3 and SH-4 were ported to a 0.18-mm process and tuned with keeping
their major architecture. Since they adopted the same five-stage pipeline, the achiev-
able frequency was also the same after the tuning. The ported SH-3 and SH-4 were
170 and 240 mW at 133 MHz and 1.5 V power supply. Therefore, the power of the
32 3 Processor Cores

SH-4 was 240/170 = 1.41 times as high as that of the SH-3. As a result, the SH-4
kept the power efficiency of the cycle performance that is calculated as
1.65/1.41 = 1.17. The actual efficiencies including the process contribution were
147 MIPS/0.17 W = 865 MIPS/W for the SH-3 and 240 MIPS/0.24 W = 1,000
MIPS/W for the SH-4. Although a conventional superscalar processor was thought
to be less efficient than a scalar processor, the SH-4 was more efficient than a scalar
processor. On the other conditions, the SH-4 achieved 166 MHz at 1.8 V with
400 mW and 240 MHz at 1.95 V with 700 mW, and the corresponding efficiencies
were 300 MIPS/0.4 W = 750 MIPS/W and 432 MIPS/0.7 W = 617 MIPS/W.

3.1.3 Efficient Frequency Enhancement of SH-X

The asymmetric superscalar architecture of the SH-4 achieved high performance and
efficiency. However, further parallelism would not contribute to the performance
because of the limited parallelism of a general program. On the other hand, the oper-
ating frequency would be limited by an applied process without fundamental change
of the architecture or microarchitecture. Although conventional superpipeline archi-
tecture was thought inefficient as was the conventional superscalar architecture
before the SH-4 [47, 48], an SH-X embedded processor core was developed with
superpipeline architecture to enhance the operating frequency with maintaining the
high efficiency of the SH-4.

3.1.3.1 Microarchitecture Selections

The SH-X adopted seven-stage superpipeline to maintain the efficiency among


various numbers of stages adopted to various processors up to highly superpipe-
lined 20 stages [48]. The seven-stage pipeline degraded the cycle performance
compared to the five-stage one. Therefore, appropriate methods were chosen to
enhance and recover the cycle performance with the careful trade-off judgment of
performance and efficiency. Table 3.5 summarizes the selection result of the
microarchitecture.
An out-of-order issue was the popular method used by a high-end processor in
order to enhance the cycle performance. However, it required much hardware and
was too inefficient especially for general-purpose register handling. The SH-X
adopted an in-order issue except some branches using no general-purpose register.
The branch penalty was the serious problem for the superpipeline architecture. In
addition to the method of the SH-4, the SH-X adopted a branch prediction and an
out-of-order branch issue, but did not adopt a more expensive way with a BTB and
an incompatible way with plural instructions. The branch prediction is categorized
to static and dynamic ones, and the static ones require the architecture change to
insert the static prediction result to the instruction. Therefore, the SH-X adopted a
dynamic one with a branch history table (BHT) and a global history.
3.1 Embedded CPU Cores 33

Table 3.5 Microarchitecture selections of SH-X


Selections Other candidates Merits
Pipeline stages 7 5, 6, 8, 10, 15, 20 1.4 times frequency
enhancement
Branch acceleration Out-of-order issue BTB, branch with Compatibility,
plural instructions Small area
Branch prediction Dynamic (BHT, Static (fixed direction, For low frequency
global history) hint bit in instruction) branch
Latency concealing Delayed execution, Out-of-order issue Simple, small
store buffers

I1
Early Branch Instruction Fetch
I2
ID Branch Instruction Decoding FPU Instruction Decoding
E1 Execution Address FPU FPU
E2 Data Data Arithmetic
E3 Load/Store Transfer Execution
E4 WB WB WB
E5
E6 WB
BR INT LS FE

Fig. 3.6 Conventional seven-stage superpipeline structure

The load/store latencies were also a serious problem, and the out-of-order issue
was effective to hide the latencies, but too inefficient to adopt as mentioned above.
The SH-X adopted a delayed execution and a store buffer as more efficient methods.
The selected methods were effective to reduce the pipeline hazard caused by the
superpipeline architecture, but not effective to avoid a long-cycle stall caused by a
cache miss for an external memory access. Such a stall could be avoided by an out-
of-order architecture with large-scale buffers, but was not a serious problem for
embedded systems.

3.1.3.2 Improved Superpipeline Architecture

Figure 3.6 illustrates a conventional seven-stage superpipeline structure based on


the ISA and instruction categorization of the SH-4. The seven stages consist of first
and second instruction fetch (I1 and I2) stages and an instruction decoding (ID)
stage for all the pipelines, and first to fourth execution (E1, E2, E3, and E4) stages
for the INT, LS, and FE pipelines. The FE pipeline has nine stages with two extra
execution stages of E5 and E6.
The I1, I2, and ID stages correspond to the IF and ID stages, and the E1, E2, and
E3 stages correspond to the EX and MA stages of the SH-4. Therefore, the same pro-
cessing time is divided into 1.5 times as many stages as the SH-4. Then, the operating
34 3 Processor Cores

I1 Out-of-order
Branch Instruction Fetch
I2
ID Branch Instruction FPU Instruction
E1 Decoding Address Decoding
E2 Execution Data Tag FPU FPU
E3 Load - Data Arithmetic
E4 WB WB Data Transfer Execution
E5 Store WB
E6 Store Buffer
E7 Flexible Forwarding WB
BR INT LS FE

Fig. 3.7 Seven-stage superpipeline structure of SH-X

frequency can be 1.4 times as high as the SH-4. The degradation from the 1.5 times is
caused by the increase of pipeline latches for the extra stage.
The control signals and processing data are flowing to the backward as well as
fall through the pipeline. The backward flows convey various information and exe-
cution results of the preceding instructions to control and execute the following
instructions. The information includes that preceding instructions were issued or
still occupying resources, where the latest value of the source operand is flowing in
the pipeline, and so on. Such information is used for an instruction issue every
cycle, and it is necessary to collect the latest information in a cycle. This informa-
tion gathering and handling become difficult if a cycle time becomes short for the
superpipeline architecture, and the issue control logic tends to be complicated and
large. However, the quantity of hardware is determined mainly by the major micro-
architecture, and the hardware increase was expected to be less than 1.4 times.
A conventional seven-stage pipeline had less cycle performance than a five-stage
one by 20%. This means the performance gain of the superpipeline architecture was
only 1.4 × 0.8 = 1.12 times, which would not compensate the hardware increase. The
branch penalty increased by the increase of the instruction fetch cycles of I1 and I2
stages. The load-use conflict penalty increased by the increase of the data load
cycles of E1 and E2 stages. They were the main reason of the 20% degradation.
Figure 3.7 illustrates the seven-stage superpipeline structure of the SH-X with
delayed execution, store buffer, out-of-order branch, and flexible forwarding.
Compared to the conventional pipeline shown in Fig. 3.6, the INT pipeline starts its
execution one cycle later at the E2 stage, a store data is buffered to the store buffer
at the E4 stage and stored to the data cache at the E5 stage, and the data transfer of
the FPU supports flexible forwarding. The BR pipeline starts at the ID stage, but is
not synchronized to the other pipelines for an out-of-order branch issue.
The delayed execution is effective to reduce the load-use conflict as Fig. 3.8
illustrates. It also lengthens the decoding stages into two except for the address
calculation and relaxes the decoding time. With the conventional architecture shown
in Fig. 3.6, a load instruction, MOV.L, sets up an R0 value at the ID stage, calculates
a load address at the E1 stage, loads a data from the data cache at the E2 and E3
3.1 Embedded CPU Cores 35

Load: MOV. L @R0, R1 ID E1 E2 E3


ALU: ADD R1, R2 ID E1 E2 E3
Conventional Architecture: 2-cycle Stalls

Load: MOV. L @R0, R1 ID E1 E2 E3


ALU: ADD R1, R2 ID E1 E2 E3
Delayed Execution: 1-cycle Stall
Fig. 3.8 Load-use conflict reduction by delayed execution

stages, and the load data is available at the end of the E3 stage. An ALU instruction,
ADD, sets up R1 and R2 values at the ID stage and adds the values at the E1 stage.
Then the load data is forwarded from the E3 stage to the ID stage, and the pipeline
stalls two cycles. With the delayed execution, the load instruction execution is the
same, and the add instruction sets up R1 and R2 values at E1 stage and adds the
values at the E2 stage. Then the load data is forwarded from the E3 stage to the E1
stage, and the pipeline stalls only one cycle, which is the same number of cycle as
that of a five-stage pipeline like SH-4.
There was another choice to start the delayed execution at the E3 stage to avoid
the pipeline stall of the load-use conflict. However, the E3 stage was bad for the
result define. For example, if an ALU result was defined at E3 and an address cal-
culation used the result at E1, it would require three-cycle issue distance between
the instructions for the ALU result and the address calculation. On the other hand, a
program for the SH-4 already considered the one-cycle stall. Therefore, the E2-start
type of the SH-X was considered to be better. Especially, we could expect the pro-
gram optimized for the SH-4 would run on the SH-X properly.
As illustrated in Fig. 3.7, a store instruction performs an address calculation,
TLB and cache-tag accesses, a store-data latch, and a data store to the cache at the
E1, E2, E4, and E5 stages, respectively, whereas a load instruction performs a cache
access at the E2 stage. This means the three-stage gap of the cache access timing
between the E2 and the E5 stages of a load and a store. However, a load and a store
use the same port of the cache. Therefore, a load instruction gets the priority to a
store instruction if the access is conflicted, and the store instruction must wait the
timing with no conflict. In the N-stage gap case, N entries are necessary for the store
buffer to treat the worst case, which is a sequence of N consecutive store issues fol-
lowed by N consecutive load issues, and the SH-X implemented three entries.
The flexible forwarding enables both an early register release and a late register
allocation and eases the optimization of a program. Figure 3.9 shows the examples
of both the cases. In the early register release case, a floating-point addition instruc-
tion (FADD) generates a result at the end of the E4 stage, and a store instruction
(FMOV) gets the result forwarded from the E5 stage of the FADD. Then the FR1 is
released only one cycle after the allocation, although the FADD takes three cycles
to generate the result. In the late register allocation case, an FADD forwards a result
at the E6 stage, and a transfer instruction (FMOV) gets the forwarded result at the
E1 stage. Then the FR2 allocation is five cycles after the FR1 allocation.
36 3 Processor Cores

Add: FADD . S FR0, FR1 E1 E2 E3 E4 E5 E6 E7


Store: FMOV . S FR1,@R0 E1 E2 E3 E4 E5
Load: FMOV . S @R1, FR1 1 cycle E1 E2 E3 E4 E5
Early register release

Add: FADD. S FR0, FR1 E1 E2 E3 E4 E5 E6 E7

Store: FMOV. S FR2,@R0 5 cycles E1 E2 E3 E4 E5


Copy: FMOV FR1, FR2 E1 E2 E3 E4 E5
Late register allocation

Fig. 3.9 Example of flexible forwarding

Compare I1 I2 ID E1 E2 E3 E4
Branch I1 I2 IQ ID ID ID E1 E2 E3 E4
Delay Slot I1 I2 ID ID ID E1 E2 E3 E4
ID
2 cycles ID Empty
ID Issue Slots
ID
Target I1 I2 ID E1 E2 E3 E4

2 cycles

Fig. 3.10 Branch execution sequence of superpipeline architecture

3.1.3.3 Branch Prediction and Out-of-Order Branch Issue

Figure 3.10 illustrates branch performance degradation of superpipeline architecture


with a program sequence consisting of compare, conditional-branch, delay-slot, and
branch-target instructions. The architecture was assumed to be the same superpipe-
line architecture as that of the SH-X except branch architecture that was the same
architecture as that of the SH-4.
The conditional-branch and delay-slot instructions are issued three cycles after
the compare instruction issue, and the branch-target instruction is issued three cycles
after the branch issue. The compare operation starts at the E2 stage by the delayed
execution, and the result is available at the middle of the E3 stage. Then the condi-
tional-branch instruction checks the result at the latter half of the ID stage and gener-
ates the target address at the same ID stage, followed by the I1 and I2 stages of the
target instruction. As a result, eight empty issue slots or four stall cycles are caused
as illustrated. This means only one third of the issue slots are used for the sequence.
The SH-4 could execute the same four instruction sequence with two empty
issue slots or one-cycle stall, and four of six issue slots were used for the sequence
as described in Sect. 3.1.2.6. The branch performance was seriously degraded and
required cycle performance recovery.
Figure 3.11 illustrates the execution sequence of the SH-X. The branch operation
can start with no pipeline stall by a branch prediction, which predicts the branch
3.1 Embedded CPU Cores 37

I1 I2 ID E1 E2 E3 E4
I1 I2 ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
Compare I1 I2 IQ ID E1 E2 E3 E4
Branch I1 I2 ID
Delay Slot I1 I2 IQ IQ ID E1 E2 E3 E4
Target I1 I2 ID E1 E2 E3 E4
I1 I2 ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
I1 I2 IQ ID E1 E2 E3 E4
Fall through I1 I2 IQ IQ IQ IQ ID E1 E2 E3 E4
I1 I2 IQ IQ IQ IQ ID E1 E2 E3 E4
(Prediction miss)
I1 I2 IQ IQ IQ IQ IQ ID E1 E2 E3 E4
I1 I2 IQ IQ IQ IQ IQ ID E1 E2 E3 E4
2-cycle stall

Fig. 3.11 Branch execution sequence of SH-X

direction that the branch is taken or not taken. However, this is not early enough to
make the empty issue slots zero. Therefore, the SH-X adopted an out-of-order issue
to the branches using no general-purpose register.
The SH-X fetches four instructions per cycle and issues two instructions at most.
Therefore, instructions are buffered in an instruction queue (IQ) as illustrated. A branch
instruction is searched from the IQ or an instruction-cache output at the I2 stage and
provided to the ID stage of the branch pipeline for the out-of-order issue earlier than
the other instructions provided to the ID stage in order. Then the conditional branch
instruction is issued right after it is fetched, while the preceding instructions are in the
IQ, and the issue becomes early enough to make the empty issue slots zero. As a result,
the target instruction is fetched and decoded at the ID stage right after the delay-slot
instruction. This means no branch penalty occurs in the sequence when the preceding
or delay-slot instructions stay two or more cycles in the IQ.
The compare result is available at the E3 stage, and the prediction is checked if it is
hit or miss. In the miss case, the instruction of the correct flow is decoded at the ID stage
right after the E3 stage, and two-cycle stall occurs. If the correct flow is not held in the
IQ, the miss-prediction recovery starts from the I1 stage and takes two more cycles.
Historically, the dynamic branch prediction method started from a BHT with
1-bit history per entry, which recorded a branch direction of taken or not for the last
time, and predicted the same branch direction. Then a BHT with 2-bit history per
entry became popular, and the four direction states of strongly taken, weakly taken,
weakly not taken, and strongly not taken were used for the prediction to reflect the
history of several times. There were several types of the state transitions including
a simple up–down transition. Since each entry held only one or two bits, it is too
expensive to attach a tag consisting of a part of the branch-instruction address,
which was usually about 20 bits for a 32-bit addressing. Therefore, we could increase
38 3 Processor Cores

A-drv.
Module 128-256 F/Fs
Clock B-drv. C-drv. D-drvs. F/Fs with CCP
Gen. Leaf
Clock GCKD
Control
Registers Software Hardware (dynamic)
(static)
CCP: Control Clock Pin
ph1 edge trigger F/F ph2 transparent latch
GCKD: Gated Clock Driver Cell

Fig. 3.12 Conventional clock-gating method

the number of entries about 10 or 20 times without the tag. Although the different
branch instructions could not be distinguished without the tag and there occurred
a false hit, the merit of the entry increase exceeded the demerit of the false hit.
A global history method was also popular for the prediction and usually used with
the 2-bit/entry BHT.
The SH-X stalled only two cycles for the prediction miss, and the performance
was not so sensitive to the hit ratio. Further, the one-bit method required a state
change only for a prediction miss, and it could be done during the stall. Therefore,
the SH-X adopted a dynamic branch prediction method with a 4 K-entry 1-bit/entry
BHT and a global history. The size was much smaller than the instruction and data
caches of 32 KB each.

3.1.3.4 Low-Power Technologies of SH-X

The SH-X achieved excellent power efficiency by using various low-power tech-
nologies. Among them, hierarchical clock gating and pointer controlled pipeline are
explained in this section.
Figure 3.12 illustrates a conventional clock-gating method. In this example, the
clock tree has four levels with A-, B-, C-, and D-drivers. The A-driver receives the
clock from the clock generator and distributes the clock to each module in the processor.
Then, the B-driver of each module receives the clock and distributes it to various sub-
modules including 128–256 flip-flops (F/Fs). The B-driver gates the clock with the
signal from the clock control register, whose value is statically written by software to
stop and start the modules. Next, the C- and D-drivers distribute the clock hierarchi-
cally to the leaf F/Fs with a Control Clock Pin (CCP). The leaf F/Fs are gated by
hardware with the CCP to avoid activating them unnecessarily. However, the clock
tree in the module is always active while the module is activated by software.
Figure 3.13 illustrates the clock-gating method of the SH-X. In addition to the
clock gating at the B-driver, the C-drivers gate the clock with the signals dynamically
generated by hardware to reduce the clock tree activity. As a result, the clock power
is 30% less than that of the conventional method.
The superpipeline architecture improved operating frequency, but increased
number of F/Fs and power. Therefore, one of the key design considerations was
3.1 Embedded CPU Cores 39

A-drv.
Module 128-256 F/Fs
Clock B-drv. C-drv. D-drvs. F/Fs with CCP
Gen. Leaf
Clock GCKD
Control GCKD
Hardware
Registers Software Hardware (dynamic)
(static) (dynamic)
CCP: Control Clock Pin
ph1 edge trigger F/F ph2 transparent latch
GCKD: Gated Clock Driver Cell

Fig. 3.13 Clock-gating method of SH-X

Fig. 3.14 Pointer-controlled pipeline F/Fs of SH-X


from other modules

E1

FF
E2
to other modules

FF
E3

E4
FF

E5 Register file

Fig. 3.15 Conventional pipeline F/Fs

to reduce the activity ratio of the F/Fs. To address this issue, a pointer-controlled
pipeline was developed. It realizes a pseudopipeline operation with a pointer control.
As shown in Fig. 3.14, three pipeline F/Fs are connected in parallel, and the pointer
is used to show which F/Fs correspond to which stages. Then, only one set of F/Fs
is updated in the pointer-controlled pipeline, while all pipeline F/Fs are updated
every cycle in the conventional pipeline as shown in Fig. 3.15.
Table 3.6 shows the relationship between F/Fs FF0–FF2 and pipeline stages E2–E4
for each pointer value. For example, when the pointer indexes zero, the FF0 holds an
input value at E2 and keeps it for three cycles as E2, E3, and E4 latches until the
40 3 Processor Cores

Table 3.6 Relationship of F/Fs and pipeline stages


Pointer FF0 FF1 FF2
0 E2 E4 E3
1 E3 E2 E4
2 E4 E3 E2

SH-3 1.00 1.00 60


SH-4
Compiler + Porting 1.10 1.10 146
+ Harvard 1.27 1.91 255
+ Superscalar 1.49 2.24 298
+ BO type 1.59 2.39 319
+ Early branch 1.77 2.65 354
+ 0-cycle MOV 1.81 2.71 361
SH-X
Compiler + Porting 1.80 2.70 504
+ Superpipeline 1.47 3.07 584
+ Branch prediction 1.50 3.16 602
+ Out-of-order branch 1.60 3.37 642
+ Store Buffer 1.69 3.55 677
+ Delayed execution 1.80 3.78 720
0 0.5 1.0 1.5 0 1 2 3 0 200 400 600
Cycle Performance Architectural Performance
(MIPS/MHz) performance (MIPS)

Fig. 3.16 Performance improvement of SH-4 and SH-X

pointer indexes zero again and the FF0 holds a new input value. This method is good
for a short latency operation in a long pipeline. The power of pipeline F/Fs decreases
to 1/3 for transfer instructions and decreases by an average of 25% as measured using
Dhrystone 2.1.

3.1.3.5 Performance Evaluations

The SH-X performance was measured using the Dhrystone benchmark as the SH-4
was. The popular version was changed to 2.1 that was 1.1 when the SH-4 was devel-
oped, because the advance of the optimization technology of compliers made the
version 1.1 not to reflect the features of real applications with excessive elimination of
unused results in the program [49]. The complier advance and the increase of the
optimization difficulty for the version 2.1 were well balanced to maintain the continuity
of the measured performances by using proper optimization level of the compiler.
Figure 3.16 shows the evaluated result of the cycle performance. The improve-
ment from the SH-3 to the SH-4 in the figure was already explained in Sect. 3.1.2.7.
3.1 Embedded CPU Cores 41

SH-3 1.00 1.00 1.00

SH-4 2.47 1.63 1.52

SH-X 3.45 2.26 1.53

0 1 2 3 0 1 2 0 0.5 1.0 1.5


Architectural Relative area Architectural area-
performance performance ratio

SH-3 60 7.0 8.6

SH-4 361 4.9 74


SH-X 720 1.8 400

0 200 400 600 0 1 2 3 4 5 6 7 0 100 200 300 400


Performance Area (mm2) Area-performance
(MIPS) Ratio (MIPS/mm2)

Fig. 3.17 Area efficiency improvement of SH-4 and SH-X

The cycle performance was decreased by 18% to 1.47 MIPS/MHz with adopting a
conventional seven-stage superpipeline to the SH-4. Branch prediction, out-of-order
branch issue, store buffer, and delayed execution improve the cycle performance by
23% and recover the 1.8 MIPS/MHz. Since 1.4 times high operating frequency was
achieved by the superpipeline architecture, the architectural performance was also
1.4 times as high as that of the SH-4. The actual performance was 720 MIPS at
400 MHz in a 0.13-mm process and improved by two times from the SH-4 in a 0.25-
mm process. The improvement by each method is shown in Fig. 3.16.
Figures 3.17 and 3.18 show the area and power efficiency improvements, respec-
tively. Upper three graphs of both the figures show architectural performance, rela-
tive area/power, and architectural area-/power-performance ratio. Lower three
graphs show actual performance, area/power, and area-/power-performance ratio.
The area of the SH-X core was 1.8 mm2 in a 0.13-mm process, and the area of the
SH-4 was estimated as 1.3 mm2 if it was ported to a 0.13-mm process. Therefore, the
relative area of the SH-X was 1.4 times as much as that of the SH-4 and 2.26 times
as much as the SH-3. Then the architectural area efficiency of the SH-X was nearly
equal to that of the SH-4 and 1.53 times as high as the SH-3. The actual area
efficiency of the SH-X reached 400 MIPS/mm2, which was 8.5 times as high as the
74 MIPS/mm2 of the SH-4.
SH-4 was estimated to achieve 200 MHz, 360 MIPS with 140 mW at 1.15 V, and
280 MHz, 504 MIPS with 240 mW at 1.25 V. The power efficiencies were 2,500
and 2,100 MIPS/W, respectively. On the other hand, SH-X achieved 200 MHz,
360 MIPS with 80 mW at 1.0 V, and 400 MHz, 720 MIPS with 250 mW at 1.25 V.
The power efficiencies were 4,500 and 2,880 MIPS/W, respectively. As a result,
the power efficiency of the SH-X improved by 1.8 times from that of the SH-4 at the
42 3 Processor Cores

SH-3 1.00 1.00 1.00


SH-4 2.47 2.12 1.17 x1.4
SH-X 3.45 2.10 1.64
0 1 2 3 0 1 2 0 0.5 1.0 1.5
Architectural Relative power Architectural power-
performance performance ratio
SH-3
3.30 V, 60 MHz 60 600 100
SH-4
1.95 V, 240 MHz 430 700 610
1.80 V, 166 MHz 300 400 750
1.50 V, 133 MHz 240 240 1000
SH-X
1.25 V, 400 MHz 720 250 2880
1.00 V, 200 MHz 360 80 4500
0 200 400 600 0 200 400 600 0 2000 4000
Performance Power Power-performance
(MIPS) (mW) Ratio (MIPS/W)

Fig. 3.18 Power efficiency improvement of SH-4 and SH-X

same frequency of 200 MHz and by 1.4 times at the same supply voltage with
enhancing the performance by 1.4 times. These were architectural improvements,
and actual improvements were multiplied by the process porting.

3.1.4 Frequency and Efficiency Enhancement of SH-X2

An SH-X2 was developed as the second-generation core and achieved performance


of 1,440 MIPS at 800 MHz using a 90-nm process. The low-power version achieved
the power efficiency of 6,000 MIPS/W. The performance and efficiency are greatly
enhanced from the SH-X by both the architecture and microarchitecture tuning and
the process porting.

3.1.4.1 Frequency Enhancement of SH-X2

According to the SH-X analyzing, the ID stage was the most critical timing part,
and the branch acceleration successfully reduced the branch penalty. Therefore, we
added the third instruction fetch stage (I3) to the SH-X2 pipeline to relax the ID
stage timing. The cycle performance degradation was negligible small by the suc-
cessful branch architecture, and the SH-X2 achieved the same cycle performance of
1.8 MIPS/MHz as the SH-X.
3.1 Embedded CPU Cores 43

I1 Out-of-order
Instruction Fetch
I2 Branch
I3 Branch Search / Instruction Pre-decoding
ID Branch Instruction FPU Instruction
E1 Decoding Address Decoding
E2 Execution Data Tag FPU FPU
E3 Load - Data Arithmetic
E4 WB WB Data Transfer Execution
E5 Store WB
E6 Store Buffer
E7 Flexible Forwarding WB
BR INT LS FE

Fig. 3.19 Eight-stage superpipeline structure of SH-X2

Figure 3.19 illustrates the pipeline structure of the SH-X2. The I3 stage was
added and performs branch search and instruction predecoding. Then the ID stage
timing was relaxed, and the achievable frequency increased.
Another critical timing path was in first-level (L1) memory access logic. SH-X
had L1 memories of a local memory and I- and D-caches, and the local memory was
unified for both instruction and data accesses. Since all the memories could not be
placed closely, a memory separation for instruction and data was good to relax the
critical timing path. Therefore, the SH-X2 separated the unified L1 local memory of
the SH-X into instruction and data local memories (ILRAM and OLRAM).
With the other various timing tuning, the SH-X2 achieved 800 MHz using a
90-nm generic process from the SH-X’s 400 MHz using a 130-nm process. The
improvement was far higher than the process porting effect.

3.1.4.2 Low-Power Technologies of SH-X2

The SH-X2 enhanced the low-power technologies from that of the SH-X explained in
Sect. 3.1.3.4. Figure 3.20 shows the clock-gating method of the SH-X2. The D-drivers
also gate the clock with the signals dynamically generated by hardware, and the leaf
F/Fs requires no CCP. As a result, the clock tree and total powers are 14% and 10%
lower, respectively, than in the SH-X method.
The SH-X2 adopted a way prediction method to the instruction cache. The SH-X2
aggressively fetched the instructions using branch prediction and early-stage branch
techniques to compensate branch penalty caused by long pipeline. The power con-
sumption of the instruction cache reached 17% of the SH-X2, and the 64% of the
instruction cache power was consumed by data arrays. The way prediction misses
were less than 1% in most cases and were 0% for the Dhrystone 2.1. Then the 56%
of the array access was eliminated by the prediction for the Dhrystone. As a result,
the instruction cache power was reduced by 33%, and the SH-X2 power was reduced
by 5.5%.
44 3 Processor Cores

A-drv.
Module 128-256 F/Fs
Clock B-drv. C-drv. D-drvs. F/Fs
Gen.
Clock GCKD
Control GCKD GCKD
Hardware
Registers Software Hardware (dynamic)
(static) (dynamic)
CCP: Control Clock Pin
ph1 edge trigger F/F ph2 transparent latch
GCKD: Gated Clock Driver Cell

Fig. 3.20 Clock-gating method of SH-X2

3.1.5 Efficient Parallelization of SH-4 FPU

In 1995, SH-3E, the first embedded processor with an on-chip floating-point unit
(FPU) was developed by Hitachi mainly for a home game console. It operated
at 66 MHz and achieved peak performance of 132 MFLOPS with a floating-point
multiply–accumulate instruction (FMAC). At that time, the on-chip FPU was popu-
lar for PC/server processors, but there was no demand of the FPU on the embedded
processors mainly because it was too expensive to integrate. However, the program-
ming of game consoles became difficult to support higher resolution and advanced
features of the 3D graphics. Especially it was difficult to avoid overflow and
underflow of fixed-point data with small dynamic range, and there was a demand to
use floating-point data. Since it was easy to implement a four-way parallel operation
with 16-bit fixed-point data, equivalent performance had to be realized to change
the data type to the floating-point format at reasonable costs.
Since an FPU was about three times as large as a fixed-point unit, and a four-way
SMID data path was four times as large as a normal one, it was too expensive to
adopt the four-way SMID FPU. Further, the FPU architecture of the SH-3E was
limited by the 16-bit fixed-length ISA. The latency of the floating-point operations
was long and required more number of registers than the fixed-point operations, but
the ISA could define only 16 registers. A popular transformation matrix of the 3D
graphics was four by four and occupied 16 registers, and no register remained for
other values. Therefore, an efficient parallelization method of FPU had to be devel-
oped with solving above issues.

3.1.5.1 Floating-Point Architecture Extension

The 16 was the limit of the number of registers directly specified by the 16-bit
fixed-length ISA. Therefore, the registers were extended to 32 as two banks of 16
registers. The two banks are front and back banks, named FR0–FR15 and
XF0–XF15, respectively, and they are switched by changing a control bit FPSCR.
FR in a floating-point status and control register (FPSCR). Most of instructions use
3.1 Embedded CPU Cores 45

only the front bank, but some newly defined instructions use both the front and back
banks. The SH-4 uses the front-bank registers as eight pairs or four length-4 vectors
as well as 16 registers and uses the back-bank registers as eight pairs or a four-by-
four matrix. They were defined as follows:

DRn = (FRn, FR [ n+1]) (n : 0,2, 4,6,8,10,12,14),

⎛ FR0⎞ ⎛ FR4⎞ ⎛ FR8 ⎞ ⎛ FR12⎞


⎜ FR1⎟ ⎜ FR5⎟ ⎜ FR9 ⎟ ⎜ FR13⎟
FV0 = ⎜ ⎟ , FV4 = ⎜ ⎟ , FV8 = ⎜ ⎟ , FV12 = ⎜ ⎟,
⎜ FR2⎟ ⎜ FR6⎟ ⎜ FR10⎟ ⎜ FR14⎟
⎜⎝ FR3⎟⎠ ⎜⎝ FR7⎟⎠ ⎜⎝ FR11⎟⎠ ⎜⎝ FR15⎟⎠

XDn = (XFn, XF [ n + 1]) (n : 0,2, 4,6,8,10,12,14),

⎛ XF0 XF4 XF8 XF12⎞


⎜ XF1 XF5 XF9 XF13⎟
XMTRX = ⎜ ⎟
⎜ XF2 XF6 XF10 XF14⎟
⎜⎝ XF3 XF7 XF11 XF15⎟⎠

Since an ordinary SIMD extension of an FPU was too expensive for an embedded
processor as described above, another parallelism was applied to the SH-4. The large
hardware of an FPU is for a mantissa alignment before the operation and normaliza-
tion and rounding after the operation. Further, a popular FPU instruction, FMAC,
requires three read and one write ports. The consecutive FMAC operations are a
popular sequence to accumulate plural products. For example, an inner product of
two length-4 vectors is one of such sequences and popular in a 3D graphics pro-
gram. Therefore, a floating-point inner-product instruction (FIPR) was defined to
accelerate the sequence with smaller hardware than that for the SIMD. It uses the
two of four length-4 vectors as input operand and modifies the last register of one of
the input vectors to store the result. The defining formula is as follows:

FR [ n + 3] = FVm × FVn (m, n : 0, 4,8,12 ).

This modifying-type definition is similar to the other instructions. However, for


a length-3 vector operation, which is also popular, you can get the result without
destroying the inputs, by setting one of forth elements of the input vectors to zero.
The FIPR produces only one result, which is one fourth of a four-way SIMD, and
can save the normalization and rounding hardware. It requires eight input and one
output registers, which are less than the 12 input and four output registers for a four-
way SIMD FMAC. Further, the FIPR takes much shorter time than the equivalent
sequence of one FMUL and three FMACs and requires small number of registers to
sustain the peak performance. As a result, the hardware was estimated to be half of
the four-way SIMD.
46 3 Processor Cores

The rounding rule of the conventional floating-point operations is strictly defined


by an ANSI/IEEE 754 floating-point standard. The rule is to keep accurate value
before rounding. However, each instruction performs the rounding, and the accumu-
lated rounding error sometimes becomes very serious. Therefore, a program must
avoid such a serious rounding error without relying to hardware if necessary. The
sequence of one FMUL and three FMACs can also cause a serious rounding error.
For example, the following formula results in zero, if we add the terms in the order
of the formula by FADD instructions:

1.0 × 2127 + 1.FFFFFE × 2102 + 1.FFFFFE × 2102 − 1.0 × 2127.

However, the exact value is 1.FFFFFE × 2103 , and the error is 1.FFFFFE × 2103
for the formula, which causes the worst error of 2 −23 times of the maximum term.
We can get the exact value if we change the operation order properly. The floating-
point standard defines the rule of each operation, but does not define the result of the
formula, and either of the result is fine for the conformance. Since the FIPR opera-
tion is not defined by the standard, we defined its maximum error as “2 E − 25+ round-
ing error of result” to make it better than or equal to the average and worst-case
errors of the equivalent sequence that conforms the standard, where E is the maxi-
mum exponent of the four products.
A length-4 vector transformation was also popular operation of a 3D graphics,
and a floating-point transform vector instruction (FTRV) was defined. It required 20
registers to specify the operands in a modification type definition. Therefore, the
defining formula is as follows, using a four-by-four matrix of all the back bank reg-
isters, XMTRX, and one of the four front-bank vector registers, FV0–FV3:

FVn = XMTRX × FVn (n : 0, 4,8,12).

Since a 3D object consists of a lot of polygons expressed by the length-4 vectors,


and the same XMTRX is applied to a lot of the vectors of a 3D object, the XMTRX
is not so often changed and suitable for using the back bank.
The FTRV operation was implemented as four inner-product operations by divid-
ing the XMTRX into four vectors properly, and its maximum error is the same as the
FIPR. It could be replaced by four inner-product instructions if we made input and
output registers different to keep the input value properly during the transformation.
The formula would become as follows:

FRn = (XF0 XF4 XF8 XF12) · FVm


FR [ n + 1] = (XF1 XF5 XF9 XF13) · FVm
.
FR [ n + 2] = (XF2 XF6 XF10 XF14) · FVm
FR [ n + 3] = (XF3 XF7 XF11 XF15) · FVm
The above inner-product operations were different from that of the FIPR in the
register usage, and we could define another inner-product instruction to fit the above
3.1 Embedded CPU Cores 47

operations. However, it required four more registers and would be useful only to
replace the FTRV, and the FTRV was simpler and better approach.
The newly defined FIPR and FTRV enhanced the performance, but data transfer
ability became bottleneck to realize the enhancement. Therefore, a pair load/store/
transfer mode was defined to double the data move ability. In the pair mode, floating-
point move instructions (FMOVs) treat 32 front- and back-bank floating-point reg-
isters as 16 pairs and directly access all the pairs without the bank switch controlled
by the FPSCR.FR bit. The mode switch between the pair and normal modes is con-
trolled by a move-size bit FPSCR.SZ in the FPSCR. Further, a floating-point regis-
ter-bank and move-size change instructions (FPCRG and FSCHG) were defined for
fast changes of the modes defined above.
The 3D graphics required high performance but used only a single precision. On
the other hand, a double-precision format was popular for server/PC market and would
ease a PC application porting to a handheld PC, but the performance requirement was
not so high as the 3D graphics. However, software emulation was several hundred
times slower than hardware implementation. Therefore, SH-4 adopted hardware emu-
lation with minimum additional hardware to the single-precision hardware. The dif-
ference of the hardware emulation and the implementation is not visible from the
architecture, and it appears as performance difference reflecting microarchitecture.
The SH-4 introduced single- and double-precision modes, which were controlled
by a precision bit FPSCR.PR of the FPSCR. Some conversion operations between
the precisions were necessary, but not fit to the mode separation. Therefore, SH-4
supported two conversion instructions in the double-precision mode. An FCNVSD
converts a single-precision data to a double-precision one, and an FCNVDS con-
verts vice versa.
In the double-precision mode, eight pairs of the front-bank registers are used for
double-precision data, and one 32-bit register, FPUL, is used for a single-precision
or integer data, mainly for the conversion, but the back-bank registers are not used.
This is because the register-file extension is an option as well as the new instructions
of FIPR and FTRV. Table 3.7 summarizes all the floating-point instructions includ-
ing the new ones.

3.1.5.2 Implementation of Extended Floating-Point Architecture

Figure 3.21 illustrates the pipeline structure of the FPU, which corresponds to the
FPU part of the LS pipeline and the FE pipeline of Fig. 3.1. This structure enables
the zero-cycle transfer of the LS-category instructions except load/store ones,
two-cycle latency of the FCMP, four-cycle latency of the FIPR and FTRV, and
three-cycle latency of the other FE-category instructions. On the latter half of the
ID stage, register reads and forwarding of on-the-fly data in the LS pipeline are
performed. The forwarding destinations include the FE pipeline. Especially, a
source operand value of the LS pipeline instruction is forwarded to the FE pipe-
line as a destination operand value of the LS pipeline instruction in order to real-
ize the zero-cycle transfer.
48

Table 3.7 Floating-point instructions


Operation in C-like Operation in C-like
LS category SZ/PR expression FE category SZ/PR expression
FMOV.S @Rm, FRn 0* FRn = *Rm FADD FRm, FRn *0 FRn + = FRm
FMOV.S @Rm+, FRn 0* FRn = *Rm; Rm + =4 FSUB FRm, FRn *0 FRn − = FRm
FMOV.S @(Rm,R0), FRn 0* FRn = *(Rm + R0) FMUL FRm, FRn *0 FRn * = FRm
FMOV.S FRm,@Rn 0* *Rn = FRm FDIV FRm, FRn *0 FRn/=FRm
FMOV.S FRm,@-Rn 0* Rn - = 4, *Rn = FRm FCMP/EQ FRm, FRn *0 T = (FRn==FRm)
FMOV.S FRm,@(Rn,R0) 0* *(Rn + R0) = FRm FCMP/GT FRm, FRn *0 T = (FRn > FRm)
FMOV FRm, FRn 0* FRn = FRm FMAC FR0, FRm, FRn *0 FRn + =FR0 × FRm
FSTS FPUL, FRn 0* FRn = FPUL FSQRT FRn *0 FRn = ÖFRn
FLDS FRm, FPUL 0* FPUL = FRm FLOAT FPUL, FRn *0 FRn = (float) FPUL
FLDI0 FRn *0 FRn = 0.0 FTRC FRm, FPUL *0 FPUL = (long) FRm
FLDI1 FRn *0 FRn = 1.0 FIPR FVm, FVn, FR[n + 3] *0 FR[n + 3] = FVm × FVn
FNEG FRn *0 FRn = -FRn FTRV XMTRX, FVn *0 FVn = XMTRX × FVn
FABS FRn *0 FRn = |FRn| FRCHG *0 FR = ~FR
FNEG DRn 01 DRn = -DRn FSCHG *0 SZ = ~SZ
FABS DRn 01 DRn = |DRn| FADD DRm, DRn 01 DRn + = DRm
FMOV.S @Rm, DRn 10 DRn = *Rm FSUB DRm, DRn 01 DRn − = DRm
FMOV.S @Rm+, DRn 10 DRn = *Rm; Rm + =4 FMUL DRm, DRn 01 DRn * = DRm
FMOV.S @(Rm,R0),DRn 10 DRn = *(Rm + R0) FDIV DRm, DRn 01 DRn/=DRm
FMOV.S DRm,@Rn 10 *Rn = DRm FCMP/EQ DRm, DRn 01 T = (DRn==DRm)
FMOV.S DRm,@-Rn 10 Rn− = 4, *Rn = DRm FCMP/GT DRm, DRn 01 T = (DRn > DRm)
FMOV.S DRm,@(Rn,R0) 10 *(Rn + R0) = DRm FMAC DR0, DRm, DRn 01 DRn + = DR0 × DRm
3

FMOV DRm, DRn 10 DRn = DRm FSQRT DRn 01 DRn = ÖDRn


FLOAT FPUL, DRn 01 DRn = (float) FPUL
FTRC DRm, FPUL 01 FPUL = (long) DRm
FCNVSD FPUL, DRn 01 DRn = (double) FPUL
FCNVDS DRm, FPUL 01 FPUL = (float) DRm
Processor Cores
3.1 Embedded CPU Cores 49

Register Read Forwarding Register Read


ID ID

EX E0
FLS FDS VEC
MAIN
MA EX

WB Register
Write
LS FE

Fig. 3.21 Pipeline structure of SH-4 FPU

A floating-point load/store block (FLS) is the main part of the LS pipeline. At the EX
stage, it outputs a store data for the FMOV with a store operation, changes a sign for the
FABS and FNEG, and outputs an on-the-fly data for the forwarding. At the MA stage, it
gets a load data for the FMOV with a load operation and outputs an on-the-fly data for
the forwarding. It writes back the result in the middle of the WB stage at the negative
edge of the clock pulse. Then the written data can be read on the latter half of the ID
stage, and no forwarding path form the WB stage is necessary.
The FE pipeline consists of three blocks of MAIN, FDS, and VEC. An E0 stage
is inserted to execute the vector instructions of FIPR and FTRV. The VEC block is
the special hardware to execute the vector instructions of FIPR and FTRV, and the
FDS block is for the floating-point divide and square-root instructions (FDIV and
FSQRT). Both the blocks will be explained later. The MAIN block executes the
other FE-category instructions and the postprocessing of all the FE-category ones.
The MAIN block executes the arithmetic operations for two and half cycles of the
EX, MA, and WB stages.
Figure 3.22 illustrates the structure of the MAIN block. It is constructed to exe-
cute the FMAC, whose three operands are named A, B, and C, and a formula
A + B × C is calculated. Other instructions of FADD, FSUB, and FMUL are treated
by setting one of the inputs to 1.0, −1.0 or 0.0 appropriately.
A floating-point format includes special numbers of zero, denormalized number,
infinity, and not a number (NaN) as well as a normalized number. The inputs are
checked by Type Check part, and if there is a special number, a proper special-
number output is generated in parallel with the normal calculation and selected at
Rounder parts of the WB stage instead of the calculation result.
The compare instructions are treated at Compare part. The comparison is simple
like an integer comparison except for some special numbers. The input check result
of the Type Check part is used for the exceptional case and selected instead of the
simple comparison result if necessary. The final result is transferred to EX pipeline
to set or clear the T-bit according to the result at the MA stage.
50 3 Processor Cores

Input-A Input-B Input-C VEC output

Type Exp.
Check MUX Diff. Exp.
Multiplier Array Adder
EX Aligner
Compare FDS
output
MUX
Leading Carry
Propagate Feedback
Non-Zero Adder
T-bit path
MA Detector (CPA)
for Double
(LNZ) Adjuster
Mantissa Normalizer
Exp.
WB Mantissa Rounder
Rounder

MAIN output

Fig. 3.22 Structure of FPU MAIN block

There are two FMAC definitions. One calculates a sequence of FMUL and FADD
and is good for conforming the ANSI/IEEE standard, but requires extra normaliza-
tion and rounding between the multiply and add. The extra operations require extra
time and causes inaccuracy. The other calculates an accurate multiply-and-add
value, then normalizes and rounds it. It was not defined by the standard at that time,
but now, it is in the standard. The SH-4 adopted the latter fused definition.
The FMAC processing flow is as follows. At the EX stage, Exp. Diff. and Exp.
Adder calculates an exponent difference of “A” and “B*C” and an exponent of B*C,
respectively, and Aligner aligns “A” according to the exponent difference. Then the
Multiplier Array calculates a mantissa of “A + B*C.” The “B*C” is calculated in
parallel with the above executions, and the aligned “A” is added at the final reduc-
tion logic. At the MA stage, CPA adds the Multiplier Array outputs, LNZ detects
the leading nonzero position of the absolute value of the CPA output from the
Multiplier Array outputs in parallel with the CPA calculation, and Mantissa
Normalizer normalizes the CPA outputs with the LNZ output. At the WB stage,
Mantissa Rounder rounds the Mantissa Normalizer output, Exp. Rounder normal-
izes and rounds the Exp. Adder output, and both the Rounders replace the rounded
result by the special result if necessary to produce the final MAIN block output.
Figure 3.23 illustrates the VEC block. The FTRV reads the inputs for four cycles
to calculate four transformed vector elements. This means the last read is at the forth
cycle, but it is too late to cancel the FTRV even the input value causes an exception.
Therefore, the VEC block must treat all the data types appropriately for the FTRV,
and all the denormalized numbers are detected and adjusted differently from the
normalized numbers. As illustrated in Fig. 3.23, the VEC block can start the opera-
tion at the ID stage by eliminating the input operand forwarding, and the above
adjustment can be done at the ID stage.
3.1 Embedded CPU Cores 51

Vector-A Vector-B

ID
MSB MSB MSB MSB MSB MSB MSB MSB Adj. Adj. Adj. Adj. Adj. Adj. Adj. Adj.
Exp. Exp. Exp. Exp.
Multiplier Multiplier Multiplier Multiplier Adder0 Adder1 Adder2 Adder3
Array 0 Array 1 Array 2 Array 3 Exp. Exp. Exp. Exp. Exp. Exp.
E0 Diff. Diff. Diff. Diff. Diff. Diff.
01 02 03 12 13 23
CPA0 CPA1 CPA2 CPA3 Max. Exp.
MUX0 MUX1MUX2MUX3 EMUX
Dec. Dec. Dec. Dec.
Aligner 0 Aligner 1 Aligner 2 Aligner 3
EX 4-to-2 Reduction Array
VEC output (Exponent)
VEC output (Mantissa)

Fig. 3.23 Structure of FPU VEC block

At the E0 stage, Multiplier Arrays 0–3 and Exp. Adders 0–3 produce the mantissas
and exponents of the four intermediate products, respectively. Since the FIPR and
FTRV definitions allow the error of “ 2 E − 25+ rounding error of result,” the multipliers
need not to produce an accurate value, and we can make smaller multiplier allowing
the error by eliminating the lower bit calculations properly. Then, Exp. Diffs. 01, 02,
03, 12, 13, and 23 generate all the six combinations of the exponent differences,
Max. Exp. judges the maximum exponent from the signs of the six differences, and
MUX0–3 select four differences from the six ones or zero to align the mantissas to
the mantissa of the maximum exponent product. The zero is selected for the maxi-
mum exponent one. Further, EMUX selects the maximum exponent as an exponent
of the VEC output.
At the EX stage, Aligners 0–3 align the mantissas by the four selected differ-
ences. Each difference can be positive or negative depending on what is the maxi-
mum exponent product, but the shift direction for the alignment is always right, and
proper adjustment is done when the difference is decoded. A 4-to-2 Reduction
Array reduces the four aligned mantissas into two as sum and carry of the mantissa
of the VEC output. The VEC output is received by MAIN block at the MUX of the
EX stage.
The vector instructions of FIPR and FTRV were defined as optional instructions,
and the hardware should be optimized for the configuration without the optional
instructions. Further, if we optimized hardware for all of the instructions, we cannot
share hardware properly because of the latency difference of FIPR and FTRV to the
others. Therefore, the E0 stage is inserted only when FIPR and FTRV are executed
with variable length pipeline architecture, although it causes one-cycle stall when
an FE-category instruction other than FIPR and FTRV is issued right after an FIPR
or an FTRV as illustrated in Fig. 3.24.
52 3 Processor Cores

FADD ID EX MA WB
FSUB ID EX MA WB
FIPR (Vector) ID E0 EX MA WB
FIPR (Vector) ID E0 EX MA WB
FMUL ID ID EX MA WB

1 cycle stall

Fig. 3.24 Pipeline stall after E0 stage use

E1 E2 WB
FDIV ID
FDS FDS FDS FDS FDS FDS FDS FDS FDS
FADD ID E1 E2 WB
FSUB ID E1 E2 WB

FMUL ID E1 E2 WB
(FDIV post process) ID E1 E2 WB

Fig. 3.25 Out-of-order completion of single-precision FDIV

The FDS block is for FDIV and FSQRT. The SH-4 adopts a SRT method with
carry-save adders, and the FDS block generates three bits of quotient or square-root
value per cycle. The numbers of bits of single- and double-precision mantissas are
24 and 53, respectively, and two extra bits, guard and round bits, are required to
generate the final result. Then, the FDS block takes 9 and 19 cycles to generate the
mantissas, and the pitches are 10 and 23 for the single- and double-precision FDIVs,
respectively. The differences are form some extra cycles before and after the man-
tissa generations. The pitches of the FSQRTs are one cycle shorter than the FDIV
with a special treatment at the beginning. The pitches are much longer than the other
instructions and degrade performance even though the frequency of the FDIV and
FSQRT is much less than the others. For example, if one of ten instructions is FDIV,
and the pitches of the other instructions are one, the total pitches are 19. Therefore,
an out-of-order completion of the FDIV and FSQRT is adopted to hide the long
pitches of them. Then only the FDS block is occupied for a long time. Figure 3.25
illustrates the out-of-order completion of single-precision FDIV.
The single-precision FDIV and FSQRT use the MAIN block for two cycles at the
beginning and ending of the operations to minimize the dedicated hardware for the
FDIV and FSQRT. The double-precision ones use it for five cycles, two cycles at
the beginning and three cycles at the ending. Then, the MAIN block is released to
the following instructions for the other cycles of the FDIV and FSQRT.
The double-precision instructions other than the FDIV and FSQRT are emulated
by hardware for single-precision instructions with small amount of additional hardware
for the emulation. Since the SH-4 merged an integer multiplier into the FPU, it sup-
ports 32-bit multiplication and 64-bit addition for an integer multiply-and-accumulate
3.1 Embedded CPU Cores 53

Mantissa (53 bits) Higher (21 bits) Lower (32 bits)

Lower x Lower (64 bits)

+ Lower x Higher (53 bits)

+ Higher x Lower (53 bits)

+ Higher x Higher (42 bits)

= Product (53 bits + guard/round bits) Sticky bit

Fig. 3.26 Double-precision FMUL emulation

Mantissa (53 bits) Higher (21 bits) Lower (32 bits)

Less Lower Alignment Range (66 bits)


Less Lower
OR
Less Higher Alignment Range (76 bits)

Less Higher

+ Greater Lower

=
Sticky bit
+ Greater Higher

= Sum (53 bits + guard/round bits)

Fig. 3.27 Double-precision FADD/FSUB emulation

instruction as well as 24-bit multiplication and 73-bit addition for the FMAC.
The 73 bits are necessary to align the added to the product even when the exponent
of the addend is larger than the product. Then the FPU supports 32-bit multiplica-
tion and 73-bit addition. The 53-bit input mantissas are divided into higher 21 bits
and lower 32 bits for the emulation. Figure 3.26 illustrates the FMUL emulation.
Four products of lower-by-lower, lower-by-higher, higher-by-lower, and higher-by-
higher are calculated and accumulated properly. FPU exception checking is done at
the first step, the calculation is done at second to fifth steps, and the lower and
higher parts are outputted at fifth and last steps, respectively.
Figure 3.27 illustrates the FADD and FSUB emulation. The less operand is
aligned to the greater operand by comparing which is larger at the first step as well
as checking the exception. Only higher halves of the input operands are compared
54 3 Processor Cores

y
N’ I L

V’
screen
(z=1) z
Sy Sx
x V
N

Fig. 3.28 Simple 3D graphics benchmark

because exponents are in the higher halves, and the alignment shift is not necessary
if the higher halves are the same. Then the read operands are swapped if necessary
at the third and later steps. The alignment and addition are done at third to fifth
steps, and the lower and higher parts are outputted at fifth and last steps.
As a result, the FMUL, FADD, and FSUB take six steps. The conversion instruc-
tions of FLOAT, FTRC, FCNVSD, and FCNVDS take two steps mainly because a
double-precision operand requires two cycles to read or write.

3.1.5.3 Performance Evaluation with 3D Graphics Benchmark

The extended floating-point architecture was evaluated by a simple 3D graphics


benchmark shown in Fig. 3.28. It consists of coordinate transformations, perspec-
tive transformations, and intensity calculations of a parallel beam of light in
Cartesian coordinates. A 3D-object surface is divided into triangular or quadrangu-
lar polygons to be treated by the 3D graphics. The benchmark uses triangular poly-
gons, and affine transformations, which consists of a rotation and a parallel
displacement. The perspective transformation assumes a flat screen expressed as
z = 1. The benchmark is expressed as follows, where A represents an affine transfor-
mation matrix; V and N represent vertex and normal vectors of a triangle before the
coordinate transformations, respectively; N¢ and V¢ represent the ones after the
transformations, respectively; S x and S y represent x and y coordinates of the projec-
tion of V¢, respectively; L represents a vector of the parallel beam of light; and I
represents an intensity of a triangle surface:

V ′ = AV , S x = Vx ′ / Vz ′ , S y = Vy ′ / Vz ′ , N ′ = AN , I = (L, N ′ ) / (N ′, N ′ ),

⎛ Axx Axy Axz Axw ⎞ ⎛ Vx ⎞ ⎛ Vx ′ ⎞


⎜A Ayy Ayz ⎟
Ayw ⎟ ⎜V ⎟ ⎜V ′ ⎟
A=⎜ , V = ⎜ ⎟ , V ′ = ⎜ y′ ⎟ ,
yx y

⎜A Azy Azz Azw ⎟ ⎜ Vz ⎟ ⎜ Vz ⎟


⎜ zx ⎟ ⎜ ⎟ ⎜ ⎟
⎝ 0 0 0 1 ⎠ ⎝ 1⎠ ⎝ 1⎠
3.1 Embedded CPU Cores 55

⎛ Nx ⎞ ⎛ N x′ ⎞ ⎛ Lx ⎞
⎜N ⎟ ⎜ N ′⎟ ⎜L ⎟
N = ⎜ ⎟ , N ′ = ⎜ y′ ⎟ , L = ⎜ ⎟ .
y y

⎜ Nz ⎟ ⎜ Nz ⎟ ⎜ Lz ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ 0⎠ ⎝ 0 ⎠ ⎝ 0⎠

The numbers of arithmetic instructions per polygon with the above formula are 17
FMULs, 40 FMACs, 4 FDIVs, and an FSQRT without the architecture extension,
and 4 FTRVs, 2 FIPRs, 7 FMULs, 4 FDIVs, and an FSQRT with the extension.
Figure 3.29 shows the resource-occupying cycles for the benchmark. (1) It took
166 cycles to execute the benchmark with the conventional architecture for the exe-
cution cycles of load, store, and transfer instructions. The arithmetic operations took
121 cycles and did not affect the performance. (2) The load/store/transfer execution
cycles were reduced to half with the pair load/store/transfer instructions, and the
arithmetic operations were reduced to 67 cycles with the out-of-order completion of
the FDIV and FSQRT. Then the execution cycles became 83 cycles. (3) Furthermore,
the register extension with the bank register file enabled to keep the transformation
matrix in the back bank and reduced reloading or save/restore of data. Only the light
vector was reloaded. Then, the number of load/store/transfer instructions decreased
to 25 and was not a bottleneck of the performance. In addition, arithmetic opera-
tions decreased to 35 cycles by the FIPR and FTRV. As explained with Fig. 3.24,
one-cycle stall occurs after the E0 stage use, and three-cycle stalls occurred for the
benchmark as well as two-cycle stalls of normal register conflicts. As a result, it was
reduced by 76% to 40 cycles from 166 cycles to execute the benchmark.
Figure 3.30 shows the benchmark performance of the SH-4 at 200 MHz. The
performance was enhanced from 1.2-M polygons/s of the conventional superscalar
architecture to 2.4-M polygons/s by the pair load/store/transfer instructions and
out-of-order completion of the FDIV and FSQRT and to 5.0-M polygons/s by the

1) Conventional Architecture
FMUL FMAC FDIV FSQRT
Load/Store/Transfer
2) with Pair Load/Store/Transfer & Out-of-Order FDIV/FSQRT
FMUL FMAC FDIV FSQRT
Load/Store/Transfer
3) with Vector Instructions & Bank Register File
FIPR FMUL FDIV FSQRT
FTRV 76% shorter
Load/Store Stall
Transfer
0 20 40 60 67 80 83 100 120121 140 160166
Resource-occupying cycles

Fig. 3.29 SH-4’s resource-occupying cycles of SH-4 for a 3D benchmark


56 3 Processor Cores

@200 MHz 40 cycles/polygon


5

M Polygons/s
4 Scalar x1.6 5.0M
Superscalar x4.2
64
3 3.1M
x1.8 x2.0 83
2 2.4M
x1.7 166 150
1 287 1.2M 1.3M
0 0.7M
1) Conventional 2) Pair Load/Store, etc. 3) Vector Inst., etc.

Fig. 3.30 Benchmark performance of SH-4 at 200 MHz

register extension and the extended instructions of the FIPR and FTRV. The corre-
sponding scalar performances would be 0.7, 1.3, and 3.1-M polygons/s at 200 MHz
for 287, 150, and 64 cycles, respectively, and the superscalar performances were
about 70% higher than the scalar ones, which was 30% for the Dhrystone benchmark.
This showed the superscalar architecture was more effective for multimedia appli-
cations than for general integer applications. Since the SH-3E was a scalar proces-
sor without the SH-4’s enhancement, it took 287 cycles as the slowest case of the
above performance evaluations. Therefore, the SH-4 achieved 287/40 = 7.2 times as
high cycle performance as the SH-3E for the media processing like a 3D graphics.
The SH-4 achieved the excellent media processing efficiency. Its cycle perfor-
mance and frequency were 7.2 and 1.5 times as high as those of the SH-3E in the
same process. Therefore, the media performance in the same process was
7.2 × 1.5 = 10.8 times high. The FPU area of the SH-3E was estimated to be 3 mm2
and that of the SH-4 was 8 mm2 in a 0.25-mm process. Then the SH-4 was 8/3 = 2.7
times as large as the SH-3E. As a result, the SH-4 achieved 10.8/2.7 = 4.0 times as
high area efficiency as the SH-3E for the media processing.
The SH-3E consumed similar power for both Dhrystone and the 3D benchmark.
On the other hand, the SH-4 consumed 2.2 times as much power for the 3D bench-
mark as the Dhrystone. As described in Sect. 3.1.2.7, the power consumptions of the
SH-3 and SH-4 ported to a 0.18-mm process were 170 and 240 mW at 133 MHz and
1.5 V power supply for the Dhrystone. Therefore, the power of the SH-4 was
240 × 2.2/170 = 3.3 times as high as that of the SH-3. The corresponding performance
ratio is 7.2 times because they run at the same frequency after the porting. As a result,
the SH-4 achieved 7.2/3.3 = 2.18 times as high power efficiency as the SH-3E.
The actual efficiencies including the process contribution are 60 MHz/
287 = 0.21-M polygons/s/0.6 W = 0.35-M polygons/s/W for the SH-3E and
5.0-M polygons/s/2 W = 2.5-M polygons/s/W for the SH-4.

3.1.6 Efficient Frequency Enhancement of SH-X FPU

The floating-point architecture and microarchitecture extension of the SH-4


achieved high multimedia performance and efficiency as described in Sect. 3.1.5.
This was mainly from the parallelization by the vector instructions of FIPR and
3.1 Embedded CPU Cores 57

FTRV, the out-of-order completions of FDIV and FSQRT, and proper exten-
sions of the register files and load/store/transfer width. Further parallelization
could be one of the next approaches, but we took another approach to enhance
the operating frequency. Main reason was that the CPU side had to take this
approach for the general applications with low parallelism as described in
Sect. 3.1.2. However, it caused serious performance degradation to allow 1.5
times long latencies of the FPU instructions. Therefore, we enhanced the archi-
tecture and microarchitecture to reduce the latencies efficiently.

3.1.6.1 Floating-Point Architecture Extension

The FDIV and FSQRT of the SH-4 were already long latency instructions, and the
1.5 times long latencies of the SH-X could cause serious performance degradations.
The long latencies were mainly from the strict operation definitions by the ANSI/
IEEE 754 floating-point standard. We had to keep accurate value before rounding.
However, there was another way if we allowed proper inaccuracies.
A floating-point square-root reciprocal approximate (FSRRA) was defined as an
elementary function instruction to replace the FDIV, FSQRT, or their combination.
Then we do not need to use the long latency instructions. Especially, 3D graphics
applications require a lot of reciprocal and square-root reciprocal values, and the
FSRRA is highly effective. Further, 3D graphics require less accuracy, and the sin-
gle precision without strict rounding is enough accuracy. The maximum error of the
FSRRA is ±2 E − 21 where E is the exponent value of an FSRRA result. The FSRRA
definition is as follows:

1
FRn = .
FRn

A floating-point sine and cosine approximate (FSCA) was defined as another


popular elementary function instruction. Once the FSRRA was introduced, extra
hardware was not so large for the FSCA. The most popular definition of the trigo-
nometric function is to use radian for the angular unit. However, the period of the
radian is 2p and cannot be expressed by a simple binary number. Therefore, the
FSCA uses fixed-point number of rotations as the angular expression. The number
consists of 16-bit integer and 16-bit fraction parts. Then the integer part is not nec-
essary to calculate the sine and cosine values by their periodicity, and the 16-bit
fraction part can express enough resolution of 360/65,536 = 0.0055°. The angular
source operand is set to a CPU–FPU communication register FPUL because the
angular value is a fixed-point number. The maximum error of the FSCA is ±2 −22 ,
which is an absolute value and not related to the result value. Then the FSCA
definition is as follows:

FRn = Sin (2 π⋅ FPUL), FR [ n + 1] = Cos (2 π⋅ FPUL


58 3 Processor Cores

The double-precision implementation will be explained later, but it was imple-


mented faster than that of the SH-4, and the load/store/transfer instructions had also
to be faster for the performance balance. Therefore, a double-precision mode was
defined as well as the normal and pair modes of the single precision by using the
FPSCR.PR and SZ bits for the FMOV to treat double-precision data. Further, a
floating-point precision change instruction (FPCHG) was defined for fast precision-
mode change as well as the FRCHG and FSCHG described in Sect. 3.1.5.1.

3.1.6.2 High-Frequency Implementation of the SH-X FPU

The SH-X FPU achieved 1.4 times of the SH-4 frequency in a same process with
maintaining or enhancing the cycle performance. Table 3.8 shows the pitches and
latencies of the FE-category instructions of the SH-3E, SH-4, and SH-X. As for the
SH-X, the simple single-precision instructions of FADD, FSUB, FLOAT, and FTRC
have three-cycle latencies. Both single- and double-precision FCMPs have two-
cycle latencies. Other single-precision instructions of FMUL, FMAC, and FIPR and
the double-precision instructions except FMUL, FCMP, FDIV, and FSQRT have
five-cycle latencies. All the above instructions have one-cycle pitches. The FTRV
consists of four FIPR like operations resulting in four-cycle pitch and eight-cycle
latency. The FDIV and FSQRT are out-of-order completion instructions having
two-cycle pitches for the first and last cycles to initiate a special resource operation
and to perform postprocesses of normalizing and rounding of the result. Their
pitches of the special hardware expressed in the parentheses are about halves of the
mantissa widths, and the latencies are four cycles more than the special-hardware
pitches. The FSRRA has one-cycle pitch, three-cycle pitch of the special hardware,
and five-cycle latency. The FSCA has three-cycle pitch, five-cycle pitch of the spe-
cial hardware, and seven-cycle latency. The double-precision FMUL has three-cycle
pitch and seven-cycle latency.
Multiply–accumulate (MAC) is one of the most frequent operations in intensive
computing applications. The use of four-way SIMD would achieve the same
throughput as the FIPR, but the latency was longer, and the register file had to be
larger. Figure 3.31 illustrates an example of the differences according to the pitches
and latencies of the FE-category SH-X instructions shown in Table 3.8. In this
example, each box shows an operation issue slot. Since FMUL and FMAC have
five-cycle latencies, we must issue 20 independent operations for peak throughput
in the case of four-way SIMD. The result is available 20 cycles after the FMUL
issue. On the other hand, five independent operations are enough to get the peak
throughput of a program using FIPRs. Therefore, FIPR requires one-quarter of the
program’s parallelism and latency.
Figure 3.32 compares the pitch and latency of an FSRRA and the equivalent
sequence of an FSQRT and an FDIV according to Table 3.8. Each of the FSQRT
and FDIV occupies 2 and 13 cycles of the MAIN FPU and special resources, respec-
tively, and takes 17 cycles to get the result, and the result is available 34 cycles after
the issue of the FSQRT. In contrast, the pitch and latency of the FSRRA are one and
3.1

Table 3.8 Pitch/latency of FE-category instructions of SH-3E, SH-4, and SH-X


Embedded CPU Cores

Single-precision SH-3E SH-4 SH-X Double-precision SH-4 SH-X


FADD FRm, FRn 1/2 1/3 1/3 FADD DRm, DRn 6/8 1/5
FSUB FRm, FRn 1/2 1/3 1/3 FSUB DRm, DRn 6/8 1/5
FMUL FRm, FRn 1/2 1/3 1/5 FMUL DRm, DRn 6/8 3/7
FDIV FRm, FRn 13/14 2 (10)/12 2 (13)/17 FDIV DRm, DRn 5 (23)/25 2 (28)/32
FSQRT FRn 13/14 2 (9)/11 2 (13)/17 FSQRT DRm, DRn 5 (22)/24 2 (28)/32
FCMP/EQ FRm, FRn 1/1 1/2 1/2 FCMP/EQ DRm, DRn 2/2 1/2
FCMP/GT FRm, FRn 1/1 1/2 1/2 FCMP/EQ DRm, DRn 2/2 1/2
FLOAT FPUL, FRn 1/2 1/3 1/3 FLOAT DRn 2/4 1/5
FTRC FRm, FPUL 1/2 1/3 1/3 FTRC DRm, FPUL 2/4 1/5
FMAC FR0, FRm, FRn 1/2 1/3 1/5 FCNVSD FPUL, FRn 2/4 1/5
FIPR FVm, FVn, FRn + 3 – 1/4 1/5 FCNVDS DRm, FPUL 2/4 1/5
FTRV XMTRX, FVn – 4/7 4/8
FSRRA FRn – – 1 (3)/5
FSCA FPUL, DRn – – 3 (5)/7
59
60 3 Processor Cores

Fig. 3.31 Four-way SIMD 4-way SIMD


vs. FIPR

Program Flow
FMUL FIPR

5 cycles
20 operations for 5 operations
peak throughput
FMAC

20 cycles
FMAC
FMAC
Result is
available here

Fig. 3.32 FSRRA vs. FSQRT FSRRA

17 cycles
5 cycles
equivalent sequence of
Program Flow

11 4
FSQRT and FDIV
(post process)
4
FDIV

17 cycles
11
(post process)
4
Result is available here

Fig. 3.33 FDIV vs. FDIV FSRRA


13 cycles
17 cycles

5 cycles
equivalent sequence of 3 cycles
Program Flow

FSRRA and FMUL 4


FMUL
5 cycles

11
4

post
process
Resource is
4 available here

Result is available here

five cycles that are only one-quarter and approximately one-fifth of those of the
equivalent sequences, respectively. The FSRRA is much faster using a similar
amount of the hardware resource.
The FSRRA can compute a reciprocal as shown in Fig. 3.33. The FDIV occupies
2 and 13 cycles of the MAIN FPU and special resources, respectively, and takes 17
cycles to get the result. On the other hand, the FSRRA and FMUL sequence occu-
pies two and three cycles of the MAIN FPU and special resources, respectively, and
takes ten cycles to get the result. Therefore, the FSRRA and FMUL sequence is bet-
ter than using the FDIV if an application does not require a result conforming to the
IEEE standard, and 3D graphics are one of such applications.
3.1 Embedded CPU Cores 61

Register Read Forwarding Register Read


E1

E2
FDS
FLS Short FPOLY
E3

E4 Main

E5 Register Write

E6

E7 Register Write
LS FE

Fig. 3.34 Arithmetic execution pipeline of SH-X FPU

We decided the vector instructions to be standard ones of the SH-X, which were
optional ones of the SH-4, and the SH-X merged the vector hardware and optimized
the merged hardware. Then the latencies of the most instructions became less than 1.5
times of the SH-4, and all the instructions could use the vector hardware if necessary.
There were weak requirements of high-speed double-precision operations when the
SH-4 was developed and chose the hardware emulation to implement them. However,
they could use the vector hardware and became faster mainly with the wider read/
write register ports and the more multipliers in the SH-X implementation.
Figure 3.34 illustrates the FPU arithmetic execution pipeline. With the delayed
execution architecture, the register-operand read and forwarding are done at the E1
stage, and the arithmetic operation starts at E2. The short arithmetic pipeline treats
three-cycle-latency instructions. All the arithmetic pipelines share one register write
port to reduce the number of ports. There are four forwarding source points to provide
the specified latencies for any cycle distance of the define-and-use instructions. The
FDS pipeline is occupied by 13/28 cycles to execute a single/double FDIV or FSQRT,
and these instructions cannot be issued frequently. The FPOLY pipeline is three cycles
long and is occupied three or five times to execute an FSRRA or FSCA instruction.
Therefore, the third E4 stage and E6 stage of the main pipeline are synchronized for
the FSRRA, and the FPOLY pipeline output merges with the main pipeline at this
point. The FSCA produce two outputs, and the first output is produced at the same
timing of the FSRRA, and the second one is produced two cycles later, and the main
pipeline is occupied for three cycles, although the second cycle is not used. The
FSRRA and FSCA are implemented by calculating the cubic polynomials of the prop-
erly divided periods. The width of the third order term is eight bits, which adds only a
small area overhead, while enhancing accuracy and reducing latency.
Figure 3.35 illustrates the structure of the main FPU pipeline. There are four
single-precision multiplier arrays at E2 to execute FIPR and FTRV and to emulate
62 3 Processor Cores

Fig. 3.35 Main pipeline of


Multiplier Multiplier Multiplier Multiplier Exponent
SH-X FPU E2
Array Array Array Array Difference
Aligner Aligner Aligner Aligner Exponent
E3
Reduction Array Adder
Carry Propagate Leading Non-zero
E4
Adder (CPA) (LNZ) Detector
Exponent
E5 Mantissa Normalizer
Normalizer
E6 Rounder

double-precision multiplication. Their total area is less than that of a double-precision


multiplier array. The calculation of exponent differences is also done at E2 for align-
ment operations by four aligners at E3. The four aligners align eight terms consisting
of four sets of sum and carry pairs of four products generated by the four multiplier
arrays, and a reduction array reduces the aligned eight terms to two at E3. The
exponent value before normalization is also calculated by an exponent adder at E3.
A carry-propagate adder (CPA) adds two terms from the reduction array, and a lead-
ing nonzero (LNZ) detector searches the LNZ position of the absolute value of the
CPA result from the two CPA inputs precisely and with the same speed as the CPA
at E4. Therefore, the result of the CPA can be normalized immediately after the CPA
operation with no correction of position errors, which is often necessary when using
a conventional 1-bit error LNZ detector. Mantissa and exponent normalizers nor-
malize the CPA and exponent-adder outputs at E5 controlled by the LNZ detector
output. Finally, the rounder rounds the normalized results into the ANSI/IEEE 754
format. The extra hardware required for the special FPU instructions of the FIPR,
FTRV, FSRRA, and FSCA is about 30% of the original FPU hardware, and the FPU
area is about 10–20% of the processor core depending on the size of the first and
second on-chip memories. Therefore, the extra hardware is about 3–6% of the
processor core.
The SH-4 used the FPU multiplier for integer multiplications, so the multiplier could
calculate a 32-by-32 multiplication. Therefore, the double-precision multiplication
could be divided into four parts. On the other hand, the SH-X separated the integer and
FPU multipliers to make the FPU optional. Then the FPU had four 24-by-24 multipliers
for the double-precision FMUL emulation. Since the double-precision mantissa width
was more than twice of the single-precision one, we had to divide a multiplication into
nine parts. Then we need three cycles to emulate the nine partial multiplications by four
multipliers.
Figure 3.36 illustrates the flow of the emulation. At the first step, a lower-by-lower
product is produced, and its lower 23 bits are added by the CPA. Then the CPA
output is ORed to generate a sticky bit. At the second step, four products of middle-
by-lower, lower-by-middle, upper-by-lower, and lower-by-upper are produced and
3.1 Embedded CPU Cores 63

Lower x Lower (46b)


Middle x Lower (47b)
23b
Lower x Middle (47b)
Upper x Lower (29b) CPA Output
+ Lower x Upper (29b)
= Reduction Array Output
Middle x Middle (48b)
23b
Upper x Middle (30b)
Middle x Upper (30b) CPA Output
+ Upper x Upper (12b)
= Reduction Array Output

CPA Output (53b + Guard/Round) Sticky

Fig. 3.36 Double-precision FMUL emulation by four multipliers

accumulated to the lower-by-lower product by the reduction array, and its lower 23
bits are also used to generate a sticky bit. At the third step, the remaining four prod-
ucts of middle-by-middle, upper-by-middle, middle-by-upper, and upper-by-upper
are produced and accumulated to the already accumulated intermediate values.
Then the CPA adds the sum and carry of the final product, and 53-bit result and
guard/round/sticky bits are produced. The accumulated terms of the second and
third steps are ten because each product consists of sum and carry, but the bitwise
position of some terms are not overlapped. Therefore, the eight-term reduction array
is enough to accumulate them.

3.1.6.3 Performance Evaluation with 3D Graphics Benchmark

The SH-X floating-point architecture was evaluated by a simple 3D graphics bench-


mark. The difference from the benchmark in Sect. 3.1.5.3 is mainly the transform
matrix types and strip model adaption. The affine transformation was used for the
SH-4 evaluation, but the general transformation was used for the SH-X. It can
express scaling as well as rotation and parallel displacement, but requires more
calculations. The strip model is a 3D object expression method to reduce the num-
ber of vertex vectors. In the model, each triangle has three vertexes, but each vertex
is shared by three triangles, and the number of vertex per triangle is one. The bench-
mark is expressed as follows, where T represents a general transformation matrix;
V and N represent vertex and normal vectors of a triangle before the coordinate
transformations, respectively; N¢ and V¢ represent the ones after the transforma-
tions, respectively; S x and S y represent x and y coordinates of the projection of V¢,
respectively; L represents a vector of the parallel beam of light; I represents an
intensity of a triangle surface; and V ″ is an intermediate value of the coordinate
transformations:
64 3 Processor Cores

without Special Inst. (FTRV, FIPR, FSRRA)


Arithmetic FSQRT FDIV
FMUL FMAC FDIV FMUL FMAC
FDIV FSQRT FDIV
FDIV/FSQRT
Coordinate & Perspective
with Special Inst. FIPR Transformations
FTRV FSRRA FMUL Intensity Calculation
FMUL 63% shorter
Arithmetic 58% shorter
0 11 19 20 26 40 52
Resource-occupying cycles

Fig. 3.37 Resource-occupying cycles of SH-X for a 3D benchmark

V ′′ V′ Vy ′ ( L, N ′ )
V ′′ = TV , V ′ = , S x = x ′ , S y = ′ , N ′ = TN , I = ,
VW′′ Vz Vz ( N ′, N ′ )

⎛ Txx Txy Txz Txw ⎞ ⎛ Vx ⎞ ⎛ Vx ′ ⎞ ⎛ Vx ′′ ⎞


⎜T Tyy Tyz Tyw ⎟⎟ ⎜V ⎟ ⎜V ′ ⎟ ⎜ V ′′ ⎟
T =⎜ , V = ⎜ ⎟ , V ′ = ⎜ y′ ⎟ , V ′′ = ⎜ ′′ ⎟ ,
yx y y
⎜T Tzy Tzz Tzw ⎟ ⎜ Vz ⎟ ⎜ Vz ⎟ ⎜ Vz ⎟
⎜ zx ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ′′ ⎟
⎜⎝ T Twy Twz Tww ⎟⎠ ⎝ 1⎠ ⎝ 1⎠ ⎝ Vw ⎠
wx

⎛ Nx ⎞ ⎛ N x′ ⎞ ⎛ Lx ⎞
⎜N ⎟ ⎜ N ′⎟ ⎜L ⎟
N = ⎜ ⎟ , N ′ = ⎜ y′ ⎟ , L = ⎜ ⎟
y y

⎜ Nz ⎟ ⎜ Nz ⎟ ⎜ Lz ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ 0⎠ ⎝ 0 ⎠ ⎝ 0⎠

The coordinate and perspective transformations require 7 FMULs, 12 FMACs,


and 2 FDIVs without special instructions (FTRV, FIPR, and FSRRA) and 1 FTRV,
5 FMULs, and 2 FSRRAs with special instructions. The intensity calculation
requires 1 FMULs, 12 FMACs, 1 FSQRT, and 1 FDIV without special instructions
and 1 FTRV, 2 FIPRs, 1 FSRRA, and 1 FMUL with special instructions.
Figure 3.37 illustrates the resource-occupying cycles of the 3D graphics bench-
mark. After program optimization, no register conflict occurs, and performance is
restricted only by the floating-point resource-occupying cycles. The gray areas of
the graph represent the cycles of the coordinate and perspective transformations.
Without the special instructions, the FDIV/FSQRT resources are occupied for the
longest cycles, and these cycles determine the number of execution cycles, that is,
26. Using the special instructions enables some of these instructions to be replaced.
In this case, the arithmetic resource-occupying cycles determine the number of exe-
cution cycles, that is, 11, which are 58% shorter than when special instructions are
3.1 Embedded CPU Cores 65

11 cycles/polygon
36M
30 with Special Inst.
x2.4
M Polygons/s 19 cycles
20
x2.7 21M

15M without Special Inst.


10
7.7M
0
Coordinate & Perspective Plus Intensity Calculation
Transformations

Fig. 3.38 Benchmark performance of SH-X at 400 MHz

not used. Similarly, when intensity is also calculated, the execution cycles are 19
and 52 with and without special instructions, respectively, and 63% shorter using
special instructions compared to not using them.
Figure 3.38 shows the 3D graphics benchmark performance at 400 MHz, accord-
ing to the cycles shown in Fig. 3.37. Without special instructions, the coordinate and
perspective transformation performance is 15-M polygons/s. With special instruc-
tions, the performance is accelerated 2.4 times to 36-M polygons/s. Similarly, with
intensity calculation, but without any special instructions, 7.7-M polygons/s is
achieved. Using special instructions, the performance is accelerated 2.7 times to
21-M polygons/s.
It is useful to compare the SH-3E, SH-4, and SH-X performance with the same
benchmark. Figure 3.39 shows the resource-occupying cycles of the SH-3E, SH-4,
and SH-X. The main difference between the SH-4 and the SH-X is the newly defined
FSRRA and FSCA, and the effect of the FSRRA is clearly shown in the figure.
The conventional SH-3E architecture took 68 cycles for coordinate and perspec-
tive transformations, 74 cycles for intensity calculation, and totally 142 cycles.
Applying superscalar architecture and SRT method for FDIV/FSQRT with keeping
the SH-3E ISA, they became 39, 42, and 81 cycles, respectively. The SH-4 architec-
ture having the FIPR/FTRV and the out-of-order FDIV/FSQRT made them 20, 19,
and 39 cycles, respectively. The performance was good, but only the FDIV/FSQRT
resource was busy in this case. Further, applying the superpipeline architecture with
keeping the SH-4 ISA, they became 26, 26, and 52 cycles, respectively. Although
the operating frequency grew higher by the superpipeline architecture, the cycle
performance degradation was serious, and almost no performance gain was achieved.
In the SH-X ISA case with the FSRRA, they became 11, 8, and 19 cycles, respec-
tively. Clearly, the FSRRA solved the long pitch problem of the FDIV/FSQRT.
Since we emphasized the importance of the efficiency, we evaluated the area and
power efficiencies. Figure 3.40 shows the area efficiencies of the SH-3E, SH-4, and
SH-X. The upper half shows architectural performance, relative area, and architectural
66 3 Processor Cores

Conventional SH-3E Architecture


FMUL FMAC FDIV FMOV
142
FMUL FMAC FSQRT FDIV FMOV
SH-3E ISA, Superscalar, SRT FDIV/FSQRT
FMUL FMAC FDIV FMUL FMAC FSQRT FDIV
SH-4 ISA (with FIPR, FTRV), Out-of-order FDVI/FSQRT
FMUL FTRV FDIV FSQRT FIPR
Coordinate & Perspective
FMUL Transformations
Arithmetic Intensity Calculation
49% shorter
SH-4 FDIV FSQRT FDIV 52% shorter
SH-X FDIV FSQRT FDIV
30% longer 33% longer
SH-X ISA (with FSRRA)
FTRV FSRRA FIPR FMUL
FMUL 51% shorter
Arithmetic 45% shorter
0 11 19 20 26 39 40 52 60 68 74 8081
Resource-occupying cycles

Fig. 3.39 Resource-occupying cycles of SH-3E, SH-4, and SH-X for a 3D benchmark

SH-3E 0.5 µm 1.0 1.0 1.0


0.25 µm 5.1 2.7 1.9
SH-4 x 2.4
0.18 µm 5.1 1.9 2.7
SH-X 0.13 µm 13 2.0 6.5
0 4 8 12 0 1 2 0 2 4 6
Architectural Relative Architectural
Performance FPU area area performance ratio

SH-3E 0.5 µm 0.97 7.0 0.14


0.25 µm 10 8.0 1.3
SH-4 x 5.8
0.18 µm 12 3.0 4.0
SH-X 0.13 µm 36 1.6 23
0 10 20 30 0 2 4 6 8 0 10 20
Performance FPU area Area performance ratio
(M polygons/s) (mm2) (M polygons/s/mm2)

Fig. 3.40 Area efficiencies of SH-3E, SH-4, and SH-X

area–performance ratio to compare the area efficiencies with no process porting


effect. Although the relative areas increased, the performance improvements were
much higher, and the efficiency was greatly enhanced. The lower half shows real
performance, area, and area–performance ratio. The efficiency was further enhanced
using the finer process. Similarly, the power efficiency was also enhanced greatly as
shown in Fig. 3.41.
3.1 Embedded CPU Cores 67

SH-3E 0.5 µm 1.0 1.0 1.0


0.25 µm 5.1 4.6 1.1
SH-4 x 2.2
0.18 µm 5.1 3.6 1.4
SH-X 0.13 µm 13 4.2 3.1
0 4 8 12 0 1 2 3 0 2 4 6 8
Architectural Relative Architectural
Performance power power performance ratio
SH-3E 0.5 µm 0.97 600 1.6
0.25 µm 10 2000 5.0
SH-4 x 7.3
0.18 µm 12 1200 10
SH-X 0.13 µm 36 500 73
0 10 20 30 0 1000 2000 0 20 40 60 80
Performance Power Power performance ratio
(M polygons/s) (mW) (M polygons/s/W)

Fig. 3.41 Power efficiencies of SH-3E, SH-4, and SH-X

3.1.7 Multicore Architecture of SH-X3

Continuously, the SH cores achieved high efficiency as described above. The SH-X3
core was developed as the third generation of the SH-4A processor core series to
achieve higher performance with keeping the high-efficiency maintained in all the
SH core series.
The multicore architecture was the next approach for the series. In this section,
the multicore support features of the SH-X3 are described, whereas the multicore
cluster of the SH-X3 and a snoop controller (SNC) are described in the chip imple-
mentation sections of RP-1 (Sect. 4.2) and RP-2 (Sect. 4.3).

3.1.7.1 SH-X3 Core Specifications

Table 3.9 shows the specifications of an SH-X3 core designed based on the SH-X2
core (see Sect. 3.1.4). The most of the specifications are the same as that of the
SH-X2 core as the successor of it. In addition to such succeeded specifications, the
core supports both symmetric and asymmetric multiprocessor (SMP and AMP) fea-
tures with interrupt distribution and interprocessor interrupt, in corporate with an
interrupt controller of such SoCs as RP-1 and RP-2. Each core of the cluster can be
set to one of the SMP and AMP modes individually. It also supports three low-power
modes of light sleep, sleep, and resume standby, which can be different for each
core as the operating frequency can be. The size of the RAMs and caches is flexible
depending on requirements in the range as shown in the table.
68 3 Processor Cores

Table 3.9 SH-X3 processor core specifications


ISA SuperHTM 16-bit encoded ISA
Pipeline structure Dual-issue superscalar, 8-stage pipeline
Operating frequency 600 MHz (90-nm generic CMOS process)
Performance Dhrystone 2.1 1,080 MIPS
FPU (peak) 4.2/0.6 GFLOPS (single/double)
Caches 8–64 KB I/D each
Local memories First level 4–128 KB I/D each
Second level 128 KB–1 MB
Power/power efficiency 360 mW/3,000 MIPS/W
Multiprocessor support SMP support Coherency for data caches (up to four cores)
AMP support Data transfer unit for local memories
Interrupt Interrupt distribution and interprocessor interrupt
Low-power modes Light sleep, sleep, and resume standby
Power management Operating frequency and low-power mode can be
different for each core

3.1.7.2 Symmetric Multiprocessor (SMP) Support

The supported SMP data-cache coherency protocols are standard MESI (Modified,
Exclusive, Shared, Invalid) and ESI modes for copy-back and write-through modes,
respectively. The copy-back and MESI modes are good for performance, and the
write-through and ESI modes are suitable to control some accelerators that cannot
control the data cache of the SH-X3 cores properly.
The SH-X3 outputs one of the following snoop requests of the cache line to the
SNC with the line address and write-back data if any:
1. Invalidate request for write and shared case
2. Fill-data request for read and cache-miss case
3. Fill-data and invalidate request for write and cache-miss case
4. Write-back request to replace a dirty line
The SNC transfers a request other than a write-back one to proper cores by
checking its DAA (duplicated address array), and the requested SH-X3 core
processes the requests.
In a chip multiprocessor, the core loads are not equal, and each SH-X3 core can
operate at a different operating frequency and in a different low-power mode to
minimize the power consumption for the load. The SH-X3 core can support the
SMP features even such heterogeneous operation modes of the cores. The SH-X3
supports a new low-power mode “light sleep” in order to respond a snoop request
from the SNC while the core is inactive. In this mode, the data cache is active for the
snoop operation, but the other modules are inactive. The detailed snoop processes
including the SNC actions are described in Sect. 4.2.
3.1 Embedded CPU Cores 69

Table 3.10 SH-X4 processor core specifications


ISA SuperHTM 16-bit ISA with prefix extension
Operating frequency 648 MHz (45-nm low-power CMOS process)
Performance Dhrystone 2.1 1,717 MIPS (2.65 MIPS/MHz)
FPU (peak) 4.5/0.6 GFLOPS (single/double)
Power, power efficiency 106 mW, 16GIPS/W
Address space Logical 32 bits, 4 GB
Physical 40 bits, 1 TB

3.1.7.3 Asymmetric Multiprocessor Support

The on-chip RAMs and the data transfer among the various memories are the key
features for the AMP support. The use of on-chip RAM makes it possible to control
the data access latency, which cannot be controlled well in systems with on-chip
caches. Therefore, each core integrates L1 instruction and data RAMs and a second-
level (L2) unified RAM. The RAMs are globally addressed to transfer data to/from
the other globally addressed memories. Then, application software can place data in
proper timing and location.
The SH-X3 integrated a data transfer unit (DTU) to accelerate the memory data
transfer between the SH-X3 and other modules. The details of the DTU will be
explained in Sect. 3.1.8.4.

3.1.8 Efficient ISA and Address-Space Extension of SH-X4

Continuously, embedded systems expand their application fields and enhance their
performance and functions in each field. As a key component of the system, embed-
ded processors must enhance their performance and functions with maintaining or
enhancing their efficiencies. As the latest SH processor core, the SH-X4 extended
its ISA and address space efficiently for this purpose.
The SH-X4 was integrated on the RP-X heterogeneous multicore chip as two four-
core clusters with four FE–GAs, two MX-2 s, a VPU5, and various peripheral mod-
ules. The SH-X4 core features are described in this section, and the chip integration
and evaluation results are described in Sect. 4.4. Further, software environments are
described in Chap. 5, and application programs and systems are described in Chap. 6.

3.1.8.1 SH-X4 Core Specifications

Table 3.10 shows the specifications of an SH-X4 core designed based on the SH-X3
core (see Sect. 3.1.7). The most of the specifications are the same as that of the
SH-X3 core as the successor of it, and the same part is not shown. The SH-X4
extended the ISA with some prefixes, and the cycle performance is enhanced from
70 3 Processor Cores

2.23 to 2.65 MIPS/MHz. As a result, the SH-X4 achieved 1,717 MIPS at 648 MHz.
The 648 MHz is not so high compared to the 600 MHz of the SH-X3, but the SH-X4
achieved the 648 MHz in a low-power process. Then the typical power consumption
is 106 mW, and the power efficiency reached as high as 16 GIPS/W.

3.1.8.2 Efficient ISA Extension

The 16-bit fixed-length ISA of the SH cores is an excellent feature enabling a higher
code density than that of 32-bit fixed-length ISAs of conventional RISCs. However,
we made some trade-off to establish the 16-bit ISA. Operand fields are carefully
shortened to fit the instructions into the 16 bits according to the code analysis of
typical embedded programs in the early 1990s. The 16-bit ISA was the best choice
at that time and following two decades. However, required performance grew higher
and higher, and program size and treating data grew larger and larger. Therefore, we
decided to extend the ISA by some prefix codes.
The weak points of the 16-bit ISA are (1) short-immediate operand, (2) lack of
three-operand operation instructions, and (3) implicit fixed-register operand. The
short-immediate ISA uses a two-instruction sequence of a long-immediate load and
a use of the loaded-data, instead of a long-immediate instruction. A three-operand
operation becomes a two-instruction sequence of a move instruction and a two-
operand instruction. The implicit fixed-register operand makes register allocation
difficult and causes inefficient register allocations.
The popular ISA extension from the 16-bit ISA is a variable-length ISA. For example,
an IA-32 is a famous variable-length ISA, and ARM Thumb-2 is a variable-length ISA
of 16 and 32 bits. However, a variable-length instruction consists of plural unit-length
codes, and each unit-length code has plural meaning depending on the preceding codes.
Therefore, the variable-length ISA causes complicated, large, and slow parallel issue
with serial code analysis.
Another way is using prefix codes. The IA-32 uses some prefixes as well as the
variable-length instructions, and using prefix codes is one of the conventional ways.
However, if we use the prefix codes but not use the variable-length instructions, we
can implement a parallel instruction decoding easily. The SH-X4 introduced some
16-bit prefix codes to extend the 16-bit fixed-length ISA.
Figure 3.42 shows some examples of the ISA extension. The first example is an
operation “Rc = Ra + Rb (Ra, Rb, Rc: registers),” which requires a two-instruction
sequence of “MOV Ra, Rc (Rc = Ra)” and “ADD Rb, Rc (Rc + = Rb)” before extension,
but only one instruction “ADD Ra, Rb, Rc” after the extension. The new instruction is
made of the “ADD Ra, Rb” by a prefix to change a destination register operand Rb to
a new register operand Rc. The code sizes are the same, but the number of issue slots
reduces from two to one. Then the next instruction can be issued simultaneously if
there is no other pipeline stall factor.
The second example is an operation “Rc = @(Ra + Rb),” which requires a two-
instruction sequence of “MOV Rb, R0 (R0 = Rb)” and “MOV.L @(Ra, R0), Rc
(Rc = @(Ra + R0))” before extension, but only an instruction “MOV.L @(Ra, Rb),
3.1 Embedded CPU Cores 71

#1) MOV Rb, Rc ( Rc=Rb) ADD Ra, Rc ( Rc+=Ra)


w/o Prefix code Rc Rb code code Rc Ra code
ADD Ra, Rb, Rc ( Rc=Ra+Rb)
w/ Prefix code Rc code code Rb Ra code

#2) MOV Rb, R0 (R0=Rb) MOV.L @(Ra, R0), Rc ( Rc=@(Ra+R0))


w/o Prefix code R0 Rb code code Rc Ra code
MOV.L @(Ra, Rb), Rc ( Rc=@(Ra+Rb))
w/ Prefix code Rb code code Rc Ra code

#3) MOV lit8, R0 (R0=lit8) MOV.L @(Ra, R0), Rc ( Rc=@(Ra+R0))


w/o Prefix code R0 Rb code code Rc Ra code

MOV.L @(Ra, lit8), Rc ( Rc=@(Ra+lit8))


w/ Prefix code lit4 code code Rc Ra lit4

Fig. 3.42 Examples of ISA extension

Rc” after the extension. The new instruction is made of the “MOV @(Ra, R0), Rc”
by a prefix to change the R0 to a new register operand. Then we do not need to use
the R0, which is the third implicit fixed operand with no operand field to specify. It
makes the R0 busy and register allocation inefficient to use the R0-fixed operand,
but the above extension solves the problem.
The third example is an operation “Rc = @(Ra + lit8) (lit8: 8-bit literal),” which
requires a two-instruction sequence of “MOV lit8, R0 (R0 = lit8)” and “MOV.L @
(Ra, R0), Rc (Rc = @(Ra + R0))” before extension, but only an instruction “MOV.L
@(Ra, lit8), Rc” after the extension. The new instruction is made of the “MOV.L @
(Ra, lit4), Rc (lit4: 4-bit literal)” by a prefix to extend the lit4 to lit8. The prefix can
specify the loaded data size in memory and the extension type of signed or unsigned
if the size is 8 or 16 bits as well as the extra 4-bit literal.
Figure 3.43 illustrates the instruction decoder of the SH-X4 enabling a dual issue
including extended instructions by prefix codes. The gray parts are the extra logic for
the extended ISA. Instruction registers at the I3 stage hold first four 16-bit codes,
which was two codes for the conventional 16-bit fixed-length ISA. The simultaneous
dual issue of the instructions with prefixes consumes the four codes per cycle at peak
throughput. Then a predecoder checks each code in parallel if it is a prefix or not, and
outputs control signals of multiplexers MUX to select the inputs of prefix and normal
decoders properly. Table 3.11 summarizes all cases of the input patterns and corre-
sponding selections. A code after the prefix code is always a normal code, and hard-
ware need not check it. Each prefix decoder decodes a provided prefix code and
overrides the output of the normal decoder appropriately. As a result, the instruction
decoder performs the dual issue of instructions with prefixes.
Figure 3.44 shows evaluation results of the extended ISA with four benchmark
programs. The performance of Dhrystone 2.1 was accelerated from 2.24 to 2.65
MIPS/MHz by 16%. The performance of FFT, FIR, and JPEG encoding was
72 3 Processor Cores

Fig. 3.43 Instruction C0 C1 C2 C3


decoder of SH-X4

Pre-decoder
MUX MUX MUX MUX

PD0 ID0 PD1 ID1

Prefix Prefix
Dec. 0 Normal Dec. 1 Normal
Dec. 0 Dec. 1
MUX MUX

Output 0 Output 1

Table 3.11 Input patterns and selections


Input Output
C0 C1 C2 C3 PD0 ID0 PD1 ID1
N N – – – C0 – C1
N P – – – C0 C1 C2
P – N – C0 C1 – C2
P – P – C0 C1 C2 C3
P prefix, N normal, – arbitrary value

Dhrystone v2.1 2.28 → 2.65 MIPS/MHz 116%

FFT 123%

FIR 134%

JPEG Encode 110%

0 50 100 (%)

Fig. 3.44 Performance improvement ratio by prefix codes

improved by 23%, 34%, and 10%, respectively. On the other hand, area overhead of
the prefix code implementation was less than 2% of the SH-X4. This means the ISA
extension by the prefix codes enhanced both performance and efficiency.

3.1.8.3 Address-Space Extension

The 32-bit address can define an address space of 4 GB. The space consists of main
memory, on-chip memories, various IO spaces, and so on. Then the maximum linearly
addressed space is 2 GB for the main memory. However, the total memory size is
3.1 Embedded CPU Cores 73

32-bit 32-bit 40-bit


Logical Space Physical Space Physical Space
00000000 00 00000000

P0/U0
3.5GB
(TLB)
Linear
Space 1TB
7FFFFFFF Linear
80000000 (232–229 Space
P1 ( PMB)
Bytes)
P2 (PMB) (240–229
Bytes)
P3 (TLB)
E0000000 P4 P4
FFFFFFFF

FF E0000000
P4
FF FFFFFFFF

Fig. 3.45 Example of logical and physical address spaces of SH-X4

continuously increasing and will soon exceed 2 GB even in an embedded system.


Therefore, we extended the number of physical address bits to 40 bits, which can
define 1-TB address space. The logical address space remained 32 bits, and the
programming model was not changed. Then the binary compatibility was main-
tained. The logical address space extension would require the costly 32–to-64-bit
extensions of register files, integer executions, branch operations, and so on.
Figure 3.45 illustrates an example of the extension. The 32-bit logical address
space is compatible to the predecessors of the SH-X4 in this example. The MMU
translates the logical address to a 32/40-bit physical address by TLB or privileged
mapping buffer (PMB) in 32/40-bit physical address mode, respectively. The TLB
translation is a well-known dynamic method, but the original PMB translation is a
static method to avoid exceptions possible for the TLB translation. Therefore, the
PMB page sizes are larger than that of the TLB in order to cover the PMB area
efficiently.
The logical space is divided into five regions, and the attribute of each region can
be specified as user-mode accessible or inaccessible, translated by TLB or PMB,
and so on. In the example, the P0/U0 region is user-mode accessible and translated
by TLB, the P1 and P2 region are user-mode inaccessible and translated by PMB,
and the P3 region is user-mode inaccessible and translated by TLB. The P4 region
includes a control register area that is mapped on the bottom of physical space so
that the linear physical space is not divided by the control register area.

3.1.8.4 Data Transfer Unit

High-speed and efficient data transfer is one of the key features for multicore perfor-
mance. The SH-X4 core integrates a DTU for this purpose. A DMAC is conventional
hardware for the data transfer. However, the DTU has some advantage to the DMAC,
74 3 Processor Cores

SH-X4 FE-GA
DTU Command
Dst .Adr . Src. Adr .
Command

UTLB TTLB Source


DATA Dst.
Mem.
CPU BUS I/F Local Mem .

SuperHyway

Fig. 3.46 DTU operation example of transfer between SH-X4 and FE–GA

because the DTU is a part of an SH-X4 core. For example, when a DMAC transfers
the data between a memory in an SH-X4 core and a main memory, the DMAC must
initiate two SuperHyway bus transactions between the SH-X4 core and the DMAC
and between the DMAC and the main memory. On the other hand, the DTU can
perform the transfer with one SuperHyway bus transaction between the SH-X4 core
and the main memory. In addition, the DTU can use the initiator port of the SH-X4
core, whereas the DMAC must have its own initiator port, and even if all the SH-X4
cores have a DTU, no extra initiator port is necessary. Another merit is that the DTU
can share the UTLB of the SH-X4 core, and the DTU can handle a logical address.
Figure 3.46 shows an example of a data transfer between an SH-X4 core and an
FE–GA. The DTU has TTLB as a micro-TLB that caches UTLB entries of the CPU
for independent executions. The DTU can get a UTLB entry when the translation
misses the TTLB. The DTU action is defined by a command chain in a local mem-
ory. The DTU can execute the command chain of plural commands without CPU
control. In the example, the DTU transfers data in a local memory of the SH-X4 to
a memory in the FE–GA. The source data specified by the source address from the
command is read from the local memory, and the destination address specified by
the command is translated by the TTLB. Then the address and data are output to the
SuperHyway via the bus interface, and the data are transferred to the destination
memory of the FE–GA.

3.2 Flexible Engine/Generic ALU Array (FE–GA)

The Flexible Engine/Generic ALU Array (FE–GA or shortly called FE as Flexible


Engine) [50], which is a type of dynamically reconfigurable processor [51, 52], is
equipped with a 2D operation cell array composed of general-purpose ALUs with
dynamically changeable functionality, a crossbar network enabling wideband and
flexible internal data transfer, and a multiple bank/port local memory for tempo-
rary data storage. Further, the FE–GA integrates peripheral functionalities such as
a configuration manager with hierarchical configuration data management and
3.2 Flexible Engine/Generic ALU Array (FE–GA) 75

Internal bus
Interruption / Sequence manager (SEQM)
DMA request Cell control bus LS Cells Local memory
Operation cells
(24+8 cells) LS CRAM
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
I/O port controller

ALU MLT ALU ALU LS CRAM System


ALU MLT ALU ALU LS CRAM Sys. bus
bus
External ALU MLT ALU ALU LS CRAM
I/F
I/O ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
LS CRAM
Crossbar network (XB)
(10 cells) (10 banks)
Configuration manager (CFGM)

Compiled RAM
ALU ALU cells MLT Mult. cells LS Load/store cells CRAM
(4-16-KB, 2-Port)

Fig. 3.47 Block diagram of flexible engine/generic ALU array (FE–GA)

background transfer mechanism for efficient configuration control and a sequence


manager for autonomous sequence control, which enables it to operate as a highly
independent subsystem. With the FE–GA, various general-purpose accelerators
can be realized for media processing in connection with favorable control CPUs
(sub-CPUs).

3.2.1 Architecture Overview

Figure 3.47 illustrates the architecture of the FE–GA, which consists of an operation
block and a control block. The operation block is composed of two-dimensionally
arrayed arithmetic logic unit (ALU)/multiplication (MLT) cells whose functions
and connections to neighboring cells are dynamically changeable, a multiple-banked
local memory (LM) for data storage, load/store (LS) cells which generate addresses
for the LM, and a crossbar (XB) network supporting internal data transfer between
the LS cells and the LM. The LM is divided into plural banks (CRAMs). The control
block consists of a configuration manager (CFGM) that manages the configuration
data for the operation block and a sequence manager (SEQM) that controls the
state of the operation block. The FE–GA is highly optimized in terms of power and
performance in media processing for embedded systems.
76 3 Processor Cores

Fig. 3.48 Example of system DMA Interruption


configuration incorporating CPU CPU
controller controller
FE–GAs
System bus

Memory FE-GA FE-GA Peripherals

The features of the FE–GA are as follows. It has:


• A 2D, nonuniform array structure formed with different types of operation cells
(24 ALU cells and 8 MLT cells) whose functions and connections are change-
able in a cycle
• Simplified data transfer between four neighboring cells, therefore achieving
shortened wiring and high frequency
• Flexible memory addressing using dedicated memory-access cells (10 LS cells)
• A middle-capacity, multiple bank/port LM for temporary-operation data storage
(maximum of 16 KB × 10 banks and 2 ports)
• A wideband XB network enabling flexible connection between the operation cell
array and the LS cells
• A configuration manager (CFGM) that supports hierarchical-configuration data
management and background data transfer, which can be performed in parallel to
cell-array operation
• A sequence manager (SEQM) supporting autonomous sequence control and
attainment of a highly independent subsystem
• An interruption and dynamic memory access (DMA) transfer request feature to
control handover and synchronized data transfer for collaborative processing
with a CPU or DMA controller (DMAC) outside the FE–GA
• Input/output ports handling streaming data with no impact on the system bus and
scalable increase in performance by multiple cascading FE–GAs based on a
remote memory-access feature
The FE–GAs are attached to the system bus usually connected to the CPUs, a
DMA controller, an interruption controller, peripherals, and a memory as illustrated
in Fig. 3.48. The CPUs control the FE–GAs and the DMA controller and execute
program parts that are not suitable for processing on the FE–GAs. The FE–GAs
execute processes previously accelerated by dedicated special logic circuits. Today’s
applications for SoCs (systems on a chip) that incorporate CPUs and accelerators
demand more complicated functions. However, due to limitations in area, power
dissipation, and development cost, no existing SoCs have sufficient space to mount
the increasing numbers of special logics. The FE–GAs can execute multiple opera-
tions that are not necessarily executed simultaneously by switching their functions.
This makes it possible to save the area size of such SoCs and to use the space
efficiently.
3.2 Flexible Engine/Generic ALU Array (FE–GA) 77

Cell control bus

Operation cell control / Configuration control

ALU
From four Arithmetic op. To four
neighboring Logical op. neighboring
cells Flow control cells

Transfer registers
Output switch
Delay control
Input switch

SFT
Shift op.

× 4 THR × 4
Data control

1-bit data (carry) with a valid bit 8-bit data with a valid bit

Fig. 3.49 Block diagram of arithmetic operation (ALU) cell

3.2.2 Arithmetic Blocks

The 8 × 4 2D operation-cell array consists of two types of cells: 24 arithmetic oper-


ation (ALU) cells that mainly execute arithmetic, logical, and shift operations and
8 multiplication (MLT) cells that mainly execute multiplication, multiply-and-
accumulation operations. Figures 3.49 and 3.50 show block diagrams of the ALU
cell and the MLT cell, respectively. The number of data inputs and outputs is the
same in all of the cells. The position of the MLT cells is selectable in either the first
or second row from the left. Every cell is connected only to its four neighboring
cells; therefore, the FE–GA can operate at a high frequency due to its short-distance
wiring. Calculations and data transfers are executable simultaneously. Consequently,
data can be transferred without lowering the operation efficiency by relaying on
multiple cells.
The ALU and MLT cells are equipped with operation units, operand-delay regis-
ters for input data and accumulation, output registers, and configuration control
logics that cooperate with the sequence manager and configuration manager placed
outside the cell array. The three operation units of the ALU cells include an ALU
unit executing arithmetic operations, logical operations and flow controls, an SFT
unit executing shift operations, and a THR unit executing data controls. The two
operation units of the MLT cells include a MAC unit executing multiplication,
78 3 Processor Cores

Cell control bus

Operation cell control / Configuration control

From four MAC To four


neighboring Multiplication/ neighboring
cells Mult. & accm./ cells
Accumulation/

Transfer registers
Output switch
Delay control Addition
Input switch

THR
×4 Data control ×4

1-bit data (carry) with a valid bit 8-bit data with a valid bit

Fig. 3.50 Block diagram of multiplication (MLT) cell

multiply-accumulation, accumulation, and addition and a THR unit executing data


controls. The configuration control circuits include configuration registers that store
configuration data corresponding to CPU commands, which determines the opera-
tion of the cell. Each cell can execute as many as four operations at the same time,
and the number of cycles consumed varies from one to three depending on its opera-
tion. Table 3.12 lists the instruction set, which includes 49 instructions for the ALU
and MLT cells. The instructions support data widths of 16 bits, 8 bits, and 1 bit,
where no suffix is attached to instructions for 16-bit data, suffix “.B” is attached for
8-bit data, and suffix “.C” is attached for 1-bit data.

3.2.3 Memory Blocks and Internal Network

The FE–GA has a 10-bank local memory (CRAMs) in order to store both operands for
the operation cell array and operation results. Each bank can be accessed from both the
operation cell array and the outside CPUs in a unit of 16-bit data. The maximum size
of a memory bank is 16 KB or 8 K words. The bank is a dual-port type; therefore, both
data transfers to/from the memory and operations on the cell array can be executed
simultaneously.
To utilize multiple banks of the local memory easily and flexibly, it has load-
store (LS) cells that can be configured exclusively for access control of every bank.
Figure 3.51 shows a block diagram of the LS cell. The LS cells generate addresses,
3.2 Flexible Engine/Generic ALU Array (FE–GA) 79

Table 3.12 Instruction set for arithmetic cells


Type Operations
Arithmetic operation [MLT] ACC[S;signed/U;unsigned] (accumulation)
[ALU] ADD, ADDC(with carry) (addition)
[MLT] ADDSAT[S/U], ADDSAT[S/U]C (addition)
[MLT] ADDSUB (addition and subtraction)
[MLT] MAC[S/U], MACSU, MACUS (multiply and
accumulation)
[MLT] MULC[S/U], MUL[S/U], MULSU, MULUS (multiply)
[ALU] SUBB, SUB (subtraction)
(supports 16-bit data with no suffix and 8-bit data with suffix of .B)
Logical operation [ALU] AND, NOT, OR, RED (reduction), XOR
(supports 16-bit, 8-bit, and 1-bit data with suffix of .C except
for reduction)
Shift operation [ALU] EXTS (sign extension), EXTU (zero extension)
ROTL, ROTR (rotation), ROTCL, ROTCR (rotation with carry)
SHAL, SHAR (arithmetic shift), SHLL, SHLR (logical shift)
SWAP (swap)
(supports 8-bit and 1-bit data for extension and swap, 16-bit
data for rotation and shift)
Data control [ALU/MLT] NOP (no operation), STC (immediate value store)
THR (data forwarding)
(supports 16-bit, 8-bit, and 1-bit data)
Flow control [ALU] CNT (count), GATE (data forwarding with condition)
GES, GEU, GTS, GTU (comparison)
JOIN (join), MUX (multiplexing), TEST (equal comparison)
(supports 16-bit and 8-bit data)

Cell control bus

From/To
Operation cell control / Configuration control
crossbar
(Port 0)
×4
To local memory
Write control
(Port 0)
Memory
×4 I/F Read control
Write control (Port 0)
Read control
Bus
I/F Write control
From/To To local memory
Read control (Port 1)
crossbar Write control
(Port 1) Memory
×4 I/F Read control
(Port 1)

×4

1-bit data (carry) with a valid bit 8-bit data with a valid bit

Fig. 3.51 Block diagram of load-store (LS) cell


80 3 Processor Cores

Table 3.13 Instruction set for load/store cells


Type Operations
Load/store LD (load)
LDINC, LDINCA, LDINCP (load with address increment)
LDX (load with extension)
ST (store)
STINC, STINCA, STINCP (store with address increment)
STX (store with extension)
(supports 16-bit data with no suffix and 8-bit data with .B suffix)

arbitrate multiple accesses, and control access protocols to the local memory by
responding to memory accesses from the cell array or the outside CPUs. The LS
cells have the capability to generate various addressing patterns satisfying the
applications’ characteristics by selecting the appropriate addressing methods or
timing control methods. The addressing methods include direct supply from the
cell array and generation of modulo addresses in the LS cells, and both methods
can use bit reversing. The timing control methods include designation by the cell
array and generation in the LS cells. Table 3.13 gives the instruction set, including
ten instructions for the LS cells. The instructions support data widths of 16 bits and
8 bits, where no suffix is attached to instructions for 16-bit data, and suffix “.B” is
attached for 8-bit data.
The crossbar is a network comprising switches that connect 16 operation cells
on both the left and right sides of the cell array and 10 LS cells by the crossbar
configuration. It supports various connections such as point to point, multiple
points to point (broadcast of loaded data on an LS cell to operation cells), and point
to multiple points (stores of data on an operation cell to multiple banks of the local
memory via LS cells) for efficient memory usage. It also supports separate trans-
fers of the upper and lower bits on a load data bus from multiple banks of the local
memory.

3.2.4 Sequence Manager and Configuration Manager

The sequence manager consists of a state controller, a sequence controller, control


registers handling interruptions and errors, and a sequencer. The sequencer per-
forms thread switching according to these registers’ settings and trigger information
stemming from operation results of the operation cell array, such as the ALU cells.
Figure 3.52 illustrates a sample thread state diagram describing a sequence definition
of thread switching. Two types of the thread state are defined as follows: a state
without a branch and one with a branch specified by the switching conditions. Once
an outside CPU kicks the first thread, the FE–GA autonomously performs thread
execution and switching repeatedly in accordance with a defined sequence, which
brings a dynamic reconfiguration with no CPU operations.
3.2 Flexible Engine/Generic ALU Array (FE–GA) 81

Fig. 3.52 Sample thread


state diagram Thr. Thr.
3 4
Thr. Thr. Thr.
1 2 7
Thr. Thr.
5 6

Thr. Thread state w/o Thr. Thread state w/


1 branches 2 branches

Memory
System bus

Sequence manager

Configuration Configuration manager Bus


buffer I/F

Configuration
registers

Local memory
Crossbar

LS cells
Operation Op. unit Op. unit Op. unit
cell

Operation cell array


FE-GA

Fig. 3.53 Block diagram explaining configuration loading mechanism

The configuration manager consists of a configuration buffer, write control


registers, and write control logics. Figure 3.53 shows a block diagram of an FE–GA
that illustrates its configuration loading mechanism. The configuration buffer stores
configuration data that have been transferred from the memory before thread execu-
tion on the FE–GA by an outside CPU or DMA controller. The buffer enables a
configuration that can be commonly used among multiple operation cells to be
shared. Consequently, it reduces the configuration data and therefore reduces both
the configuration transfer time and area size of the configuration buffer.
The configuration manager loads the configuration data into registers of the oper-
ation cell array, the LS cells, and the crossbar on request from the sequence manager
when thread switching occurs. The configuration loading can also be done in
advance of a thread switching; therefore, the overhead cycles of the configuration
load, which consumes about a 100 cycles, can be concealed by doing it in the back-
ground of a thread execution.
82 3 Processor Cores

Start

Set up configuration
control registers

Set up sequence
control registers

Transfer data

Thread switch?

Operation
Execute operations finished?

Thread switching Transfer data

End

Fig. 3.54 Operation flowchart of FE–GA

3.2.5 Operation Flow of FE–GA

The FE–GA carries out various processes on a single hardware platform by setting
up configurations of the operation cell array, the LS cells, and the crossbar network
and by changing the configurations dynamically. Figure 3.54 shows an operation
flowchart of an FE–GA.
The operation steps of the FE–GA are as follows:
1. Set up configuration control registers.
The FE–GA executes specified arithmetic processing in such a way that each cell
and the crossbar operates according to their configurations corresponding to CPU
commands. This specified processing is called a thread, which is identified by the
logical thread number. At this stage, an outside CPU or a DMA controller sets up
controlling resources in the configuration manager, such as registers that define buf-
fers storing configuration data and correspondence of a logical thread number to
data stored on the configuration buffer.
2. Set up sequence control registers.
The FE–GA provides states by combining the configuration state of each cell and
the crossbar identified by the logical thread number and parameters such as an oper-
ation mode and an operation state. A transition from a specified internal state to
another internal state is called a thread switch, and a series of switchings is called a
sequence. At this stage, an outside CPU or a DMA controller sets up a sequence
control table defining switching conditions and states before and after the switching
and initializes the internal state.
3.2 Flexible Engine/Generic ALU Array (FE–GA) 83

FDL Constraint Checker


Sequence Verified Sequence

FDL Assembler
FE-GA Editor

FDL Linker
FDL(S-FDL) S-FDL Object
Linked
object
Thread Verified Thread
FDL(T-FDL) T-FDL Object

Fig. 3.55 Toolchain of FE–GA software development

3. Transfer data.
An outside CPU or a DMA controller transfers necessary data for operation from
an outside buffer memory or another bank of the FE–GA’s local memory to the
specified bank of the local memory. It also transfers the operation result to memo-
ries inside and outside the FE–GA.
4. Thread switch (reconfiguration).
After completion of the setups, an outside CPU triggers the FE–GA, and FE–GA
starts its operation by the sequence manager. The sequence manager observes both
the internal state and trigger events that establish the condition for thread switching.
When the condition for thread switching is satisfied, it updates the internal state and
executes thread switching. Thread switching consumes two cycles. When the pro-
cessing is finished or an error occurs, it halts the processing and issues an interruption
to an outside CPU for service.
5. Execute operations.
When thread switching is completed, it starts the processing defined with
configurations identified by the next-switching logical thread number. The processing
is continued until the next thread-switch condition is satisfied.

3.2.6 Software Development Environment

The programming of the FE–GA involves mapping the operation cell array called a
thread and a sequence of multiple threads as depicted in Fig. 3.52. The FE–GA has
a dedicated assembly-like programming language called Flexible-Engine Description
Language, or FDL. There are two types of FDLs; one is Thread-FDL (T-FDL),
which describes a cell-array mapping, and the other is Sequence-FDL (S-FDL), which
describes a sequence of threads. Users first create both T-FDL and S-FDL with an
FE–GA editor and convert them into binary using FE–GA tools as shown in Fig. 3.55.
The tool-chain includes an editor, a constraint checker, an assembler, and a linker.
The editor is a graphical tool on which users set up functions of each operation cell,
data allocation of the local memory, and sequence definition of threads. It has a simu-
lator so as to verify users’ FE–GA programming, and it can also generate FDLs.
84 3 Processor Cores

Create a reference program for CPU

Determine program parts


suitable for FE-GA processing

Create FDLs Divide a process into threads

Check FDLs with FDL checker


Create a data flow graph (DFG)
and debug on FE-GA simulator

Convert FDLs into objects


Map DFG with FE-GA editor
with FDL assembler

Compress and combine objects


with FDL linker

Create FE-GA controlling codes

Debug on FE-GA and CPU


integrated simulator or on a real chip

Fig. 3.56 Software development steps for system with FE–GA

The constraint checker verifies both types of FDL files in terms of grammars and
specifications and generates verified FDL files. Then the assembler converts the
FDL files into a sequence object and a thread object, respectively. Finally, the FDL
linker combines both object files into a linked object with section information that
includes the address of its placement in a memory. It also compresses the object by
combining common instructions among the operation cells so that the object is
placed in the configuration buffer of the FE–GA.
The software development process in a system with an FE–GA is shown in
Fig. 3.56. The process is rather complicated so as to obtain the optimal perfor-
mance. Users first create a reference program implementing a target application,
which is executable on a CPU. Then, FE–GA executable parts in the program are
determined by considering whether such parts can be mapped on the operation
array of the FE–GA in both a parallel and pipelined manner. Because operation
resources, such as the operation cells and the local memory, are limited, users need
to divide an FE–GA executable part into multiple threads. Then data flows are
extracted in each thread to create a data flow graph (DFG). Data placement on
multiple banks of the local memory is also studied in such a way that the data are
provided to the operation cells continuously in parallel. Users then program the
operation cells’ functions and intercell wirings, taking into consideration the timing
of data arrival on each cell, according to the DFG and the data placement, using the
FE–GA editor. The program is debugged using the FE–GA simulator in the next step.
3.2 Flexible Engine/Generic ALU Array (FE–GA) 85

Then the object is generated using the assembler and the linker. Since the FE–GA
is managed by CPUs, users need to create FE–GA control codes and attach them
to the reference program. Finally, the combined program for CPUs and FE–GA is
debugged on the integrated CPU and FE–GA simulator or on a real chip.

3.2.7 Implementation of Fast Fourier Transform on FE–GA

Fast Fourier transform (FFT), which is a common algorithm used in media process-
ing, was implemented on the FE–GA for evaluation. In this subsection, details of
the implementation are described. The algorithm used for mapping and evaluation
was a radix-2 decimation-in-time FFT, which is the most common form of the
Cooley–Tukey algorithm [53, 54]. We used this algorithm because the radix-2 FFT
is simple, and the decimation-in-time FFT has a multiplication of data and a twiddle
factor in the first part of calculation before fixed-point processing. It avoids having
to use wiring in order for supplying a twiddle factor into the middle of the cell array;
therefore, it preserves the resources of the cells for the fixed-point processing. The
format of input and output data is 16-bit fixed point (Q-15 format).
The FFT is calculated by repeating the butterfly calculation as follows (a decima-
tion-in-time algorithm):
a = x + y × W,b = x − y × W,

where a, b, x, and y are imaginary data and W is a twiddle factor. The equation can
be divided into a real part and an imaginary part as follows:

ar = xr + yr × Wr − yi × Wi, ai = xi + yr × Wi + yi × Wr ,

br = xr − yr × Wr + yi × Wi, bi = xi − yr × Wi − yi × Wr.

Figure 3.57a shows a data flow graph of the above equations. The circled “×,”
“<<,” “+,” and “−,” respectively, indicate multiplication, 1-bit left shift, addition,
and subtraction. Because the data are 16-bit fixed point, the upper 16 bits of the
multiplied data with W should be left shifted. Figure 3.57b depicts a mapping of
the data flow graph on the 4 × 4 cells (the upper half part of the operation cell array).
A rectangle in each cell indicates an operation, and an arrow between rectangles
shows 16-bit data (dashed arrow is 1-bit data). In the rectangles, “D” inserts 1-cycle
delay. “ROTL” shifts 1 bit to the left and inserts an input 1-bit data into the LSB.
“MSB” outputs the MSB as 1-bit data, which is realized by an “add with carry”
operation. “~MSB” outputs the complement of the MSB as 1-bit data. Note that
the MLT cells, normally set in the second row of the cell array, are set in the first row
in order to map the FFT. After multiplications at the first row, the MSB of the lower
16-bit data is extracted, and the upper 16-bit datum with 1-cycle delay applied
is shifted 1 bit to the left with the MSB of the lower data attached to the LSB.
86 3 Processor Cores

a b
1 2 3 4
yr ¥
H xr
xr DROTL
D +
wr
L ~MSB ar
xi + ar
yi L
¥ ~MSB D xr
yr ¥ << + ai H
D -
- wi D ROTL D - br
yi ¥ << - br H
yi ¥ DROTL xi
wr ¥ <<
- bi wr L D +

+
MSB ai
wi ¥ << yr ¥ L MSB D xi
H D +
wi D ROTL D - bi

Data flow graph (fixed point num.) Mapping on 4x4 cells

Fig. 3.57 Data flow graph and mapping of FFT butterfly calculation

a b
Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3

0 + + + 0 0 + + + 0
1 + + 0
8 - 1 1 + + + 1
0
2 + 4 - + 2 2 + + + 2
3 + 0
4 - 2
8 - 3 3 + + + 3
4 0
2 - + + 4 4 0
2 - 0
4 - 0
8 - 4
5 0
2 - + 1
8 - 5 5 0
2 - 1
4 - 2
8 - 5
6 0
2 - 1
4 - + 6 6 0
2 - 0
4 - 1
8
- 6
1 3 1 3
7 0
2 - 4 - 8 - 7 7 0
2 - 4 - 8 - 7

Original data flow Transformed data flow

Fig. 3.58 Transformation of eight-point FFT data flow graph

After subtraction and addition are applied, the calculation results are obtained and
stored in the local memory.
The FFT algorithm is modified to obtain identical flow graphs at each stage.
This makes it possible to reduce the number of configurations and avoid the port-
number constraint of the local memory. Figure 3.58 shows both the original flow
(a) and the modified flow (b) of 8-point FFT. A square in each stage shows the
twiddle factor, Wab = exp (−2p ib/a), where “a” is positioned higher and “b” is
lower in the rectangles.
Two butterfly calculations can be mapped and executed on the cell array.
Therefore, for efficient use of the local memory, one butterfly calculation is applied
to the data with even numbers, and the other butterfly is applied to those with odd
3.2 Flexible Engine/Generic ALU Array (FE–GA) 87

1 2 3 4
¥ DROTL
DROTL
+
D
~MSB

¥ ~MSB D
D -
0 yer DROTL
DROTL D - xer 0
1 yor xor 1
¥ DROTL
D +
i xei
2 ye MSB 2
3 yoi ¥ MSB xoi 3
D
D +
4 DROTL
DROTL D - ar 4
5 br 5
¥ DROTL
DROTL D +
6 ~MSB ai 6
7 ¥ ~MSB D bi 7
D -
DROTL
DROTL D -
8 wr 8
9 wi ¥ DROTL 9
D +
MSB
Local memory Local memory
w/ LS cells ¥ MSB D w/ LS cells
D +
DROTL
DROTL D -
Input (LDINC)
Operation cell array Output (STINC)

Fig. 3.59 Mapping of parallelized eight-point FFT on FE–GA

numbers. In other words, the input data are divided into two groups with even num-
bers and odd numbers, and they are stored in different banks (bank 0 and bank 2 for
the even numbers, bank 1 and bank 3 for the odd numbers) of the local memory
(Fig. 3.59). Also, the two different input data to the butterfly, x and y, are respec-
tively stored on the first half and the latter half of the same bank of the local mem-
ory. Since each bank is a dual-port memory, these two data items can be read
simultaneously, and they are provided to two operation cells at the same time by the
crossbar’s multicast operation. Operation results are stored on different banks (bank
4–7) of the local memory.
Since the FFT algorithm is modified to obtain an identical mapping of the
butterfly calculation, the total number of threads depends on the cell configurations
related to data input and output. Figure 3.60 describes the defined threads and their
sequence for 1,024-point FFT. The 1,024-point FFT has 10 stages of the butterfly
calculations. The configuration of the cell array that includes the ALU and MLT
cells is common among all the stages. Input data and output data are divided in order
to be stored in ten banks of the local memory. One stage places its output data in a
bank of the local memory, and the next stage uses the output data in the bank as
input data. In other words, two types of configurations for the LS cells (L1 and L2
in the figure) are defined and alternatively used. The twiddle factors are placed in
88 3 Processor Cores

Thread No. ALU,MLT XB LS


(FFT Stage)

L1 L2 L1 L2
1 A1 X1 4 A1 X2 7 A1 X1 10 A1 X2
- W8 W 64 W 512

L2 L1 L2
2 A1 X2 5 A1 X1 8 A1 X2
W2 W 16 W 128

L1 L2 L1
3 A1 X1 6 A1 X2 9 A1 X1
W4 W 32 W 256

A1 : Mapping for ALU and MLT cells (1 type)


X1,X2 : Mapping for the crossbar (2 types)
L1,L2 : Mapping for LS cells (2 types)

Fig. 3.60 Thread definition and sequence for 1,024-point FFT

Table 3.14 Evaluated cycles of 1,024- and 2,048-point FFT on FE–GA


Breakdown of the cycles
Number of FFT Number of total Load/store Config.
points cycles Operation init. delay preloading
1,024 2,747 2,560 80 107
2,048 5,838 5,632 88 118

different banks, and the location in the banks differs at each FFT stage. Therefore,
the total number of threads is the same as that of the FFT stages as illustrated in
Fig. 3.60.
The performance of 1,024-point and 2,048-point FFT on FE–GA was evaluated.
This process involved placing all the data including input data and twiddle factors
placed in the local memory and storing the configurations in the configuration buf-
fer. Therefore, the evaluated cycles of execution include operations, data load and
store from/to the local memory, thread switching, and configuration preloading to
the operation cells. Note that the cycles exclude a bit-reversing process. Table 3.14
gives the evaluation results. The operations account for most of the total cycles, and
there are relatively few overhead cycles consisting of initial delay (data and
configuration load) and thread switching.

3.3 Matrix Engine (MX)

As a special-purpose processor core which is suitable for arithmetic-intensive appli-


cations like image and signal processing, Matrix Engine (MX) core which has a
massively parallel SIMD architecture is developed. There are two versions, MX-1
and MX-2, of the MX core, and they are described in the following subsections.
3.3 Matrix Engine (MX) 89

Instruction
MX Processor Controller (MPC)
RAM
Control Signals

PE
I/O Interface

Data Registers PE Data Registers


(SRAM) (SRAM)

2048 entries
PE
PE

V-ch H-ch

PE
PE

256b 256b
MX Processor Array (MPA)

Fig. 3.61 MX-1 architecture

3.3.1 MX-1

3.3.1.1 Architecture Overview

Applications like image processing and recognition which are employed in the por-
table devices demand the processing ability of up to several tens of GOPS, which is
far beyond the capabilities of conventional CPUs or DSPs. In these areas which
require high performance, hard-wired logic LSIs are commonly used to realize both
high-performance and low-power dissipations. However, hard-wired solutions have
problems in cost efficiency because algorisms for media processing are being
improved at short intervals. Therefore, powerful and also programmable devices are
desired to be employed in these multimedia applications. Considering these back-
grounds, our motivation is to improve energy efficiencies and flexibilities of SIMD
architectures while realizing a high performance which is enough for the multime-
dia applications.
Figure 3.61 shows the overview of the MX-1 architecture [55, 56]. MX-1 is the
first version of the MX core. MX-1 consists of matrix processor array (MPA), matrix
processor controller (MPC) which is a dedicated controller with an instruction
memory, and the I/O interface for data I/O. The main components of MPA are two
planes of data register array matrices, fine-grained (2-bit) 2,048 processing elements
(PE). The data register array matrices are composed of single-port SRAM cells to
enhance the area efficiency. PE adopts a 2-bit-grained structure, which includes two
full adders and some logic circuits, to minimize the size of each PE.
As shown in Fig. 3.61, there are two-directional channels for data processing.
One is the Horizontal channel (H-ch) which connects the data register array matri-
ces and PEs. The other is the Vertical channel (V-ch) which realizes the flexible data
90 3 Processor Cores

MPC
Pointer_0 Inst. Pointer_1

PE
Entry #n PE

PE

Operand_A : Operand_B :
Cycle 00000110 Temp. Reg. 00000101
2bit
k Read xx ALU
00 00 01 10 + 00 00 01 01

Read
Read 01
k+1 00 00 01 10 10 + 00 00 01
11
Write
Read
01
k+2 00 00 01 10 01 + 00 00 11
10
Write

Fig. 3.62 Operation flow of H-ch

communications among PEs. The powerful processing performance of MX can be


realized with the cooperation between H-ch and V-ch. The design concept of MX is
based on the SIMD architecture. Therefore, all PEs and SRAM data registers oper-
ate in the same way. The operations of the processing array portion are controlled
by SIMD control signals generated by the MPC. The MPC generates the control
signals by decoding the instructions stored in its instruction RAM. That is, all the
operations of MX-1 can be controlled by the sequence programs loaded in the
instruction RAM as in the case of conventional DSPs.
The operation flow diagram of the addition operation utilizing H-ch is explained
in Fig. 3.62. One column of data register array and one PE are grouped as one entry
which is an operation unit of H-ch. In each PE, 2-bit-ALU and temporary registers
which temporarily store the readout data from the data registers are equipped. As
the PE is 2-bit grained, the operands stored in the data registers are processed in a
bit-serial way. At the cycle k, 2 bits (LSB) of Operand_A are read out, and they are
stored in the temporary registers at the next cycle k + 1. At this k + 1 cycle, 2 bits of
Operand_B are read out from the other plane of data register array and added with
the stored data in the temporary registers.
The output data of PE are written back to the data registers within the same cycle
of k + 1, utilizing the read–modify–write operation of SRAM. We have adopted this
two-operand type operation, such as B = A + B; therefore, each SRAM memory cell
behaves as if it is an accumulator. In addition, with the double-sided memory structure
shown in Fig. 3.62, the processing throughputs are enhanced up to two-bit-grained
3.3 Matrix Engine (MX) 91

a Read-out
Sense RS
SRAM Amp. Latch Execution
(1 Column) PE (Modify)
Write
Driver

Write-back
Operation Flow
b
CLK

Word-Line
(SRAM)

Operation RMW RMW RMW

Read Exec. Write

Timing Diagram

Fig. 3.63 Read-modify-write based data-path design (© 2007 IEEE)

addition per 1 cycle. With these techniques, if 2,048 sets of 16-bit-additions are
executed with 2,048 entries in parallel, MX-1 can process all the data in ten cycles
(including the overhead of pipelined operation); therefore, a set of operands stored in
1 entry can be processed in approximately 0.005 cycle (ten cycles/2,048 entries). Note
that the practical implementation of PE and double-sided memory is completely sym-
metrical, temporary registers, and PEs have connections with both sides of the data
register array (required selectors not described in Fig. 3.62). The design concept of
H-ch proposed here significantly contributes to the enhancement of the processing
throughput while maintaining the area efficiency.
Figure 3.63 shows the proposed design technique employed in this work, which
is based on the read–modify–write (RMW) operation of SRAM. The main feature
of this design is that the required sequential operations of the H-ch processing, readout,
execution, and write-back, can be completed in one clock cycle. The asynchronous
RS-latch located next to the sense amplifier is implemented for holding the readout
data until the write-back operation is completed. As shown in the timing diagram of
Fig. 3.63b, the word line of SRAM can be activated at every clock cycle, and that
brings a high data processing throughput. In addition, by adopting the proposed
design methodology, the size of PE can be reduced as small as possible by eliminat-
ing unnecessary pipeline registers. Although the proposed scheme reduces the max-
imum operating frequency, portable multimedia devices do not require high-frequency
system clock, and reducing the required clock cycles for data processing is more
important to build up a high-performance engine.
92 3 Processor Cores

Temp. Reg. PE ALU


H-ch
Step 1 Entry #0 a0 a0 + b0
Load to #1 a1 a1 + b1
Temp. Reg.
(H-ch operation) #2 a2 a2 + b2
#3 a3 a3 + b3
V-ch
Step 2 Entry #0 a0 * + b0
Shifting #1 a1 a0 + b1
+
Temp. Reg.
(V-ch operation) #2 a2 a1 b2
#3 a3 a2 + b3
H-ch
Step 3 Entry #0 a0 * + b0+ *
Execution #1 a1 a0 + b1+a0
& Store
(H-ch operation)
#2 a2 a1 + b2+a1
#3 a3 a2 + b3+a2

Fig. 3.64 Operation flow of V-ch

SIMD processors are needed to equip an efficient contrivance for communicating


data among PEs because huge data and complex algorithms are needed to be pro-
cessed by using multiple data entries. V-ch in Fig. 3.61 is designed for this purpose,
and the operation flow using V-ch is shown in Fig. 3.64. Figure 3.64 shows the way
for adding certain data with data stored in neighboring entries. In the first step, the
operands are loaded to the temporary registers of PEs. In the second step, all the
data in the temporary registers are moved by 1 entry step simply like shift registers,
which is the V-ch operation. In the third step, the other operands are added with the
data in the temporary registers and modified. The proposed simple PE network with
V-ch enables the flexible processing and is quite effective for a lot of applications,
such as convolutions, FFT, and so on.
Although some kinds of PE networks have been reported [57–62] for massively
parallel processors, those circuits have substantial area overhead, or their operations
are too complex to be controlled by simple SIMD control signals. Considering these
backgrounds, we adopt the simple shift-register type network shown in Fig. 3.65.
The temporary registers in PEs are utilized to form the shift registers. As shown in
the example of “+1 entry move,” the data in each entry moves to the neighboring
entry. The feature of this implementation is that entries located in the boundary such
as entry #0 and entry #2,047 can exchange the data with each other. We can realize
any movements with only 1 shift step. However, it costs a lot of cycles to realize
long-distance data moving. To reduce the cycle overhead in this long-distance case,
MX-1 supports several kinds of shift steps, such as +/− 1, 2, 4 ….256. Of course,
any movements can also be realized by the combination of this configuration
3.3 Matrix Engine (MX) 93

Fig. 3.65 Conceptual Temporary Registers


diagram of V-ch operation Entry No.

#2046 a[2046] a[2045] a[0]

#2047 a[2047] a[2046] a[1]

#0 a[0] a[2047] a[2]

#1 a[1] a[0] a[3]

#2 a[2] a[1] a[4]

#3 a[3] a[2] a[5]

#4 a[4] a[3] a[6]

#5 a[5] a[4] a[7]

initial +1 entry move -2 entry move

Intra-bank Inter-bank
V-ch V-ch
PE PE
64 Entries

SR AM

SR AM

SR AM
PE PE
SRAM

SRAM
SRAM

PEs
PEs

SRAM
SRAM

SRAM

PE PE
PEs

PEs

Bank #31 Bank #30 Bank Bank


#0 #1
SRAM

SRAM

SRAM

SRAM

SRAM
SRAM

PEs
PEs

PEs

PEs

Bank #0 Bank #1 Bank #14 Bank #15

Fig. 3.66 Overview of wire implementations of V-ch

(e.g., +5 entry movement is realized by executing +1 and +4 movement one after


another). Required clock cycles for a movement can be reduced by increasing the
kinds of shift steps; however, those are decided by considering the trade-offs
between the cycle reduction and the area overhead.
Figure 3.66 shows a physical overview of wire implementations of V-ch.
Increasing kinds of shift steps lead to an area overhead; however, V-ch realizes an
area-efficient, powerful network by utilizing its symmetrical layout property and
multi-metal-layer technology. V-ch wire networks are implemented with upper
94 3 Processor Cores

Table 3.15 Comparison of PE configurations


Configuration of PE Corse grained with Fine grained in general Fine grained in this
item multiplier (8 bits–32 bits) (1 bit or 2 bits) work (2 bits)
Parallelism per unit area Moderate Very good Very good
Performance of addition Moderate Very good Very good
Performance of MAC Moderate Not good Good

metal layers, and the V-ch circuit shown in Fig. 3.66 is simple and small enough;
therefore, these powerful networks have been realized with negligibly small silicon
area overhead.

3.3.1.2 PE Design

Several kinds of PE configurations are considered to be the candidates to build up a


massively parallel SIMD processor like MX-1. Table 3.15 shows the comparison of
various PE configurations. In general, a finer-grained PE configuration has an
advantage in area efficiency, because the circuit structure of each PE is simple and
small. MX-1 also utilizes this feature and maximizes the parallelism up to 2,048 in
a small silicon area of 3.1 mm2 in 90-nm process technology. On the other hand,
conventional coarse-grained configurations [57–59] require large silicon area.
Therefore, the realized parallelism is moderate, for example, up to 128. Because a
coarse-grained PE usually equips a dedicated multiplier, both simple additions and
MAC operations can also be processed in a moderate performance. Each PE of
MX-1 is basically composed of 2-bit-grained full adders. Therefore, MX-1 gives the
best performance in the applications which are mainly composed of simple addi-
tions or subtractions, for example, pixel interpolation, SAD (sum of absolute differ-
ence), etc. In contrast to that, MAC operations cost a lot of clock cycles because
they are realized by breaking down to simple additions. Our motivation is to enhance
a MAC performance by adopting the fine-grained (2-bit) PE configuration without
reducing the massive parallelism of 2,048. However, dedicated multipliers are
difficult to be equipped in 2-bit-grained PEs employed in MX-1. Therefore, some
contrivances both in a PE circuit configuration and in an operation flow are required.
Because a MAC operation is realized by breaking down to simple additions, reduc-
ing the total number of additions by decreasing a number of generating partial prod-
ucts in a MAC operation flow is considered to be the best way. Booth’s algorithm is
a well-known methodology to enhance a MAC performance by decreasing a num-
ber of generating partial products. When we look at the radix-4 Booth’s encoding
table shown in Table 3.16, three characteristic operations which are applied to the
multiplicand can be found, that is, one-bit shifting, complementing, and NOP (no
operation). Therefore, if we add some control circuits to support these operations,
we can apply the Booth’s algorithm to our 2-bit-grained processor elements.
Figure 3.67 shows a circuit diagram of the PE adopted in this work. It is quite
simply configured and is mainly composed of two full adders (FAs), eight flip-flops,
and some logics. This circuit is designed to support the radix-4 Booth’s algorithm
3.3 Matrix Engine (MX) 95

Table 3.16 Booth encoding table


B[i + 1] (XH) B[i] (X) B[i−1] (F−1) Operation Shift (D) Complement (F) Nop (N)
0 0 0 0(NOP) 0 0 0
0 0 1 +A(ADD) 0 0 1
0 1 0 +A(ADD) 0 0 1
0 1 1 +2A(SHIFT & ADD) 1 0 1
1 0 0 −2A(SHIFT & SUB) 1 1 1
1 0 1 −A(SUB) 0 1 1
1 1 0 −A(SUB) 0 1 1
1 1 1 0(NOP) 0 1 0

Valid Reg.
V
I<2> O<2>
I<1> Booth- O<1>
I<0> Encoder O<0> N
F
Temp. Reg.
D
0 Carry Reg.
XH I2 Cout
I1
C
1 FA
Cin Sum
<1>

Temp. Reg.
<1>

<0> 0
X I2 Cout
<1>

<0> I1
<1:0>

1 FA
S Cin Sum <0>
<1:0>

Shift-Compensate Reg.
<1:0>

MUX MUX
Left/Right Left/Right
IN_L<1:0>
IN_R<1:0>
OUT<1:0>
Output_Enable

Fig. 3.67 Circuit diagram of processing element

which operates according to Table 3.16. D/F/N registers, which store the encoded
results of the Booth’s encoder, are implemented to control the way of generating
partial products. That is, D switches the multiplicand shifts (1 bit shift) or not, F
switches the multiplicand inverts for complementing or not, and N switches whether
the partial product is valid or not. In addition, S is the register for shift compensation
which functions when D register is set to 1, and V register is implemented for vali-
dating the function of each PE. Figure 3.68 shows the proposed operation flow of a
MAC operation. At first, 2 bits of the multiplier are loaded to the temporary register
of PE, XH, and X, and the values of F/D/N registers are fixed with Booth’s encod-
ing. Next, 2 bits of the multiplicand are loaded to XH and X registers, and also 2 bits
of the accumulator region are added with the data in XH, X, and S registers at the
conditions of D/F/N registers. These sequences are realized by programming the
96 3 Processor Cores

Action Data Registers (SRAM) PE Data Registers (SRAM)


Multiplic and Multiplier Accumulator
Multiplier
Region Region XH/X D F N Region
Loaded &
Booth 10 1 1 1
10 11 01 10 00 00 00 00
Encoding S 0 2bit - FA

Multiplic and XH/X D F N


11 1 1 1
Added to 10 11 01 10 00 00 00 10
Accumulator S 0 2bit - FA

Multiplic and XH/X D F N


10 1 1 1
Added to 10 11 01 10 00 00 10 10
Accumulator S 1 2bit - FA

XH/X D F N
Sign- 11 1 1 1
Extension 10 11 01 10 00 00 10 10
S 1 2bit - FA

Multiplier XH/X D F N
Loaded & 01 1 0 1
Booth 10 11 01 10 00 00 10 10
S 0 2bit - FA
Encoding

(be continued)

Fig. 3.68 Operation flow of MAC operation

Fig. 3.69 Micrograph of MX-1 core

microprograms stored in the instruction RAM in the controller. With the proposed
circuit configuration, a 16-bit fixed-point signed MAC operation costs about 100
cycles in each PE, which is 56% smaller than that of non-Booth circuit configuration.
The MAC cycle cost of 100 cycles is normalized to 0.05 cycle per one PE because
MX-1 executes 2,048 MAC operations in parallel. In this way, fast MAC operations
based on the Booth’s algorithm can be realized even with the 2-bit-grained PE
configuration of MX-1.
Figure 3.69 shows the micrograph of the MX-1 core, and the performance of
MX-1 is summarized in Table 3.17.
3.3 Matrix Engine (MX) 97

Table 3.17 Performance of MX-1


Process technology 90 nm 7Cu CMOS low standby process
Core size 3.1 mm2
Operation frequency 200 [email protected]
Power consumption 250 mW
Maximum performance 40GOPS
(16b addition)
3.6GOPS
(Fixed point signed 16bit MAC)

H-Ch0
H-Ch1

4b ALU
V-Ch -8x
-4x
0
4x
Mux 8x
-2x Adder
-x
0
XREG x
2x
[3:0] Booth
[3:2]
Encoder
0
[1:0] Booth
Encoder
1

Fig. 3.70 Circuit diagram of 4-bit-grained PE

3.3.2 MX-2

The required performance of image processing is getting higher and higher; there-
fore, the second version of MX core (MX-2), which architecture is improved from the
MX-1, is developed [63]. The main technologies for the enhancement are men-
tioned below:
1. Expanding the processor elements from 2-bit grained to 4-bit grained
2. Improving the pipeline architecture of the MX controller
3. Equipping double frequency mode
Hereafter, these technologies are described in detail.
Figure 3.70 shows a block diagram of the PE of the MX-2. The PE contains a
4-bit temporary register (XREG) and 4-bit-grained ALU. The XREG loads data from
data registers through Horizontal channel (H-ch0, H-ch1) or Vertical channel (V-ch).
The PE loads data from data registers and through H-ch0 or H-ch1 and operates
98 3 Processor Cores

MPC MPA
Control Registers

Register
Control
Command SRAM
Instruction Data
RAM
Control
Logic FIFO PEs

Instruction MPA Control


Fetch Command

Fig. 3.71 Block diagram of MPC

with data from the XREG and stores the result back to the data register in 1 cycle
with the same way of MX-1. The ALU and XREG operate in parallel when they
access a different bank of the data register. The ALU contains two booth encoders,
an adder, and their peripherals. In MAC operation, a multiplicand data in the XREG
is operated with the outputs of two booth encoders, and they are calculated in the
adder. Therefore, two sets of 4-bit partial product are handled in 1 cycle. 4-bit ALU
calculates partial product four times faster than 2-bit ALU of MX-1, because it
repeats a 2-bit calculations four times.
In addition to the improvement of the PEs of MPA, the architecture of MPC is
also enhanced to extract the maximum performance of the parallel processing per-
formance of MX processor. Figure 3.71 shows the block diagram of the MPC. It
basically consists of the instruction RAM, the control registers, and the control
logic. The control logic decodes the microinstructions stored in the instruction RAM
and generates the control command for the control registers in the MPC and PEs
and SRAMs in the MPA. The control registers store the data for controlling the
MPA, such as address pointers. Due to this architecture, when the MPC is occupied
for maintaining the registers in itself, such as setting the immediate data to the con-
trol registers, the operation rate of the MPA degrades. To avoid this degradation,
FIFO circuits are newly added in the MPC.
Figure 3.72 shows an example operation of the controller and the MPA when an
application program is executed. A1–A4 are instructions for the MPA operation.
C1–C3 are instructions when only the controller is operated and the MPA is not
operated. The MPA needs multiple cycles for A1–A3 instructions. The multiple
cycle execution happens when the same bank of the data register is accessed by
both of the XREG and the ALU in the PE operation. Without FIFO, the controller
must wait for completion of the MPA operation. The MPA also needs to stay in idle
states until A4 instruction in the controller is completed. These “WAIT” and
“IDLE” cycles are absorbed by FIFO. While the MPA executes A1, the controller
can operate without waiting, and the next instructions (A2–A3) are stored in FIFO.
3.3 Matrix Engine (MX) 99

A1~A4 MPA Instruction


C1~C3 MPC Instruction
• Without FIFO
Clock

MPC A1 WAIT A2 WAIT A3 WAIT C1 C2 C3 A4


PPU A1 A2 A3 IDLE (3cycles) A4

• With FIFO
Clock
MPC A1 A2 A3 C1 C2 C3 A4

FIFO A1 A2 A2~3 A3 A4
MPA A1 A2 A3 A4

“WAIT” and “IDLE” cycles are eliminated.

Fig. 3.72 FIFO operation

These instructions in FIFO are executed by the MPA in parallel with the controller
operations of C2–A4.
In addition to the above-mentioned technologies, MX-2 equips double frequency
mode which enhances the maximum operating frequency of MX-2 and its perfor-
mance. High throughput of ALU operation of the MX core is realized by the read–
modify–write (RMW) operation of SRAM [55]. The RMW operation also realizes
low-power operation because a set of read and write operations of SRAM activate
the word line only once. The RMW operation is useful for power-efficient ALU
operation. However, the operating frequency of the MX core is limited by this RMW
operation. The MX-2 core has normal frequency (NF) mode and double frequency
(DF) mode. In the NF mode, the RMW operation is executed in 1 cycle. In the DF
mode, the RMW operation is divided into two cycles, and the MX-2 can be operated
at higher frequency. This mode is used when high performance is required rather
than low power consumption. Operating cycles of 8-bit addition and 8-bit MAC are
increased to 6 cycles and 18 cycles, respectively, in the DF mode. In the image pro-
cessing applications, the operating cycle of the DF mode is increased up to 40% from
the NF mode. With the DF mode, the maximum operating frequency of the MX-2
can be enhanced almost up to double compared with the NF mode. Therefore, the
processing performance of the real application can be improved with the DF mode.
Figure 3.73 shows the performance comparisons of MX-1 and MX-2 in case
various application programs are executed. To clarify the effect of the improvement,
Case A which is the case of 4-bit PE with conventional MPC is added in this graph.
About 20–40% improvement is confirmed with only the 4-bit-grained PE. In addi-
tion to that, about 20–40% improvement is realized with the improved MPC.
Figure 3.74 shows the micrograph of the MX-2 core, and the performance of
MX-2 is summarized in the Table 3.18.
100 3 Processor Cores

MX-1 : (2b grain PE with conventional MPC)


Case A : (4b grain PE with conventional MPC)
MX-2 : (4b grain PE with improved MPC)

FFT (2048point)

FFT (8point)
3x3 Convolution
Filter
3x3 Median Filter

Optical Flow

Look Up Table

Harris Operator
0 10 20 30 40 50 60 70 80 90 100
Normalized Operating Cycles (a.u.)

Fig. 3.73 Comparison of MX-1’s and MX-2’s performance

Fig. 3.74 Micrograph of MX-2 core

Table 3.18 Performance of MX-2


Process technology 65 nm 7Cu CMOS low standby process
Core size 5.29 mm2
Operation frequency 300 MHz(normal frequency)
560 MHz(double frequency)
Power consumption 330 mW
Maximum performance 61.5GOPS
(16b addition)
10.4GOPS
(fixed point signed 16bit MAC)
3.4 Video Processing Unit 101

3.4 Video Processing Unit

This section introduces the architecture and circuit techniques for video encoding/
decoding processors. This video codec processor is embedded in the heterogeneous
multicore chip as a special-purpose processor (SPP), which is described in Chap. 2.

3.4.1 Introduction

Consumer audiovisual devices such as digital video cameras, mobile handsets, and
home entertainment equipment have become major drivers for raising the perfor-
mance and lowering the power consumption of signal processing circuits. Market
trends in the field of consumer video demand larger picture sizes, higher bit rates,
and more complex video processing. In video coding, the wide range of consumer
applications requires the ability to handle video resolutions across the range from
standard definition (SD, i.e., 720 pixels by 480 lines) to full high definition (full HD,
i.e., 1,920 pixels by 1,080 lines) encoded in multiple video coding standards such as
H.264, MPEG-2, MPEG-4, and VC-1. H.264 [64] is one of the latest standards for
motion-estimation-based codecs. It contains a number of new features [65, 66] that
allow it to compress video much more effectively than older standards, but it requires
more processing power. The availability of context-adaptive binary arithmetic coding
(CABAC) is considered one of the primary advantages of the H.264 encoding
scheme, since it provides more efficient data compression than other entropy encoding
schemes, including context-adaptive variable-length coding (CAVLC). However, it
also requires considerably more processing. The trade-off between high performance
and low-power consumption is a key focus of video codec design for advanced
embedded systems, especially for mobile application processors [28, 67–69].
Many video coding processors have been proposed. Generally, these codecs use
one of two approaches. The first approach constructs video encoding and decoding
software on homogenous high-performance processor cores [67, 68]. This approach,
which handles multiple video coding standards by changing the software or
firmware, suffers from large power consumption and lack of performance. A dual-
core DSP operating at 216 MHz [67] offers up to SD video, and an eight-core media
processor operating at 324 MHz [68] supports high definition (HD, i.e., 1,280 pixels
by 720 lines) at most. The second approach aims to develop dedicated video coding
hardware. While dedicated circuits can minimize power consumption, the dedicated
encoders and decoders described in previous reports [70–73] have difficulty in per-
forming all of the media processing that is indispensable for an embedded device
such as a modern smart phone [28, 67–69]. In addition, few of these video codecs
can handle video streams at more than 20 megabits per second (Mbps), so they have
difficulty in supporting full HD high-quality video.
In response to these issues, a video processing unit (VPU) has been designed based
on a heterogeneous multicore processor in order to achieve both high performance
102 3 Processor Cores

and low power consumption with multiple video formats. In full HD video processing,
dynamic current is still a dominant form of power consumption in low-power CMOS
technology. Therefore, the focus was on achieving lower dynamic power in the video
codec design using video signal processing characteristics.
Subsection 3.4.2 describes an overview of the video codec architecture. A two-
domain (stream-rate and pixel-rate) processing approach raises the performance of
both stream and image processing units for a given operating frequency. In the
image-processing unit, a sophisticated dual macroblock-level pipeline processing
with a shift-register-based ring bus is introduced. This circuit is simple yet provides
high throughput and a reasonable latency for video coding. Subsection 3.4.3
describes the stream processor and media processor architecture. The media proces-
sor is applied to transformations, subpixel motion compensation, and an in-loop
deblocking filter. Including the single stream processor, a total of seven application-
specific processors are integrated on the proposed video codec. Subsection 3.4.4
discusses the results of implementing the VPU from the viewpoints of performance
and power consumption. Subsection 3.4.5 concludes with a brief summary.

3.4.2 Video Codec Architecture

3.4.2.1 Architecture Model

Figure 3.75 shows the basic architecture of the VPU based on a heterogeneous mul-
ticore approach, the concept of which is the same as the heterogeneous multicore
chip for embedded systems described in Chap. 2. To satisfy both the high-performance
and low-power requirements for advanced embedded systems with greater flexibility,
it is necessary to develop parallel processing on a video processing unit by taking
advantage of the data dependency in video coding process.
Several low-power special-purpose processor (SPP) cores, several high-performance
application-specific hard-wired circuits (HWC), shared memory, and a global data
transfer unit (DTU) are embedded on a VPU. There are two types of SPPs, a stream
processor and a media processor. Each processing core includes local memories
(LM) and a local DTU. These are embedded in the processing core to achieve paral-
lel execution of internal operation in the core and data transfer operations between
cores and memories. Each core processes the data on its LM, and the DTU simulta-
neously executes memory-to-memory data transfer between cores, shared memory,
or off-chip memory via a global DTU. The dynamic clock controller (DCC), which
is connected to each core, controls the clock supply of each core independently and
reduces the dynamic power consumption of the VPU. The shared memory is a
middle-sized on-chip memory which is used as a line buffer in vertical deblocking
processing or as a reference image buffer for motion estimation/compensation. Each
core is connected to the on-chip interconnect called the shift-register-based bus
(SBUS), which is suitable for block-level pipeline processing. Frequency and voltage
control (FVC) is applied to the top level of the video processing unit only.
3.4 Video Processing Unit 103

Video Processing Unit


DCC DCC DCC DCC

SPPa #0 SPPa #m SPPb #0 SPPb #n

LM LM LM LM
DTU DTU DTU DTU

Shift-register-based bus (SBUS)

HWC #0 HWC #k
Shared
Memory Global DTU
LM LM
DTU DTU
CPU On-chip
DCC DCC FVC FVC
#0 Interconnect

Off-chip memory

Fig. 3.75 Architecture model of video processing unit (VPU)

When a program is executed on the heterogeneous multicore video processing


unit, it is divided into two structures, a frame of a picture and a macroblock. The
macroblock is a video compression component whose size is fixed at 16 × 16 pixels
in modern video coding standards. Each macroblock contains four luminance blocks
(Y), one blue color difference (Cb) block, and one red color difference (Cr) block in
a 4:2:0 YCbCr format. Macroblocks can be subdivided further into smaller blocks
called partitions. H.264, for example, supports block sizes as small as 4 × 4.
Each video component is executed in the most suitable processor core in parallel
as shown in Fig. 3.75. Each core processes the data on its LM, and the DTU simul-
taneously executes memory–memory transfer. In the parallel operation, there are
time slots when the corresponding cores do not need to process or transfer data.
During these time slots, the corresponding cores are controlling the connected DCC
and making it stop the clocks automatically. This control reduces the redundant
power consumption of a core, resulting in lower power consumption of a heteroge-
neous multicore chip.

3.4.2.2 Stream Domain and Image Domain Processing

Figure 3.76 is a block diagram of the video processing unit, which is a heteroge-
neous multicore processing unit that applies our architecture model shown in
Fig. 3.75.
The architecture consists of a stream-rate domain and a pixel-rate domain [74].
These units operate independently in a picture-level pipeline manner to achieve full
HD performance while lowering the operating frequency. At a given time, this video
104 3 Processor Cores

Video Processing Unit


Stream-rate domain Pixel-rate domain CPU #0
Stream processing Image processing unit (n=2)

Media interconnect
unit (m=1)
Symbol TRF1 FME1 DEB1 Global

On-chip interconnect
#1 codec1 (PIPE) (PIPE) (PIPE)
CME1
DMAC

Stream processor Shift-register-based bus (SBUS)

CABAC accelerator Symbol TRF0 FME0 DEB0


#0 codec0 (PIPE) (PIPE) (PIPE)
CME0 L-MEM
Media
IPs

Memory port

Intermediate
Bit stream Image Off-chip memory
stream

PIPE: Programmable image processing element, L-MEM: Line memory,


CME: Coarse motion estimator, TRF: Transformer,
FME: Fine motion estimator/compensator, DEB: De-blocking filter

Fig. 3.76 Block diagram of video processing unit. The stream-rate domain and pixel-rate domain
can access the intermediate stream via the global DMAC

codec performs either encoding or decoding. In decoding mode, the stream processing
unit (SPU) reads bit streams from off-chip memory and outputs a transformed inter-
mediate stream. The image processing units (IPU) read the intermediate streams
produced by the stream processing unit and generate the final decoded image.
The space for the intermediate streams in the off-chip memory serves as a buffer
between the stream-rate domain and the pixel-rate domain. Variable-length coding
inherently lacks fixed processing times. CABAC times have particularly large varia-
tion. Up to 384 symbols of transform coefficients are definable in a macroblock, but
the maximum number of bits changes according to the probability of a syntactic
element in the given context. If the stream processing unit takes more time to process
a frame than is available at the frame rate, the operating frequency must be raised.
Figure 3.77 shows an example of the decoding time and the number of bits for
each picture in an H.264 40-Mbps video stream. As the figure shows, when the
number of bits in the pictures around picture #30 is large, the stream-rate domain’s
decoding time is longer than that of the pixel-rate domain. When the number of bits
assigned to the pictures around picture #5 is small, the stream-rate domain’s decoding
time is shorter than that of the pixel-rate domain.
The intermediate stream buffer fills the performance gap between the stream
processing unit and the image processing unit in the picture-level pipeline.
Figure 3.78 is the stream and pixel decoding time chart in the picture-level pipeline.
The time slot is defined as the decoding time of image processing in the pixel-rate
3.4 Video Processing Unit 105

80
Decoding time of stream processing
unit in the stream-rate domain
Number of bits/picture
Decoding time (ms)
60
33.3ms(=Decoding time of
Image processing unit in
40 the pixel-rate domain)

20

40Mbps
0
0 5 10 15 20 25 30 35 40
Picture number

Fig. 3.77 Stream processing unit’s decoding time for an H.264 40-Mbps full HD video stream
running at 30 frames per second (fps)

S0 S1 S2 S3 S4

P0 P1 P2 P3 P4
Delay
Time slot
Time
Without intermediate stream buffering
b
S0 S1 S2 S3 S4

P0 P1 P2 P3 P4

Time slot
No delay Time
Sn : Stream decoding for picture n
Intermediate stream buffering Pn : Pixel decoding for picture n

Fig. 3.78 Parallel operation in picture-level pipeline in stream-rate domain and pixel-rate domain

domain. Except for stream decoding S0 for Picture 0, stream and pixel decoding are
processed in parallel. In Fig. 3.78a, that is, without an intermediate stream buffer, if
the decoding time of S2 is larger than the decoding time of P1, it causes a delay in
the start of P2. To shorten the S2 decoding time, we can increase the operating
106 3 Processor Cores

frequency. However, this increases the power consumption of the stream processing
unit in the stream-rate domain. To meet the performance requirements without
increasing the operating frequency, we introduce an intermediate stream buffer.
By using the intermediate stream buffering depicted in Fig. 3.78b, the outputs of S0
and S1 are stored in the intermediate stream buffer. As this time chart shows, S1 and
S2 can start processing independently of the time slot, and S2 is finished before the
end of P1. Therefore, the start of P2 is not delayed from the defined time slot, and
we can say that by using the intermediate stream buffering, each picture can start its
pixel decoding at every time slot. Thus, the two-domain structure with the interme-
diate stream buffer can handle all pictures at the average frequency, and this helps
to keep the required operating frequency, and hence the power consumption, low.
The intermediate stream format has two segments, one in fixed-length and the
other in variable-length coding, and the two parts are processed per symbol (not
per bit). The fixed-length part consists of information on the macroblocks, including
the slice boundaries, coded block pattern, quantization scale parameter, and several
other items. The variable-length part of the intermediate stream contains the other
syntax elements (motion vectors and transform coefficients) in exponential-Golomb
coding, which is a common, simple, and highly structured technique.
We evaluated the memory bandwidth between the stream and pixel domains.
Although access to the intermediate stream by the stream processing unit and image
processing units takes the form of access to the external synchronous DRAM
(SDRAM), the required memory bandwidth is less than would be required for the
conventional method (directly applying 16-bit-per-pixel transform coefficients).
Figure 3.79 plots the compression ratio of the intermediate stream relative to the
original stream for individual pictures of the H.264 conformance-test streams [75],
other than those for I_PCM. The compression ratios are around 1.6 and 1.5 for CABAC
and CAVLC, respectively. Although a portion of the intermediate stream is in fixed-
length coding, the coding efficiency was within 1.6 in the case of CABAC. The com-
pression effect of the intermediate stream relative to the conventional method
corresponds to a 95% reduction in required memory capacity and memory bandwidth
for the processing of a 40-Mbps full HD stream (64 Mbps for the intermediate stream
and 90 Mpixels/s for the transform coefficients). Table 3.19 lists the bandwidths of all
DMA channels in the video decoding process. The ratio of bandwidth for the interme-
diate stream buffer is only 4.8% and is small even in the worst case. Therefore, the use
of a stream buffer has only a small impact on power consumption.

3.4.2.3 Shift-Register-Based Bus Network and Macroblock-Level Pipeline


Processing

As shown in Fig. 3.76, all submodules of the video codec are connected in a ring
structure by a bidirectional 64-bit shift-register-based bus (SBUS). Figure 3.80
shows the architecture of the SBUS and the data flow in the macroblock-pipeline
stages. The clockwise SBUS is the path for data readout from the external SDRAM.
The counterclockwise SBUS is used for intermodule data transfer to the next stage
3.4 Video Processing Unit 107

a 9
8 y = 1.5871x + 0.0045

Intermediate stream
7

bit-rate (Mbps)
6
5
4
3
2
1
0
0 1 2 3 4 5 6
Original stream bit-rate (Mbps)
CABAC
b 9
8 y = 1.4946x + 0.0093
Intermediate stream

7
bit-rate (Mbps)

6
5
4
3
2
1
0
0 1 2 3 4 5 6
Original stream bit-rate (Mbps)
CAVLC
Fig. 3.79 Bit rate increase of the intermediate stream. (a) CABAC and (b) CAVLC

Table 3.19 Bandwidths used in video processing unit


DMA Bandwidth MByte/s Note
Reference image (read) 180–806 1,920 × 1,088 × 30 × 1.5 × 2~
Reference image (write) 90 1,920 × 1,088 × 30 × 1.5
Parameters of macroblock 45 Motion vector, etc.
Steam (read) 5 40 Mbps
Intermediate stream (read + write) 16 Stream × 1.6 × 2 ch

rightward of the macroblock pipeline. This bidirectional SBUS efficiently enables


high throughput. Data are transferred by simply being shifted through shift register
slot (SRS) along the SBUS. Each SRS is assigned identification data (ID). Target
IDs are shifted along the SBUS with address, data, and enable signals, and an indi-
vidual module takes the data set into local memory when the target ID matches its
own ID, at which point the flow of that data set is terminated (not transferred to the
next module). Since there is a path traveling through the modules in sequence in
each direction, this bus architecture does not require arbitration. When the destina-
tion is not the next module to the left or right, the latency is simply proportional to
108 3 Processor Cores

(4)
SPP/HWC SPP or HWC SPP/HWC Global SD
(PSn-1) Pipeline stage (PSn) (PSn+1) DMAC RAM

Counter- (1) (2) (3) (0)


clockwise
Decoder

Clockwise

Shift register Shift register slot (SRS)

(0) DMA read a b c d

(1) PS n-1 a b c d

(2) PS n a b c d

(3) PS n+1 a b c d

(4) DMA write a b c d

Fig. 3.80 Shift-register-based bus network and depiction of how it works in macroblock-level
pipeline processing

the number of stages from the source module to the destination module. For a video
coding process, however, the major form of data transfer will be to the next stages
of the macroblock pipeline. Transactions between individual modules and the line
memory (L-MEM) constitute the only exception, but we avoid this problem by
scheduling this in a time slot taking up the first few tens of clock cycles before the
processing of each macroblock begins. This keeps the latency of the SBUS from
affecting the performance of the codec. The SBUS architecture provides an easy
way to connect an additional image processing unit for larger screens or a higher
frame rate without having to increase bandwidth, as would be required with a
conventional bus. The SBUS thus provides excellent video-size scalability.
The two image processing units work cooperatively as two macroblock-based
pipelines. Processing proceeds as shown in Fig. 3.81. Most state-of-the-art video-
coding standards, including H.264, utilize context correlation between adjacent
macroblocks. For example, macroblock X is coded by using the context information
from macroblocks A, B, C, or D in Fig. 3.81. We can take advantage of this charac-
teristic in the sophisticated dual macroblock-pipeline architecture. The delay and
parallelization for the two image processing units (IPU #0, #1) that handle the
respective pipelines are controlled accordingly. As shown in Fig. 3.82, context
information processed by IPU #1 is directly transferred to IPU #0, and context
information from IPU #0 is transferred to L-MEM. The two macroblock lines share
L-MEM. This halves the requirement for L-MEM to store context information.
3.4 Video Processing Unit 109

Fig. 3.81 Dual macroblock-


level pipeline processing A B C

D X
IPU
#1
IPU
#0

Macroblock

Data flow

a b c IPU #n
Data flow

SPPa #0 SPPb #0 HWC #0 #m #n IPU #1


Data flow IPU #1
SRS SRS SRS SRS SRS

SPPa #0 SPPb #0 HWC #0 #m #n IPU #0


IPU #0 IPU #0
SRS SRS SRS SRS SRS

L-MEM L-MEM L-MEM

SRS SRS SBUS SRS

1 IPU 2 IPUs n IPUs

Fig. 3.82 Configuration of image processing unit and data flows

To ensure real-time processing, we specified 1,200 as the upper bound on the


number of clock cycles for processing each macroblock. The value 1,200 ensures
that the overall image processing unit is capable of handling full HD operations at
less than 162 MHz. By varying the combinations of clock frequency and the number
of image processing units, our approach provides reasonable scalability across the
range from SD to full HD and even larger screen sizes while still only requiring
the single shared-line memory. Figure 3.82 depicts examples of IPU configurations
and data flows between the IPUs and L-MEM along with the SBUS. For example,
an LSI with a single IPU is a good option for applications that require the handling
of SD video and is capable of doing so at an operating frequency of 54 MHz.
110 3 Processor Cores

PLL

Static Clock driver Macroblock-level


module for module Clock driver
stop at sub-module
signal processing
registers
start/stop requests
Clock start/stop requests
Controlled by software Dynamic clock control at sub-module level

Fig. 3.83 Hierarchical clock control for full HD video processing

Time slot Macroblock-processing


Time
Encoding process
Stage0 DMA read nn
Stage1 Coarse ME nn
Stage2 Fine ME nn
Stage3 Transform nn
Stage4 Symbol coding nn
Stage4 De-blocking nn
Stage5 DMA write nn

Clock supply is stopped Data flow

Fig. 3.84 Dynamic power control in macroblock-level pipeline processing

3.4.2.4 Hierarchical Power Management

Figure 3.83 shows how the clock domains are divided at the submodule level. The
power management architecture consists of multiple domains, which are controlled
independently using video codec processing characteristics. A new layer of control
is set up between the static module stop controlled by software and the bit-grained
clock gating. We defined clock domains for each submodule of the video codec.
Each clock domain corresponds to a macroblock processing pipeline stage and is
also a unit of hierarchical synthesis and layout. The clock signals in the respective
domains are switched in accordance with the time reference defined by the macro-
block processing.
Figure 3.84 illustrates the dynamic power management in the pipeline for
macroblock processing. Data are processed in a macroblock-based pipeline manner.
The time reference defined 1,200 clock cycles at 162 MHz. Variable-length coding
inherently lacks a fixed processing time. The clock supply for each block is inde-
pendently cut off as soon as it finishes its required processing. This scheme reduces
the amount of power consumed by the clock drivers. Dynamic power consumption
is reduced by 27% by using this technique [76].
3.4 Video Processing Unit 111

3.4.2.5 Memory Management

The DMA read and DMA write depicted in Fig. 3.84 could cause a delay in the
macroblock-level pipeline processing. To prevent this, we have to improve the efficiency
of the image-data transfer especially in the reference-image read. To achieve an
efficient 2D data transfer, an address transformation scheme is introduced in the mem-
ory management unit for VPU and other media IPs in order to avoid a page miss in the
external SDRAM.
Most video codec standards require small, submacroblock level 2D data transfer
for reference reads in the decoding mode. Without using any particular techniques
to achieve such transfers, this will result in a page miss at every line. The penalty for
a page miss, which is around ten cycles or more, requires a high proportion of
memory bandwidth. Efficiency in the 2D data transfer is thus critically important.
With embedded systems such as mobile applications, in which various kinds of
software are executed, it is not feasible to adopt a particular form of memory alloca-
tion such as using a bank-interleave operation for each pixel line.
To avoid page misses, tile-linear address translation (TLAT) [76] is introduced
between the video codec and the on-chip interconnect. Figure 3.85a shows the
TLAT circuits and memory allocation in the virtual address (VADR) and physical
address (PADR) space. The lower-order bits of the VADR issued by the video codec
are rearranged into the corresponding PADR. As shown in Fig. 3.85b, 32 × 32 tile
access from the video codec is mapped to linear addressing in the PADR space.
When the lower address of the VADR is defined as VADR [m: 0], the PADR is
described as follows:
PADR [m: TB+VB+HB] = VADR [m: TB+VB+HB];
PADR [TB+VB+HB-1: TB+VB] = VADR [TB+HB-1: TB];
PADR [TB+VB-1: TB] = VADR [TB+HB+VB-1: TB+HB];
PADR [TB -1:0] = VADR [TB-1: 0].
In these equations, TB, HB, and VB are calculated by the following equations:
TB=log2 (Blk_h),
HB=log2 (stride)-TB,
VB=log (Blk_v);
Stride, Blk_h, and Blk_v should be power of two.
With this address translation scheme, codec performance improved a maximum
of 47% in the bipredictive prediction picture (called the B-picture), and power con-
sumption in the video codec core was reduced by 16% [76]. This scheme is also
well suited for image rotation and block-based filter processing.

3.4.3 Processor Elements

To provide flexibility for handling multiple video coding standards, a stream processor
and six media processors are implemented in the video processing unit. Two fine
112 3 Processor Cores

a
Stride
Media IPs VPU
Virtual address (VADR) BlkV Tile Tile

BlkH
Address space judgment Log2(BlkV) Log2(Stride)

PMB VADR 23:16 VB5 HB6 TB5


Tile-linear
16 entries
16MB page
address
translator
Media interconnect PADR 23:16 HB6 VB5 TB5

On-chip interconnect Log2(BlkH)


Physical address (PADR)

Off-chip memory

Tile-linear address transition circuits.

b
VADR (Tile-based addressing) PADR (Linear addressing)
32Byte
Stride=2048Byte Stride=2048Byte

0 32 64 2016 0 1 31 32 63
32 lines

1 33 65 64 65 95 96 127

31 63 95 2047 2016 2047


2048 2048 2049 2079
32 lines

2049
Image plane page = 1024Byte
2079

Memory allocation in virtual and physical address space.

Fig. 3.85 Tile-linear address translation (TLAT)

motion estimators/motion compensators (FME), two transformers (TRF), and two


in-loop deblocking filters (DEB), which are depicted in Fig. 3.76, are implemented
as low-power media processors called programmable image processing elements
(PIPE).
3.4 Video Processing Unit 113

Global SBUS
Internal bus
DMAC interface
Stream processor memory interface Intermediate Intermediate
Initial stream stream
Firmware Table data
On-chip parameter
interconnect Hardwired logic
Instruction Data Table
memory memory memory

3,220-bit
Memory
context flip-flop
port Two-way VLIW stream processor
Control CABAC accelerator

External Initial Intermediate


Firmware Table data
memory
memory parameter stream

Fig. 3.86 Stream-processing unit architecture. The stream processor and CABAC accelerator are
connected to the internal bus, allowing them to access the intermediate stream buffer in an external
memory via the global DMAC

3.4.3.1 Stream Processor

Figure 3.86 shows the stream processing unit architecture, which consists of a two-
way very long instruction word (VLIW) stream processor and an H.264 context-
adaptive binary arithmetic coding (CABAC) accelerator with 3,220-bit context
flip-flops. These parts and a common SBUS interface are connected to the internal
bus so that each part can access the intermediate stream buffer in an external
SDRAM via the global DMAC. The stream processing unit can support various
video coding standards by changing the firmware, which consists of the decoder or
encoder program and the table data. The video codec loads the firmware from the
external SDRAM to the stream processing unit’s internal memories before the unit
starts decoding or encoding video streams. The program in the firmware is loaded
to the stream processor’s instruction memory, and the table data are loaded to the
table memory. The CABAC accelerator, which the stream processor controls,
includes context flip-flops to achieve high performance.
Figure 3.87 shows the architecture of the proposed stream processor (STX). We
employ the 32-bit 2-way VLIW, 3-stage pipelined architecture as the stream proces-
sor architecture [77].
Stream encoding/decoding is divided into variable length coding, syntax analysis,
and context calculation. Variable length coding is further divided into coding/
decoding with table data (table encoding/decoding) and Golomb encoding/decoding,
which is employed in H.264. Table encoding/decoding in various video coding stan-
dards can be easily developed by changing the data in the table memory in the STX.
114 3 Processor Cores

Instruction Data Table Intermediate stream


memory memory memory to internal bus
2-way/3-stage pipeline

Instruction LD/ST unit Exec. 0


32
decode 0
ALU
32 32

Inst- Variable-length Golomb Table


ruction coding unit enc/dec lookup
fetch
Instruction
decode 1 32 ALU Execution 1
32
Read port 32
(32bitx4) Partitioned register file
4bitx64(Type 1)
Write port(32bitx2)
16bitx32(Type 2)

32bitx32(Type 3) STX

Fig. 3.87 Stream processor architecture

Also, the Golomb encoding/decoding process does not change with each video
coding standard. Therefore, we developed the variable-length coding unit in the
STX as the dedicated variable-length coding hardware. On the contrary, syntax
analysis and context calculation have complicated data flows, and they vary with
each video coding standard. Thus, they are implemented into the firmware for each
standard. These processes also have a lot of branch operations. In general, VLIW
architecture is not good at handling branch operations, and branch-stall cycles
increase in proportion to the number of pipeline stages. Thus, the number of stages
in the STX is reduced to as few as possible.
The STX also has an out-of-order execution feature. If the instruction decoder in
the STX judges that there is no data dependency with a variable-length coding
instruction and the following instructions, then the pipeline executes the next
instruction, even though the execution of the variable-length coding instruction is
not finished. This feature enables the symbol-level processing of variable length
coding and syntax analysis/context calculation to be pipelined. This pipeline pro-
cessing is effective for improving the performance in processing stream data that
have large bit rates and include a lot of residual data.
When calculating the context for a symbol in a video stream, various previously
decoded symbols are required. For efficient access to these symbols, they are located
in the register file. Before designing the STX, we estimated the number of entries
required in the register file from specifications of several video coding standards.
Based on this estimation, it was determined that 128 entries were sufficient for
storing previously decoded symbols while encoding or decoding various video
streams. However, 128 entries × 32 bits (4,096 bits) of flip-flops require large hardware.
3.4 Video Processing Unit 115

PIPE
Sub-MB (4x4) level pipeline processing
bit extension, bit rounding,
2-D processing
transposition transposition
PC Shared instruction memory

Loading PU Media PU Storing PU


Registers

Registers
Registers
ALU ALU
Media
LD/ ALU LD/
ST ST

DMAC Local data memory

Shift-register-based bus

Fig. 3.88 Architecture of programmable image processing element (PIPE)

To reduce the number of flip-flops, the symbols are categorized into three types by
bit width: type 1 (1–4 bits), type 2 (4–16 bits), and type 3 (16–32 bits). As a result,
the number of entries belonging to type 1 is about 1.8 times larger than that belong-
ing to the other categories. Based on this result, our register file architecture consists
of three partitions: 64 type 1 entries (4 bits), 32 type 2 entries (16 bits), and 32 type
3 entries (32 bits) as shown in Fig. 3.87. Compared with a 32-bit nonpartitioned
register file, a 57% reduction in the number of flip-flops is achieved.
The CABAC accelerator achieves a performance of two cycles per bit of the bin
string (an intermediate binary representation of the syntax elements), which corre-
sponds to three cycles per bit of the stream. This is assuming that the compression
rate for the arithmetic coding is 1.5 and that single-cycle-access flip-flops are used
to update the context information. Taking the several cycles of processing overhead
into account, the performance is 40 Mbps at 162-MHz operation.

3.4.3.2 Programmable Image Processing Element

To provide flexibility for handling multiple video standards, the following six
submodules of the image processing units are implemented as low-power media
processors [74]: two fine motion estimators/motion compensators (FME), two
transformers (TRF), and two in-loop deblocking filters (DEB). These modules are
shown in Fig. 3.76.
Figure 3.88 is a block diagram of the programmable image processing element
(PIPE). The PIPE is a tightly coupled multiprocessing unit (PU) system which con-
sists of three PUs (the loading PU, media PU, and storage PU), a local data memory,
and a shared instruction memory. Each PIPE is capable of simultaneously loading
116 3 Processor Cores

a
64bit
Sync Source1/2 Opecode Destination Count Width Pitch

Synchronization Load/store
Source Destination
between PUs instruction

Loading Reg0 Reg4 Pitch

Count
Count

PU Reg1 Reg5 Width


Reg2 ALU Reg6
wait

wait

Count
Reg3 Reg7 Data
Media- Width Width
PU

Example of SIAD instruction format

b
Register File
Src1 Src2
Shifter / extender
Decoder

X X

add add

Barrel Barrel
shifter shifter

SIAD ALU structure

Fig. 3.89 SIAD processor architecture and instruction format

data, performing image processing, and storing data. Arrays of data are specifiable
as the operands for several single instructions of the PUs, so they are capable of
handling multiple horizontal data as vectors. This aspect of the PIPE can reduce the
number of cycles required for operations such as pre-/post-transposition processing,
as well as the code size and instruction fetches. Reducing instruction fetches from
the shared instruction memory reduces power consumption. Overall, the PIPE
improves the efficiency with which 2D data and instructions are supplied.
A single instruction multiple data (SIMD) architecture performs the same opera-
tion on multiple data simultaneously using multiple processing elements. In gen-
eral, the SIMD can only handle a pair of source data, which is in the horizontal
pixels of images. A major cause of performance degradation in 2D image process-
ing is that one instruction can handle only single source data. To solve this issue, 2D
vector data in a single instruction are taken into account.
Figure 3.89a shows a single instruction with arrayed data (SIAD) instruction
format. The width and count fields specify multiple source data as multiple vector data.
3.4 Video Processing Unit 117

0 1200 (cycles/MB)

0
TRF
Enc. T*, Q**, Inverse T, I
P Processing cycles/MB
Inverse Q
B
FME I Cycle of fetching
Fine ME*** P by the 3 PUs
MC**** B
DEB I Ratio of cycles
De-blocking P for fetching to
filter B overall cycles for
TRF I 1-MB-processing
Dec.
Inverse T, P per PU (%)
Inverse Q B
FME I
MC P
B
*T: Transform
DEB I
**Q: Quantization
De-blocking P ***ME: Motion estimation
filter B ****MC: Motion compensation
Picture type 0 10 20 30 50 100 (%)

Fig. 3.90 Evaluation of performance and efficiency in instruction fetching of PIPE acting as modules
of the image processing unit in H.264 video processing

The hardware controls source and destination register pointers with multiple cycles.
This architectural concept provides parallelism for vertical data. Figure 3.89b shows a
basic SIAD ALU structure. This dataflow goes through mapping logics, multipliers,
sigma adders, and barrel shifters in a pipeline. Each data path is similar to the general
SIMD structure, but the total structure differs in how source data are supplied.
Each PIPE also has a local DMA controller for communication with the other
PIPE modules and with the hard-wired modules (e.g., coarse motion estimator,
symbol coder). Connecting multiple PIPEs in series to form the macroblock-based
pipeline modules provides strong parallel computing performance and scalability
for the video codec (as described in Fig. 3.82).

3.4.4 Implementation Results

Figure 3.90 shows the performance and instruction-fetching efficiency of PIPEs


acting as the TRF, FME, and DEB modules of an image processing unit. As the
figure indicates, the average time for fetching from the shared instruction memory is
around 6% (FME processing for H.264 encoding) to 19% (TRF processing for H.264
encoding) of the PU processing cycles. Each PU fetches an instruction every 5–16
cycles. This corresponds to 18–58% of macroblock-processing cycles by the three PUs.
This helps to achieve lower power consumption than would be the case for a typical
RISC processor, which would basically fetch an instruction every cycle. Note that
Fig. 3.90 also indicates that the average number of cycles to process a macroblock is
118 3 Processor Cores

Table 3.20 Specifications of VPU and measured power consumption


Technology 45-nm, 8-layer, triple-Vth, CMOS
Circuit size 4.0 MGate logic and 300 kB SRAM
Supply voltage 1.1 V (1.0–1.2 V)
Clock frequency 162 MHz
Performance 1,920 × 1,080 × 30 fpts, 40 Mbps
Supported video coding Standard Profile Level Decoding Encoding
standard and measured power H.264 High 4.1 95 mW 162 mW
consumption
MPEG-2 Main High 70 mW 130 mW
MPEG-4 Advanced simple 5 77 mW 134 mW
VC-1 Simple Medium
Main High 107 mW n/a
Advanced 3

Fig. 3.91 Micrograph of test chip in 45-nm CMOS

less than 1,200 in H.264 encoding and less than 1,000 in H.264 decoding. In addi-
tion to the H.264 processing, the average macroblock-processing cycle for MPEG-2,
MPEG-4, and VC-1 is less than 1,200, which means the video codec is capable of
full HD real-time processing at an operating frequency of 162 MHz.
Table 3.20 lists specifications of the video codec and the measured results for
power consumption in the processing of full HD video at 30 fps. With 45-nm CMOS
technology, the codec consumed 162 mW in encoding and 95 mW in decoding of
H.264 High Profile at 1.10 V at room temperature. Figure 3.91 is a micrograph
of the test chip, which is overlaid with the layout of the video processing unit.

3.4.5 Conclusion

A multistandard, size-scalable, low-power video codec including one stream pro-


cessor and six image-processing processors has been integrated in 45-nm CMOS.
References 119

With two-domain (stream-rate and pixel-rate) processing, two parallel pipelines


for macroblock processing, and tile-based address translation circuits, the video
processing unit consumed 95 mW of power in real-time decoding of a full HD
H.264 stream at an operating frequency of 162 MHz at 1.1 V.
The video processing unit in the test chip supports four video coding standards:
H.264, MPEG-2, MPEG-4, and VC-1. Moreover, by changing the firmware and
system software, this unit can support other coding standards such as AVS or H.263
or some proprietary video coding technologies based on a supported video coding
standard as noted above.
A successor to the H.264 standard is currently being developed by the Joint
Collaborative Team on Video Coding (JCT-VC) and will be called the High
Efficiency Video Coding (HEVC) standard [78???]. HEVC aims to substantially
improve coding efficiency compared to the H.264 High Profile, that is, reduce bit-rate
requirements by half with comparable visual quality at the expense of increased
computational complexity. Thus, efficient parallel operation is strongly desired in
order to satisfy both high performance and low-power consumption for video codec
design at the architecture level.

References

1. Daniels RG (1996) A participant’s perspective. IEEE Micro 16(2):8–15


2. Gwennap L (1996) CPU technology has deep roots. Microprocessor Report 10(10):9–13
3. Nakamura H et al (1983) A Circuit Methodology for CMOS Microcomputer LSIs. ISSCC Dig
Tech Papers:134–135
4. Patrick P. Gelsinger (2001) Microprocessors for the New Millennium Challenges, Opportunities,
and New Frontiers. ISSCC Dig Tech Papers, Session 1.3
5. Weicker RP (1984) Dhrystone: a synthetic programming benchmark. Commun ACM 27(10):
1013–1030
6. Kawasaki S (1994) SH-II a low power RISC microprocessor for consumer applications. HOT
Chips VI:79–103
7. Hasegawa A et al (1995) SH-3: high code density, low power. IEEE Micro 15(6):11–19
8. Arakawa F, et al (1997) SH4 RISC Multimedia Microprocessor. HOT Chips IX Symposium
Record:165–176
9. Nishii O, et al (1998) A 200 MHz 1.2 W 1.4GFLOPS microprocessor with graphic operation
unit. ISSCC Dig Tech Papers:288–289, 447
10. Arakawa F et al (1998) SH4 RISC multimedia microprocessor. IEEE Micro 18(2):26–34
11. Biswas P et al (2000) SH-5: the 64 bit SuperH architecture. IEEE Micro 20(4):28–39
12. Uchiyama K et al (2001) Embedded processor core with 64-bit architecture and its system-
on-chip integration for digital consumer products. IEICE Trans Electron E84-C(2):139–149
13. Arakawa F (2001) SH-5: A First 64-bit SuperH Core with Multimedia Extension. HOT Chips
13 Conference Record
14. Arakawa F et al (2004) An embedded processor core for consumer appliances with 2.8GFLOPS
and 36 M Polygons/s FPU. ISSCC Dig Tech Papers 1:334–335, 531
15. Ozawa M, et al (2004) Pipeline Structure of SH-X Core for Achieving High Performance and
Low Power, COOL Chips VII Proceedings, vol. I:239–254
16. Arakawa F et al (2004) An embedded processor core for consumer appliances with 2.8GFLOPS
and 36 M polygons/s FPU. IEICE Trans Fundamentals E87-A(12):3068–3074
120 3 Processor Cores

17. Arakawa F et al (2005) An exact leading non-zero detector for a floating-point unit. IEICE
Trans Electron E88-C(4):570–575
18. Arakawa F et al (2005) SH-X: an embedded processor core for consumer appliances. ACM
SIGARCH Comput Architect News 33(3):33–40
19. Kamei T, et al (2004) A resume-standby application processor for 3G cellular phones. ISSCC
Dig Tech Papers:336–337, 531
20. Ishikawa M, et al (2004) A resume-standby application processor for 3G cellular phones with
low power clock distribution and on-chip memory activation control. COOL Chips VII
Proceedings, vol. I:329–351
21. Ishikawa M et al (2005) A 4500 MIPS/W, 86 mA resume-standby, 11 mA ultra-standby appli-
cation processor for 3 G cellular phones. IEICE Trans Electron E88-C(4):528–535
22. Yamada T, et al (2005) Low-Power Design of 90-nm SuperHTM Processor Core. Proceedings
of 2005 IEEE International Conference on Computer Design (ICCD), pp 258–263
23. Arakawa F, et al (2005) SH-X2: An embedded processor core with 5.6 GFLOPS and 73 M
Polygons/s FPU, 7th Workshop on Media and Streaming Processors (MSP-7):22–28
24. Yamada T et al (2006) Reducing consuming clock power optimization of a 90 nm embedded
processor core. IEICE Trans Electron E89–C(3):287–294
25. Hattori T, et al (2006) A power management scheme controlling 20 power domains for a single-
chip mobile processor. ISSCC Dig Tech Papers, Session 29.5
26. Ito M, et al (2007) A 390 MHz single-chip application and dual-mode baseband processor in
90 nm Triple-Vt CMOS. ISSCC Dig Tech Papers, Session 15.3
27. Naruse M, et al (2008) A 65 nm single-chip application and dual-mode baseband processor
with partial clock activation and IP-MMU. ISSCC Dig Tech Papers, Session 13.3
28. Ito M et al (2009) A 65 nm single-chip application and dual-mode baseband processor with
partial clock activation and IP-MMU. IEEE J Solid-State Circuits 44(1):83–89
29. Kamei T (2006) SH-X3: Enhanced SuperH core for low-power multi-processor systems. Fall
Microprocessor Forum 2006
30. Arakawa F (2007) An embedded processor: is it ready for high-performance computing?
IWIA 2007:101–109
31. Yoshida Y, et al (2007) A 4320MIPS four-processor core SMP/AMP with Individually managed
clock frequency for low power consumption. ISSCC Dig Tech Papers, Session 5.3
32. Shibahara S, et al (2007) SH-X3: Flexible SuperH multi-core for high-performance and low-
power embedded systems. HOT CHIPS 19, Session 4, no 1
33. Nishii O, et al (2007) Design of a 90 nm 4-CPU 4320 MIPS SoC with individually managed
frequency and 2.4 GB/s multi-master on-chip interconnect. Proc 2007 A-SSCC, pp 18–21
34. Takada M, et al (2007) Performance and power evaluation of SH-X3 multi-core system. Proc
2007 A-SSCC, pp 43–46
35. Ito M, et al (2008) An 8640 MIPS SoC with independent power-off control of 8 CPUs and 8
RAMs by an automatic parallelizing compiler. ISSCC Dig Tech Papers, Session 4.5
36. Yoshida Y, et al (2008) An 8 CPU SoC with independent power-off control of CPUs and
multicore software debug function. COOL Chips XI Proceedings, Session IX, no. 1
37. Arakawa F (2008) Multicore SoC for embedded systems. International SoC Design Conference
(ISOCC) 2008, pp.I-180–I-183
38. Kido H, et al (2009) SoC for car navigation systems with a 53.3 GOPS image recognition
engine. HOT CHIPS 21, Session 6, no. 3
39. Yuyama Y, et al (2010) A 45 nm 37.3GOPS/W heterogeneous multi-core SoC. ISSCC
Dig:100–101
40. Nito T, et al (2010) A 45 nm heterogeneous multi-core SoC supporting an over 32-bits physical
address space for digital appliance. COOL Chips XIII Proceedings, Session XI, no. 1
41. Arakawa F (2011) Low power multicore for embedded systems. COMS Emerg Technol,
Session 5B, no. 1
42. Song SP et al (1994) The PowerPC 604 RISC microprocessor. IEEE Micro 14(5):8–22
43. Levitan D, et al (1995) The PowerPC 620TM microprocessor: a high performance superscalar
RISC microprocessor. Compcon ‘95.’Technologies for the Information Superhighway’, Digest
of Papers, pp 285–291
References 121

44. Edmondson JH et al (1995) Superscalar instruction execution in the 21164 alpha microprocessor.
IEEE Micro 15(2):33–43
45. Gronowski PE et al (1998) High-performance microprocessor design. IEEE J Solid-State
Circuit 33(5):676–686
46. Yeager KC (1996) The MIPS R10000 superscalar microprocessor. IEEE Micro 16(2):28–40
47. Golden M et al (1999) A seventh-generation x86 microprocessor. IEEE J Solid-State Circuit
34(11):1466–1477
48. Hinton G, et al (2001) A 0.18-mm CMOS IA-32 processor with a 4-GHz integer execution
unit. IEEE J Solid-State Circuit 36:11
49. Weicker RP (1988) Dhrystone benchmark: rationale for version 2 and measurement rules.
ACM SIGPLAN Notices 23(8):49–62
50. Kodama T, et al (2006) Flexible engine: a dynamic reconfigurable accelerator with high
performance and low power consumption, In: Proc of the IEEE Symposium on Low-Power
and High-Speed Chips (COOL Chips IX)
51. Motomura M (2002) A dynamically reconfigurable processor architecture. Microprocessor
Forum 2002, Session 4-2
52. Fujii T, et al (1999) A dynamically reconfigurable logic engine with a multi-Context/multi-mode
unified-cell architecture. Proc Intl Solid-State Circuits Conf, pp 360–361
53. Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier
series. Mathemat Comput, 19
54. Pease MC (1968) An adaptation of the fast Fourier transform for parallel processing. J ACM,
15(2)
55. Noda H et al (2007) The design and implementation of the massively parallel processor based
on the matrix architecture. IEEE J Solid-State Circuits 42(1):183–192
56. Noda H et al (2007) The circuits and robust design methodology of the massively parallel
processor based on the matrix architecture. IEEE J Solid-State Circuits 42(4):804–812
57. Kuang JB, et al (2005) A double-precision multiplier with fine-grained clock-gating support
for a first-generation CELL processor. In: IEEE Int Solid-State Circuits Conf Dig Tech Papers,
378–379
58. Flachs B, et al (2005) A streaming processor unit for a CELL processor. IEEE Int Solid-State
Circuits Conf Dig Tech Papers 134–135
59. Kyo S et al (2003) A 51.2GOPS scalable video recognition processor for intelligent cruise
control based on a linear array of 128 four-way VLIW processing elements. IEEE J Solid-State
Circuits 38(11):1992–2000
60. Hillis D (1985) The connection machine. MIT, Cambridge, MA
61. Swan RJ et al (1977) The implementation of the CM multiprocessor. Proc NCC 46:645–655
62. Amano H (1996) Parallel computers. Tokyo, Shoukoudou
63. Kurafuji T, et al (2010) A scalable massively parallel processor for real-time image processing.
IEEE Int Solid-State Circuits Conf Dig Tech Papers:334–335
64. Joint Video Team (JVT) of ISO/IEC MEPG & ITU-T VCEG, Text of International Standard
of Joint Video Specification, ITU-T Rec. H.264 | ISO/IEC 14496-10 Advanced Video Coding,
Dec. 2003
65. Richardson IEG (2003) H.264 and MPEG-4 video compression: video coding for next-generation
multimedia. Wiley, New York
66. Wiegand T et al (2003) Overview of the H.264/AVC video coding standard. IEEE Trans
Circuits Syst Video Technol 13(7):560–576
67. Shirasaki M, et al (2009) A 45 nm Single-Chip Application-and-Baseband Processor Using an
Intermittent Operation Technique. IEEE ISSCC Dig Tech Papers:156–157
68. Nomura S, et al (2008) A 9.7 mW AAC-decoding, 620 mW H.264 720p 60fps decoding,
8-core media processor with embedded forward-body-biasing and power-gating circuit in
65 nm CMOS technology. IEEE ISSCC Dig Tech Papers:262–263
69. Mair H, et al (2007) A 65-nm mobile multimedia applications processor with an adaptive
power management scheme to compensate for variations. Dig Symp VLSI Circuits:224–225
70. Chien CD et al (2007) A 252kgate/7lmW multi-standard multi-channel video decoder for high
definition video applications. IEEE ISSCC Dig Tech Papers:282–283
122 3 Processor Cores

71. Liu TM et al (2007) A 125 mW, fully scalable MPEG-2 and H.264/AVC video decoder for
mobile applications. IEEE J Solid-State Circuits 42(1):161–169
72. Lin YK, et al (2008) A 242 mW 10 mm2 1080p H.264/AVC high-profile encoder chip. IEEE
ISSCC Dig Tech Papers:314–315
73. Chen YH, et al (2008) An H.264/AVC scalable extension and high profile HDTV 1080p
encoder chip. Symp VLSI Circuits Dig:104–105
74. Iwata K et al (2009) 256 mW 40 Mbps Full-HD H.264 high-profile codec featuring a dual-
macroblock pipeline architecture in 65 nm CMOS. IEEE J Solid-State Circuits 44(4):
1184–1191
75. ITU-T, ITU-T Recommendation H.264.1, Conformance Specification for H.264 Advanced
Video Coding, 2005
76. Iwata K et al (2010) A 342 mW mobile application processor with full-hd multi-standard video
codec and tile-based address-translation circuits. IEEE J Solid-State Circuits 45(1):59–68
77. Kimura M et al (2009) A full HD multistandard video codec for mobile applications. IEEE
Micro 29(6):18–27
78. Wiegand T et al (2010) Special Section on the Joint Call for Proposals on High Efficiency
Video Coding (HEVC) Standardization. IEEE Trans Circuits Syst Video Technol 20(12):
1661–1666
Chapter 4
Chip Implementations

Three prototype multicore chips, RP-1, RP-2, and RP-X, were implemented with
the highly efficient cores described in Chap. 3. The details of the chips are described
in this chapter. The multicore architecture makes it possible to enhance the perfor-
mance while maintaining the efficiency, but not to enhance the efficiency. Therefore,
a multicore with inefficient cores is still inefficient, and the highly efficient cores are
the key components to realize a high-performance and highly efficient SoC.
However, the multicore requires different technologies from that of a single core to
maximize its capabilities. The prototype chips are useful for researching and devel-
oping such technologies and have been utilized for developing and evaluating soft-
ware environments, application programs, and systems (see Chaps. 5 and 6).

4.1 Multicore SoC with Highly Efficient Cores

A multicore system on a chip (SoC) is one of the most promising approaches to


achieve high performance. Formerly, frequency scaling was the best approach.
However, the scaling has hit the power wall, and frequency enhancement is slowing
down. Further, the performance of a single processor core is proportional to the
square root of its area, known as Pollack’s rule [1], and the power is roughly propor-
tional to the area. Therefore, lower performance processors can achieve higher
power efficiency. As a result, we should make use of the multicore SoC with rela-
tively low-performance processors, which can achieve high power efficiency.
The power wall is not a problem only for high-end server systems. Embedded
systems also face this problem [2]. Figure 4.1 roughly illustrates the power budgets
of chips for various application categories. The horizontal and vertical axes repre-
sent performance measured using Dhrystone GIPS (DGIPS) and efficiency
(DGIPS/W) in logarithmic scale, respectively. The oblique lines represent constant
power (W) lines and constant product lines of the power performance ratio and the
power (DGIPS2/W). The product roughly indicates the attained degree of the design.
There is a trade-off relationship between the power efficiency and the performance.

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 123
DOI 10.1007/978-1-4614-0284-8_4, © Springer Science+Business Media New York 2012
124 4 Chip Implementations

50

Sensors
New Categories
10
DGIPS/W

5 Mobile Devices
Controllers
D

1 Mobile PC
Equipped Devices
0.5
Server/PC

01
0.
0.1 0.5 1 DGIPS 5 10 50 100
*) DGIPS: Dhrystone GIPS

Fig. 4.1 Power budgets of chips for various application categories

The power of chips in the server/PC category is limited at around 100 W, and the
chips above the 100-W oblique line must be used. Similarly, the chips roughly above
the 10- or 1-W oblique line must be used for equipped devices/mobile PCs, or control-
lers/mobile devices, respectively. Further, some sensors must use the chips above
the 0.1-W oblique line, and new categories may grow from this region. Consequently,
we must develop high DGIPS2/W chips to achieve high performance under the
power limitations.
Figure 4.2 maps various processors on a graph, whose horizontal and vertical
axes, respectively, represent operating frequency (MHz) and power frequency ratio
(MHz/W) in logarithmic scale. Figure 4.2 uses MHz or GHz instead of the DGIPS of
Fig. 4.1. This is because few DGIPS of the server/PC processors are disclosed. Some
power values include leak current, whereas the others do not; some are under the
worst conditions, while the others are not. Although the MHz value does not directly
represent the performance, and the power measurement conditions are not identical,
they roughly represent the order of performance and power. The triangles and circles
represent embedded and server/PC processors, respectively. The dark gray, light
gray, and white plots represent the periods up to 1998, after 2003, and in between,
respectively. The GHz2/W improved roughly ten times from 1998 to 2003, but only
three times from 2003 to 2008. The enhancement of single cores is apparently slow-
ing down. Instead, the processor chips now typically adopt a multicore architecture.
Figure 4.3 summarizes the multicore chips presented at the International Solid-
State Circuit Conference (ISSCC) from 2005 to 2008. All the processor chips pre-
sented at ISSCC since 2005 have been multicore ones. The axes are similar to
Fig. 4.2, although the horizontal axis reflects the number of cores. Each plot at the
start and end points of an arrow represents single core and multicore, respectively.
The performance of multicore chips has continued to improve which has com-
pensated for the slowdown in the performance gains of single cores in both the
4.1 Multicore SoC with Highly Efficient Cores 125

3000 SH-X2/X3 Cortex-A8/A9


MIPS 4Kc SH-X
VR4122
XScale Atom
VR4131 ARM11
1000
Power Frequency Ratio (MHz/W)

SH3-DSP MIPS 5Kc C64+


PPC401GF FR-V Cavium
MIPS 20Kc
300 VR4121 VR5500 TM5400
ARM710 SH-4 ARM10
PPC403GA CF5307 ARM9 PPC750FX Niagara 3rdG SPARC
SH-3 TM3120 Merom
100 SA110 PPC405CR PAsemi
V851 CF5202 VR5432 MobilePentium III PPC750GX Opteron POWER6
SH-2 i960JT Ultra IIe Itanium
PPC750 R16000 SPARC v9 PPC970
EC603e R14000
30 PPC7400 POWER4+ Xeon Pentium4
Ultra II Pentium III
R12000 Pentium II Xeon POWER4 POWER5
—’98—’03— Alpha 21164 POWER3 Itanium2 Athlon
10
Embedded Pentium Pro PA8500 PA8800 Ultra III
Server/PC Ultra PA8600 Alpha 21364
PA8000 R10000 Itanium Alpha 21264
30 100 Operating 300 Frequency 1000 (MHz) 3000

Fig. 4.2 Performance and efficiency of various processors

3000 8x ISSCC 2005


ISSCC 2006
4x
SH-X3 ISSCC 2007
1000 ISSCC 2008

FR-V 8x
4x Cavium
MHz/W

CELL
300 16x 16x
8x Niagara
4x
100 PAsemi 3rdG SPARC
Merom POWER6
Opteron
Itanium PPC970
30 SPARC v9 Xeon
4x
1000 3000 10000 30000
1x, 2x, 4x, 8x, or 16x of Operating Frequency (MHz)

Fig. 4.3 Some multicore chips presented at ISSCC

embedded and server/PC processor categories. There are two types of multicore
chips. One type integrates multiple-chip functions into a single chip, resulting in a
multicore SoC. This integration type has been popular for more than ten years. Cell
phone SoCs have integrated various types of hardware intellectual properties
(HW-IPs), which were formerly integrated into multiple chips. For example, an
SH-Mobile G1 integrated the function of both the application and baseband proces-
sor chips [3], followed by SH-Mobile G2 [4] and SH-Mobile G3 [5, 6], which
enhanced both the application and baseband functionalities and performance. The
other type has increased number of cores to meet the requirements of performance
and functionality enhancement. The RP-1, RP-2, and RP-X are the prototype SoCs,
126 4 Chip Implementations

and an SH2A-DUAL [7] and an SH-Navi3 [8] are the multicore SoC products of
this enhancement type. The transition from single core chips to multicore ones
seems to have been successful on the hardware side, and various multicore products
are already on the market. However, various issues still need to be addressed for
future multicore systems.
The first issue concerns memories and interconnects. Flat memory and intercon-
nect structures are the best for software, but hardly possible in terms of hardware.
Therefore, some hierarchical structures are necessary. The power of on-chip inter-
connects for communications and data transfers degrade power efficiency, and a
more effective process must be established. Maintaining the external I/O perfor-
mance per core is more difficult than increasing the number of cores, because the
number of pins per transistors decreases for finer processes. Therefore, a break-
through is needed in order to maintain the I/O performance.
The second issue concerns runtime environments. The performance scalability
was supported by the operating frequency in single core systems, but it should be
supported by the number of cores in multicore systems. Therefore, the number of
cores must be invisible or virtualized with small overhead when using a runtime
environment. A multicore system will integrate different subsystems called domains.
The domain separation improves system reliability by preventing interference
between domains. On the other hand, the well-controlled domain interoperation
results in an efficient integrated system.
The third issue relates to the software development environments. Multicore sys-
tems will not be efficient unless the software can extract application parallelism and
utilize parallel hardware resources. We have already accumulated a huge amount of
legacy software for single cores. Some legacy software can successfully be ported,
especially for the integration type of multicore SoCs like the SH-Mobile G series.
However, it is more difficult with the enhancement type. We must make a single
program that runs on multicore or distribute functions now running on a single core
to multicore. Therefore, we must improve the portability of legacy software to the
multicore systems. Developing new highly parallel software is another issue. An
application or parallelization specialist could do this, although it might be necessary
to have specialists in both areas. Some excellent research has been done on auto-
matic parallelization compilers, and the products of such compilers are expected to
be released in the future. Further, we need a paradigm shift in the development, for
example, a higher level of abstraction, new parallel languages, and assistant tools
for effective parallelization.

4.2 RP-1 Prototype Chip

The RP-1 is the first multicore chip with four SH-X3 CPU cores (see Sect. 3.1.7)
[9–13]. It was fabricated as a prototype chip using a 90-nm CMOS process to accel-
erate the research and development of various embedded multicore systems. The
RP-1 achieved a total of 4,320 MIPS at 600 MHz by the four SH-X3 cores measured
4.2 RP-1 Prototype Chip 127

Table 4.1 RP-1 specifications


Process technology 90-nm, 8-layer Cu, triple-Vth, CMOS
Chip size 97.6 mm2 (9.88 mm × 9.88 mm)
Supply voltage 1.0 V(internal), 1.8/3.3 V(I/O)
Clock frequency 600 MHz
SH-X3 core Size 2.60 mm × 2.80 mm
I/D-cache 32-KB 4-way set-associative (each)
ILRAM/OLRAM 8 KB/16 KB
URAM 128 KB (unified)
Snoop controller (SNC) Duplicated address array (DAA) of four D-caches
Centralized shared memory (CSM) 128 KB
External interfaces DDR2-SDRAM, SRAM, PCI-Express
Performance CPU 4,320 MIPS (Dhrystone 2.1, 4-core total)
FPU 16.8 GFLOPS (peak, 4-core total)
Package 554-pin FCBGA, 29 mm × 29 mm
Chip power 3 W (typical, 1.0 V)

using the Dhrystone 2.1 benchmark. It supports both symmetric and asymmetric
multiprocessor (SMP and AMP) features for embedded applications. The SMP and
AMP modes can be mixed to construct a hybrid system of the SMP and AMP. Each
core can operate at different frequencies and can stop individually with maintaining
its data cache coherency, while the other processors are running in order to achieve
both the maximum processing performance and the minimum operating power for
various applications.

4.2.1 RP-1 Specifications

Table 4.1 summarizes the RP-1 specifications. The RP-1 integrates four SH-X3
cores with a snoop controller (SNC) to maintain the data cache coherency among
the cores, DDR2-SDRAM and SRAM memory interfaces, a PCI-Express interface,
some HW-IPs for various types of processing, and some peripheral modules. The
HW-IPs include a DMA controller, a display unit, and accelerators. Each SH-X3
core includes a CPU, an FPU, 32-KB 4-way set-associative instruction and data
caches, a 4-entry instruction TLB, a 64-entry unified TLB, an 8-KB instruction
local RAM (ILRAM), a 16-KB operand local RAM (OLRAM), and a 128-KB user
RAM (URAM).
Figure 4.4 illustrates a block diagram of the RP-1. The four SH-X3 cores, a snoop
controller (SNC), and a debug module (DBG) constitute a cluster. The HW-IPs are
connected to an on-chip system bus (SuperHyway). The arrows to/from the SuperHyway
indicate connections from/to initiator/target ports, respectively. The details of the
SH-X3 cluster and SuperHyway are described in the following sections.
128 4 Chip Implementations

SH-X3 Cluster
SNC: Snoop Controller (Cntl.)
SH-X3 Core3

INTC
DAA: Duplicated Address Array
SH-X3 Core2

LCPG3
CRU:Cache RAM Control Unit
DAA

SH-X3 Core1

LCPG2
I$/D$: Instruction (Inst.)/Data Cache
SH-X3 Core0 IL/DL: Inst./Data Local Memory

LCPG1
URAM: User RAM

SHPB
LCPG0
CPU CRU FPU
DBG: Debug Module
SNC0

IL I$ D$ DL
URAM GCPG/LCPG: Global/Local CPG
INTC: Interrupt Cntl.
DBG
SHPB,HPB: Peripheral Bus Bridge
On-chip system bus (SuperHyway) CSM: Centralized Shared Memory
DMAC: Direct Memory Access Cntl.
PCIe: PCIexpress Interface (i/f)
HPB
IPs
SRAM i/f
DDR2 i/f

TMU0/1
SCIF0-3
HWIPs

SCIF: Serial Communication i/f


DMAC
IPs
CSM

PCIe

GCPG
GPIO

GPIO: General Purpose IO


HW
HW

TMU: Timer Unit

Fig. 4.4 Block diagram of RP-1

4.2.2 SH-X3 Cluster

The four SH-X3 cores constitute a cluster sharing an SNC and a DBG to support
symmetric-multiprocessor (SMP) and multicore-debug features. The SNC has a
duplicated address array (DAA) of data caches of all the four cores and is connected
to the cores by a dedicated snoop bus separated from the SuperHyway to avoid both
deadlock and interference by some cache coherency protocol operations. The DAA
minimizes the number of data cache accesses of the cores for the snoop operations,
resulting in the minimum coherency maintenance overhead. Each core can operate
at different CPU clock (ICLK) frequencies and can stop individually to minimize
the power (see Sect. 4.2.3). The coherency protocol was optimized to avoid the
interference that results from a slow core to a fast core (see Sect. 4.2.4).

4.2.3 Dynamic Power Management

Each core can operate at different CPU clock (ICLK) frequencies and can stop indi-
vidually, while the other processors are running with a short switching time in order
to achieve both the maximum processing performance and the minimum operating
power for various applications. A data cache coherency is maintained during opera-
tions at different frequencies, including frequencies lower than the on-chip system
bus clock (SCLK). The following four schemes make it possible to change each
ICLK frequency individually while maintaining data cache coherency:
1. Each core has its own clock divider for an individual clock frequency change.
2. A handshake protocol is executed before the frequency change to avoid conflicts
in bus access, while keeping the other cores running.
4.2 RP-1 Prototype Chip 129

Table 4.2 Coherency overhead cycles


Cache line state Overhead (SCLK cycles)
Snooped core: 600 MHz Snooped core: 150 MHz
Access Accessed Snooped Not Not
type core core optimized Optimized optimized Optimized
Read S, E, M – 0 0 0 0
Write E, M – 0 0 0 0
S S 10 4 19 4
Read or Miss Miss 5 5 5 5
write S 10 5 19 5
E 10 10 19 19
M 13 13 22 22

3. Each core supports various ICLK frequency ratios to SCLK including a lower
frequency than that of SCLK.
4. Each core has a light-sleep mode to stop its ICLK while maintaining data cache
coherency.
The global ICLK and the SCLK that run up to 600 and 300 MHz, respectively,
are generated by a global clock pulse generator (GCPG) and distributed to each
core. Both the global ICLK and SCLK are programmable by setting the frequency
control register in the GCPG. Each local ICLK is generated from the global ICLK
by the clock divider of each core. The local CPG (LCPG) of a core executes a hand-
shake sequence dynamically when the frequency control register of the LCPG is
changed so that it can keep the other cores running and can maintain coherency in
data transfers of the core. The previous approach assumed a low frequency in a
clock frequency change, and it stopped all the cores when a frequency was changed.
The core supports “light-sleep mode” to stop its ICLK except for its data cache in
order to maintain the data cache coherency. This mode is effective for reducing the
power of an SMP system.

4.2.4 Core Snoop Sequence Optimization

Each core should operate at the proper frequency for its load, but in some cases of
the SMP operation, a low frequency core can cause a long stall of a high frequency
core. We optimized the cache snoop sequences for the SMP mode to minimize such
stalls. Table 4.2 summarizes the coherency overhead cycles. These cycles vary
according to various conditions; the table indicates a typical case. The bold values
indicate optimized cases explained below.
Figure 4.5 (i), (ii) shows examples of core snoop sequences before and after the
optimization. The case shown is a “write access to a shared line,” which is the third
case in the table.
130 4 Chip Implementations

time
(1) Core Snoop Request (6) Snoop Acknowledge
Core #0 (600MHz)
State: S to M
Core #1 (150MHz) D$
State: S to I (2) DAA DAA (4) D$ Update
Core #2 (600MHz) Update
D$
State: S to I
(3) Invalidate Request (5) Invalidate Acknowledge
Snoop Latency
(i) Before Optimization
time
(1) Core Snoop Request (3) Snoop Acknowledge
Core #0 (600MHz)
State: S to M
Core #1 (150MHz) D$
State: S to I (2) DAA DAA (4) D$ Update
Core #2 (600MHz) Update
D$
State: S to I
(3) Invalidate Request (5) Invalidate Acknowledge
Snoop Latency
(ii) After Optimization

Fig. 4.5 Core snoop sequences before and after optimization

The operating frequencies of cores #0, #1, and #2 are 600, 150, and 600 MHz,
respectively. Initially, all the data caches of the cores hold a common cache line, and
all the cache-line states are “shared.” Sequence (i) is as follows:
1. Core Snoop Request: Core #0 stores data in the cache, changes the stored-line
state from “Shared” to “Modified,” and sends a “Core Snoop Request” of the
store address to the SNC.
2. DAA Update: The SNC searches the DAA of all the cores and changes the states
of the hit lines from “Shared” to “Modified” for core #0 and “Invalid” for cores
#1 and #2. The SNC runs at SCLK frequency (300 MHz).
3. Invalidate Request: The SNC sends “Invalidate Request” to cores #1 and #2.
4. Data Cache Update: Cores #1 and #2 change the states of the corresponding
cache lines from “Shared” to “Invalid.” The processing time depends on each
core’s ICLK.
5. Invalidate Acknowledge: Cores #1 and #2 return “Invalidate Acknowledge” to
the SNC.
6. Snoop Acknowledge: The SNC returns “Snoop Acknowledge” to core #0.
As shown in Fig. 4.5 (i), the return from core #1 is late due to its low frequency,
resulting in long snoop latency.
Sequence (ii) is as follows by the optimization:
1. Core Snoop Request
2. DAA Update
4.2 RP-1 Prototype Chip 131

3. Snoop Acknowledge and Invalidate Request


4. Data Cache Update
5. Invalidate Acknowledge
The “Snoop Acknowledge” is moved from the sixth to the third step by eliminat-
ing the wait of the “Invalidate Acknowledge,” and the late response of the slow core
does not affect the operation of the fast core. In the optimized sequence, the SNC is
busy for some cycles after the “Snoop Acknowledge,” and the next “Core Snoop
Request” must wait if the SNC is still busy. However, this is rare for ordinary
programs.
The sequence of another case, a “read miss and hit to another core’s modified
line,” which is the last case in the table, is as follows:
1. Core Snoop Request: A data read of core #0 misses its cache and sends a “Core
Snoop Request” of the access address to the SNC.
2. DAA Update: The SNC searches the DAA of all the cores and changes the states
of the hit lines from “Modified” to “Shared.”
3. Data Transfer Request: The SNC sends a “Data Transfer Request” to the core of
the hit line for the cache fill data of core #0.
4. Data Cache Update: The requested core reads the requested data and changes
the states of the corresponding line of the DAA to “Shared.” The processing time
depends on each core’s ICLK.
5. Data Transfer Response and Write Back Request: The requested core returns the
requested data and requests a write back to the SNC.
6. Snoop Acknowledge and Write Back Request: The SNC returns “Snoop
Acknowledge” to core #0 with the fill data and requests a write back of the
returned data to the main memory.
7. Data Cache Update 2: Core #0 completes the “Read” operation by replacing a
cache line with the fill data.
In this case, core #0 must wait for the fill data, and the early “Snoop Acknowledge”
is impossible.

4.2.5 SuperHyway Bus

It would require too much time and money to design an SoC consisting of a lot of
originally designed modules. Therefore, we make a module that is reusable and
refer to it as a HW-IP. A standard and highly efficient method is needed to connect
the HW-IPs. An on-chip system bus called SuperHyway is a packet-based split
transaction bus used to connect the HW-IPs, and transactions may contain up to 32
bytes of data. The bus is compatible with Virtual Socket Interface (VSI) protocols.
It seamlessly connects to VSI virtual-component libraries.
Effective support of high-speed, multi-initiator, multi-target data transfer is
important for cost-effective SoC implementations. Such data transfer mechanisms
132 4 Chip Implementations

R1 G1 R2 G2 R3 G3

D1 D2 D3

42 cycles, 52% External Bus Utilization


(i) Non Split Bus Transactions R1,R2,R3: Requests
G1,G2,G3: Grants
D1,D2,D3: External Bus Accesses
R1 G1 R2 G2 R3 G3
D1 D2 D3

30 cycles, 80% External Bus Utilization


(ii) Split Bus Transactions

Fig. 4.6 Improvements in utilization ratio achieved in split bus transaction

must also be flexible to support different configurations. The SuperHyway can


perform two routing jobs at any instant. It can receive data transaction requests from
HW-IP-module initiator ports and route one of the requests to the target port of a
HW-IP module that is ready to receive the transaction request. At the same time, it
can route a response from the target port. Such a response might be “Read data on
read transaction request.” The response is automatically directed back to the origi-
nal initiator port. Target ports can receive requests and save them in an internal
service queue.
The upper eight bits of the 32-bit address are used to specify a module connected
to the SuperHyway, and a multiple of 16-MB spaces are assigned to each module.
Then an initiator can access any location mapped on the 4-GB address space.
A request packet has a field consisting of an 8-bit transaction ID so that SuperHyway
modules can initiate 256 outstanding transactions. A module can request a transac-
tion after the previous transaction is granted by the SuperHyway. This split transac-
tion effectively hides the long latency of a slow response such as an external memory
access or peripheral module access. As a result, the bus utilization ratio is highly
improved. Figure 4.6 compares how a non-split transaction bus and the SuperHyway
access an external DRAM. The utilization ratio is improved by 28% for this case.

4.2.6 Chip Integration

The RP-1 integrated lots of HW-IPs, and a flat interconnection by a single


SuperHyway was not feasible. We connected the HW-IPs together including the
SH-X3 cluster with one SuperHyway and connected them to the other HW-IPs via
bus bridges. Figure 4.7 illustrates the connection. Ten HW-IPs and a bridge are con-
nected within a single SCLK cycle. “I” and “T” indicate initiator and target ports.
A SuperHyway routing block connects these ports for the single-cycle transfer
among the HW-IPs at 300 MHz. The critical timing path starts at an F/F of an initiator
4.2 RP-1 Prototype Chip 133

I Bus Bridge
Core #0 I Initiator
T
T SHPB T Target
I

Target I/F
Core #1 T CSM
T
I T SRAM i/f 600MHz

Initiator I/F
Core #2
T T DDR2 i/f 32 bits
I 300MHz (I-to-T: 3.3ns)

SuperHyway
Core #3
T 64 bits, 2.4GB / s
I I - to -T Connection

Router
SNC
T 29-bit Address (Request)
I 64-bit Data (Request)
Debug 64-bit Data (Response)
T

Fig. 4.7 SuperHyway connection of IPs including SH-X3 cluster

Fig. 4.8 Physical


organization of SuperHyway
connection
Core #0 Core #1

Router

Core #3
Core #2

1 mm

IP address and ends at many F/Fs of target HW-IPs. The 300 MHz 64-bit SuperHyway
achieved the throughput of 2.4 GB/s, which is the same throughput as the 600 MHz
32-bit DDR2 interface.
Figure 4.8 shows the physical organization of the interconnect logic, where each
arrow includes 29-bit-address and 128-bit-data lines corresponding to Fig. 4.7. The
routing block was synthesized as a net list without actual wire lengths. A long wire
path caused an unacceptable CR delay that could be calculated after place and route,
so we inserted repeater cells to improve the path delay. As a result, a one-cycle path
from an initiator to a target could reach the shaded area.
Figure 4.9 shows the chip micrograph of the RP-1. The chip was integrated in
two steps to minimize the design period of the physical integration, and successfully
fabricated: (1) First, a single core was laid out as a hard macro and completed
134 4 Chip Implementations

Fig. 4.9 Chip micrograph


of RP-1

1.00
Execution Time (Normalized)

0.75
1 Thread
2 Threads
0.50
4 Threads
Barrier
0.25

0
FFT LU Radix Water

Fig. 4.10 Execution time of SPLASH-2 suite

timing closure of the core, and (2) the whole chip was laid out with instancing the
core four times.

4.2.7 Performance Evaluations

We evaluated the processing performance and power reduction in parallel process-


ing on the RP-1. Figure 4.10 plots the time required to execute the SPLASH-2 suite
[14] depending on the number of threads on an SMP Linux system. The RP-1
reduced the processing time to 50.5–52.6% and 27.1–36.9% with two and four
4.2 RP-1 Prototype Chip 135

800

Active Power (mW) 1 Thread


600
2 Threads
4 Threads
400

200
Linux
0
FFT LU Radix Water

Fig. 4.11 Active power of SPLASH-2 suite

1000
Energy (mW·S)

750 2CPU Run + 2CPU Idle


2CPU Run + 2CPU Light Sleep
500 2CPU Run + 2CPU Sleep
2CPU Run + 2CPU Module Stop
250

0
600MHz 300MHz

Fig. 4.12 Energy consumption with low power modes

threads, respectively. The time should be 50% and 25% for ideal performance scal-
ability. The major overhead was synchronization and snoop time. The SNC improved
cache coherency performance, and the performance overhead by snoop transactions
was reduced up to 0.1% when SPLASH-2 was executed.
Figure 4.11 shows the power consumption of the SPLASH-2 suite. The suite ran
at 600 MHz and at 1.0 V. The average power consumption of one, two, and four
threads was 251, 396, and 675 mW, respectively. This included 104 mW of active
power for the idle tasks of SMP Linux. The results of the performance and power
evaluation showed that the power efficiency was maintained or enhanced when the
number of threads increased.
Figure 4.12 shows the energy consumption with low power modes. These modes
were implemented to save power when fewer threads were running than available
on CPU cores. As a benchmark, two threads of FFT were running on two CPU
cores, and two CPU cores were idle. The energy consumed in the light-sleep, sleep,
and module-stop modes at 600 MHz was 4.5%, 22.3%, and 44.0% lower than in the
136 4 Chip Implementations

Table 4.3 RP-2 specifications


Process technology 90-nm, 8-layer Cu, triple-Vth, CMOS
Chip size 104.8 mm2 (10.61 mm × 9.88 mm)
Supply voltage 1.0 V (internal), 1.8/3.3 V (I/O)
Clock frequency 600 MHz
SH-X3 core Size 6.6 mm2 (3.36 mm × 1.96 mm)
I/D-cache 16-KB 4-way set-associative (each)
ILRAM/OLRAM 8 KB/32 KB
URAM 64 KB
Centralized shared memory (CSM) 128 KB
External interfaces DDR2-SDRAM, SRAM
Performance CPU 8,640 MIPS (Dhrystone 2.1, 8-core total)
FPU 33.6 GFLOPS (peak, 8-core total)
Chip power 2.8 W (600 MHz, 1.0 V, room temperature, Dhrystone 2.1)

normal mode, respectively, although these modes took some time to stop and start
the CPU core and to save and return the cache. The execution time increased by
79.5% at 300 MHz, but the power consumption decreased, and the required energy
decreased by 5.2%.

4.3 RP-2 Prototype Chip

The RP-2 is a prototype multicore chip with eight SH-X3 CPU cores (see Sect.
3.1.7) [15–17]. It was fabricated in a 90-nm CMOS process that was the same pro-
cess used for the RP-1. The RP-2 achieved a total of 8,640 MIPS at 600 MHz by the
eight SH-X3 cores measured with the Dhrystone 2.1 benchmark. Because it is
difficult to lay out the eight cores close to each other, we did not select a tightly
coupled cluster of eight cores. Instead, the RP-2 consists of two clusters of four
cores, and the cache coherency is maintained in each cluster. Therefore, the inter-
cluster cache coherency must be maintained by software if necessary.

4.3.1 RP-2 Specifications

Table 4.3 summarizes the RP-2 specifications. The RP-2 integrates eight SH-X3
cores as two clusters of four cores, DDR2-SDRAM and SRAM memory interfaces,
DMA controllers, and some peripheral modules. Figure 4.13 illustrates a block dia-
gram of the RP-2. The arrows to/from the SuperHyway indicate connections from/
to initiator/target ports, respectively.
4.3 RP-2 Prototype Chip 137

SH-X3 Core3 SH-X3 Core7

INTC
SH-X3 Core2 SH-X3 Core6

LCPG3

LCPG7
DAA

DAA
SH-X3 Core1 SH-X3 Core5

LCPG2

LCPG6
SH-X3 Core0 SH-X3 Core4

LCPG1

LCPG5
CPU CRU FPU FPU CRU CPU

SHPB
LCPG0

LCPG4

SNC1
SNC0

IL I$ D$ DL DL D$ I$ IL
URAM URAM
DBG0 DBG1

On-chip system bus (SuperHyway)

JTAG i/f DMAC1 HPB GPIO


DDR2 i/f SRAM i/f CSM
DMAC0 GCPG TMU0-3

Fig. 4.13 Block diagram of RP-2

4.3.2 Power Domain and Partial Power-Off

Power-efficient SoC design for embedded applications requires several independent


power domains where the power of unused domains can be turned off. The power
domains were initially introduced to an SoC for mobile phones [3], which defined
20 hierarchical power domains, but most of the power domains were assigned to
peripheral IPs using low-leakage, high-Vt transistors. In contrast, high-performance
multicore SoCs use leaky low-Vt transistors for CPU cores, and reducing the leak-
age power of such cores is the primary goal. The RP-2 was developed for target use
in power-efficient, high-performance embedded applications. Sixteen power
domains were defined so that they can be independently powered off. A resume-
standby mode was also defined for fast resume operation, and the power levels of
the CPU and the URAM of a core are off and on, respectively. Each processor core
can operate at a different frequency or even dynamically stop the clock to maintain
processing performance while reducing the average operating power consumption.
Figure 4.14 illustrates the power domain structure of eight CPU cores with eight
URAMs. Each core is allocated to a separate power domain so that the power supply
can be cut off while unused. Two power domains (Cn and Un, for n ranging from
0 to 7) are assigned to each core, where Un is allocated only for URAM. By keeping
the power of Un on, the CPU status is saved to URAM before the Cn power is
turned off, and restored from URAM after Cn power is turned on. This shortens the
restart time compared with a power-off mode in which both Cn and Un are powered
off together. Each power domain is surrounded by power switches and controlled by
a power switch controller (VSWC).
Table 4.4 summarizes the power modes of each CPU. Light-sleep mode is suit-
able for dynamic power saving while cache coherency is maintained. In sleep mode,
almost all clocks for the CPU core are stopped. In resume-standby mode, the leak-
age current for eight cores is reduced to 22 mA from 162 mA in sleep mode, and
leakage power was reduced by 86%.
138 4 Chip Implementations

120µm 70µm
C0 C1
U0 U1

U2 C2 C3 U3 Core
U6 U7 URAM
C6 C7

U4 U5 50µm
C4 C5
VSS
VSWC for Core VSWC for URAM
VSSM
(virtual ground)
Power Control Register

VSWC : Power Switch Controller : Power Switch for Core


LCPG: Local Clock Pulse Generator : Power Switch for URAM

Fig. 4.14 Power domain structure of 8 CPU cores with 8 URAMs

Table 4.4 Power modes of CPU cores


CPU power modes Normal Light sleep Sleep Resume Power-off
Clock for CPU and URAM On Off Off Off Off
Clock for I/D-cache On On Off Off Off
Power supply for CPU On On On Off Off
Power supply for URAM On On On On Off
Leakage currenta (mA) 162 162 162 22 0
a
Measured at room temperature at 1.0 V, eight-core total

4.3.3 Synchronization Support Hardware

The RP-2 has barrier registers to support CPU core synchronization for multipro-
cessor systems. Software can use these registers for fast synchronization between
the cores. In the synchronization, one core waits for other cores to reach a specific
point in a program. Figure 4.15 illustrates the barrier registers for the synchroni-
zation. In a conventional software solution, the cores have to test and set a specific
memory location, but this requires long cycles. We provide three sets of barrier
registers to accelerate the synchronization. Each CPU core has a one-bit BARW
register to notify when its program flow reaches a specific point. The BARW
values of all the cores are gathered by hardware to form an 8-bit BARR register
of each core so that each core can obtain all the BARW values from its BARR
register with a single instruction. As a result, the synchronization is fast and does
not disturb other transactions on the SuperHyway bus.
Figure 4.16 shows an example of the barrier register usage. In the beginning, all
the BARW values are initialized to zero. Then each core inverts its BARW value
4.3 RP-2 Prototype Chip 139

Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7


CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

Cluster0 Cluster1

BARW for each core BARR for each core

Fig. 4.15 Barrier registers for synchronization

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7


Barrier Initialization (Each core clear its BARW to zero)

Executions (Each core runs and sets its BARW to one at specific point)

Barrier Synchronization (Each core waits its BARR to be all ones)

Executions (Each core runs and clears its BARW to zero at specific point)

Barrier Synchronization (Each core waits its BARR to be all zeros)

Fig. 4.16 Synchronization example using barrier registers

Table 4.5 Eight-core synchronization cycles


Conventional method RP-2 method
(via external memory) (via BARW/BARR registers)
Average clock cycles 52,396 8,510
Average difference 20,120 10

when it reaches a specific point, and it checks and waits until all its BARR values
are ones reflecting the BARW values. The synchronization is complete when all the
BARW values are inverted to ones. The next synchronization can start immediately
with the BARWs being ones and is complete when all the BARW values are inverted
to zeros.
Table 4.5 compares the results of eight-core synchronizations with and without
the barrier registers. The average number of clock cycles required for a certain task
to be completed with and without barrier registers is 8,510 and 52,396 cycles,
respectively. The average differences in the synchronizing cycles between the first
and last cores are 10 and 20,120 cycles with and without the barrier registers, respec-
tively. These results show that the barrier registers effectively improve the
synchronization.
140 4 Chip Implementations

NMI
Non-maskable
Request
External Maskable Interrupt (MNI)
External Interrupt Control Control
Interrupt
Request Interrupt Interrupt
CPU CPU
Mask Distribution
Control x8 Control
Core i/f Core
x8 x8
On-chip Inter-core x8
On-chip Peripheral
Peripheral Interrupt Control
Interrupt Control
Interrupt
Request #0 #1 #2 #3 IPI
#4 #5 #6 #7 reg.
INTC

Fig. 4.17 Block diagram of interrupt controller (INTC)

4.3.4 Interrupt Handling for Multicore

In a multicore system, multiple cores can handle interrupts, so a mechanism is nec-


essary to select the core that handles the interrupt while taking overhead reduction
into consideration. We have added an autorotating interrupt distribution scheme to
processor cores for this purpose, and the processing time in the Linux kernel was
reduced by 21% when SPLASH-2 was executed.
The RP-2 integrates a lot of peripherals and has to handle interrupt requests from
the peripherals and processor cores. Figure 4.17 illustrates a block diagram of the
interrupt controller (INTC). The INTC handles maskable/non-maskable interrupts,
on-chip peripheral interrupts, and inter-core interrupts. A request is received by the
corresponding control block, masked at the interrupt mask control block for each core,
distributed by the interrupt distribution control block, and output by the CPU core
interface for each core. An inter-core interrupt is handled by the inter-core interrupt
control block via the CPU core interface.
The interrupts have fixed and dynamic distribution modes. In the fixed distribution
mode, the interrupt request is distributed to the specific core configured by setting up
an interrupt mask register. The RP-2 has two dynamic distribution modes. In the con-
ventional dynamic distribution mode, the interrupt request is distributed simultane-
ously to all the cores and will be served by the first acknowledging one. All the cores
jump to an interrupt handling routine, check the interrupt acknowledgment register
(INTACK) in the INTC, and determine whether they should process the interrupt or
return from the handler routine. The first acknowledging core reads the INTACK
value “1” and processes the interrupt handling. The INTC clears the INTACK value
to “0” and invalidates the interrupt request. Then the other cores read “0” and return
to their tasks. This mode is best in terms of response time because the earliest respond-
ing core serves the interrupt request. However, the other cores consume redundant
operation time in the interrupt handling routine and context save/restore. The system
must pay for the overhead of the redundant time multiplied by the number of cores.
4.3 RP-2 Prototype Chip 141

Fig. 4.18 Chip micrograph of


RP-2

The RP-2 has an autorotating dynamic distribution mode to reduce the overhead.
In this mode, the INTC asserts an interrupt request to one core for some cycles
specified by the software, which are at most 24 cycles. Then the other cores need not
consume the redundant interrupt handling time. This mode is best in terms of com-
puting throughput and can keep the worst response time of the conventional mode,
which is important in order to guarantee the response time. The average number of
clock cycles for an evaluated task is 73,028 cycles in the newly added mode, which
is 1,900 cycles fewer on average than that in the conventional mode.
In the multicore system, each core needs to interrupt other cores, and an inter-
processor interrupt (IPI) is supported by the RP-2. There are eight IPI registers in
the IPI control block for eight cores. Each core can generate an interrupt to other
cores by writing to its IPI register in the INTC. Each IPI register consists of eight
fields corresponding to the target cores.

4.3.5 Chip Integration and Evaluation

The RP-2 was fabricated using the same 90-nm CMOS process as that for the RP-1.
Figure 4.18 is the chip micrograph of the RP-2. It achieved a total of 8,640 MIPS at
600 MHz by the eight SH-X3 cores measured with the Dhrystone 2.1 benchmark
and consumed 2.8 W at 1.0 V including leakage power.
The fabricated RP-2 chip was evaluated using the SPLASH-2 benchmarks on an
SMP Linux operating system. Figure 4.19 plots the RP-2 execution time on one
cluster based on the number of POSIX threads. The processing time was reduced to
142 4 Chip Implementations

1.0

0.8
Relative Execution Time

0.6 1 Thread
2 Threads
4 Threads
0.4
8 Threads
16 Threads
0.2

0.0
Water FFT LU Radix Barnes Ocean

Fig. 4.19 RP-2 execution time according to number of POSIX threads

2,000,000 200,000
Acknowledged Interrupts

–31% Core0
–7% –57%
Core1
1,500,000 150,000 Core2
Core3
1,000,000 100,000

500,000 50,000

0 0
Conventional Auto-rotating Conventional Auto-rotating Conventional Auto-rotating
Water Radix Barnes

Fig. 4.20 Number of acknowledged interrupts during SPLASH-2 execution

51–63% with two threads and to 41–27% with four or eight threads running on one
cluster. Since there were fewer cores than threads, the eight-thread case showed
similar performance to four-thread one. Furthermore, in some cases, the increase in
the number of threads resulted in an increase in the processing time due to the syn-
chronization overhead.
The autorotating dynamic interrupt distribution mode was evaluated and com-
pared to a conventional one by SPLASH-2 with four threads on SMP Linux using
one cluster of a real chip. Figure 4.20 shows the number of interrupts acknowl-
edged by the CPU cores during the SPLASH-2 execution. The total acknowledged
interrupts by all the cores in the autorotating mode decreased by 7% for Water, 31%
for Radix, and 57% for Barnes from the conventional mode. As a result, it avoided
the redundant interrupt handling. This improvement leads to a reduced processing
time in Linux kernel mode. Figure 4.21 shows the processing time reduction in
4.4 RP-X Prototype Chip 143

1.0
-8% -11%

Relative Proccessing Time


-21%
0.8
in Kernel Mode
Conventional
0.6
Auto-rotating
0.4

0.2

0.0
Water Radix Barnes

Fig. 4.21 Processing time reduction in kernel mode

kernel mode. The reduction was 8% for Water, 11% for Radix, and 21% for Barnes,
respectively.
In addition to the improved performance, the reduction in the acknowledged
interrupts is expected to be effective for saving power. In sleep mode in particular,
the redundant interrupt handling leads to wasted power used to wake up the cores
and put them in sleep mode.

4.4 RP-X Prototype Chip

A heterogeneous multicore is one of the most promising approaches to attain high


performance with low frequency and power for consumer electronics or scientific
applications. The RP-X is the latest prototype multicore chip with eight SH-X4
cores [18–20] (see Sect. 3.1.8), four Flexible Engine/Generic ALU Arrays (FE–GAs)
[21, 22], two MX-2 matrix processors [23], a video processing unit 5 (VPU5)
[24, 25], and various peripheral modules. It was fabricated using 45-nm CMOS
process. The RP-X achieved 13.7 GIPS at 648 MHz by the eight SH-X4 cores mea-
sured using the Dhrystone 2.1 benchmark and a total of 114.7 GOPS with 3.07 W.
It attained a power efficiency of 37.3 GOPS/W.

4.4.1 RP-X Specifications

The RP-X specifications are summarized in Table 4.6. It was fabricated using a
45-nm CMOS process, integrating eight SH-X4 cores, four FE–GAs, two MX-2s,
one VPU5, one SPU, and various peripheral modules as a heterogeneous multicore
SoC, which is one of the most promising approaches to attain high performance
with low frequency and power, for consumer electronics or scientific applications.
144 4 Chip Implementations

Table 4.6 RP-X specifications


Process technology 45-nm, 8-layer Cu, triple-Vth, CMOS
Chip size 153.76 mm2 (12.4 mm × 12.4 mm)
Supply voltage 1.0–1.2 V (internal), 1.2/1.5/1.8/2.5/3.3 V (I/O)
Clock frequency 648 MHz (SH-X4), 324 MHz (FE–GA, MX-2)
Total power consumption 3.07 W (648 MHz, 1.15 V)
Processor cores 8× SH-X4 CPU 13.7 GIPS (Dhrystone 2.1, 8-core total)
and FPU 36.3 GFLOPS (8-core total)
performances 4× FE–GA 41.5 GOPS (4-core total)
2× MX-2 36.9 GOPS (2-core total)
Programmable VPU5 (video processing unit) for MPEG2, H.264, VC-1
Special-purpose cores SPU (sound processing unit) for AAC, MP3
Total performances and power 114.7 GOPS, 3.07 W, 37.3 GOPS/W (648 MHz, 1.15 V)
External interfaces 2x DDR3-SDRAM (32-bit, 800 MHz), SRAM
PCI-Express (rev 2.0, 2.5 GHz, 4 lanes), Serial ATA

The eight SH-X4 cores achieved 13.7 GIPS at 648 MHz measured using the
Dhrystone 2.1 benchmark. Four FE–GAs, dynamically reconfigurable processors,
were integrated and attained a total performance of 41.5GOPS and a power con-
sumption of 0.76 W. Two 1,024-way MX-2s were integrated and attained a total
performance of 36.9GOPS and a power consumption of 1.10 W. Overall, the
efficiency of the RP-X was 37.3 GOPS/W at 1.15 V excluding special-purpose
cores of a VPU5 and an SPU. This was the highest among comparable processors.
The operation granularity of the SH-X4, FE–GA, and MX-2 processors are 32 bits,
16 bits, and 4 bits, respectively, and thus, we can assign the appropriate processor
cores for each task in an effective manner.
Figure 4.22 illustrates the structure of the RP-X. The processor cores of the
SH-X4, FE–GA, and MX-2; the programmable special-purpose cores of the
VPU5 and SPU; and the various modules are connected by three SuperHyway
buses to handle high-volume and high-speed data transfers. SuperHyway-0 con-
nects the modules for an OS, general tasks, and video processing, SuperHyway-1
connects the modules for media acceleration, and SuperHyway-2 connects media
IPs except for the VPU5. Some peripheral buses and modules are not shown in
the figure.
A data transfer unit (DTU) was implemented in each SH-X4 core to transfer data
to and from the special-purpose cores or various memories without using CPU
instructions. In this kind of system, multiple OSes are used to control various func-
tions, and thus, high-volume and high-speed memories are required.
4.4 RP-X Prototype Chip 145

SH - X4 SNC SH - X4 SNC
SH - X4 SH - X4
SH - X4 L2 SH - X4 L2
SH - X4 SH - X4
CPU CRU FPU PWC CPU CRU FPU PWC
I$ DTU D$ I$ DTU D$
ILM UM DLM SH-X4 ILM UM DLM SH-X4
Cluster#0 Cluster#1
SuperHyway-0 SuperHyway-1
SuperHyway-2 VPU5 FE MX-2
Video FE

DDR3#1
CSM#1
DDR3#0

MX-2
CSM#0

PWC
LBSC SPU2 Processing FE
Unit FE LM
PCIexpress S-ATA Media LM
IPs RP-X

Fig. 4.22 Block diagram of RP-X


DMA Request
Interrupt/

Internal Bus
Sequence Manager (SEQM)
LS Local
Array Control Bus Cells Memories
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
I/O Port Control

System Bus
ALU MLT ALU ALU LS CRAM
Bus I/F
I/O Ports

ALU MLT ALU ALU LS CRAM


ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
ALU MLT ALU ALU LS CRAM
Arithmetic Cell Array LS CRAM
LS CRAM
Crossbar Network (XB)

Configuration Manager (CFGM)

Fig. 4.23 Structure of dynamic reconfigurable processor FE–GA

4.4.2 Dynamically Reconfigurable Processor FE–GA

Figure 4.23 illustrates the structure of the FE–GA. It is a dynamic reconfigurable


processor consisting of an arithmetic cell array of 24 16-bit ALUs and eight 16-bit
multipliers, ten pairs of a load/store (LS) cell, and a local memory (CRAM). The
cell array and LS cells are connected by crossbar network (XB). The array is
configured by a configuration manager (CFGM) and controlled by a sequence man-
ager (SEQM) via an array control bus. The SEQM outputs interrupts and DMA
146 4 Chip Implementations

Instruction Memory Controller


Sub System Bus
PE

256 x n PEs (n = 1, 2, …, 8)
PE
I/O Interface H-ch H-ch
PE
Data Registers PE Data Registers
(SRAM) PE (SRAM)

V- ch
PE
PE
Media Bus

PE

PE
PE

512b 512b

Fig. 4.24 Structure of massively parallel processor MX-2

requests to communicate with a host processor or a DMA controller. The LS cells


and the SEQM are connected to a bus interface (I/F) via an internal bus, and the
CFGM is also connected to the bus I/F, which is the interface to the system bus
SuperHyway. An I/O port is connected to the XB by an I/O port controller and
enables data transfers independent of the SuperHyway.
The FE–GA is suitable for signal processing or recognition including image and
sound data that consists of massive amount of 16-bit data and is effective for processes
with middle grain parallelism. The details of the FE–GA are described in Sect. 3.2.

4.4.3 Massively Parallel Processor MX-2

Figure 4.24 illustrates the structure of the MX-2. It is a massively parallel processor
consisting of 1,024-way-SIMD 4-bit PEs with an ALU and a booth encoder, two
SRAMs as data registers, a controller with an instruction memory, and an I/O interface
to subsystem and media buses. The number of PEs can be in multiple of 256, and each
MX-2 of the RP-X integrated 1,024 PEs. The PEs and SRAMs are connected conform-
ing horizontal channels (H-ch), and the PEs are connected by a shifter that forms
vertical channels (V-ch). The MX-2 performs efficient massively parallel arithmetic
processing. It is especially good for multiple-of-4-bit-wide data such as image data,
which are mainly 8 or 12 bits. The details of the MX-2 are described in Sect. 3.3.

4.4.4 Programmable Video Processing Core VPU5

Figure 4.25 illustrates the structure of the VPU5. It is a programmable video pro-
cessing core consisting of two codec elements for pixel-rate domain and a variable
4.4 RP-X Prototype Chip 147

VPU5 Codec Element 2


Codec Element 1
VLCS PIPE PIPE PIPE
Transform Motion De-block DMAC
Codec Prediction Comp. Filter

Shift-register-based bus

PIPE micro-program

Load 2D Store
Module ALU Module

Data I/O

Fig. 4.25 Structure of programmable video processing core VPU5

length coding for stream-rate domain (VLCS) codec. They are connected by a
shift-register-based bus for fast and efficient transfer of processing data. Each
codec element consists of a DMAC and three programmable image processing
elements (PIPEs) for transform prediction, motion compensation, and a deblock
filter. Each PIPE consists of a load module, a two-dimensional ALU, and a store
module. They are controlled by a microprogram, and the load/store modules use a
data I/O to connect to the bus. The VPU5 can handle various formats such as
MPEG-1/2/4, H.263, and H.264 and various resolutions from QCIF to full HD. The
programmability is a convenient feature that allows a new algorithm to be applied
or a previous algorithm to be updated. The details of the VPU5 are described in
Sect. 3.4.

4.4.5 Global Clock Tree Optimization

Because the RP-X integrated various modules, it was important to reduce the power
consumption of unused modules by clock gating. The power consumption of clock
buffers was particularly large. Figure 4.26 shows the clock buffer deactivation cir-
cuits. In the conventional clock tree (i), global clock trees from a clock generator
were divided logically into CLK0, CLK1, and CLK2, and the clock of Modules A,
B, and C was provided by the same clock tree CLK0. However, the Module C was
located further away from the Modules A and B, and the clock tree of the Module C
became a dedicated tree from a point near the clock generator, which had to be acti-
vated even when the Module C was not used. On the contrary, the Modules A and B
successfully shared the clock tree and saved the clock tree’s capacitance.
After optimizing the power (ii), the clock tree of the Module C was separated and
gated at the clock generator as CLK0_1, whereas the Modules A and B shared
the clock tree CLK0_0. In this way, the clock tree CLK0_1 can be stopped when
148 4 Chip Implementations

Clock Generator

Clock Generator
CLK0 CLK0_0
Clock Buffer Module A Module A
CLK0_1
CLK1 Clock Gating Cell Module B Module B
CLK1

CLK2 Module C CLK2 Module C


(i) Conventional Clock Tree (ii) after Power Optimization

Fig. 4.26 Power optimization of global clock tree

Clock
clock Generator CK, /CK
DDR3 Memory Controller

F/F CMD,ADDR.,DM
CMD,ADDR.,DQM

Write Data
F/F
DQ
Read Data
F/F F/F 90° Mask
Shift Logic
FIFO DQS
F/F
PHY

Fig. 4.27 DDR3-SDRAM interface

the Module C is not used. In a large-scale chip, it is not easy to lay out all the mod-
ules using the same clock close together, and proper tree separation is effective for
reducing the power. A gate-level simulation showed that by applying this method,
the deactivation of all clock buffers related to MX-2 and PCI-Express saved 41.5 mW
of power at 1.15 V.

4.4.6 Memory Interface Optimization

The RP-X contains two 2-GB DDR3-SDRAM interfaces. Figure 4.27 illustrates the
DDR3-SDRAM interface. The latency of this interface was reduced to improve the
performance and power efficiency by deleting unnecessary data buffering and
invalid data masking. No F/F except retiming F/Fs was used in the DDR3 PHY to
reduce write latency. The DDR3 interface included asynchronous FIFO and invalid
level mask circuit for latching valid strobe signals from a bidirectional interface to
reduce read latency. Overall, the DDR3 interface including I/O buffer and data sam-
pling requires four cycles (10 ns), and the total latency is nine cycles including
memory latency.
4.4 RP-X Prototype Chip 149

Fig. 4.28 Chip micrograph


of RP-X

1080i VGA (640 x 480), 15fps


VPU, SPU MX- 2 (30.6GOPS) FE (0.62GOPS)
Audio/Video Image Feature Quantity Optical Flow Database
Decode Detection Calculation Calculation Search I/O
PCI
OS#0 OS#1 OS#2
0.4GB 0.6GB 1.6GB 1.8GB

Total 4.4 GB

Fig. 4.29 System configuration and memory usage of prototype digital TV

4.4.7 Chip Integration and Evaluation

The RP-X was fabricated using a 45-nm low-power CMOS process. A chip micro-
graph of the RP-X is in Fig. 4.28. It achieved a total of 13,738 MIPS at 648 MHz by
the eight SH-X4 cores measured using the Dhrystone 2.1 benchmark, and consumed
3.07 W at 1.15 V including leakage power.
The RP-X is a prototype chip for consumer electronics or scientific applications.
As an example, we produced a digital TV prototype system with IP networks (IP-
TV) including image recognition and database search. Its system configuration and
memory usage are shown in Fig. 4.29. The system is capable of decoding 1,080i
audio/video data using a VPU and an SPU on the OS#1. For image recognition, the
MX-2s are used for image detection and feature quantity calculation, and the
150 4 Chip Implementations

Table 4.7 Performance and power consumption of RP-X


Operating frequency Performance Power Power efficiency
SH-X4 648 MHz 36.3 GFLOPS 0.74 W 49.1 GFLOPS/W
MX-2 324 MHz 36.9 GOPS 0.81 W 45.6 GOPS/W
FE–GA 324 MHz 41.5 GOPS 1.12 W 37.1 GOPS/W
Others 324/162/81MHz – 0.40 W –
Total – 114.7 GOPS 3.07 W 37.3 GOPS/W

FE–GAs are used for optical flow calculation of a VGA (640 × 480) video at 15fps
on the OS#2. These operations required 30.6 and 0.62 GOPS of the MX-2 and
FE–GA, respectively. The SH-X4 cores are used for database search using the
results of the above operations on the OS#3, as well as supporting of all the process-
ing including OS#1, OS#2, OS#3, and data transfers between the cores. The main
memories of 0.4, 0.6, 1.6, and 1.8 GB are assigned to OS#1, OS#2, OS#3, and PCI,
respectively, for a total of 4.4 GB. The details of the prototype system are described
in Chap. 6.
Table 4.7 lists the total performance and power consumption at 1.15 V when
eight CPU cores, four FE–GAs, and two MX-2s are used at the same time. The
power efficiency of the CPU cores, FE–GAs, and MX-2s reached 42.9 GFLOPS/W,
41.5 GOPS/W, and 36.9 GOPS/W, respectively. The power consumption of the
other components was reduced to 0.40 W by clock gating of 31 out of 44 modules.
In total, if we count 1 GFLOPS as 1 GOPS, the RP-X achieved 37.3 GOPS/W at
1.15 V excluding I/O area power consumption.

References

1. Patrick P. Gelsinger (2001) Microprocessors for the new millennium challenges, opportunities,
and new frontiers. ISSCC Dig. Tech. Papers, Session 1.3
2. Arakawa F (2008) Multicore SoC for Embedded Systems. International SoC Design Conference
(ISOCC) 2008: I-180–I-183
3. Hattori T, et al (2006) A Power Management Scheme Controlling 20 Power Domains for a
Single-Chip Mobile Processor, ISSCC Dig. Tech. Papers, Session 29.5
4. Ito M, et al (2007) A 390 MHz Single-Chip Application and Dual-Mode Baseband Processor
in 90 nm Triple-Vt CMOS, ISSCC Dig. Tech. Papers, Session 15.3
5. Naruse M, et al (2008) A 65nm single-chip application and dual-mode baseband processor
with partial clock activation and IP-MMU. ISSCC Dig. Tech. Papers, Session 13.3
6. Ito M et al (2009) A 65 nm Single-Chip Application and Dual-Mode Baseband Processor with
Partial Clock Activation and IP-MMU. IEEE J Solid-State Circuits 44(1):83–89
7. Hagiwara K, et al (2008) High performance and low power SH2A-DUAL core for embedded
microcontrollers. COOL Chips XI Proceedings, Session XI, no. 2
8. Kido H, et al (2009) SoC for car navigation systems with a 53.3 GOPS image recognition
engine. HOT CHIPS 21, Session 6, no 3
9. Kamei T (2006) SH-X3: Enhanced SuperH core for low-power multi-processor systems. Fall
Microprocessor Forum 2006
References 151

10. Yoshida Y, et al (2007) A 4320MIPS Four-Processor Core SMP/AMP with Individually


Managed Clock Frequency for Low Power Consumption, ISSCC Dig Tech Papers, Session 5.3
11. Shibahara S, et al (2007) SH-X3: Flexible SuperH Multi-core for High-performance and
Low-power Embedded Systems. HOT CHIPS 19, Session 4, no 1
12. Nishii O, et al (2007) Design of a 90 nm 4-CPU 4320 MIPS SoC with individually managed
frequency and 2.4 GB/s multi-master On-chip interconnect. Proc. 2007 A-SSCC:18–21
13. Takada M, et al (2007) Performance and power evaluation of SH-X3 multi-core system. Proc
2007 A-SSCC:43–46
14. Woo SC, et al (1995) The SPLASH-2 programs: Characterization and methodological consid-
erations. Proc ISCA:24–36
15. Ito M, et al (2008) An 8640 MIPS SoC with Independent Power-Off Control of 8 CPUs and
8 RAMs by An Automatic Parallelizing Compiler. ISSCC Dig Tech Papers, Session 4.5
16. Yoshida Y, et al (2008) An 8 CPU SoC with independent power-off control of CPUs and multi-
core software debug function. COOL Chips XI Proceedings, Session IX, no. 1
17. Hoang HT, et al (2008) Design and performance evaluation of an 8-processor 8,640 MIPS SoC
with overhead reduction of interrupt handling in a multi-core system. Proc 2008 A-SSCC,
193–196
18. Yuyama Y, et al (2010) A 45 nm 37.3GOPS/W heterogeneous multi-core SoC. ISSCC Dig
Tech Papers:100–101, Feb. 2010
19. Nito T, et al (2010) A 45 nm heterogeneous multi-core SoC supporting an over 32-bits physical
address space for digital appliance. COOL Chips XIII Proceedings, Session XI, no. 1
20. Arakawa F (2011) Low power multicore for embedded systems. COMS Emerg Technol 2011,
Session 5B, no. 1
21. Ito M, et al (2007) Heterogeneous multiprocessor on a chip which enables 54x AAC-LC stereo
encoding. IEEE 2007 Symp VLSI Circuit:18–19
22. Shikano H, et al (2008) Heterogeneous multi-core architecture that enables 54x AAC-LC
stereo encoding. IEEE J Solid-State Circuits, 43(4):902–910
23. Kurafuji T, et al (2010) A scalable massively parallel processor for real-time image processing.
ISSCC Dig Tech Papers:334–335
24. Iwata K et al (2009) 256 mW 40 Mbps full-HD H.264 high-profile codec featuring a dual-
macroblock pipeline architecture in 65 nm CMOS. IEEE J Solid-State Circuits 44(4):
1184–1191
25. Iwata K et al (2010) A 342 mW mobile application processor with full-HD multi-standard
video codec and tile-based address-translation circuits. IEEE J Solid-State Circuits 45(1):
59–68
Chapter 5
Software Environments

5.1 Linux® on Multicore Processor

Linux1 is one of the operating systems capable of symmetric multiprocessing (SMP).


In this subsection, we describe the work of porting SMP Linux to the RP-1, RP-2,
and RP-X multicore chips which are explained in Chap. 4, and extending functions
of Linux to use the new processor’s features. Each chip has the following special
points:
• RP-1: The first SMP-ready multicore processor of SuperH™2 (SH) architecture
• RP-2: The second SMP-ready multicore processor which has enhanced power-
saving features
• RP-X: The SMP-ready multicore processor with 40-bit physical addressing
We have extended Linux in the following steps:
• Porting SMP Linux to RP-1
• Extending a power-saving feature of Linux to RP-2
• Developing a physical address extension feature of Linux to RP-X
The details of the work are described in the following sub-subsections:

5.1.1 Porting SMP Linux

5.1.1.1 Introduction

Linux source code consists of architecture-dependent and architecture-independent


parts. The architecture-independent part is common among all processor architectures.

1
Linux® is the registered trademark of Linus Torvalds in the USA and other countries.
2
SuperH™ is a trademark of Renesas Electronics.

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 153
DOI 10.1007/978-1-4614-0284-8_5, © Springer Science+Business Media New York 2012
154 5 Software Environments

Table 5.1 Atomic operations of Linux


Operation type Details
Spinlock Lock operation using busy loop
Read/write lock Lock operation for read/write access
Bit operation Atomic operation for bit set and clear of variable
Add/declare operation Atomic operation for increment/decrement of variable
Compare/exchange operation Atomic operation for compare and exchange of two values

To port SMP Linux to new processor architecture, it is necessary to implement the


architecture-dependent part related to SMP. The number of source lines of the architec-
ture-dependent part related SMP is about 5 K lines in Linux version 2.6.16. Major parts
to be implemented are as follows [1]:
• Boot sequence
• Timer functions
• CPU cache controls
• TLB (translation look-aside buffer) controls
• Inter-core communications
• Atomic operations
Atomic operations are key primitives to support an SMP-type operating system.
They enable exclusive access to shared hardware resources such as memories and
I/O devices. They also strongly affect SMP performance. Therefore, we mainly
describe the implementation and evaluation of atomic operations.
SMP Linux uses the atomic operations described in Table 5.1.
In the conventional Linux for a single CPU, all atomic operations are realized by
disabling interrupts to the CPU core. However, SMP Linux cannot use this method.
For example, while one CPU core disables interrupts for exclusive access to shared
objects, other CPU cores continue to run and touch them. To avoid such illegal
access, SMP Linux must use special CPU instructions to ensure atomic access
among CPU cores.
RP-1 has two types of CPU instructions for atomic operations: one is TAS (test-
and-set) instruction and the others are LL/SC (load-linked/store-conditional)
instructions. TAS instruction is used in a thread library in the conventional Linux for
a single CPU. LL/SC instructions are newly added to RP-1 to support an SMP-type
operating system:
1. TAS instruction.
The TAS instruction reads the data indicated by the address and sets the T-bit in
the status register to 1 if that datum is zero, or clears the T-bit to 0 if that datum is
nonzero. The instruction then sets bit 7 of the data to 1 and writes to the same
address. The bus is not released during this period.
2. LL/SC instructions.
The LL instruction is used in combination with an SC instruction to realize an
atomic read-modify-write operation. This instruction sets the LL/SC private flag to
5.1 Linux® on Multicore Processor 155

Table 5.2 Difference between TAS and LL/SC


Compared data size Bus lock
TAS instruction Binary (zero or nonzero) Used
LL/SC instructions 32 bits Not used

1 and reads 4-byte data into a general register. If, however, an interrupt, exception,
or other core’s data access occurs, the LL/SC private flag is cleared to 0.
Storage by the SC instruction only proceeds when the instruction is executed
after the LL/SC private flag has been set by the LL instruction and not cleared by an
interrupt, other exception, or other core’s data access. When the LL/SC private flag
has been cleared to 0, the SC instruction clears the T-bit in the status register and
does not execute its storage.
The difference between a TAS instruction and LL/SC instructions is indicated in
Table 5.2.

5.1.1.2 Implementation

We describe the atomic add operation using LL/SC instructions. The atomic add
operation needs two arguments: address (call by reference) and value (call by value).
The address argument specifies an address of the variable to be accessed. The value
argument is an immediate value to be added to the variable. The sequences of atomic
add are as follows:
1. Load the data referenced by the address argument to a temporary register using
the LL instruction.
2. Add the value passed by the value argument to the temporary register using a
normal add instruction.
3. Try to store the value of the temporary register to the referenced address using
the SC instruction.
4. Check the condition of the SC instruction’s result (T-bit in status register). If
another core’s data access occurs between (1) and (3), the SC instruction will
fail, and the value will not be stored to the address. In this case, the above
sequence is retried from the first step.

5.1.1.3 Evaluation

We evaluate SMP Linux performance of each implementation, using the TAS atomic
operation and LL/SC atomic operation. We use LMBench [2], which is widely used as
a benchmark program on UNIX-like systems, to compare the performance. The Linux
version on which we have implemented the SMP extension is shown in Table 5.3.
To compare each implementation, we calculate performance ratio using the follow-
ing formula:

Performance ratio [% ] = {(TASversion) / (LL/SCversion)} × 100.


156 5 Software Environments

Table 5.3 Software version


Software Version number
Linux 2.6.16
glibc 2.3.3

Table 5.4 Process-related performance


Performance
TAS (us) LL/SC (us) ratio (%)
Null call 0.44 0.44 100
Null I/O 0.83 0.83 100
Stat 12.20 8.69 140
Open/close 16.4 11.6 141
Select TCP 59.3 42.4 140
Sig inst 1.91 1.81 106
Sig handle 92.1 84.8 109
Fork proc 3,932 3,578 110
Exec proc 10,801 9,981 108
Shell proc 32 k 30 k 107

Table 5.5 Inter-process communication performance


Performance
TAS (us) LL/SC (us) ratio (%)
Pipe 17.5 14.3 122
AF UNIX 42 36.3 116
UDP 62.8 60.6 104
TCP 115 88.4 130
TCP conn 289 241 120

Table 5.6 File system and virtual memory related performance


Performance
TAS (us) LL/SC (us) ratio (%)
0 K file create 57.6 41.8 138
0 K file delete 32.6 24.4 134
10 K file create 281.3 238.5 118
10 K file delete 75.1 52.7 143
Mmap Latency 4,904 4,418 111
Page Fault 50 45.3 110
100fds select 39.5 25.1 157

The results are presented in Table 5.4–5.6.


We can summarize that the SMP Linux using LL/SC atomic operation achieves
better performance than the SMP Linux using TAS atomic operation in all of the
LMBench results.
5.1 Linux® on Multicore Processor 157

Table 5.7 Number of CPU instructions to implement atomic add


Number of CPU instructions
Using TAS instruction 16
Using LL/SC instruction 4

5.1.1.4 Considerations

As described before, the LL/SC instructions do not lock the bus. On the other hand,
the TAS instruction locks the bus and causes huge overhead. Moreover, atomic oper-
ation using the TAS instruction requires complex implementation because the TAS
instruction only compares binary data (zero or nonzero). For example, the number of
CPU instructions to implement an atomic add operation is shown in Table 5.7.
The atomic add using TAS requires four times the number of CPU instructions com-
pared with LL/SC. This is also the case with other atomic operations. LMBench
results show this overhead. The advantage of LL/SC compared to TAS increases with
the number of atomic operations used in the benchmark [3].

5.1.2 Power-Saving Features

5.1.2.1 Introduction

RP-2 is the multicore chip which has the following enhanced power-saving features:
• Power on/off control to each core
• Frequency control to each core
• Voltage control of the chip
Linux already has the power-saving frameworks CPU hot-plug and CPUfreq for
multicore processors [4, 5], but these frameworks have the following problems:
• No coordination between CPU hot-plug and CPUfreq
• CPUfreq has some governors that control voltage and frequency dynamically
according to system loads. However, CPU hot-plug does not have such a feature.
• No coordination between voltage control and core frequency control
• The input voltage to the chip limits the maximum frequency for CPU cores of
RP-2. This means each CPU core frequency is to be coordinately controlled with
the input voltage. However, CPUfreq governor does not have such a feature.
To resolve these issues, we have developed a new power-saving framework called
“idle reduction” based on CPU hot-plug and CPUfreq.

5.1.2.2 Implementation

Figure 5.1 and Table 5.8 show the structure of the idle reduction framework.
158 5 Software Environments

Fig. 5.1 Structure of idle reduction Idle Reduction


framework
User Space (1) (2)
Kernel Space
procfs sysfs
(3) (4)

CPU CPUfreq
Hot-plug (userspace gov.)

CPU CPU CPU Frequency


Hot Remove Hot Add change

Table 5.8 Components of idle reduction framework


Components Details
Idle reduction Power-saving manager
procfs File system to get kernel status
sysfs File system to set kernel parameters
CPU hot-plug Framework for core hot-plug control
CPU hot add Power on control of each core (architecture-dependent part)
CPU hot remove Power off control of each core (architecture-dependent part)
CPUfreq (userspace) Framework for CPU frequency control
CPU frequency change Frequency control of each core (architecture-dependent part)

Table 5.9 CPUfreq governors


Governor Details
On demand Controls core frequency to adapt it to each core’s load
Conservative Same as on demand, but controls core frequency
conservatively
Powersave Uses minimum core frequency at all times
Performance Uses maximum core frequency at all times
Userspace Controls core frequency by user commands

Figure 5.1 shows the major components of idle reduction that reduce power con-
sumption coordinately using CPUfreq and CPU hot-plug. CPUfreq can be con-
trolled by userspace governors. The available governors are listed in Table 5.9.
An example of idle reduction is as follows. If the multicore chip has no execution
load, idle reduction forces CPU hot remove on all cores except for the primary core
and drops the primary core frequency to minimum automatically. After that, if two
threads are runnable, idle reduction forces CPU hot add on a core (two cores are
alive) and ups core frequencies step by step.

5.1.2.3 Evaluation

An evaluation was done using the following steps: first, we evaluated the instanta-
neous power consumption of CPU hot-plug and CPUfreq, and second, we evaluated
5.1 Linux® on Multicore Processor 159

Table 5.10 Software version


Software Version number
Linux 2.6.16
glibc 2.3.5

5000
4500
4000
3500
Power [mW]

3000
2500
2000 4cores
1500 3cores
1000 2cores
500 1core
0
0 500 1000 1500 2000 2500
Total Frequency (CPU cores) [MHz]

Fig. 5.2 Power consumption with CPU hot-plug and CPUfreq control

the total energy consumption of idle reduction with application loads. The Linux
version on which we implemented idle reduction is given in Table 5.10:
1. Power consumption of CPU hot-plug and CPUfreq.
The total power consumption of up to four CPU cores of the RP-2 using CPUfreq
and CPU hot add/remove is shown in Fig. 5.2. Each CPU core executes the Dhrystone
2.1 program at 75-MHz, 150-MHz, 300-MHz, or 600-MHz frequency controlled by
CPUfreq, or is stopped and powered off using CPU hot remove.
This figure shows that the total frequency that can be translated to total instructions
per second (IPS) is the same, but the power consumption is different. For example, if
each core frequency is set to 600, 300, 150, and 150 MHz, and the total is 1,200 MHz,
the power consumption is about 3.2 W where the chip voltage is 1.4 V. In another
case, all four core frequencies are set to 300 MHz and the total is 1,200 MHz; the
power consumption is about 1.9 W where the chip voltage is 1.2 V [6].
2. Energy consumption of idle reduction.
We evaluated the energy consumption of idle reduction using a multi-thread
benchmark (Splash2 RAYTRACE) [7]. The evaluation was done by comparing it
with the old governors.
Table 5.11 and Fig. 5.3 show the energy consumption with no load (all cores are
in an idle state). The table presents a comparison of 10 s of energy consumption of
each governor. Idle reduction clearly had the lowest energy consumption, and
160 5 Software Environments

Table 5.11 Energy consumption with no load (10 s)


Energy Comparison with
consumption the performance
Governor (Ws) governor (%) Core condition
Idle reduction 5.2 26 75 MHz with sleep instruction × 1
CPU hot remove × 3
On demand 6.3 31.5 75 MHz with sleep instruction × 4
Conservative 6.3 31.5 75 MHz with sleep instruction × 4
Powersave 6.2 31 75 MHz with sleep instruction × 4
Performance 20 100 600 MHz with sleep instruction × 4

20 Idle Reduction
ondemand
conservative
powersave
Electrical Energy [Ws]

15 performance

10

0
0 2 4 6 8 10 12
Elapsed Time [s]

Fig. 5.3 Comparison of energy consumption

Table 5.12 Power consumption with mixed loads


Energy Comparison with
consumption performance Execution Average power
Governor (Ws) governor (%) time (s) consumption (W)
Idle reduction 103.3 89 52 2
On demand 121.6 104.7 52 2.3
Conservative 111.1 95.7 73 1.5
Powersave 126.2 108.7 187 0.7
Performance 116.1 100 44 2.6

powersave had the second lowest. Idle reduction was able to reduce energy con-
sumption by 16% compared to the powersave governor.
Table 5.12 lists the energy consumption of the two-thread RAYTRACE bench-
mark; two cores have high loads (executing threads) and the other two cores have no
load (sleep or CPU hot remove). Idle reduction has the lowest energy consumption
(103.3 Ws), and the conservative governor has the second lowest (111.1 Ws). Idle
reduction was able to reduce power consumption by 7% compared to the conserva-
tive governor.
5.1 Linux® on Multicore Processor 161

140

120
Energy Consumption [Ws]
100

80

60

40 Idle Reduction
ondemand
conservative
20 powersave
performance
0
0 50 100 150 200
Elapsed Time [s]

Fig. 5.4 Elapsed time and energy consumption with mixed loads

Figure 5.4 plots the elapsed time and energy consumption. The graph also shows
the execution time of the RAYTRACE benchmark. Idle reduction shows very good
performance per energy consumption [8].

5.1.3 Physical Address Extension

5.1.3.1 Introduction

The heterogeneous multicore RP-X described in Sect. 4.4 has the physical address
extension (AE) feature, which extends its physical address to 40 bits. We have
decided to extend the Linux HIGHMEM framework to support AE on RP-X.
1. Address Extension
Even if a CPU core has only a 32-bit virtual address space, it can access a physi-
cal address space of over 4 GB with the address extension (AE) feature. It enables
use of more than 4 GB of memory without having to modify the application
software.
2. HIGHMEM
Figure 5.5 is an overview of the Linux HIGHMEM framework. In the figure,
direct translation means that H/W translates from a virtual address to a physical
address. Indirect translation means S/W manages address translation, and a vir-
tual address is translated to a physical address on a memory page basis by using
a translation look-aside buffer (TLB).
The HIGHMEM framework separates physical address memory into two
regions: straight mapped memory corresponding to the straight mapped area in a
162 5 Software Environments

Virtual Address Space Physical Address Space


[User Space]

[Kernel Space]

Straight Mapped Area

32 bits Memory Mapped I/O


Memory Mapped I/O
Area
40 bits

Straight Mapped Memory


HIGHMEM Area

HIGHMEM Mapped
Direct Translation Memory
Indirect Translation

Fig. 5.5 RP-X’s Linux address space with HIGHMEM framework

virtual address space and HIGHMEM mapped memory corresponding to the


HIGHMEM area. The size of the straight mapped memory is the same as the
straight mapped area, and this fixed translation is done by H/W. When the Linux
kernel uses the straight mapped memory, it accesses the straight mapped area in
the virtual address space. When the Linux kernel uses the HIGHMEM mapped
memory, it accesses the HIGHMEM area in the virtual address space. The size
of the HIGHMEM area in the virtual address space is smaller than the size of the
HIGHMEM mapped memory, so the Linux kernel manages address translation
using a page translation table. The Linux kernel is able to use the entire physical
memory by this mechanism.
The HIGHMEM framework for RP-X has been newly implemented. In this
case, we need to implement HIGHMEM framework without any modification of
application software (modifying all application software is unrealistic because
we have much application software). To solve this design issue, we separate
software components into three layers, a kernel layer, driver layer, and applica-
tion layer.
3. Kernel Layer
To implement the HIGHMEM framework, it is necessary to place the HIGHMEM
area in the virtual address space (Fig. 5.5). To place this area, we need to modify
a memory access function that is architecture dependent.
4. Driver Layer
In the driver layer, we need to consider two device spaces, the I/O space and
memory-mapped I/O space. The I/O space is not related to the virtual address
space because dedicated CPU instructions to access the I/O space are used.
Therefore, I/O space drivers can be used without any modifications. SH architecture
5.1 Linux® on Multicore Processor 163

Linux
Physical page
allocation
(1)
HIGHMEM area
Memory access
allocation
(2) Architecture independent (5)
Architecture dependent

Update PTE Resolve TLB Miss


(3) (6)
(4)
PTE

Software
Hardware
TLB

Fig. 5.6 Calling sequence of HIGHMEM memory access

does not have the I/O space, so we do not need to worry about the I/O space. The
address of the memory-mapped I/O space must not be changed regardless of the
existence or nonexistence of HIGHMEM, in order to keep the driver’s source
code compatible. However, if I/O devices use DMA, which accesses the physical
address space directly, it is necessary to modify DMA’s physical address in the
driver’s source code.
5. Application Layer
By implementing the above design, application programs can run without any
modifications.
In summary, we have designed a new HIGHMEM framework according to
the following policy:
a. Place the HIGHMEM area in the virtual address space.
b. Leave the memory-mapped I/O area as it is.
c. Leave the user area as it is.

5.1.3.2 Implementation

The calling sequence of physical page allocation to use the HIGHMEM area is
shown in Fig. 5.6. This function returns the physical page address. In step (1), the
physical page allocator calls the HIGHMEM area allocator to access the HIGHMEM
area. In step (2), the HIGHMEM area allocator allocates the virtual address for a
HIGHMEM page and calls the page table entry (PTE) updater. In step (3), the PTE
updater updates the PTE. In step (5), a TLB miss occurs when Linux accesses a
virtual address, and the address is not registered in the TLB. At this time, the TLB
164 5 Software Environments

Table 5.13 Components of HIGHMEM framework


Component Details
Physical page allocation Allocates a physical page at the kernel call
HIGHMEM area allocation Allocates a virtual address for HIGHMEM
Memory access Accesses a virtual address allocated in a previous access
Update PTE Registers the combination of a virtual address and
physical address to a PTE
Resolve TLB miss When a memory access violation has occurred, it
obtains the combination of the virtual address
(missed address) and physical address from the
PTE, and updates the TLB
PTE (page table entry) Software table for translation between the physical
address and the virtual address
TLB (translation look-aside buffer) Hardware table for translation between the physical
address and the virtual address

Table 5.14 Software version


Software Version number
Linux 2.6.27
glibc 2.3.5

miss resolver calls the physical page allocator and calculates the physical address
from the TLB-missed virtual address and the PTE in step (4). In step (6), the TLB
miss resolver registers the combination of the virtual address and physical address
to the TLB.
As just described, the HIGHMEM feature implemented by the Linux kernel
saves the combination of the physical address (over 32 bits) and the virtual address
(under 32 bits) as the PTE, and the TLB is updated when a memory access violation
occurs (Table 5.13).

5.1.3.3 Evaluation

Our HIGHMEM framework was evaluated to determine whether the existing appli-
cations and driver could be run without any modifications. The Linux version on
which we have extended the HIGHMEM framework is shown in the following table
(Table 5.14).
To evaluate the application’s compatibility, LMBench and IOzone [9] are used
for testing. Both benchmarks are suitable for verifying that Linux runs correctly.
Development tools like the cross compiler and cross linker are also tested using
these benchmarks.
To maintain compatibility between drivers, none of the devices for RP-X needed
changes in their drivers’ source code except for the serial-ATA driver, which needed
a change in the DMA access function. This is because the RP-X’s serial-ATA device
5.2 Domain-Partitioning System 165

cannot handle physical addresses over 32 bits (Linux HIGHMEM allocates a physical
address for DMA access over 32 bits). This limitation is not RP-X specific, but the
same issue will occur when supporting some PCI or PCI-Express devices which are
limited to less than 32-bit DMA addressing [10].

5.2 Domain-Partitioning System

5.2.1 Introduction

The application fields of embedded systems are rapidly expanding, and the func-
tionality and complexity of these systems are increasing dramatically [11]. Today’s
embedded systems require not only real-time control functions of traditional embed-
ded systems but also IT functions, such as multimedia computing, multiband net-
work connectivity, and extensive processing for database transactions. Facilitating
embedded systems’ many requirements calls for new system architectures.
One approach for designing system architectures is to integrate multiple operat-
ing systems on a multicore processor. In this approach,
• Heterogeneous operating systems run different types of applications within the
multicore processor.
• A real-time operating system delivers real-time behavior such as low latency and
predictable control function performance.
• A versatile operating system processes applications developed for IT systems.
However, this system architecture has a drawback. An unintentional failure of one
operating system could overwrite important data and codes and bring down not only
that operating system but others as well. This can occur because a CPU core, which
executes operating system codes, can access any hardware resource on a multicore
processor. We therefore need a partitioning mechanism to isolate any unintentional
operating system failure within a domain to prevent it from affecting systems in other
domains. A domain is a virtual resource-management entity that executes operating
system codes in a multi-operating system integrated on a multicore processor.
System engineers have developed several partitioning mechanisms for servers
and high-end desktop systems [12, 13]. However, these mechanisms are unsuitable
for an embedded multicore processor equipped with only the minimally required
resources. Rather, they are for multiprocessor systems in which many processors
share large amounts of memory and I/O devices. The mechanisms cannot divide a
small memory system into areas nor segment a device into groups of channels to be
assigned to multidomains on the multicore processor. We have therefore developed
a low-overhead domain-partitioning mechanism for a multidomain embedded sys-
tem architecture that protects a domain from being affected by other domains on an
embedded multicore processor. Additionally, we fabricated a multicore processor
that incorporates a physical partitioning controller (PPC), which is a hardware sup-
port for the domain-partitioning mechanism.
166 5 Software Environments

5.2.2 Trends in Embedded Systems

Embedded systems were originally cost-sensitive control systems with a fixed


function—namely, to make the machine or mechanical parts operate in a specific
and safe way to meet real-time performance constraints for reasons such as safety
and usability. Embedded processors have the following characteristics:
• A simple CPU optimized for code efficiency and low power.
• Low and predictable interrupt latency to support real-time control.
• Occasional use of application-specific processors such as digital signal proces-
sors and media processors.
• Integration of on-chip ROM, RAM, and peripheral I/O devices to reduce system
cost.
Newer embedded systems are going to incorporate IT as well as control functions
[14], and heterogeneous multicore processors will be widely used in embedded sys-
tems soon because it is difficult for a single-processor architecture to support opera-
tion throughput for both IT functions and real-time response for control functions.
Automobiles are increasingly using embedded electronic subsystems to control
mechanical parts such as the engine, transmission, brakes, and steering. Each subsys-
tem has an electronic control unit (ECU) and a control application, which usually
have a one-to-one relationship. Manufacturers enhance and network these subsys-
tems together with the car area network (CAN) to implement new functions that
maximize control and safety, such as antilock braking systems, electronic stability
control, traction control, and automatic four-wheel drive. Furthermore, navigation
systems and telematics, which provide IT functions such as databases and network-
ing, will soon be integrated into car control systems, and control and IT functions
will cooperate to increase automobiles’ safety, security, comfort, and usability.
Because of the dramatically increasing complexity and functionality of embed-
ded systems, their development could cost billions of dollars and take a decade to
complete. Through software reuse [15], organizations are attempting to save devel-
opment time and energy. Engineers have a library of software modules, many of
which they use in multiple applications. Modern embedded systems will be based
on two-domain system architectures: a real-time domain, consisting of real-time
control applications on a real-time operating system (RTOS) and an IT domain,
consisting of IT applications on versatile operating systems.

5.2.3 Programming Model on Multicore Processors

Figure 5.7 shows our multidomain system, built on a multicore processor. The system
architecture lets the designer assign domains to the different CPU cores and imple-
ment them independently in each core. Applications and operating systems in both
domains are largely unaware of each other. Domains might exchange information and
5.2 Domain-Partitioning System 167

Real Time Control Applications IT Applications


Steering
Controller
Engine Transmission Brake Steering Audio Navi Telematics

Sensor Multimedia GUI Database Network


Engine
RTOS Versatile OS

Trans-
mission
CPU #0 DMAC INTC Timer SCIF CPU #1

Brake
GPIO ROM Con RAM Con DU PCI

CAN Dev ROM RAM Display Net

: RT domain : IT domain

Fig. 5.7 Multidomain system architecture for embedded multicore processors

coordinate tasks, but there is no dynamic load balancing. The task assignment is fixed,
and hardware resources can be dedicated to the domains, resulting in a more determin-
istic performance. Despite some possible memory overhead due to multiple operating
system images in the main memory, this feature is one of the system architecture’s
most significant advantages for embedded system developers.
As the size and complexity of embedded systems increase, so do the chances that a
system will break down because of software malfunctions or attacks over the network.
Although operating systems isolate software failures within an application, a failure
could affect the operating system itself, causing it to bring down all applications run-
ning on it because operating systems, especially versatile ones, are becoming larger
and more complex.
In developing control subsystems whose failure might endanger a person’s life,
such as an automobile’s brake control system, an engineer tries to achieve a high
level of safety by every conceivable means. However, even a safe and secure control
subsystem can be affected if it is incorporated with IT subsystems into a multido-
main system on a multicore processor.
Our domain-partitioning approach helps to isolate failures within unreliable IT
domains rather than let them affect control domains on the multidomain embedded
system. This domain partitioning protects a domain from being affected by other
domains in the multidomain system and maintains the system’s safety and security by
• Allocating multicore processor resources for each domain to let it run its own
operating system and applications [16]
• Protecting a domain from the effects of software failure in other domains and
ensuring that only the domain causing the failure is affected
168 5 Software Environments

Partitioning
RT domain IT domain

CPU #0 CPU #1 Timer INTC Timer INTC CPU #2 CPU #3


Partition
Controller

GPIO SCIF DMAC RAM Con SCIF DMAC DU PCI

Dev RAM Display Net

Fig. 5.8 Physical partitioning

• Resetting the domain and rebooting its operating system without letting the other
domains observe any of the failure’s effects
• Lowering system performance overhead to less than 5% to implement fault
isolation

5.2.4 Partitioning of Multicore Processor Systems

Partitioning techniques with hardware support fall into two categories: physical par-
titioning and logical partitioning [12].

5.2.4.1 Physical Partitioning

With physical partitioning, each domain uses dedicated processor resources. In the
multidomain system in Fig. 5.8, system designers allocate each CPU core and each
group of channels in multichannel devices—DMA controllers (DMAC), timer units
(TMU), and serial communication interfaces (SCIF)—and other devices, such as
the display unit (DU), PCI, and general-purpose I/O (GPIO), to one of the domains.
Each allocated resource is physically distinct from the resources used by the other
domain. Although the domains share the on-chip system bus, each transaction is
dedicated to a domain. This prevents the other domain from affecting transactions
that relate to issues other than bandwidth.
In physical partitioning, each partition’s configuration—that is, the resources
assigned to a domain—is controlled in the hardware (such as the partition con-
troller in Fig. 5.8), because physical partitioning does not require sophisticated
algorithms to schedule and manage resources. When the system boots up, the
partition controller sets up the hardware resources to use in a partition according
5.2 Domain-Partitioning System 169

Partitioning
RT domain IT domain

Hypervisor

CPU #0 CPU #1 INTC Timer CPU #2 CPU #3

GPIO SCIF RAM Con DMAC DU PCI

Dev RAM Display Net

Fig. 5.9 Logical partitioning

to partition-configuration commands. Once a partition is configured, the operat-


ing systems are loaded into each partition, and a domain on each partition starts
to run the operating system and applications. The partition controller checks
every access request for hardware resources that the operating system and appli-
cations generate and invalidates any unauthorized requests.
Failure isolation and predictable resource allocation are the two most important
goals of physical partitioning. Although physical partitioning sacrifices flexibility in
allocating resources to partitions to achieve these goals, it is generally easy to imple-
ment with hardware and imposes little overhead on application execution.

5.2.4.2 Logical Partitioning

With logical partitioning, domains share some physical resources, usually in a time-
multiplexed manner. Thus, logical partitioning makes it possible to run multiple
operating system images on a single hardware system, which enables dynamic
workload balancing. Logical partitioning is used to implement virtual machines on
PC servers and mainframes to optimize utilization of hardware resources.
Logical partitioning is more flexible than physical partitioning but requires
additional mechanisms to provide the services needed to share resources safely and
efficiently. Usually, a hypervisor—that is, a programming layer lower than the
operating system and hidden from general system users (see Fig. 5.9)—controls
each partition’s configuration. When the system boots up, the hypervisor sets up
the hardware resources for use in a partition according to the partition-configuration
commands. Once a partition is configured, the hypervisor loads the operating sys-
tems and applications into each partition, and a domain on each partition starts to
run them. During the execution of the operating system and applications on the
170 5 Software Environments

partition, the hypervisor traps every hardware resource access request that the
operating system and applications generate, in order to check their authenticity and
to provide the requested resource services if authorized.
Optimizing hardware utilization is one of the main goals of logical partitioning.
To achieve this goal, logical partitioning sacrifices the partition’s physical isolation
in exchange for greater flexibility in dynamically allocating resources to partitions.
It also imposes performance penalties because the hypervisor is implemented in
software layers.

5.2.4.3 Our Approach

In terms of hardware design simplicity, implementing the partition controller for


physical partitioning of embedded processors involves a small amount of memory
and a simple logical circuit and does not require any architectural changes to the
CPU core. However, to mitigate logical partitioning’s original virtualization perfor-
mance overhead, logical partitioning imposes several challenges on the CPU core
architecture, such as introducing a new execution mode on the CPU core [17].
In terms of the simplicity of software development, physical partitioning
requires a new error-handling routine for the partition controller that can be imple-
mented as a simple interrupt handler in embedded systems and does not require the
modification of guest operating systems. However, logical partitioning requires a
hypervisor to arbitrate accesses to the underlying physical hardware resources so
that multiple guest operating systems can share them. Its paravirtualization
approach improves virtualization performance but requires modification of the
guest operating systems [18].
In typical multidomain embedded systems, each domain’s CPU, memory area,
and peripheral devices are physically different, and physical isolation and low over-
head are more important than partitioning flexibility. Therefore, we based our domain
partitioning on physical partitioning techniques and used the PPC hardware module
to implement physical isolation with low overhead.

5.2.5 Multicore Processor with Domain-Partitioning Mechanism

Figure 5.10 is a block diagram of the multicore processor we used to implement the
proposed domain-partitioning technique. The processor is a multicore chip contain-
ing four SH-4A processor cores, each of which is a 32-bit RISC microprocessor
containing an instruction cache, data cache in write-back mode, and memory man-
agement unit (MMU) with a translation look-aside buffer (TLB), which supports a
32-bit virtual address space. The SH-4A cores maintain consistency between data
caches and share instruction/data unified L2 cache in write-through mode. The pro-
cessor incorporates a DDR3-SDRAM memory controller (DBSC), local bus state
5.2 Domain-Partitioning System 171

LCPG INTC CPU #0 CPU #1 CPU #2 CPU #3 DMAC PCIe

SHPB

Internal system bus

WDT TMU HAC Ether USB DU


HPB LBSC DBSC
Peripheral bus

CPG GPIO HSPI SCIF I2C SSI SDIF

ROM RAM

Fig. 5.10 Block diagram of the multicore processor

controller (LBSC) supporting connection to burst ROM, a DMAC, a sophisticated


interrupt controller (INTC), and several on-chip peripherals including the display
unit (DU), Ethernet controller, general-purpose I/O (GPIO), and serial communica-
tion interfaces (SCIF).
These hardware modules are connected through an internal system bus and a
peripheral bus. SHPB and HPB are bus bridges that connect the internal system bus
with the peripheral bus. The CPU cores, display unit, Ether, and USB are initiator
modules that can request access to other modules connected to the internal system
bus. HPB, SHPB, LBSC, and DBSC are target modules that initiators can access
through the internal system bus. DMAC and PCI Express (PCIe) are initiator mod-
ules as well as target modules on the internal system bus. These hardware modules
are mapped onto the processor’s physical address space, which the initiators use to
gain access to processor resources.
In a multidomain system architecture, system designers assign domains to the
different CPU cores and processor resources, such as memory and peripherals.
They might also allocate shared memory resources that the domains can use to
communicate with each other. The operating system in each domain controls the
CPU cores and the other initiator modules so that they use the assigned processor
resources. Thus, the applications running on the operating system cannot use the
processor resources assigned to another domain. However, as we mentioned pre-
viously, the system could break down due to unintentional software malfunctions
in the operating system because the CPU core can access any processor resources.
An access-control mechanism in the multicore processor can help prevent such
access.
172 5 Software Environments

Initiators PPC Targets


RAM DBSC
DMAC Control registers
CPU #0
Area #0
CPU #1 DMAC 0-5 HPB

CPU #2
Area #1 DMAC 6-11 TMU-05
Access Check List CPG
CPU #3 CPU #2 Mem A#0 RW Area #C TMU6-11
CPU #3 Mem A#0 RW WDT
CPU #0 Mem A#1 RW SDIF0
DMAC 0-5 CPU #2 DMAC 0-5 RW
CPU #0 DMAC 6-11 RW
GPIO
DMAC 0-5 Mem A#0 RW
SDIF1
DMAC 6-11 PCIe 0
DMAC 6-11 Mem A#1 RW DU
CPU #2 Ether RW SSI0-1
Ether Mem A#0 RW PCIe 1 HSPI
PCIe 0 CPU #2 DU RW
SSI2-3
DU Mem A#0 R
CPU #2 PCIe 0 RW PCIe 2 SCIF0-1
PCIe 1 PCIe 0 Mem A#0 RW I2C0
CPU #2 SCIF 0-1 RW
PCIe 2 CPU #0 SCIF 2-5 RW
SCIF2-5
CPU #2 TMU 0-5 RW I2C1
CPU #0 TMU 6-11 RW
SHPB USBF
DU CPU #0 GPIO RW ROM LBSC HAC0
INTC
USBH
Ether Area #0 HAC1
INTC2
USBF Area #1 LCPG 0
USBH LCPG 1

Fig. 5.11 Physical partitioning controller (PPC)

5.2.5.1 Access Control of Physical Partitioning Controller

The PPC is located between the access initiator modules and the access target mod-
ules. It checks every access request and blocks requests that are not authentic. The
PPC contains an access checklist (ACL) to set access authorization rules, and the
ACL defines the processor’s partition configuration. The ACL consists of several
register entries, each having three fields:
An SRC field, which specifies an access initiator
A DEST field, which specifies an access target
An AUTH field, which specifies authorized operation for both the SRC and DEST
fields
The processor has multiple-channel devices, such as a DMAC, PCIe, TMU,
SCIF, audio codec I/F (HAC), serial sound I/F (SSI), and I2C. The PPC segments
them into groups of channels and recognizes each group as a separate module so
that each domain uses one group’s function exclusively. The PPC also segments
RAM and ROM into several memory areas so that a domain can use them as private
memory. Moreover, several domains can access a shared RAM area to communicate
with each other. Therefore, the PPC recognizes each initiator and target, indicated
in Fig. 5.11, as separate modules.
5.2 Domain-Partitioning System 173

Op Address SrcID
ACL: Access Control List

Address Mask[31:xx] SrcID Mask[7:0] Pwr

Address
Address [31:xx] SrcID Mask[7:0]
Mask[31:xx] SrcID[7:0] Pwr
Address [31:xx] SrcID[7:0] Prd
Address [31:xx] SrcID[7:0] Prd

“Read” “Write”
Deny

Fig. 5.12 Access control of physical partitioning controller

We assigned an SrcID to each initiator module and used a control register address
as an identifier for the DEST field. The size of the address range should be a power
of 2, and its start address should be a multiple of the alignment, which must be a
power of 2 and a multiple of the size of the address range.

5.2.5.2 Implementation of PPC

Figure 5.12 shows the PPC structure. The SRC field consists of an SrcID and an
SrcID mask; the DEST field consists of an address and address mask; and the AUTH
field consists of two bits—one for read permission and one for write permission.
The PPC checks every access request by comparing a set consisting of an operation,
target address, and SrcID with all ACL entries using the logical circuit shown in the
figure. When the PPC finds a match for an ACL entry, it authorizes the access
request, and the PPC passes the access request to the target module. When the PPC
finds no match for an ACL entry, it does not authorize the access request; rather, the
PPC blocks it and generates a deny signal to start error handling.
The PPC modules are located between the internal system bus and the bus-target
modules—that is, the DBSC, SHPB and HPB bus bridges, PCIe, and DMAC (see
Fig. 5.13). The PPC has six subblocks—DBSC-PPC, LBSC-PPC, SHPB-PPC,
HPB-PPC, DMAC-PPC, and PCI-PPC—each of which has its own set of registers
174 5 Software Environments

Partitioning
Real time control IT applications
PPC applications
error
handler INT INT
handler RTOS Versatile OS handler

LCPG INTC CPU #0 CPU #1 CPU #2 CPU #3 DMAC PCIe

SHPB PPC PPC


PPC
Internal system bus

PPC PPC PPC


WDT TMU HAC Ether USB DU
HPB LBSC DBSC
Peripheral bus

CPG GPIO HSPI SCIF I2C SSI SDIF

ROM RAM

Fig. 5.13 Implementation of the physical partitioning controller

Table 5.15 Number of access checklist (ACL) entries for each physical partitioning controller
No. of ACLs No. of ACLs
PPC Initiator Target needed implemented
LBSC-PPC 2 CPU cores 2 ROM areas 2 4
DBSC-PPC 2 CPU cores 9 2 RAM areas allocated 11 + 11 32
initiator modules to a domain 1 RAM
area shared by
domains
SHPB-PPC 2 CPU cores 4 Target modules 4 8
HPB-PPC 2 CPU cores 27 Target modules 27 32
DMAC-PPC 2 CPU cores 2 Target modules 2 4
PCI-PPC 2 CPU cores 3 Target modules 3 4

and ACLs. Table 5.15 lists the number of ACL entries of each PPC. For SHPB-
PPC, HPB-PPC, DMAC-PPC, and PCI-PPC, the initiators are CPU cores, and the
targets are modules connected on the bus-target modules; therefore, an ACL entry
is needed for each target to authenticate an access from the CPU cores of the
domain. For LBSC-PPC, CPU cores gain access to two ROM areas that are each
allocated to a domain; therefore, LBSC-PPC needs an ACL entry for each ROM
5.2 Domain-Partitioning System 175

area. For DBSC-PPC, the initiators, which are two CPU cores and nine initiator
modules, access two dedicated RAM areas and a shared RAM area; therefore, this
subblock needs 11 entries for the dedicated RAM areas and 11 entries for the
shared RAM area.

5.2.5.3 PPC Error Handling

When these conditions are not satisfied, the PPC judges the access to have been
inauthentic and rejects it. The PPC then sends an error response to the internal system
bus instead of passing the access request to the target module. The PPC also gener-
ates an access-violation interrupt signal, which is transmitted to the INTC.
The interrupt controller (INTC) prioritizes interrupt sources and controls the
flow of interrupt requests to the CPU. The INTC has registers for prioritizing each
interrupt, and it processes interrupt requests following the priority order set in
these registers by the program. Most of these registers are system registers, so they
cannot be physically partitioned. Therefore, we assumed that the real-time domain
would be more reliable than the IT domain, so we decided that CPU #0, which
houses the real-time domain, should be allowed to access the INTC registers and
that the IT domain should send requests to the real-time domain for operation on
the registers.
When the INTC receives an access-violation interrupt signal, the execution
jumps to the start address of the PPC error-handling routine. Each PPC subblock has
registers that determine the access-violation interrupt signal’s behavior and hold the
information on rejected access requests. Based on this information, the PPC error-
handling routine classifies the access violation’s seriousness and decides whether
the system should be rebooted.

5.2.6 Evaluation

We implemented the embedded multicore processors, “RP-1,” using 90-nm CMOS


process technology with typical-case design methodology for an experimental chip
operating at a 600-MHz clock frequency. Designing and implementing the PPC is
so simple that its design impact was negligible in terms of chip area. In addition, we
did not find any critical path due to PPC in the timing analysis of the chip’s
development.
To evaluate the performance overhead of PPC-based domain partitioning, we used
the LMBench [19] benchmark program and compared bare Linux with Linux on
top of a PPC error handler, run on the multicore processor. In this evaluation,
we implemented the PPC error handler as an interrupt/exception service routine that
handled the access-violation error and PPC-related system calls. So that we did not
have to modify Linux, we implemented the PPC handler like a hypervisor outside of
the operating system. Therefore, the PPC error handler checked whether the event
176 5 Software Environments

Table 5.16 Performance evaluation results for memory latency and context switching times using
LMBench
Context switching times (ms)
(number of processes/process
Memory latency (ns) image size [in bytes])
Main Random
Operating system L1 cache memory memory 2p/64 k 8p/64 k 16p/64 k
Linux 4.99 148.30 1,842.05 11.10 43.50 48.05
Linux + PPC error 4.99 150.50 1,912.95 11.50 45 49.40
handler
Overhead (%) 0.00 1.48 3.85 3.60 3.45 2.81

Table 5.17 Performance evaluation results for processing times using LMBench
Process times (ms)
Signal Signal Fork Execution
Operating system Null call Null I/O install handling processing processing
Linux 0.41 0.79 1.63 13.80 3,906.50 4,432
Linux + PPC error 0.48 0.87 1.70 13.95 3,911 4,452.50
handler
Overhead (%) 17.07 10.13 4.29 1.09 0.12 0.46

was related to the PPC in response to interrupts or exceptions and passed the processing
to the appropriate normal service routine in Linux if unrelated; otherwise, the PPC
error handler rebooted Linux, causing a serious access-violation error.
To observe the domain partitioning using PPC, we injected access-violation
errors by configuring PPC so that it did not allow applications running on Linux to
access a small memory area assigned to Linux. When an application wrote some
data into the small memory area, the PPC rejected the write access request so that
no data were written in the memory area, and the PPC generated an access-violation
interrupt signal to initiate the PPC error handler to reboot Linux.
Tables 5.16 and 5.17 indicate the overhead of the domain partitioning using the
PPC. The average performance penalty was 2.49%, and the overheads were typi-
cally less than 5%. In memory latency cases, the overheads were due only to the
additional bus access cycle generated by implementing PPC because the PPC error
handler was not initiated during the tests; thus, the overhead was 0.0% for “L1
cache” of the LMBench, 1.48% for “main memory,” and 3.85% for “random mem-
ory.” We presumed that the difference in overhead between the main and random
memories was due to the effect of the CPU core store buffers. The worst cases of
overhead were 17.07% for null call and 10.13% for null I/O. We attributed this
overhead to the PPC error handler because “null call” and “null I/O” are system
calls that only generate exceptions that trigger the PPC error handler’s execution.
Implementing the PPC error handler into the Linux service routine using a paravir-
tualization approach [18] could reduce the overhead.
References 177

References

1. Yamamoto H, Takata H (2004) Porting Linux to a Single Chip Multi-processor. The 66th
National Convention of Information Processing Society of Japan, Kanagawa, Japan
2. LMbench: http://lmbench.sourceforge.net/
3. Idehara A, Tawara Y, Yamamoto H, Ochiai S (2007) Development of SMP Linux for embed-
ded multicore processor. Embedded Systems Symposium 2007, Tokyo, Japan, pp 226–232
4. Brock B, Rajamani K (2003) Dynamic power management for embedded systems. Proceedings
of the IEEE International SOC Conference 2003, Portland, USA, pp 416–419
5. IBM, Montavista, Dynamic power management for embedded systems, 2002, http://www.
research.ibm.com/arl/publications/papers/DPM_V1.1.pdf, July (2008)
6. Idehara A, Tawara Y, Yamamoto H, Sugai N, Iizuka T (2008) An evaluation of dynamic power
management support of SMP Linux for embedded multicore processor. Embedded Systems
Symposium 2008, Tokyo, Japan, pp 115–123
7. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: character-
ization and methodological considerations. Proceedings of the 22nd International Symposium
on Computer Architecture, Santa Margherita Ligure, Italy, pp 24–36
8. Idehara A, Tawara Y, Yamamoto H, Ohtani H, Ochiai S (2009) Idle reduction: dynamic power
manager for embedded multicore processor. Embedded Systems Symposium 2009, Tokyo,
Japan, Oct 2009, pp 5–12
9. IOzone: http://www.iozone.org/
10. Idehara A, Tawara Y, Yamamoto H, Motai H, Ochiai S, Matsumoto T (2010) Design and
implementation of Linux HIGHMEM extension for the embedded processor. Embedded
Systems Symposium 2010, Tokyo, Japan, Oct (2010), pp 75–80
11. Ebert C, Jones C (2009) Embedded software: facts, figures, and future. Computer 42(4):42–52
12. Smith JE, Nair R (2005) Virtual machines: versatile platforms for systems and processors.
Morgan Kaufmann, MA, USA
13. Sun Microsystems (1999) Sun Enterprise 10000 Server: Dynamic System Domains, white
paper; http://www.sun.com/datacenter/docs/domainswp.pdf
14. Takada H, Honda S (2006) Real-time operating system for function distributed multiproces-
sors. J Inform Proc Soc Jpn 47(1):41–47
15. Kruger C (1992) Software reuse. ACM Comput Surv 24(2):131–183
16. Nesbit KJ et al (2008) Multi-core resource management. IEEE Micro 28(3):6–16
17. Uhlig R et al (2005) Intel virtualization technology. Computer 38(5):48–56
18. Barham P et al (2003) Xen and the art of virtualization. Proc 19th ACM Symp Operating
Systems Principles ACM, pp 164–177
19. McVoy LW, Staelin C (1996) Lmbench: portable tools for performance analysis. Proc. Usenix
Ann. Technical Conf., Usenix Assoc., pp 279–294
20. Nojiri T et al (2009) Domain partitioning technology for embedded multi core processors.
IEEE Cool Chips XII:273–286
21. Nojiri T et al (2010) Domain partitioning technology for embedded multicore processors.
IEEE Micro 29(6):7–17
Chapter 6
Application Programs and Systems

6.1 AAC Encoding

This section describes the evaluation of a heterogeneous multicore architecture


consisting of a widely used advanced audio codec (AAC) [1] audio encoder imple-
mented on a fabricated chip. The AAC encoder is supported for audio playback by
various embedded systems. The processing scheme on the heterogeneous multi-
core architecture with support of hierarchical memories and data transfer units was
newly investigated, and the execution time and power consumption of the encoding
were measured.

6.1.1 Target System

The evaluated chip is equipped with two homogeneous CPU cores and two accel-
erators (FE-GA) [2], which are described in Sect. 3.2. Figures 6.1 and 6.2 show a
block diagram and micrograph of the chip, respectively [3, 4]. The chip has two
SH-4A (SH) cores capable of multicore functions such as cache snooping, a 128-
KB on-chip shared memory (CSM), a DMAC, and two FE-GAs. The SH cores have
several types of local memories and a data transfer unit (DTU). The local memories
include a 128-KB users’ RAM (URAM) as a distributed shared memory, a 16-KB
operand local RAM (OLRAM) as a local data memory, and an 8-KB instruction
local RAM (ILRAM) as a local program memory. The FE-GAs also have a 40-KB
local memory (4 KB × 10 banks) that can be accessed from its internal load/store
cells as well as other processor cores. All the memories are distributed shared types,
which means they are address-mapped globally.
The SH cores are also equipped with an instruction cache and a coherent data
cache corresponding to ILRAM and OLRAM, respectively. In our use model, the data
cache is normally utilized for non-real-time applications. In contrast, the OLRAM
is used for real-time applications because data placement on the OLRAM can be

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 179
DOI 10.1007/978-1-4614-0284-8_6, © Springer Science+Business Media New York 2012
180 6 Application Programs and Systems

Fig. 6.1 Block diagram


of evaluated chip CPU#0 CPU#1
SH4A SH4A
LRAM core LRAM core
8+16 KB 8+16 KB
Cache Cache
URAM URAM CSM
DMAC
128 KB DTU 128 KB DTU 128 KB

STB
FE-GA#0 FE-GA#1 Memory
LM ALU LM ALU controller
40 KB array 40 KB array
Chip
LRAM, URAM Local Memories
DTU Data Transfer Unit Off-chip CSM
CSM Centralized Shared Memory (SDRAM)
DMAC Direct Memory Access Controller 128 MB
STB Split-Transaction Bus

Fig. 6.2 Micrograph


of evaluated chip

managed by software in advance of program execution. In the evaluation, an instruc-


tion cache was used instead of ILRAM. A data cache was also utilized. The array data
for the encoding program, such as the input audio-frame data and intermediately gen-
erated data, are placed on the URAM. There are certain amounts of these data, and
because they are frequently accessed by processors, their placement in a local memory
improves the performance very effectively. Otherwise, small amounts of data, such as
scholar variables, are placed in an off-chip memory (SDRAM) and cached in the data
cache. Table 6.1 lists the specifications for the evaluated chip. It is fabricated using
90-nm 8-layer CMOS technology. The operational clock frequency for CPU cores is
600 MHz and that for FE-GAs and the interconnection network is 300 MHz with the
power supplied at 1.0 V.
6.1 AAC Encoding 181

Table 6.1 Specifications for evaluated chip


Process technology 90-nm 8-layer CMOS
Supply voltage 1.0 V (internal), 1.8 V/3.3 V (I/O)
Operating frequency 600 MHz for CPU, 300 MHz for FE-GA/bus
Performance 19.2 GOPS (FE-GA maximum)
Local memories 128 KB + 8 KB/1 CPU, 40 KB/1 FE-GA
On-chip shared memory 128 KB

a Start b
Huffman
Bitstream
Frame read coding 4%
generation 8%
Filter bank

M/S stereo

Quantization Quantization Filter bank


32% & M/S stereo
57%
Huffman coding

Stream generation
For CPU
For FE-GA
Processing flow Profiling results on CPU

Fig. 6.3 Processing flow and profiling result of AAC encoder

Table 6.2 Conditions for AAC encoding


Profile AAC encoder LC (low complexity)
Bit rate 128 kbps
Input data 16 bits, 44.1 KHz, PCM-formatted
Music-1 (192 s), Music-2 (87.9 s)
Memory allocation Input: PCM
Output: AAC streams placed on off-chip
shared memory

6.1.2 Processing Flow of AAC Encoding

The encoding process for the AAC consists of the use of a filter bank and mid-side
(M/S) stereo, quantization, Huffman coding, and bit-stream generation. The process
is performed frame by frame, which is a unit of sampled points in input pulse-code
modulation (PCM) data. Figure 6.3 outlines the process flow and profiling results
for the AAC encoding. The profiling results in (b) indicate that the filter bank, M/S
stereo, and quantization account for 89% of the total encoding time. Table 6.2 lists
the specifications of the encoder used for the evaluation.
182 6 Application Programs and Systems

Table 6.3 Improved performance by FE-GA for AAC encoding processes


Process On CPU On FE-GAa Speedup
Filter bank and M/S stereo 2,400 K cycles 100 K cycles 24.0×
Quantization 240 K cycles 31 K cycles 7.7×
a
Data transfer time by the DTU is included

6.1.3 Process Mapping on FE-GA

The encoder program was investigated thoroughly to confirm its suitability for
processing on the FE-GA at every encoding stage. The filter bank is a band-pass
filter separating the input audio signal into several components of frequency sub-
bands. The calculation of the filter bank is composed of additions to and multiplica-
tions of the streaming data, which is suitable for processing on the FE-GA. The M/S
stereo extracts parts of the frequency sub-bands that appear in both left and right
channels. The calculation consists of additions to and subtractions of the left and
right sub-bands, and it is thus implemented on the FE-GA. Quantization constrains
the output value of the filter bank to a discrete set of values in accordance with the
specified bit rate. The calculation is a power of 3/4 to the data. The evaluated pro-
gram contains a table reference, which is implemented on the FE-GA. Huffman
coding assigns shorter coding symbols to more frequently appearing bit strings for
compression. In the implementation, quantization and Huffman coding iterate after
the step value for quantization is increased, until the amount of encoded data satisfies
a given bit rate. Since the coding length of bit strings is not fixed, it is difficult to
improve the performance with the FE-GA, and thus, a CPU is used for the Huffman
coding. Bit-stream generation arranges coded symbols in compliance with the AAC
stream format. A CPU is used to generate bit streams.
We developed the configurations of the FE-GAs for the filter bank, M/S stereo,
and quantization for the evaluation. The configurations for the filter bank and M/S
stereo were merged because the M/S stereo continuously follows the filter bank
process. The execution cycles were measured both on an FE-GA and on a single
CPU, as indicated in Table 6.3. Note that the FE-GA cycles are converted to CPU
cycles since the FE-GAs operate at 300 MHz, which is half the CPU’s cycles at
600 MHz. Introducing FE-GAs to the merged filter bank and M/S stereo and quan-
tization yields 24- and 7.7-fold speedups in performance against sequential execu-
tion on a CPU.

6.1.4 Data Transfer Optimization with DTU

Each processor core has a data transfer unit (DTU) attached to an internal bus con-
nected to the local memories. The DTU simultaneously transfers data between local
memories on different processor cores, between a local memory and on-chip CSM or
6.1 AAC Encoding 183

Command #1 Command #3
FLAG CHECK TRANSFER
Flag adr. Source adr.
Check value Destination adr.
Check interval Transfer size
Next cmd. ptr. Next cmd. ptr.

TRANSFER FLAG SET


Source adr. Flag adr.
Destination adr. Flag value
Transfer size -
Next cmd. ptr. Next cmd. ptr.
Command #2 Command #4

Fig. 6.4 Transfer list function in data transfer units (DTU)

off-chip main memory, or between the on-chip CSM and off-chip main memory
behind task executions on processor cores. The DTU is also equipped with flag-set
and flag-check commands. The DTU sets a flag with a number specified in a com-
mand in the flag-set mode. In the flag-check mode, it reads the value of the flag and
checks its correspondence with the number specified in the command.
The DTU interprets a transfer list, which is a set of DTU commands placed on the
local memory. Different types of transfers can be defined in advance, and thus, the
DTU can operate independently behind a CPU. Furthermore, the setup time, such as
that to register operations, is also reduced, which results in improved performance.
Figure 6.4 shows an example of a DTU transfer list. Each command is linked by a
pointer. The following explains how the list is interpreted. First, a CPU initiates a DTU
by setting up a start-up register for the DTU and specifying an address for the first com-
mand to be interpreted. The DTU starts to perform a flag-check. It checks a flag on a
memory and compares it to the specified value in the command. The correspondence
of the two values enables the DTU to read the next command, which is specified as a
pointer address in the original command. The flag-check interval cycles are optionally
specified with the aim of restraining bus traffic or reducing power consumption. In the
next command, data are transferred from the source address to the destination address
of the specified size. As soon as the transfer has finished, the DTU reads the next com-
mand, which is another data transfer in this example. After the transfer, the DTU then
sets a flag with the specified value on the specified address.
The DTU also supports data packing to nonaligned burst transfer. Users do not need
to be concerned about the alignment of data placement in a memory, and the utilization
of a bus is also improved. In addition, it supports the stride transfer mode that enables
gathered/scattered data transfers. This is effective for applications using transfers of
rectangular regions on a memory, for example, image handling, because these transfers
can be completed with one transfer command.
184 6 Application Programs and Systems

CPU#0 FE-GA#0
SH
DTU ALU/MLT array
Core
FE-
CPU
GA
#1
#1
LM com. Bus LM Bus
lists I/F I/F
data data
flag

Fig. 6.5 Implementation of DTU on evaluated chip and its operation

Figure 6.5 outlines the DTU implementation on the evaluated chip with an example
diagram of its operation. Transfer lists, data, and flags are placed in the local mem-
ory (LM). The example shows that the DTU interprets the command on CPU#0’s
LM and transfers data on the LM in CPU#0’s LM to the FE-GA#0’s LM.
In order to maximize the performance of the encoding process, the on-chip and off-
chip memories are used as follows. The encoding is done frame by frame. Input PCM
data and output AAC streams are stored in the off-chip main memory (SDRAM). Before
every frame is processed, the PCM frame data are transferred to the URAM of a target
CPU. Intermediately generated data are also placed on the URAM. For processes on an
FE-GA, data are transferred from the URAM to the local memory of a target FE-GA
before they are executed, and processed data stored on the local memory are transferred
to the URAM of the target CPU after they are executed.

6.1.5 Performance Evaluation on CPU and FE-GA

The processing time for AAC encoding was evaluated for the following data trans-
fer methods: by a CPU, by a DMAC, by a DTU without transfer lists, and by a DTU
with the lists on a configuration of one CPU and one FE-GA. The encoding options
and conditions are described in Table 6.2 with music-1 adopted for the evaluation.
Figure 6.6 shows the improved performance with various data transfer methods as a
result. Encoding on one CPU resulted in 58.2 s of execution time. The encoding
speedup rate is 3.3, which is calculated from the length of input music, which is
192 s. By introducing an FE-GA with data transferred by a CPU, the encoding time
is 14.1 s, which is 13.6 times the encoding speed. The FE-GA contributes to greater
speedups against the CPU, which had speedups of 4.1. Next, encoding with DMAC
transfers resulted in an encoding time of 10.1 s, which is 20.1 times the encoding
speed. Furthermore, with DTU transfers without transfer lists, the encoding time
was 7.9 s, which is 24.2 times the encoding speed. Finally, evaluation with DTU
transfers operated by transfer lists resulted in an encoding time of 7.5 s, which is
25.6 times the encoding speed.
6.1 AAC Encoding 185

60.0 30.0
25.6
50.0 24.2 25.0

Encoding speed [times]


Execution time [sec] 20.1
40.0 20.0
Execution time
30.0 15.0
Encoding speed
13.6
20.0 10.0

10.0 5.0
3.3

0.0 0.0
CPU CPU DMAC DTU DTU
transfer transfer transfer transfer transfer
w/o list with list
1 CPU 1 CPU + 1FE-GA

Fig. 6.6 Performance improvements with various data transfer methods

The evaluation results indicate that the efficient use of accelerators for process
executions and DTUs for data transfer plays an active role in improving performance.
Performance with the DTU was better than that with the DMAC because twice as
many transactions of the interconnection bus are required with the DMAC than with
the DTU, and this bus is slower than the CPU internal bus connected directly to the
URAM and DTU. The beneficial effect of the DTU transfer lists is due to a reduction
in the number of DTU register setups for multiple transfers to the banks of the local
memory in the FE-GA. Since the FE-GA has multiple banks of memory, divided data
are placed in different banks, and transfers are done multiple times. As a result, the
number of DTU operations is reduced by utilizing transfer lists.

6.1.6 Performance Evaluation in Parallelized Processing

We measured the performance of AAC encoding on the evaluated chip. The evalu-
ation included the execution time and average power consumed in the encoding.
The encoding process was mapped to the four processor cores as outlined in Fig. 6.7.
For simple implementation of parallel processes, two streams of encoding were
individually assigned to a pair of one CPU and one FE-GA. However, processing
tasks of the encoding on both a CPU and an FE-GA in parallel will be achieved by
utilizing inter-frame parallelism.
The evaluation was done under the conditions listed in Table 6.2. The perfor-
mance was measured with double input streams of music-2. In other words, the
186 6 Application Programs and Systems

Stream #1 Start Stream #2


FE-GA #0 CPU #0 DTU0: Off- DTU1: Off-
CPU #1 FE-GA #1
DTU0: URAM CSM-URAM DTU1: URAM
CSM-URAM
to CRAM Frame read Frame read to CRAM

Filter bank Filter bank


& M/S Stereo DTU0:CRAM DTU1: CRAM & M/S Stereo
to URAM to URAM
DTU0: URAM Target bit rate calc. Target bit rate calc. DTU1: URAM
to CRAM to CRAM
DTU0: CRAM DTU1: CRAM
Quantization Quantization
to URAM to URAM
Huffman coding Huffman coding

Bit rate adjustment Bit rate adjustment

N Less than speci- Less than speci- N


DTU0: URAM fied bit rate? fied bit rate? DTU1: URAM
to CRAM Y Y to CRAM
Stream generation Stream generation
DTU0: URAM DTU1: URAM
to Off-CSM to Off-CSM

Processing flow (data transfer)

Fig. 6.7 Heterogeneous parallelization of AAC encoding

60 1.8
Encoding speed x54.1
Power consumption Measured power consumption [W]
50 1.46 1.5

1.17
Encoding speed

40 1.36 1.2
1.22

30 x27.1 0.9

20 0.6

10 x8.0 0.3
x4.0

0 0.0
CPUx1 CPUx2 CPUx1+FEx1 CPUx2+FEx2
[1 stream] [2 streams] [1 stream] [2 streams]
Encoding 3.4 5.8 22.2 37.1
speed / W xEnc/W xEnc/W xEnc/W xEnc/W

Fig. 6.8 Performance and power results with various configurations

input stream was encoded twice on one CPU and one FE-GA, and the two streams
of the same input music were encoded simultaneously on two CPUs and two
FE-GAs. Input PCM and output AAC stream data were placed in the off-chip main
memory. The DTU transferred data by using transfer lists.
Figure 6.8 plots the evaluation results. The speedup was 4.0 and the average
power consumption was 1.17 W with encoding on a single CPU. The encoding
6.2 Real-Time Image Recognition 187

Left channel Right


Time [K Cycles]
Average iterations㧦2.03

Initialization Target bit Huffman AAC stream


DTU transfer calculation coding generation

CPU 27 1 55 55 15
㨯㨯㨯
11 26 19 5 19
FE-GA 68 7 7

Filter bank Quantization

132-K cycles 185 (Average) 185

502-K cycles: 27.1x encoding

Fig. 6.9 Trace Gantt chart of one-frame encoding on CPUx1 + FE-GAx1

speedup was 8.0 and the power consumption was 1.36 W on the homogeneous
multicore with two CPUs. The encoding speedup was 27.1 and the power consump-
tion was 1.22 W on the heterogeneous multicore with one CPU and one FE-GA.
Finally, the encoding speedup was 54.1 and the power consumption was 1.46 W on
two CPUs and two FE-GAs. The heterogeneous multicore configuration outper-
formed the homogeneous multicores. Even though the power consumption increases
as the number of processor cores is increased, the speedup in encoding is much
faster. To evaluate the power-performance efficiency of the heterogeneous multi-
core configuration, the index of encoding speed per W [xEnc/W] was calculated for
all the evaluated configurations as listed in the bottom of Fig. 6.8. Sequential execu-
tion on a single CPU resulted in 3.4 xEnc/W. Parallel execution on a configuration
of two CPUs and two FE-GAs resulted in 37.1 xEnc/W, which is 10.9 times better
power-performance efficiency.
Figure 6.9 is a Gantt chart of one-frame encoding on one CPU and one FE-GA.
The filter bank and quantization were processed on the FE-GA, and DTU data transfers
were performed between executions on the CPU and the FE-GA.

6.2 Real-Time Image Recognition

6.2.1 MX Library

To extract the maximum parallel processing performance of MX, a dedicated MX


library consisting of more than 100 microcode functions is prepared. As shown in
Fig. 6.10, these library functions are stored in the MX controller. The MX system
188 6 Application Programs and Systems

void main (void) Controller


1. CPU
{ ADD
Instruction
// Library I/F SUB Memory
MAC
MX ADD ( );
MX SUB ( ); 2.
:
} SIMD Data-path
3.
1. Calling the MX function. PE A[0] B[0]+A[0]
2. Issuing the SIMD instruction. PE A[1] B[1]+A[1]
3. SIMD parallel processing. PE A[2] B[2]+A[2]
PE A[3] B[3]+A[3]

Fig. 6.10 MX microcode library

User Application
API
Application Application
Specific Image/Signal Processing Library Specific
Library functions
(Filters, FFT, • • •)

Microcode Basic Library


Arithmetic/
Logical
functions
Hardware CPU MX core (ADD, AND, • • •)

Fig. 6.11 Software interface of MX core

normally employs a CPU as a general-purpose controller. The CPU calls MX


microcode functions in the main program written in C language. The MX controller
decodes microcode functions called by the CPU and generates control signals for
the SIMD data path. In the data path, all PEs operate simultaneously by these control
signals. Overall, such simple programming realizes massively parallel operations
on the MX core [5, 6].
The software interface of the MX core is shown in Fig. 6.11. The CPU and MX
core are employed in the hardware layer. The primitive microcode library is imple-
mented to support the simple arithmetic and logical operations. In addition, an
application-specific library layer, for example, an image and signal processing
library, is prepared in the higher layer. This library layer offers the optimum codes
for the MX core and CPU, and it conceals the unique hardware structure of the MX
core. Therefore, users only need to call the simple APIs of the specific library and
do not have to worry about the super parallel structure of the MX core.
6.2 Real-Time Image Recognition 189

DMAC MX core CPU #0 CPU #1

On-Chip Bus
SDRAM
SDRAM
I/F

Fig. 6.12 Example of SoC system architecture including MX core

Semantic Information
“Count” “Run” “Congestion”

Tracking Information

S-T MRF

Fig. 6.13 Spatiotemporal Markov random field model

6.2.2 MX Application

The MX core is an embedded processor core in SoCs. Therefore, a suitable SoC


system architecture needs to be designed in order to extract the total performance of
the MX core. A heterogeneous multicore system architecture is a promising option
for real-time image recognition applications.
Figure 6.12 schematically shows an example of an SoC system architecture,
which is a heterogeneous multicore structure consisting of multiple CPUs, the MX
core, and required peripheral IPs. It also includes the DMA controller that controls
the data transfers between the MX core and the external memory. With this archi-
tecture, high-performance image processing and recognition applications can be
achieved by optimizing the software architecture. Hereafter, an application example
and its implementations on this system are shown [7].
Figure 6.13 shows an application example of the MX core using the spatiotem-
poral Markov random field model (S-T MRF) [8]. The S-T MRF is a powerful
tracking algorithm for camera applications such as traffic-sensing cameras and sur-
veillance cameras. The algorithm has the following features:
• It is independent of object shapes.
• It can deal with occluded objects.
The software layers are also shown in Fig. 6.13. The bottom layer is the S-T
MRF layer, which extracts the tracking information and passes it to the upper layer.
The upper layer is the semantic layer, which interprets the tracking information as
traffic conditions, for example, congested or clear.
190 6 Application Programs and Systems

Search the similar texture Evaluate the ID distribution

41 41

41

41 41

Previous Current Previous Current


y y
x x 17 17
time time 17 17
41
41 41 41
17
17 17
41 17
41 41 41

17 17
41 17
41 41 41
Motion Vector MAP Object MAP

Fig. 6.14 Overview of S-T MRF algorithm

An overview of the S-T MRF algorithm is shown in Fig. 6.14. This algorithm
presumes the boundary of each object based on the motion vectors. First, the motion
vectors of each object are extracted by comparing the previous and current image
frames. The extracted motion vectors are mapped onto the motion vector maps.
With these motion vector maps, the object maps including the boundary informa-
tion of objects are also generated. In the process of generating these object maps,
the boundaries of objects are evaluated by a high-level algorithm and then opti-
mized. In this way, robust and stable object tracking is achieved by updating each
map in every image frame.
As shown in Fig. 6.15, the S-T MRF can be divided into two layers, that is, object
map creation and motion vector extraction. In applications that use the S-T MRF
algorithm, the event detection algorithm is added in the application layer. The vol-
ume of data is reduced as the software level is elevated because the higher-level
layer does not need to handle the pixel data. These operations on each layer are
executed independently; therefore, the thread parallel processing can be applied.
Figure 6.16 is an overview of the thread parallel processing of the S-T MRF
application on the proposed SoC architecture. The motion vector extraction that
processes large-volume data is assigned to the MX core. The object map creation is
assigned to CPU#0, and event detection is assigned to CPU#1. Each thread com-
municates with the other threads by giving and receiving the information. For exam-
ple, the object map creation gives the pixel pointer to the motion vector extraction
and receives the motion vector map. Thus, effective thread parallel processing can
be achieved with this scheme.
6.2 Real-Time Image Recognition 191

Small Data
Volume Application
Event Detection

S-T MRF Algorithm


Thread
Object MAP Creation Parallel
Processing

Motion Vector
Extraction
( SAD Operation)
Large Data
Volume

Fig. 6.15 Structure of S-T MRF application

Software Layer
Application
Frame Pointer Pixel Pointer

Event Object MAP Motion Vector


Detection Creation Extraction

Object MAP Vector

MX Core
CPU#1 CPU#0
MPC

MPA
Hardware Layer

Fig. 6.16 Thread parallel processing scheme

Figure 6.17 illustrates the frame pipelining technique for the thread parallel pro-
cessing shown in Fig. 6.16. The horizontal axis is the time, and the time unit refers
to the processing time of each frame. Each output of the thread is given to another
thread that is processed one time unit later. Parallel execution of the threads is
achieved with this frame pipelining technique.
In the S-T MRF algorithm, the motion vector extraction is based on the sum of
the absolute difference between the sequential frames. The SAD algorithm is very
useful for evaluating the similarity between two frames, as depicted in Fig. 6.18.
192 6 Application Programs and Systems

1 Frame Processing

MX Vector Vector Vector


Core Extraction Extraction Extraction

Object MAP Object MAP Object MAP


CPU#0 Creation Creation Creation

Event Event Event


CPU#1 Detection Detection Detection

Traffic situation
EX.) Run, Congestion

0 1 2 3 4 Time

Fig. 6.17 Frame pipelining scheme

Absolute difference
(each pixel)

SAD
value

Summation
Image Data Template Data
(8x8pixel) (8x8pixel)

Expression :

Fig. 6.18 Overview of SAD calculation

The absolute difference of each paired pixel between the two pixel blocks is first
determined and then their sum is calculated.
Figure 6.19 shows the implementation overview of SAD to the MX core. The
absolute difference between the sequential frames is processed line by line with the
PEs of the MX core in parallel. The MX core has a powerful data network between
PEs; therefore, inter-PE operations such as the summation are also easily imple-
mented. With these implementation techniques, effective and high-performance
SAD operations are realized with the MX core.
The performance evaluation of the S-T MRF application using the proposed SoC
architecture is illustrated in Fig. 6.20. This graph shows the speed performance; the results
6.3 Applications on SMP Linux 193

Image Template Work area


Data Data Result

PE
PE
PE
PE
PE
PE
PE
PE

x8 Columns

Fig. 6.19 SAD operation using MX core

Speed Performance

Case A 220.1 msec

Case B 10.8 times faster


20.4 msec

@Freq. 648MHz(CPU), 324MHz(MX-2), per 1 frame

Case A Case B
CPU#1: Event Detection CPU#1: Event Detection
CPU#0: Object Map Creation CPU#0: Object Map Creation
Motion Vector Extraction MX-2: Motion Vector Extraction

Fig. 6.20 Performance evaluation

were obtained under the conditions where one image frame was processed at an
operating frequency of 648 MHz in each CPU and 324 MHz in the MX-2. About the
speed performance, the proposed system exhibits the 20.4-ms processing time that
is 10.8 times faster than the CPU-only configuration. As shown here, high-
performance image recognition applications can be achieved by implementing the
heterogeneous architecture with the MX core and the CPU cores.

6.3 Applications on SMP Linux

Three Linux applications running on the RP-1, RP-2, and RP-X multicore chips (as
described in Chap. 4) have been developed. The first application program visualizes
the load balancing mechanism of Linux on the RP-1, which has four CPU cores
with the cache coherency protocol among them. A monolithic Linux kernel runs on
the four cores, and the load balancer of Linux balances the loads among the cores.
194 6 Application Programs and Systems

The second application program on the RP-2 visualizes the power-saving mechanisms
for multiple cores using Linux. Two mechanisms are implemented in Linux. One
is dynamic voltage and frequency scaling (DVFS) of multiple cores, and the other
is dynamic plugging or unplugging of each CPU core. The two mechanisms
are controlled by the newly introduced “power control manager” daemon. The third
application program performs image processing of magnetic resonance imaging
(MRI) images using the RP-X chip. These three applications are described in
detail below.

6.3.1 Load Balancing on RP-1

6.3.1.1 Introduction

The RP-1 chip has four SH-4A cores. The main memory is shared by the four cores.
A pair of caches—an instruction cache and an operand cache—is placed between
each core and the main memory. Each operand cache is kept coherent with the other
operand caches using the directory-based write invalidation cache coherency protocol.
The write invalidation protocol is either the MESI cache coherency protocol or the
MSI cache coherency protocol. Eight channels of the inter-CPU interrupts (ICIs)
are implemented. The communication between cores inside Linux is mapped to one
or more channels of the ICIs. An interrupt caused by an event outside a core can
be either bound to a specific core or distributed to an arbitrary core so that the core
that receives an interrupt first serves the interrupt.
Some problems in scalability have been found with multiple processors in Linux
2.4. The problems have become obvious as multi-thread application programs have
become popular. Even on a single processor, the scheduler in Linux 2.4 runs in O(n)
time, where n is the size of the run queue. In symmetric multiprocessing (SMP) on
Linux 2.4, there is a single global run queue protected by a global spinlock. Only
one processor that has acquired the global lock may handle the run queue [9]. To
designate a task to run, the scheduler searches the run queue looking for the highest
dynamic priority of processes. That results in an O(n) time algorithm and causes the
scalability problem in SMP.
Linux 2.6 has been improved for SMP and has a per-CPU run queue, which
avoids the global spinlock with multiple CPUs and provides SMP scalability. The
scheduler on Linux 2.6, before 2.6.23, is called the O(1) scheduler [10], which was
designed and implemented by Ingo Molnar.
The load balancer on Linux 2.6 supports SMP. Balancing within a schedul-
ing domain occurs among groups. The RP-1 Linux has one scheduling domain
with four groups, each of which consists of one CPU. The scheduler works
independently on each CPU. To maintain an equal load in multiple processors,
a load balancer is run periodically to equalize the workload among the proces-
sors. The four-core multiprocessor system has four schedulers. Each CPU has
a run queue. Each run queue maintains a variable called cpu_load, which represents
6.3 Applications on SMP Linux 195

the CPU’s load. When run queues are initialized, their cpu_loads are set at zero
and updated periodically afterward. The number of runnable tasks on each run
queue is represented by the nr_running variable. The current run queue’s cpu_
load variable is roughly set to the average of the current load and the previous
load using the statement shown below:

cpu _ load = (cpu _ load + nr _ running *128) / 2.

The constant 128 is used to increase the resolution of load calculations and to
produce a fixed-point number. The above statement means that the cpu_load vari-
able accumulates the recent load history. The load balancing is done at a certain
appropriate timing. The load balancer looks for the busiest CPU. If the busiest
CPU is the current CPU, it does nothing because it is busy. If the load of the current
CPU is less than the average, and the difference in loads of two CPUs exceeds a
certain threshold, the current CPU will pull a certain number of tasks from the
busiest CPU. The number of tasks pulled is the smaller of the following two calcu-
lations. One is the difference between the busiest load and the average load of the
four CPU’s, and the other is the difference between the average load of four CPU’s
and the current load [11].
The purpose of the first application program is to visualize the load balancing
mechanism of Linux. The application program shows that the number of processes
on each CPU core is averaged among the four CPU cores on the RP-1 chip.

6.3.1.2 Design and Implementation

When the application creates several processes, they will be distributed to the four
CPU cores according to the load balancing mechanism of the Linux kernel. This
mechanism should work effectively when the number of processes is both increasing
and decreasing.
A system diagram of the RP-1 application is shown in Fig. 6.21, and the software
architecture of the RP-1 application is in Fig. 6.22. The display unit (“DU” hereafter)
on the RP-1 chip has been used for visualization. The DU converts the contents of a
frame buffer located in the main memory into a video signal. The size of the display
is fixed to VGA, 640 × 480 pixels. The display is divided into four sections. They are
assigned to CPU #0, CPU #1, CPU #2, and CPU #3 exclusively, as shown in
Fig. 6.23. The location of the frame buffer can be an arbitrary address. If the system
has a dedicated memory area for the frame buffer, the DU driver uses the virtual
address after mapping by the ioremap() function of Linux. In this system, the DU
driver allocates the frame buffer in the main memory, DRAM, using the dma_alloc_
coherent() function of Linux. This function allocates one or more physical pages
which can be written or read by the processor or device without worrying about
cache effects, and returns a virtual address. Finally, a frame buffer of plane 0 of the
DU can be accessed by a user program as a file, “/dev/fb0.”
The application program creates some processes. One process shows a bitmap
image of a penguin on the display. When a penguin process is assigned to CPU #3,
196 6 Application Programs and Systems

RP-1
CPU CPU CPU CPU
#0 #1 #2 #3

On-chip Interconnect

DRAM Display
Video
Controller Unit
encoder
(DU)

CPU #0 CPU #1 CPU #0 CPU #1


Frame
buffer CPU #2 CPU #3 CPU #2 CPU #3
Plane 0

Display
DRAM

Fig. 6.21 RP-1 application system diagram

Penguin drawing 0
application DU Background Penguin drawing 1
initialization painting …
Penguin drawing N

OS SMP Linux

driver UART LAN DU

Memory (DRAM)
hardware
CPU #0 CPU #1 CPU #2 CPU #3

Fig. 6.22 RP-1 application software architecture

CPU #0 CPU #1

CPU #2 CPU #3

Fig. 6.23 Depiction of


0
penguin process in CPU #3
6.3 Applications on SMP Linux 197

the bitmap image of a penguin appears in the CPU #3 section on the display as
shown in Fig. 6.23.
In the same way, when a penguin process is assigned to CPU #1, the penguin
appears in the CPU #1 section. Likewise, when a penguin process is assigned to
CPU #2, the penguin appears in the CPU #2 section. This application consists of
three sub-applications. They are initiated out of a shell script. First, the DU initial-
ization sub-application is initiated. It disables the DU, sets the pixel format to the
16-bit RGB 5:6:5 format, fills all of the pixels with the black color, and enables
the DU. Second, the background painting sub-application is initiated. It paints
the contents of the 640 × 480 bitmap file, in which the CPU #0, CPU #1, CPU #2,
and CPU #3 sections are drawn, onto “/dev/fb0,” which is plane 0 of the DU. Third,
several penguin drawing sub-applications are created and killed after a while. The
penguin drawing sub-application clears the penguin image at the previous position
whose initial position is given arbitrarily, obtains the CPU #ID number from the/
proc/xxxxxx/stat file where xxxxxx is the decimal process ID (PID) of that sub-
application process, calculates the position inside the corresponding CPU section
randomly using the rand() of <stdlib.h>, and draws the 82 × 102 pixel bitmap image
of a penguin at that position. This sub-application repeats the above procedures
continuously until an interrupt is signaled and clears the penguin image upon
receiving an interrupt. The Bourne shell script below creates and kills some pen-
guin sub-applications:
Line 0001: ./penguin 0/dev/fb0 &
Line 0002: sleep 1
Line 0003: ./penguin 1/dev/fb0 &
Line 0004: sleep 1


Line 1001: kill -2 `ps a | grep 'penguin 0' | grep -v grep | awk '{print $1}'`
Line 1002: sleep 1
Line 1003: kill -2 `ps a | grep 'penguin 1' | grep -v grep | awk '{print $1}'`
Line 1004: sleep 1

Initially, no penguin images are on the display. The line 0001 above creates a
penguin drawing sub-application process whose image is named “0” and draws it
on/dev/fb0, the plane 0 of the DU. This process will be created on the same CPU
where the parent shell script process exists. The load balancer of a less busy CPU
might pull one or more processes from the busiest CPU. Line 0002 waits for 1 s.
Line 0003 creates a penguin drawing sub-application process whose image is named
“1” and draws it on/dev/fb0. The load balancer of a less busy CPU might pull one
or more processes from the busiest CPU. Line 0004 waits for 1 s. After creating
several penguin sub-application processes, they are killed in turn. Line 1001 kills
the penguin drawing sub-application process named “0.” The “kill -2 [PID]” command
198 6 Application Programs and Systems

sends the “SIGINT” signal, which is an interrupt from a keyboard, to a process


specified by the [PID]. That [PID] is obtained through the subsequent commands
between the symbols ` and `. The “ps a” command outputs line by line the status of
processes of all the users. The “grep ‘penguin 0’” extracts any line that matches
‘penguin 0.’ The “grep -v grep” command removes any line that matches “grep.”
The “awk ‘{print $1}’” extracts the first word of the filtered line in order to get the
PID of the “./penguin 0/dev/fb0” process. After killing a process, the load balancer
of a less busy CPU might pull one or more processes from the busiest CPU. Line 1002
waits for 1 s. Line 1003 kills the penguin drawing sub-application process named
“1.” After killing a process, the load balancer of a less busy CPU might pull one or
more processes from the busiest CPU. Line 1004 waits for 1 s. Finally, no penguin
images remain on the display. That process is one routine of the application program.
The same routine is repeated again and again until the power is turned off.
Figure 6.24 illustrates the time when the number of penguins has increased
from three to four. The number of penguins on the four CPU cores is initially
unbalanced, and penguins are equally distributed among four sections after a while.
Even when the number of penguins decreases from five to four, the number of
penguins in each section is balanced among the four sections as shown in the same
figure, although the names of the penguins are not the same. In this process, we
were able to verify that the load balancing mechanism of the Linux kernel works
fine with the four CPU cores.

6.3.2 Power Management on RP-2

6.3.2.1 Introduction

The second application program has been designed to instantiate the power man-
agement capabilities of the RP-2 chip and RP-2 Linux and to visualize the power
consumption and performance of the system. The RP-2 has two capabilities that
support power saving. One is dynamic voltage and frequency scaling (DVFS), and
the other is power gating. The DVFS of the RP-2 allows each CPU core to change
the frequency independently and allows the whole chip to change the voltage to one
of three voltage sources. The voltage supplied to the whole chip is determined by
the highest frequency of all the CPU cores on the chip as indicated in Table 6.4.
The power gating of the RP-2 chip allows each CPU core to turn off or on the
power supplied to the core. Each CPU is inside an independent power domain. The
power supplied to a CPU core can be turned off either by itself or by another CPU
core through manipulation of a memory-mapped register. The power supplied to a
CPU core can be turned on either by an interrupt to the CPU core or by another CPU
core also by manipulating the memory-mapped register.
The RP-2 has two clusters each of which consists of four CPU cores. The four
CPU cores are cache coherent within a cluster. The SMP Linux kernel works on
only one cluster with each operand cache turned on. We have used only one cluster
in the application program.
6.3 Applications on SMP Linux 199

Fig. 6.24 Example of Linux Time = T


load balancing
CPU #0 CPU #1

2
1

CPU #2 CPU #3

0
3

The Linux load balancer migrates the


process of the penguin 0 from the
CPU #3 to CPU #2 to balance the
loads among four CPU cores.
Time = T + d

CPU #0 CPU #1

2
1

CPU #2 CPU #3

0
3

Table 6.4 Frequency–voltage relationship of RP-2


Highest frequency of four cores (MHz) Voltage supplied (V)
600 1.4
300 1.2
150 1.0
75 1.0

The RP-2 Linux kernel supports DVFS with the CPUfreq framework. The RP-2
Linux kernel supports power gating with the CPU Hot-plug framework. Both
CPUfreq and CPU Hot-plug are controlled with the power control manager daemon
that realizes the “Idle Reduction” framework described in Sect. 5.1.2. The original
CPUfreq has the “ondemand,” “conservative,” “powersave,” “performance,” and
“userspace” governors which represent power management policies. The power
200 6 Application Programs and Systems

control manager daemon supports the “Idle Reduction” framework which coordi-
nates both the CPUfreq and the CPU Hot-plug frameworks. The purpose of the
second application program is to control the power consumption using the “Idle
Reduction” framework, to protect the system from heat or battery life shortages, and
to visualize the status of the system.

6.3.2.2 Design and Implementation

DVFS and Power-Gating Controls by Idle Reduction Framework

The CPUfreq framework of Linux uses the ratio of idle time per sampling time to
increase or decrease the frequency of a CPU. The Idle Reduction framework takes
advantage of this process and samples the idle time every 2,000 ms. The kernel runs
the idle loop when a CPU has no workload. If the idle time ratio is less than 20%
in the sampling period, the workload is dense. If the idle time ratio is more than
80%, the workload is sparse. The Idle Reduction framework increases or decreases
the frequency of a CPU when the workload is dense or sparse, respectively. If the
workload of a CPU at the lowest frequency is sparse in two consecutive sampling
periods, the CPU will be turned off using the CPU Hot Remove of the CPU Hot-plug
framework. Because the voltage is the dominant factor in the power consumption on
the RP-2 board, and because the voltage is determined by the highest frequency of
the four CPUs, the Idle Reduction framework tries to level the frequencies of the
four CPUs. If one CPU has been turned off and another CPU has a dense workload,
the Idle Reduction framework will turn on a CPU using the CPU Hot Add of the
CPU Hot-plug framework rather than increasing the frequency of the CPU with the
dense workload.
The software decoder of MPEG-2 was selected to evaluate the Idle Reduction
framework because it can be multi-threaded and because the workload can be
specified by the frame rate in frames per second (fps). The original software of the
MPEG-2 decoder was downloaded from the web site of the ALPBench [12] bench-
mark program suite. The program of the MPEG-2 decoder has already been multi-
threaded, and the number of threads can be specified by the user when initiating the
decoder. The screen image of MPEG-2 is divided horizontally into nearly equal
areas, and the number of areas is equal to the number of threads. The load balancer
of SMP Linux balances the loads among the CPU cores. In the four-CPU SMP Linux,
the performance or fps nearly scales the number of threads up to four threads.
The DVFS and power-gating controls have been evaluated by changing the
workload. The workload of the MPEG-2 decode application was changed by speci-
fying the fps. The number of threads of the MPEG-2 decode application was four
for the four-CPU SMP in this evaluation. The workload was decreased from the
highest to the lowest workload, and then increased from the lowest to the highest
workload. The power consumption and the status of each CPU core changed as
shown in Table 6.5.
6.3 Applications on SMP Linux 201

Table 6.5 Status transition by fps control


Workload (fps) 25 20 15 10 5 10 15 20 25
CPU #0 (MHz) 600 300 300 150 150 150 300 300 600
CPU #1 (MHz) 600 300 300 300 75 300 300 300 600
CPU #2 (MHz) 600 600 150 150 75 150 150 600 600
CPU #3 (MHz) 600 300 300 150 75 150 300 300 600
Total (MHz) 2,400 1,500 1,050 750 375 750 1,050 1,500 2,400
Voltage (V) 1.4 1.4 1.2 1.2 1.0 1.2 1.2 1.4 1.4
Power (W) 4.4 3.5 1.8 1.7 0.8 1.7 1.8 3.5 4.4

Battery Life and Temperature Controls Using Idle Reduction Framework

The CPUfreq framework has been used in general for laptop personal computers
with an AC adapter or a battery. The CPUfreq receives data on the activation status
of the AC adapter and the remaining battery life. If the AC adapter is activated, the
CPUfreq disregards the battery life. If the AC adapter is not activated, the CPUfreq
takes the battery life into consideration when choosing a governor.
The CPUfreq also takes the temperature around the board into consideration in
the choice of a governor. Heat is generated by the power consumption of the semi-
conductors. The temperature of the semiconductors can exceed the upper bound at
which normal operation of the semiconductors is not guaranteed. A processor is
usually the main source of on the board. However, a processor with the DVFS capa-
bility and multiple power domains can control the amount of heat radiation and
power consumption.
The power control manager daemon controls both the CPUfreq and the CPU
Hot-plug frameworks depending on the activation status of the AC adapter,
remaining battery life, and the temperature around the board. The activation status
of the AC adapter is represented by the value of a DIP switch on the RP-2 board.
The value, 0 or 1, of the DIP switch is read from one bit of a general-purpose
input/output (GPIO) port. If the AC adapter is activated, the battery life is ignored.
The battery life is translated from the voltage of the battery. A battery manufac-
turer supplies a datasheet on which a graph shows the correspondence between
the remaining battery life and the output voltage. We have developed a battery life
and output voltage model from a battery currently on the market. We used a DC
power unit with variable voltage output instead of a battery because the charge or
discharge time of a battery takes too long to test or demonstrate the control that
depends on the battery life.
The temperature is measured by a heat sensor, which can be either inside or out-
side the chip. We used a heat sensor outside the RP-2 chip. The power unit of the
RP-2 board is compatible with that of the Advanced Technology extended (ATX)
PC motherboard. The ATX PC power unit made for the automobile PC takes DC
current from a battery and generates power that is compatible with the ATX PC
power unit. The advantage of this power unit is that it has both a voltage sensor and
202 6 Application Programs and Systems

Temperature

Upward threshold
Downward threshold

time

Fig. 6.25 Temperature control

a heat sensor. Both the measured voltage of the input DC current and the measured
temperature of the power unit board are output via the USB cable. The USB human
interface device (HID) class driver of the RP-2 Linux obtains the voltage and tem-
perature data from the USB host device, and the power control manager daemon
requests and reads the data via the “/dev/hiddev0” driver interface.
The temperature control changes the power management policy to “powersave”
if the temperature goes above the user-specified upward threshold temperature.
Likewise, the temperature control changes the power management policy from
“powersave” to another mode if the temperature goes below the user-specified
downward threshold temperature. In the “powersave” policy, CPU#0 operates at
75 MHz, and the three other CPUs are turned off to reduce the amount of heat radi-
ating from the CPUs. Chattering, in which the power management policy frequently
goes into and comes out of the “powersave” mode, might occur if the temperature
fluctuates around a threshold. This is inefficient because turning a CPU off or on
takes much more time than changing a CPU’s frequency. The temperature control
has two thresholds, upward and downward, as shown in Fig. 6.25. If the downward
threshold is the same as the upward threshold, chattering might occur.
The battery life control changes the power management policy to “powersave” if
the battery life goes below the user-specified downward threshold. On the other
hand, the battery life control changes the policy from “powersave” to another mode
if the battery life goes above the user-specified upward threshold. In the “power-
save” policy, the power consumed by the CPUs is reduced in order to prolong the
battery life. The chattering may occur if the remaining battery life fluctuates around
the threshold. The battery life control has two thresholds, a downward one and an
upward one, as shown in Fig. 6.26. If the upward and downward thresholds are the
same, chattering might occur.
Figure 6.27 shows a system diagram of the RP-2 application.
The application program uses the X-window system. To show the MPEG-2 image
on a window of the X-window system, the DU driver uses two planes of the XGA
size or 1,024 × 768. One plane is used to display the whole screen of the X-window
system. That plane is accessed via the “/dev/fb0” frame buffer device. The other
plane is used to display the MPEG-2 image on a window. That plane is accessed via
the “/dev/fb1” frame buffer device. The image on the screen is the graphical user
6.3 Applications on SMP Linux 203

Battery life left

Upward threshold
Downward threshold

time

Fig. 6.26 Battery life control

Hard Disk Drive USB


OS lib Appli. MPEG2 data Mouse

ATA USB Host


USB Hub
Controller Controller

Off-chip Interconnect Automobile


ATX Power
Unit
RP-2 Off-chip
Bus Voltage
CPU CPU CPU CPU
Sensor
#0 #1 #2 #3 Controller
Thermal
Sensor
On-chip Interconnect

DRAM Display Video


Controller Unit (DU) encoder

Frame
buffer
Plane 0

Frame Display
buffer
Plane 1

DRAM

Fig. 6.27 RP-2 application system diagram


204 6 Application Programs and Systems

mpeg2 osview xeyes power control manager


application
X server

OS SMP Linux

driver UART LAN ATA DU USB GPIO

Memory (DRAM)
hardware
CPU #0 CPU #1 CPU #2 CPU #3

Fig. 6.28 RP-2 application software architecture

Fig. 6.29 Display image of RP-2 application

interface (GUI) program implemented using the X toolkit of the X-window system.
A mouse is used as a pointing device. Figure 6.28 shows the software architecture of
the RP-2 application.
A display image of the application is shown in Fig. 6.29. It consists of three win-
dows; they are those of the main application, system monitor, and “xeyes.”
Figure 6.30 shows the main application window, which has two parts. One is the
area to display both the MPEG-2 video and a histogram to show the current speed in
fps up to 40 fps. This area is an instance of a custom widget class of the X toolkit. The
contents of the “/dev/fb1” frame buffer device are mapped to this area. The other part
6.3 Applications on SMP Linux 205

Fig. 6.30 Main application window

is the area for the control buttons. A button on the X-window screen is associated with
a shell script with the X toolkit. Pushing the button executes the shell script. The
MPEG-2 decode program is executed out of one of the shell scripts.
The system monitor window is shown in Fig. 6.31. This system monitor was
developed by modifying the “xosview” [13] program which runs on the X-window
system. The program continuously updates system-related statistics obtained
from the “/proc” file system. The source code of “xosview” has been downloaded
from the Internet and modified to show the statistics listed in Table 6.6. The first
items, “CPU0,” “CPU1,” “CPU2,” and “CPU3” display the information based on
“/proc/stats” and “/proc/cpuinfo.” The original “xosview” does not work cor-
rectly with CPU Hot Remove or CPU Hot Add of CPU Hot-plug because the
removed CPU disappears from “/proc/stats.” The “xosview” has been modified
to gray the area of the removed CPU. The other items, “BTRY,” “THER,” “POLI,”
“FREQ,” and “WATT” display the information obtained from the power control
manager daemon.
206 6 Application Programs and Systems

Fig. 6.31 “xosview” window

6.3.3 Image Filtering on RP-X

6.3.3.1 Introduction

Image recognition technology involves several kinds of analyses at the same time.
Image processing is one of the research fields that can benefit from multicore pro-
cessors. This subsection describes a system that takes images captured by a camera
and displays them after carrying out some filtering processes.
6.3 Applications on SMP Linux 207

Table 6.6 Description of system monitor window


Title Description
CPU0 CPU loads of CPU #0, CPU #1, CPU #2, and CPU #3 are shown in the
CPU1 percentage. US/NI/SY/ID/WA/INT/OFF mean the following:
CPU2 US: CPU user time
CPU3 NI: CPU nice time
SY: CPU system time
ID: CPU idle time
WA: CPU wait time
INT: CPU interrupt time + software interrupt time
OFF: CPU power cutoff time
BTRY The battery life remaining is shown by the percentage of battery capacity. AC/
BATTERY mean the following:
AC: an AC adapter is in use
BATTERY: a battery is in use
THER The thermal sampling data measured around the board are shown in degrees
Celsius
POLI The management policy of the power control manager daemon is shown.
POWERSAVE/PERFORMANCE/OPTIMIZATION/USER/
IMMEDIATE mean the following:
POWERSAVE: power-saving policy
PERFORMANCE: performance pursuing policy
OPTIMIZATION: idle reduction policy
USER: user customized policy
IMMEDIATE: application-specified power setting
These are identified by the color
FREQ The sum of frequencies (MHz) of CPU #0, CPU #1, CPU #2, and CPU #3.
The frequency of each CPU is identified by a specific color. The maximum
value is 600 MHz × 4 = 2,400 MHz
WATT The power consumption is shown by the wattage (mW). The maximum value is
4,500 mW

The purpose of the third application program is to create an application in real-


world computing that processes images. SUSAN [14] is an image recognition pack-
age adopted in the MiBench [15] benchmark suite. SUSAN implements the Smallest
Univalue Segment Assimilating Nucleus (SUSAN) Principle that was developed to
perform edge and corner detection and structure preserving noise reduction or
smoothing. SUSAN is employed for real-world computing such as magnetic reso-
nance imaging of the brain, vision-based quality assurance application, and so on.

6.3.3.2 Design and Implementation

The source code of the SUSAN benchmark package is available in the MiBench
benchmark suite, which is a set of commercially representative embedded applica-
tions. The package includes three visual effect algorithms. They find corners, find
edges, and smooth shapes in the images. The three algorithms are independent.
The original SUSAN application performs only one choice of the three visual
208 6 Application Programs and Systems

Fig. 6.32 RP-X application


USB
system diagram USB
Mouse
Camera

USB Hub

USB Host
Controller
Hard Disk Drive
OS lib Appli.
Off-chip Interconnect

RP-X Off-chip
CPU CPU CPU CPU SATA Bus
#0 #1 #2 #3 Controller Controller

On-chip Interconnect

DRAM
LDC Video
Controller
Controller encoder

Frame
buffer
Plane 0

Display
DRAM

effects. However, the application presented here has been modified to perform the
three visual effects in parallel to take advantage of a multicore processor.
The original SUSAN program accepts an image file stored in the portable gray
map (PGM) format as an input, reads the size of image and subsequent 8-bit gray-
scale images, and distributes the input to one of the visual effect algorithms. In this
implementation, however, the visual images are captured via the USB video class
(UVC) device and stored in the YUY2 format once and then converted into the 8-bit
grayscale format as the input to each visual effect algorithm. The size of the image
is smaller than 320 × 240 pixels and is one of the sizes supported by the USB cam-
era. Figure 6.32 shows the system diagram of the RP-X application.
The software architecture of the RP-X application is shown in Fig. 6.33. The
SUSAN process creates four threads: “input,” “smoothing,” “edges,” and “corners.”
The “input” thread converts the captured image in the YUY2 format into the 8-bit
grayscale format. The “smoothing” thread smoothes shapes. The “edges” thread
finds edges in the image. The “corners” thread finds corners in the image. The system
is built to take advantage of the X-window system on the Linux operating system.
6.3 Applications on SMP Linux 209

input smoothing edges corners


SUSAN
application
X server

OS SMP Linux

driver Video4Linux2 SATA LCD USB

Memory (DRAM)
hardware
CPU #0 CPU #1 CPU #2 CPU #3

Fig. 6.33 RP-X application software architecture

Fig. 6.34 SUSAN application window

The X toolkit is used to lay out the images. The “luvcview” [16] package is a web
camera viewer based on the UVC. The source code of the “luvcview” package is
available on the Internet. The “v4l2uvc.c” file and the related header files have been
extracted from the package and integrated into the application. The “v4l2uvc.c”
captures the UVC camera images using the Video4Linux2 driver.
Figure 6.34 shows the display image of the SUSAN application on the X-window.
There are four video images in the figure. The upper left image shows the gray-scaled
210 6 Application Programs and Systems

image of the original input image from the USB camera. The size of the input image
is 320 × 240 pixels. A smoothed image is shown at the upper right. The lower left
image shows the edge detection effect, and the lower right image shows the corner
detection effect. The images are written on the frame buffer in Linux using the
Xgraphics functions of the Xlib library.

6.4 Video Image Search

One example of the systems utilizing the multicore chip is a video image search
system. A detailed implementation of the system with the multicore chip or RP-X
[17] is described in this chapter. It offers a video-stream playback with a graphical
operation interface, as well as a similar-image search [18] that recognizes faces
while playing back video. It makes the most use of the heterogeneous cores such
as the video processing unit (VPU) in playing video streams and SH-4A in per-
forming image recognition. Figure 6.35 shows a block diagram of the implemented
video image search system on the chip. The system operates on two different oper-
ating systems, uITRON and Linux, over a hypervisor, to manage the physical
resources of the chip. The hypervisor is a programming layer lower than operating
systems [19]. The two operating systems use a common shared memory for their
intercommunications.

Face Face
Application Video image search GUI detection recognition

OS uITRON SH-Linux (SMP)

Hypervisor

VPU VEU SH-4A SH-4A SH-4A SH-4A SH-4A


Chip
BEU LCDC MMU

Device DVI SATA USB Ethernet


interface I/F I/F I/F I/F

Device Display Memory HDD HID

IPTV : Internet Protocol television MMU : Memory Management Unit


GUI : Graphical User Interface DVI : Digital Visual Interface
VPU : Video Processing Unit SATA : Serial Advanced Technology Attachment
VEU : Video Engine Unit USB : Universal Serial Bus
BEU : Blend Engine Unit HDD : Hard Disk Drive
LCDC : Display Controller HID : Human Interface Device

Fig. 6.35 Block diagram of developed video image search system


6.4 Video Image Search 211

Fig. 6.36 Processing flow of image synthesis

Programs running on uITRON process the playback of motion pictures by utilizing


image processing cores such as the VPU, video engine unit (VEU), blend engine unit
(BEU), and display controller (LCDC) on RP-X. They also perform image syn-
thesis of a graphics plane and the motion pictures and generate output images to a
monitor connected to the digital visual interface (DVI). Programs operating on
Linux perform a similar-image search of detected faces, user interface control, and
graphic plane depiction that is synthesized with an output image of the similar-
image search. Figure 6.36 shows a processing flow of the image synthesis. First, the
decoded image of an input video stream is generated, and the size and position of
the image are adjusted to create a video plane on uITRON. Then, images used for
the similar-image search and a mouse-pointer trail are generated to create a graphics
plane on Linux. The synthesis of the video plane and the graphics plane is based on
an a plane that specifies transparent parts of the graphics plane synthesized with the
video plane. The a plane is created on Linux, and it is stored in the DDR3 memory
shared with Linux.
212 6 Application Programs and Systems

Processing unit Memory


MPEG-2 video stream

MPEG-2 video Frame buffer of decoding data


decoding VPU (YrCbCr x4)
Frame buffer of captured image
(YrCbCr x1)
Frame buffer of decoded data
(YrCbCr x1) 720
Video image
scaling VEU Frame buffer of scaled data 480

(YrCbCr x1)
1024
Frame buffer of graphics
Image-graphics (YrCbCr x1)
BEU 768
synthesis
Frame buffer of synthesized data
(YrCbCr x1) 1024

768
Image output to Display
LCDC
DVI
Data transfer Memory area dedicated for uITRON
Memory area shared by both uITRON and Linux

Fig. 6.37 Data flow of uITRON system and utilized hardware IP cores

6.4.1 Implementation of Main Functions

The system on uITRON plays back motion pictures, carries out the image scaling
and synthesis, and outputs the image to a monitor, which are the main functions of
the video image search. Figure 6.37 illustrates the data flow of the system on uITRON.
It also shows the utilized hardware IP cores. The VPU that decodes video streams
supports multiple video codecs such as H.264, MPEG-2, and MPEG-4. The codec
used by the system is MPEG-2. The VEU reads an image placed on the specified area
of the memory, enlarges/reduces the size of the image, and writes it to the specified
area of the memory. The BEU reads three images placed on the specified areas of the
memory, blends the three images, and writes them to the specified areas. The imple-
mented system uses BEU’s blending of two images. The LCDC reads an image on
the specified area of the memory and transmits it to a display device. The system uses
a DVI interface for the transmission.
The implementation details of the five main functions on the uITRON system are
described as follows:
1. MPEG-2 decoding
2. Still-image capturing
3. Image scaling
4. Video image and graphics synthesizing
5. Output image controlling
6.4 Video Image Search 213

First, the MPEG-2 decoding is processed on the VPU using a frame buffer of
decoding data, whose size corresponds to four frames of the video image. The VPU
starts the decoding frame-by-frame when one frame of an input data stream is
obtained from the memory, and it stores the decoded image to one of the four frames
in the frame buffer.
The still-image capturing duplicates the decoded image to the frame buffer of the
captured image at every decoding frame. The buffer of the captured image is shared
with uITRON and Linux in the memory; therefore, a program on Linux can obtain
a decoded image any time.
The image scaling also duplicates the decoded image to the frame buffer of the
decoded data at every decoding frame. Since the adjusted size of images at the scaling is
set to 720 × 480, scaling factors in both horizontal and vertical directions to the decoded
image are calculated and set to the VEU. For example, when the size of the image is
720 × 480, the scaling factors are set to 1.00 and 1.00 in the horizontal and vertical direc-
tions, respectively. In the same manner, when the size is 960 × 540, the scaling factors are
set to 0.75 and 0.89. When the size is 320 × 240, the factors are 2.25 and 2.00. After the
start-up of the VEU, it reads an image from the frame buffer of the decoded data, adjusts
the size of the image according to specified scaling factors, and writes the scaled image
whose size is 720 × 480 to the frame buffer of the scaled data.
The video image and graphics synthesizing process uses image data in the frame
buffer of the scaled data, as well as graphics data in the frame buffer of graphics and
blends them in the BEU. The size of the frame buffers is 1,024 × 768. When a scaled
image is stored in the frame buffer of scaled data, the BEU starts the blending and
writes the synthesized image to the frame buffer of the synthesized data. The graphics
frame buffer is placed in the memory area shared by both uITRON and Linux and
can therefore be updated on Linux at any time.
Finally, the output image control sets up the LCDC and a DVI transmitter to con-
vert the synthesized image stored in the frame buffer into video signals that are trans-
mitted to the monitor via the DVI interface. Figure 6.38 illustrates the processing
flow of the uITRON system. The process is repeated from supplying the video stream
to copying the frame buffer of decoding data to that of a still-captured image.

6.4.2 Implementation of Face Recognition and GUI Controls

The system on Linux performs face recognition by utilizing the similar-image


search, pointing device detection, and GUI controls to create a graphics plane gen-
erated by the face recognition. Figure 6.39 depicts a block diagram of the Linux
system that comprises the following five functions:
1. Similar-image search
2. Face detection
3. Event processing
4. Image object management
5. Image processing
214 6 Application Programs and Systems

Fig. 6.38 Processing flow Start


of uITRON system
Initialize VEU/BEU/LCDC

Start LCDC

Initialize VPU

Decode MPEG-2 Supply MPEG-2 video stream


Video on VPU
Copy frame buffer of decoding
data to that of decoded data

Start VEU

Start BEU

Copy frame buffer of decoding


data to that of captured image

uITRON (Video image search middlewares)

Still Memory
α plane Graphics
image (DDR3-SDRAM)

Image object management Face


detection
Still Thumbnail
α plane Graphics
image images Face
Mouse pointer Face region Similar detection
Face image
depiction image image

Event Image processing Similar-image search


processing
Interior event YUV-RGB Feature
Trimming Registering
generation conversion calculation

Mouse event Frame Image


Scaling Deletion
detection depiction search

Linux (Face detection, similar-image search and user operation)

Fig. 6.39 Block diagram of Linux system


6.4 Video Image Search 215

Start

Initialization

Mouse event detection

Event No
occurs?
Yes

Still-image Still-image Similar-image


Registering
capturing trimming trimming

Still-image Face region Similar-image


Deletion
display display display

Face Feature
detection calc.& search

Detected- Thumbnail
faces display display

Fig. 6.40 Processing flow of Linux application programs

The similar-image search consists of feature calculation, in which the feature value
of a face image is calculated; registering, in which faces are registered in a database
created on a hard disk drive; deletion, where a face entry in the database is deleted;
and an image search, where similar face images are searched for in the database.
The face detection utilizes a face detection function offered by Intel’s
OpenCV [20], which is a general image processing library.
The event processing consists of mouse event detection that detects the operation
of a pointing device and internal event generation that starts the face detection by
the detected mouse event.
The image object management manages objects of the still image obtained from
uITRON via the shared memory and the image generated by the face detection. It
also manages the depiction of mouse trails detected by the event processing and
generation of the a plane that determines the synthesizing position of the video
plane and the graphics plane.
Finally, the image processing performs trimming, which trims a specified range
of an image; scaling, which enlarges or reduces the size of an image; YUV–RGB
conversion, which converts the color format of an image; and frame depiction,
which makes it possible to draw a shape on a face-detected area.
Figure 6.40 shows the processing flow of the Linux application programs. First,
image objects displayed on the graphics plane are initialized. Then the operation of
216 6 Application Programs and Systems

Table 6.7 Measured average execution time of Linux system processes


Process Time consumed (s)
Initialization 0.0506
Still-image capturing 0.0088
Face-region display 0.0381
Face detection 1.6072
Thumbnail display 0.0176
Similar-image display (top ten images) 0.0798
Similar-image database Registering 0.6348
access Feature calculation and search 0.5857
Deletion 0.2530

a mouse connected via the USB interface is detected by a device driver embedded
in the Linux kernel. The device driver outputs on/off values of each mouse button
and the distance of mouse movement. The mouse event detection classifies three
events of mouse button operations: PUSH, REPEAT, and RELEASE. Furthermore,
it converts the distance into the axis of coordinates. The internal event generation is
processed in accordance with the values generated by the mouse detection. The
defined internal events include no events, still-image capturing, face detection, sim-
ilar-image display, similar-image search, similar-image registering, and similar-
image deletion.
When a mouse event is detected on a video-plane area, the still-image capturing
event is generated, and a still image captured from decoded video images is
obtained as a still-image object. The graphics plane is updated in order to display
the newly captured image. Then the area range of the image selected by the mouse
is trimmed. The trimmed image is treated as a face-region image object, and the
graphics plane is updated again. The face detection uses the face-region image
object, and a frame shape is drawn on the area of the detected face. When a mouse
event is detected on the still-image object or the similar-image object, the face
detection is carried out by using these two objects. When a mouse event is detected
on the thumbnail image object, a thumbnail image shown in the event is displayed
as a similar image. When one is detected on a framed face of the face image object,
the face-framed part of the image is trimmed. The trimmed image is converted in
the image format in order to calculate the feature value, and the calculation is per-
formed. Then the similar-image search is carried out by using the calculated fea-
ture value, and the top ten similar images are displayed. When a mouse event is
detected on the framed face, the face image is registered on the similar-image data-
base. When one is detected on a thumbnail image, an entry of the image is deleted
from the database.
The execution time of each process on the Linux system was measured. Table 6.7
lists the average time for the processes. The face detection required is 1.6 s, and the
access of the similar-image database took more than 0.5 s. The time for such processes
References 217

Fig. 6.41 Appearance of developed video image search system

depends on parameters related to the detection accuracy of faces, and a subjective


evaluation of the system determined the parameter for practical use. Figure 6.41 shows
the appearance of the developed video image search system.

References

1. ISO/IEC 13818-7:1997 (1997) Information technology—Generic coding of moving pictures


and associated audio information—Part 7: advanced audio coding (AAC), ISO
2. Kodama T, Tsunoda T, Takada M, Tanaka H, Akita Y, Sato M, Ito M (2006) Flexible engine: a
dynamic reconfigurable accelerator with high performance and low power consumption. Proc
IEEE Symp. Low-Power and High-Speed Chips (COOL Chips IX) pp 393–408
3. Yoshida Y, Kamei T, Hayase K, Shibahara S, Nishii O, Hattori T, Hasegawa A, Takada M, Irie
N, Uchiyama K, Odaka T, Takada K, Kimura K, Kasahara H (2007) A 4320 MIPS four-
processor core SMP/AMP with individually managed clock frequency for low power con-
sumption. IEEE Int Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, pp 100–101
4. Shikano H, Ito M, Todaka T, Tsunoda T, Kodama T, Onouchi M, Uchiyama K, Odaka T, Kamei
T, Nagahama E, Kusaoke M, Wada Y, Kimura K, Kasahara H (2008) Heterogeneous multi-
core architecture that enables 54x AAC-LC stereo encoding. IEEE J Solid-State Circuits
43(4)
5. Sugimura T, et al (2008) High performance and low-power FFT on super parallel processor
(MX) for mobile multimedia applications. Digest of ISPACS2008, pp 146–149
6. Sato Y, et al (2009) Integral-image Based Implementation of U-SURF Algorithm for Embedded
Super Parallel Processor. Digest of ISPACS 2009, pp 485–488
218 6 Application Programs and Systems

7. Yamazaki H, et al (2010) An energy-efficient massively parallel embedded processor core for


realtime image processing SoC. Proceedings of IEEE Symposium on Low-Power and High-
Speed Chips, pp 398–409
8. Kamijo S et al (2000) Traffic monitoring and accident detection at intersections. IEEE Trans
ITS 1(2):108–118
9. Andrew Lenharth (2003) Linux Scheduler in Kernel 2.4 and 2.5, May 26, (2003)
10. Ingo Molnar, http://people.redhat.com/mingo/O(1)-scheduler/README
11. Josh Aas (2005) Understanding the Linux 2.6.8.1 CPU scheduler, February 17, (2005)
12. APLBench http://rsim.cs.illinois.edu/alp/alpbench/
13. xosview http://sourceforge.net/projects/xosview/
14. Smith Stephen M, Michael Brady J (1997) SUSAN—A new approach to low level image
processing. Int J Computer Vision 23(1)
15. Guthaus MR, et al MiBench: A free, commercially representative embedded benchmark suite
16. luvcview, http://mxhaard.free.fr/spca50x/Investigation/uvc/luvcview-20070512.tar.gz
17. Yuyama Y, Ito M, Kiyoshige Y, Nitta Y, Matsui S, Nishii O, Hasegawa A, Ishikawa M, Yamada
T, Miyakoshi J, Terada K, Nojiri T, Satoh M, Mizuno H, Uchiyama K, Wada Y, Kimura K,
Kasahara H, Maejima H (2010) A 45 nm 37.3GOPS/W Heterogeneous Multi-Core SoC, IEEE
International Solid-State Circuits Conference (ISSCC 2010), San Francisco, Feb. 8. (2010)
18. Matsubara D, Hiroike A (2009) High-speed similarity-based image retrieval with data-align-
ment optimization using self-organization algorithm. Proc. of the 11th IEEE International
Multimedia (ISM ‘2009), pp 312–317
19. Nojiri T, Kondo Y, Irie N, Ito M, Sasaki H, Maejima H (2009) Domain partitioning technology
for embedded multicore processors. IEEE Micro 29(6):7–17
20. OpenCV library, http://sourcefouge.net/projects/opencvlibrary
Index

A C
AAC. See Advanced audio codec (AAC) CABAC. See Context-adaptive binary
Access checklist (ACL), 172–175 arithmetic coding (CABAC)
ACL. See Access checklist (ACL) Cache coherency, 68, 127–129, 135–137,
Address extension, 153, 161–165 193, 194
Advanced audio codec (AAC), 1, 179–187 CAVLC. See Context-adaptive variable-length
Affine transformation, 54, 63 coding (CAVLC)
ALPBench, 200 Centralized shared memory (CSM),
ALU. See Arithmetic logical unit (ALU) 12–17, 127, 128, 136, 137, 139,
AMP. See Asymmetric multiprocessor 180, 182, 183
(AMP) CFGM. See Configuration manager (CFGM)
ANSI/IEEE 754, 46, 57, 62 CISC. See Complicated instruction set
Area efficiency, 6, 19, 31, 41, 56, 65, 66, 73, computer (CISC)
89, 91, 93, 94 Clock gating, 38, 39, 43, 44, 110, 147, 150
Arithmetic logical unit (ALU), 6, 19, 25, 28, Cluster, 13–16, 67, 69, 127, 128, 132, 133,
35, 74–88, 90, 97–99, 117, 143, 136, 139, 141, 142, 198
145–147 CODEC, 7, 15, 21, 101–111, 113, 117–119,
Asymmetric multiprocessor (AMP), 22, 67, 146, 147, 172, 212
69, 127 Coherency, 68, 127–129, 135–137, 193, 194
Atomic operation, 154–157 Complicated instruction set computer
(CISC), 23
COMS technology, 7
B Configuration manager (CFGM), 6, 74–77,
BARR, 138, 139 80–82, 145
BARW, 138, 139 Context-adaptive binary arithmetic coding
BEU. See Blend engine unit (BEU) (CABAC), 7, 101, 104, 106, 107,
BHT. See Branch history table (BHT) 113, 115
Blend engine unit (BEU), 210–214 Context-adaptive variable-length coding
Bourne shell, 197 (CAVLC), 101, 106, 107
Branch history table (BHT), 32, 33, 37, 38 Cooley–Tukey algorithm, 85
Branch prediction, 24, 32, 33, 36–38, 41, 43 CPU, 1, 5–7, 14, 16, 57, 74–76, 78, 80–85, 89,
Branch target buffer (BTB), 24, 32, 33 127, 137, 144, 154, 157, 165, 166, 170,
BTB. See Branch target buffer (BTB) 175, 176, 179, 180, 182–190, 194–195,
Butterfly calculation, 85–87 197, 200–202

K. Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, 219
DOI 10.1007/978-1-4614-0284-8, © Springer Science+Business Media New York 2012
220 Index

CPU core, 4, 5, 7, 11, 13, 14, 16, 17, 19–21, E


126, 135–138, 140, 142, 150, 154, 157, Early-stage branch, 24, 25, 29–31, 43
159, 161, 165, 166, 168, 170, 171, 174, ESI modes, 68
176, 179, 180, 193–195, 198, 200
CPUfreq, 157–159, 199–201
CPU Hot Add, 158, 159, 200, 205 F
CPU Hot-plug, 157–159, 199–201, 205 Face recognition, 210, 213–217
CPU Remove, 158–160, 200, 205 Fast Fourier transform (FFT), 6, 71, 72, 85–88,
CSM. See Centralized shared memory (CSM) 92, 135
FDL. See Flexible-Engine Description
Language (FDL)
D FE. See Flexible engine (FE)
DAA. See Duplicated data array (DAA) FE–GA. See Flexible engine/generic ALU
Data flow graph (DFG), 84–86 array (FE–GA)
Data transfer unit (DTU), 11–17, 68, 69, 73–74, FFT. See Fast Fourier transform (FFT)
102, 144, 179, 180, 182–185, 187 Filter bank, 181, 182, 187
DDR3, 14, 16, 148, 170, 211 Fine motion estimator/compensator (FME),
DEB. See Deblocking filters (DEB) 7, 104, 112, 115, 117
Deblocking filters (DEB), 7, 102, 104, 110, FIPR. See Floating-point inner-product
112, 115, 117 instruction (FIPR)
Delayed branch, 24 FIR, 6, 71, 72
Delayed execution, 24, 33–36, 41, 61 Fixed-length ISA, 23–25, 44, 70, 71
Development efficiency, 20 Flexible engine (FE), 15, 16, 19, 25–28, 33,
DFG. See Data flow graph (DFG) 47, 49, 51, 58
3D graphics, 21, 44–47, 54–57, 60, 63–67 Flexible-Engine Description Language (FDL),
Dhrystone, 20, 30, 31, 40, 43, 56, 68, 69, 71, 83, 84
123, 127, 136, 141, 143, 144, 149, 159 Flexible engine/generic ALU array (FE–GA),
Digital convergence, 1–3, 5 19, 69, 74–88, 143–146, 149, 150
Digital visual interface (DVI), 210–213 Floating-point inner-product instruction
Direct memory access controller (DMAC), (FIPR), 26, 45–49, 51, 52, 55, 56,
11, 13, 16, 20, 73, 74, 76, 113, 128, 58–62, 64, 65
147, 168, 171, 172, 174, 179, 180, Floating-point multiply-accumulate (FMAC)
184, 185, 189 instructions, 26, 44–46, 48–50, 53, 55,
Display controller (LCDC), 210, 211 58, 59, 64
Display unit (DU), 127, 168, 171, 195, Floating-point sine and cosine approximate
196, 203 (FSCA), 57–59, 61, 62, 65
DMAC. See Direct memory access controller Floating-point square-root reciprocal
(DMAC) approximate (FSRRA), 57–62, 64, 65
Domain, 102–106, 110, 119, 126, 137–138, Floating-point transform vector (FTRV), 26,
146, 147, 194, 198, 201 46–51, 55–59, 61, 62, 64, 65
Domain partitioning, 165–176 Floating-point unit (FPU), 34, 44–62, 68, 69,
DSP, 1, 6, 89, 90, 101 127, 136, 144
DTU. See Data transfer unit (DTU) FMAC. See Floating-point multiply-
DU. See Display unit (DU) accumulate (FMAC) instructions
Duplicated data array (DAA), 68, 127, 128, FME. See Fine motion estimator/compensator
130, 131 (FME)
DVD, 1, 2 Forwarding, 22, 28, 34–36, 47, 49, 50, 61, 79
DVFS. See Dynamic voltage and frequency FPU. See Floating-point unit (FPU)
scaling (DVFS) Frequency and voltage controller (FVC),
DVI. See Digital visual interface (DVI) 12–14, 102
Dynamically reconfigurable processor, 6, 11, FSCA. See Floating-point sine and cosine
15, 74, 144–146 approximate (FSCA)
Dynamic voltage and frequency scaling FSRRA. See Floating-point square-root
(DVFS), 194, 198–201 reciprocal approximate (FSRRA)
Index 221

FTRV. See Floating-point transform vector Interrupt controller (INTC), 20, 67, 140, 141,
(FTRV) 171, 175
Full HD, 7, 101–103, 105, 106, 109, 110, I/O device, 154, 163, 166
118, 119, 147 I/O space, 163
FVC. See Frequency and voltage controller IOzone, 164
(FVC) ISA. See Instruction set architecture (ISA)

G J
Giga operations per second (GOPS), 1, 2, 89, JCT-VC. See Joint Collaborative Team on
143, 144, 150 Video Coding (JCT-VC)
Global history, 32, 33, 38 Joint Collaborative Team on Video Coding
Golomb, 106, 113, 114 (JCT-VC), 119
GOPS. See Giga operations per second
(GOPS)
GUI control, 213–217 L
Latency, 11, 12, 24, 27, 33, 40, 44, 47, 51,
57–59, 61, 69, 102, 107, 108, 130, 132,
H 148, 165, 166, 176
H.264, 1, 2, 7, 15, 101, 103–106, 108, 113, LCDC. See Display controller (LCDC)
117–119, 144, 147, 212 LCPG. See Local clock pulse generator
Hardware emulation, 47, 61 (LCPG)
Harvard architecture, 24, 28, 31 Leading nonzero (LNZ) detector, 50, 62
H-ch. See Horizontal channel (H-ch) Leakage current, 3, 20, 137, 138
Heterogeneous multicore, 3, 4, 7, 8, 11–17, Legacy software, 126
19, 69, 101–103, 143, 161, 166, 179, Linux, 134, 135, 141, 142, 153–165, 175, 176,
187, 189 193–215, 217
Heterogeneous parallelism, 3–8 Linux kernel, 140, 142, 162, 164, 193, 195,
HEVC. See High Efficiency Video Coding 198, 199, 216
(HEVC) LL/SC instructions, 154–155, 157
High Efficiency Video Coding (HEVC), 119 LM. See Local memory (LM)
HIGHMEM, 161–165 LMBench, 155–157, 164, 175, 176
Horizontal channel (H-ch), 89–92, 146 LNZ. See Leading nonzero (LNZ) detector
Hypervisor, 169, 170, 175, 210 Load balancing, 167, 169, 193–199
Local clock pulse generator (LCPG), 14, 15,
128, 129
I Local memory (LM), 6, 11–17, 43, 74–76,
ICIs. See Inter-CPU interrupts (ICIs) 78–80, 83, 84, 86–88, 102, 107, 145,
Idle reduction, 157–161, 199–201 179, 180, 182–185
ILRAM. See Instruction local RAM (ILRAM) Logical partitioning, 169–170
Image filtering, 206–210
Image processing, 89, 97, 99, 102, 104, 106,
108, 109, 112, 115–118, 147, 189, 194, M
206, 211, 213, 215 Macroblock, 102–104, 106–110, 117–119
In-order, 23, 24, 32 Magnetic resonance imaging (MRI), 194, 207
Instruction categorization, 25, 33 Matrix Engine (MX), 15, 16, 19, 69, 88–100,
Instruction local RAM (ILRAM), 14, 43, 127, 187–193
136, 179, 180 Matrix processor array (MPA), 89, 98, 99, 191
Instruction predecoding, 43 Matrix processor controller (MPC), 89, 90, 98,
Instruction set architecture (ISA), 21, 23–26, 99, 191
33, 44, 65, 68–74 Memory management unit (MMU), 21, 73,
INTC. See Interrupt controller (INTC) 111, 170, 210
Inter-CPU interrupts (ICIs), 194 MESI. See Modified, Exclusive, Shared,
Inter-frame parallelism, 185 Invalid (MESI) modes
222 Index

MiBench, 207 PE. See Processing elements (PE)


Mid-side (M/S) stereo, 181, 182 Physical partitioning, 168–169
Million instructions per second (MIPS), Physical partitioning controller (PPC),
20–23, 31, 40–42, 70–72, 126, 127, 165, 170, 172–176
136, 141, 149 PID. See Process ID (PID)
MIPS. See Million instructions per second PIPE. See Programmable image processing
(MIPS) elements (PIPE)
MIPS/W, 4, 5, 21, 22, 32, 41, 42, 68 Pipeline hazard, 24, 25, 33
MMU. See Memory management unit (MMU) Pitch, 52, 58, 59, 65, 116
Modified, Exclusive, Shared, Invalid (MESI) PMB. See Privileged mapping buffer (PMB)
modes, 68, 194 Pointer controlled pipeline, 38, 39
Motion vector, 106, 107, 190, 191, 193 Pollack’s rule, 20, 123
MP3, 1, 2, 144 Power domain, 137–138, 198, 201
MPA. See Matrix processor array (MPA) Power efficiency, 4, 5, 20, 21, 32, 38, 41,
MPC. See Matrix processor controller (MPC) 42, 56, 66, 68–70, 123, 126, 135,
MPEG-2, 101, 118, 119, 200, 202, 204, 205, 143, 148, 150
212–214 Power gating, 198–200
MPEG-4, 15, 101, 118, 119, 212 Power management, 68, 110, 128–129, 198–206
MRI. See Magnetic resonance imaging (MRI) Powersave, 158, 160, 199, 202, 207
M/S stereo. See Mid-side (M/S) stereo Power wall, 123
Multicore, 3, 4, 7, 8, 11–17, 20, 67–69, PPC. See Physical partitioning controller
102, 103, 123–126, 136, 137, 140–141, (PPC)
143, 153–175, 179, 187, 189, 193, Prefix, 69–72
206, 208, 210 Privileged mapping buffer (PMB), 73
Multidomain embedded system, 165, 167, 170 Process ID (PID), 197, 198
Multimedia, 16, 19–22, 56, 89, 91, 165 Processing elements (PE), 6, 7, 89, 95,
Multiprocessor, 67–69, 127, 128, 138, 165, 115, 116
175, 194 Processing unit (PU), 11–14, 16, 17, 103, 115,
Multithread, 159, 194, 200 117, 212
MX. See Matrix Engine (MX) Programmable image processing elements
MX-1, 88–92, 94, 96–100 (PIPE), 7, 112, 115–117, 147
MX-2, 69, 88, 97–100, 143–146, 148–150, 193 PTE. See Page table entry (PTE)
PU. See Processing unit (PU)
Pulse-code modulation (PCM), 181, 184, 186
O
OLRAM, 14, 43, 127, 179
OpenCV, 215 Q
Operating frequency, 3, 28, 32, 38, 41, 57, 65, Quantization, 106, 181, 182, 187
67–69, 91, 99, 102–104, 106, 109, 118,
119, 124, 126, 181, 193
Optimizing compiler, 30 R
Out-of-order, 23, 24, 32–34, 36–38, 41, 52, 55, RAYTRACE, 159–161
57, 58, 65, 114 Real-time image recognition, 187–193
Real-time operating system (RTOS),
165, 166
P Reconfigurable processor, 6, 11, 15, 19, 74,
Packet, 131, 132 144–146
Page table entry (PTE), 163, 164 Reduced instruction set computer (RISC),
Parallel decoding, 23 19–23, 25, 70, 117, 170
Parallelization, 22–32, 44–57, 108, 126, 186 Register conflict, 22, 27, 28, 55, 64
Parallel processing, 3, 11, 15, 98, 102, 134, Resource conflict, 22, 24, 25
187, 190, 191 Resume standby, 67, 68, 137
Paravirtualization, 170, 176 RISC. See Reduced instruction set computer
PCM. See Pulse-code modulation (PCM) (RISC)
Index 223

RP-1 prototype chip, 19, 22, 67, 123, 125–136, Store buffer, 33–35, 41
141, 153, 154, 175, 193–198 Store with extension (STX), 80, 113, 114
RP-2 prototype chip, 19, 22, 67, 123, 125, STX. See Store with extension (STX)
136–143, 153, 157, 159, 193, 194, Sum of absolute difference (SAD), 94,
198–206 191–193
RP-X prototype chip, 14, 19, 69, 123, 125, SuperHTM, 19–22, 68, 69
143–150, 153, 161, 162, 164, 165, SuperHyway, 21, 74, 127, 128, 131–133, 136,
193, 194, 206–211 138, 144–146
RTOS. See Real-time operating system Superpipeline, 24, 32–36, 38, 41, 43, 65
(RTOS) Superscalar, 22–27, 29–32, 55, 56, 65, 68
SUSAN. See Smallest Univalue Segment
Assimilating Nucleus (SUSAN)
S Symmetric multiprocessor (SMP), 22, 67, 68,
SAD. See Sum of absolute difference (SAD) 127–129, 134, 135, 141, 142, 153–156,
SEQM. See Sequence manager (SEQM) 193–210
Sequence manager (SEQM), 6, 75–77, 80–81, Synchronization, 116, 138–139
83, 145, 146 System on a chip (SoC), 1, 3, 20, 67, 76,
SH-1, 4, 20, 21 123–126, 131, 137, 143, 189, 190, 192
SH-2, 20, 21
SH-3, 21, 31, 32, 40, 41
SH-4, 21–36, 40–42, 44–59, 61–63, 65–67 T
SH-5, 21 TAS instruction, 154, 155, 157
SH-4A, 21, 22, 67, 170, 179, 194, 210 Thread, 16, 80–84, 87, 88, 134, 135, 141,
SH core, 19, 67, 70, 179, 184 142, 154, 158–160, 190, 191, 194,
SH-3E, 44, 56, 58, 59, 65, 66 200, 208
SH processor, 4, 20, 21, 69 TLBs. See Translation look aside buffer
SH-X, 21, 32–43, 56–67 (TLBs)
SH-X2, 21, 42–44, 67 Transformer (TRF), 7, 104, 112, 115, 117
SH-X3, 22, 67–70, 126–128, 132, 133, 136, 141 Translation look aside buffer (TLBs), 28, 154,
SH-X4, 22, 69–74, 143, 144, 149, 150 161, 164, 170
SIMD. See Single instruction multiple data TRF. See Transformer (TRF)
(SIMD)
Single instruction multiple data (SIMD), 6, 7,
16, 19, 45, 58, 60, 88–90, 92, 94, 116, U
117, 146, 188 uITRON, 210–215
Smallest Univalue Segment Assimilating URAM. See User RAM (URAM)
Nucleus (SUSAN), 207–209 USB, 171, 202, 203, 208, 210, 215
SMP. See Symmetric Multiprocessor (SMP) User RAM (URAM), 14, 127, 137, 138, 179,
SNC. See Snoop controller (SNC) 180, 184, 185
Snoop, 68, 128–131, 135
Snoop controller (SNC), 67, 68, 127, 128, 130,
131, 135 V
SoC. See System on a chip (SoC) VC-1, 101, 118, 119, 144
Spatiotemporal Markov random field model V-ch. See Vertical channel (V-ch)
(S-T MRF), 189–192 Vertical channel (V-ch), 89, 90, 92–94,
Special purpose processor (SPP), 11–17, 97, 146
101, 102 VEU. See Video engine unit (VEU)
SPLASH-2, 134, 135, 140–142 Video codec, 7, 21, 101–111, 113,
Split transaction, 131, 132 117–119, 212
SPP. See Special purpose processor (SPP) Video engine unit (VEU), 210–213
SRAM, 6, 89–91, 93, 96, 98, 99, 127, 128, Video image search, 210–217
133, 136, 146 Video processing unit (VPU), 15, 16, 19,
S-T MRF. See Spatiotemporal Markov random 101–119, 143, 144, 146, 147, 149,
field model (S-T MRF) 210–214
224 Index

Virtualization, 170 X
Virtual Socket Interface (VSI), 131 Xeyes, 204
VPU. See Video processing unit (VPU) Xosview, 205–206
VSI. See Virtual Socket Interface (VSI) XREG, 97–98

W Z
Way prediction, 43 Zero-cycle transfer, 24, 28, 31, 47

You might also like