1 s2.0 S1434841116309037 Main
1 s2.0 S1434841116309037 Main
Regular paper
a r t i c l e i n f o a b s t r a c t
Article history: This study presents a design of two-dimensional (2D) discrete cosine transform (DCT) hardware architec-
Received 3 October 2016 ture dedicated for High Efficiency Video Coding (HEVC) in field programmable gate array (FPGA)
Revised 11 December 2016 platforms. The proposed methodology efficiently proceeds 2D-DCT computation to fit internal compo-
Accepted 25 December 2016
nents and characteristics of FPGA resources. A four-stage circuit architecture is developed to implement
the proposed methodology. This architecture supports variable size of DCT computation, including 4 4,
8 8, 16 16, and 32 32. The proposed architecture has been implemented in System Verilog and syn-
Keywords:
thesized in various FPGA platforms. Compared with existing related works in literature, this proposed
H.265/HEVC
Two-dimensional discrete cosine transform
architecture demonstrates significant advantages in hardware cost and performance improvement. The
(2D-DCT) proposed architecture is able to sustain 4 K@30 fps ultra high definition (UHD) TV real-time encoding
FPGA platform applications with a reduction of 31–64% in hardware cost.
Hardware architecture Ó 2016 Elsevier GmbH. All rights reserved.
1. Introduction H.264 only accepts smaller block sizes (i.e., 4 4 and 8 8). Large
sizes of DCT and IDCT help to improve coding efficiency. For
Rapid advances in consumer electronics have resulted in a example, the use of 16 16 and 32 32 DCTs and IDCTs in HEVC
variety of emerging video coding applications. Typical examples reduces bit rate up to 10.1% [11]. However, the associated hard-
include ultra-high definition (UHD) 4 K/8 K TV [1] or unmanned ware cost rises significantly. For instance, a transpose buffer of
aerial vehicle (UAV) reconnaissance and surveillance [2,3], which 32 32 16 bits is needed to store one-dimensional transform
demands aggressive video compression requirement. Despite the coefficients for 32 32 2D-DCT, while a transpose buffer of
success in the last decade, video compression efficiency of H.264 8 8 16 bits is sufficient for 8 8 2D-DCT. The hardware cost
standard cannot satisfy stringent requirements [4]. Alternatively, of transpose buffer will continue to increase in next-generation
recently established H.265/HEVC standard has great potential to video coding standard, since it will include 64 64 and
improve video compression efficiency by around 50%, while retain- 128 128 DCT/IDCT operations [12]. Therefore, it is necessary to
ing the same video quality as H.264 [5,6]. As a result, HEVC has investigate efficient circuit architectures to reduce hardware
been viewed as one of the most promising standard to overcome implementation cost and computational complexity [13–15].
these challenges [7,8]. Nowadays, computational resources in FPGA are adequate to
Optimized coding efficiency in HEVC is attributed to increased implement HEVC codecs, such as FPGA implementation of 4 K
computational complexity [9]. For example, Discrete Cosine Trans- real-time HEVC decoder [16] and a full HD real-time HEVC main pro-
form (DCT) and Inverse Discrete Cosine Transform (IDCT) are indis- file decoder [17]. The use of FPGA instead of application-specific
pensable building blocks of HEVC hardware implementation [10]. integrated circuit (ASIC), shortens design time to market, and hence
Previous study has reported that DCT and IDCT computation in is a preferred approach for small volume production. Therefore, the
HEVC is estimated as 11% of total computational complexity in study of HEVC FPGA implementation is gaining more and more
hardware implementations [6]. Due to computational similarity attention. In order to satisfy real-time and high-efficiency coding
between DCT and IDCT, they can alter the coefficient matrix and in these emerging video applications, a few design methodologies
share same circuit architecture. Large block sizes of DCT and IDCT and circuit architectures have been developed [11,18–24]. In [11],
(i.e., 16 16 and 32 32) are supported in HEVC standard, while the authors presented an IDCT implementation with zero-column
skipping technique to boost energy and area efficiency. However,
the use of single-port static random access memory (SRAM) does
⇑ Corresponding author.
not allow pipeline operation, hence the throughput of this design
E-mail addresses: [email protected] (M. Chen), [email protected] (Y. Zhang),
[email protected] (C. Lu). is impeded. The proposed architecture in [18] focuses on reduction
http://dx.doi.org/10.1016/j.aeue.2016.12.024
1434-8411/Ó 2016 Elsevier GmbH. All rights reserved.
2 M. Chen et al. / Int. J. Electron. Commun. (AEÜ) 73 (2017) 1–8
of hardware utilization. A series of hardware minimization tech- H.265 standard. For simplicity, integer values are usually chosen
niques was applied, including operation reordering, multiplications in the transform matrix C in VLSI implementation. Thus, the hard-
to shift-adds conversion, etc. In [19–21], the proposed designs uti- ware implementation achieves finite precision DCT approximation
lized distributed arithmetic hardware to perform multiplications [9]. Thanks to the separability feature, a 2D-DCT is usually decom-
in 2D-DCT. These approaches are efficient for smaller DCT computa- posed into two 1D-DCTs. 1D-DCT is firstly applied on individual
tion (e.g., 8 8). The proposed architectures in [19–21] did not con- row of input data matrix X, then, another 1D-DCT is applied to
sider internal features and characteristics of FPGA platforms. In [22], the results from the first 1D-DCT [18]. Two 1D-DCTs are connected
a new algorithm and processing architecture for 2D-DCT were pre- through a transpose buffer, which temporarily stores the results of
sented to achieve higher energy efficiency. Recently, the researchers the first 1D-DCT. Because direct hardware implementation of
in [23,24] proposed FPGA-based 2D-DCT with improved area-speed matrix multiplication in 1D-DCT requires intensive computation,
efficiency. [23,24] are initial trials to efficiently utilize features and 1D-DCT based on even-odd decomposition techniques is widely
dedicated components of FPGA platforms. However, the design accepted to minimize computational complexity [9].
strategy of allocating FPGA resources to fit DCT architectures are
not elaborated in details.
2.2. Hardware components and features of FPGA platform
These existing architectures are inefficient when implementing
HEVC 2D-DCT in FPGA platforms, because larger DCTs (e.g.,
FPGA is one type of pre-fabricated integrated circuits designed
32 32) involves a great number of transpose buffers, cascaded
for rapid prototyping and functional verification. Nowadays, FPGA
additions and subtractions. When a design is synthesized towards
platform consists of five main elements: DSP blocks, look-up tables
FPGA platforms, many general-purpose logics (i.e., Look-up Tables
(LUTs), flip-flops, random access memory (RAM) blocks and
(LUTs)) are utilized. Thus, the critical path delay of a synthesized
routing matrix. The DSP blocks, including pre-adder, multiplier,
design is longer, and the maximum operation frequency is
accumulator, etc., are dedicated for accelerating complex arith-
degraded. On the other hand, if a designer is aware of internal
metic computation. A look-up table is a collection of logic gates
resources of FPGA, the resultant architecture may fit with FPGA
to implement any arbitrarily user defined Boolean function. A
components and features. Thus, the synthesized design efficiently
look-up table is composed of register arrays and works as a
utilizes FPGA resources, such as digital signal processor (DSP)
combinational logic of inputs. Users can program look-up tables
blocks, bus width, and on-chip memory. An efficient hardware
to realize any combinational logic function. Each look-up table
architecture should always make every effort to fit FPGA resources,
may connect with flip-flops, which are indispensable components
which is the focus of this paper.
of sequential logic modules. RAM block is an embedded storage
This paper makes the following contributions: (1) our proposed
element, whose type could be single-port, dual-port or quad-
design methodology takes into account of hardware resources of
port. For example, a dual-port RAM enables simultaneously access
FPGA platforms, and efficiently utilizes bus width, DSP blocks,
(i.e., write or read) by two agents. Routing matrix is used to route
BRAM blocks, and on-chip memory bandwidth. Thus, the required
signals and interconnect among FPGA processing resources.
general programmable logics (e.g., LUTs) are significantly reduced,
Different FPGA companies implement these five main elements
and video processing throughput is largely improved. (2) A hard-
differently. For example, Zynq is one FPGA of Xilinx 7-series fami-
ware architecture is proposed to support variable DCT sizes from
lies. This FPGA has plentiful hardware resources, such as LUTs,
4 4 to 32 32, which can also be extended to larger DCT sizes
DSP48s, and BRAMs [25,26]. The LUTs of Zynq FPGA can be config-
(e.g., 64 64 and 128 128). This architecture facilitates hard-
ured as either six inputs with one output, or as five inputs with
ware sharing and reusing among different DCT sizes. The design
multiple separate outputs. Some LUTs could be configured as
details are described and illustrated through timing diagram. (3)
64-bit distributed RAMs or as 32-bit shift registers. A DSP48 block
The proposed architecture has been synthesized in various FPGA
consists of a 25-bit pre-adder, a 25 18 two’s complement multi-
platforms. The benefits are presented through comparisons with
plier, and a 48-bit accumulator. A DSP48 block may be configured
existing designs in literature. The proposed architecture is able to
as a single-instruction-multiple-data (SIMD) arithmetic unit.
sustain 4 K@30 fps ultra high definition (UHD) TV real-time encod-
DSP48 block is optimized for short critical path, and hence reaches
ing applications with a reduction of 31–64% in hardware cost.
clock frequency as high as 741 MHz. In addition, on-chip dual-port
The rest of this paper is organized as follows. Section 2 reviews
block RAMs (BRAM) with port width up to 72 bits are embedded in
basic DCT algorithm, hardware components and FPGA characteris-
Zynq FPGA. This BRAM also supports asymmetric read or write
tics. Section 3 describes the proposed design methodology and sys-
operations with variable port width.
tem architecture. In Section 4, system implementation results are
provided. An in-depth comparison with related design architectures
in literature is presented. Finally, Section 5 concludes the paper. 3. Proposed design methodology and circuit architecture
2. Related work As been reviewed in Section 2, FPGA owns rich on-chip high
performance resources. It is highly desirable to create HEVC archi-
2.1. Basic DCT algorithm tectures to fit FPGA components and characteristics. For example,
2D-DCT involves extensive matrix multiplications. Multiplications
DCT is widely used in image coding and signal processing appli- could be implemented either by LUTs or DSP blocks inside FPGA
cations. DCT transforms images from spatial-domain into platforms. If a design is implemented using LUTs, due to dis-
frequency-domain, and provides a more efficient representation tributed locations of LUT components and long wire routing, the
of information. A 1D-DCT computation of an N N block size can resultant design will exploit lots of LUTs and suffer from slower
be expressed as [10,24] system operation. In contrast, if dedicated DSP blocks are selected
to implement multiplications, its operating frequency and hard-
X
N1 ware efficiency will be improved over LUT-based design scheme.
Yði; jÞ ¼ Xðk; iÞ Cðj; kÞ ð1Þ From the above discussion, it is clear that existing 2D-DCT
k¼0
designs in literature do not fully explore internal resources and
Here X and Y are the input and output data matrix, respectively. characteristics of FPGA platforms. In this section, we propose a
C is an N N transform matrix. N can be 4, 8, 16 or 32 in HEVC/ FPGA-friendly DCT design methodology and circuit architecture
M. Chen et al. / Int. J. Electron. Commun. (AEÜ) 73 (2017) 1–8 3
… ...
… ...
… ...
… ...
… ...
… ...
… ...
… ...
… ...
… ...
4th Stage
… ...
… ...
Fig. 1. Proposed 2D-DCT 32 32 algorithm with row and column seperation. Fig. 2. FPGA four-stage architecture of the proposed 2D-DCT transformation.
4 M. Chen et al. / Int. J. Electron. Commun. (AEÜ) 73 (2017) 1–8
… ...
… ...
… ...
… ...
… ...
… ... … ...
… ...
… ...
… ...
… ...
… ...
O00 mulplier
O01 coefficient
Odd Adder+
… ... mulplier
Part Round Odd rows of
… ... coefficient 2nd Column
O15 mulplier
coefficient
architecture of 2D-DCT in this work is applicable to different tech- The related references in literature are included in Table 2,
nology or process nodes. It is well known that an advanced fabrica- which lists key performance metrics of 2D-DCT architectures, such
tion technology usually leads to shorter propagation delay and a as FPGA model name, utilization of LUTs/ALMs and DSP blocks, the
higher clock frequency, but a constant number of 500 clock cycles number of required clock cycle, clock frequency, and output
is required to accomplish a 32 32 2D-DCT. throughput. Our proposed architecture enables variable block size
6 M. Chen et al. / Int. J. Electron. Commun. (AEÜ) 73 (2017) 1–8
Table 2
FPGA Performance Results Summary.
Table 3
Hardware Resource Results for Variable Size of 2D-DCT Computation.
DCT Block Size [24] (ALM/Frequency) This work (ALM or LUT/Frequency) Performance Comparison at Arria II GX FPGA
44 88 16 16 32 32 Arria II GX Arria II GX Xilinx Zynq ALM number reduction Clock freq. degradation
7269/200 MHz 5034/138 MHz 5806/222 MHz 31% 31%
6928/200 MHz 4108/138 MHz 5726/222 MHz 41% 31%
6821/200 MHz 3424/143 MHz 4733/225 MHz 50% 29%
6792/200 MHz 2967/150 MHz 3898/237 MHz 56% 25%
5014/200 MHz 2586/179 MHz 3155/261 MHz 48% 11%
3436/200 MHz 2097/206 MHz 2478/289 MHz 39% 3%
4921/200 MHz 1781/185 MHz 2745/263 MHz 64% 8%
(4 4, 8 8, 16 16 and 32 32), while the references [18,20,21] Power Consumption is another important factor to compare
only support one smaller block size (8 8 or 16 16). The refer- among these designs. Yet, because there are no power consump-
ences [18,20,21] do not utilize on-chip DSP blocks. Due to the tion values available in references [18,21], we cannot include
use of on-chip DSPs, our proposed work results in much shorter power consumption for a quantitative comparison in Table 2.
critical path and significant improvement in terms of frequency Alternatively, we offer a comprehensive analysis as below. Accord-
and throughput. Due to [20] and this proposed work utilize differ- ing to the statement of the primary FPGA Company Altera [27], the
ent FPGA platforms (as well as different fabrication technology), DSP blocks in modern FPGA platforms are very power efficient.
the number of required clock cycle should be compared instead These power-efficient DSP blocks enable the use of modern FPGA
of frequency. Under the same supported DCT size (8 8), both platforms in high definition video coding applications (e.g., our
works utilize the same amount of hardware resources (2.5 K LUT/ targeted application - HEVC encoder) [27]. Moreover, as reported
ALM). However, the required number of clock cycle in this work in [28], the DSP blocks only account for 1% of total dynamic power
is 7, which is equivalent to only 5.5% of that in [20]. Therefore, consumption in Stratix III devices. 70% of power consumption are
our proposed work is advantageous than [20]. In the world, Xilinx from user logic and signal routing. Therefore, the number of user
and Altera are two dominant FPGA companies. Their FPGA plat- logic (i.e., LUT/ALM) is a good indicator of energy consumption
forms are distinctive in terms of design tools, reconfigurable logic estimation. Table 2 exhibits that our proposed design consists of
cells and system architecture. As a result, there is no intention to much less user logic than [18,21,24] under the same FPGA plat-
compare these key performance metrics of the proposed design form. Under different FPGA platforms, the number of user logic
between Arria II GX and Xilinx Zynq. The purpose of including in the proposed work keeps the same as that in [20], but the
Xilinx Zynq results in Tables 2 and 3 is to demonstrate that our required clock cycle is reduced largely. Hence, it is highly possible
proposed design is applicable to both primary FPGA manufactur- that our proposed design results in lower energy consumption due
ers. The results in Tables 2 and 3 indicate that our proposed idea to significant savings in user logic and signal routing. Even though
is generic and it is not limited to any specific FPGA company. Both our proposed design uses 16 or 64 more DSP blocks than [18] or
the reference [24] and this work support variable DCT sizes. Using [21], the power consumption resulted from DSP blocks is negligible
the same FPGA platform, our proposed architecture saves LUT compared with that from user logic and signal routing. In all, con-
resources by 32% and DSP blocks by 75%. As a result, our proposed sidering the above two reasons, we estimate the use of DSP blocks
architecture excels in hardware cost, while the clock frequency is a in modern FPGA platforms probably does not lead to significant
little degraded than [24]. Alternatively, if Xilinx Zynq is the imple- power consumption. Our proposed design is acceptable in hand-
mented FPGA platform, our proposed architecture operates at a held consumer electronics from an energy consumption point of
clock rate of 222 MHz and achieves a throughput of 453 M pix- view.
els/second. Note the required number of clock cycle in our pro- Moreover, we focus on Altera ‘‘Arria II GX” and Xilinx ‘‘Zynq” to
posed design is the same in both Arria II and Xilinx Zynq platforms. thoroughly discuss system performance. Arria II GX and Zynq are
M. Chen et al. / Int. J. Electron. Commun. (AEÜ) 73 (2017) 1–8 7
based on eight-input adaptive-logic-modules (ALM8) and six-input [5] Bossen F, Bross B, Suhring K, Flynn D. HEVC complexity and implementation
analysis. IEEE Trans Circuits Syst Video Technol December 2012;22
look-up tables (LUT6), respectively. Table 3 summaries hardware
(12):1685–96.
resource results for variable size of 2D-DCT computation. Note [6] Kalali E, Ozcan E, Yalcinkaya O, Hamzaoglu I. A low energy HEVC inverse
even though Arria II and Zynq are implemented with 40 nm or transform hardware. IEEE Trans Consum Electron 2014;60(4):754–61.
28 nm process respectively, the required number of clock cycle in [7] Kessentini A, Samet A, Ayed M, Masmoudi N. Performance analysis of inter-
layer prediction module for H.264/SVC. Int J Electron Commun 2015;69
our proposed design is the same (i.e., 500 for a 32 32 2D-DCT (1):344–50.
in Table 2). Our proposed architecture in Zynq requires at least [8] Samcovic A. Mathematical modeling of coding gain and rate-distortion
15% more hardware resources than Arria II GX. Our proposed archi- function in multihypothesis motion compensation for video signals. Int J
Electron Commun 2015;69(2):487–91.
tecture demonstrates a higher clock frequency (i.e., 222–289 MHz) [9] Budagavi M, Fuldseth A, Bjontegaard G, Sze V, Sadafale M. Core transform
at Xilinx Zynq, which is 30–38% faster than the system implemen- design in the high efficiency video coding (HEVC) standard. IEEE J Sel Top
tations at Arria II GX (i.e., 138-206 MHz). This is because our Signal Process 2013;7(6):1029–41.
[10] Rao KR, Yip P. Discrete cosine transform: algorithms, advantages,
architecture is designed inherently to best fit characteristics of applications. Academic Press Inc; 1990.
Zynq platform, where the distributed RAMs/ROMs helps to [11] Tikekar M, Huang C, Sze V, Chandrakasan A. Energy and area efficient
improve operation speed and reduces the amount of LUTs for logic hardware implementation of HEVC inverse transform and dequantization. In:
IEEE international conference on image processing (ICIP). p. 2100–4.
synthesis. Table 3 also demonstrate the benefit of this work over [12] JVET document, Joint video exploration team of ITU-T SG 16 WP 3 and ISP/IEC
the reference [24]. Using the same FPGA platform, this work JTC 1/SC 29/WG 11 meeting.
achieves 31–64% reduction in the number of ALMs, while the clock [13] Meher P, Park S, Mohanty B, Lim K, Yeo C. Efficient integer DCT architectures
for HEVC. IEEE Trans Circuits Syst Video Technol 2014;24(1):168–78.
frequency overhead is no more than 31%.
[14] Zhu J, Liu Z, Wang D. Fully pipelined DCT/IDCT/Hadamard unified transform
Let us take a look at a 4 K@30fps UHD TV video encoding appli- architecture for HEVC codec. In: IEEE international symposium on circuits and
cation. The minimum throughput to accomplish a 32 32 DCT is systems. p. 677–80.
calculated as 3840 2160 30/(32 32) = 243,000 blocks/sec- [15] Budagavi M, Sze V. Unified forward+inverse transform architecture for HEVC.
In: IEEE international conference on image processing. p. 209–12.
ond. Since our proposed 2D-DCT architecture needs 500 clock [16] Abeydeera M, Karunaratne M, Karunaratne G, Silva KD, Pasqual A. 4K real-time
cycles to complete one 32 32 block, therefore, our proposed HEVC decoder on an FPGA. IEEE Trans Circuits Syst Video Technol January
architecture demands 243,000 500 = 121.5 million cycles/sec- 2016;26(1):236–49.
[17] Engelhardt D, Moller J, Hahlbeck J, Stabernack B. FPGA implementation of a full
ond, which is equivalent to a clock frequency of 121.5 MHz. As HD real-time HEVC main profile decoder. IEEE Trans Consum Electron August
shown in Table 2, no matter the FPGA platform is Arria II GX or Xil- 2014;60(3):476–84.
inx Zynq, the clock rate of our synthesized solution reaches at least [18] Conceicao R, Souza J, Jeske R, Zatt B, Porto M, Agostini L. Low-cost and high
throughput hardware design for the HEVC 16x16 2-D DCT transform. J Integr
138 MHz. This number indicates our proposed architecture is able Circuits Syst 2014;9:25–35.
to sustain 4 K@30 fps UHD TV real-time encoding applications, [19] Huang J, Parris M, Lee J, Demara R. Scalable FPGA architecture for DCT
meanwhile achieving lower hardware cost. computation using dynamic partial reconfiguration. ACM transactions on
embedded computing systems 2009;9(1).
[20] Yusof Z, Suleiman I, Aspar Z. Implementation of two dimensional forward DCT
and inverse DCT using FPGA. Int Conf Electr Electron Technol 2000;3:242–5.
5. Conclusion [21] Kitsos P, Voros N, Dagiuklas T, Skodras A. A high speed FPGA implementation
of the 2D DCT for ultra-high definition video coding. In: International
conference on digital signal processing. p. 1–5.
This paper presents a FPGA-friendly architecture design of vari- [22] Scrofano R, Jang J, Prasanna V. Energy-efficient discrete cosine transform on
able size 2D-DCT for HEVC standard. 4 4, 8 8, 16 16 and FPGAs. Eng Reconfigurable Syst Algorithms 2003:215–21.
[23] Atitallah A, Kadionik P, Ghozzi F, Nouel P, Masmoudi N, Marchegay P.
32 32 sizes of 2D-DCT are embedded in one architecture. This Optimization and implementation on FPGA of the DCT/IDCT algorithm. In:
property enables multiple DCT sizes to share and reuse hardware International conference on acoustics speech and signal processing
resources. The proposed methodology efficiently proceeds 2D- proceedings. p. 928–31.
[24] Pastuszak G. Hardware architectures for the H.265/HEVC discrete cosine
DCT computation to fit internal components and characteristics transform. IET Image Process 2015;9(6):468–77.
of FPGA platforms. Details of circuit architecture and timing dia- [25] Xilinx FPGA document, <http://www.xilinx.com/support/documentation/
gram are described in this work. The proposed architecture has white_papers/wp406-DSP-Design-Productivity.pdfg>.
[26] Xilinx 7 Series DSP48E1 Slice User Guide, <http://www.
been implemented in several FPGA platforms. Synthesis and simu-
xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.
lation results demonstrate that the proposed architecture has great pdf>.
advantages in hardware cost, operating frequency and throughput, [27] DSP blocks in Stratix series FPGAs, <https://www.altera.com/products/fpga/
in contrast with prior works in literature. The proposed architec- features/stx-dsp-block.html>.
[28] Blasinski H, Amiel F, Ea T, Impact of different power reduction techniques at
ture is able to sustain 4 K@30 fps UHD TV real-time encoding architectural level on modern FPGAs. LASCAS; 2010.
applications with a reduction of 31–64% in hardware cost.
Yuanzhi Zhang received the B.S. and M.S. degrees in Chao Lu received the B.S. degree in electrical engi-
Electronic Engineering from Shandong University, China neering from the Nankai University, Tianjin, China in
in 2011 and 2014, respectively. He is working towards 2004 and the M.S. degree in the Department of Elec-
his Ph.D. degree at Southern Illinois University Car- tronic and Computer Engineering from the Hong Kong
bondale, IL, United States since 2015 August. His University of Science and Technology, Hong Kong, in
research interests include HEVC/H.265 video/image 2007. He obtained his Ph.D. degree at Purdue University,
processing circuit and system optimization, low power West Lafayette, Indiana, in 2012. From 2013 to 2015, He
SRAM VLSI design and methodology, and 3D-IC system worked as a R&D circuit design engineer at Arctic Sand
design. Technologies Inc. and Tezzaron Semiconductors. Now
he works as an assistant professor in Electrical and
Computer Engineering Department of Southern Illinois
University Carbondale. His research interests include
design of micro-scale energy harvesting systems, HEVC/H.265 video/image pro-
cessor, power efficient memory design, and power management IC design for ultra-
low power applications. Mr. Lu was the recipient of the Best Paper Award of the
International Symposium on Low Power Electronics and Design (2007).