PLATFORM-BASED MPEG-4 VIDEO ENCODER SOC DESIGN
Yung-Chi Chang, Wei-Min Chao and Liang-Gee Chen
DSP/IC Design Lab
Department of Electrical Engineering and Graduate Institute of Electronics Engineering
National Taiwan University, Taipei, Taiwan, R.O.C.
ABSTRACT heavily relied for efficient implementations and data access
An MPEG-4 video coding SOC design is presented in this reduction. For other coding tasks, including DCT/IDCT,
paper. We adopt platform-based architecture with an em- Q/IQ, and MC, dedicated architectures can be adopted for
bedded RISC core and efficient memory organization. A these highly regular tasks. Programmable architectures are
motion estimator supporting predictive diamond search and suitable for the other less-demanding but high-level task,
spiral full search is implemented for compromise between such as system control.
compression performance and design cost. The proposed In this paper, a RISC-based platform with hardware ac-
data reuse scheme reduces required memory access band- celerators is presented to implement MPEG-4 video encod-
width. Several key modules are integrated into an efficient ing algorithms. The optimization in both algorithm and ar-
platform in hardware/software co-design fashion. The cost- chitecture level is applied. Not only the key components
efficient video encoder SOC consumes 256.8mW at 40MHz but also the connection optimization and memory organiza-
and achieves real-time encoding of 30 CIF (352x288) frames tion are discussed in this paper. The whole system is di-
per second. vided into three main subsystems. In motion subsystem,
the hybrid motion estimator supporting both predictive di-
amond search and spiral full search with halfway termina-
1. INTRODUCTION
tion for real-time or high compression quality applications
MPEG-4 standard is becoming the main technique of the are proposed to reduce the dominant cost in the typical cod-
mobile devices and streaming video applications such as ing system. In texture subsystem, the efficient interleaving
smart phone and handheld PDA devices. The improved schedule and substructure sharing technique among quanti-
coding efficiency and advanced functionalities of MPEG- zation and DC/AC prediction are proposed [7] to reduce the
4 come with much higher computational complexity com- cost further. In bitstream subsystem, to handle the complex
pared with previous standards. Several MPEG-4 video chips bitstream syntax and avoid inefficient bit-level storage, the
have been reported. To satisfy rich functionality of future hardware/software co-operations scheme is applied for the
multimedia, some are implemented in software [3] based on bitstream generation. By applying these optimization ap-
the low-power DSP platform. They have highest flexibility proaches, a low cost and high performance MPEG-4 video
but degraded quality due to the fast algorithms of ME and encoder SOC is implemented.
DCT. Some [4] use the dedicated hardware methodology to This paper is organized as follows. In Sec. 2, the whole
achieve low power and low area cost. Lack of potential for system architecture will be explored. The system memory
future modification of advanced algorithms and higher de- organization will be discussed in Sec. 3. In Sec. 4, we will
sign effort are disadvantages. Hence, some [5] [6] adopted present the algorithm, architecture, and performance of the
the hybrid software/hardware co-design to compromise the motion estimator. The implementation result will be shown
performance and flexibility. in Sec. 5. In Sec. 6, a brief conclusion will be given.
According to the computational complexity analysis re-
ported in [1] and [2], the dominating computation-intensive 2. SYSTEM ARCHITECTURE
tasks in MPEG-4 core profile coding are motion estima-
tion(ME) and shape encoding, which together contribute Fig. 1 depicts the proposed platform-based MPEG-4 video
more than 90% of the overall complexity. For simple profile coding system. RISC takes responsibility for MB level hard-
without shape coding tools, ME becomes the most signifi- ware scheduling, coding mode decision, motion vector cod-
cant one. It belongs to highly regular low-level task, and ing, and other high level procedures. Other hardware ac-
a huge amount of data access through frame buffer is also celerators improve the system performance by parallel pro-
required. So, dedicated architectures and local buffers are cessing according to the parallelism of algorithms. Mo-
0-7803-8504-7/04/$20.00 ©2004 IEEE 251 SIPS 2004
BITSTREAM Frame-based map
PROGRAM DATA SHARE Word-based map
Coding
Source Coding
Processing Bitstream
frame Processing
RISC Cache DMA MEM IF
RISC Bus Block-based map
Motion Motion Share Texture Bitstream Reconstructed
Estimaotr Compensator Memory Block Engine Generator frame
Share Bus
Data Bus Soruce Other
Reconstructed frame Bitstream
frame Information
Physical address of external memory
Fig. 1. System Architecture
Fig. 2. Heterogeneous memory organizations
tion estimator (ME) carries out motion estimation with the
search range -16.0 to +15.5 pixel unit. Motion compensator store it from the register to the memory, it results in signifi-
(MC) interpolates pixels in reference frames into compen- cant reduction in the code size. To achieve cycle-accurately
sated blocks by specified motion vectors. Texture block controlling, an inner-timer and polling technique are intro-
engine (TBE) carries out discrete cosine transform (DCT), duced. A special instruction, WAIT, is used to support this
inverse cosine transform (IDCT), quantization (Q), inverse functionality. While the RISC encounters the WAIT instruc-
quantization (IQ), and AC/DC prediction on texture pixels tions, it waits until the next trigger events.
in block unit. Bitstream generator (BTS) produces headers,
motion information, and texture information in the format 3. MEMORY ORGANIZATION
of variable length codes. In addition, share memory builds
the direct channels from MC to TBE and BE to BTS to de- We have off-chip memory and several on-chip memory blocks.
crease the traffic of the data bus. DMA involved in dedi- Off-chip memory contains source frames, reconstructed frames,
cated commands efficiently generates the proper addresses and AC/DC information. On-chip memory is used as local
issued by RISC. Four global bus channels are used in this buffers to reduce the bus bandwidth. Due to the penalty
system. First, RISC bus broadcasts controlling information of irregular accessing to and from off-chip RAM, we ac-
to each hardware modules. After applying operations is- cess off-chip RAM more successively by using random ac-
sued by RISC, hardware modules respond processed side cess on-chip RAM. For MPEG-4 video coding, block-based
information for MB coding mode decision at RISC. At the memory organization is efficient to burst reading a block of
same time source, reference, and reconstructed frames re- data for video processing. However, the common video in-
quired by hardware modules are passed through DMA and put/output devices usually adopt the raster scan direction.
then provided by DATA bus. Hardware modules efficiently It makes addressing more regular if frame data is arranged
access the data automatically according to pre-determined in frame-based scheme. Therefore, we use heterogeneous
scheduling. These parts are integrated into a single chip memory organization for off-chip RAM as shown in Fig. 2.
with the firmware stored outside for programmability through The source frames are stored in the frame-based way, while
PROGRAM bus after taped out. SHARE bus can transfer the reconstructed frames are store in the block-based way
DCT coefficients, quantized coefficients, or other immedi- for processing in the future. The bitstream data and AC/DC
ate information in the testing mode. The developing time information is arranged as traditional 1-D addressing. After
and effort can be reduced through this information. this arrangement, the data access to/from off-chip will more
The RISC core contains four stages pipeline with sep- consecutive.
arated program and data memory. Its instruction set is 21 Fig. 3 shows four types off-chip memory access. Each
bits. The special 2-operand MAX and MIN instruction is in- store unit (word) consists of four pixels. In case (a) the
cluded for the median operations for MV predictor decision. search window (SW) data of 48x48 pixels for ME are loaded
Besides, a hardwired datapath for multiplication and divi- from the previous reconstructed frame. In case (b) the 9x9
sion is also provided. We also propose an immediate store reference blocks are required for half-pixel MC. Reading
instruction (SWI) to send a specified data to memory. Com- in vertical direction can reduce the frequency of crossing
pared to traditional approach, which requires one instruc- neighboring blocks. In case (c) data are read out from source
tion to move data into a register and then the other one to frames in frame-based organization. In case (d) the 8x8 re-
252
(1) determine the motion vector predictor
A B
C X
predictor of macroblock X = median of motion vectors of
macroblock A, B, and C
(a) (b) (c) (d)
(2) integer-pel motion estimation (16x16 block size) and then half-pixel refinement
Fig. 3. Memory Access Scheme (a) SW for ME (b) Block
for MC (c) Block to TBE (d) Block from TBE FFS PDS
half-pel refinement
predictor
or
constructed block is burst written to the block-based organi-
zation region of external memory without any segmentation. initial phase
The input video source, reconstructed frames, and trans- refinement phase (large diamond)
spiral to all search area last phase (small diamond)
formed coefficients for AC/DC prediction are stored in the
external memory. Direct Memory Access (DMA) plays a
role to control memory interface (MIF) to read data from or (3) local motion estimation (8x8 block size) and then half-pel refinement
write data to the external memory in a specified sequence af-
ter being initialized by RISC. For this kind of data-intensive
applications, DMA always have a heavy load to handle the
traffic through the data bus. Therefore, three special func-
tions are involved in DMA to reduce addressing overhead
half-pel refinement
and to provide pixel data more efficiently. It not only can
improve the data access but also decrease the complexity
of address generation in other hardware modules. First, the
addressing generation combines the conversion process of
2-D to 1-D address. Second, the advanced prediction mode
allows motion vectors to point out of the VOP and the data Fig. 4. Algorithms of motion estimation
is padded from the boundary pixels in this situation. DMA
handles this problem of boundary data for ME and MC units
that can focus on the current processing MB. Third, spe-
archy scheme is applied for the motion estimation for four
cial addressing for half-pixel precision compensation is sup-
8x8 pixels blocks in a MB around +2 to -2 positions of the
ported. Due to the half-pixel precision for motion compen-
previous best motion vector. The half-pixel refinement is
sation, the compensated block is read out in 9 by 9 pixels
also applied for all found integer-pixel motion vectors. The
and may occupy the four blocks in the block-based memory
whole stages of motion estimation is described as follows.
organization. This kind of fixed addressing is designed in
The predictor is determined from neighboring MBs. The
the control unit of the DMA to improve the performance.
PDS mode or FFS mode is employed to find the integer
pixel motion vectors. The half-pixel refinement is applied
4. MOTION ESTIMATOR DESIGN around the motion vector found in the phase 2. For four 8x8
pixel blocks in a MB, the spiral search around -2 to +2 is ap-
4.1. Algorithm plied to obtain four optimal motion vectors. Four times of
To meet the requirement of various applications under the half-pixel refinement is applied around the motion vectors
acceptable cost, we adopt two kinds of algorithms for the found in the previous phases.
motion estimation of 16x16 block size at integer-pixel pre- Fig. 4 depicts the whole stages of motion estimation
cision. One is the spiral full search with halfway termina- and describes as follows. The predictor is determined from
tion (called fast full search, FFS) which can achieve the neighboring MBs. The PDS mode or FFS mode is em-
same compression efficiency as the full search algorithm. ployed to find the integer pixel motion vectors. The half-
The other is the diamond search starting from the predic- pixel refinement is applied around the motion vector found
tor derived from neighboring MBs (called predictive dia- in the phase 2. For four 8x8 pixel blocks in a MB, the spiral
mond search, PDS) and it meets the real-time specification search around -2 to +2 is applied to obtain four optimal mo-
under the visual quality degradation. Afterwards, the hier- tion vectors. Four times of half-pixel refinement is applied
253
data in
mode Controller Data loading path
SW MEM
48
A B C D
0 1 2 0 1 2
Adjustable ROM-based
48
Reference Frame
Spiral Diamond Strip Strip Strip
Pattern Pattern 0 1 2
Range MUX Pattern 00 10 20 30 40 50 60 70 01 11 21 31 41 51 61 71
Checker (id,u,v) Generation 06 16 26 36 46 56 66 76 07 17 27 37 47 57 67 77
012 112 212 312 412 512 612 712 013 113 213 313 413 513 613 713
valid FIFO
terminate 018 118 218 318 418 518 618 718 019 119 219 319 419 519 619 719
fetch
024 124 224 324 424 524 624 724 025 125 225 325 425 525 625 725
AG MB SW
Y 030 130 230 330 430 530 630 730 031 131 231 331 431 531 631 731
RAM MEM
Adder #A Pixel in RAM# with address A Half-row for address (3,2)
64 bits
Tree
Rate-biased Termination Fig. 6. Memory organization of motion estimator
Distortion
Accumulator detector
Calculation
Min. motion
vector with SAD
collision in the same memory bank. Fig. 6 depicts this or-
ganization with the search range -16 to +15 pixels. Before
Fig. 5. Architecture of motion estimator performing motion estimation for each MB labeled as A, B,
or C, it needs to load colocated 48 by 48 pixel size of search
windows in the reference frame to search window memory
around the motion vectors found in the previous phases. (SWMEM) which is divided into three strips and each one
contains exclusive sixteen pixels. Under the leftmost MB of
the frame, all data located in these three strips are required
4.2. Architecture
to be loaded. However, the left MBs in the same row can
Fig. 5 depicts the hardware architecture of the motion esti- reuse two-thrid of search window of the immediately previ-
mator supporting PDS and FFS. This architecture mainly in- ous MB.
cludes three processing stages and two buffers to store cur- With rotation and modulation operations for addressing,
rent MB and the search window. Before performing motion this column-by-column data reuse scheme is applied to this
estimation, the video coding system transfers data from ex- motion estimation architecture. The bus traffic for loading
ternal memory into these buffers to eliminate the bus band- search window is then reduced from 26.10 to 9.49 Mbytes
width for calculating of sum of absolute difference in the per second for CIF format with the search range of -16 to
following. Meanwhile, the adder tree accumulates the sum +15 pixel. In each strip, eight horizontal neighboring pixels
of the pixels in the current MB to save it into a register (a half row) are stored into eight separated memory with the
for the mode decision in the future. To speed up the data linear addressing. While reading a half-row of pixels ran-
loading and reduce the bus traffic, the search window buffer domly, two consecutive addresses are calculated first from
can be loaded using column-by-column data-reuse scheme. the two-dimension coordinates. Then the proper circular
After motion estimation starts, the pattern generation (PG) rotative operations are applied to the data read out from the
stage generates the valid candidate positions. Then these memory banks.
positions are passed through the FIFO stage and fetched by
the distortion calculation (DC) stage. The DC stage is re- 4.4. Performance
sponsible for calculating SAD of candidate positions and
finds the minimum one. The accumulation comparison elim- The PDS mode can satisfy the real-time specification while
ination (ACE) unit performs the PDE algorithm to reduce the FFS mode can achieve the same compression quality as
the computational complexity. MPEG-4 software verified model (VM) [8]. To explore the
degradation in the PDS mode, four sequences with different
features are used as test patterns. The average difference
4.3. Data Reuse Scheme
between PDS and VM in PSNR is only 0.136 dB and the
The eight-way interleaved memory organization is used to maximum PSNR drop through the testing sequences is only
dynamically fetch eight pixels in one cycle without reading 0.618 dB. Even in the frames whose the difference in PSNR
254
40 PSNR Y (dB) SRAM SRAM
weather - PDS
w
38
weather - FFS
foreman - PDS
foreman - FFS
f
36 stefan - PDS MEMIF MEMIF
stefan - FFS
34 coastguard - PDS Host
PROGRAM
SM BUS
coastguard - FFS DMA
Computer
BUS
32 c
DMA
30
s
28
PCI PCI
Arbiter MPEG-4 Video Encoder
26 Connector Controller
BITSTREAM
24
FRAME
DMA
BUS
BUS
22
Bit rate (bps)
DMA
20
0 200 400 600 800 1000
MEMIF MEMIF
Xilinx FPGA
Fig. 7. RD curves with PDS and FFS modes SRAM SRAM
Fig. 8. Configurable platform
are maximum, it is still indistinguishable between these two
in subject view. While encoding in the FFS mode, the PSNR
and bit-rate of the reconstructed frames are almost the same
as that encoded by VM. The average PSNR are even better Table 1. Characteristics of the encoder chip
than 0.00625 dB. The general R-D curves for testing se- Technology TSMC 0.35 µm 1P4M CMOS
quence are simulated and shown in Fig. 7. Die Size 5.02 x 5.13 mm 2
Transistor count 828,692 trans.
On-chip memory 39,080 bits
5. IMPLEMENTATION
Off-chip memory 2,027,527 bits
A configurable platform shown in Fig. 8 is used to ver- Clock frequency 40 MHz
ify the functionality of our architecture design. This proto- Voltage 3.3V
typing board is connected through the PCI interface to the Power consumption 256.8mW
host computer. Four separated memory with DMA mod- Package 208 CQFP
ules are used to handle PROGRAM, DATA, SHARE, and ME algorithm PDS/FFS, 4MV mode
BITSTREAM bus from our design. An arbiter is responsi- Search range -16.0 to +15.5
ble for the memory access through PCI and memory. The Encoding complexity 352 x 288 at 30 fps
MPEG-4 video encoder design is synthesized and placed
on the FPGA chip. The RISC program is compiled to ma-
chine codes by the host computer and then sent to the pro-
16∼+15.5. In [5], it is a platform-based video/speech codec
gram memory. Raw image data is transferred from the host
design. It uses 3-step hierarchical search for ME with search
computer to the frame memory on the prototyping board.
range -32∼+31.5. In [6], it is a platform-based video codec
Video encoding is processed concurrently. Afterwards, bit-
design with ARM/AMBA. It uses a coarse ME with search
stream data are stored in the bitstream memory and then
range -8∼+7.5. All chip designs adopts fast algorithms for
read from the host computer. Besides, the share memory
motion estimation. In the viewpoint of video encoder parts,
can record the immediate information for debugging in the
our work has highest encoding complexity and the lowest
testing mode.
cost meanwhile.
Fig. 9 shows a micrograph of the encoder and Table
1 depicts its characteristics. It contains 828K transistors
and is fabricated on a 5.02 x 5.13 mm 2 with 0.35 µ m and 6. CONCLUSION
single-poly quadruple-metal CMOS process. The chip is
tested and works successfully. The supply voltage is 3.3V An efficient platform architecture design with hardware ac-
and consumes 256.8mW at 40MHz working frequency. Ta- celerators for MPEG-4 Simple Profile@Level 3 video en-
ble 2 shows the number of transistors, the area, and the size coder SOC is proposed in this paper. With the proposed
ratio to the chip of each unit. hybrid motion estimation and efficient memory organiza-
Table 3 gives a comparison of some MPEG-4 video codec tion, the system are implemented with 0.35 µm CMOS tech-
proposed before. In [4], it is a full dedicated hardware video nology. It works at 40MHz and consumes 256.8mW with
codec design. It uses MVFAST for ME with search range - 5.03x5.13 mm 2 die size to meet the real-time encoding spec-
255
Table 3. Architectures Comparison
Designer [4] [5] [6] Proposed
Encoding CIF, QCIF, CIF, CIF,
Complexity 15fps 15fps 15fps 30fps
Frequency
13.5 60 27 40
(MHz)
Power
29 240 500 256.8
(mW)
Transistor 20,500
3,150 1,700 829
(K) (DRAM)
Process
0.18 0.25 0.35 0.35
(µm)
Chip area
28.048 117.506 110.25 25.801
(mm2 )
Fig. 9. Micrograph of this encoder [3] A. Hatabu, T. Miyazaki, and I. Kuroda, “QVGA/CIF
resolution MPEG-4 video codec based on a low-power
and general-purpose DSP,” in IEEE Workshop on Sig-
Table 2. Cost distribution nal Processing Systems (SiPS), 2002, pp. 15–20.
Trans. Area Size ratio
(k) (mm2 ) (%) [4] H. Nakayama, T. Yoshitake, H. Komazaki, Y. Watan-
abe, H. Araki, K. Morioka, J. Li, L. Peilin, S. Lee,
ME 288 5.8 22.6
MC 53 0.3 1.2 H. Kubosawa, and Y. Otobe, “An MPEG-4 Video LSI
DCT/IDCT in TBE 126 1.6 6.2 with an Error-Resilient Codec Core Based on a Fast
Motion Estimation Algorithm,” in IEEE International
Q/IQ in TBE 64 0.7 2.9
ACDCP in TBE 22 0.8 3.0 Solid-State Circuits Conference (ISSCC), 2002, vol. 1,
pp. 368–474.
RISC 112 1.8 7.0
DMA 19 0.3 1.2 [5] M. Takahashi, T. Nishikawa, M. Hamada,
VLC 95 0.7 2.7 T. Takayanagi, H. Arakida, N. Machida, H. Ya-
Share MEM 68 2.8 10.9 mamoto, T. Fujiyoshi, Y. Ohashi, O. Yamagishi,
Others (PAD etc.) 49 10.9 42.3 T. Samata, A. Asano, T. Terazawa, K. Ohmori,
Total 829 25.8 100.0 Y. Watanabe, H. Nakamura, S. Minami, T. Kuroda,
and T. Furuyama, “A 60-MHz 240-mW MPEG-4
Videophone LSI with 16-Mb Embedded DRAM,”
ification. The proposed design achieves high performance IEEE Journal of Solid-State Circuit, vol. 35, no. 11, pp.
with low design cost, which proves that a cost-effective MPEG- 1713–1721, Nov 2000.
4 coding system implementation is realized. [6] J. H. Park, I. K. Kim, S. M. Kim, S. M. Park, B. T. Koo,
K. S. Shin, K. B. Seo, and J. J. Cha, “MPEG-4 Video
7. REFERENCES Codec on an ARM core and AMBA,” in Workshop and
Exhibition on MPEG-4, 2001, pp. 95–98.
[1] P. M. Kuhn and W. Stechele, “Complexity analysis of
the emerging MPEG-4 standard as a basis for VLSI im- [7] C. W. Hsu, W. M. Chao, Y. C. Chang, and L. G.
plementation,” in International Conference on Visual Chen, “Cost-Effective Scheduling Of Texture Coding
Communications and Image Processing, 1998. For MPEG-4 Video,” IEEE International Conference
on Multimedia and Expo(ICME’02), Aug 2002.
[2] H .C. Chang, L. G. Chen, M. Y. Hsu, and Y. C. Chang,
“Performance analysis and architecture evaluation of [8] T. Sikora, “The MPEG-4 Video Standard Verification
MPEG-4 video codec system,” in IEEE International Model,” IEEE Trans. on Circuits and Systems for Video
Symposium on Circuits and Systems (ISCAS), 2000, Technology, vol. 7, no. 1, pp. 19–31, Feb 1997.
vol. 2, pp. 449–452.
256