0% found this document useful (0 votes)
4 views

ASIC BASED DCT2016

ASIC

Uploaded by

Bhavya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ASIC BASED DCT2016

ASIC

Uploaded by

Bhavya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2016 European Modelling Symposium

An Efficient ASIC Design of Variable-Length Discrete Cosine Transform for HEVC

Niras C. Vayalil, Joshua Haddrill and Yinan Kong


Department of Engineering
Macquarie University
Sydney, NSW, 2109 Australia
[email protected], [email protected], [email protected]

Abstract—The latest video coding standard introduced by rotations in [5] reduces gate count. Another approximate
the joint collaborative team on video coding (JCT-VC) is DCT architecture is proposed in [6] and offers better peak
known as high-efficiency video coding (HEVC) or H.265. signal-to-noise ratio (PSNR). High-resolution video such as
HEVC/H.265 is mainly targeted for high-definition videos,
and offer more compression than its predecessor. The discrete ultra-high-definition (UHD) video is more likely to have
cosine transform (DCT) is widely used for image and video large smoother regions [7], thus transforms of larger size
compression including HEVC. This paper proposes a variable- are mostly used. Hence the design is targeted mainly for
length DCT architecture for encoding video according to the most likely block sizes instead of all possible sizes,
the HEVC/H.265 specifications. The architecture is optimized and this assumption can reduce the hardware complexity
for most likely block sizes in ultra-high definition (UHD)
video, and eliminates unnecessary complexities found in many significantly. This paper proposes a 2D-DCT architecture
architectures proposed. The synthesized results with Synopsys which has substantial throughput, with block sizes 16 × 16
design tools show that the proposed method can encode 8K and 32 × 32 for real-time encoding of 8K UHD. This
UHD videos @ 60 fps in real-time and accomplishes more architecture leads to a simple memory and DCT hardware
than 60% in hardware savings. structure and thus is smaller (lower gate count) and faster
Keywords–Discrete cosine transform (DCT); H.265; high effi- than many proposals in the literature.
ciency coding (HEVC); high definition;
II. H ARDWARE A RCHITECTURE FOR DCT COMPUTATION
I. I NTRODUCTION
The DCT is a Fourier-related transform that only uses real
As the demand for high-definition (HD) video content numbers to represent a set number of discrete data points
increases, so does the need for efficient compression tech- within a signal; unlike the discrete Fourier transform (DFT)
niques. The high efficiency video coding (HEVC/H.265) the DCT only uses cosine functions to represent the data
standard [1] is a relatively new codec that is poised to replace points [8]. There are multiple versions of the DCT that range
advanced video coding (AVC/H.264) [2] as the standard for from DCT-I to DCT-IV, the most common of which is DCT-
high-definition video encoding. HEVC/H.265 offers more II and referred to as ‘the DCT’ and defined as
compression, approximately a 50% bit-rate reduction, than
its predecessor AVC/H.264 for an equivalent subjective
N −1
reproduction quality [3]. The use of the discrete cosine trans-
 
X π 1
form (DCT) is a common method in several previous codecs Xk = cos (n + )k xn 0≤k<N (1)
n=0
N 2
and could be a key factor in the development of compression
techniques for HEVC due to its near-optimal efficiency for For processing two-dimensional signals such as images, a
performing this task. To be compatible for proper use with two-dimensional version of the DCT (2D-DCT) is used; it
HEVC/H.265 the DCT needs to be computed for a matrix is a trivial expansion of the standard DCT, given as
of varying length.
To accommodate the varying size of the architecture it 1 −1 N
NX 2 −1  
would be ideal to develop components that can be utilized by
X π 1
Xk1 ,k2 = cos (n1 + )k1
other lengths such that the architecture is more area efficient, n1 =0 n2 =0
N1 2
(2)
but the common method of multiplying by a constant matrix 
π 1

would not be effective in this case, due to its architecture cos (n2 + )k2 xn1 ,n2
N2 2
not being able to be reused for other lengths.
Mehr et al proposed a reusable integer DCT architecture where 0 ≤ k1 < N1 , 0 ≤ k2 < N2 .
[4] providing same throughput in all supported transform One property of the 2D-DCT is separability [9], i.e.
lengths, but resulting in a higher area or gate count. the 2-D DCT can be computed in two steps, a column-
An approximated architecture of DCT through the Walsh- wise 1-D DCT followed by a row-wise 1-D DCT, or vice
Hadamard Transform (WHT) followed by a set of Givens versa. This procedure of calculating a multidimensional

2473-3539/16 $31.00 © 2016 IEEE 229


DOI 10.1109/EMS.2016.43
x[0] X[0]
x(0) a(0) y(0)
x[1] X[1]
x(1) a(1) y(2)
x[2] - X[2]
N/2 point
x[3] - X[3] DCT

Input addder unit


x[4] - X[4]
x[5] - - X[5] a(N/2-1) y(N-2)
x[6] - - X[6]

Output addder unit


x[7] - - X[7] b(0) N/2 y(1)
shift and adder unit
Stage 1 Stage 2 Stage 3 b(1) N/2 y(3)
shift and adder unit
Figure 1. Stick diagram of the butterfly technique applied to the DCT.
x(N-2)
x(N-1) b(N/2-1) N/2 y(N-1)
shift and adder unit
separable transform is called row-column decomposition,
which reduces the number of computations. Figure 2. A generalized structure of higher radix DCTs, where N = 8,
The DCT has become a staple in image compression, 16, 32. [4]. The N point DCT is build upon N/2 DCT with adder units
specifically in the JPEG format, due to the resulting lossy and shift units.
compression that occurs as a result of the transform, allowing
larger image data to be compressed. This is done by applying
the DCT to a quantization of an image’s pixels to obtain B. Architecture for higher length 1-D DCTs
an approximation that requires less data to be stored. DCT Higher-length DCTs with N = 8, 16 and 32 are built
possesses a strong energy compaction property [10]; most upon 4 point DCTs recursively. A generalized structure
of the signal information tends to be concentrated in few of N = 8, 16, 32 point integer DCTs are shown in Fig.
low-frequency components, making DCTs useful for image 2. At each stage of this process the input data is first
compression. manipulated by an IAU to create intermediate data that is
The related Fourier properties of the DCT make it possible then used as the input for further operations. The even-
to use the butterfly multiplication approach described by numbered rows/columns, including zero, are processed as an
Budagavi et al, in such a way that the overall transformation N/2 point DCT to obtain the corresponding output values.
completes in sections that effectively ‘fold’ into the next The odd-numbered rows and columns are passed through a
section [11]. This method is ideal as it is more efficient than shift adder unit (SAU), which is specific to the point length
the brute-force matrix multiplication method which is very of the DCT being performed, to produce the corresponding
costly in terms of computing time [11]. A stripped-down values in the output matrix.
representation of this process can be seen in Fig. 1, where Once the lower level transforms and operations have been
the horizontal lines represent the input and manipulated completed the resulting data is then further manipulated by
data after operations while a (−) beneath the dot represents an OAU to complete the transformation. The complete ar-
subtraction or addition. chitecture implementation operates recursively by gradually
calling N/2-point DCTs until it reaches the four-point DCT.
A. Four-point DCT architecture
III. P ROPOSED HARDWARE ARCHITECTURE FOR
The four-point DCT module is based on the algorithm VARIABLE - LENGTH TWO - DIMENSIONAL DCT
by Meher et al [4]. The algorithm used to implement the
The separability property is used to design the 2-D DCT
DCT for a 4 × 4 matrix is outlined in stages in Table I.
because the row-column decomposition results in computa-
For efficient implementation the algorithm is divided into
tional savings but this introduces another problem of data
an input adder unit (IAU), a glssau and an output adder
storage or memory for saving first-step results. It is clear
unit (OAU), such that each stage can be examined and
that the second step (row/column 1D-DCT) can only begin
implemented individually to ensure that necessary values are
after completing the first step (column/row 1D-DCT), thus
available as each stage completes. This algorithm is used to
it is necessary to save all data and retrieve it in a transposed
replicate the kernel matrix represented by equation (3) to
order. In this proposed architecture we use a 2 dimensional
perform the transform without directly performing matrix
register array to save and transpose first 1-D DCT results,
multiplication. This is done to improve the computational
which results in an efficient 2D-DCT architecture.
speed and efficiency of the architecture.
  The architecture of the proposed method is shown in Fig.
64 64 64 64 3, where the transpose module has a size of 32 × 32 words
83 36 −36 −83 (1-D DCT input and output have 16 bit word length) which
C4 =   (3)
64 −64 −64 64  can hold all column transforms of a 32 × 32 blocks pixels.
36 −83 83 −36 Initially, the 2D shift registers are set into ‘shift-in mode’ and

230
TABLE I
F OUR -P OINT DCT A LGORITHM BY S TAGE

Stage Computation Binary expression Notes


a(i) = x(i) + x(3 − i)
Stage 1 (IAU) for i = 0 to 3
b(i) = x(i) − x(3 − i)
mi,9 = 9b(i) (b(i) << 3) + b(i)
mi,64 = 64a(i) a(i) << 6
Stage 2 (SAU) for i = 0 to 3
ti,83 = 83b(i) (b(i) << 6) + (m1,9 << 1) + b(i)
ti,36 = 36b(i) m1,9 << 1
y(0) = t0,64 + t1,64
y(1) = t0,83 + t1,36
Stage 3 (OAU)
y(2) = t0,64 + t1,64
y(3) = t0,36 + t1,83

columns of module input connects to the 2D shift register outputs with


MUX

rows of 4/8/16/32 1D-DCT 2D-DCT input the help of a multiplexer (MUX). In ‘shift-out’ mode the
2D-DCT output shift register’s data shifts in the upward direction, and the
output is taken from the top row. This write and read
arrangement facilitates the transpose operation. The shift
registers do not load data in this mode of operation. Since the
DCT module completes a row transform in each cycle, the
Transposition memory
proposed architecture requires 2N clock cycles to complete
shift in/shift out
mode select an N point 2D-DCT transform. An an example, for a 32
point 2D-DCT, the first 32 clock cycles are required to
Figure 3. The proposed 2D-DCT architecture, transposition memory complete all column transforms which are stored into the
implemented using a 2-D register array. shift registers, and another 32 clock cycles are required to
y0 y1 y2 y3 shift-out these data row-wise and complete all row transfor-
mations.
x0
R00 R01 R02 R03 In HEVC/H.265, the transform is performed after intra
and inter prediction, on the residues obtained by the dif-
ferences between the original pixels and predicted pixels.
x1 For the residual coding, HEVC/H.265 employs recursive
R10 R11 R12 R13
quad tree-structured partitioning of coding blocks [1]. The
HEVC/H.265 specification supports four transform sizes:
x2 4 × 4, 8 × 8, 16 × 16 and 32 × 32. The different block
R20 R21 R22 R23
sizes in the specification are introduced for accommodating
varying space-frequency characteristics of the residuals. The
x3 rate distortion (RD) cost computation is to be done for all
R30 R31 R32 R33
coding unit (CU) sizes to select the best among the various
block sizes. However this ‘trial and error’ method has a very
Figure 4. The proposed 2D shift register architecture, showing 4 inputs high computational cost. Several algorithms are proposed for
and 4 outputs. Data is shifted in the horizontal direction from left to right,
and shifted out in the up direction; all MUX selection changes accordingly.
early transform unit (TU) decision reducing this complexity.
Chio et al. propose a method for early TU decision by
determining the number of nonzero DCT coefficients as a
input data is given into the 1D-DCT module column-wise. threshold to stop further RD cost evaluation in the quad tree
During the ‘shift-in mode’, the results of the first 1D-DCTs structure [12]. But this method still has enough complexity,
are stored into the leftmost column of the 2D register array, especially for sequences with active motion or rich textures,
and each column of data in the register array is shifted right- thus further optimizations are proposed in [13]. Quad-tree
ward at every clock cycle. A detailed digram of the 2D shift TU encoding process termination based on the residual
register arrangement is shown in Fig. 4. coefficients is proposed in [14]–[17]
After completion of all column transformations, the 2D One of the options in the HEVC test model (HM) [18]
shift register changes into ‘shift-out’ mode and the DCT to reduce the computational complexity is to use the largest

231
TABLE II
C OMPARISON OF 2D-DCT ARCHITECTURES

Design Technology Gate Count Max. Freq. Throughput Supported Video format
TCSVT’14 [4] Architecture-1 90 nm 347 k 187 MHz 5.984 G 8K UHD @ 60 fps
TCSVT’14 [4] Architecture-2 90 nm 208 k 187 MHz 2.992 G 8K UHD @ 60 fps
TCSVT’16 [5] Architecture-1 90 nm 243 k 250 MHz 3.212 G 8K UHD @ 64 fps
TCSVT’16 [5] Architecture-2 90 nm 157 k 250 MHz 1.302 G 8K UHD @ 26 fps
Proposed 32 nm 96 k 450 MHz 3.600 G 8K UHD @ 60 fps

available transform size. The homogeneity of the transform method saves more than 60% gate count of the designs in
block residuals has a strong relation to the homogeneity the table.
of input block; when the TU covers multiple prediction
V. C ONCLUSION
units (PUs) these transform residues may not be consistent
and also there is a chance for introducing blocks artifacts In this paper we propose a 2-D DCT architecture for
which in turn increases the high-frequency energy in the encoding UHD video in the HEVC/H.265 standard. The
residuals. To cope with computational complexity and the hardware has substantial throughput for block sizes that are
aforementioned problems, this architecture decided to use more likely to be found in HD or UHD video. This assump-
the maximum TU size that fits in the PU as the TU size. This tion removes several unnecessary complexities established
decreases the computational complexity but the Bjøntegaard in many other architectures. The proposed method has an
delta (BD)-rate [19] increases by 3.02% in the low-delay P efficient and fast DCT structures as well as transposition
configuration [20]. memory. Thus the synthesized results show a lower gate
count or a smaller area than the architectures in the literature.
IV. R ESULTS AND COMPARISON
R EFERENCES
The proposed architecture is written in the VHDL hard-
[1] G. Sullivan, J. Ohm, W. J. Han, and T. Wiegand, “Overview of the
ware description language, and is verified by simulating high efficiency video coding (HEVC) standard,” Circuits and Systems
the design in ModelSim. The design is synthesized using for Video Technology, IEEE Transactions on, vol. 22, no. 12, pp.
Synopsys Design Compiler version K-2015.06 with Syn- 1649–1668, Dec. 2012.
[2] ITU-T and ISO/IEC JTC, Advanced video coding for generic audio-
opsys Armenia Educational Department (SAED) design kit visual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 (AVC),
32 nm standard logic cell libraries, for operating conditions 2003.
of 1.16 V and a worst-case temperature 125 ◦C. The highest [3] J. R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand,
“Comparison of the coding efficiency of video coding standards –
throughput of the architecture is 16 pixels per clock cycle including high efficiency video coding (HEVC),” IEEE Transactions
while processing a 32 × 32 block within 64 clock cycles, on Circuits and Systems for Video Technology, vol. 22, no. 12, pp.
and varies to a worst-case 2 pixels per clock cycle when 1669–1684, Dec. 2012.
processing blocks are in 4 × 4 size. In high-resolution [4] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo,
“Efficient integer DCT architectures for HEVC,” IEEE Transactions
video, especially for 8K UHD video, lower block sizes on Circuits and Systems for Video Technology, vol. 24, no. 1, pp.
are rarely expected, hence as an average, throughput of 168–178, Jan. 2014.
16 × 16 blocks are taken for calculation purposes. Thus to [5] M. Masera, M. Martina, and G. Masera, “Adaptive approximated DCT
architectures for HEVC,” IEEE Transactions on Circuits and Systems
encode 8K UHD @ 60 Hz in 4:2:0 YUV format requires for Video Technology, no. 99, pp. 1–1, 2016.
7680 × 4320 × 60 × 1.5/8 clock cycles per second or [6] M. Jridi and P. Meher, “A scalable approximate DCT architectures
374 MHz. The design can operate up to 450 MHz, a much for efficient HEVC compliant video coding,” IEEE Transactions on
Circuits and Systems for Video Technology, no. 99, pp. 1–1, 2016.
higher clock frequency than is required, and the synthesized
[7] M. T. Pourazad, C. Doutre, M. Azimi, and P. Nasiopoulos, “HEVC:
results are in Table II. The new gold standard for video compression: How does HEVC
The synthesized design has an area of 0.2443 mm2 or a compare with H.264/AVC?” IEEE Consumer Electronics Magazine,
96 k standard 2-input NAND equivalent gate count. There vol. 1, no. 3, pp. 36–46, July 2012.
[8] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”
are two architectures proposed in each of [4] and [5] for the IEEE Transactions on Computers, vol. C-23, no. 1, pp. 90–93, Jan
2D-DCT, based on unfolded and folded 1D-DCT modules, 1974.
and are referred to as Architecture-1 and Architecture-2 [9] N. C. Vayalil, A. Safari, and Y. Kong, “Overlapped block-processing
VLSI architecture for separable 2D filters,” in Electronics, Commu-
respectively. The unfolded or full-parallel structures have nications and Networks IV, Jun 2015, pp. 1355–1358.
higher throughput at the expense of larger area or gate [10] K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms,
count. A comparison is given in the table with the proposed Advantages, Applications. Academic Press, Boston, 1990.
architecture, and is clear that the proposed method has [11] M. Budagavi, A. Fuldseth, G. Bjntegaard, V. Sze, and M. Sadafale,
“Core transform design in the high efficiency video coding (HEVC)
approximately 61% gate count of [5] Architecture-2 and is standard,” IEEE Journal of Selected Topics in Signal Processing,
twice as fast. For an equivalent throughput, the proposed vol. 7, no. 6, pp. 1029–1041, Dec 2013.

232
[12] K. Choi and E. S. Jang, “Early TU decision method for fast video
encoding in high efficiency video coding,” Electronics Letters, vol. 48,
no. 12, pp. 689–691, June 2012.
[13] C. C. Wang, Y. C. Liao, J. W. Wang, and C. W. Tung, “An effective
TU size decision method for fast HEVC encoders,” in Computer,
Consumer and Control (IS3C), 2014 International Symposium on,
June 2014, pp. 1195–1198.
[14] J. Su, K. Nitta, M. Ikeda, and A. Shimizu, “Residue role assignment
based transform partition predetermination on HEVC,” in 2013 IEEE
International Conference on Image Processing, Sept 2013, pp. 2019–
2023.
[15] J. Kang, H. Choi, and J. G. Kim, “Fast transform unit decision
for HEVC,” in Image and Signal Processing (CISP), 2013 6th
International Congress on, vol. 01, Dec 2013, pp. 26–30.
[16] Z. Pan, J. Lei, Y. Zhang, W. Yan, and S. Kwong, “Fast transform unit
depth decision based on quantized coefficients for hevc,” in Systems,
Man, and Cybernetics (SMC), 2015 IEEE International Conference
on, Oct 2015, pp. 1127–1132.
[17] J. T. Fang, Y. C. Tsai, J. X. Lee, and P. S. Yu, “Computation
reduction in transform unit of high efficiency video coding based
on zero-coefficients,” in 2016 International Symposium on Computer,
Consumer and Control (IS3C), July 2016, pp. 797–800.
[18] HEVC reference software 16.3. [Online]. Available: https://hevc.hhi.
fraunhofer.de/svn/svn HEVCSoftware/
[19] G. Bjøntegaard, Calculation of average PSNR differences between
RD-curves, ITU-T SG16 Document VCEG-M33, Joint Collaborative
Team on Video Coding (JCTVC), Apr. 2001.
[20] V. Sze, M. Budagavi, and G. J. Sullivan, Eds., High Efficiency Video
Coding (HEVC) Algorithms and Architectures. Springer, 2014.

233

You might also like