ASIC BASED DCT2016
ASIC BASED DCT2016
Abstract—The latest video coding standard introduced by rotations in [5] reduces gate count. Another approximate
the joint collaborative team on video coding (JCT-VC) is DCT architecture is proposed in [6] and offers better peak
known as high-efficiency video coding (HEVC) or H.265. signal-to-noise ratio (PSNR). High-resolution video such as
HEVC/H.265 is mainly targeted for high-definition videos,
and offer more compression than its predecessor. The discrete ultra-high-definition (UHD) video is more likely to have
cosine transform (DCT) is widely used for image and video large smoother regions [7], thus transforms of larger size
compression including HEVC. This paper proposes a variable- are mostly used. Hence the design is targeted mainly for
length DCT architecture for encoding video according to the most likely block sizes instead of all possible sizes,
the HEVC/H.265 specifications. The architecture is optimized and this assumption can reduce the hardware complexity
for most likely block sizes in ultra-high definition (UHD)
video, and eliminates unnecessary complexities found in many significantly. This paper proposes a 2D-DCT architecture
architectures proposed. The synthesized results with Synopsys which has substantial throughput, with block sizes 16 × 16
design tools show that the proposed method can encode 8K and 32 × 32 for real-time encoding of 8K UHD. This
UHD videos @ 60 fps in real-time and accomplishes more architecture leads to a simple memory and DCT hardware
than 60% in hardware savings. structure and thus is smaller (lower gate count) and faster
Keywords–Discrete cosine transform (DCT); H.265; high effi- than many proposals in the literature.
ciency coding (HEVC); high definition;
II. H ARDWARE A RCHITECTURE FOR DCT COMPUTATION
I. I NTRODUCTION
The DCT is a Fourier-related transform that only uses real
As the demand for high-definition (HD) video content numbers to represent a set number of discrete data points
increases, so does the need for efficient compression tech- within a signal; unlike the discrete Fourier transform (DFT)
niques. The high efficiency video coding (HEVC/H.265) the DCT only uses cosine functions to represent the data
standard [1] is a relatively new codec that is poised to replace points [8]. There are multiple versions of the DCT that range
advanced video coding (AVC/H.264) [2] as the standard for from DCT-I to DCT-IV, the most common of which is DCT-
high-definition video encoding. HEVC/H.265 offers more II and referred to as ‘the DCT’ and defined as
compression, approximately a 50% bit-rate reduction, than
its predecessor AVC/H.264 for an equivalent subjective
N −1
reproduction quality [3]. The use of the discrete cosine trans-
X π 1
form (DCT) is a common method in several previous codecs Xk = cos (n + )k xn 0≤k<N (1)
n=0
N 2
and could be a key factor in the development of compression
techniques for HEVC due to its near-optimal efficiency for For processing two-dimensional signals such as images, a
performing this task. To be compatible for proper use with two-dimensional version of the DCT (2D-DCT) is used; it
HEVC/H.265 the DCT needs to be computed for a matrix is a trivial expansion of the standard DCT, given as
of varying length.
To accommodate the varying size of the architecture it 1 −1 N
NX 2 −1
would be ideal to develop components that can be utilized by
X π 1
Xk1 ,k2 = cos (n1 + )k1
other lengths such that the architecture is more area efficient, n1 =0 n2 =0
N1 2
(2)
but the common method of multiplying by a constant matrix
π 1
would not be effective in this case, due to its architecture cos (n2 + )k2 xn1 ,n2
N2 2
not being able to be reused for other lengths.
Mehr et al proposed a reusable integer DCT architecture where 0 ≤ k1 < N1 , 0 ≤ k2 < N2 .
[4] providing same throughput in all supported transform One property of the 2D-DCT is separability [9], i.e.
lengths, but resulting in a higher area or gate count. the 2-D DCT can be computed in two steps, a column-
An approximated architecture of DCT through the Walsh- wise 1-D DCT followed by a row-wise 1-D DCT, or vice
Hadamard Transform (WHT) followed by a set of Givens versa. This procedure of calculating a multidimensional
230
TABLE I
F OUR -P OINT DCT A LGORITHM BY S TAGE
rows of 4/8/16/32 1D-DCT 2D-DCT input the help of a multiplexer (MUX). In ‘shift-out’ mode the
2D-DCT output shift register’s data shifts in the upward direction, and the
output is taken from the top row. This write and read
arrangement facilitates the transpose operation. The shift
registers do not load data in this mode of operation. Since the
DCT module completes a row transform in each cycle, the
Transposition memory
proposed architecture requires 2N clock cycles to complete
shift in/shift out
mode select an N point 2D-DCT transform. An an example, for a 32
point 2D-DCT, the first 32 clock cycles are required to
Figure 3. The proposed 2D-DCT architecture, transposition memory complete all column transforms which are stored into the
implemented using a 2-D register array. shift registers, and another 32 clock cycles are required to
y0 y1 y2 y3 shift-out these data row-wise and complete all row transfor-
mations.
x0
R00 R01 R02 R03 In HEVC/H.265, the transform is performed after intra
and inter prediction, on the residues obtained by the dif-
ferences between the original pixels and predicted pixels.
x1 For the residual coding, HEVC/H.265 employs recursive
R10 R11 R12 R13
quad tree-structured partitioning of coding blocks [1]. The
HEVC/H.265 specification supports four transform sizes:
x2 4 × 4, 8 × 8, 16 × 16 and 32 × 32. The different block
R20 R21 R22 R23
sizes in the specification are introduced for accommodating
varying space-frequency characteristics of the residuals. The
x3 rate distortion (RD) cost computation is to be done for all
R30 R31 R32 R33
coding unit (CU) sizes to select the best among the various
block sizes. However this ‘trial and error’ method has a very
Figure 4. The proposed 2D shift register architecture, showing 4 inputs high computational cost. Several algorithms are proposed for
and 4 outputs. Data is shifted in the horizontal direction from left to right,
and shifted out in the up direction; all MUX selection changes accordingly.
early transform unit (TU) decision reducing this complexity.
Chio et al. propose a method for early TU decision by
determining the number of nonzero DCT coefficients as a
input data is given into the 1D-DCT module column-wise. threshold to stop further RD cost evaluation in the quad tree
During the ‘shift-in mode’, the results of the first 1D-DCTs structure [12]. But this method still has enough complexity,
are stored into the leftmost column of the 2D register array, especially for sequences with active motion or rich textures,
and each column of data in the register array is shifted right- thus further optimizations are proposed in [13]. Quad-tree
ward at every clock cycle. A detailed digram of the 2D shift TU encoding process termination based on the residual
register arrangement is shown in Fig. 4. coefficients is proposed in [14]–[17]
After completion of all column transformations, the 2D One of the options in the HEVC test model (HM) [18]
shift register changes into ‘shift-out’ mode and the DCT to reduce the computational complexity is to use the largest
231
TABLE II
C OMPARISON OF 2D-DCT ARCHITECTURES
Design Technology Gate Count Max. Freq. Throughput Supported Video format
TCSVT’14 [4] Architecture-1 90 nm 347 k 187 MHz 5.984 G 8K UHD @ 60 fps
TCSVT’14 [4] Architecture-2 90 nm 208 k 187 MHz 2.992 G 8K UHD @ 60 fps
TCSVT’16 [5] Architecture-1 90 nm 243 k 250 MHz 3.212 G 8K UHD @ 64 fps
TCSVT’16 [5] Architecture-2 90 nm 157 k 250 MHz 1.302 G 8K UHD @ 26 fps
Proposed 32 nm 96 k 450 MHz 3.600 G 8K UHD @ 60 fps
available transform size. The homogeneity of the transform method saves more than 60% gate count of the designs in
block residuals has a strong relation to the homogeneity the table.
of input block; when the TU covers multiple prediction
V. C ONCLUSION
units (PUs) these transform residues may not be consistent
and also there is a chance for introducing blocks artifacts In this paper we propose a 2-D DCT architecture for
which in turn increases the high-frequency energy in the encoding UHD video in the HEVC/H.265 standard. The
residuals. To cope with computational complexity and the hardware has substantial throughput for block sizes that are
aforementioned problems, this architecture decided to use more likely to be found in HD or UHD video. This assump-
the maximum TU size that fits in the PU as the TU size. This tion removes several unnecessary complexities established
decreases the computational complexity but the Bjøntegaard in many other architectures. The proposed method has an
delta (BD)-rate [19] increases by 3.02% in the low-delay P efficient and fast DCT structures as well as transposition
configuration [20]. memory. Thus the synthesized results show a lower gate
count or a smaller area than the architectures in the literature.
IV. R ESULTS AND COMPARISON
R EFERENCES
The proposed architecture is written in the VHDL hard-
[1] G. Sullivan, J. Ohm, W. J. Han, and T. Wiegand, “Overview of the
ware description language, and is verified by simulating high efficiency video coding (HEVC) standard,” Circuits and Systems
the design in ModelSim. The design is synthesized using for Video Technology, IEEE Transactions on, vol. 22, no. 12, pp.
Synopsys Design Compiler version K-2015.06 with Syn- 1649–1668, Dec. 2012.
[2] ITU-T and ISO/IEC JTC, Advanced video coding for generic audio-
opsys Armenia Educational Department (SAED) design kit visual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 (AVC),
32 nm standard logic cell libraries, for operating conditions 2003.
of 1.16 V and a worst-case temperature 125 ◦C. The highest [3] J. R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand,
“Comparison of the coding efficiency of video coding standards –
throughput of the architecture is 16 pixels per clock cycle including high efficiency video coding (HEVC),” IEEE Transactions
while processing a 32 × 32 block within 64 clock cycles, on Circuits and Systems for Video Technology, vol. 22, no. 12, pp.
and varies to a worst-case 2 pixels per clock cycle when 1669–1684, Dec. 2012.
processing blocks are in 4 × 4 size. In high-resolution [4] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo,
“Efficient integer DCT architectures for HEVC,” IEEE Transactions
video, especially for 8K UHD video, lower block sizes on Circuits and Systems for Video Technology, vol. 24, no. 1, pp.
are rarely expected, hence as an average, throughput of 168–178, Jan. 2014.
16 × 16 blocks are taken for calculation purposes. Thus to [5] M. Masera, M. Martina, and G. Masera, “Adaptive approximated DCT
architectures for HEVC,” IEEE Transactions on Circuits and Systems
encode 8K UHD @ 60 Hz in 4:2:0 YUV format requires for Video Technology, no. 99, pp. 1–1, 2016.
7680 × 4320 × 60 × 1.5/8 clock cycles per second or [6] M. Jridi and P. Meher, “A scalable approximate DCT architectures
374 MHz. The design can operate up to 450 MHz, a much for efficient HEVC compliant video coding,” IEEE Transactions on
Circuits and Systems for Video Technology, no. 99, pp. 1–1, 2016.
higher clock frequency than is required, and the synthesized
[7] M. T. Pourazad, C. Doutre, M. Azimi, and P. Nasiopoulos, “HEVC:
results are in Table II. The new gold standard for video compression: How does HEVC
The synthesized design has an area of 0.2443 mm2 or a compare with H.264/AVC?” IEEE Consumer Electronics Magazine,
96 k standard 2-input NAND equivalent gate count. There vol. 1, no. 3, pp. 36–46, July 2012.
[8] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”
are two architectures proposed in each of [4] and [5] for the IEEE Transactions on Computers, vol. C-23, no. 1, pp. 90–93, Jan
2D-DCT, based on unfolded and folded 1D-DCT modules, 1974.
and are referred to as Architecture-1 and Architecture-2 [9] N. C. Vayalil, A. Safari, and Y. Kong, “Overlapped block-processing
VLSI architecture for separable 2D filters,” in Electronics, Commu-
respectively. The unfolded or full-parallel structures have nications and Networks IV, Jun 2015, pp. 1355–1358.
higher throughput at the expense of larger area or gate [10] K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms,
count. A comparison is given in the table with the proposed Advantages, Applications. Academic Press, Boston, 1990.
architecture, and is clear that the proposed method has [11] M. Budagavi, A. Fuldseth, G. Bjntegaard, V. Sze, and M. Sadafale,
“Core transform design in the high efficiency video coding (HEVC)
approximately 61% gate count of [5] Architecture-2 and is standard,” IEEE Journal of Selected Topics in Signal Processing,
twice as fast. For an equivalent throughput, the proposed vol. 7, no. 6, pp. 1029–1041, Dec 2013.
232
[12] K. Choi and E. S. Jang, “Early TU decision method for fast video
encoding in high efficiency video coding,” Electronics Letters, vol. 48,
no. 12, pp. 689–691, June 2012.
[13] C. C. Wang, Y. C. Liao, J. W. Wang, and C. W. Tung, “An effective
TU size decision method for fast HEVC encoders,” in Computer,
Consumer and Control (IS3C), 2014 International Symposium on,
June 2014, pp. 1195–1198.
[14] J. Su, K. Nitta, M. Ikeda, and A. Shimizu, “Residue role assignment
based transform partition predetermination on HEVC,” in 2013 IEEE
International Conference on Image Processing, Sept 2013, pp. 2019–
2023.
[15] J. Kang, H. Choi, and J. G. Kim, “Fast transform unit decision
for HEVC,” in Image and Signal Processing (CISP), 2013 6th
International Congress on, vol. 01, Dec 2013, pp. 26–30.
[16] Z. Pan, J. Lei, Y. Zhang, W. Yan, and S. Kwong, “Fast transform unit
depth decision based on quantized coefficients for hevc,” in Systems,
Man, and Cybernetics (SMC), 2015 IEEE International Conference
on, Oct 2015, pp. 1127–1132.
[17] J. T. Fang, Y. C. Tsai, J. X. Lee, and P. S. Yu, “Computation
reduction in transform unit of high efficiency video coding based
on zero-coefficients,” in 2016 International Symposium on Computer,
Consumer and Control (IS3C), July 2016, pp. 797–800.
[18] HEVC reference software 16.3. [Online]. Available: https://hevc.hhi.
fraunhofer.de/svn/svn HEVCSoftware/
[19] G. Bjøntegaard, Calculation of average PSNR differences between
RD-curves, ITU-T SG16 Document VCEG-M33, Joint Collaborative
Team on Video Coding (JCTVC), Apr. 2001.
[20] V. Sze, M. Budagavi, and G. J. Sullivan, Eds., High Efficiency Video
Coding (HEVC) Algorithms and Architectures. Springer, 2014.
233