International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
COMMON SHARING DISTRIBUTED ARITHMETIC METHOD WITH
EIGHT PARALLEL COMPUTATION PATHS USED EFFECTIVE MULTI
STANDARD TRANSFORM CORE SUPPORTING THE STANDARDS
MPEG, H.264, VC-1
[Link] [Link],
Department Of ECE, SRM University.
Email: sudhaselvaraj1991@[Link], nivashs09@[Link]
ABSTRACT
This paper proposes a low-cost high-throughput multi standard transform (MST) core, which can support
MPEG- 1/2/4 (8 × 8), H.264 (8 × 8, 4 × 4)[2][3], and VC-1 (8 × 8, 8 × 4, 4×8, 4×4) transforms[4]. Common sharing
distributed arithmetic (CSDA) combines factor sharing and distributed arithmetic sharing techniques, efficiently
reducing the number of adders for high hardware-sharing capability[5][7]. This achieves a reduction in adders in the
proposed MST, compared with the direct implementation method. With eight parallel computation paths, the
proposed MST core has an eightfold operation frequency throughput [Link] CSDA-MST[8][9] core thus achieves
a high-throughput rate supporting multi standard transformations at low cost.
KEYWORDS:Common sharing distributed arithmetic (CSDA), discrete cosine transform (DCT), integer transform,
multistandard transform (MST).
8 transform, two 8 × 4 transforms, two 4 × 8transforms,
1. INTRODUCTION or four 4 × 4 transforms. This feature enables coding
H.264 and VC-1 are popular video that takes advantage of the different transform sizes as
compression [Link] VC-1 codec is designed to needed for optimal image quality.
achieve state-of-the-art compressed video quality at bit
rates that may range from very low to very high[6]. 16-Bit Transforms:
VC-1 Advanced Profile is also transport and container In order to minimize the computational
independent. This provides even greater flexibility for complexity of the decoder, VC-1 uses 16-bit
device manufacturers and content services. transforms. This also has the advantage of easy
implementation on the large amount of digital signal
Innovations: processing (DSP) [2][11]hardware built with 16-bit
VC-1 includes a number of innovations that processors. Among the constraints put on VC-1
enable it to produce high quality content. This section transforms is the requirement that the 16-bit values
provides brief descriptions of some of these features. used produce results that can fit in 16 bits. The
constraints on transforms ensure that decoding is as
efficient as possible on a wide range of devices.
2. ADAPTIVE BLOCK SIZE TRANSFORM:
Traditionally, 8 × 8 transforms have been used Motion compensation:
for image and video coding. However, there is Motion compensation is the process of
evidence to suggest that 4 × 4 transforms can reduce generating a prediction of a video frame by displacing
ringing artifacts at edges and discontinuities. [6]VC-1 the reference frame. Typically, the prediction is formed
is capable of coding an 8 × 8 block using either an 8 × for a block (an 8 × 8 pixel tile) or a macro block (a 16
× 16 pixel tile) of data[9][10]. The displacement of
gopalax Publications 29
International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
data due to motion is defined by a motion vector, Fading compensation:
which captures the shift along both the x- and y-axes. Due to the nature of compression that uses
The efficiency of the codec is affected by the size of motion compensation, encoding of video frames that
the predicted block, the granularity of sub-pixel data contain fades to or from black is very inefficient. With
that can be captured, and the type of filter used for a uniform fade, every macro block needs adjustments
generating sub-pixel predictors. [11][12]VC-1 uses 16 to luminance. VC-1 includes fading compensation,
× 16 blocks for prediction, with the ability to generate which detects fades and uses alternate methods to
mixed frames of 16 × 16 and 8 × 8 blocks. The finest adjust luminance. This feature improves compression
granularity of sub-pixel information supported by VC- efficiency for sequences with fading and other global
1 is 1/4 pixel. Two sets of filters are used by VC-1 for illumination changes.
motion compensation. The first is an approximate
bicubic filter with four taps. The second is a bilinear Differential quantization:
filter with two taps. Differential quantization, or dquant, is an
encoding method in which multiple quantization steps
VC-1 combines the motion vector settings are used within a single frame. Rather than quantize the
defined by the block size, sub-pixel granularity, and entire frame with a single quantization level, macro
filter type into modes. The result is four motion blocks are identified within the frame that might
compensation modes that suit a range of different benefit from lower quantization levels and greater
situations[12][13]. This classification of settings into number of preserved AC coefficients. Such macro
modes also helps compact decoder implementations. blocks are then encoded at lower quantization levels
than the one used for the remaining macro blocks in the
Loop Filtering: frame[5]. The simplest and typically most efficient
VC-1 uses an in-loop de-blocking filter that form of differential quantization involves only two
attempts to remove block-boundary discontinuities quantizer levels (bi-level dquant), but VC-1 supports
introduced by quantization errors in interpolated multiple levels, too.
frames. These discontinuities can cause visible artifacts
in the decompressed video frames and can impact the
quality of the frame as a predictor for future 2. EXISTING SYSTEM
interpolated frames. The loop filter takes into account A significant amount of research has been
the adaptive block size transforms. The filter is also conducted to efficiently combine and implement the
optimized to reduce the number of operations required. transform units for multiple codec’s[5][7]. On the other
hand little research is focused on the implementation of
Interlace Coding: multi-quantized unit. Among the multiple-transform
Interlaced video content is widely used in units, a unified Inverse Discrete Cosine Transform
television broadcasting. When encoding interlaced (IDCT) architecture to support five standards (such as,
content, the VC-1 codec can take advantage of the AVS, H.264, VC-1, MPEG-2/4 and JPEG) is
characteristics of interlaced [4][5]frames to improve presented. The authors in offer an area efficient
compression. This is achieved by using data from both architecture to perform a DCT-based transform for
fields to predict motion compensation in interpolated JPEG, MPEG-4, VC-1 and H.264 using delta mapping.
frames. The design in is an IDCT and IQ circuit for H.264,
MPEG-4 and VC-1. The MJPEG standard defines
Advanced B Frame Coding: quantization as the division operation of the DCT
A bi-directional or B frame is a frame that is coefficient coming from the transform unit by the
interpolated from data both in previous and subsequent corresponding Q value (specified by the quantization
frames. B frames are distinct from I frames (also called matrix). MJPEG allows specification of Q-matrices
key frames), which are encoded without reference to that facilitates the allocation of more bits for the
other frames. B frames are also distinct from P frames, representation of coefficients which are visually more
which are interpolated from previous frames only. VC- significant.
1 includes several optimizations that make B frames
more efficient. The intercommunications between the video
devices using different standards are so much
gopalax Publications 30
International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
inconvenient,[8][10] thus video codec supporting register, and the TMEM is designed using sixty-four
multiple standards are more useful and more attractive. 12-bit registers, where the output data from Core-1 can
In this brief, low cost very large scale integration be transposed and fed into Core-2. The registers in
(VLSI) architecture is designed for multi standard TMEM are designed the same way as the architecture.
inverse Discrete Cosine transform. It is used in multi Each core has four pipeline stages: two in the even and
standard decoder of MPEG-2, MPEG-4 ASP, and VC- odd part CSDA circuit, and two in ECATs. Therefore,
1[4][6] .Two circuit share strategies, factor share (FS) the proposed 2-D CSDA-MST core has a latency of 16
an adder share (AS) are applied to the inverse clock cycles (= 4 + 8 + 4), and the TMEM executes
transform architecture for saving its circuit resource. transposed operation after 12 clock cycles (= 4 + 8)
Pipelined stages are used in this Multistandard inverse when 8 pixels are input.
transform to increase the operational speed.
The CSDA-MST core can achieve high
The possibility to employ hierarchical performance, with a high throughput rate and low-cost
prediction structures for providing temporal scalability VLSI design, supporting MPEG-1/2/4, H.264, and VC-
with several layers while improving the coding 1 MSTs. By using the proposed CSDA method, the
efficiency and increasing the effectiveness of quality number of adders and MUXs in the MST core can be
and spatial scalable coding. New methods for inter- saved efficiently. Measured results show the CSDA-
layer prediction of motion and residual improving the MST core with a throughput rate of 1.28 G-pels/s,
coding efficiency of spatial scalable and quality which can support (4928 × 2048@24 Hz) digital
scalable coding. The concept of key pictures for cinema format with only 30 k logic gates. Because
efficiently controlling the drift for packet-based quality visual media technology has advanced rapidly, this
scalable coding with hierarchical prediction structures. approach will help meet the rising high-resolution
Single motion compensation loop decoding for spatial specifications and future needs as well.
and quality scalable coding providing a decoder
complexity close to that of single-layer coding. 4. MODULES
• 1-D Common sharing distributed arithmetic-MST
3. PROPOSED SYSTEM • Even part Common sharing distributed arithmetic
The proposed CSDA algorithm combines the circuit
FS and DA methods. By expanding the coefficients • Odd part Common sharing distributed arithmetic
matrix at the bit level, the FS method first shares the circuit
same factor in each coefficient; the DA method is then • 2-D Common sharing distributed arithmetic -MST
applied to share the same combination of the input core
among each coefficient position. An example of the
proposed CSDA algorithm in a matrix inner product is
as follows the proposed CSDA combines the FS and 5. MODULE DESCRIPTION
DA methods[1][2]. The FS method is adopted first to 1-D Common Sharing Distributed Arithmetic-Mst:
identify the factors that can achieve higher capability in Based on the proposed CSDA algorithm, the
hardware resource sharing, where the hardware coefficients for MPEG-1/2/4, H.264, and VC-1
resource in defined as the number of adder usage. Next, transforms are chosen to achieve high sharing
the DA method is used to find the shared coefficient capability for arithmetic resources. To adopt the
based on the results of the FS method. The adder-tree searching flow, software code will help to do the
circuits will be followed by the proposed CSDA iterative searching loops by setting a constraint with
circuit.[4][8] Thus, the CSDA method aims to reduce minimum nonzero elements. In this paper, the
the nonzero elements to as few as possible. The CSDA constraint of minimum nonzero elements is set to be
shared coefficient is used for estimating and comparing five. After software searching, the coefficients of the
thereafter the number of adders in the CSDA loop. CSD expression, where 1 indicates −1. Note that the
choice of shared coefficient is obtained by some
The proposed 2-D CSDA-MST core consists constraints. Thus, the chosen CSDA coefficient is not a
of two 1-D CSDA-MST core (Core-1 and Core-2) with global optimal solution. It is just a local or suboptimal
a transposed Memory (TMEM). Core-1 and Core-2 are solution. Besides, the CSD codes are not the optimal
different in word length for each arithmetic, MUX, and expression, which have minimal nonzero bits.
gopalax Publications 31
International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
However, the chosen coefficients of CSD expression
can achieve high sharing capability for arithmetic
resources by using the proposed CSDA algorithm.
More information about CSDA coefficients for MPEG-
1/2/4, H.264, and VC-1 transforms.
6. ODD PART COMMON SHARING
DISTRIBUTED ARITHMETIC CIRCUIT
Similar to the CSDA_E, the CSDA_O also
has two pipeline stages. Based on the proposed CSDA
algorithm, the CSDA_O efficiently shares the hardware
resources among the odd part of the eight-point
transform and four-point transform for variable
standards. It contains selection signals of multiplexers
(MUXs) for different standards. Eight adder trees with
Even part common sharing distributed arithmetic error compensation (ECATs) are followed by the
circuit: CSDA_E and CSDA_O, which add the nonzero CSDA
The SBF module executes for the eight-point coefficients with corresponding weight as the tree-like
transform and bypasses the input data for two four- architectures[11][12]. The ECATs circuits can alleviate
point transforms. After the SBF module, the CSDA_E truncation error efficiently in small area design when
and CSDA_O execute and by feeding input data a and summing the nonzero data all together.
b, respectively[10][11]. The CSDA_E calculates the
even part of the eight-point transform, similar to the
four-point Trans form for H.264 and VC-1 standards.
Within the architecture of CSDA_E, two pipeline
stages exist (12-bit and 13-bit). The first stage executes
as a four-input butterfly matrix circuit, and the second
stage of CSDA_E then executes by using the proposed
CSDA algorithm to share hardware resources in
variable standards.
gopalax Publications 32
International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
9. PROPOSED 2-D CSDA-MST CORE
DESIGN
7. 2-D COMMON SHARING DISTRIBUTED We introduces the proposed 2-D CSDA-MST
ARITHMETIC -MST CORE core implementation. Neglecting the scaling factor, the
This section provides a discussion of the one dimensional (1-D) eight-point transform can be
hardware resources and system accuracy for the defined as follows
proposed 2-D CSDA-MST core and also presents a
comparison with previous works. Finally, the Z0 x0
characteristics of the implementation into a chip are Z1 x1
described. Z2 x2
Z3 = C . x3
Z4 x4
Z5 x5
Z6 x6
Z7 x7
8. COMMON SHARING DISTRIBUTED Where
ARITHMETIC
Distributed Arithmetic (DA) has been wildly C4 C4 C4 C4 C4 C4 C4 C4
adopted for its computational efficiency in many digital C1 C3 C5 C7 -C7 -C5 -C3 -C1
signal processing applications such as DCT (Discrete C2 C6 -C6 -C2 -C2 -C6 C6 C2
Cosine Transform), DFT (Discrete Fourier Transform), C3 -C7 -C1 -C5 C5 C1 C7-C3
FIR (Finite Impulse Response), and DHT (Discrete C= C4 -C4 -C4 C4 C4 -C4 -C4C4
Hartley Transform)[4][7]. These applications involve C5 -C1 C7 C3 -C3 -C7 C1-C5
the computation of inner products between two vectors, C6 -C2 C2 -C6 -C6 C2 -C2C6
one of which is a constant. The general method to C7 -C3 C3 -C1 C1 -C3 C5-C7
generate the products is to use a MAC (multiply and
accumulate) unit, which is fast but has a high cost in
the case of long-length inner-products. Because the eight-point coefficient structures
in MPEG- 1/2/4, H.264, and VC-1 standards are the
The idea behind the conventional DA, called same, the eight-point transform for these standards can
ROM-based, is to replace multiplication operations by use the same mathematic derivation. According to the
pre-computing all possible values and storing these in a symmetry property, the 1-D eight point transform in (8)
ROM. According to the ROM based DA can reduce the can be divided into even and odd two four-point
circuit size by 50-80% on average. Custom transforms, Ze and Zo, as listed in (9) and (10),
reconfigurable technology emerged to satisfy the respectively
simultaneous demand for flexibility and efficiency.
Custom/domain-specific reconfigurable arrays can be Z0 C4 C4 C4 C4 a0
programmed to adapt for different applications, so the Ze = Z2 C2 C6 -C6 -C2 a1
efficiency of the hardware and flexibility of the whole Z4 C4 -C4 -C4 C4 a2
system is improved. Earlier works, such as, show good Z6 C6 -C2 C2 -C6 a3
performance in area, power consumption and speed.
Since a domain-specific reconfigurable architecture = Ce.a
targets few application fields, it achieves better
performance than a general purpose FPGA device. Z1 C1 C3 C5 C7 b0
Zo = Z3 C3 -C7 -C1 -C5 b1
Z5 C5 -C1 C7 C3 b2
Z7 C7 -C3 C3 -C1 b3
= Co.b
gopalax Publications 33
International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
Where FUTURE ENHANCEMENT
I will modify the proposed system by reducing
X0+X7 x0-x7 the Area of converting one dimensional to two
a = X1+X6 , b = x1-x6 dimensional core [Link] also will reduce the total
X2+X5 x2-x5 number of adders by using the CSDA Common
X3+X4 x3-x4 Sharing Distributed Arithmetic method in the proposed
system.
The even part of the operation in (10) is the REFERENCES
same as that of the four-point H.264 and VC-1 [1] S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y.
transformations. Moreover, the even part Ze can be Yamashita, H. Terane, and M. Yoshimoto, “A 100-
further decomposed into even and odd parts: Zee and MHz 2-D discrete cosine transform core processor,”
Zeo IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 492–
499, Apr. 1992.
Z0 C4 C4 A0
Zee = = [2] S. Yu and E. E. Swartzlander, “DCT
Z4 C4 -C4 A1 implementation with distrib- uted arithmetic,” IEEE
Trans. Comput., vol. 50, no. 9, pp. 985–991, Sep. 2001.
= Cee.A
[3] A. M. Shams, A. Chidanandan, W. Pan, and M. A.
Z2 C2 C6 B0 Bayoumi, “NEDA: A low-power high-performance
Zeo = = DCT architecture,” IEEE Trans. Signal Process., vol.
Z6 C6 -C2 B1 54, no. 3, pp. 955–964, Mar. 2006.
= Ceo.B [4] C. Peng, X. Cao, D. Yu, and X. Zhang, “A 250
Where MHz optimized distributed architecture of 2D 8×8
DCT,” inProc. 7th Int. Conf. ASIC, Oct. 2007, pp.
a0+a3 a0 - a3 189–192.
A = B =
a1+a2 a1 - a2 [5] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-
speed 2-D transform architecture with unique kernel
for multi-standard video applications,” in Proc. IEEE
Applications: Int. Symp. Circuits Syst., May 2008, pp. 21–24.
• Video and image applications
[6] Y. H. Chen, T. Y. Chang, and C. Y. Li, “High
throughput DA-based DCT with high accuracy error-
10. CONCLUSION compensated adder tree,” IEEE Trans. Very Large
The CSDA-MST core can achieve high Scale Integr. (VLSI) Syst., vol. 19, no. 4, pp. 709–714,
performance, with a high throughput rate and low-cost Apr. 2011.
VLSI design, supporting MPEG-1/2/4, H.264, and VC-
1 MSTs. By using the proposed CSDA method, the [7] Y. H. Chen, T. Y. Chang, and C. Y. Li, “A high
number of adders and MUXs in the MST core can be performance video transform engine by using space-
saved efficiently. Measured results show the CSDA- time scheduling strategy,” IEEE Trans. Very Large
MST core with a throughput rate of 1.28 G-pels/s, Scale Integr. (VLSI) Syst., vol. 20, no. 4, pp. 655–664,
which can support (4928 × 2048@24 Hz) digital Apr. 2012.
cinema format with only 30 k logic gates. Because
visual media technology has advanced rapidly, this [8] Y. K. Lai and Y. F. Lai, “A reconfigurable IDCT
approach will help meet the rising high-resolution architecture for universal video decoders,” IEEETrans.
specifications and future needs as well.
gopalax Publications 34
International Journal of Power Control and Computation(IJPCSC)
Vol 6. No.1 – Jan-March 2014 Pp. 29-35
©gopalax Journals, Singapore
available at : [Link]
ISSN: 0976-268X
Consum. Electron., vol. 56, no. 3, pp. 1872–1879, Aug.
2010.
[9] H. Chang, S. Kim, S. Lee, and K. Cho, “Design of
area-efficient unified transform circuit for multi-
standard video decoder,” in Proc. IEEE Int. SoC
Design Conf., Nov. 2009, pp. 369–372.
[10] S. Lee and K. Cho, “Circuit implementation for
transform and quanti- zation operations of
H.264/MPEG-4/VC-1 video decoder,” in Proc. Int.
Conf. Design Technol. Integr. Syst. Nanosc., Sep.
2007, pp. 102–107.
[11] S. Lee and K. Cho, “Architecture of transform
circuit for video decoder supporting multiple
standards,” Electron. Lett., vol. 44, no. 4, pp. 274–275,
Feb. 2008.
[12] H. Qi, Q. Huang, and W. Gao, “A low-cost very
large scale integration architecture for multistandard
inverse transform,” IEEE Trans. Circuits Syst., vol. 57,
no. 7, pp. 551–555, Jul. 2010.
gopalax Publications 35