0% found this document useful (0 votes)
132 views10 pages

Survey 1

http://www.ieee.org/web/publications/authors/transjnl/index.html

Uploaded by

Nidhi Parmar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views10 pages

Survey 1

http://www.ieee.org/web/publications/authors/transjnl/index.html

Uploaded by

Nidhi Parmar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013

An overview of tiles in HEVC


Kiran Misra, Member, IEEE, Andrew Segall, Member, IEEE, Michael Horowitz, Shilin Xu, Arild Fuldseth, Minhua Zhou

AbstractTiles is a new feature in the High Efficiency Video Coding (HEVC) standard that divides a picture into independent, rectangular regions. This division provides a number of advantages. Specifically, it increases the parallel friendliness of the new standard by enabling improved coding efficiency for parallel architectures, as compared to previous sliced based methods. Additionally, tiles facilitate improved maximum transmission unit (MTU) size matching, reduced line buffer memory, and additional region-of-interest functionality. In this paper, we introduce the tiles feature and survey the performance of the tool. Coding efficiency is reported for different parallelization factors and MTU size requirements. Additionally, a tile-based region of interest coding method is developed. Index TermsVideo coding, Multicore processing, High efficiency video coding, Tiles.

I. INTRODUCTION HE ISO/IECs Moving Pictures Experts Group (MPEG) and the International Telecommunications Unions (ITUT) Video Coding Experts Group (VCEG) have recently concluded work on the first edition of the High Efficiency Video Coding (HEVC) standard [3][4][5]. This standard was developed collaboratively by the Joint Collaborative Team on Video Coding (JCT-VC). For consumer applications, HEVC has been reported to achieve 50% improvement in coding efficiency when compared to previous coding standards such as MPEG-4 AVC/ITU-T H.264 [1][5]. These coding gains are achieved through a number of improvements that result in an increase in computational complexity for both encoder and decoder. Here, computational complexity refers to a combination of
Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected] Manuscript received January 30, 2013, revised May 10, 2013, cameraready version submitted June 21, 2013. Kiran Misra and Andrew Segall are with Sharp Laboratories of America, Inc., 5750 NW Pacific Rim Blvd, Camas, WA 98607, USA (e-mail: {misrak,asegall}@sharplabs.com). Micahel Horowitz and Shilin Xu are with eBrisk Video, Inc., Suite 1450, 1055 West Hastings Street, Vancouver, BC, V6E 2E9, Canada (e-mail: {michael,shilin}@ ebriskvideo.com). Arild Fuldseth is with Cisco Systems, Oslo, Norway (e-mail: [email protected]). Minhua Zhou is with Texas Instruments Inc, 12500 TI Blvd. Dallas, TX75243, USA (e-mail: [email protected]).

algorithmic operations and memory transfers. Algorithmic operations correspond to the calculations required in a decoder to convert bit-stream information to reconstructed pixel values or in an encoder to convert the original pixel values to a bitstream. For hardware, this corresponds to logic gates; for software, this corresponds to calculations on a CPU, GPU, or other processing units. Memory transfers represent the amount of data that must be stored and accessed to perform the required calculations. Typical architectures contain multiple memory types, ranging from high speed memory that is onchip (including caches near a CPU core) to lower speed memory that is off-chip or farther from the core. In general, on-chip memory is more expensive and therefore relatively small. Additionally, for many architectures, the critical bottleneck is the bandwidth necessary to transfer data from off-chip to on-chip memory in time to complete the required calculations. The increase in computational complexity in HEVC compared with earlier standards directly impacts the implementation and design. For systems with a single-core processor, the increased complexity requires higher clock speeds. This has the additional cost of increased power consumption and heat dissipation. For many applications of interest today, the increased clock rate is not desirable. An alternative solution for addressing the increased computational complexity is parallelism. Parallelism in a video system is not a new concept. For example, todays software based video conferencing systems that operate at resolutions up to 1080p (1920x1080 pixels) and frame rates of 60 frames per second (fps) rely on high-level parallelism (i.e., encoders and decoders that can process different portions of a video picture in a relatively independent fashion) despite using the less computationally complex H.264/AVC and its scalable extension SVC. With previous standards, high-level parallelism within a picture may be realized by partitioning the source frames using slices and assigning each slice to one of several processing cores. Slices were originally designed to map a bit-stream into smaller independently decodable chunks for transmission. The size of a coded slice was typically determined by the network characteristics; for example, the size is often selected to be less than the maximum transmission unit (MTU) size of the network being considered.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013

CTB #1

Column Boundaries

1 5 9 40 Row Boundaries

2 6 10 41

3 7 11

4 8 12

13 19 25

14 20 26

15 21 27

16 22 28

17 23 29

18 24 30

31 34 37

32 35 38

33 36 39

Figure 1 Example illustrating rectangular picture partitioning and coded tree block (CTB) scanning order within a picture that is divided into nine tiles. In practice, using slices for parallelization results in a number of disadvantages. For example, the pixel segmentation achieved by slices using only network constraints often result in partitioning where the correlation existing in the pixel data is reduced. This lowers the achievable coding efficiency. Moreover, slices contain header information to facilitate independent processing of pixel data. With the higher coding efficiency of HEVC, this becomes problematic it is possible to transmit high resolution video at low bit rates such that the overhead introduced by a slice header is not negligible. Finally, for applications that require both parallelization and packetization, it is difficult to use slices to achieve an optimal partitioning for both goals. Tiles provide an alternative partitioning that divides a picture into rectangular sections that are processed in a relatively independent fashion. Figure 1 illustrates an example where a picture is partitioned into rectangular processing units called tiles [7]. HEVC also provides additional tools for parallelism, including both low-level and high-level methods. For highlevel parallelism, one alternative approach is wavefront processing, which is further described in [9]. As described in this section, tiles have a number of desirable properties for next generation video coding devices. In the rest of this paper, we provide a detailed description of the tiles feature in HEVC. The rest of the paper is organized as follows: Section II provides background on slices and tiles. In section III, we discuss constraints on tiles as described in the HEVC standard. In section IV an example usage of tiles is provided. Section V, reports the coding efficiency improvement associated with the use of tiles in MTU size matching and high-level parallelism applications. Section V further demonstrates the efficacy of tiles in lightweight bitstream rewriting. Finally section VI provides concluding remarks.

Figure 2 - Example illustrating slice-based picture partitioning of coded tree block (CTB) following a raster scan order within the picture. II. BACKGROUND A. Slices A video decoder consists of two fundamental processes: (a) bit-stream parsing carried out by the entropy decoder and (b) picture reconstruction carried out by the pixel processing engine. The video bit-stream is typically organized in a causal fashion where both the parsing and the reconstruction step for the current location being processed depend on information occurring earlier in the bit-stream. In practical applications, a video bit-stream may be transmitted over lossy channels before it arrives at the decoder. Loss of a part of the video bit-stream would lead to an inability to parse and/or reconstruct information later within the bit-stream. This causal dependency propagates and therefore a single error may lead to an inability to process a significant portion of the bit-stream occurring after the error. To limit the propagation of error it is important to break dependencies in processing. Earlier video coding standards [1][2][6] achieved this by organizing the bitstream into independently parsable units called slices. Within a picture, individual slices can be independently reconstructed. In HEVC, slices define groups of independently parsable coded tree blocks (CTBs). Slices contain CTBs which follow raster scan order within a picture as shown in Figure 2. Previous standards such as H.264/AVC include tools such as flexible macroblock ordering (FMO) that enable defining arbitrary shaped slices. However, while FMO provides excellent capabilities in defining slice shapes, it unfortunately requires frame-level decoupling of the deblocking filter from the rest of the decoding process. Thus, in the context of H.264, it is not possible to perform macroblock based deblocking filtering with FMO. Largely as a result of this property, FMO is not included in HEVC, as the frame-level deblocking significantly increases memory bandwidth and leads to a decrease in decoder performance. By contrast, with tiles, the in-loop filtering process can be performed at the CTB-level with the use of vertical column buffers. Moreover, tiles in

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013

Figure 4 Illustration of sample data from reconstructed frames to be stored in on-chip memory for three different CTB rows. The dashed lines show the sample data stored in memory. HEVC provide an additional benefit of having lower overhead since unlike FMO they do not have associated slice headers. In HEVC, slice partitioning may be based on network constraints such as maximum transmission unit (MTU) or pixel processing constraints such as maximum number of CTBs to be contained within a slice. As can be seen in Figure 2, following the raster-scan order within a picture results in partitioning which has a lower level of spatial correlation within the picture. Additionally, every slice contains within it an associated slice header which adds a non-negligible overhead at lower bitrates. As a result of the reduced spatial correlation and additional slice header overhead video coding efficiency suffers. B. Tiles A picture in HEVC is partitioned into coded tree blocks. In addition, each picture may be partitioned into rows and columns of CTBs, and the intersection of a row and column results in a tile. Note that tiles are always aligned with CTB boundaries. As a result of its partitioning flexibility, a tile may be spatially more compact than a slice containing the same number of coded tree blocks. This has the benefit of higher correlation between pixels compared to slices. As an additional advantage, tiles do not contain headers for improved coding efficiency. The CTBs within a tile are processed in raster scan order, and the tiles within a picture are processed in raster scan order. An example of the above-described partitioning is shown in Figure 1 where a picture is partitioned into three columns and three rows of CTBs. The CTBs within the first (upper-left) tile, depicted as numbered squares 1-12, are scanned in raster scan order. After scanning the first tile, the second tile follows (in tile raster scan order). Specifically, as shown in Figure 1, CTB #13 in the second tile follows CTB #12 in the first tile. Tiles allow the column and row boundaries to be specified with and without uniform spacing.

Figure 3 Example illustrating tiles partitioning of a picture to identify region of interest. In the above example the tiles in the center column for the first two rows form the region of interest. The modified scan pattern has the advantage of reduced line buffer requirements for motion estimation. Specifically, the prediction of a CTB requires storing, in on-chip memory, reconstructed pixel data (from previously coded frames) that are candidates for motion compensation. This data is loaded into on-chip memory and retained until no longer needed. Without tiles, raster scanning of a picture results in storing sample data equal to PicW*(2*SRy + CtbHeight), where PicW is the width of the picture, SRy is the maximum vertical size of a motion vector in full sample units and CtbHeight is the height of an CTB in full sample units. With tiles, the modified scan pattern results in a sample data storage requirement that is approximately (TileW+2*SRx)*(2*SRy + CtbHeight), where TileW is the width of a tile and SRx is the maximum horizontal size of a motion vector. Using tiles, sample data storage can be substantially reduced when PicW is significantly larger than (TileW+2*SRx), which is typical. Note that the above analysis assumes a single core encoder that processes tiles sequentially. A graphical illustration of the memory required for motion estimation using tiles is shown in Figure 4. In addition to changing the CTB scanning process, tile boundaries denote a break in coding dependencies. Dependencies between tiles are disabled in the same manner as between slices. Specifically, entropy coding and reconstruction dependencies are not allowed across a tile boundary. This includes motion vector prediction, intra prediction and context selection. In-loop filtering is the only exception which is allowed across the boundaries but can be disabled by a flag in the bit-stream. Thus, separate tiles may be encoded on different processors with little inter-processor communication. The breaking of coding dependencies at tile boundaries implies that a decoder can process tiles in parallel with other tiles. To facilitate parallel decoding the locations of tiles must be signaled in the bit-stream. In HEVC, the bit-stream offsets of all but the first tile are explicitly transmitted in the slice

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013 header [10][18][19]. The location of the first tile immediately following the slice header is known to the decoder. Tiles have been successfully demonstrated to be an effective parallelism tool for pixel-load balancing in ultra-high definition (UHD) video [8]. In addition to high-level parallelism, tiles also facilitate improved MTU size matching. Additionally, for some applications the rectangular pixel data partitioning afforded by tiles facilitates region of interest (ROI) based coding. Figure 3 illustrates an example where two tiles are used to represent the ROI within a video source. Tiles based ROI identification can be used to facilitate asymmetric processing of pictures, where the tile corresponding to the ROI is processed by a more computationally-capable core. This is a desirable trait in some applications where the ROIs encoder rate-distortion decision process is computationally more demanding due to the use of advanced search algorithms and distortion metrics. Additionally, when tiles lying within an ROI are coded independently, the subset of the bit-stream corresponding to the ROI can be easily extracted and reconstituted into another bit-stream with lower bit rate requirements. An example application where tile partitioning and ROI based coding is used to perform lightweight bitstream rewriting is demonstrated in section V.C. An equally strong benefit of tiles is the reduction of memory bandwidth and on-chip memory requirements. Specifically, with a rectangular partitioning, the size of the line buffers required for motion compensation and in-loop filtering is dramatically reduced. (The line buffer width is reduced from the width of the picture to approximately the width of the tile. Thus, for even two tiles, the reduction is nearly 50%) For an encoder with significant memory restrictions, this has the additional advantage of higher coding efficiency, as the reduced memory requirements of tiles enables a larger vertical search range for such an encoder [12]. C. Complexity properties For system designers the worst-case performance of a tool is an important measure that needs to be considered when designing solutions that are guaranteed to meet a minimum constraint. With this view in mind, we describe the worst-case per-picture execution time for slices and tiles. It is assumed that the degree of parallelism afforded by slices and tiles is the same so as to enable comparisons between them. We consider a video coding environment with 1 independent encoders and independent decoders. The encoders and decoders have equal computational capability. Let , represent the encoding time for , . A picture consists of a total of rows of CTBs and columns of CTBs. If a picture is partitioned into slices then the worst-case encoding time for the system would be: , ,

, , 1 where , and , represent the deblocking and sample adaptive offset filter encoding time at slice boundaries respectively. If a picture is partitioned into uniform tiles versus uniform slices then the number of CTBs at the boundary of tiles can always be made to be smaller than or equal to the number of CTBs at the slice boundary. For tiles the worst-case encoding time for the system would be: , ,

, , 2 where , and , represent the deblocking and sample adaptive offset filter encoding time at tile boundaries respectively. Similar expressions for worst case decoding time can be derived for the system. As an example, if we assume that (a) the only difference in execution times for slices and tiles based processing lie in the processing of pixels at the boundary, (b) the number of CTBs in a picture are large compared to and (c) tiles take on square-shapes, then the number of boundary edges to be shared for slice and tile based parallelism approaches is at min, 1 1 most and 2 respectively. Here the function min(x, y) returns the smaller of the two values x and y, if the two values are equal the function returns x. The second expression above, representing the shared boundary for tiles, is strictly smaller than the first indicating that, for the stated assumptions, tile-based parallel processing may be preferred. In practice however the non-square nature of tiles and the coding complexity of the video scene being coded make it necessary to further refine the complexity considerations. In theory, it is possible to perform load balancing between different processing cores based on estimated coding complexity of different regions of a picture and redefining tile boundaries. A good load-balancing algorithm would increase resource utilization while reducing average processing times. However, frequent changes in tile boundaries and the associated change in scan pattern and buffering requirements make software/hardware optimizations difficult to achieve. In practice, system designers need to determine a good trade-off between variable tile structures and optimized implementations for their individual applications.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013

(a) (b) Figure 5 Interaction between tiles and slices. In the example above, the following are illustrated: (a) Three complete tiles contained within a single slice (b) Two complete slices contained within each tile. An additional constraint made on the tile system in HEVC is that all tile locations within a bit-stream are provided to the decoder. Conceptually, a single-core decoder could ignore this location information while processing a bit-stream containing tiles. This would result in a single core decoder following the memory access pattern described in section II.B. However, the HEVC design requires the transmission of entry points for all but the first tile with the goal of realizing two key benefits. The first benefit is enabling the maximum amount of parallelization at the decoder while, the second benefit is allowing for decoding a bit-stream containing tiles in raster-scan order. More specifically, a decoder may receive a bit-stream in tiles raster-scan, but choose to decode it in the (alternative) rasterscan of the frame. This is achieved by decoding the CTBs in the first row of a first tile, saving the entropy coding state, and then resetting the entropy coder to decode the CTB in the first row from the neighboring tile. For the example, as illustrated in Figure 5a, this corresponds to decoding CTBs 1, 2, 3, 4 (in that order); saving the entropy coding state for the current tile; resetting the entropy coder and continue to decoding CTB 33 in the adjacent tile. This benefits single core decoders, as the single core device can decode a bit-stream containing tiles without significant changes in processing or memory access pattern. Entry points also allow a bit-stream containing independently decodable tiles to be extracted and re-constituted into a lower bit rate stream. This requires further non-normative (encoder side) constraints on the bit-stream being generated. These nonnormative constraints with the corresponding lightweight rewriting experiments are listed in section V.C. IV. EXAMPLE USAGE FOR TILES We now take a more detailed look at an example use case for tiles. Video conferencing applications that stand to benefit from the parallelism afforded by tiles are used to demonstrate the usage of tiles. Software-based interactive video applications run on

III. CONSTRAINTS ON TILES In this section, we begin by listing the constraints related to tiles in HEVC. Supporting tiles in the HEVC system requires the transmission of the tiles configuration information from an encoder to a decoder. This includes column and row locations, loop filter control and the bit-stream location information for the start of all but the first tile in a picture. Using uniform spacing, the tile boundaries are automatically distributed uniformly across the picture. The tile boundaries thus balance the pixel load approximately evenly amongst different tiles in a picture. Alternatively, tile boundaries may be explicitly specified, for example based on picture coding complexity. When more than one tile exists within a picture then the tile column widths and tile row heights are required to be greater than or equal to 256 and 64 luma samples respectively. This constraint ensures that tile sizes cannot be too small. Additionally the total number of tiles within a picture is limited by constraining the maximum number of tile columns and maximum number of tile rows allowed within a picture based on the level of the bitstream under consideration. These bounds are specified in Table A-1 of the HEVC standard and monotonically increase with increasing level. In HEVC, slice boundaries can also be introduced by the encoder and need not be coincident with tile boundaries. However, to manage decoder implementation complexity, the combination of slices and tiles is constrained. Specifically, either all coded blocks in a tile must belong to the same slice, or all coded blocks in a slice must belong to the same tile. Figure 5 illustrates the two constraints. In Figure 5a all coded blocks within the three tiles belong to a single slice illustrating the earlier constraint. While in Figure 5b all the coded blocks in a slice belong to the same tile. As a consequence of these constraints it should be noted that a slice that does not start at the beginning of a tile cannot span multiple tiles.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013 platforms ranging from laptop and desktop computers to tablets and smart phones. Modern desktop and laptop computers use CPUs with four or more processing cores. Many tablets and smartphones sporting dual-core and even quad-core ARM processors are already commercially available in the market. One way to leverage multi-core computational power for HEVC video encoding and decoding is to use tiles. The following example describes the use of tiles syntax in the context of a software-based 720p (1280x720) interactive video application operating at 60 frames per second (fps). The example application is designed for a hardware platform containing an Intel Core i7 CPU, which accounting for hyperthreading, has eight virtual processing cores. The application consists of several components including an HEVC encoder, HEVC decoder, audio processing, user interface, etc. With the relative computational complexity associated with each component in mind, four virtual cores are allocated for HEVC encoding, two for decoding (HEVC decoding generally requires fewer computational resources than encoding) and the remaining two cores are reserved for all other application components. To better take advantage of the processing capacity of the four cores allocated for encoding, the input picture is partitioned into four tiles. Since each core has identical computational capabilities, it is desirable to partition a given picture so that the encoding of each resulting tile requires the same processing power. To achieve a proper processing load balance, tiles in active picture regions requiring more processing power are specified to be smaller than tiles in less computationally demanding regions. One simple load balancing strategy starts by partitioning the picture to be encoded into tiles of roughly equal size and adapting the size of the tiles over time depending on source content. Tile locations and dimensions are specified in the picture parameter set (PPS). The use of the PPS facilitates picture-to-picture tile configuration changes that may be made in order to load balance. A picture may be partitioned into four uniformly spaced tiles (to facilitate load balancing). The resulting tiles are more spatially compact than those resulting from other partitioning strategies (e.g., four tiles side-by-side). In general, tile compactness results in improved coding efficiency as discussed earlier. The resulting left and right tile columns each have a width of 640 luma samples. Assuming the coded tree blocks comprise of 64x64 luma samples, setting the first tile row height to 6 CTBs results in the top tile row having a height of 384 luma samples while the bottom row has a height of 336 samples. In this way, four tiles having dimensions 640x384, 640x384, 640x336, and 640x336 counting clockwise from the upper-left, are specified. Having specified the tile dimensions, the encoder partitions the input picture into four tiles and sends the picture data associated with each tile to a separate processing core for encoding. In this way, the encoder may achieve full processor utilization with very low delay. After encoding, the bits

6 produced by each core must be assembled into a coded slice in decoding order (tiles are decoded in raster scan order within a picture) prior to being placed in a data packet and sent to the network for transport. For the sake of clarity, we shall assume the bits for all tiles in a picture are contained within a single coded slice. This assumption is not unreasonable for a 720p, 60 fps HEVC encoding in the context of video conferencing. Parallel decoding requires a different approach. Where an encoder has the flexibility to choose how to partition and allocate portions of an input picture for parallel processing, due to dependencies in the decoding process, the decoder cannot arbitrarily partition an input bit-stream. To facilitate parallel decoding, HEVC inserts information into a slice header to signal entry points associated with tiles contained within that slice. In this example, three entry point per slice are signaled to mark the location in the bit-stream of the start of the second, third and fourth tile. The decoder receives the coded slice, parses the slice header and determines the associated PPS which is mandated to occur earlier in the bitstream than any slice referencing it. From the PPS, the decoder may derive the number of tiles as well as the location and spatial dimensions of each tile. In addition, the decoder determines the location in the bit-stream of the tile entry point from the slice header. The tile substreams may then be sent to the two independent cores for decoding. In this example, each core is assigned two tiles for processing at the decoder. An important benefit of signaling entry points in the slice header is that it facilitates raster-scan based decoding as described in section III. V. EXPERIMENTS To assess the benefit of the tiles feature, we report the coding efficiency improvement of the approach in a number of configurations. These configurations include the cases of highlevel parallelization and network maximum transmission unit (MTU) size matching. The efficacy of tiles is further demonstrated using a lightweight bit-stream rewriting example. Another experiment considering motion estimation with limited on-chip memory is reported in [7]. The experiments reported here are conducted on test sequences of different resolutions. Sequences used in experiments are classified into five groups based on their resolution. Class A sequences have the highest resolution of 2560x1600 and are cropped versions of ultra-high definition (UHD) 4K resolution sequences. Class B sequences correspond to full high definition (HD) sequences with a resolution of 1920x1080. Class C and Class D sequences correspond to WVGA and WQVGA resolutions of 832x480 and 416x240 respectively. Finally Class E sequences correspond to sequences typically seen in video conferencing applications and correspond to 720p i.e. 1280x720 pixel resolution.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013 Table 1 - Slice and tile partitioning for experiment 1 Reference Test Class Number Slice Number of Tile of slices sizes tiles dimensions (in (in units of units of CTBs) CTBs) horiz vert horiz verti ontal ical ontal cally ly ly ly A 25 40 5 5 8 5 B 17 30 6 3 5 6 C 4 26 2 2 7 4 E 12 20 4 3 5 4 For the experiments, Class A includes the Traffic and PeopleOnStreet sequences; Class B includes the Kimono, ParkScene, Cactus, BasketballDrive and BQTerrace sequences; Class C includes the BasketballDrill, BQMall, PartyScene and RaceHorses sequences, class D includes the BasketballPass, BQSquare, BlowingBubbles and RaceHorses sequences; Class E includes the FourPeople, Johnny and KristenAndSara sequences. Note, that for the random access configuration class E sequences are not tested while for low delay configuration class A sequences are not tested. This is consistent with the test conditions defined during the standardization process of HEVC [15]. A. Comparing parallelism using Tiles versus Slices (Experiment 1) In a first experiment, we compare the high-level parallelization performance of tiles to traditional slices. Here, we select the tile size to be approximately equal to the size of one WQVGA image frame (i.e., 416x240 pixels). The exact slice and tile partitioning used for different class of sequences is listed in Table 1. Similarly, for reference, we select the slice size to have the same number of CTBs as a tile. Choosing tile sizes approximately equal to WQVGA results in a single tile and correspondingly a single slice for class D sequences. This will lead to the anchor and test data having exactly identical rate-distortion performance. Consequently class D sequences are not tested for this experiment. Experiments are conducted using HM 9.2 [14] and JCT-VC main configuration common conditions [15] for class A to E test sequences. Results from the comparison appear in Table 2. As can be seen from the table, the tiles system provides an average 2.2%, 2.2%, 5.4% and 5.5% luma BD-rate improvements for the main configuration of All Intra, Random Access, Low Delay B and Low Delay P scenarios, respectively, compared to slices and for the same amount of parallelization. Table 2 - Encoder parallelization performance results for experiment 1 All Intra Main Y U V Class A -1.5% -1.1% -1.0% Class B -1.9% -1.6% -1.5% Class C -0.9% -0.8% -0.9% Class E -4.5% -4.0% -4.1% Overall -2.2% -1.9% -1.9% Random access Main Y U -2.1% -1.9% -3.3% -3.8% -1.3% -1.6% * * -2.2% -2.4% Low delay B Main Y U * * -3.5% -3.3% -1.5% -1.4% -11.3% -10.3% -5.4% -5.0% Low delay P Main Y U * * -4.1% -3.9% -1.8% -1.9% -10.7% -9.7% -5.5% -5.2%

Class A Class B Class C Class E Overall

V -1.8% -2.8% -2.0% * -2.2%

V * -2.9% -1.5% -10.1% -4.8%

Class A Class B Class C Class E Overall

V * -3.4% -2.0% -10.1% -5.2%

Class A Class B Class C Class E Overall

B. MTU size matching using Tiles (Experiment 2) In a second experiment, we compare the performance of the tiles system for MTU size matching to traditional slices. Here, an encoder divides a picture into slices that do not exceed 1500 bytes. This slice size is consistent with the MTU size of an Ethernet v2 network. Tiles are used to improve the coding efficiency of the system. We use column boundaries to divide the picture, since we observe that columns result in more square-like slice shapes leading to higher correlations. The column widths used for each sequence class and encoder configuration are listed in Table 3. The higher correlations improve intra-prediction, mode prediction and motion vector coding, for example. Experiments are conducted using HM 9.2 [14] and JCT-VC main configuration common conditions [15] for class A to E test sequences. A coded tree block size of 32x32 was used in lieu of 64x64. Additionally, the HM-9.2 encoder is modified to allow byte-limited slices that begin at the start of a tile to extend to the end of another tile. Results for the experiment are reported in Table 4. As can be seen from the table, tiles improve the coding efficiency of HEVC in the MTU size

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013 Table 3 - Column for experiment 2 All Intra Main Class A QP 22 40 QP 27 40 QP 32 4 QP 37 4 Class B QP 22 30 QP 27 5 QP 32 6 QP 37 8 Class C QP 22 13 QP 27 13 QP 32 3 QP 37 4 Class D QP 22 7 QP 27 7 QP 32 2 QP 37 4 Class E QP 22 4 QP 27 5 QP 32 5 QP 37 7 widths in units of 32x32 CTBs used Random Access Main 5 7 10 10 6 6 8 10 4 7 7 13 4 7 7 7 * * * * Low Delay B Main * * * * 4 8 10 15 3 4 7 13 4 7 7 7 7 20 20 20 Low Delay Main Table 4 - MTU size matching performance results for experiment 2 All Intra Main Y U V Class A -0.9% -0.3% -0.3% Class B -2.7% -2.3% -2.2% Class C -1.8% -1.4% -1.5% Class D -0.4% -0.3% -0.3% Class E -4.8% -4.2% -4.0% Overall -2.1% -1.8% -1.7% Random access Main Y U -1.3% -1.4% -1.9% -2.1% -1.2% -1.1% 0.0% -0.1% * * -1.1% -1.2% Low delay B Main Y U * * -0.9% -1.2% -0.6% -0.4% 0.0% -0.1% 0.5% 1.2% -0.4% -0.2% Low delay P Main Y U * * -1.0% -1.2% -0.5% -0.4% 0.0% -0.1% 0.4% 0.5% -0.4% -0.4%

* * * * 4 8 10 15 3 4 7 13 4 7 7 7 7 20 20 20

Class A Class B Class C Class D Class E Overall

V -1.4% -1.9% -1.1% 0.0% * -1.1%

V * -1.1% -0.3% -0.1% 0.7% -0.3%

Class A Class B Class C Class D Class E Overall

matching scenario. Specifically, an average improvement of 2.1%, 1.1%, 0.4% and 0.4% luma BD-rate [16] is reported for the main configuration of All Intra, Random Access, Low Delay B and Low Delay P scenarios, respectively. As the CTB size decreases, the coding gain realized by using tiles increases. For example, for 16x16 CTBs the gains due to tiles has been shown to be 4.7%, 2.5%, and 0.9% luma BD-rate (on average) for Intra, Random Access and Low Delay scenarios, respectively [17]. Note, that for extremely low bitrates where a single slice exists per picture, the coding efficiency benefits of compact representation using tiles are sometimes exceeded by the losses incurred due to breaking of prediction dependencies at tile boundaries. This is evidenced by the coding efficiency losses observed for class E sequences in Low Delay B and Low Delay P configuration. Based on the above rate-distortion results it is fair to conclude that the utility of tiles for encoder parallelization and MTU size matching is low for smaller resolution sequences such as class C and class D sequences. C. Lightweight bit-stream rewriting using tiles based region of interest coding (Experiment 3) In a third experiment we partition pictures into tiles and identify one tile as containing the region-of-interest (ROI). To ensure that the ROI is independently decodable from non-ROI

V * -1.2% -0.7% -0.1% 1.1% -0.4%

Class A Class B Class C Class D Class E Overall

tiles, temporal predictions within the ROI tile are prevented from referring to pixels outside the ROI within reference pictures using encoder restrictions. Additionally the application of deblocking and sample adaptive offset filters is disabled at tile boundaries. Each picture in the video source is coded as a single slice. The slice header contains location information identifying the start of each tile. Using the entry point information a lightweight rewriting process extracts tiles corresponding to the ROI from each picture, rewrites the slice header and parameter sets to re-constitute a bit-stream containing only the ROI tile. Note [20] which was recently adopted within the working draft of version 2 of the HEVC standard, describes a way to constrain the encoding process so that the decoder can correctly decode specific set(s) of tiles. It also describes an encoder constraint which avoids the need for ROI applications to disable deblocking and sample adaptive offset filtering across tile boundaries.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013 Table 5 Tile heights and widths in units of 64x64 CTBs for lightweight bit-stream rewriting using tiles based ROI (experiment 3) Tile Tile Row ROI tile Column Heights index Widths (indexing starts with zero) FourPeople 20 3, 5, 4 1 Johnny 4, 12, 4 12 1 KristenAndSara 2, 18 12 1 For this experiment class E sequences were used. The tile partitioning and the ROI tile index used for experiments is listed in Table 5. The performance is measured using BD-rate. The anchor bit rates used correspond to the sum of bit rates for transmitting the full resolution class E sequence and a cropped version of the class E sequence corresponding to ROI using a single tile per picture. The quantization parameters used for the experiment are 22, 27, 32 and 37. The anchor peak signal to noise ratio (PSNR) used corresponds to the PSNR of the full resolution class E sequence with one tile. For the test data the bit rate of the full resolution class E sequences with the tile configuration listed in Table 5 is used. The test PSNR also corresponds to the full resolution class E sequence. The BD rate measured using this set of anchor and test data is listed in Table 6. This BD rate measure represents the bit rate savings achieved by using a mechanism where only a single resolution bit-stream is transmitted to a network middle box capable of performing the lightweight rewriting process versus transmitting two separate resolution bit-streams. Note that this BD rate reflects the bit rate savings from the point-of-view of an end-point device which receives the full resolution class E bit-stream and represent average bandwidth savings 43.9%, 28.5%, 21.1% and 23.0% for the main configuration of All Intra, Random Access, Low Delay B and Low Delay P scenarios, respectively. VI. CONCLUSION The tiles based design of HEVC provides multiple benefits for managing the computational complexity of video encoding and decoding. This is especially true for high resolution video data. By breaking dependencies within a picture, high-level parallelism for both the encoder and decoder can be achieved without the overhead of traditional slices. It was demonstrated that a tiles based parallelism approach results in an average luma bit rate saving of 2.2% to 5.5% over a slice based approach. It was also shown that the compact CTB scan pattern afforded by tiles can be used to improve the coding efficiency of MTU size matching within HEVC. Moreover, by altering the CTB scan pattern within an image, on-chip memory requirements are reduced and coding efficiency improvements can be achieved. Tiles can also be used to perform region-of-interest based lightweight bit-stream rewriting. The combination of high-level parallelism, resource reduction and coding efficiency provides a very useful tool within the HEVC system. Table 6 Lightweight stream splitting based using tiles (experiment 3) All Intra Main Y U FourPeople -39.4% -39.9% Johnny -46.5% -46.9% KristenAndSara -45.7% -46.1% Overall -43.9% -44.3% Random access Main Y U -10.8% -18.4% -30.9% -34.0% -43.7% -51.0% -28.5% -34.5% Low delay B Main Y U -6.1% -8.2% -17.9% -21.0% -39.2% -40.0% -21.1% -23.0% Low delay P Main Y U -8.4% -10.4% -20.5% -22.8% -40.1% -40.9% -23.0% -24.7% on ROI

V -39.7% -47.0% -46.1% -44.3%

FourPeople Johnny KristenAndSara Overall

V -22.0% -36.8% -52.2% -37.0%

FourPeople Johnny KristenAndSara Overall

V -9.2% -23.5% -41.1% -24.6%

FourPeople Johnny KristenAndSara Overall

V -11.6% -25.7% -42.3% -26.5%

ACKNOWLEDGEMENTS The authors would like to thank the anonymous reviewers for their valuable comments and feedback which was extremely helpful in improving the quality of the paper. REFERENCES
[1] Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec. H.264 | ISO/IEC 14496-10, Version 2: May 2004, Version 3: Mar.2005, Version 4: Sept. 2005, Version 5: June 2006, Version 7: Apr. 2007, Version 8 (with SVC extension) Consented July 2007, May 2003. T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, "Overview of the H.264/AVC video coding standard," IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560576, Jul. 2003. Joint Call for Proposals on Video Compression Technology, ITU T SG16/Q.6 Doc. VCEG-AM91, Kyoto, Japan, 2010. ITU-T Rec. H.265 and ISO/IEC 23008-2: High Efficiency Video Coding, ITU-T and ISO/IEC, April 2013. Sullivan, G. J.; Ohm, J.-R.; Han, W.-J. & Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649-1668, 2012. S. Wenger and M. Horowitz, FMO: Flexible Macroblock Ordering, JVT-C089, May 2002. A. Fuldseth, M. Horowitz, S. Xu, A. Segall. M. Zhou, Tiles, JCT-VC F335, 6th Meeting of Joint Collaborative Team on Video Coding (JCTVC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011. M. Zhou, V. Sze, M. Budagavi, Parallel tools in HEVC for highthroughput processing, Proceedings SPIE 8499, Applications of Digital Image Processing XXXV, 849910, October 15, 2012.

[2]

[3] [4] [5]

[6] [7]

[8]

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013
[9] Chi Ching Chi, Mauricio Alvarez-Mesa, Ben Juurlink, Gordon Clare, Flix Henry, Stphane Pateux and Thomas Schierl, "Parallel Scalability and Efficiency of HEVC Parallelization Approaches," IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1827-1838, 2012. K. Misra and A. Segall, Parallel decoding with Tiles, JCT-VC F594, 6th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011. A. Fuldseth, Replacing slices with tiles for high level parallelism, JCTVC-D227, 4th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Daegu, January 2011. M. Zhou, Sub-picture based raster scanning coding order for HEVC UHD video coding, JCTVC-B062, 2nd Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Geneva, July, 2010. M. Horowitz and S. Xu, Generalized slices, JCTVC-D378, 4th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Daegu, January 2011. High efficiency test model software SVN repository https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-9.2/ F. Bossen, Common HM test conditions and software reference configurations, JCT-VC I1100, 9th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Geneva, May 2012. G. Bjontegaard, Calculation of average PSNR differences between RDcurves, VCEG M33, March, 2001. M. Horowitz, S. Xu, E. S. Rye, and Y. Ye, The effect of LCU size on coding efficiency in the context of MTU size matching, JCT-VC F596, 6th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011. Hendry, S. Jeong, S. W. Park, B. M. Jeon, K. Misra, A. Segall, "AHG4: Harmonized method for signalling entry points of tiles and WPP substreams," JCTVC-H0556, 8th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, San Jose, February 2012. Y. -K. Wang , A. Segall, M. Horowitz, Hendry, W. Wade, F. Henry , T. Lee, "Text for tiles, WPP and entropy slices," JCTVC-H0737, 8th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, San Jose, February 2012. Y. Wu, G. J. Sullivan, Y. Zhang, Motion-constrained tile sets SEI message, JCTVC-M0235, 13th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Incheon, KR, April, 2013.

10
video processing and include video coding, super resolution and scale space theory. Michael Horowitz received an A.B. degree with distinction in physics from Cornell University, Ithaca, NY, in 1986, a M.S. in electrical engineering from Columbia University, New York City, NY, in 1988 and a Ph.D. in electrical engineering from The University of Michigan, Ann Arbor, in 1998. He is Chief Technology Officer at eBrisk Video. Prior to eBrisk, he led the engineering team at Vidyo that developed the first commercially available H.264 SVC video codec. Earlier, at Polycom he led the engineering team that developed the first commercially available in-product H.264/AVC video codec. Dr. Horowitz is Managing Partner at Applied Video Compression and is a member of the Technical Advisory Board of Vivox, Inc. Dr. Horowitz has served as chair for several ad hoc groups including the ad hoc group on High-level Parallelism during the ITU-T | ISO/IEC Joint Collaborative Team on Video Codings (JCT-VC) development of HEVC. Shilin Xu received a B.E. degree in Communication Engineering in 2004 and a Ph.D. degree in Electrical and Information Engineering in 2009, both from Huazhong University of Science and Technology, Wuhan, China. He has been the research engineer at eBrisk Video since 2010 and is actively participating in the standardization of HEVC. Prior to eBrisk, he was an assistant professor in Wuhan Institute of Technology, China, from 2009 to 2010. Arild Fuldseth received his B.Sc. degree from the Norwegian Institute of Technology in 1988, his M.Sc. degree from North Carolina State University in 1989, and his Ph.D. degree from Norwegian University of Science and Technology in 1997, all degrees in Signal Processing. From 1989 to 1994, he was a Research Scientist in SINTEF, Trondheim, Norway. From 1997 to 2002 he was a Manager of the signal processing group of Fast Search and Transfer, Oslo, Norway. Since 2002 he has been with Tandberg Telecom, Oslo, Norway (now part of Cisco Systems) where he is currently a Principal Engineer working with video compression technology. Minhua Zhou received his B.E. degree in Electronic Engineering and M.E. degree in Communication & Electronic Systems from Shanghai Jiao Tong University, Shanghai, P.R. China, in 1987 and 1990, respectively. He received his Ph.D. degree in Electronic Engineering from Technical University Braunschweig, Germany, in 1997. He received Rudolf-Urtel Prize 1997 from German Society for Film and Television Technologies in recognition of Ph.D. thesis work on Optimization of MPEG-2 Video Encoding. From 1993 to 1998, he was a Researcher at Heinrich-Hertz-Institute (HHI) Berlin, Germany. Since 1998, he is with Texas Instruments Inc, where he is currently a research manager of video coding technology. His research interests include video compression, video pre- and post-processing, end-toend video quality, joint algorithm and architecture optimization, and 3D video.

[10]

[11]

[12]

[13]

[14] [15]

[16] [17]

[18]

[19]

[20]

Kiran Misra (M09) received his B.E. degree in Electronics Engineering from Mumbai University, India, in 1998. He received his M.S. and Ph.D. degrees in Electrical and Computer Engineering in 2002 and 2010 from Michigan State University (MSU), East Lansing. He is a member (M) of the IEEE since 2009. He was also the recipient of MSUs Graduate School Research Enhancement and Summer Program Fellowship in 2008. Dr. Misra joined Sharp Laboratories of America Inc. as a Post-doctoral Researcher in 2010 where he is currently a Senior Researcher in the video coding group. His research interests include video coding and image compression, network coding, joint source and channel code design, wireless networking, and stochastic modeling. Andrew Segall (S00M05) received the B.S. and M.S. degrees in electrical engineering from Oklahoma State University, Stillwater, in 1995 and 1997, respectively, and the Ph.D. degree in electrical engineering from Northwestern University, Evanston, IL, in 2002. He is a currently a Manager at Sharp Laboratories of America, Camas, WA, where he leads groups performing research on video coding and video processing algorithms for next generation display devices. From 2002 to 2004, he was a Senior Engineer at Pixcise, Inc., Palo Alto, CA, where he developed scalable compression methods for high definition video. His research interests are in image and

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

You might also like