Programming Hive
   Reading #4

    @just_do_neet
Chapter 11. and 15.

            •Chapter 11. ‘Other File Formats and
             Compression’

                •Choosing / Enabling / Action / HAR / etc...

            •Chapter 15. ‘Customizing Hive File and Record
             Formats’

                •Demystifying DML / File Formats / etc...

                •exclude "SerDe" related topics at this
                 presentation...


Programming Hive Reading #4                                    3
#11 Determining Installed Codecs

        $ hive -e "set io.compression.codecs"
        io.compression.codecs=
         org.apache.hadoop.io.compress.GzipCodec,
         org.apache.hadoop.io.compress.DefaultCodec,
         com.hadoop.compression.lzo.LzoCodec,
         org.apache.hadoop.io.compress.SnappyCodec




Programming Hive Reading #4                            4
#11 Choosing a Compression Codec

            •Advantage :

                •network I/O , disk space.

            •Disadvantage :

                •CPU overhead.

            •to be short... : Trade-off




Programming Hive Reading #4                  5
#11 Choosing a Compression Codec

            •“why do we need different compression
             schemes?”

                •speed

                •minimizing size

                •‘splittable’ or not.




Programming Hive Reading #4                          6
#11 Choosing a Compression Codec

            •“why do we need different compression
             schemes?”




                              http://comphadoop.weebly.com/




Programming Hive Reading #4                                   7
take a break : algorithm

            •lossless compression

                •LZ77(LZSS), LZ78, etc...

                     •DEFLATE (LZ77 with Huffman coding)

                     •LZH (LZ77 with Static Huffman coding)

                •BZIP2(Burrows–Wheeler transform, Move-to-
                 Front, Huffman Coding)

            •lossy

                •for JPEG, MPEG,etc...(snip.)
Programming Hive Reading #4                                   8
take a break : algorithm




                        http://www.slideshare.net/moaikids/ss-2638826



Programming Hive Reading #4                                             9
take a break : algorithm




                        http://www.slideshare.net/moaikids/ss-2638826



Programming Hive Reading #4                                             10
take a break : algorithm

            •Burrows–Wheeler Transform(BWT)

                •block sorting

            •“abracadabra” = bwt“ard$rcaaabb”
                  abracadabra$   $abracadabra   a   $   a
                  bracadabra$a   a$abracadabr   r   a   b
                  racadabra$ab   abra$abracad   d   a   r
                  acadabra$abr   abracadabra$   $   a   a
                  cadabra$abra   acadabra$abr   r   a   c
                  adabra$abrac   adabra$abrac   c   a   a
                  dabra$abraca   bra$abracada   a   b   d
                  abra$abracad   bracadabra$a   a   b   a
                  bra$abracada   cadabra$abra   a   c   b
                  ra$abracadab   dabra$abraca   a   d   r
                  a$abracadabr   ra$abracadab   b   r   a
                  $abracadabra   racadabra$ab   b   r   $



Programming Hive Reading #4                                 11
take a break : algorithm

            •BWT with Suffix Array

                •ref. http://d.hatena.ne.jp/naoya/20081016/1224173077

                •ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt




Programming Hive Reading #4                                                          12
take a break : algorithm

            •LZO

                •“Compression is comparable in speed to
                 DEFLATE compression.”

                •“Very fast decompression”
                • http://www.oberhumer.com/opensource/lzo/




Programming Hive Reading #4                                  13
take a break : algorithm

            •Google Snappy

                •“very high speeds and reasonable
                 compression”
                • https://code.google.com/p/snappy/


            •ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889




Programming Hive Reading #4                                                       14
take a break : algorithm

            •LZ4

                •“very fast lossless compression algorithm”
                • https://code.google.com/p/lz4/


            •ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4




Programming Hive Reading #4                                              15
take a break : algorithm

            •“Add support for LZ4 compression”

                •fix version : 0.23.1, 0.24.0,(CDH4)

                •ref. https://issues.apache.org/jira/browse/HADOOP-7657




Programming Hive Reading #4                                               16
take a break : Implementation Codec

  public HogeCodec implements CompressionCodec{
   @Override
   public CompressionOutputStream createOutputStream(OutputStream out,
                           Compressor compressor)
      throws IOException {
     return new BlockCompressorStream(out, compressor, bufferSize,
       compressionOverhead);
   }

  @Override                                                                           ref.
  public Class<? extends Compressor> getCompressorType() {
    return HogeCompressor.class;
                                                                           http://hadoop.apache.org/
  }                                                                      docs/current/api/org/apache/
  @Override                                                                  hadoop/io/compress/
  public CompressionOutputStream createOutputStream(OutputStream out)       CompressionCodec.html
     throws IOException {
    return createOutputStream(out, createCompressor());
  }

  @Override
  public Compressor createCompressor() {
    return new HogeCompressor();
  }

    @Override
    public CompressionInputStream createInputStream(InputStream in)
       throws IOException {
      return createInputStream(in, createDecompressor());
    }
  ............

Programming Hive Reading #4                                                                         17
#11 Enabling Compression

            •Intermediate Compression(hive, mapred)

            •Final Output Compression(hive, mapred)




Programming Hive Reading #4                           18
#11 Enabling Compression

            •Intermediate Compression(hive, mapred)

                •setting enable flag




Programming Hive Reading #4                           19
#11 Enabling Compression

            •Intermediate Compression(hive, mapred)

                •setting codec




Programming Hive Reading #4                           20
#11 Enabling Compression

            •Final Output Compression(hive, mapred)

                •setting enable flag




Programming Hive Reading #4                           21
#11 Enabling Compression

            •Final Output Compression(hive, mapred)

                •setting codec




Programming Hive Reading #4                           22
#11 Sequence File

            •Sequence File Format


                • Header
                • Record
                     • Record length
                     • Key length
                     • Key
                     • Value
                • A sync-marker every few 100 bytes or so.
                  http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/
                  SequenceFile.html




Programming Hive Reading #4                                                                23
#11 Sequence File

            •Compression Type

                •NONE : nothing to do

                •RECORD : compress on each records

                •BLOCK : compress on each blocks




Programming Hive Reading #4                          24
#11 Compression in Action

            •(DEMO)




Programming Hive Reading #4          25
#11 Archive Partition

            •Using ‘HAR’

                •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

            •Archiving
              $ SET hive.archive.enabled=true;
              $ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)


            •Unarchiving
              $ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)




Programming Hive Reading #4                                                       26
Break :)
#15 Record Format

            •TEXTFILE

            •SEQUENCEFILE

            •RCFILE

              CREATE TABLE hoge (.
              ........
              )
              STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]




Programming Hive Reading #4                              28
#15 Record Format

            •RCFile(Record Columnar File)

                •fast data loading

                •fast query processing

                •highly efficient storage space utilization

                •a strong adaptivity to dynamic data access
                 patterns.

            •ref. "A Fast and Space-efficient Data Placement Structure in
              MapReduce-based Warehouse Systems (ICDE’11)"
              http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/
              TR-11-4.pdf
Programming Hive Reading #4                                                      29
#15 Record Format

            •RCFile Format
                •1 record = some Row Group

                •1 HDFS Block = some Row Group

                •Row Group
                     •a sync marker
                     •metadata header
                     •table data

                •uses the RLE algorithm to compress ‘metadata
                 header’ section.
Programming Hive Reading #4                                     30
#15 Record Format

            •Implementation of RCFile

                •Input Format

                     •o.a.h.h.ql.io.RCFileInputFormat

                •Output Format

                     •o.a.h.h.ql.io.RCFileOutputFormat

                •SerDe

                     •o.a.h.h.serde2.columnar.ColumnarSerDe

Programming Hive Reading #4                                   31
#15 Record Format

            •Tuning of RCFile

                •“hive.io.rcfile.record.buffer.size”

                     •define “RowGroup” size(default: 4MB)




Programming Hive Reading #4                                 32
#15 Record Format

            •ref. “HDFS and Hive storage - comparing file
             formats and compression methods”
                • http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-
                  compression/


            •"In term of file size, the “RCFILE” format with
             the “default” and “gz” compression achieve the
             best results."

            •"In term of speed, the “RCFILE” formats with the
             “lzo” and “snappy” are very fast while preserving
             a high compression rate."

Programming Hive Reading #4                                                          33
#Appendix - trevni

            •ref. https://github.com/cutting/trevni/

            •ref. http://avro.apache.org/docs/current/trevni/spec.html




Programming Hive Reading #4                                              34
#Appendix - trevni

       file header   column    column      column   column         column       column              column
                                                                                          ......
                                                         file

             number of number of  file          column     column start            number of
    magic                                                                                          block             block
               rows     columns metadata      metadata      position               blocks                   ......

                file header                                                                               column
                                ・name
                                ・type
                     column     ・codec                     block         row        row            row                row
                    metadata    ・etc...                  descriptor                                         ......
                                                                                        block

                                  number of uncompres compress
                                    rows     sed bytes ed bytes

                                          block descriptor


Programming Hive Reading #4                                                                                                  35
#Appendix - ORCFile


            •ref. http://hortonworks.com/blog/100x-
              faster-hive/


            •ref. https://issues.apache.org/jira/browse/
              HIVE-3874


            •ref. https://issues.apache.org/jira/secure/
              attachment/12564124/OrcFileIntro.pptx




Programming Hive Reading #4                                36
#Appendix - ORCFile


            •ref. data size




Programming Hive Reading #4    37
#Appendix - ORCFile


            •ref. comparison




Programming Hive Reading #4    38
#Appendix - Column-Oriented Storage


            •ref. http://arxiv.org/pdf/1105.4252.pdf




Programming Hive Reading #4                            39
#Appendix - more informations




          http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=

Programming Hive Reading #4                                                     40
Thanks for your listening :)

Programming Hive Reading #4

  • 1.
    Programming Hive Reading #4 @just_do_neet
  • 3.
    Chapter 11. and15. •Chapter 11. ‘Other File Formats and Compression’ •Choosing / Enabling / Action / HAR / etc... •Chapter 15. ‘Customizing Hive File and Record Formats’ •Demystifying DML / File Formats / etc... •exclude "SerDe" related topics at this presentation... Programming Hive Reading #4 3
  • 4.
    #11 Determining InstalledCodecs $ hive -e "set io.compression.codecs" io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodec Programming Hive Reading #4 4
  • 5.
    #11 Choosing aCompression Codec •Advantage : •network I/O , disk space. •Disadvantage : •CPU overhead. •to be short... : Trade-off Programming Hive Reading #4 5
  • 6.
    #11 Choosing aCompression Codec •“why do we need different compression schemes?” •speed •minimizing size •‘splittable’ or not. Programming Hive Reading #4 6
  • 7.
    #11 Choosing aCompression Codec •“why do we need different compression schemes?” http://comphadoop.weebly.com/ Programming Hive Reading #4 7
  • 8.
    take a break: algorithm •lossless compression •LZ77(LZSS), LZ78, etc... •DEFLATE (LZ77 with Huffman coding) •LZH (LZ77 with Static Huffman coding) •BZIP2(Burrows–Wheeler transform, Move-to- Front, Huffman Coding) •lossy •for JPEG, MPEG,etc...(snip.) Programming Hive Reading #4 8
  • 9.
    take a break: algorithm http://www.slideshare.net/moaikids/ss-2638826 Programming Hive Reading #4 9
  • 10.
    take a break: algorithm http://www.slideshare.net/moaikids/ss-2638826 Programming Hive Reading #4 10
  • 11.
    take a break: algorithm •Burrows–Wheeler Transform(BWT) •block sorting •“abracadabra” = bwt“ard$rcaaabb” abracadabra$ $abracadabra a $ a bracadabra$a a$abracadabr r a b racadabra$ab abra$abracad d a r acadabra$abr abracadabra$ $ a a cadabra$abra acadabra$abr r a c adabra$abrac adabra$abrac c a a dabra$abraca bra$abracada a b d abra$abracad bracadabra$a a b a bra$abracada cadabra$abra a c b ra$abracadab dabra$abraca a d r a$abracadabr ra$abracadab b r a $abracadabra racadabra$ab b r $ Programming Hive Reading #4 11
  • 12.
    take a break: algorithm •BWT with Suffix Array •ref. http://d.hatena.ne.jp/naoya/20081016/1224173077 •ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt Programming Hive Reading #4 12
  • 13.
    take a break: algorithm •LZO •“Compression is comparable in speed to DEFLATE compression.” •“Very fast decompression” • http://www.oberhumer.com/opensource/lzo/ Programming Hive Reading #4 13
  • 14.
    take a break: algorithm •Google Snappy •“very high speeds and reasonable compression” • https://code.google.com/p/snappy/ •ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889 Programming Hive Reading #4 14
  • 15.
    take a break: algorithm •LZ4 •“very fast lossless compression algorithm” • https://code.google.com/p/lz4/ •ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4 Programming Hive Reading #4 15
  • 16.
    take a break: algorithm •“Add support for LZ4 compression” •fix version : 0.23.1, 0.24.0,(CDH4) •ref. https://issues.apache.org/jira/browse/HADOOP-7657 Programming Hive Reading #4 16
  • 17.
    take a break: Implementation Codec public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); } @Override ref. public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; http://hadoop.apache.org/ } docs/current/api/org/apache/ @Override hadoop/io/compress/ public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html throws IOException { return createOutputStream(out, createCompressor()); } @Override public Compressor createCompressor() { return new HogeCompressor(); } @Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); } ............ Programming Hive Reading #4 17
  • 18.
    #11 Enabling Compression •Intermediate Compression(hive, mapred) •Final Output Compression(hive, mapred) Programming Hive Reading #4 18
  • 19.
    #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting enable flag Programming Hive Reading #4 19
  • 20.
    #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting codec Programming Hive Reading #4 20
  • 21.
    #11 Enabling Compression •Final Output Compression(hive, mapred) •setting enable flag Programming Hive Reading #4 21
  • 22.
    #11 Enabling Compression •Final Output Compression(hive, mapred) •setting codec Programming Hive Reading #4 22
  • 23.
    #11 Sequence File •Sequence File Format • Header • Record • Record length • Key length • Key • Value • A sync-marker every few 100 bytes or so. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/ SequenceFile.html Programming Hive Reading #4 23
  • 24.
    #11 Sequence File •Compression Type •NONE : nothing to do •RECORD : compress on each records •BLOCK : compress on each blocks Programming Hive Reading #4 24
  • 25.
    #11 Compression inAction •(DEMO) Programming Hive Reading #4 25
  • 26.
    #11 Archive Partition •Using ‘HAR’ •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html •Archiving $ SET hive.archive.enabled=true; $ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’) •Unarchiving $ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’) Programming Hive Reading #4 26
  • 27.
  • 28.
    #15 Record Format •TEXTFILE •SEQUENCEFILE •RCFILE CREATE TABLE hoge (. ........ ) STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE] Programming Hive Reading #4 28
  • 29.
    #15 Record Format •RCFile(Record Columnar File) •fast data loading •fast query processing •highly efficient storage space utilization •a strong adaptivity to dynamic data access patterns. •ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)" http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/ TR-11-4.pdf Programming Hive Reading #4 29
  • 30.
    #15 Record Format •RCFile Format •1 record = some Row Group •1 HDFS Block = some Row Group •Row Group •a sync marker •metadata header •table data •uses the RLE algorithm to compress ‘metadata header’ section. Programming Hive Reading #4 30
  • 31.
    #15 Record Format •Implementation of RCFile •Input Format •o.a.h.h.ql.io.RCFileInputFormat •Output Format •o.a.h.h.ql.io.RCFileOutputFormat •SerDe •o.a.h.h.serde2.columnar.ColumnarSerDe Programming Hive Reading #4 31
  • 32.
    #15 Record Format •Tuning of RCFile •“hive.io.rcfile.record.buffer.size” •define “RowGroup” size(default: 4MB) Programming Hive Reading #4 32
  • 33.
    #15 Record Format •ref. “HDFS and Hive storage - comparing file formats and compression methods” • http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format- compression/ •"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results." •"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate." Programming Hive Reading #4 33
  • 34.
    #Appendix - trevni •ref. https://github.com/cutting/trevni/ •ref. http://avro.apache.org/docs/current/trevni/spec.html Programming Hive Reading #4 34
  • 35.
    #Appendix - trevni file header column column column column column column column ...... file number of number of file column column start number of magic block block rows columns metadata metadata position blocks ...... file header column ・name ・type column ・codec block row row row row metadata ・etc... descriptor ...... block number of uncompres compress rows sed bytes ed bytes block descriptor Programming Hive Reading #4 35
  • 36.
    #Appendix - ORCFile •ref. http://hortonworks.com/blog/100x- faster-hive/ •ref. https://issues.apache.org/jira/browse/ HIVE-3874 •ref. https://issues.apache.org/jira/secure/ attachment/12564124/OrcFileIntro.pptx Programming Hive Reading #4 36
  • 37.
    #Appendix - ORCFile •ref. data size Programming Hive Reading #4 37
  • 38.
    #Appendix - ORCFile •ref. comparison Programming Hive Reading #4 38
  • 39.
    #Appendix - Column-OrientedStorage •ref. http://arxiv.org/pdf/1105.4252.pdf Programming Hive Reading #4 39
  • 40.
    #Appendix - moreinformations http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr= Programming Hive Reading #4 40
  • 41.
    Thanks for yourlistening :)