CH 8 and CH 9: Mapreduce Types, Formats and Features
CH 8 and CH 9: Mapreduce Types, Formats and Features
Partition function:
Each map processes a single split, which is divided into records (key-value pair) that are individually processed by the map
InputFormat - responsible for creating input splits and dividing them into records so you will not directly deal with with the InputSplit class
A sequence file can be used to merge small files into larger files to avoid a large number of small files
Preventing splitting - you might want to prevent splitting if you want a single mapper to process each input file
as an entire file
1. Increase the minimum split size to be larger than the largest file in the system
2. Subclass the subclass of FileInputFormat to override the isSplitable() method to return false
RecordRecorder - deliver file contents as the value of the record, must implement createRecordReader() to create a custom
implementation of the class
WholeFileInputFormat
Input Formats - File Input
FileInputFormat - the base class for all implementations of InputFormat that use a file as the source for data
Provides a place to define what files are included as input to a job and an implementation for generating splits for the input files
CombineFileInputFormat - Java class designed to work well with small files in Hadoop
Each split will contain many of the small files so that each mapper has more to process
Takes node and rack locality into account when deciding what blocks to place into the same split
WholeFileInputFormat - defines a format where the keys are not used and the values are the file contents
Key - byte offset within the file of the beginning of the line; Value - the contents of the line, not including any line terminators, packaged as a
Text object
KeyValueTextInputFormat - Used to interpret TextOutputFormat (default output that contains key-value pairs
separated by a delimiter)
NLineInputFormat - used when the mappers need to receive a fixed number of lines of input
Binary Input:
SequenceFileInputFormat - stores sequences of binary key-value pairs
SequenceFileAsBinaryInputFormat - retrieves the sequence file’s keys and values as binary objects
FixedLengthInputFormat - reading fixed-width binary records from a file where the records are not separated by delimiters
Multiple Inputs:
All input is interpreted by a single InputFormat and a single Mapper
MultipleInputs - allows programmer to specify which InputFormat and Mapper to use on a per-path basis
Database Input/Output:
DBInputFormat - input format for reading data from a relational database
Binary Output:
SequenceFileAsBinaryOutputFormat - writes keys and values in binary format into a sequence file container
Multiple Outputs:
MultipleOutputs - allows programmer to write data to files whose names are derived from output keys and values to create more than one file
Lazy Output: LazyOutputFormat - wrapper output format that ensures the output file is created only when the first record
is emitted for a given partition
Counters
Useful for gathering statistics about a job, quality-control, and problem diagnosis
Task Counters - gather info about tasks as they are executed and results are aggregated over all job tasks
Maintained by each task attempt and are sent to the application manager on a regular basis to be globally aggregated
Job Counters - measure job-level statistics and are maintained by the application master so they do not need to be sent across
the network
Dynamic counters (not defined by Java enum) can be created by the user
Sorting
Partial Sort - does not produce a globally- Total Sort - produces a globally-sorted output
sorted output file file
Ex:
Joins - Map-Side vs Reduce-Side
Map-Side Join Reduce-Side Join
The main challenge is to make side data available to all the map or reduce tasks (which are spread across the cluster) in way that is convenient and efficient
Configuration is a setter method used to set key-value pairs in the job configuration
Distributed Cache
Instead of serializing side data in the job config, it is preferred to distribute the datasets using Hadoop’s distributed cache
Provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run
Files
Archives
MapReduce Library Classes
Mappers/Reducers for commonly-used functions:
Video – Example MapReduce WordCount
Video: https://youtu.be/aelDuboaTqA