MapReduce
MapReduce
WHAT IS MAPREDUCE?
Input Splits: An input to a MapReduce in Big Data job is divided into fixed-size
pieces called input splits. Input split is a chunk of the input that is consumed by a
single map.
Mapping: This is the very first phase in the execution of map-reduce program. In
this phase data in each split is passed to a mapping function to produce output
values. In our example, a job of mapping phase is to count a number of
occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>.
Shuffling: This phase consumes the output of Mapping phase. Its task is to
consolidate the relevant records from Mapping phase output. In our example,
the same words are clubed together along with their respective frequency.
Reducing: In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a single output
value. In short, this phase summarizes the complete dataset.
STEPS:
One map task is created for each split which then executes map
function for each record in the split.
It is always beneficial to have multiple splits because the time
taken to process a split is small as compared to the time taken for
processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the
splits in parallel.
However, it is also not desirable to have splits too small in size.
When splits are too small, the overload of managing the splits and
map task creation begins to dominate the total job execution
time.
For most jobs, it is better to make a split size equal to the size of
an HDFS block (which is 64 MB, by default).
STEPS:
Hadoop divides the job into tasks. There are two types of tasks:
Map tasks (Splits & Mapping)
Reduce tasks (Shuffling, Reducing)
The complete execution process (execution of Map and Reduce tasks,
both) is controlled by two types of entities called a
Jobtracker: Acts like a master (responsible for complete execution of
submitted job)
Multiple Task Trackers: Acts like slaves, each of them performing the
job.
For every job submitted for execution in the system, there is one
Jobtracker that resides on Namenode and there are multiple
tasktrackers which reside on Datanode.
WORKING
A job is divided into multiple tasks which are then run onto multiple
data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by
scheduling tasks to run on different data nodes.
Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
Task tracker’s responsibility is to send the progress report to the job
tracker.
In addition, task tracker periodically sends ‘heartbeat’ signal to the
Jobtracker so as to notify him of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different
task tracker.
EXAMPLE:
MAPREDUCE EXAMPLE - WORD
COUNT
Example: Counting the frequency of words in a document:
1. The Map function generates key-value pairs where keys are
words and values are 1.
2. The Shuffle and Sort step groups identical words together.
3. The Reduce function sums the values to get the word count.
BENEFITS :
Benefits:
1. Scalability
2. Fault tolerance
3. Simplicity in handling large datasets