0% found this document useful (0 votes)
3 views

MapReduce

Uploaded by

chise6969
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

MapReduce

Uploaded by

chise6969
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MAPREDUCE

WHAT IS MAPREDUCE?

MapReduce is a software framework used for processing


vast data sets in a distributed computing environment.

It is composed of two key phases: Map and Reduce.

Map tasks deal with splitting and mapping of data while


Reduce tasks shuffle and reduce the data.
COMPONENTS OF MAPREDUCE

 Map Function: It processes input data and generates key-


value pairs.
 Shuffle and Sort: Organizes the key-value pairs to send
similar keys to the same reducer.
 Reduce Function: Aggregates the output by performing
operations like sum, count, etc. on the key-value pairs.
MAPREDUCE ARCHITECTURE
MAPREDUCE ARCHITECTURE
MAPREDUCE ARCHITECTURE

Input Splits: An input to a MapReduce in Big Data job is divided into fixed-size
pieces called input splits. Input split is a chunk of the input that is consumed by a
single map.
Mapping: This is the very first phase in the execution of map-reduce program. In
this phase data in each split is passed to a mapping function to produce output
values. In our example, a job of mapping phase is to count a number of
occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>.
Shuffling: This phase consumes the output of Mapping phase. Its task is to
consolidate the relevant records from Mapping phase output. In our example,
the same words are clubed together along with their respective frequency.
Reducing: In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a single output
value. In short, this phase summarizes the complete dataset.
STEPS:

One map task is created for each split which then executes map
function for each record in the split.
It is always beneficial to have multiple splits because the time
taken to process a split is small as compared to the time taken for
processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the
splits in parallel.
However, it is also not desirable to have splits too small in size.
When splits are too small, the overload of managing the splits and
map task creation begins to dominate the total job execution
time.
For most jobs, it is better to make a split size equal to the size of
an HDFS block (which is 64 MB, by default).
STEPS:

Execution of map tasks results into writing output to a local disk


on the respective node and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication
which takes place in case of HDFS store operation.
Map output is intermediate output which is processed by reduce
tasks to produce the final output.
Once the job is complete, the map output can be thrown away.
So, storing it in HDFS with replication becomes overkill.
In the event of node failure, before the map output is consumed
by the reduce task, Hadoop reruns the map task on another node
and re-creates the map output.
STEPS:

Reduce task doesn’t work on the concept of data locality. An


output of every map task is fed to the reduce task. Map output is
transferred to the machine where reduce task is running.
On this machine, the output is merged and then passed to the
user-defined reduce function.
Unlike the map output, reduce output is stored in HDFS (the first
replica is stored on the local node and other replicas are stored on
off-rack nodes). So, writing the reduce output
WORKING

Hadoop divides the job into tasks. There are two types of tasks:
 Map tasks (Splits & Mapping)
 Reduce tasks (Shuffling, Reducing)
The complete execution process (execution of Map and Reduce tasks,
both) is controlled by two types of entities called a
 Jobtracker: Acts like a master (responsible for complete execution of
submitted job)
 Multiple Task Trackers: Acts like slaves, each of them performing the
job.
For every job submitted for execution in the system, there is one
Jobtracker that resides on Namenode and there are multiple
tasktrackers which reside on Datanode.
WORKING

 A job is divided into multiple tasks which are then run onto multiple
data nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by
scheduling tasks to run on different data nodes.
 Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
 Task tracker’s responsibility is to send the progress report to the job
tracker.
 In addition, task tracker periodically sends ‘heartbeat’ signal to the
Jobtracker so as to notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different
task tracker.
EXAMPLE:
MAPREDUCE EXAMPLE - WORD
COUNT
 Example: Counting the frequency of words in a document:
 1. The Map function generates key-value pairs where keys are
words and values are 1.
 2. The Shuffle and Sort step groups identical words together.
 3. The Reduce function sums the values to get the word count.
BENEFITS :

 Benefits:
 1. Scalability
 2. Fault tolerance
 3. Simplicity in handling large datasets

You might also like