0% found this document useful (0 votes)
3 views

2 MapReduce continue

MapReduce is a programming framework designed for distributed and parallel processing of large data sets, consisting of three main components: the Mapper Class, Reducer Class, and Driver Class. The Mapper processes input records into key-value pairs, while the Reducer aggregates these pairs to produce final output, which is then saved in HDFS. Key advantages of MapReduce include parallel processing, improved data locality, and reduced processing time by distributing tasks across multiple nodes.

Uploaded by

kajalyadav102703
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2 MapReduce continue

MapReduce is a programming framework designed for distributed and parallel processing of large data sets, consisting of three main components: the Mapper Class, Reducer Class, and Driver Class. The Mapper processes input records into key-value pairs, while the Reducer aggregates these pairs to produce final output, which is then saved in HDFS. Key advantages of MapReduce include parallel processing, improved data locality, and reduced processing time by distributing tasks across multiple nodes.

Uploaded by

kajalyadav102703
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

MapReduce Working and

Advantages
MapReduce
MapReduce is a programming framework that allows us to perform distributed
and parallel processing on large data sets in a distributed environment.
More about MapReduce and its components. MapReduce majorly
has the following three Classes.

Mapper Class
• The first stage in Data Processing using MapReduce is
the Mapper Class. Here, RecordReader processes each Input
record and generates the respective key-value pair. Hadoop’s
Mapper store saves this intermediate data into the local disk.
• Input Split
It is the logical representation of data. It represents a block of
work that contains a single map task in the MapReduce
Program.
• RecordReader
It interacts with the Input split and converts the obtained data
in the form of Key-Value Pairs.
Reducer Class

The Intermediate output generated from the mapper is fed to


the reducer which processes it and generates the final output
which is then saved in the HDFS.
Driver Class

The major component in a MapReduce job is a Driver Class. It


is responsible for setting up a MapReduce Job to run-in
Hadoop.
We specify the name of Mapper and Reducer Classes long
with data types and their respective job names.
• First, we divide the input into three splits as shown in the
figure. This will distribute the work among all the map nodes.
• Then, we tokenize the words in each of the mappers and give
a hardcoded value (1) to each of the tokens or words. The
rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is
nothing but the individual words and value is one. So, for the
first line (Dear Bear River) we have 3 key-value pairs – Dear, 1;
Bear, 1; River, 1. The mapping process remains the same on all
the nodes.
• After the mapper phase, a partition process takes place where
sorting and shuffling happen so that all the tuples with the
same key are sent to the corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will
have a unique key and a list of values corresponding to that
very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in
that list of values. As shown in the figure, reducer gets a list of
values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as –
Bear, 2.
• Finally, all the output key/value pairs are then collected and
written in the output file.
Advantages of MapReduce

1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes
and each node works with a part of the job simultaneously.
So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different machines.
As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data
gets reduced by a tremendous amount
2. Data Locality:

• Instead of moving data to the processing unit, we are moving


the processing unit to the data in the MapReduce Framework.
• In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and
became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
• Moving huge data to processing is costly and deteriorates the
network performance.
• Processing takes time as the data is processed by a single unit
which becomes the bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by
bringing the processing unit to the data. So, as you can see in
the above image that the data is distributed among multiple
nodes where each node processes the part of the data
residing on it. This allows us to have the following advantages:

It is very cost-effective to move processing unit to the data.

The processing time is reduced as all the nodes are working


with their part of the data in parallel.

Every node gets a part of the data to process and therefore,


there is no chance of a node getting overburdened.
MapReduce-Example
Twitter receives around 500 million tweets per day, which is
nearly 3000 tweets per second.

You might also like