0% found this document useful (0 votes)
42 views

Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1

it is useful for all

Uploaded by

jagadiish21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1

it is useful for all

Uploaded by

jagadiish21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Big Data Analytics Unit II

MapReduce Tutorial: Traditional Way

Let us look at the challenges associated with this traditional approach:

1. Critical path problem: It is the amount of time taken to finish the job without delaying
the next milestone or actual completion date. So, if, any of the machines delays the job,
the whole work gets delayed.
2. Reliability problem: What if, any of the machines which is working with a part of data
fails? The management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that each machine
gets even part of data to work with. In other words, how to equally divide the data such
that no individual machine is overloaded or under utilized.
4. Single split may fail: If any of the machine fails to provide the output, I will not be
able to calculate the result. So, there should be a mechanism to ensure this fault
tolerance capability of the system.
5. Aggregation of result: There should be a mechanism to aggregate the result generated
by each of the machines to produce the final output.

To overcome these issues, we have the MapReduce framework which allows us to perform
such parallel computations without bothering about the issues like reliability, fault tolerance
etc. Therefore, MapReduce gives you the flexibility to write code logic without caring about
the design issues of the system.

What is MapReduce?

MapReduce is a programming framework that allows us to perform distributed and parallel


processing on large data sets in a distributed environment.
Big Data Analytics Unit II

• MapReduce consists of two distinct tasks – Map and Reduce.


• As the name MapReduce suggests, reducer phase takes place after mapper phase has
been completed.
• So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-value
pair) into a smaller set of tuples or key-value pairs which is the final output.

MapReduce Tutorial: A Word Count Example of MapReduce

Let us understand, how a MapReduce works by taking an example where I have a text file
called example.txt whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we
will be finding the unique words and the number of occurrences of those unique words.

• First, we divide the input in three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mapper and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1
is that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value
pairs – Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the
nodes.
• After mapper phase, a partition process takes place where sorting and shuffling happens
so that all the tuples with the same key are sent to the corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key and a list
of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list of values. As shown
in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts
the number of ones in the very list and gives the final output as – Bear, 2.
Big Data Analytics Unit II

• Finally, all the output key/value pairs are then collected and written in the output file.

MapReduce Tutorial: Advantages of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machine instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below (2).

Fig.: Traditional Way Vs. MapReduce Way – MapReduce Tutorial

2. Data Locality:

Instead of moving data to the processing unit, we are moving processing unit to the data in the
MapReduce Framework. In the traditional system, we used to bring data to the processing unit
and process it. But, as the data grew and became very huge, bringing this huge amount of data
to the processing unit posed following issues:

• Moving huge data to processing is costly and deteriorates the network performance.
• Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• Master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
Big Data Analytics Unit II

where each node processes the part of the data residing on it. This allows us to have the
following advantages:

• It is very cost effective to move processing unit to the data.


• The processing time is reduced as all the nodes are working with their part of the data
in parallel.
• Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.

You might also like