Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
1. Critical path problem: It is the amount of time taken to finish the job without delaying
the next milestone or actual completion date. So, if, any of the machines delays the job,
the whole work gets delayed.
2. Reliability problem: What if, any of the machines which is working with a part of data
fails? The management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that each machine
gets even part of data to work with. In other words, how to equally divide the data such
that no individual machine is overloaded or under utilized.
4. Single split may fail: If any of the machine fails to provide the output, I will not be
able to calculate the result. So, there should be a mechanism to ensure this fault
tolerance capability of the system.
5. Aggregation of result: There should be a mechanism to aggregate the result generated
by each of the machines to produce the final output.
To overcome these issues, we have the MapReduce framework which allows us to perform
such parallel computations without bothering about the issues like reliability, fault tolerance
etc. Therefore, MapReduce gives you the flexibility to write code logic without caring about
the design issues of the system.
What is MapReduce?
Let us understand, how a MapReduce works by taking an example where I have a text file
called example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we
will be finding the unique words and the number of occurrences of those unique words.
• First, we divide the input in three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mapper and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1
is that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value
pairs – Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the
nodes.
• After mapper phase, a partition process takes place where sorting and shuffling happens
so that all the tuples with the same key are sent to the corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key and a list
of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list of values. As shown
in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts
the number of ones in the very list and gives the final output as – Bear, 2.
Big Data Analytics Unit II
• Finally, all the output key/value pairs are then collected and written in the output file.
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machine instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below (2).
2. Data Locality:
Instead of moving data to the processing unit, we are moving processing unit to the data in the
MapReduce Framework. In the traditional system, we used to bring data to the processing unit
and process it. But, as the data grew and became very huge, bringing this huge amount of data
to the processing unit posed following issues:
• Moving huge data to processing is costly and deteriorates the network performance.
• Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• Master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
Big Data Analytics Unit II
where each node processes the part of the data residing on it. This allows us to have the
following advantages: