0% found this document useful (0 votes)
28 views

Map Reduce Tutorial-1

MapReduce is a programming model used for processing large datasets in a distributed system. It allows for parallel processing of data across clusters of computers. The MapReduce algorithm contains two main tasks - the Map task which processes key-value pairs and the Reduce task which combines the outputs from Map into final results. An example is provided of how Twitter uses MapReduce to process 500 million tweets per day across distributed systems.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Map Reduce Tutorial-1

MapReduce is a programming model used for processing large datasets in a distributed system. It allows for parallel processing of data across clusters of computers. The MapReduce algorithm contains two main tasks - the Map task which processes key-value pairs and the Reduce task which combines the outputs from Map into final results. An example is provided of how Twitter uses MapReduce to process 500 million tweets per day across distributed systems.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1.

MAPREDUCE – INTRODUCTION MapReduce

MapReduce is a programming model for writing applications that can process Big
Data in parallel on multiple nodes. MapReduce provides analytical capabilities for
analyzing huge volumes of complex data.

What is Big Data?


Big Data is a collection of large datasets that cannot be processed using traditional
computing techniques. For example, the volume of data Facebook or YouTube need
require it to collect and manage on a daily basis, can fall under the category of Big
Data. However, Big Data is not only about scale and volume, it also involves one or
more of the following aspects − Velocity, Variety, Volume, and Complexity.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge volumes
of scalable data and cannot be accommodated by standard database servers.
Moreover, the centralized system creates too much of a bottleneck while processing
multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the
results are collected at one place and integrated to form the result dataset.

3
MapReduce

How MapReduce Works?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).

 The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their
significance.

4
MapReduce

 Input Phase − Here we have a Record Reader that translates each record in
an input file and sends the parsed data to the mapper in the form of key-value
pairs.

 Map − Map is a user-defined function, which takes a series of key-value pairs


and processes each one of them to generate zero or more key-value pairs.

 Intermediate Keys − The key-value pairs generated by the mapper are


known as intermediate keys.

 Combiner − A combiner is a type of local Reducer that groups similar data


from the map phase into identifiable sets. It takes the intermediate keys from
the mapper as input and applies a user-defined code to aggregate the values
in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.

 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.

 Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a wide

5
MapReduce

range of processing. Once the execution is over, it gives zero or more key-
value pairs to the final step.

 Output Phase − In the output phase, we have an output formatter that


translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.

Let us try to understand the two tasks Map & Reduce with the help of a small diagram

MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second.
The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.

6
MapReduce

As shown in the illustration, the MapReduce algorithm performs the following actions

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-
value pairs.

 Filter − Filters unwanted words from the maps of tokens and writes the
filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of similar counter values into


small manageable units.

7
2. MAPREDUCE – ALGORITHM MapReduce

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class

 The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps, and sorts it. The output of Mapper
class is used as input by Reducer class, which in turn searches matching pairs and
reduces them.

MapReduce implements various mathematical algorithms to divide a task into small


parts and assign them to multiple systems. In technical terms, MapReduce algorithm
helps in sending the Map & Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following −

 Sorting

 Searching

 Indexing

 TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-value
pairs from the mapper by their keys.

8
MapReduce

 Sorting methods are implemented in the mapper class itself.

 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
theContext class (user-defined class) collects the matching valued keys as a
collection.

 To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.

 The set of intermediate key-value pairs for a given Reducer is automatically


sorted by Hadoop to form key-values (K2, {V2, V2…}) before they are
presented to the Reducer.

Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase. Let us try to understand how Searching
works with the help of an example.

Example
The following example shows how MapReduce employs Searching algorithm to find
out the details of the employee who draws the highest salary in a given employee
dataset.

 Let us assume we have employee data in four different files − A, B, C, and D.


Let us also assume there are duplicate employee records in all four files
because of importing the employee data from all database tables repeatedly.
See the following illustration.

 The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following illustration.

You might also like