0% found this document useful (0 votes)
11 views12 pages

unit 2

MapReduce is a framework for processing large data sets in parallel across clusters of commodity hardware, utilizing a two-step process involving mapping and reducing tasks. The framework allows for easy scalability and efficient data processing, with mappers handling data transformation and reducers performing aggregation. Integration with R enhances data analysis and visualization capabilities, making it a powerful tool for big data applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

unit 2

MapReduce is a framework for processing large data sets in parallel across clusters of commodity hardware, utilizing a two-step process involving mapping and reducing tasks. The framework allows for easy scalability and efficient data processing, with mappers handling data transformation and reducers performing aggregation. Integration with R enhances data analysis and visualization capabilities, making it a powerful tool for big data applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

What is MapReduce?

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.

The Algorithm:-

 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server
Fundamentals
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is
used for Transformation while the Reducer is used for aggregation kind of operation. The
terminology for Map and Reduce is derived from some functional programming languages like
Lisp, Scala, etc. The Map-Reduce processing framework program comes with 3 main
components i.e. our Driver code, Mapper(For Transformation), and Reducer(For
Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB
of data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to
process it for that we have a Map-Reduce framework. So to process this data with Map-
Reduce we have a Driver code which is called Job. If we are using Java programming language
for processing the data on HDFS then we need to initiate this Driver class with the Job object.
Suppose you have a car which is your framework than the start button used to start the car is
similar to this Driver code in the Map-Reduce framework. We need to initiate the Driver code
to utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined
and modified by the developers as per the organizations requirement.

Brief Working of Mapper

Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we
have 100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100
Mapper program or process that runs in parallel on machines(nodes) and produce there own
output known as intermediate output which is then stored on Local Disk, not on HDFS. The
output of the mapper act as input for Reducer which performs some sorting and aggregation
operation on data and produces the final output.

Brief Working Of Reducer


Reducer is the second part of the Map-Reduce programming model. The Mapper produces
the output in the form of key-value pairs which works as input for the Reducer. But before
sending this intermediate key-value pairs directly to the Reducer some process will be done
which shuffle and sort the key-value pairs according to its key values. The output generated
by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File
System). Reducer mainly performs some computation operation like addition, filtration, and
aggregation.
Steps of Data-Flow:

 At a time single input split is processed. Mapper is overridden by the developer according
to the business logic and this Mapper run in a parallel manner in all the machines in our
cluster.
 The intermediate output generated by Mapper is stored on the local disk and shuffled to
the reducer to reduce the task.
 Once Mapper finishes their task the output is then sorted and merged and provided to the
Reducer.
 Reducer performs some reducing tasks like aggregation and other compositional
operation and the final output is then stored on HDFS in part-r-00000(created by default)
file.

MapReduce – Algorithm (writing a hadoop map reduce example)


The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used
as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF

Sorting

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the
help of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

 The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map phase as
a key-value pair with employee name and salary. Using searching technique, the
combiner will check all the employee salary to find the highest salaried employee in each
file. See the following snippet.

<k: employee name, v: salary>


Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){


Max = v(salary);
}
else{
Continue checking;
}
The expected result is as follows −

<satish, 26000> <gopal, 50000> <kian, 45000> <manisha, 45000>

 Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input
files. The final output should be as follows −
<gopal, 50000>

Indexing

Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names
and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the
term "is" appears in the files T[0], T[1], and T[2].
TF-IDF

TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers
to the number of times a term appears in a document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the
number of times a word appears in a document divided by the total number of words in that
document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in
the document)
Inverse Document Frequency (IDF)
It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
The algorithm is explained below with the help of a small example.
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Learning the different ways to write Hadoop MapReduce in R

R is an open-source programming language. It is best suited for statistical and graphical


analysis. Also, if we need strong data analytics and visualization features, we have to combine R
with Hadoop.

The purpose behind R and Hadoop integration:

1. To use Hadoop to execute R code.


2. To use R to access the data stored in Hadoop.

Hadoop and R complement each other very well in terms of big data visualization and analytics.
There are four ways of using Hadoop and R together, which are as follows:

RHadoop: -

The R Hadoop methods are the collection of packages. It contains three packages i.e., rmr,
rhbase, and rhdfs.

The rmr package: -For the Hadoop framework, the rmr package provides MapReduce
functionality by executing the Mapping and Reducing codes in R.

The rhbase package: -This package provides R database management capability with
integration with HBASE.
The rhdfs package: -This package provides file management capabilities by integrating with
HDFS.

RHIPE:-

RHIPE stands for R and Hadoop Integrated Programming Environment. Divide and Recombine
developed RHIPE for carrying out efficient analysis of a large amount of data. RHIPE involves
working with R and Hadoop integrated programming environment. We can use Python, Perl, or
Java to read data sets in RHIPE. There are various functions in RHIPE which lets HDFS interact
with HDFS. Hence, this way we can read, save the complete data which is created using RHIPE
MapReduce.
Hadoop streaming:-
Hadoop Streaming is a utility that allows users to create and run jobs with any executable as
the mapper and/or the reducer. Using the streaming system, we can develop working Hadoop
jobs with just enough knowledge of Java to write two shell scripts which work in tandem.
The combination of R and Hadoop appears as a must-have toolkit for people working with large
data sets and statistics. However, some Hadoop enthusiasts have raised a red flag when dealing
with very large Big Data excerpts. They claim that the benefit of R is not its syntax, but the
entire library of primitives for visualization and data. These libraries are fundamentally non-
distributed, making data retrieval a time-consuming affair. This is an inherent flaw with R, and if
you choose to ignore it, both R and Hadoop can work together.

ORCH: -

ORCH is known as Oracle R Connector. This method is used to work with Big Data in Oracle
appliance particularly. It is also used on a non- Oracle framework like Hadoop.

This method helps in accessing the Hadoop cluster with the help of R and also helps to write the
mapping and reducing functions. It allows us to manipulate the data residing in the Hadoop
Distributed File System

Integration of Hadoop and R


As we know that data is the precious thing that matters most for an organization and it’ll be
not an exaggeration if we say data, is the most valuable asset. But in order to deal with this
huge structure and unstructured we need an effective tool that could effectively do the data
analysis, so we get this tool by merging the features of both R language and Hadoop
framework of big data analysis, this merging result increment in its scalability. Hence, we
need to integrate both then only we can find better insights and result from data. Soon we’ll
go through the various methodologies which help to integrate these two.

R is an open-source programming language that is extensively used for statistical and


graphical analysis. R supports a large variety of Statistical-Mathematical based library
for(linear and nonlinear modeling, classical-statistical tests, time-series analysis, data
classification, data clustering, etc) and graphical techniques for processing data efficiently.

One major quality of R’s is that it produces well-designed quality plots with greater ease,
including mathematical symbols and formulae where needed. If you are in a crisis of strong
data-analytics and visualization features then combining this R language with Hadoop into
your task will be the last choice for you to reduce the complexity. It is a highly extensible
object-oriented programming language and it has strong graphical capabilities.

Some reasons for which R is considered the best fit for data analytics:
 A robust collection of packages
 Powerful data visualization techniques
 Commendable Statistical and graphical programming features
 Object-oriented programming language
 It has a wide smart collection of operators for calculations of arrays, particular
matrices, etc
 Graphical representation capability on display or on hard copy.

You might also like