0% found this document useful (0 votes)

11 views12 pages

unit 2

MapReduce is a framework for processing large data sets in parallel across clusters of commodity hardware, utilizing a two-step process involving mapping and reducing tasks. The framework allows for easy scalability and efficient data processing, with mappers handling data transformation and reducers performing aggregation. Integration with R enhances data analysis and visualization capabilities, making it a powerful tool for big data applications.

Uploaded by

hrushikeshpanchabudhe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

unit 2

Uploaded by

hrushikeshpanchabudhe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

What is MapReduce?

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.

The Algorithm:-

 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server
Fundamentals
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is
used for Transformation while the Reducer is used for aggregation kind of operation. The
terminology for Map and Reduce is derived from some functional programming languages like
Lisp, Scala, etc. The Map-Reduce processing framework program comes with 3 main
components i.e. our Driver code, Mapper(For Transformation), and Reducer(For
Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB
of data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to
process it for that we have a Map-Reduce framework. So to process this data with Map-
Reduce we have a Driver code which is called Job. If we are using Java programming language
for processing the data on HDFS then we need to initiate this Driver class with the Job object.
Suppose you have a car which is your framework than the start button used to start the car is
similar to this Driver code in the Map-Reduce framework. We need to initiate the Driver code
to utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined
and modified by the developers as per the organizations requirement.

Brief Working of Mapper

Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we
have 100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100
Mapper program or process that runs in parallel on machines(nodes) and produce there own
output known as intermediate output which is then stored on Local Disk, not on HDFS. The
output of the mapper act as input for Reducer which performs some sorting and aggregation
operation on data and produces the final output.

Brief Working Of Reducer

Reducer is the second part of the Map-Reduce programming model. The Mapper produces
the output in the form of key-value pairs which works as input for the Reducer. But before
sending this intermediate key-value pairs directly to the Reducer some process will be done
which shuffle and sort the key-value pairs according to its key values. The output generated
by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File
System). Reducer mainly performs some computation operation like addition, filtration, and
aggregation.
Steps of Data-Flow:

 At a time single input split is processed. Mapper is overridden by the developer according
to the business logic and this Mapper run in a parallel manner in all the machines in our
cluster.
 The intermediate output generated by Mapper is stored on the local disk and shuffled to
the reducer to reduce the task.
 Once Mapper finishes their task the output is then sorted and merged and provided to the
Reducer.
 Reducer performs some reducing tasks like aggregation and other compositional
operation and the final output is then stored on HDFS in part-r-00000(created by default)
file.

MapReduce – Algorithm (writing a hadoop map reduce example)

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class

 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used
as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF

Sorting

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the
help of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

 The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map phase as
a key-value pair with employee name and salary. Using searching technique, the
combiner will check all the employee salary to find the highest salaried employee in each
file. See the following snippet.

<k: employee name, v: salary>

Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){

Max = v(salary);
}
else{
Continue checking;
}
The expected result is as follows −

<satish, 26000> <gopal, 50000> <kian, 45000> <manisha, 45000>

 Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input
files. The final output should be as follows −
<gopal, 50000>

Indexing

Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names
and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the
term "is" appears in the files T[0], T[1], and T[2].
TF-IDF

TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers
to the number of times a term appears in a document.
Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the
number of times a word appears in a document divided by the total number of words in that
document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in
the document)
Inverse Document Frequency (IDF)
It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
The algorithm is explained below with the help of a small example.
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Learning the different ways to write Hadoop MapReduce in R

R is an open-source programming language. It is best suited for statistical and graphical

analysis. Also, if we need strong data analytics and visualization features, we have to combine R
with Hadoop.

The purpose behind R and Hadoop integration:

1. To use Hadoop to execute R code.

2. To use R to access the data stored in Hadoop.

Hadoop and R complement each other very well in terms of big data visualization and analytics.
There are four ways of using Hadoop and R together, which are as follows:

RHadoop: -

The R Hadoop methods are the collection of packages. It contains three packages i.e., rmr,
rhbase, and rhdfs.

The rmr package: -For the Hadoop framework, the rmr package provides MapReduce
functionality by executing the Mapping and Reducing codes in R.

The rhbase package: -This package provides R database management capability with
integration with HBASE.
The rhdfs package: -This package provides file management capabilities by integrating with
HDFS.

RHIPE:-

RHIPE stands for R and Hadoop Integrated Programming Environment. Divide and Recombine
developed RHIPE for carrying out efficient analysis of a large amount of data. RHIPE involves
working with R and Hadoop integrated programming environment. We can use Python, Perl, or
Java to read data sets in RHIPE. There are various functions in RHIPE which lets HDFS interact
with HDFS. Hence, this way we can read, save the complete data which is created using RHIPE
MapReduce.
Hadoop streaming:-
Hadoop Streaming is a utility that allows users to create and run jobs with any executable as
the mapper and/or the reducer. Using the streaming system, we can develop working Hadoop
jobs with just enough knowledge of Java to write two shell scripts which work in tandem.
The combination of R and Hadoop appears as a must-have toolkit for people working with large
data sets and statistics. However, some Hadoop enthusiasts have raised a red flag when dealing
with very large Big Data excerpts. They claim that the benefit of R is not its syntax, but the
entire library of primitives for visualization and data. These libraries are fundamentally non-
distributed, making data retrieval a time-consuming affair. This is an inherent flaw with R, and if
you choose to ignore it, both R and Hadoop can work together.

ORCH: -

ORCH is known as Oracle R Connector. This method is used to work with Big Data in Oracle
appliance particularly. It is also used on a non- Oracle framework like Hadoop.

This method helps in accessing the Hadoop cluster with the help of R and also helps to write the
mapping and reducing functions. It allows us to manipulate the data residing in the Hadoop
Distributed File System

Integration of Hadoop and R

As we know that data is the precious thing that matters most for an organization and it’ll be
not an exaggeration if we say data, is the most valuable asset. But in order to deal with this
huge structure and unstructured we need an effective tool that could effectively do the data
analysis, so we get this tool by merging the features of both R language and Hadoop
framework of big data analysis, this merging result increment in its scalability. Hence, we
need to integrate both then only we can find better insights and result from data. Soon we’ll
go through the various methodologies which help to integrate these two.

R is an open-source programming language that is extensively used for statistical and

graphical analysis. R supports a large variety of Statistical-Mathematical based library
for(linear and nonlinear modeling, classical-statistical tests, time-series analysis, data
classification, data clustering, etc) and graphical techniques for processing data efficiently.

One major quality of R’s is that it produces well-designed quality plots with greater ease,
including mathematical symbols and formulae where needed. If you are in a crisis of strong
data-analytics and visualization features then combining this R language with Hadoop into
your task will be the last choice for you to reduce the complexity. It is a highly extensible
object-oriented programming language and it has strong graphical capabilities.

Some reasons for which R is considered the best fit for data analytics:
 A robust collection of packages
 Powerful data visualization techniques
 Commendable Statistical and graphical programming features
 Object-oriented programming language
 It has a wide smart collection of operators for calculations of arrays, particular
matrices, etc
 Graphical representation capability on display or on hard copy.

BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Big Data Unit-2 PPT part2
No ratings yet
Big Data Unit-2 PPT part2
78 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
BDA-UNIT-3
No ratings yet
BDA-UNIT-3
29 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
UNIT-3 (1)
No ratings yet
UNIT-3 (1)
27 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit - III
No ratings yet
Unit - III
37 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
26 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Unit 3
No ratings yet
Unit 3
22 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
6.UNIT 3 BDA
No ratings yet
6.UNIT 3 BDA
18 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Assignment 2 Write-up
No ratings yet
Assignment 2 Write-up
7 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Map Reduce
No ratings yet
Map Reduce
8 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Veeman Test
100% (1)
Veeman Test
3 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Productivity Rate 5 PDF Free
No ratings yet
Productivity Rate 5 PDF Free
115 pages
Duo Setup Instructions Off Campus
No ratings yet
Duo Setup Instructions Off Campus
6 pages
Polaris 600 Alvo Integra Connection Manual en
No ratings yet
Polaris 600 Alvo Integra Connection Manual en
10 pages
ISMS Quiz: No. Answer Choices (Tick Any One)
67% (9)
ISMS Quiz: No. Answer Choices (Tick Any One)
4 pages
BT Studio 1000
No ratings yet
BT Studio 1000
56 pages
Operation Manual -HMCTP-100P CTPT Analyzer
No ratings yet
Operation Manual -HMCTP-100P CTPT Analyzer
28 pages
Study On The Identification of The Characteristics of Battery of Electric Vehicles
No ratings yet
Study On The Identification of The Characteristics of Battery of Electric Vehicles
2 pages
ASO 1 Introduction
No ratings yet
ASO 1 Introduction
86 pages
OpenScape Business V1, Service Documentation, Issue 20
No ratings yet
OpenScape Business V1, Service Documentation, Issue 20
498 pages
Data For SA Series Dehumidifiers
No ratings yet
Data For SA Series Dehumidifiers
2 pages
Aditya Kunwar
No ratings yet
Aditya Kunwar
190 pages
Fundamentals of Social Media Advertising
No ratings yet
Fundamentals of Social Media Advertising
53 pages
A101747A Item 08 Engine Operation and Maitanace Manual
No ratings yet
A101747A Item 08 Engine Operation and Maitanace Manual
132 pages
VSS-2016-011-V1.6 Vector-R Launch User's Guide
No ratings yet
VSS-2016-011-V1.6 Vector-R Launch User's Guide
12 pages
Setup
No ratings yet
Setup
402 pages
Assignment 1: BTEC HND Diploma in Computing and Systems Development
0% (1)
Assignment 1: BTEC HND Diploma in Computing and Systems Development
14 pages
4.GPA Series Intelligent Signal Isolation Distributor
No ratings yet
4.GPA Series Intelligent Signal Isolation Distributor
2 pages
Decision Transer Material and Accessorie For DWDM100G in NOR, CEN, ART, OUE3
No ratings yet
Decision Transer Material and Accessorie For DWDM100G in NOR, CEN, ART, OUE3
5 pages
Experiment No. 8 Am Signal Demodulation Techniques
No ratings yet
Experiment No. 8 Am Signal Demodulation Techniques
12 pages
Features: High Current MOSFET Driver
No ratings yet
Features: High Current MOSFET Driver
17 pages
Lembar Soal UTS Isj1K3/ Bahasa Inggris Ii Tulis Paragraf Pendek
No ratings yet
Lembar Soal UTS Isj1K3/ Bahasa Inggris Ii Tulis Paragraf Pendek
3 pages
Balanced Scorecard Excel Template
No ratings yet
Balanced Scorecard Excel Template
6 pages
Tpms - DSP Productivity Review Sheet For Month Ending
No ratings yet
Tpms - DSP Productivity Review Sheet For Month Ending
1 page
Onkyo HT-R940 Manual
No ratings yet
Onkyo HT-R940 Manual
68 pages
Mobile Application Assignment
No ratings yet
Mobile Application Assignment
2 pages
For Other Uses, .: See Engineering (Disambiguation)
No ratings yet
For Other Uses, .: See Engineering (Disambiguation)
11 pages
6bws11 20specsheet PDF
No ratings yet
6bws11 20specsheet PDF
2 pages
Architecture For Body Sensor Networks
No ratings yet
Architecture For Body Sensor Networks
4 pages
Civil Engineering & Society and Other Profession
No ratings yet
Civil Engineering & Society and Other Profession
6 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet

unit 2

Uploaded by

unit 2

Uploaded by

What is MapReduce?

Brief Working of Mapper

Brief Working Of Reducer

MapReduce – Algorithm (writing a hadoop map reduce example)

 The map task is done by means of Mapper Class

<k: employee name, v: salary>

if(v(second employee).salary > Max){

<satish, 26000> <gopal, 50000> <kian, 45000> <manisha, 45000>

R is an open-source programming language. It is best suited for statistical and graphical

The purpose behind R and Hadoop integration:

1. To use Hadoop to execute R code.

Integration of Hadoop and R

R is an open-source programming language that is extensively used for statistical and

You might also like