0% found this document useful (0 votes)

28 views

Map Reduce Tutorial-1

MapReduce is a programming model used for processing large datasets in a distributed system. It allows for parallel processing of data across clusters of computers. The MapReduce algorithm contains two main tasks - the Map task which processes key-value pairs and the Reduce task which combines the outputs from Map into final results. An example is provided of how Twitter uses MapReduce to process 500 million tweets per day across distributed systems.

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Map Reduce Tutorial-1

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1.

MAPREDUCE – INTRODUCTION MapReduce

MapReduce is a programming model for writing applications that can process Big
Data in parallel on multiple nodes. MapReduce provides analytical capabilities for
analyzing huge volumes of complex data.

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional
computing techniques. For example, the volume of data Facebook or YouTube need
require it to collect and manage on a daily basis, can fall under the category of Big
Data. However, Big Data is not only about scale and volume, it also involves one or
more of the following aspects − Velocity, Variety, Volume, and Complexity.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge volumes
of scalable data and cannot be accommodated by standard database servers.
Moreover, the centralized system creates too much of a bottleneck while processing
multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the
results are collected at one place and integrated to form the result dataset.

3
MapReduce

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).

 The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their
significance.

4
MapReduce

 Input Phase − Here we have a Record Reader that translates each record in
an input file and sends the parsed data to the mapper in the form of key-value
pairs.

 Map − Map is a user-defined function, which takes a series of key-value pairs

and processes each one of them to generate zero or more key-value pairs.

 Intermediate Keys − The key-value pairs generated by the mapper are

known as intermediate keys.

 Combiner − A combiner is a type of local Reducer that groups similar data

from the map phase into identifiable sets. It takes the intermediate keys from
the mapper as input and applies a user-defined code to aggregate the values
in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.

 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.

 Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a wide

5
MapReduce

range of processing. Once the execution is over, it gives zero or more key-
value pairs to the final step.

 Output Phase − In the output phase, we have an output formatter that

translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.

Let us try to understand the two tasks Map & Reduce with the help of a small diagram
−

MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second.
The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.

6
MapReduce

As shown in the illustration, the MapReduce algorithm performs the following actions
−

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-
value pairs.

 Filter − Filters unwanted words from the maps of tokens and writes the
filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of similar counter values into

small manageable units.

7
2. MAPREDUCE – ALGORITHM MapReduce

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class

 The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps, and sorts it. The output of Mapper
class is used as input by Reducer class, which in turn searches matching pairs and
reduces them.

MapReduce implements various mathematical algorithms to divide a task into small

parts and assign them to multiple systems. In technical terms, MapReduce algorithm
helps in sending the Map & Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following −

 Sorting

 Searching

 Indexing

 TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-value
pairs from the mapper by their keys.

8
MapReduce

 Sorting methods are implemented in the mapper class itself.

 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
theContext class (user-defined class) collects the matching valued keys as a
collection.

 To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.

 The set of intermediate key-value pairs for a given Reducer is automatically

sorted by Hadoop to form key-values (K2, {V2, V2…}) before they are
presented to the Reducer.

Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase. Let us try to understand how Searching
works with the help of an example.

Example
The following example shows how MapReduce employs Searching algorithm to find
out the details of the employee who draws the highest salary in a given employee
dataset.

 Let us assume we have employee data in four different files − A, B, C, and D.

Let us also assume there are duplicate employee records in all four files
because of importing the employee data from all database tables repeatedly.
See the following illustration.

 The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following illustration.

Unit Ii Iintroduction To Map Reduce
No ratings yet
Unit Ii Iintroduction To Map Reduce
4 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Rohit
No ratings yet
Rohit
14 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
6.UNIT 3 BDA
No ratings yet
6.UNIT 3 BDA
18 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
Fundamentals of MapReduce With Example
No ratings yet
Fundamentals of MapReduce With Example
2 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Bda Expt5 - 60002190056
No ratings yet
Bda Expt5 - 60002190056
5 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 3 MapReduce Part 1
No ratings yet
Unit 3 MapReduce Part 1
12 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
BDA Lab 5
No ratings yet
BDA Lab 5
6 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
BDA notes
No ratings yet
BDA notes
39 pages
Big Data 4 Vivek
No ratings yet
Big Data 4 Vivek
3 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Unit 3 Map Reduce
No ratings yet
Unit 3 Map Reduce
3 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Map Reduce 1
No ratings yet
Map Reduce 1
50 pages
What is Map Reduce Programming Model_ Explain.
No ratings yet
What is Map Reduce Programming Model_ Explain.
3 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Big 5
No ratings yet
Big 5
2 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
Hadoop
No ratings yet
Hadoop
7 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
TM Flow Software Manual Installation Manual en
No ratings yet
TM Flow Software Manual Installation Manual en
172 pages
Lesson Plan 7EC 2021-22
No ratings yet
Lesson Plan 7EC 2021-22
125 pages
Work Breakdown Structure
No ratings yet
Work Breakdown Structure
1 page
Bipolar Stepper Motor Driver 74194
100% (2)
Bipolar Stepper Motor Driver 74194
12 pages
Brocade Portlogdump Reference Guide
No ratings yet
Brocade Portlogdump Reference Guide
101 pages
NM-C431 YogaC740-14IML 15IML
No ratings yet
NM-C431 YogaC740-14IML 15IML
62 pages
21 - WebSocket - Adoption - and - The - Landscape - of - The - Real-Time - Web
No ratings yet
21 - WebSocket - Adoption - and - The - Landscape - of - The - Real-Time - Web
12 pages
Ix2 Network Storage: User Guide
No ratings yet
Ix2 Network Storage: User Guide
142 pages
Guide To TannerEDA For VLSI
No ratings yet
Guide To TannerEDA For VLSI
32 pages
DLD MCQs
No ratings yet
DLD MCQs
60 pages
Vico Office R6.5 Installation Guide
No ratings yet
Vico Office R6.5 Installation Guide
32 pages
Computer System Architecture Set 1
No ratings yet
Computer System Architecture Set 1
10 pages
Big Data Greenplum PDF
No ratings yet
Big Data Greenplum PDF
5 pages
Biostar I945C-M7B r6.0
No ratings yet
Biostar I945C-M7B r6.0
28 pages
Manual Sj25c en
No ratings yet
Manual Sj25c en
22 pages
Apa
No ratings yet
Apa
22 pages
conga-TEVAL/COMe 3.0 User’s Guide
No ratings yet
conga-TEVAL/COMe 3.0 User’s Guide
56 pages
GrrCON Challenge Walkthrough
No ratings yet
GrrCON Challenge Walkthrough
15 pages
Unit 4 Git Branching and Merging
No ratings yet
Unit 4 Git Branching and Merging
21 pages
FoodRush Project Proposal
100% (1)
FoodRush Project Proposal
14 pages
Fundamentals of Ic Assembly
No ratings yet
Fundamentals of Ic Assembly
38 pages
XMP1: 16 Slot Subrack of The Modular Multi-Service Access System
No ratings yet
XMP1: 16 Slot Subrack of The Modular Multi-Service Access System
4 pages
How Do You Get The Most Out of Your LDAR Data Management System?
No ratings yet
How Do You Get The Most Out of Your LDAR Data Management System?
2 pages
Digital Signatures Security Technologies Firewalls and VPNS
No ratings yet
Digital Signatures Security Technologies Firewalls and VPNS
44 pages
TESDA-OP-CO-01-F11 TESDA-OP-CO-01-F11 (Rev. No. 00-03/08/17) (Rev. No. 00-03/08/17)
100% (1)
TESDA-OP-CO-01-F11 TESDA-OP-CO-01-F11 (Rev. No. 00-03/08/17) (Rev. No. 00-03/08/17)
31 pages
Catalog 2018
No ratings yet
Catalog 2018
37 pages
Frontiers of Physics
No ratings yet
Frontiers of Physics
18 pages
Synopsis: Project Title: Gym Management System
No ratings yet
Synopsis: Project Title: Gym Management System
2 pages
VDI Rapid Deploy Program-Ds
No ratings yet
VDI Rapid Deploy Program-Ds
43 pages
UF-450AX Catalog 2P
No ratings yet
UF-450AX Catalog 2P
2 pages