MapReduce

Uploaded by

chise6969

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

MapReduce

Uploaded by

chise6969

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

MAPREDUCE

WHAT IS MAPREDUCE?

MapReduce is a software framework used for processing

vast data sets in a distributed computing environment.

It is composed of two key phases: Map and Reduce.

Map tasks deal with splitting and mapping of data while

Reduce tasks shuffle and reduce the data.
COMPONENTS OF MAPREDUCE

 Map Function: It processes input data and generates key-

value pairs.
 Shuffle and Sort: Organizes the key-value pairs to send
similar keys to the same reducer.
 Reduce Function: Aggregates the output by performing
operations like sum, count, etc. on the key-value pairs.
MAPREDUCE ARCHITECTURE
MAPREDUCE ARCHITECTURE
MAPREDUCE ARCHITECTURE

Input Splits: An input to a MapReduce in Big Data job is divided into fixed-size
pieces called input splits. Input split is a chunk of the input that is consumed by a
single map.
Mapping: This is the very first phase in the execution of map-reduce program. In
this phase data in each split is passed to a mapping function to produce output
values. In our example, a job of mapping phase is to count a number of
occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>.
Shuffling: This phase consumes the output of Mapping phase. Its task is to
consolidate the relevant records from Mapping phase output. In our example,
the same words are clubed together along with their respective frequency.
Reducing: In this phase, output values from the Shuffling phase are aggregated.
This phase combines values from Shuffling phase and returns a single output
value. In short, this phase summarizes the complete dataset.
STEPS:

One map task is created for each split which then executes map
function for each record in the split.
It is always beneficial to have multiple splits because the time
taken to process a split is small as compared to the time taken for
processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the
splits in parallel.
However, it is also not desirable to have splits too small in size.
When splits are too small, the overload of managing the splits and
map task creation begins to dominate the total job execution
time.
For most jobs, it is better to make a split size equal to the size of
an HDFS block (which is 64 MB, by default).
STEPS:

Execution of map tasks results into writing output to a local disk

on the respective node and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication
which takes place in case of HDFS store operation.
Map output is intermediate output which is processed by reduce
tasks to produce the final output.
Once the job is complete, the map output can be thrown away.
So, storing it in HDFS with replication becomes overkill.
In the event of node failure, before the map output is consumed
by the reduce task, Hadoop reruns the map task on another node
and re-creates the map output.
STEPS:

Reduce task doesn’t work on the concept of data locality. An

output of every map task is fed to the reduce task. Map output is
transferred to the machine where reduce task is running.
On this machine, the output is merged and then passed to the
user-defined reduce function.
Unlike the map output, reduce output is stored in HDFS (the first
replica is stored on the local node and other replicas are stored on
off-rack nodes). So, writing the reduce output
WORKING

Hadoop divides the job into tasks. There are two types of tasks:
 Map tasks (Splits & Mapping)
 Reduce tasks (Shuffling, Reducing)
The complete execution process (execution of Map and Reduce tasks,
both) is controlled by two types of entities called a
 Jobtracker: Acts like a master (responsible for complete execution of
submitted job)
 Multiple Task Trackers: Acts like slaves, each of them performing the
job.
For every job submitted for execution in the system, there is one
Jobtracker that resides on Namenode and there are multiple
tasktrackers which reside on Datanode.
WORKING

 A job is divided into multiple tasks which are then run onto multiple
data nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by
scheduling tasks to run on different data nodes.
 Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
 Task tracker’s responsibility is to send the progress report to the job
tracker.
 In addition, task tracker periodically sends ‘heartbeat’ signal to the
Jobtracker so as to notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different
task tracker.
EXAMPLE:
MAPREDUCE EXAMPLE - WORD
COUNT
 Example: Counting the frequency of words in a document:
 1. The Map function generates key-value pairs where keys are
words and values are 1.
 2. The Shuffle and Sort step groups identical words together.
 3. The Reduce function sums the values to get the word count.
BENEFITS :

 Benefits:
 1. Scalability
 2. Fault tolerance
 3. Simplicity in handling large datasets

Ikm Assessment Software Quality Assurance Questions
0% (2)
Ikm Assessment Software Quality Assurance Questions
15 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Mapreduce Lifecycle
No ratings yet
Mapreduce Lifecycle
8 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Unit 3
No ratings yet
Unit 3
13 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Unit-4
No ratings yet
Unit-4
19 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit - III
No ratings yet
Unit - III
37 pages
Executing Hadoop Map Reduce Jobs
No ratings yet
Executing Hadoop Map Reduce Jobs
2 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
1 UNIT-1
No ratings yet
1 UNIT-1
59 pages
Module 4
No ratings yet
Module 4
37 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
Module 4 BDA Solutions
No ratings yet
Module 4 BDA Solutions
22 pages
Map reduce
No ratings yet
Map reduce
35 pages
Map Red
No ratings yet
Map Red
6 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
BDA Assignment - 148 - A
No ratings yet
BDA Assignment - 148 - A
4 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Yarn Tutorial
No ratings yet
Yarn Tutorial
14 pages
Data Wrangling and munging (1)
No ratings yet
Data Wrangling and munging (1)
21 pages
OC_Module 2_DA Lifecycle 021312
No ratings yet
OC_Module 2_DA Lifecycle 021312
33 pages
Pig_Hive_Spark_Big_Data_Analytics
No ratings yet
Pig_Hive_Spark_Big_Data_Analytics
10 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
15 pages
Understanding Points and Patches a Journey Into Geometry Modeling and Applications
No ratings yet
Understanding Points and Patches a Journey Into Geometry Modeling and Applications
11 pages
matrix-mult
No ratings yet
matrix-mult
6 pages
HDFS
No ratings yet
HDFS
15 pages
MT Embedded Diploma Contents 2021
No ratings yet
MT Embedded Diploma Contents 2021
11 pages
Project 2
No ratings yet
Project 2
11 pages
Action Research by Ben B.K. Ayawli (2008) - Improving The Performance of HND SMS 2 Students of Sunyani Polytechnic in Database Application Programs (MS Access)
100% (6)
Action Research by Ben B.K. Ayawli (2008) - Improving The Performance of HND SMS 2 Students of Sunyani Polytechnic in Database Application Programs (MS Access)
112 pages
CERT-In Press
No ratings yet
CERT-In Press
4 pages
Cloud Computing Module-1
No ratings yet
Cloud Computing Module-1
5 pages
Fundamentals of Database Management System
No ratings yet
Fundamentals of Database Management System
4 pages
Advanced Memory Forensics
No ratings yet
Advanced Memory Forensics
45 pages
Y22 B Tech Semester in 1 Examinations Time Table, August 2023 2023
No ratings yet
Y22 B Tech Semester in 1 Examinations Time Table, August 2023 2023
1 page
Chapter 6: Internet
No ratings yet
Chapter 6: Internet
3 pages
Client Server Chatroom Using Python 2.0
No ratings yet
Client Server Chatroom Using Python 2.0
27 pages
DWDM Unit 2 PDF
No ratings yet
DWDM Unit 2 PDF
16 pages
WAM Presentation
No ratings yet
WAM Presentation
10 pages
h17840 Poweredge Sap Hana VG
No ratings yet
h17840 Poweredge Sap Hana VG
60 pages
India Explorer Research Report
No ratings yet
India Explorer Research Report
3 pages
Oop in java final
No ratings yet
Oop in java final
82 pages
How To Fix - PDB in Restricted - Database Blog
No ratings yet
How To Fix - PDB in Restricted - Database Blog
5 pages
Deployment Diagram 1
No ratings yet
Deployment Diagram 1
17 pages
Tirtos 2.16 For c2000
No ratings yet
Tirtos 2.16 For c2000
32 pages
Report
No ratings yet
Report
172 pages
Adodotnet 130101201843 Phpapp01
No ratings yet
Adodotnet 130101201843 Phpapp01
32 pages
Hostel Management System Report
No ratings yet
Hostel Management System Report
39 pages
Keyword Extraction Methods From Documents in NLP
No ratings yet
Keyword Extraction Methods From Documents in NLP
15 pages
MCA Placement Brochure 2022-2023
No ratings yet
MCA Placement Brochure 2022-2023
38 pages
Hosting e Commerce Website On AWS
No ratings yet
Hosting e Commerce Website On AWS
8 pages
Black Mail
No ratings yet
Black Mail
10 pages
Lecture-1 (Pic) .PDFV 4
No ratings yet
Lecture-1 (Pic) .PDFV 4
16 pages
Vishal Java BigData 3years
No ratings yet
Vishal Java BigData 3years
3 pages
Untuk GAME Online: Seting Jalur Games Online, Download, Browsing Pada Mikrotik
No ratings yet
Untuk GAME Online: Seting Jalur Games Online, Download, Browsing Pada Mikrotik
4 pages
Advantages of Internet
No ratings yet
Advantages of Internet
3 pages

MapReduce

Uploaded by

MapReduce

Uploaded by

MAPREDUCE

MapReduce is a software framework used for processing

It is composed of two key phases: Map and Reduce.

Map tasks deal with splitting and mapping of data while

 Map Function: It processes input data and generates key-

Execution of map tasks results into writing output to a local disk

Reduce task doesn’t work on the concept of data locality. An

You might also like