Data Science

Uploaded by

aswini.ran98

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Data Science

Uploaded by

aswini.ran98

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

MapReduce is a programming model and an associated implementation for processing and

generating large data sets with a parallel, distributed algorithm on a cluster. It was originally
developed by Google and is widely used in various data processing frameworks, including
Hadoop. Understanding MapReduce is crucial for anyone diving into the fundamentals of
data science, particularly in the context of big data.

Key Concepts of MapReduce

1. Map Function:
o The map function takes an input pair and produces a set of intermediate
key/value pairs.
o A typical implementation of the map function involves splitting a large data
set into smaller sub-problems and processing them in parallel.
2. Reduce Function:
o The reduce function takes the intermediate key/value pairs produced by the
map function and combines them to form a smaller set of values.
o The reduce function typically performs a summary operation, such as counting
occurrences or averaging.

How MapReduce Works

1. Splitting: The input data is split into fixed-size chunks, which are then processed in
parallel by the map tasks.
2. Mapping: Each map task processes a chunk of data and produces intermediate
key/value pairs.
3. Shuffling and Sorting: The intermediate data is shuffled (distributed across nodes)
and sorted by key. This is a critical step for ensuring that all values associated with
the same key are brought together.
4. Reducing: Reduce tasks process the sorted intermediate data, applying the reduce
function to generate the final output.

Example of MapReduce

Word Count Example

Let's consider a simple example of counting the number of occurrences of each word in a
large text document.

1. Map Function:
o Input: A chunk of the text document.
o Process: For each word in the chunk, emit (word, 1).
o Output: Intermediate key/value pairs like (word, 1).
2. Shuffle and Sort:
o Group all intermediate key/value pairs by key (word).
3. Reduce Function:
o Input: A key (word) and a list of values ([1, 1, 1, ...]).
o Process: Sum the values.
o Output: Final key/value pairs like (word, count).
o Code Example
python
Copy code
# Map Function
def map_function(document):
words = document.split()
return [(word, 1) for word in words]

# Reduce Function
def reduce_function(word, counts):
return (word, sum(counts))

# Example Data
documents = ["cat dog", "cat cat", "dog"]

# Applying Map Function

mapped = []
for document in documents:
mapped.extend(map_function(document))

# Shuffling and Sorting

from collections import defaultdict
shuffled = defaultdict(list)
for key, value in mapped:
shuffled[key].append(value)

# Applying Reduce Function

reduced = []
for key in shuffled:
reduced.append(reduce_function(key, shuffled[key]))

print(reduced)

Output:

css
Copy code
[('cat', 3), ('dog', 2)]

Importance in Data Science

1. Scalability: MapReduce allows for the processing of vast amounts of data across
many machines.
2. Fault Tolerance: The system is designed to handle failures gracefully, making it
robust for large-scale data processing.
3. Parallel Processing: By dividing tasks into smaller chunks, MapReduce leverages
parallel processing to speed up data analysis.

Applications in Data Science

 Log Analysis: Analyzing server logs to extract useful information like error rates or
user activity.
 Indexing: Building search indexes for large-scale search engines.
 Data Transformation: Converting data from one format to another, such as
transforming raw data into a structured format.
 Machine Learning: Preprocessing data and implementing machine learning
algorithms that can be parallelized.
Understanding MapReduce is a fundamental step in mastering data science, especially when
dealing with large datasets that require efficient processing. This model provides a scalable,
reliable, and straightforward approach to big data analysis, making it a powerful tool in a data
scientist's toolkit.

MapReduce Architecture
Last Updated : 10 Sep, 2020





MapReduce and HDFS are the two major components of Hadoop which makes it
so powerful and efficient to use. MapReduce is a programming model used for
efficient processing in parallel over large data-sets in a distributed manner. The
data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power. The MapReduce task
is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send
jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size
to the Hadoop MapReduce Master. Now, the MapReduce master will divide this
job into further equivalent job-parts. These job-parts are then made available for
the Map and Reduce Task. This Map and Reduce task will contain the program as
per the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for
processing the data as per the requirement. The algorithm for Map and Reduce is
made with a very optimized way such that the time complexity or space
complexity is minimum.
Let’s d
Introduction To MapReduce
MapReduce is a Hadoop structure utilized for composing applications that can process large
amounts of data on clusters. It can likewise be known as a programming model in which we can
handle huge datasets across PC clusters. This application permits information to be put away in
a distributed form. It works on huge volumes of data and enormous scope of computing.

MapReduce consists of two phases:

Map and Reduce Map generally deals with the splitting and mapping of data while reducing tasks
shuffle and reducing the data.
Hadoop is fully capable of running MapReduce programs that are written in various languages:
python, java, and C++. This is very useful for performing large-scale data analysis using multiple
machines in the cluster.

Application Of MapReduce
Entertainment: To discover the most popular movies, based on what you like and what you
watched in this case Hadoop MapReduce help you out. It mainly focuses on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay, utilize the
MapReduce programming model to distinguish most loved items dependent on clients’
inclinations or purchasing behavior.
It incorporates making item proposal Mechanisms for E-commerce inventories, examining
website records, buy history, user interaction logs, etc.

Data Warehouse: We can utilize MapReduce to analyze large data volumes in data warehouses
while implementing specific business logic for data insights.
Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises, including
organizations like banks, insurance providers, installment areas for misrepresentation
recognition, pattern distinguishing proof, or business metrics through transaction analysis.
How does MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.

Input Phase − Here we have a Record Reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as intermediate
keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.
Advantage of MapReduce
Fault tolerance: It can handle failures without downtime.
Speed: It splits, shuffles, and reduces the unstructured data in a short time.
Cost-effective: Hadoop MapReduce has a scale-out feature that enables users to process or
store the data in a cost-effective manner.
Scalability: It provides a highly scalable framework. MapReduce allows users to run applications
from many nodes.
Parallel Processing: Here multiple job-parts of the same dataset can be processed in a parallel
manner. This can reduce the task that can be taken to complete a task.
Limitations Of MapReduce
 MapReduce cannot cache the intermediate data in memory for a further requirement
which diminishes the performance of Hadoop.
 It is only suitable for Batch Processing of a Huge amounts of Data.

BDF 2022 Combined 2
No ratings yet
BDF 2022 Combined 2
266 pages
Veritas Netbackup™ Emergency Engineering Binary Guide: Release 8.2 and 8.2.X
No ratings yet
Veritas Netbackup™ Emergency Engineering Binary Guide: Release 8.2 and 8.2.X
41 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Map Reduce
No ratings yet
Map Reduce
8 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 3 MapReduce Part 1
No ratings yet
Unit 3 MapReduce Part 1
12 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Bda 03
No ratings yet
Bda 03
10 pages
What is Map Reduce Programming Model_ Explain.
No ratings yet
What is Map Reduce Programming Model_ Explain.
3 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Map reduce
No ratings yet
Map reduce
35 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
MapReduce BigData 09
No ratings yet
MapReduce BigData 09
9 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Cp5291 Security Practices Unit I
No ratings yet
Cp5291 Security Practices Unit I
1 page
Uni 1
No ratings yet
Uni 1
2 pages
Software
No ratings yet
Software
2 pages
Reading Files in R Programming Language
No ratings yet
Reading Files in R Programming Language
33 pages
5G, or Fifth-Generation Wireless Technology, Is The Latest Iteration of Mobile
No ratings yet
5G, or Fifth-Generation Wireless Technology, Is The Latest Iteration of Mobile
4 pages
Lab07-Apache Pig V1.01
No ratings yet
Lab07-Apache Pig V1.01
7 pages
Subjects Given
No ratings yet
Subjects Given
15 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Selvarani Mylsamy: "We Swim in A Sea of Data and The Sea Level Is Rising Rapidly."
No ratings yet
Selvarani Mylsamy: "We Swim in A Sea of Data and The Sea Level Is Rising Rapidly."
33 pages
Apache Hive
No ratings yet
Apache Hive
17 pages
Expert Veri Ed, Online, Free.: Custom View Settings
No ratings yet
Expert Veri Ed, Online, Free.: Custom View Settings
12 pages
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
No ratings yet
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
20 pages
Machine Learning Platform Design and Application Based On SparkProceedings of SPIE The International Society For Optical Engineering
No ratings yet
Machine Learning Platform Design and Application Based On SparkProceedings of SPIE The International Society For Optical Engineering
6 pages
A Thorough Introduction To Distributed Systems
No ratings yet
A Thorough Introduction To Distributed Systems
31 pages
Download Complete Apache Flume Distributed Log Collection for Hadoop 2nd Edition Steve Hoffman PDF for All Chapters
100% (8)
Download Complete Apache Flume Distributed Log Collection for Hadoop 2nd Edition Steve Hoffman PDF for All Chapters
50 pages
7th Sem question papers
No ratings yet
7th Sem question papers
16 pages
Harshit Resume Big Data
No ratings yet
Harshit Resume Big Data
1 page
Hadoop Release 2.0
No ratings yet
Hadoop Release 2.0
54 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
PRACTICAL 4 - Single and Multi Node Hadoop Install
No ratings yet
PRACTICAL 4 - Single and Multi Node Hadoop Install
11 pages
Using Mis 9th Edition Kroenke Test Bank
100% (26)
Using Mis 9th Edition Kroenke Test Bank
30 pages
oaaomloverviewnewfeaturesroadmap-5462726
No ratings yet
oaaomloverviewnewfeaturesroadmap-5462726
100 pages
FIWARE Overview Slides
No ratings yet
FIWARE Overview Slides
60 pages
Enterprise Archiving With Apache Hadoop Featuring The 2015 Gartner Magic Quadrant
No ratings yet
Enterprise Archiving With Apache Hadoop Featuring The 2015 Gartner Magic Quadrant
27 pages
Apache Ignite: - in - Memory Data Fabric
No ratings yet
Apache Ignite: - in - Memory Data Fabric
16 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Sqoop Big Data Tech
No ratings yet
Sqoop Big Data Tech
16 pages
Big Data With Hadoop
No ratings yet
Big Data With Hadoop
26 pages
Teradata Connector For Hadoop Tutorial v1 0 Final
100% (2)
Teradata Connector For Hadoop Tutorial v1 0 Final
62 pages
ME Cse Regulation 2018
No ratings yet
ME Cse Regulation 2018
49 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
SCN Sap Com Community Data Services Blog 2013-07-17 Step by
No ratings yet
SCN Sap Com Community Data Services Blog 2013-07-17 Step by
11 pages
23000122010
No ratings yet
23000122010
12 pages