0% found this document useful (0 votes)
27 views

Data Science

Uploaded by

aswini.ran98
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Data Science

Uploaded by

aswini.ran98
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

MapReduce is a programming model and an associated implementation for processing and

generating large data sets with a parallel, distributed algorithm on a cluster. It was originally
developed by Google and is widely used in various data processing frameworks, including
Hadoop. Understanding MapReduce is crucial for anyone diving into the fundamentals of
data science, particularly in the context of big data.

Key Concepts of MapReduce

1. Map Function:
o The map function takes an input pair and produces a set of intermediate
key/value pairs.
o A typical implementation of the map function involves splitting a large data
set into smaller sub-problems and processing them in parallel.
2. Reduce Function:
o The reduce function takes the intermediate key/value pairs produced by the
map function and combines them to form a smaller set of values.
o The reduce function typically performs a summary operation, such as counting
occurrences or averaging.

How MapReduce Works

1. Splitting: The input data is split into fixed-size chunks, which are then processed in
parallel by the map tasks.
2. Mapping: Each map task processes a chunk of data and produces intermediate
key/value pairs.
3. Shuffling and Sorting: The intermediate data is shuffled (distributed across nodes)
and sorted by key. This is a critical step for ensuring that all values associated with
the same key are brought together.
4. Reducing: Reduce tasks process the sorted intermediate data, applying the reduce
function to generate the final output.

Example of MapReduce

Word Count Example

Let's consider a simple example of counting the number of occurrences of each word in a
large text document.

1. Map Function:
o Input: A chunk of the text document.
o Process: For each word in the chunk, emit (word, 1).
o Output: Intermediate key/value pairs like (word, 1).
2. Shuffle and Sort:
o Group all intermediate key/value pairs by key (word).
3. Reduce Function:
o Input: A key (word) and a list of values ([1, 1, 1, ...]).
o Process: Sum the values.
o Output: Final key/value pairs like (word, count).
o Code Example
python
Copy code
# Map Function
def map_function(document):
words = document.split()
return [(word, 1) for word in words]

# Reduce Function
def reduce_function(word, counts):
return (word, sum(counts))

# Example Data
documents = ["cat dog", "cat cat", "dog"]

# Applying Map Function


mapped = []
for document in documents:
mapped.extend(map_function(document))

# Shuffling and Sorting


from collections import defaultdict
shuffled = defaultdict(list)
for key, value in mapped:
shuffled[key].append(value)

# Applying Reduce Function


reduced = []
for key in shuffled:
reduced.append(reduce_function(key, shuffled[key]))

print(reduced)

Output:

css
Copy code
[('cat', 3), ('dog', 2)]

Importance in Data Science

1. Scalability: MapReduce allows for the processing of vast amounts of data across
many machines.
2. Fault Tolerance: The system is designed to handle failures gracefully, making it
robust for large-scale data processing.
3. Parallel Processing: By dividing tasks into smaller chunks, MapReduce leverages
parallel processing to speed up data analysis.

Applications in Data Science

 Log Analysis: Analyzing server logs to extract useful information like error rates or
user activity.
 Indexing: Building search indexes for large-scale search engines.
 Data Transformation: Converting data from one format to another, such as
transforming raw data into a structured format.
 Machine Learning: Preprocessing data and implementing machine learning
algorithms that can be parallelized.
Understanding MapReduce is a fundamental step in mastering data science, especially when
dealing with large datasets that require efficient processing. This model provides a scalable,
reliable, and straightforward approach to big data analysis, making it a powerful tool in a data
scientist's toolkit.

MapReduce Architecture
Last Updated : 10 Sep, 2020



MapReduce and HDFS are the two major components of Hadoop which makes it
so powerful and efficient to use. MapReduce is a programming model used for
efficient processing in parallel over large data-sets in a distributed manner. The
data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power. The MapReduce task
is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send
jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size
to the Hadoop MapReduce Master. Now, the MapReduce master will divide this
job into further equivalent job-parts. These job-parts are then made available for
the Map and Reduce Task. This Map and Reduce task will contain the program as
per the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for
processing the data as per the requirement. The algorithm for Map and Reduce is
made with a very optimized way such that the time complexity or space
complexity is minimum.
Let’s d
Introduction To MapReduce
MapReduce is a Hadoop structure utilized for composing applications that can process large
amounts of data on clusters. It can likewise be known as a programming model in which we can
handle huge datasets across PC clusters. This application permits information to be put away in
a distributed form. It works on huge volumes of data and enormous scope of computing.

MapReduce consists of two phases:


Map and Reduce Map generally deals with the splitting and mapping of data while reducing tasks
shuffle and reducing the data.
Hadoop is fully capable of running MapReduce programs that are written in various languages:
python, java, and C++. This is very useful for performing large-scale data analysis using multiple
machines in the cluster.

Application Of MapReduce
Entertainment: To discover the most popular movies, based on what you like and what you
watched in this case Hadoop MapReduce help you out. It mainly focuses on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay, utilize the
MapReduce programming model to distinguish most loved items dependent on clients’
inclinations or purchasing behavior.
It incorporates making item proposal Mechanisms for E-commerce inventories, examining
website records, buy history, user interaction logs, etc.

Data Warehouse: We can utilize MapReduce to analyze large data volumes in data warehouses
while implementing specific business logic for data insights.
Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises, including
organizations like banks, insurance providers, installment areas for misrepresentation
recognition, pattern distinguishing proof, or business metrics through transaction analysis.
How does MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.

Input Phase − Here we have a Record Reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as intermediate
keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.
Advantage of MapReduce
Fault tolerance: It can handle failures without downtime.
Speed: It splits, shuffles, and reduces the unstructured data in a short time.
Cost-effective: Hadoop MapReduce has a scale-out feature that enables users to process or
store the data in a cost-effective manner.
Scalability: It provides a highly scalable framework. MapReduce allows users to run applications
from many nodes.
Parallel Processing: Here multiple job-parts of the same dataset can be processed in a parallel
manner. This can reduce the task that can be taken to complete a task.
Limitations Of MapReduce
 MapReduce cannot cache the intermediate data in memory for a further requirement
which diminishes the performance of Hadoop.
 It is only suitable for Batch Processing of a Huge amounts of Data.

You might also like