0% found this document useful (0 votes)
3 views

Map Reduce

Uploaded by

anneinlondon02
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Map Reduce

Uploaded by

anneinlondon02
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

11/28/24, 9:40 PM ChatGPT

MapReduce Theory
MapReduce is a programming model for processing large datasets in parallel across a distributed
cluster of computers. The model consists of two main steps: the Map step and the Reduce step.

1. Map Step:
In this step, the input data is divided into chunks, and the map function processes these chunks in
parallel, creating intermediate key-value pairs. These key-value pairs are generated based on the
input data.

2. Shuffle and Sort Step:


After the map function processes the data, the system groups and sorts these key-value pairs based
on their keys, so that all values corresponding to the same key are together.

3. Reduce Step:
In this step, the reduce function processes the grouped data and combines or aggregates it to
produce the final result.

MapReduce Workflow:
Map: Breaks the data into smaller, manageable chunks and processes them in parallel.
Shuffle: Organizes the intermediate key-value pairs.
Reduce: Combines or aggregates the processed results into the final output.

MapReduce allows large-scale data processing because it runs across many machines and ensures
that each part of the data is processed efficiently.

Key Features of MapReduce:


Scalability: Can handle petabytes of data by distributing the computation across a cluster of
machines.
Fault Tolerance: If a task fails on one machine, it is reassigned to another, ensuring reliable
execution.
Parallel Processing: Processes data simultaneously on multiple machines, speeding up
computation.

Simple Example of MapReduce

Problem: Count the number of occurrences of each word in a large text file.

Input Data:
text Copy code

https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 1/3
11/28/24, 9:40 PM ChatGPT

Hello World Hello Hadoop Hadoop is great

Step-by-Step Process:

1. Map Step:
Each line of the text is split into words, and for each word, we emit a key-value pair. The key is the
word, and the value is 1 (representing the occurrence of that word).
Input:

csharp Copy code

Hello World Hello Hadoop Hadoop is great

Map Output (Key-Value Pairs):

scss Copy code

(Hello, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Hadoop, 1) (is, 1) (great, 1)

2. Shuffle and Sort Step:


The intermediate data is grouped by keys (the words), so all occurrences of the same word are
grouped together.
Grouped Data:

css Copy code

(Hello, [1, 1]) (World, [1]) (Hadoop, [1, 1]) (is, [1]) (great, [1])

3. Reduce Step:
The reduce function takes each group of values associated with a key and aggregates them (sums
them up in this case).
Reduce Output:

scss Copy code

(Hello, 2) (World, 1) (Hadoop, 2) (is, 1) (great, 1)

Final Output:
The final result is the word count for each word in the input text:

makefile Copy code

https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 2/3
11/28/24, 9:40 PM ChatGPT

Hello: 2 World: 1 Hadoop: 2 is: 1 great: 1

Code Example (Pseudo-code):


Here’s a simple pseudo-code implementation for MapReduce:

python Copy code

# Map function def map_function(input_data): output = [] for line in input_data: for


word in line.split(): output.append((word, 1)) return output # Reduce function def
reduce_function(mapped_data): result = {} for word, count in mapped_data: if word in
result: result[word] += count else: result[word] = count return result # Example usage
input_data = [ "Hello World", "Hello Hadoop", "Hadoop is great" ] # Map phase
mapped_data = map_function(input_data) # Reduce phase final_result =
reduce_function(mapped_data) print(final_result)

Output:

arduino Copy code

{'Hello': 2, 'World': 1, 'Hadoop': 2, 'is': 1, 'great': 1}

Conclusion:
Big Data refers to large-scale datasets that require specialized tools for storage and
processing.
MapReduce is a powerful method for parallel processing of large datasets, where data is
processed in the form of key-value pairs.
The process involves mapping data into smaller chunks, shuffling and sorting the intermediate
data, and reducing it to a final result.

MapReduce is particularly effective for distributed computing environments like Hadoop, where
data is spread across many machines.

https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 3/3

You might also like