Map Reduce
Map Reduce
MapReduce Theory
MapReduce is a programming model for processing large datasets in parallel across a distributed
cluster of computers. The model consists of two main steps: the Map step and the Reduce step.
1. Map Step:
In this step, the input data is divided into chunks, and the map function processes these chunks in
parallel, creating intermediate key-value pairs. These key-value pairs are generated based on the
input data.
3. Reduce Step:
In this step, the reduce function processes the grouped data and combines or aggregates it to
produce the final result.
MapReduce Workflow:
Map: Breaks the data into smaller, manageable chunks and processes them in parallel.
Shuffle: Organizes the intermediate key-value pairs.
Reduce: Combines or aggregates the processed results into the final output.
MapReduce allows large-scale data processing because it runs across many machines and ensures
that each part of the data is processed efficiently.
Problem: Count the number of occurrences of each word in a large text file.
Input Data:
text Copy code
https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 1/3
11/28/24, 9:40 PM ChatGPT
Step-by-Step Process:
1. Map Step:
Each line of the text is split into words, and for each word, we emit a key-value pair. The key is the
word, and the value is 1 (representing the occurrence of that word).
Input:
(Hello, [1, 1]) (World, [1]) (Hadoop, [1, 1]) (is, [1]) (great, [1])
3. Reduce Step:
The reduce function takes each group of values associated with a key and aggregates them (sums
them up in this case).
Reduce Output:
Final Output:
The final result is the word count for each word in the input text:
https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 2/3
11/28/24, 9:40 PM ChatGPT
Output:
Conclusion:
Big Data refers to large-scale datasets that require specialized tools for storage and
processing.
MapReduce is a powerful method for parallel processing of large datasets, where data is
processed in the form of key-value pairs.
The process involves mapping data into smaller chunks, shuffling and sorting the intermediate
data, and reducing it to a final result.
MapReduce is particularly effective for distributed computing environments like Hadoop, where
data is spread across many machines.
https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 3/3