Search...
DSA Practice Problems Python C C++ Java Courses Machine Learning DevOps Web D
Map Reduce and its Phases with numerical
example.
Last Updated : 18 May, 2023
Map Reduce :-
It is a framework in which we can write applications to run huge amount
of data in parallel and in large cluster of commodity hardware in a
reliable manner.
Different Phases of MapReduce:-
MapReduce model has three major and one optional phase.
Mapping
Shuffling and Sorting
Reducing
Combining
Mapping :- It is the first phase of MapReduce programming. Mapping
Phase accepts key-value pairs as input as (k, v), where the key
represents the Key address of each record and the value represents the
entire record content.T he output of the Mapping phase will also be in
the key-value format (k’, v’).
Shuffling and Sorting :- The output of various mapping parts (k’, v’),
then goes into Shuffling and Sorting phase.All the same values are
deleted, and different values are grouped together based on same keys.
The output of the Shuffling and Sorting phase will be key-value pairs
again as key and array of values (k, v[ ]).
Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be
the input of the Reducer phase.In this phase reducer function’s logic is
executed and all the values are Collected against their corresponding
keys. Reducer stabilize outputs of various mappers and computes the
final output.
Combining :- It is an optional phase in the MapReduce phases .The
combiner phase is used to optimize the performance of MapReduce
phases. This phase makes the Shuffling and Sorting phase work even
quicker by enabling additional performance features in MapReduce
phases.
flow chart
Numerical:-
MovieLens Data
USER_ID MOVIE_ID RATING TIMESTAMP
196 242 3
881250949
186 302 3
891717742
196 377 1
878887116
244 51 2
880606923
166 346 1
886397596
186 474 4
884182806
186 265 2
881171488
Solution : –
Step 1 – First we have to map the values , it is happen in 1st phase of
Map Reduce model.
196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ;
186:265
Step 2 – After Mapping we have to shuffle and sort the values.
166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51
Step 3 – After completion of step1 and step2 we have to reduce each
key’s values.
Now, put all values together
Solution
CODE FOR MAPPER AND REDUCER TOGETHER:
Python3
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreak(MRJob):
def steps(self):
return [
MRstep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
# MAPPER CODE
def mapper_get_ratings(self, _, line):
(User_id, Movie_id, Rating, Timestamp) = line.split('/t')
yield rating,
# REDUCER CODE
def reducer_count_ratings(self, key, values):
yield key, sum(values)
Comment More info Next Article
Hadoop - Daemons and Their
Advertise with us Features
Similar Reads
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which
makes it so powerful and efficient to use. MapReduce is a programming…
15+ min read
Hadoop - Mapper In MapReduce
Map-Reduce is a programming model that is mainly divided into two
phases Map Phase and Reduce Phase. It is designed for processing the…
15+ min read
MapReduce Programming Model and its role in Hadoop.
In the Hadoop framework, MapReduce is the programming model.
MapReduce utilizes the map and reduce strategy for the analysis of data.…
15+ min read
Hadoop Tutorial
Big Data is a collection of data that is growing exponentially, and it is
huge in volume with a lot of complexity as it comes from various…
14 min read
Hadoop - Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly
used for storage purpose and maintaining and analyzing a large amount …
15+ min read
Hadoop - Schedulers and Types of Schedulers
In Hadoop, we can receive multiple jobs from different clients to perform.
The Map-Reduce framework is used to perform multiple tasks in parallel…
15+ min read
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, However, for those who are not…
15+ min read
Introduction to Apache Pig
Pig Represents Big Data as data flows. Pig is a high-level platform or tool
which is used to process the large datasets. It provides a high-level of…
15+ min read
Anatomy of File Read and Write in HDFS
Big data is nothing but a collection of data sets that are large, complex,
and which are difficult to store and process using available data…
15+ min read
map vs unordered_map in C++
Pre-requisite : std::map, std::unordered_mapWhen it comes to efficiency,
there is a huge difference between maps and unordered maps. We must…
2 min read
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate
Tower, Sector- 136, Noida, Uttar Pradesh
(201305)
Registered Address:
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam
Buddh Nagar, Uttar Pradesh, 201305
Advertise with us
Company Explore
About Us Job-A-Thon Hiring Challenge
Legal GfG Weekly Contest
Privacy Policy Offline Classroom Program
Careers DSA in JAVA/C++
In Media Master System Design
Contact Us Master CP
GfG Corporate Solution GeeksforGeeks Videos
Placement Training Program
Languages DSA
Python Data Structures
Java Algorithms
C++ DSA for Beginners
PHP Basic DSA Problems
GoLang DSA Roadmap
SQL DSA Interview Questions
R Language Competitive Programming
Android Tutorial
Data Science & ML Web Technologies
Data Science With Python HTML
Data Science For Beginner CSS
Machine Learning JavaScript
ML Maths TypeScript
Data Visualisation ReactJS
Pandas NextJS
NumPy NodeJs
NLP Bootstrap
Deep Learning Tailwind CSS
Python Tutorial Computer Science
Python Programming Examples GATE CS Notes
Django Tutorial Operating Systems
Python Projects Computer Network
Python Tkinter Database Management System
Web Scraping Software Engineering
OpenCV Tutorial Digital Logic Design
Python Interview Question Engineering Maths
DevOps System Design
Git High Level Design
AWS Low Level Design
Docker UML Diagrams
Kubernetes Interview Guide
Azure Design Patterns
GCP OOAD
DevOps Roadmap System Design Bootcamp
Interview Questions
School Subjects Databases
Mathematics SQL
Physics MYSQL
Chemistry PostgreSQL
Biology PL/SQL
Social Science MongoDB
English Grammar
Preparation Corner More Tutorials
Company-Wise Recruitment Process Software Development
Aptitude Preparation Software Testing
Puzzles Product Management
Company-Wise Preparation Project Management
Linux
Excel
All Cheat Sheets
Machine Learning/Data Science Programming Languages
Complete Machine Learning & Data Science Program - [LIVE] C Programming with Data Structures
Data Analytics Training using Excel, SQL, Python & PowerBI - C++ Programming Course
[LIVE] Java Programming Course
Data Science Training Program - [LIVE] Python Full Course
Data Science Course with IBM Certification
Clouds/Devops GATE 2026
DevOps Engineering GATE CS Rank Booster
AWS Solutions Architect Certification GATE DA Rank Booster
Salesforce Certified Administrator Course GATE CS & IT Course - 2026
GATE DA Course 2026
GATE Rank Predictor
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved