0% found this document useful (0 votes)
158 views8 pages

Map Reduce and Its Phases With Numerical Example. - GeeksforGeeks

The document explains the MapReduce framework, which allows for the parallel processing of large data sets across clusters of hardware. It details the three main phases of MapReduce: Mapping, Shuffling and Sorting, and Reducing, along with an optional Combining phase to optimize performance. A numerical example using MovieLens data illustrates the mapping, shuffling, and reducing processes, accompanied by sample Python code for implementing the mapper and reducer.

Uploaded by

ramyatech25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views8 pages

Map Reduce and Its Phases With Numerical Example. - GeeksforGeeks

The document explains the MapReduce framework, which allows for the parallel processing of large data sets across clusters of hardware. It details the three main phases of MapReduce: Mapping, Shuffling and Sorting, and Reducing, along with an optional Combining phase to optimize performance. A numerical example using MovieLens data illustrates the mapping, shuffling, and reducing processes, accompanied by sample Python code for implementing the mapper and reducer.

Uploaded by

ramyatech25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Search...

DSA Practice Problems Python C C++ Java Courses Machine Learning DevOps Web D

Map Reduce and its Phases with numerical


example.
Last Updated : 18 May, 2023

Map Reduce :-
It is a framework in which we can write applications to run huge amount
of data in parallel and in large cluster of commodity hardware in a
reliable manner.
Different Phases of MapReduce:-
MapReduce model has three major and one optional phase.​

Mapping
Shuffling and Sorting
Reducing
Combining

Mapping :- It is the first phase of MapReduce programming. Mapping


Phase accepts key-value pairs as input as (k, v), where the key
represents the Key address of each record and the value represents the
entire record content.​T he output of the Mapping phase will also be in
the key-value format (k’, v’).

Shuffling and Sorting :- The output of various mapping parts (k’, v’),
then goes into Shuffling and Sorting phase.​All the same values are
deleted, and different values are grouped together based on same keys.​
The output of the Shuffling and Sorting phase will be key-value pairs
again as key and array of values (k, v[ ]).

Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be
the input of the Reducer phase.​In this phase reducer function’s logic is
executed and all the values are Collected against their corresponding
keys. ​Reducer stabilize outputs of various mappers and computes the
final output.​
Combining :- It is an optional phase in the MapReduce phases .​The
combiner phase is used to optimize the performance of MapReduce
phases. This phase makes the Shuffling and Sorting phase work even
quicker by enabling additional performance features in MapReduce
phases.

flow chart

Numerical:-
MovieLens Data
USER_ID MOVIE_ID RATING TIMESTAMP

196 242 3
881250949

186 302 3
891717742

196 377 1
878887116

244 51 2
880606923

166 346 1
886397596

186 474 4
884182806

186 265 2
881171488

Solution : –
Step 1 – First we have to map the values , it is happen in 1st phase of
Map Reduce model.

196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ;


186:265

Step 2 – After Mapping we have to shuffle and sort the values.

166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51

Step 3 – After completion of step1 and step2 we have to reduce each


key’s values.

Now, put all values together

Solution

CODE FOR MAPPER AND REDUCER TOGETHER:

Python3

from mrjob.job import MRJob


from mrjob.step import MRStep

class RatingsBreak(MRJob):
def steps(self):
return [
MRstep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
# MAPPER CODE
def mapper_get_ratings(self, _, line):
(User_id, Movie_id, Rating, Timestamp) = line.split('/t')
yield rating,
# REDUCER CODE

def reducer_count_ratings(self, key, values):


yield key, sum(values)

Comment More info Next Article


Hadoop - Daemons and Their
Advertise with us Features

Similar Reads
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which
makes it so powerful and efficient to use. MapReduce is a programming…

15+ min read

Hadoop - Mapper In MapReduce


Map-Reduce is a programming model that is mainly divided into two
phases Map Phase and Reduce Phase. It is designed for processing the…

15+ min read

MapReduce Programming Model and its role in Hadoop.


In the Hadoop framework, MapReduce is the programming model.
MapReduce utilizes the map and reduce strategy for the analysis of data.…

15+ min read

Hadoop Tutorial
Big Data is a collection of data that is growing exponentially, and it is
huge in volume with a lot of complexity as it comes from various…

14 min read

Hadoop - Different Modes of Operation


As we all know Hadoop is an open-source framework which is mainly
used for storage purpose and maintaining and analyzing a large amount …

15+ min read

Hadoop - Schedulers and Types of Schedulers


In Hadoop, we can receive multiple jobs from different clients to perform.
The Map-Reduce framework is used to perform multiple tasks in parallel…

15+ min read

Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, However, for those who are not…

15+ min read

Introduction to Apache Pig


Pig Represents Big Data as data flows. Pig is a high-level platform or tool
which is used to process the large datasets. It provides a high-level of…

15+ min read

Anatomy of File Read and Write in HDFS


Big data is nothing but a collection of data sets that are large, complex,
and which are difficult to store and process using available data…

15+ min read

map vs unordered_map in C++


Pre-requisite : std::map, std::unordered_mapWhen it comes to efficiency,
there is a huge difference between maps and unordered maps. We must…

2 min read
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate
Tower, Sector- 136, Noida, Uttar Pradesh
(201305)

Registered Address:
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam
Buddh Nagar, Uttar Pradesh, 201305

Advertise with us

Company Explore
About Us Job-A-Thon Hiring Challenge
Legal GfG Weekly Contest
Privacy Policy Offline Classroom Program
Careers DSA in JAVA/C++
In Media Master System Design
Contact Us Master CP
GfG Corporate Solution GeeksforGeeks Videos
Placement Training Program

Languages DSA
Python Data Structures
Java Algorithms
C++ DSA for Beginners
PHP Basic DSA Problems
GoLang DSA Roadmap
SQL DSA Interview Questions
R Language Competitive Programming
Android Tutorial

Data Science & ML Web Technologies


Data Science With Python HTML
Data Science For Beginner CSS
Machine Learning JavaScript
ML Maths TypeScript
Data Visualisation ReactJS
Pandas NextJS
NumPy NodeJs
NLP Bootstrap
Deep Learning Tailwind CSS

Python Tutorial Computer Science


Python Programming Examples GATE CS Notes
Django Tutorial Operating Systems
Python Projects Computer Network
Python Tkinter Database Management System
Web Scraping Software Engineering
OpenCV Tutorial Digital Logic Design
Python Interview Question Engineering Maths

DevOps System Design


Git High Level Design
AWS Low Level Design
Docker UML Diagrams
Kubernetes Interview Guide
Azure Design Patterns
GCP OOAD
DevOps Roadmap System Design Bootcamp
Interview Questions

School Subjects Databases


Mathematics SQL
Physics MYSQL
Chemistry PostgreSQL
Biology PL/SQL
Social Science MongoDB
English Grammar

Preparation Corner More Tutorials


Company-Wise Recruitment Process Software Development
Aptitude Preparation Software Testing
Puzzles Product Management
Company-Wise Preparation Project Management
Linux
Excel
All Cheat Sheets

Machine Learning/Data Science Programming Languages


Complete Machine Learning & Data Science Program - [LIVE] C Programming with Data Structures
Data Analytics Training using Excel, SQL, Python & PowerBI - C++ Programming Course
[LIVE] Java Programming Course
Data Science Training Program - [LIVE] Python Full Course
Data Science Course with IBM Certification
Clouds/Devops GATE 2026
DevOps Engineering GATE CS Rank Booster
AWS Solutions Architect Certification GATE DA Rank Booster
Salesforce Certified Administrator Course GATE CS & IT Course - 2026
GATE DA Course 2026
GATE Rank Predictor

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

You might also like