0% found this document useful (0 votes)

17 views10 pages

Bda 03

The document discusses MapReduce and how to implement a word count MapReduce job on a multi-node Hadoop cluster using Docker containers. It explains the MapReduce framework and algorithm, and provides steps to configure the code on the namenode, provide an input file, run the job, and check the output.

Uploaded by

HARSH NAG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Bda 03

Uploaded by

HARSH NAG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Experiment 03

Big Data Analysis

Harsh Suryanath Nag

201070046
15th February, 2024
Aim
Create mapreduce code for wordcount and execute it on Multi Node cluster

Theory
MapReduce

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce the task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.

Algorithm

● Generally the MapReduce paradigm is based on sending the computer to where the data
resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
a. Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
b. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.

1
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
● After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input
to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of
the job, conceivably of different types.

The key and the value classes should be serialized by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a
MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

2
Terminology

● PayLoad − Applications implement the Map and the Reduce functions, and form the core
of the job.
● Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pairs.
● NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
● DataNode − Node where data is presented in advance before any processing takes
place.
● MasterNode − Node where JobTracker runs and which accepts job requests from clients.
● SlaveNode − Node where Map and Reduce program runs.
● JobTracker − Schedules jobs and tracks the assigned jobs to Task tracker.
● Task Tracker − Tracks the task and reports status to JobTracker.
● Job − A program is an execution of a Mapper and Reducer across a dataset.
● Task − An execution of a Mapper or a Reducer on a slice of data.
● Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

3
Procedure
1. Start the necessary containers using docker-compose

2. Get in our master node “namenode” and create folder structure to allocate input files

3. Download MapReduce script

4
4. Download or create the .txt file that you want to process

5
5. Move our .jar & .txt file into the container

6. Create the input folder inside our namenode container

7. Visualize the dashboard in your localhost

6
8. Copy your .txt file from /tmp to the input file

9. Run the Mapreduce job

7
10. Check the output of the job (i.e. the word count)

8
Conclusion
We learnt about the MapReduce paradigm within Hadoop and proceeded to implement it on a
multi node cluster using Docker containers. Our primary objective was to develop a MapReduce
code specifically designed for word counting and execute it across multiple nodes in the cluster.
To accomplish this, we configured the MapReduce code within the namenode and provided a
large data file as input. Through this hands-on exercise, we gained insights into the distributed
computing capabilities of MapReduce, enabling us to effectively harness the power of parallel
processing for large-scale data analysis tasks.

Unit 3 Notes
No ratings yet
Unit 3 Notes
21 pages
HSBC Bank Statement TemplateLab Com
100% (1)
HSBC Bank Statement TemplateLab Com
1 page
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
41 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Exp 5 Bdafinal
No ratings yet
Exp 5 Bdafinal
7 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Hadoop 2
No ratings yet
Hadoop 2
31 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Exp 5 Bda
No ratings yet
Exp 5 Bda
9 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
Staad Questions PDF
No ratings yet
Staad Questions PDF
8 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
BDA Experiment 3
No ratings yet
BDA Experiment 3
7 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
21CS1601 Unit 5 Understanding Big Data Technolgies
No ratings yet
21CS1601 Unit 5 Understanding Big Data Technolgies
20 pages
Sony MDS-JE 520 User Manual
No ratings yet
Sony MDS-JE 520 User Manual
136 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Calcium Carbonate
33% (3)
Calcium Carbonate
1 page
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Private Health Institutions Law
100% (1)
Private Health Institutions Law
22 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
RevRes PDF
No ratings yet
RevRes PDF
1,134 pages
Mapreduce Programming Framework
No ratings yet
Mapreduce Programming Framework
23 pages
Court Order
100% (1)
Court Order
17 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Instant Download Understanding Race and Crime 1st Edition Colin Webster PDF All Chapter
100% (3)
Instant Download Understanding Race and Crime 1st Edition Colin Webster PDF All Chapter
84 pages
Data Science
No ratings yet
Data Science
7 pages
Strat Sim
No ratings yet
Strat Sim
289 pages
77 4001 StaSaf
No ratings yet
77 4001 StaSaf
20 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Hanover Report 1978
100% (1)
Hanover Report 1978
10 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Traits of 21st Century Teacher
No ratings yet
Traits of 21st Century Teacher
14 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Orientering
No ratings yet
Orientering
15 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Resumen Productos Datalogic SENSORES
No ratings yet
Resumen Productos Datalogic SENSORES
219 pages
Swimming Pool Structural Calcs
100% (1)
Swimming Pool Structural Calcs
7 pages
Module 5 Reflection 1
No ratings yet
Module 5 Reflection 1
7 pages
November 09
No ratings yet
November 09
2 pages
Anfis Based Kinematic Analysis of A 4-Dofs Scara Robot: Jyotindra Narayan Ashish Singla
No ratings yet
Anfis Based Kinematic Analysis of A 4-Dofs Scara Robot: Jyotindra Narayan Ashish Singla
7 pages
Bda 201070046 01
No ratings yet
Bda 201070046 01
24 pages
MFM Assignment 1 Draft
No ratings yet
MFM Assignment 1 Draft
9 pages
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
No ratings yet
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
64 pages
Instruction For AVIC F-Series In-Dash 2.008 Firmware Update
No ratings yet
Instruction For AVIC F-Series In-Dash 2.008 Firmware Update
4 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Bda 07
No ratings yet
Bda 07
9 pages
Bda 05
No ratings yet
Bda 05
12 pages
1.1 Survey of The History, Growth and Role of Translation in India
No ratings yet
1.1 Survey of The History, Growth and Role of Translation in India
50 pages
Bda 06
No ratings yet
Bda 06
15 pages
SCBA Pre-Use Inspection
No ratings yet
SCBA Pre-Use Inspection
2 pages
QIG Quick Installation Guide DCU 305 R3
No ratings yet
QIG Quick Installation Guide DCU 305 R3
2 pages
6) Matrix Chain Multiplication
No ratings yet
6) Matrix Chain Multiplication
6 pages
5) N Catalan Numbers
No ratings yet
5) N Catalan Numbers
4 pages
ASIC Implementation of Efficient 16-Parallel Fast FIR Algorithm Filter Structure
No ratings yet
ASIC Implementation of Efficient 16-Parallel Fast FIR Algorithm Filter Structure
5 pages
4) NPR Binomial Permutation
No ratings yet
4) NPR Binomial Permutation
2 pages
3) NCR Binomial Coefficient
No ratings yet
3) NCR Binomial Coefficient
2 pages
ED Mid
No ratings yet
ED Mid
1 page
31st MCMC
No ratings yet
31st MCMC
11 pages
Injection Engine Control System. VAZ 21213, 21214 (Niva)
No ratings yet
Injection Engine Control System. VAZ 21213, 21214 (Niva)
3 pages
What Is Athletic Sports and Management?
No ratings yet
What Is Athletic Sports and Management?
3 pages
Packet Tracer Activity 3.5.1
No ratings yet
Packet Tracer Activity 3.5.1
2 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet

Bda 03

Uploaded by

Bda 03

Uploaded by

Experiment 03

Big Data Analysis

Harsh Suryanath Nag

Inputs and Outputs

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

3. Download MapReduce script

6. Create the input folder inside our namenode container

7. Visualize the dashboard in your localhost

9. Run the Mapreduce job

You might also like