0% found this document useful (0 votes)
17 views10 pages

Bda 03

The document discusses MapReduce and how to implement a word count MapReduce job on a multi-node Hadoop cluster using Docker containers. It explains the MapReduce framework and algorithm, and provides steps to configure the code on the namenode, provide an input file, run the job, and check the output.

Uploaded by

HARSH NAG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Bda 03

The document discusses MapReduce and how to implement a word count MapReduce job on a multi-node Hadoop cluster using Docker containers. It explains the MapReduce framework and algorithm, and provides steps to configure the code on the namenode, provide an input file, run the job, and check the output.

Uploaded by

HARSH NAG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Experiment 03

Big Data Analysis

Harsh Suryanath Nag

201070046
15th February, 2024
Aim
Create mapreduce code for wordcount and execute it on Multi Node cluster

Theory
MapReduce

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce the task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.

Algorithm

● Generally the MapReduce paradigm is based on sending the computer to where the data
resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
a. Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
b. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.

1
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
● After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input
to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of
the job, conceivably of different types.

The key and the value classes should be serialized by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a
MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

2
Terminology

● PayLoad − Applications implement the Map and the Reduce functions, and form the core
of the job.
● Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pairs.
● NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
● DataNode − Node where data is presented in advance before any processing takes
place.
● MasterNode − Node where JobTracker runs and which accepts job requests from clients.
● SlaveNode − Node where Map and Reduce program runs.
● JobTracker − Schedules jobs and tracks the assigned jobs to Task tracker.
● Task Tracker − Tracks the task and reports status to JobTracker.
● Job − A program is an execution of a Mapper and Reducer across a dataset.
● Task − An execution of a Mapper or a Reducer on a slice of data.
● Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

3
Procedure
1. Start the necessary containers using docker-compose

2. Get in our master node “namenode” and create folder structure to allocate input files

3. Download MapReduce script

4
4. Download or create the .txt file that you want to process

5
5. Move our .jar & .txt file into the container

6. Create the input folder inside our namenode container

7. Visualize the dashboard in your localhost

6
8. Copy your .txt file from /tmp to the input file

9. Run the Mapreduce job

7
10. Check the output of the job (i.e. the word count)

8
Conclusion
We learnt about the MapReduce paradigm within Hadoop and proceeded to implement it on a
multi node cluster using Docker containers. Our primary objective was to develop a MapReduce
code specifically designed for word counting and execute it across multiple nodes in the cluster.
To accomplish this, we configured the MapReduce code within the namenode and provided a
large data file as input. Through this hands-on exercise, we gained insights into the distributed
computing capabilities of MapReduce, enabling us to effectively harness the power of parallel
processing for large-scale data analysis tasks.

You might also like