0% found this document useful (0 votes)

19 views

MapReduce Arch

Uploaded by

21ve1a6772

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

MapReduce Arch

Uploaded by

21ve1a6772

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

MapReduce Architecture

MapReduce and HDFS are the two major components of Hadoop which makes it
so powerful and efficient to use. MapReduce is a programming model used for
efficient processing in parallel over large data-sets in a distributed manner. The
data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send
jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.

1
In MapReduce, we have a client. The client will submit the job of a particular size
to the Hadoop MapReduce Master. Now, the MapReduce master will divide this job
into further equivalent job-parts. These job-parts are then made available for the
Map and Reduce Task. This Map and Reduce task will contain the program as per
the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for
processing the data as per the requirement. The algorithm for Map and Reduce is
made with a very optimized way such that the time complexity or space complexity
is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its
architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the id
of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair which
works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm written
by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and
stores historical information about the task or application, like the logs which are
generated during or after the job execution are stored on Job History Server.

2
Map Reduce in Hadoop
One of the three components of Hadoop is Map Reduce. The first component of
Hadoop that is, Hadoop Distributed File System (HDFS) is responsible for storing
the file. The second component that is, Map Reduce is responsible for
processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first phase, Map is
utilised and in next phase Reduce is utilised.

Map and Reduce interfaces

Suppose there is a word file containing some text. Let us name this file as
sample.txt. Note that we use Hadoop to deal with huge files but for the sake of
easy explanation over here, we are taking a text file as an example. So, let’s
assume that this sample.txt file contains few lines as text. The content of the file is
as follows:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

3
Hence, the above 8 lines are the content of the file. Let’s assume that while storing
this file in Hadoop, HDFS broke this file into four parts and named each part as
first.txt, second.txt, third.txt, and fourth.txt. So, you can easily see that the above
file will be divided into four equal parts and each part will contain 2 lines. First two
lines will be in the file first.txt, next two lines in second.txt, next two in third.txt and
the last two lines will be stored in fourth.txt. All these files will be stored in Data
Nodes and the Name Node will contain the metadata about them. All this is the
task of HDFS. Now, suppose a user wants to process this file. Here is what Map-
Reduce comes into the picture. Suppose this user wants to run a query on this
sample.txt. So, instead of bringing sample.txt on the local computer, we will send
this query on the data. To keep a track of our request, we use Job Tracker (a
master service). Job Tracker traps our request and keeps a track of it. Now
suppose that the user wants to run his query on sample.txt and want the output in
result.output file. Let the name of the file containing the query is query.jar. So, the
user will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
So, now the Job Tracker traps this request and asks Name Node to run this
request on sample.txt. Name Node then provides the metadata to the Job Tracker.
Job Tracker now knows that sample.txt is stored in first.txt, second.txt, third.txt,
and fourth.txt. As all these four files have three copies stored in HDFS, so the Job
Tracker communicates with the Task Tracker (a slave service) of each of these
files but it communicates with only one copy of each file which is residing nearest
to it. Note: Applying the desired code on local first.txt, second.txt, third.txt and
fourth.txt is a process., This process is called Map. In Hadoop terminology, the
main file sample.txt is called input file and its four subfiles are called input splits.
So, in Hadoop the number of mappers for an input file are equal to number of
input splits of this input file. In the above case, the input file sample.txt has four
input splits hence four mappers will be running to process it. The responsibility of
handling these mappers is of Job Tracker. Note that the task trackers are slave
services to the Job Tracker. So, in case any of the local machines breaks down
then the processing over that part of the file will stop and it will halt the complete
process. So, each task tracker sends heartbeat and its number of slots to Job
Tracker in every 3 seconds. This is called the status of Task Trackers. In case any
task tracker goes down, the Job Tracker then waits for 10 heartbeat times, that is,
30 seconds, and even after that if it does not get any status, then it assumes that
either the task tracker is dead or is extremely busy. So it then communicates with
the task tracker of another copy of the same file and directs it to process the
desired code over it. Similarly, the slot information is used by the Job Tracker to
keep a track of how many tasks are being currently served by the task tracker and
how many more tasks can be assigned to it. In this way, the Job Tracker keeps
track of our request. Now, suppose that the system has generated output for

4
individual first.txt, second.txt, third.txt, and fourth.txt. But this is not the user’s
desired output. To produce the desired output, all these individual outputs have to
be merged or reduced to a single output. This reduction of multiple outputs to a
single one is also a process which is done by REDUCER. In Hadoop, as many
reducers are there, those many number of output files are generated. By
default, there is always one reducer per cluster. Note: Map and Reduce are two
different processes of the second component of Hadoop, that is, Map Reduce.
These are also called phases of Map Reduce. Thus we can say that Map Reduce
has two phases. Phase 1 is Map and Phase 2 is Reduce.
Functioning of Map Reduce
Now, let us move back to our sample.txt file with the same content. Again it is
being divided into four input splits namely, first.txt, second.txt, third.txt, and
fourth.txt. Now, suppose we want to count number of each word in the file. That is
the content of the file looks like:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Then the output of the ‘word count’ code will be like:

Hello - 1
I - 1
am - 1
geeksforgeeks - 1
How - 2 (How is written two times in the entire file)
Similarly
Are - 3
are - 2
….and so on
Thus in order to get this output, the user will have to send his query on the data.
Suppose the query ‘word count’ is in the file wordcount.jar. So, the query will look
like:
5
J$hadoop jar wordcount.jar DriverCode sample.txt result.output
Types of File Format in Hadoop
Now, as we know that there are four input splits, so four mappers will be running.
One on each input split. But, Mappers don’t run directly on the input splits. It is
because the input splits contain text but mappers don’t understand the
text. Mappers understand (key, value) pairs only. Thus the text in input splits first
needs to be converted to (key, value) pairs. This is achieved by Record Readers.
Thus we can also say that as many numbers of input splits are there, those many
numbers of record readers are there. In Hadoop terminology, each line in a text is
termed as a ‘record’. How record reader converts this text into (key, value) pair
depends on the format of the file. In Hadoop, there are four formats of a file. These
formats are Predefined Classes in Hadoop. Four types of formats are:
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat
By default, a file is in TextInputFormat. Record reader reads one record(line) at a
time. While reading, it doesn’t consider the format of the file. But, it converts each
record into (key, value) pair depending upon its format. For the time being, let’s
assume that the first input split first.txt is in TextInputFormat. Now, the record
reader working on this input split converts the record in the form of (byte offset,
entire line). For example first.txt has the content:
Hello I am GeeksforGeeks
How can I help you
So, the output of record reader has two pairs (since two records are there in the
file). The first pair looks like (0, Hello I am geeksforgeeks) and the second pair
looks like (26, How can I help you). Note that the second pair has the byte offset of
26 because there are 25 characters in the first line and the newline operator (\n) is
also considered a character. Thus, after the record reader as many numbers of
records is there, those many numbers of (key, value) pairs are there. Now, the
mapper will run once for each of these pairs. Similarly, other mappers are also
running for (key, value) pairs of different input splits. Thus in this way, Hadoop
breaks a big task into smaller tasks and executes them in parallel execution.
Shuffling and Sorting
Now, the mapper provides an output corresponding to each (key, value) pair
provided by the record reader. Let us take the first input split of first.txt. The two
pairs so generated for this file by the record reader are (0, Hello I am
GeeksforGeeks) and (26, How can I help you). Now mapper takes one of these
pair at a time and produces output like (Hello, 1), (I, 1), (am, 1) and
(GeeksforGeeks, 1) for the first pair and (How, 1), (can, 1), (I, 1), (help, 1) and
(you, 1) for the second pair. Similarly, we have outputs of all the mappers. Note
that this data contains duplicate keys like (I, 1) and further (how, 1) etc. These

6
duplicate keys also need to be taken care of. This data is also called Intermediate
Data. Before passing this intermediate data to the reducer, it is first passed
through two more stages, called Shuffling and Sorting.
1. Shuffling Phase: This phase combines all values associated to an identical
key. For eg, (Are, 1) is there three times in the input file. So after the shuffling
phase, the output will be like (Are, [1,1,1]).
2. Sorting Phase: Once shuffling is done, the output is sent to the sorting phase
where all the (key, value) pairs are sorted automatically. In Hadoop sorting is an
automatic process because of the presence of an inbuilt interface
called WritableComparableInterface.
One of the three components of Hadoop is Map Reduce. The first
component of Hadoop that is, Hadoop Distributed File System (HDFS) is
responsible for storing the file. The second component that is, Map
Reduce is responsible for processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first
phase, Map is utilised and in next phase Reduce is utilised.

Map and Reduce interfaces

Suppose there is a word file containing some text. Let us name this file as
sample.txt. Note that we use Hadoop to deal with huge files but for the sake
of easy explanation over here, we are taking a text file as an example. So,
let’s assume that this sample.txt file contains few lines as text. The content of
the file is as follows:
Hello I am GeeksforGeeks
How can I help you
How can I assist you

7
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while
storing this file in Hadoop, HDFS broke this file into four parts and named
each part as first.txt, second.txt, third.txt, and fourth.txt. So, you can easily
see that the above file will be divided into four equal parts and each part will
contain 2 lines. First two lines will be in the file first.txt, next two lines in
second.txt, next two in third.txt and the last two lines will be stored in
fourth.txt. All these files will be stored in Data Nodes and the Name Node will
contain the metadata about them. All this is the task of HDFS. Now, suppose
a user wants to process this file. Here is what Map-Reduce comes into the
picture. Suppose this user wants to run a query on this sample.txt. So,
instead of bringing sample.txt on the local computer, we will send this query
on the data. To keep a track of our request, we use Job Tracker (a master
service). Job Tracker traps our request and keeps a track of it. Now suppose
that the user wants to run his query on sample.txt and want the output in
result.output file. Let the name of the file containing the query is query.jar.
So, the user will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
So, now the Job Tracker traps this request and asks Name Node to run this
request on sample.txt. Name Node then provides the metadata to the Job
Tracker. Job Tracker now knows that sample.txt is stored in first.txt,
second.txt, third.txt, and fourth.txt. As all these four files have three copies
stored in HDFS, so the Job Tracker communicates with the Task Tracker (a
slave service) of each of these files but it communicates with only one copy
of each file which is residing nearest to it. Note: Applying the desired code
on local first.txt, second.txt, third.txt and fourth.txt is a process., This process
is called Map. In Hadoop terminology, the main file sample.txt is called input
file and its four subfiles are called input splits. So, in Hadoop the number of
mappers for an input file are equal to number of input splits of this
input file. In the above case, the input file sample.txt has four input splits
hence four mappers will be running to process it. The responsibility of
handling these mappers is of Job Tracker. Note that the task trackers are

8
slave services to the Job Tracker. So, in case any of the local machines
breaks down then the processing over that part of the file will stop and it will
halt the complete process. So, each task tracker sends heartbeat and its
number of slots to Job Tracker in every 3 seconds. This is called the status
of Task Trackers. In case any task tracker goes down, the Job Tracker then
waits for 10 heartbeat times, that is, 30 seconds, and even after that if it does
not get any status, then it assumes that either the task tracker is dead or is
extremely busy. So it then communicates with the task tracker of another
copy of the same file and directs it to process the desired code over it.
Similarly, the slot information is used by the Job Tracker to keep a track of
how many tasks are being currently served by the task tracker and how
many more tasks can be assigned to it. In this way, the Job Tracker keeps
track of our request. Now, suppose that the system has generated output for
individual first.txt, second.txt, third.txt, and fourth.txt. But this is not the user’s
desired output. To produce the desired output, all these individual outputs
have to be merged or reduced to a single output. This reduction of multiple
outputs to a single one is also a process which is done by REDUCER. In
Hadoop, as many reducers are there, those many number of output
files are generated. By default, there is always one reducer per
cluster. Note: Map and Reduce are two different processes of the second
component of Hadoop, that is, Map Reduce. These are also called phases of
Map Reduce. Thus we can say that Map Reduce has two phases. Phase 1 is
Map and Phase 2 is Reduce.
Functioning of Map Reduce
Now, let us move back to our sample.txt file with the same content. Again it
is being divided into four input splits namely, first.txt, second.txt, third.txt, and
fourth.txt. Now, suppose we want to count number of each word in the file.
That is the content of the file looks like:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths

9
Then the output of the ‘word count’ code will be like:

10
the record in the form of (byte offset, entire line). For example first.txt has the
content:
Hello I am GeeksforGeeks
How can I help you
So, the output of record reader has two pairs (since two records are there in
the file). The first pair looks like (0, Hello I am geeksforgeeks) and the
second pair looks like (26, How can I help you). Note that the second pair
has the byte offset of 26 because there are 25 characters in the first line and
the newline operator (\n) is also considered a character. Thus, after the
record reader as many numbers of records is there, those many numbers of
(key, value) pairs are there. Now, the mapper will run once for each of these
pairs. Similarly, other mappers are also running for (key, value) pairs of
different input splits. Thus in this way, Hadoop breaks a big task into smaller
tasks and executes them in parallel execution.
Shuffling and Sorting
Now, the mapper provides an output corresponding to each (key, value) pair
provided by the record reader. Let us take the first input split of first.txt. The
two pairs so generated for this file by the record reader are (0, Hello I am
GeeksforGeeks) and (26, How can I help you). Now mapper takes one of
these pair at a time and produces output like (Hello, 1), (I, 1), (am, 1) and
(GeeksforGeeks, 1) for the first pair and (How, 1), (can, 1), (I, 1), (help, 1)
and (you, 1) for the second pair. Similarly, we have outputs of all the
mappers. Note that this data contains duplicate keys like (I, 1) and further
(how, 1) etc. These duplicate keys also need to be taken care of. This data is
also called Intermediate Data. Before passing this intermediate data to the
reducer, it is first passed through two more stages, called Shuffling and
Sorting.
1. Shuffling Phase: This phase combines all values associated to an
identical key. For eg, (Are, 1) is there three times in the input file. So after
the shuffling phase, the output will be like (Are, [1,1,1]).
2. Sorting Phase: Once shuffling is done, the output is sent to the sorting
phase where all the (key, value) pairs are sorted automatically. In Hadoop
sorting is an automatic process because of the presence of an inbuilt
interface called WritableComparableInterface.
After the completion of the shuffling and sorting phase, the resultant output is then
sent to the reducer. Now, if there are n (key, value) pairs after the shuffling and
sorting phase, then the reducer runs n times and thus produces the final result in
which the final processed output is there. In the above case, the resultant output
after the reducer processing will get stored in the directory result.output as
specified in the query code written to process the query on the data.

11
Hadoop – Mapper In MapReduce
Map-Reduce is a programming model that is mainly divided into two phases Map
Phase and Reduce Phase. It is designed for processing the data in parallel which
is divided on various machines(nodes). The Hadoop Java programs are consist of
Mapper class and Reducer class along with the driver class. Hadoop Mapper is a
function or task which is used to process all input records from a file and generate
the output which works as input for Reducer. It produces the output by returning
new key-value pairs. The input data has to be converted to key-value pairs as
Mapper can not process the raw input records or tuples(key-value pairs). The
mapper also generates some small blocks of data while processing the input
records as a key-value pair. we will discuss the various process that occurs in
Mapper, There key features and how the key-value pairs are generated in the
Mapper.
Let’s understand the Mapper in Map-Reduce:

Mapper is a simple user-defined program that performs some operations on input-

splits as per it is designed. Mapper is a base class that needs to be extended by
the developer or programmer in his lines of code according to the organization’s
requirements. input and output type need to be mentioned under the Mapper class
argument which needs to be modified by the developer.
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Mapper is the initial line of code that initially interacts with the input dataset.
suppose, If we have 100 Data-Blocks of the dataset we are analyzing then in that
case there will be 100 Mapper program or process that runs in parallel on
machines(nodes) and produce their own output known as intermediate output
which is then stored on Local Disk, not on HDFS. The output of the mapper act as
input for Reducer which performs some sorting and aggregation operation on data
and produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader,
Map, and Intermediate output disk. The Map Task is completed with the
contribution of all this available component.
1. Input: Input is records or the datasets that are used for analysis purposes. This
Input data is set out with the help of InputFormat. It helps in identifying the
location of the Input data which is stored in HDFS(Hadoop Distributed File
System).

12
2. Input-Splits: These are responsible for converting the physical input data to
some logical form so that Hadoop Mapper can easily handle it. Input-Splits are
generated with the help of InputFormat. A large data set is divided into many
input-splits which depend on the size of the input dataset. There will be a
separate Mapper assigned for each input-splits. Input-Splits are only referencing
to the input data, these are not the actual data. DataBlocks are not the only
factor that decides the number of input-splits in a Map-Reduce. we can
manually configure the size of input-splits in mapred.max.split.size property
while the job is executing. All of these input-splits are utilized by each of the
data blocks. The size of input splits is measured in bytes. Each input-split is
stored at some memory location (Hostname Strings). Map-Reduce places map
tasks near the location of the split as close as it is possible. The input-split with
the larger size executed first so that the job-runtime can be minimized.
3. Record-Reader: Record-Reader is the process which deals with the output
obtained from the input-splits and generates it’s own output as key-value pairs
until the file ends. Each line present in a file will be assigned with the Byte-
Offset with the help of Record-Reader. By-default Record-Reader
uses TextInputFormat for converting the data obtained from the input-splits to
the key-value pairs because Mapper can only handle key-value pairs.
4. Map: The key-value pair obtained from Record-Reader is then feed to the Map
which generates a set of pairs of intermediate key-value pairs.
5. Intermediate output disk: Finally, the intermediate key-value pair output will be
stored on the local disk as intermediate output. There is no need to store the
data on HDFS as it is an intermediate output. If we store this data onto HDFS
then the writing cost will be more because of it’s replication feature. It also
increases its execution time. If somehow the executing job is terminated then, in
that case, cleaning up this intermediate output available on HDFS is also
difficult. The intermediate output is always stored on local disk which will be
cleaned up once the job completes its execution. On local disk, this Mapper
output is first stored in a buffer whose default size is 100MB which can be
configured with io.sort.mb property. The output of the mapper can be written
to HDFS if and only if the job is Map job only, In that case, there will be no
Reducer task so the intermediate output is our final output which can be written
on HDFS. The number of Reducer tasks can be made zero manually with
job.setNumReduceTasks(0). This Mapper output is of no use for the end-user
as it is a temporary output useful for Reducer only.
How to calculate the number of Mappers In Hadoop:
The number of blocks of input file defines the number of map-task in the Hadoop
Map-phase, which can be calculated with the help of the below formula.
Mapper = (total data size)/ (input split size)
For Example: For a file of size 10TB(Data Size) where the size of each data block
is 128 MB(input split size) the number of Mappers will be around 81920.
Hadoop – Reducer in Map-Reduce
13
Map-Reduce is a programming model that is mainly divided into two phases i.e.
Map Phase and Reduce Phase. It is designed for processing the data in parallel
which is divided on various machines(nodes). The Hadoop Java programs are
consist of Mapper class and Reducer class along with the driver class. Reducer is
the second part of the Map-Reduce programming model. The Mapper produces
the output in the form of key-value pairs which works as input for the Reducer.

But before sending this intermediate key-value pairs directly to the Reducer some
process will be done which shuffle and sort the key-value pairs according to its key
values, which means the value of the key is the main decisive factor for sorting.
The output generated by the Reducer will be the final output which is then stored
on HDFS(Hadoop Distributed File System) . Reducer mainly performs some
computation operation like addition, filtration, and aggregation. By default, the
number of reducers utilized for process the output of the Mapper is 1 which is
configurable and can be changed by the user according to the requirement.
Let’s understand the Reducer in Map-Reduce:

Here, in the above image, we can observe that there are multiple Mapper which
are generating the key-value pairs as output. The output of each mapper is sent to
the sorter which will sort the key-value pairs according to its key value. Shuffling
also takes place during the sorting process and the output will be sent to the
Reducer part and final output is produced.
Let’s take an example to understand the working of Reducer. Suppose we have
the data of a college faculty of all departments stored in a CSV file. In case we
want to find the sum of salaries of faculty according to their department then we
can make their dept. title as key and salaries as value. The Reducer will perform
the summation operation on this dataset and produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
1. Framework overhead increases.
2. Cost of failure Reduces
3. Increase load balancing.

14
One thing we also need to remember is that there will always be a one to one
mapping between Reducers and the keys. Once the whole Reducer process is
done the output is stored at the part file(default name) on HDFS(Hadoop
Distributed File System). In the output directory on HDFS, The Map-Reduce
always makes a _SUCCESS file and part-r-00000 file. The number of part files
depends on the number of reducers in case we have 5 Reducers then the number
of the part file will be from part-r-00000 to part-r-00004. By default, these files have
the name of part-a-bbbbb type. It can be changed manually all we need to do is to
change the below property in our driver code of Map-Reduce.
// Here we are changing output file name from part-r-00000 to
GeeksForGeeks
job.getConfiguration().set("mapreduce.output.basename",
"GeeksForGeeks")
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
1. Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer.
With the help of HTTP, the framework calls for applicable partition of the output
in all Mappers.
2. Sort: In this phase, the output of the mapper that is actually the key-value pairs
will be sorted on the basis of its key value.
3. Reduce: Once shuffling and sorting will be done the Reducer combines the
obtained result and perform the computation operation as per the
requirement. OutputCollector.collect() property is used for writing the output to
the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
Setting Number Of Reducers In Map-Reduce:
1. With Command Line: While executing our Map-Reduce program we can
manually change the number of Reducer with
controller mapred.reduce.tasks.
2. With JobConf instance: In our driver class, we can specify the number of
reducers using the instance of job.setNumReduceTasks(int).
For example job.setNumReduceTasks(2), Here we have 2 Reducers. we can
also make Reducers to 0 in case we need only a Map job.
// Ideally The number of Reducers in a Map-Reduce must be set
to:
0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum
containers per node>)

MapReduce Job Execution

Once the resource manager’s scheduler assign a resources to the task for a
container on a particular node, the container is started up by the application master
by contacting the node manager. The task whose main class is YarnChild is
executed by a Java application .
It localizes the resources that the task needed before it can run the task. It includes

15
the job configuration, any files from the distributed cache and JAR file. It finally
runs the map or the reduce task. Any kind of bugs in the user-defined map and
reduce functions (or even in YarnChild) don’t affect the node manager as
YarnChild runs in a dedicated JVM. So it can’t be affected by a crash or hang.
All actions running in the same JVM as the task itself are performed by each task
setup. These are determined by the OutputCommitter for the job. The commit
action moves the task output to its final location from its initial position for a file-
based jobs. When speculative execution is enabled, the commit protocol ensures
that only one of the duplicate tasks is committed and the other one is aborted.

What does Streaming means?

Streaming reduce tasks and runs special map for the purpose of launching the
user supplied executable and communicating with it. Using standard input and
output streams, it communicates with the process. The Java process passes input
key-value pairs to the external process during execution of the task. It runs the
process through the user-defined map or reduce function and passes the output
key-value pairs back to the Java process.
It is as if the child process ran the map or reduce code itself from the manager’s
point of view. MapReduce jobs can take anytime from tens of second to hours to
run, that’s why are long-running batches. It’s important for the user to get feedback
on how the job is progressing because this can be a significant length of time.

16
Each job including the task has a status including the state of the job or task,
values of the job’s counters, progress of maps and reduces and the description or
status message. These statuses change over the course of the job.
The task keeps track of its progress when a task is running like a part of the task is
completed. This is the proportion of the input that has been processed for map
tasks. It is a little more complex for the reduce task but the system can still
estimate the proportion of the reduce input processed. When a task is running, it
keeps track of its progress (i.e., the proportion of the task completed). For map
tasks, this is the proportion of the input that has been processed. For reduce tasks,
it’s a little more complex, but the system can still estimate the proportion of the
reduce input processed.
Process involved –
 Read an input record in a mapper or reducer.
 Write an output record in a mapper or reducer.
 Set the status description.
 Increment a counter using Reporter’s incrCounter() method or Counter’s
increment() method.
 Call Reporter’s or TaskAttemptContext’s progress() method.

Hadoop MapReduce – Data Flow

Map-Reduce is a processing framework used to process data over a large number
of machines. Hadoop uses Map-Reduce to process the data distributed in a
Hadoop cluster. Map-Reduce is not similar to the other regular processing
framework like Hibernate, JDK, .NET, etc. All these previous frameworks are
designed to use with a traditional system where the data is stored at a single
location like Network File System, Oracle database, etc. But when we are
processing big data the data is located on multiple commodity machines with the
help of HDFS.
So when the data is stored on multiple nodes we need a processing framework
where it can copy the program to the location where the data is present, Means it
copies the program to all the machines where the data is present. Here the Map-
Reduce came into the picture for processing the data on Hadoop over a distributed
system. Hadoop has a major drawback of cross-switch network traffic which is due
to the massive volume of data. Map-Reduce comes with a feature called Data-
Locality. Data Locality is the potential to move the computations closer to the
actual data location on the machines.
Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as
it is widely acceptable which provides an easy way to process data over multiple
nodes. Map-Reduce is not the only framework for parallel processing. Nowadays
Spark is also a popular framework used for distributed computing like Map-
Reduce. We also have HAMA, MPI theses are also the different-different
distributed processing framework.

17
Let’s Understand Data-Flow in Map-Reduce

Map Reduce is a terminology that comes with Map Phase and Reducer Phase.
The map is used for Transformation while the Reducer is used for aggregation kind
of operation. The terminology for Map and Reduce is derived from some functional
programming languages like Lisp, Scala, etc. The Map-Reduce processing
framework program comes with 3 main components i.e. our Driver
code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on
Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop with
HDFS. Now we have to process it for that we have a Map-Reduce framework. So
to process this data with Map-Reduce we have a Driver code which is called Job.
If we are using Java programming language for processing the data on HDFS then
we need to initiate this Driver class with the Job object. Suppose you have a car
which is your framework than the start button used to start the car is similar to this
Driver code in the Map-Reduce framework. We need to initiate the Driver code to
utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which
are predefined and modified by the developers as per the organizations
requirement.

Brief Working of Mapper

Mapper is the initial line of code that initially interacts with the input dataset.
suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in that
case, there will be 100 Mapper program or process that runs in parallel on
machines(nodes) and produce there own output known as intermediate output
which is then stored on Local Disk, not on HDFS. The output of the mapper act as
input for Reducer which performs some sorting and aggregation operation on data
and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The Mapper
produces the output in the form of key-value pairs which works as input for the
Reducer. But before sending this intermediate key-value pairs directly to the

18
Reducer some process will be done which shuffle and sort the key-value pairs
according to its key values. The output generated by the Reducer will be the final
output which is then stored on HDFS(Hadoop Distributed File System). Reducer
mainly performs some computation operation like addition, filtration, and
aggregation.

Steps of Data-Flow:

 At a time single input split is processed. Mapper is overridden by the developer

according to the business logic and this Mapper run in a parallel manner in all
the machines in our cluster.
 The intermediate output generated by Mapper is stored on the local disk and
shuffled to the reducer to reduce the task.
 Once Mapper finishes their task the output is then sorted and merged and
provided to the Reducer.
 Reducer performs some reducing tasks like aggregation and other
compositional operation and the final output is then stored on HDFS in part-r-
00000(created by default) file.
Job Initialisation in MapReduce
Resource manager hands off the request to the YARN scheduler when it receives
a call to its submitApplication() method. The resource manager launches the
application master’s process there when the scheduler allocates a container under

19
the node manager’s management. MRAppMaster is the main class of Java
application for the Java application that masters for MapReduce jobs. By creating a
number of bookkeeping objects, it initializes the job to keep track of the job’s
progress. This is because it will receive completion reports and receive progress
from the tasks.
The next step is retrieving the input splits. These are computed in the client from
the shared filesystem. Then a map task object is created for each split. It also
creates a number of reduce task objects determined by
the mapreduce.job.reduces property. This property is set by
the setNumReduceTasks() method on Job. At this point, tasks are given IDs and
how to run the tasks that make up the MapReduce job is decided up by the
application master.
Application master may choose to run the tasks in the same JVM as itself if the job
is small. If the running tasks and overhead of application running and allocation
outweigh the gain to be running in parallel, application master will work on. Such a
task is said to be uberised.

A job that is one that has less than 10 mappers, only one reducer can be defined
as a small job. For a small job, the size of the input is less that one HDFS block.
By
setting mapreduce.job.ubertask.maxmaps, mapreduce.job.ubertask.maxreduc
es, and mapreduce.job.ubertask.maxbytes, these values may be changed for a
job.

20
By setting mapreduce.job.ubertask.enable to true, Uber tasks must be enabled
explicitly (for an individual job, or across the cluster). The application master calls
the setupJob() method on the OutputCommitter, finally before any tasks can be
run, the application master calls the setupJob() method on the OutputCommitter.
The final output directory for the job and the temporary working space for the task
output is created by the FileOutputCommitter (it is the default). Final output
directory for the job is created and for the task output, the temporary working
space is created.

How Job runs on MapReduce

MapReduce can be used to work with a solitary method call: submit() on a Job
object (you can likewise call waitForCompletion(), which presents the activity on
the off chance that it hasn’t been submitted effectively, at that point sits tight for it
to finish).
Let’s understand the components –
1. Client: Submitting the MapReduce job.
2. Yarn node manager: In a cluster, it monitors and launches the compute
containers on machines.
3. Yarn resource manager: Handles the allocation of computing resources
coordination on the cluster.
4. MapReduce application master Facilitates the tasks running the MapReduce
work.
5. Distributed Filesystem: Shares job files with other entities.

How to submit Job?

To create an internal JobSubmitter instance, use the submit() which further
calls submitJobInternal() on it. Having submitted the job,
waitForCompletion() polls the job’s progress after submitting the job once per
second. If the reports have changed since the last report, it further reports the
progress to the console. The job counters are displayed when the job completes
successfully. Else the error (that caused the job to fail) is logged to the console.
21
Processes implemented by JobSubmitter for submitting the Job :
 The resource manager asks for a new application ID that is used for
MapReduce Job ID.
 Output specification of the job is checked. For e.g. an error is thrown to the
MapReduce program or the job is not submitted or the output directory already
exists or it has not been specified.
 If the splits cannot be computed, it computes the input splits for the job. This
can be due to the job is not submitted and an error is thrown to the MapReduce
program.
 Resources needed to run the job are copied – it includes the job JAR file, and
the computed input splits, to the shared filesystem in a directory named after the
job ID and the configuration file.
 It copies job JAR with a high replication factor, which is controlled
by mapreduce.client.submit.file.replication property. AS there are a number
of copies across the cluster for the node managers to access.
 By calling submitApplication(), submits the job to the resource manager.

How MapReduce completes a task?

Application master changes the status for the job to “successful” when it receives
a notification that the last task for a job is complete. Then it learns that the job has
completed successfully when the Job polls for status. So, a message returns from
the waitForCompletion() method after it prints a message, to tell the user about the
successful completion of the task. At this point job, statistics and counters are
printed. If the application master is configured to do so, it also sends an HTTP job
notification. Using the mapreduce.job.end-notification.url the property, clients
wishing to receive callbacks that can configure it. Finally, the task containers and
the application master clean up their working state after completing the job. So,
the OutputCommitter's commitJob() method is called and the intermediate output is
deleted. To enable later interrogation by users if desired, job information is
archived by the job history server.
Case of failures?
Real user code can process crash, can be full of bugs or even the machine can
fail. The capability of Hadoop to handle such failures is the biggest benefit of using
it which allows the job to be completed successfully. Any of the
following components can fail:
 Application master
 Node manager
 Resource manager
 Task

22
The most common of this is Task failure. When a user code in the reduce
task or map task, runtime exception is the most common occurrence of this
failure. JVM reports the error back if this happens, to its parent application master
before it exits. The error finally makes it to the user logs. The application frees up
the container so its resources are available for another task after marking the task
attempt as failed.
To stream the task, the Streaming process is marked as failed if the Streaming
process exits with a nonzero exit code. stream.non.zero.exit.is.failure property
(the default is true) governs this behaviour. The sudden exit of the task, JVM is
another failure mode and perhaps due to the exposition of MapReduce user code,
there is a JVM bug that causes the JVM to exit for a particular set of
circumstances. Node manager notices that the process has exited. So, it can mark
the attempt as failed as the application master is informed. Hanging tasks are dealt
with differently. Application master proceeds to mark the task as failed and notices
that it hasn’t received a progress update for some time. After this period, the task
JVM process will be killed automatically. The timeout period can be configured on
a per-job basis by setting the mapreduce.task.timeout property to a value in
milliseconds. After this task, tasks are considered failed is normally 10 minutes.
Long-running tasks are never marked as failed because setting the timeout to a
value of zero disables the timeout. Over time there may be cluster slowdown as a
result and a hanging task will never free up its container. So to make sure that a
task is reporting progress periodically should suffice, this approach should be
avoided. The application master will reschedule the execution of the task after it is
being notified of a task attempt. After the task is failed, the application master will
try to avoid rescheduling the task on a node manager. It will not be retried again if
a task fails four times. This value is configurable to control the maximum number of
the task. It is controlled by the mapreduce.reduce.maxattempts for reduce tasks
and mapreduce.map.maxattempts property for map tasks. The whole job fails by
default if any task fails four times. If a few tasks fail, it is undesirable to abort the

23
job for some application because to use the results of the job despite some failures
is possible. Without triggering, job failure can be set for the job. Using
the mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercen
t properties map tasks and reduce tasks are controlled independently. Task getting
killed is different from failing. Because of speculative duplicate or if the node
manager was running, a task attempt may also be
killed. mapreduce.map.maxattempts and mapreduce.reduce.maxattempts tasks will not
count killed task attempts against the number of attempts to run the task.
Hadoop – Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly used for
storage purpose and maintaining and analyzing a large amount of data or datasets
on the clusters of commodity hardware, which means it is actually a data
management tool. Hadoop also posses a scale-out storage property, which means
that we can scale up or scale down the number of nodes as per are a requirement
in the future which is really a cool feature.

Difference Between Hadoop 2.x vs Hadoop 3.x

Hadoop Mainly works on 3 different Modes:

1. Standalone Mode
2. Pseudo-distributed Mode
3. Fully-Distributed Mode

1. Standalone Mode

In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource

24
Manager and Node Manager. Standalone Mode also means that we are installing
Hadoop only in a single system. By default, Hadoop is made to run in this
Standalone Mode or we can also call it as the Local mode. We mainly use Hadoop
in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we
all know HDFS (Hadoop distributed file system) is one of the major components for
Hadoop which utilized for storage Permission is not utilized in this mode. You can
think of HDFS as similar to the file system’s available for windows i.e. NTFS (New
Technology File System) and FAT32(File Allocation Table which stores the data in
the blocks of 32 bits ). when your Hadoop works in this mode there is no need to
configure the files – hdfs-site.xml, mapred-site.xml, core-site.xml for Hadoop
environment. In this Mode, all of your Processes will run on a single JVM(Java
Virtual Machine) and this mode can only be used for small development purposes.

2. Pseudo Distributed Mode (Single Node Cluster)

In Pseudo-distributed Mode we also use only a single node, but the main thing is
that the cluster is simulated, which means that all the processes inside the cluster
will run independently to each other. All the daemons that are Namenode,
Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will be
running as a separate process on separate JVM(Java Virtual Machine) or we can
say run on different java processes that is why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up so
all the Master and Slave processes are handled by the single system. Namenode
and Resource Manager are used as Master and Datanode and Node Manager is
used as a slave. A secondary name node is also used as a Master. The purpose of
the Secondary Name node is to just keep the hourly based backup of the Name
node. In this Mode,
 Hadoop is used for development and for debugging purposes both.
 Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input
and Output processes.
 We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-
site.xml for setting up the environment.

25


3. Fully Distributed Mode (Multi-Node Cluster)

This is the most important one in which multiple nodes are used few of them run
the Master Daemon’s that are Namenode and Resource Manager and the rest of
them run the Slave Daemon’s that are DataNode and Node Manager. Here
Hadoop will run on the clusters of Machine or nodes. Here the data that is used is
distributed across different nodes. This is actually the Production Mode of Hadoop
let’s clarify or understand this Mode in a better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install
it in your system and you run all the processes in a single system but here in the
fully distributed mode we are extracting this tar or zip file to each of the nodes in
the Hadoop cluster and then we are using a particular node for a particular
process. Once you distribute the process among the nodes then you’ll define which
nodes are working as a master or which one of them is working as a slave.

26
he Journey of Hadoop Started in 2005 by Doug Cutting and Mike Cafarella. Which
is an open-source software build for dealing with the large size Data? The
objective of this article is to make you familiar with the differences between the
Hadoop 2.x vs Hadoop 3.x version. Obviously, Hadoop 3.x has some more
advanced and compatible features than the older versions of Hadoop 2.x.

Hadoop 2.x vs Hadoop 3.x

S.No
. Feature Hadoop 2.x Hadoop 3.x

Apache 2.0 is used for Apache 2.0 is used for

1 License licensing which is open- licensing which is open-
source. source.

Minimum
JAVA 7 is the minimum JAVA 8 is the minimum
2 supported
compatible version. compatible version.
Java version

3 Fault Replication is the only way Erasure coding is used

Tolerance to handle fault tolerance for handling fault
which is not space

27
S.No
. Feature Hadoop 2.x Hadoop 3.x

optimized. tolerance.

Intra-data node balancer

Data HDFS balancer is used for is used which is called
4
Balancing Data Balancing. via HDFS disk-balancer
command-line interface.

Storage 3x Replication Scheme is uses eraser encoding in

5
Scheme used. HDFS.

50% used in Hadoop 3.x

Storage 200% of HDFS is
6 means we have more
Overhead consumed in Hadoop 2.x
space to work.

Improve the time line

YARN
Uses timeline service with service along with
7 Timeline
scalability issue. improving scalability and
Service
reliability of this service.

Limited Scalability, can Scalability is improved,

8 Scalability have upto 10000 nodes in can have more then
a cluster. 10000 nodes in a cluster.

Linux ephemeral port

Default Port
range is used as default, Ports used are out of this
9 Range (32768-
which is failed to bind at ephemeral port range.
61000)
startup time.

10 Compatible HDFS(default), FTP, All file systems including

28
S.No
. Feature Hadoop 2.x Hadoop 3.x

Amazon S3 and Windows

Microsoft Azure Data
File System. Azure Storage Blobs
Lake filesystem.
(WASB) file system.

Manual intervention is No need of Manual

Name Node
11 needed for the namenode intervention for name
recovery
recovery. node recovery.

Filetest 9
No ratings yet
Filetest 9
8 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
3 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit-4
No ratings yet
Unit-4
19 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Unit 3
No ratings yet
Unit 3
13 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Unit - III
No ratings yet
Unit - III
37 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
unit5 b
No ratings yet
unit5 b
4 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit 5
No ratings yet
Unit 5
35 pages
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
No ratings yet
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
24 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Unit 3 Bba
No ratings yet
Unit 3 Bba
11 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
unit 2
No ratings yet
unit 2
12 pages
Basics of Hadoop
No ratings yet
Basics of Hadoop
19 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Psyc 103 (Stats)
No ratings yet
Psyc 103 (Stats)
75 pages
Food Processing Question
100% (2)
Food Processing Question
5 pages
Get Haskell The Ultimate Beginner s Guide to Learn Haskell Programming Step by Step 1st Edition Claudia Alves PDF ebook with Full Chapters Now
100% (1)
Get Haskell The Ultimate Beginner s Guide to Learn Haskell Programming Step by Step 1st Edition Claudia Alves PDF ebook with Full Chapters Now
62 pages
Introducing IBBI
0% (1)
Introducing IBBI
10 pages
3-5 Joshua
No ratings yet
3-5 Joshua
9 pages
QSEV Introduction Final RAH
No ratings yet
QSEV Introduction Final RAH
29 pages
Websitedesigncontract
No ratings yet
Websitedesigncontract
5 pages
Name - Raj Kumar Software Engineer at Hexagon Global B.E CSE From Ait, Bangalore
No ratings yet
Name - Raj Kumar Software Engineer at Hexagon Global B.E CSE From Ait, Bangalore
20 pages
World Transfer Pricing 2014
No ratings yet
World Transfer Pricing 2014
251 pages
Income-Taxation-2024-2025
No ratings yet
Income-Taxation-2024-2025
6 pages
Clubfoot A Comprehensive Approach Past Present and
No ratings yet
Clubfoot A Comprehensive Approach Past Present and
36 pages
Spatial Statistics
No ratings yet
Spatial Statistics
14 pages
Project Name: Period Reported On: Action Items: Checkpoint Report
No ratings yet
Project Name: Period Reported On: Action Items: Checkpoint Report
1 page
AFP Customs and Traditions
No ratings yet
AFP Customs and Traditions
29 pages
Service Manual: Remote Control System RC400 G2B/G3B
100% (4)
Service Manual: Remote Control System RC400 G2B/G3B
48 pages
(Ebook) JavaScript and DHTML cookbook by Danny Goodman ISBN 9780596514082, 0596514085 - The ebook is available for online reading or easy download
100% (1)
(Ebook) JavaScript and DHTML cookbook by Danny Goodman ISBN 9780596514082, 0596514085 - The ebook is available for online reading or easy download
57 pages
CN Experiment 2
No ratings yet
CN Experiment 2
6 pages
Heavy Machine Building Plant (HMBP) : WORKSHOP 010/011
No ratings yet
Heavy Machine Building Plant (HMBP) : WORKSHOP 010/011
4 pages
Sub Netting
No ratings yet
Sub Netting
3 pages
Cement Additives (Function and Definition)
No ratings yet
Cement Additives (Function and Definition)
3 pages
مهارات التيسير 2011
No ratings yet
مهارات التيسير 2011
59 pages
Managerial Economics MCQs
100% (1)
Managerial Economics MCQs
6 pages
A View From The Top Script
No ratings yet
A View From The Top Script
118 pages
Th60dh WW Cyclone
No ratings yet
Th60dh WW Cyclone
12 pages
Vina Term Paper
No ratings yet
Vina Term Paper
7 pages
Employee Handbook Eng
No ratings yet
Employee Handbook Eng
10 pages
Mock Exam 24
No ratings yet
Mock Exam 24
8 pages
Sales Contract
No ratings yet
Sales Contract
5 pages
Anin
No ratings yet
Anin
14 pages

MapReduce Arch

Uploaded by

MapReduce Arch

Uploaded by

MapReduce Architecture

Components of MapReduce Architecture:

Map and Reduce interfaces

Map and Reduce interfaces

Mapper is a simple user-defined program that performs some operations on input-

MapReduce Job Execution

What does Streaming means?

Hadoop MapReduce – Data Flow

Brief Working of Mapper

 At a time single input split is processed. Mapper is overridden by the developer

How Job runs on MapReduce

How to submit Job?

How MapReduce completes a task?

Difference Between Hadoop 2.x vs Hadoop 3.x

Hadoop Mainly works on 3 different Modes:

2. Pseudo Distributed Mode (Single Node Cluster)

3. Fully Distributed Mode (Multi-Node Cluster)

Hadoop 2.x vs Hadoop 3.x

Apache 2.0 is used for Apache 2.0 is used for

3 Fault Replication is the only way Erasure coding is used

Intra-data node balancer

Storage 3x Replication Scheme is uses eraser encoding in

50% used in Hadoop 3.x

Improve the time line

Limited Scalability, can Scalability is improved,

Linux ephemeral port

10 Compatible HDFS(default), FTP, All file systems including

Amazon S3 and Windows

Manual intervention is No need of Manual

You might also like