MapReduce Arch
MapReduce Arch
MapReduce and HDFS are the two major components of Hadoop which makes it
so powerful and efficient to use. MapReduce is a programming model used for
efficient processing in parallel over large data-sets in a distributed manner. The
data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send
jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
1
In MapReduce, we have a client. The client will submit the job of a particular size
to the Hadoop MapReduce Master. Now, the MapReduce master will divide this job
into further equivalent job-parts. These job-parts are then made available for the
Map and Reduce Task. This Map and Reduce task will contain the program as per
the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for
processing the data as per the requirement. The algorithm for Map and Reduce is
made with a very optimized way such that the time complexity or space complexity
is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its
architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the id
of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair which
works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm written
by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and
stores historical information about the task or application, like the logs which are
generated during or after the job execution are stored on Job History Server.
2
Map Reduce in Hadoop
One of the three components of Hadoop is Map Reduce. The first component of
Hadoop that is, Hadoop Distributed File System (HDFS) is responsible for storing
the file. The second component that is, Map Reduce is responsible for
processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first phase, Map is
utilised and in next phase Reduce is utilised.
Suppose there is a word file containing some text. Let us name this file as
sample.txt. Note that we use Hadoop to deal with huge files but for the sake of
easy explanation over here, we are taking a text file as an example. So, let’s
assume that this sample.txt file contains few lines as text. The content of the file is
as follows:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
3
Hence, the above 8 lines are the content of the file. Let’s assume that while storing
this file in Hadoop, HDFS broke this file into four parts and named each part as
first.txt, second.txt, third.txt, and fourth.txt. So, you can easily see that the above
file will be divided into four equal parts and each part will contain 2 lines. First two
lines will be in the file first.txt, next two lines in second.txt, next two in third.txt and
the last two lines will be stored in fourth.txt. All these files will be stored in Data
Nodes and the Name Node will contain the metadata about them. All this is the
task of HDFS. Now, suppose a user wants to process this file. Here is what Map-
Reduce comes into the picture. Suppose this user wants to run a query on this
sample.txt. So, instead of bringing sample.txt on the local computer, we will send
this query on the data. To keep a track of our request, we use Job Tracker (a
master service). Job Tracker traps our request and keeps a track of it. Now
suppose that the user wants to run his query on sample.txt and want the output in
result.output file. Let the name of the file containing the query is query.jar. So, the
user will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
So, now the Job Tracker traps this request and asks Name Node to run this
request on sample.txt. Name Node then provides the metadata to the Job Tracker.
Job Tracker now knows that sample.txt is stored in first.txt, second.txt, third.txt,
and fourth.txt. As all these four files have three copies stored in HDFS, so the Job
Tracker communicates with the Task Tracker (a slave service) of each of these
files but it communicates with only one copy of each file which is residing nearest
to it. Note: Applying the desired code on local first.txt, second.txt, third.txt and
fourth.txt is a process., This process is called Map. In Hadoop terminology, the
main file sample.txt is called input file and its four subfiles are called input splits.
So, in Hadoop the number of mappers for an input file are equal to number of
input splits of this input file. In the above case, the input file sample.txt has four
input splits hence four mappers will be running to process it. The responsibility of
handling these mappers is of Job Tracker. Note that the task trackers are slave
services to the Job Tracker. So, in case any of the local machines breaks down
then the processing over that part of the file will stop and it will halt the complete
process. So, each task tracker sends heartbeat and its number of slots to Job
Tracker in every 3 seconds. This is called the status of Task Trackers. In case any
task tracker goes down, the Job Tracker then waits for 10 heartbeat times, that is,
30 seconds, and even after that if it does not get any status, then it assumes that
either the task tracker is dead or is extremely busy. So it then communicates with
the task tracker of another copy of the same file and directs it to process the
desired code over it. Similarly, the slot information is used by the Job Tracker to
keep a track of how many tasks are being currently served by the task tracker and
how many more tasks can be assigned to it. In this way, the Job Tracker keeps
track of our request. Now, suppose that the system has generated output for
4
individual first.txt, second.txt, third.txt, and fourth.txt. But this is not the user’s
desired output. To produce the desired output, all these individual outputs have to
be merged or reduced to a single output. This reduction of multiple outputs to a
single one is also a process which is done by REDUCER. In Hadoop, as many
reducers are there, those many number of output files are generated. By
default, there is always one reducer per cluster. Note: Map and Reduce are two
different processes of the second component of Hadoop, that is, Map Reduce.
These are also called phases of Map Reduce. Thus we can say that Map Reduce
has two phases. Phase 1 is Map and Phase 2 is Reduce.
Functioning of Map Reduce
Now, let us move back to our sample.txt file with the same content. Again it is
being divided into four input splits namely, first.txt, second.txt, third.txt, and
fourth.txt. Now, suppose we want to count number of each word in the file. That is
the content of the file looks like:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Then the output of the ‘word count’ code will be like:
Hello - 1
I - 1
am - 1
geeksforgeeks - 1
How - 2 (How is written two times in the entire file)
Similarly
Are - 3
are - 2
….and so on
Thus in order to get this output, the user will have to send his query on the data.
Suppose the query ‘word count’ is in the file wordcount.jar. So, the query will look
like:
5
J$hadoop jar wordcount.jar DriverCode sample.txt result.output
Types of File Format in Hadoop
Now, as we know that there are four input splits, so four mappers will be running.
One on each input split. But, Mappers don’t run directly on the input splits. It is
because the input splits contain text but mappers don’t understand the
text. Mappers understand (key, value) pairs only. Thus the text in input splits first
needs to be converted to (key, value) pairs. This is achieved by Record Readers.
Thus we can also say that as many numbers of input splits are there, those many
numbers of record readers are there. In Hadoop terminology, each line in a text is
termed as a ‘record’. How record reader converts this text into (key, value) pair
depends on the format of the file. In Hadoop, there are four formats of a file. These
formats are Predefined Classes in Hadoop. Four types of formats are:
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat
By default, a file is in TextInputFormat. Record reader reads one record(line) at a
time. While reading, it doesn’t consider the format of the file. But, it converts each
record into (key, value) pair depending upon its format. For the time being, let’s
assume that the first input split first.txt is in TextInputFormat. Now, the record
reader working on this input split converts the record in the form of (byte offset,
entire line). For example first.txt has the content:
Hello I am GeeksforGeeks
How can I help you
So, the output of record reader has two pairs (since two records are there in the
file). The first pair looks like (0, Hello I am geeksforgeeks) and the second pair
looks like (26, How can I help you). Note that the second pair has the byte offset of
26 because there are 25 characters in the first line and the newline operator (\n) is
also considered a character. Thus, after the record reader as many numbers of
records is there, those many numbers of (key, value) pairs are there. Now, the
mapper will run once for each of these pairs. Similarly, other mappers are also
running for (key, value) pairs of different input splits. Thus in this way, Hadoop
breaks a big task into smaller tasks and executes them in parallel execution.
Shuffling and Sorting
Now, the mapper provides an output corresponding to each (key, value) pair
provided by the record reader. Let us take the first input split of first.txt. The two
pairs so generated for this file by the record reader are (0, Hello I am
GeeksforGeeks) and (26, How can I help you). Now mapper takes one of these
pair at a time and produces output like (Hello, 1), (I, 1), (am, 1) and
(GeeksforGeeks, 1) for the first pair and (How, 1), (can, 1), (I, 1), (help, 1) and
(you, 1) for the second pair. Similarly, we have outputs of all the mappers. Note
that this data contains duplicate keys like (I, 1) and further (how, 1) etc. These
6
duplicate keys also need to be taken care of. This data is also called Intermediate
Data. Before passing this intermediate data to the reducer, it is first passed
through two more stages, called Shuffling and Sorting.
1. Shuffling Phase: This phase combines all values associated to an identical
key. For eg, (Are, 1) is there three times in the input file. So after the shuffling
phase, the output will be like (Are, [1,1,1]).
2. Sorting Phase: Once shuffling is done, the output is sent to the sorting phase
where all the (key, value) pairs are sorted automatically. In Hadoop sorting is an
automatic process because of the presence of an inbuilt interface
called WritableComparableInterface.
One of the three components of Hadoop is Map Reduce. The first
component of Hadoop that is, Hadoop Distributed File System (HDFS) is
responsible for storing the file. The second component that is, Map
Reduce is responsible for processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first
phase, Map is utilised and in next phase Reduce is utilised.
Suppose there is a word file containing some text. Let us name this file as
sample.txt. Note that we use Hadoop to deal with huge files but for the sake
of easy explanation over here, we are taking a text file as an example. So,
let’s assume that this sample.txt file contains few lines as text. The content of
the file is as follows:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
7
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while
storing this file in Hadoop, HDFS broke this file into four parts and named
each part as first.txt, second.txt, third.txt, and fourth.txt. So, you can easily
see that the above file will be divided into four equal parts and each part will
contain 2 lines. First two lines will be in the file first.txt, next two lines in
second.txt, next two in third.txt and the last two lines will be stored in
fourth.txt. All these files will be stored in Data Nodes and the Name Node will
contain the metadata about them. All this is the task of HDFS. Now, suppose
a user wants to process this file. Here is what Map-Reduce comes into the
picture. Suppose this user wants to run a query on this sample.txt. So,
instead of bringing sample.txt on the local computer, we will send this query
on the data. To keep a track of our request, we use Job Tracker (a master
service). Job Tracker traps our request and keeps a track of it. Now suppose
that the user wants to run his query on sample.txt and want the output in
result.output file. Let the name of the file containing the query is query.jar.
So, the user will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
So, now the Job Tracker traps this request and asks Name Node to run this
request on sample.txt. Name Node then provides the metadata to the Job
Tracker. Job Tracker now knows that sample.txt is stored in first.txt,
second.txt, third.txt, and fourth.txt. As all these four files have three copies
stored in HDFS, so the Job Tracker communicates with the Task Tracker (a
slave service) of each of these files but it communicates with only one copy
of each file which is residing nearest to it. Note: Applying the desired code
on local first.txt, second.txt, third.txt and fourth.txt is a process., This process
is called Map. In Hadoop terminology, the main file sample.txt is called input
file and its four subfiles are called input splits. So, in Hadoop the number of
mappers for an input file are equal to number of input splits of this
input file. In the above case, the input file sample.txt has four input splits
hence four mappers will be running to process it. The responsibility of
handling these mappers is of Job Tracker. Note that the task trackers are
8
slave services to the Job Tracker. So, in case any of the local machines
breaks down then the processing over that part of the file will stop and it will
halt the complete process. So, each task tracker sends heartbeat and its
number of slots to Job Tracker in every 3 seconds. This is called the status
of Task Trackers. In case any task tracker goes down, the Job Tracker then
waits for 10 heartbeat times, that is, 30 seconds, and even after that if it does
not get any status, then it assumes that either the task tracker is dead or is
extremely busy. So it then communicates with the task tracker of another
copy of the same file and directs it to process the desired code over it.
Similarly, the slot information is used by the Job Tracker to keep a track of
how many tasks are being currently served by the task tracker and how
many more tasks can be assigned to it. In this way, the Job Tracker keeps
track of our request. Now, suppose that the system has generated output for
individual first.txt, second.txt, third.txt, and fourth.txt. But this is not the user’s
desired output. To produce the desired output, all these individual outputs
have to be merged or reduced to a single output. This reduction of multiple
outputs to a single one is also a process which is done by REDUCER. In
Hadoop, as many reducers are there, those many number of output
files are generated. By default, there is always one reducer per
cluster. Note: Map and Reduce are two different processes of the second
component of Hadoop, that is, Map Reduce. These are also called phases of
Map Reduce. Thus we can say that Map Reduce has two phases. Phase 1 is
Map and Phase 2 is Reduce.
Functioning of Map Reduce
Now, let us move back to our sample.txt file with the same content. Again it
is being divided into four input splits namely, first.txt, second.txt, third.txt, and
fourth.txt. Now, suppose we want to count number of each word in the file.
That is the content of the file looks like:
Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
9
Then the output of the ‘word count’ code will be like:
Hello - 1
I - 1
am - 1
geeksforgeeks - 1
How - 2 (How is written two times in the entire file)
Similarly
Are - 3
are - 2
….and so on
Thus in order to get this output, the user will have to send his query on the
data. Suppose the query ‘word count’ is in the file wordcount.jar. So, the
query will look like:
J$hadoop jar wordcount.jar DriverCode sample.txt result.output
Types of File Format in Hadoop
Now, as we know that there are four input splits, so four mappers will be
running. One on each input split. But, Mappers don’t run directly on the input
splits. It is because the input splits contain text but mappers don’t understand
the text. Mappers understand (key, value) pairs only. Thus the text in input
splits first needs to be converted to (key, value) pairs. This is achieved
by Record Readers. Thus we can also say that as many numbers of input
splits are there, those many numbers of record readers are there. In Hadoop
terminology, each line in a text is termed as a ‘record’. How record reader
converts this text into (key, value) pair depends on the format of the file. In
Hadoop, there are four formats of a file. These formats are Predefined
Classes in Hadoop. Four types of formats are:
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat
By default, a file is in TextInputFormat. Record reader reads one record(line)
at a time. While reading, it doesn’t consider the format of the file. But, it
converts each record into (key, value) pair depending upon its format. For
the time being, let’s assume that the first input split first.txt is in
TextInputFormat. Now, the record reader working on this input split converts
10
the record in the form of (byte offset, entire line). For example first.txt has the
content:
Hello I am GeeksforGeeks
How can I help you
So, the output of record reader has two pairs (since two records are there in
the file). The first pair looks like (0, Hello I am geeksforgeeks) and the
second pair looks like (26, How can I help you). Note that the second pair
has the byte offset of 26 because there are 25 characters in the first line and
the newline operator (\n) is also considered a character. Thus, after the
record reader as many numbers of records is there, those many numbers of
(key, value) pairs are there. Now, the mapper will run once for each of these
pairs. Similarly, other mappers are also running for (key, value) pairs of
different input splits. Thus in this way, Hadoop breaks a big task into smaller
tasks and executes them in parallel execution.
Shuffling and Sorting
Now, the mapper provides an output corresponding to each (key, value) pair
provided by the record reader. Let us take the first input split of first.txt. The
two pairs so generated for this file by the record reader are (0, Hello I am
GeeksforGeeks) and (26, How can I help you). Now mapper takes one of
these pair at a time and produces output like (Hello, 1), (I, 1), (am, 1) and
(GeeksforGeeks, 1) for the first pair and (How, 1), (can, 1), (I, 1), (help, 1)
and (you, 1) for the second pair. Similarly, we have outputs of all the
mappers. Note that this data contains duplicate keys like (I, 1) and further
(how, 1) etc. These duplicate keys also need to be taken care of. This data is
also called Intermediate Data. Before passing this intermediate data to the
reducer, it is first passed through two more stages, called Shuffling and
Sorting.
1. Shuffling Phase: This phase combines all values associated to an
identical key. For eg, (Are, 1) is there three times in the input file. So after
the shuffling phase, the output will be like (Are, [1,1,1]).
2. Sorting Phase: Once shuffling is done, the output is sent to the sorting
phase where all the (key, value) pairs are sorted automatically. In Hadoop
sorting is an automatic process because of the presence of an inbuilt
interface called WritableComparableInterface.
After the completion of the shuffling and sorting phase, the resultant output is then
sent to the reducer. Now, if there are n (key, value) pairs after the shuffling and
sorting phase, then the reducer runs n times and thus produces the final result in
which the final processed output is there. In the above case, the resultant output
after the reducer processing will get stored in the directory result.output as
specified in the query code written to process the query on the data.
11
Hadoop – Mapper In MapReduce
Map-Reduce is a programming model that is mainly divided into two phases Map
Phase and Reduce Phase. It is designed for processing the data in parallel which
is divided on various machines(nodes). The Hadoop Java programs are consist of
Mapper class and Reducer class along with the driver class. Hadoop Mapper is a
function or task which is used to process all input records from a file and generate
the output which works as input for Reducer. It produces the output by returning
new key-value pairs. The input data has to be converted to key-value pairs as
Mapper can not process the raw input records or tuples(key-value pairs). The
mapper also generates some small blocks of data while processing the input
records as a key-value pair. we will discuss the various process that occurs in
Mapper, There key features and how the key-value pairs are generated in the
Mapper.
Let’s understand the Mapper in Map-Reduce:
12
2. Input-Splits: These are responsible for converting the physical input data to
some logical form so that Hadoop Mapper can easily handle it. Input-Splits are
generated with the help of InputFormat. A large data set is divided into many
input-splits which depend on the size of the input dataset. There will be a
separate Mapper assigned for each input-splits. Input-Splits are only referencing
to the input data, these are not the actual data. DataBlocks are not the only
factor that decides the number of input-splits in a Map-Reduce. we can
manually configure the size of input-splits in mapred.max.split.size property
while the job is executing. All of these input-splits are utilized by each of the
data blocks. The size of input splits is measured in bytes. Each input-split is
stored at some memory location (Hostname Strings). Map-Reduce places map
tasks near the location of the split as close as it is possible. The input-split with
the larger size executed first so that the job-runtime can be minimized.
3. Record-Reader: Record-Reader is the process which deals with the output
obtained from the input-splits and generates it’s own output as key-value pairs
until the file ends. Each line present in a file will be assigned with the Byte-
Offset with the help of Record-Reader. By-default Record-Reader
uses TextInputFormat for converting the data obtained from the input-splits to
the key-value pairs because Mapper can only handle key-value pairs.
4. Map: The key-value pair obtained from Record-Reader is then feed to the Map
which generates a set of pairs of intermediate key-value pairs.
5. Intermediate output disk: Finally, the intermediate key-value pair output will be
stored on the local disk as intermediate output. There is no need to store the
data on HDFS as it is an intermediate output. If we store this data onto HDFS
then the writing cost will be more because of it’s replication feature. It also
increases its execution time. If somehow the executing job is terminated then, in
that case, cleaning up this intermediate output available on HDFS is also
difficult. The intermediate output is always stored on local disk which will be
cleaned up once the job completes its execution. On local disk, this Mapper
output is first stored in a buffer whose default size is 100MB which can be
configured with io.sort.mb property. The output of the mapper can be written
to HDFS if and only if the job is Map job only, In that case, there will be no
Reducer task so the intermediate output is our final output which can be written
on HDFS. The number of Reducer tasks can be made zero manually with
job.setNumReduceTasks(0). This Mapper output is of no use for the end-user
as it is a temporary output useful for Reducer only.
How to calculate the number of Mappers In Hadoop:
The number of blocks of input file defines the number of map-task in the Hadoop
Map-phase, which can be calculated with the help of the below formula.
Mapper = (total data size)/ (input split size)
For Example: For a file of size 10TB(Data Size) where the size of each data block
is 128 MB(input split size) the number of Mappers will be around 81920.
Hadoop – Reducer in Map-Reduce
13
Map-Reduce is a programming model that is mainly divided into two phases i.e.
Map Phase and Reduce Phase. It is designed for processing the data in parallel
which is divided on various machines(nodes). The Hadoop Java programs are
consist of Mapper class and Reducer class along with the driver class. Reducer is
the second part of the Map-Reduce programming model. The Mapper produces
the output in the form of key-value pairs which works as input for the Reducer.
But before sending this intermediate key-value pairs directly to the Reducer some
process will be done which shuffle and sort the key-value pairs according to its key
values, which means the value of the key is the main decisive factor for sorting.
The output generated by the Reducer will be the final output which is then stored
on HDFS(Hadoop Distributed File System) . Reducer mainly performs some
computation operation like addition, filtration, and aggregation. By default, the
number of reducers utilized for process the output of the Mapper is 1 which is
configurable and can be changed by the user according to the requirement.
Let’s understand the Reducer in Map-Reduce:
Here, in the above image, we can observe that there are multiple Mapper which
are generating the key-value pairs as output. The output of each mapper is sent to
the sorter which will sort the key-value pairs according to its key value. Shuffling
also takes place during the sorting process and the output will be sent to the
Reducer part and final output is produced.
Let’s take an example to understand the working of Reducer. Suppose we have
the data of a college faculty of all departments stored in a CSV file. In case we
want to find the sum of salaries of faculty according to their department then we
can make their dept. title as key and salaries as value. The Reducer will perform
the summation operation on this dataset and produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
1. Framework overhead increases.
2. Cost of failure Reduces
3. Increase load balancing.
14
One thing we also need to remember is that there will always be a one to one
mapping between Reducers and the keys. Once the whole Reducer process is
done the output is stored at the part file(default name) on HDFS(Hadoop
Distributed File System). In the output directory on HDFS, The Map-Reduce
always makes a _SUCCESS file and part-r-00000 file. The number of part files
depends on the number of reducers in case we have 5 Reducers then the number
of the part file will be from part-r-00000 to part-r-00004. By default, these files have
the name of part-a-bbbbb type. It can be changed manually all we need to do is to
change the below property in our driver code of Map-Reduce.
// Here we are changing output file name from part-r-00000 to
GeeksForGeeks
job.getConfiguration().set("mapreduce.output.basename",
"GeeksForGeeks")
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
1. Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer.
With the help of HTTP, the framework calls for applicable partition of the output
in all Mappers.
2. Sort: In this phase, the output of the mapper that is actually the key-value pairs
will be sorted on the basis of its key value.
3. Reduce: Once shuffling and sorting will be done the Reducer combines the
obtained result and perform the computation operation as per the
requirement. OutputCollector.collect() property is used for writing the output to
the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
Setting Number Of Reducers In Map-Reduce:
1. With Command Line: While executing our Map-Reduce program we can
manually change the number of Reducer with
controller mapred.reduce.tasks.
2. With JobConf instance: In our driver class, we can specify the number of
reducers using the instance of job.setNumReduceTasks(int).
For example job.setNumReduceTasks(2), Here we have 2 Reducers. we can
also make Reducers to 0 in case we need only a Map job.
// Ideally The number of Reducers in a Map-Reduce must be set
to:
0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum
containers per node>)
15
the job configuration, any files from the distributed cache and JAR file. It finally
runs the map or the reduce task. Any kind of bugs in the user-defined map and
reduce functions (or even in YarnChild) don’t affect the node manager as
YarnChild runs in a dedicated JVM. So it can’t be affected by a crash or hang.
All actions running in the same JVM as the task itself are performed by each task
setup. These are determined by the OutputCommitter for the job. The commit
action moves the task output to its final location from its initial position for a file-
based jobs. When speculative execution is enabled, the commit protocol ensures
that only one of the duplicate tasks is committed and the other one is aborted.
16
Each job including the task has a status including the state of the job or task,
values of the job’s counters, progress of maps and reduces and the description or
status message. These statuses change over the course of the job.
The task keeps track of its progress when a task is running like a part of the task is
completed. This is the proportion of the input that has been processed for map
tasks. It is a little more complex for the reduce task but the system can still
estimate the proportion of the reduce input processed. When a task is running, it
keeps track of its progress (i.e., the proportion of the task completed). For map
tasks, this is the proportion of the input that has been processed. For reduce tasks,
it’s a little more complex, but the system can still estimate the proportion of the
reduce input processed.
Process involved –
Read an input record in a mapper or reducer.
Write an output record in a mapper or reducer.
Set the status description.
Increment a counter using Reporter’s incrCounter() method or Counter’s
increment() method.
Call Reporter’s or TaskAttemptContext’s progress() method.
17
Let’s Understand Data-Flow in Map-Reduce
Map Reduce is a terminology that comes with Map Phase and Reducer Phase.
The map is used for Transformation while the Reducer is used for aggregation kind
of operation. The terminology for Map and Reduce is derived from some functional
programming languages like Lisp, Scala, etc. The Map-Reduce processing
framework program comes with 3 main components i.e. our Driver
code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on
Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop with
HDFS. Now we have to process it for that we have a Map-Reduce framework. So
to process this data with Map-Reduce we have a Driver code which is called Job.
If we are using Java programming language for processing the data on HDFS then
we need to initiate this Driver class with the Job object. Suppose you have a car
which is your framework than the start button used to start the car is similar to this
Driver code in the Map-Reduce framework. We need to initiate the Driver code to
utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which
are predefined and modified by the developers as per the organizations
requirement.
Mapper is the initial line of code that initially interacts with the input dataset.
suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in that
case, there will be 100 Mapper program or process that runs in parallel on
machines(nodes) and produce there own output known as intermediate output
which is then stored on Local Disk, not on HDFS. The output of the mapper act as
input for Reducer which performs some sorting and aggregation operation on data
and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The Mapper
produces the output in the form of key-value pairs which works as input for the
Reducer. But before sending this intermediate key-value pairs directly to the
18
Reducer some process will be done which shuffle and sort the key-value pairs
according to its key values. The output generated by the Reducer will be the final
output which is then stored on HDFS(Hadoop Distributed File System). Reducer
mainly performs some computation operation like addition, filtration, and
aggregation.
Steps of Data-Flow:
19
the node manager’s management. MRAppMaster is the main class of Java
application for the Java application that masters for MapReduce jobs. By creating a
number of bookkeeping objects, it initializes the job to keep track of the job’s
progress. This is because it will receive completion reports and receive progress
from the tasks.
The next step is retrieving the input splits. These are computed in the client from
the shared filesystem. Then a map task object is created for each split. It also
creates a number of reduce task objects determined by
the mapreduce.job.reduces property. This property is set by
the setNumReduceTasks() method on Job. At this point, tasks are given IDs and
how to run the tasks that make up the MapReduce job is decided up by the
application master.
Application master may choose to run the tasks in the same JVM as itself if the job
is small. If the running tasks and overhead of application running and allocation
outweigh the gain to be running in parallel, application master will work on. Such a
task is said to be uberised.
A job that is one that has less than 10 mappers, only one reducer can be defined
as a small job. For a small job, the size of the input is less that one HDFS block.
By
setting mapreduce.job.ubertask.maxmaps, mapreduce.job.ubertask.maxreduc
es, and mapreduce.job.ubertask.maxbytes, these values may be changed for a
job.
20
By setting mapreduce.job.ubertask.enable to true, Uber tasks must be enabled
explicitly (for an individual job, or across the cluster). The application master calls
the setupJob() method on the OutputCommitter, finally before any tasks can be
run, the application master calls the setupJob() method on the OutputCommitter.
The final output directory for the job and the temporary working space for the task
output is created by the FileOutputCommitter (it is the default). Final output
directory for the job is created and for the task output, the temporary working
space is created.
22
The most common of this is Task failure. When a user code in the reduce
task or map task, runtime exception is the most common occurrence of this
failure. JVM reports the error back if this happens, to its parent application master
before it exits. The error finally makes it to the user logs. The application frees up
the container so its resources are available for another task after marking the task
attempt as failed.
To stream the task, the Streaming process is marked as failed if the Streaming
process exits with a nonzero exit code. stream.non.zero.exit.is.failure property
(the default is true) governs this behaviour. The sudden exit of the task, JVM is
another failure mode and perhaps due to the exposition of MapReduce user code,
there is a JVM bug that causes the JVM to exit for a particular set of
circumstances. Node manager notices that the process has exited. So, it can mark
the attempt as failed as the application master is informed. Hanging tasks are dealt
with differently. Application master proceeds to mark the task as failed and notices
that it hasn’t received a progress update for some time. After this period, the task
JVM process will be killed automatically. The timeout period can be configured on
a per-job basis by setting the mapreduce.task.timeout property to a value in
milliseconds. After this task, tasks are considered failed is normally 10 minutes.
Long-running tasks are never marked as failed because setting the timeout to a
value of zero disables the timeout. Over time there may be cluster slowdown as a
result and a hanging task will never free up its container. So to make sure that a
task is reporting progress periodically should suffice, this approach should be
avoided. The application master will reschedule the execution of the task after it is
being notified of a task attempt. After the task is failed, the application master will
try to avoid rescheduling the task on a node manager. It will not be retried again if
a task fails four times. This value is configurable to control the maximum number of
the task. It is controlled by the mapreduce.reduce.maxattempts for reduce tasks
and mapreduce.map.maxattempts property for map tasks. The whole job fails by
default if any task fails four times. If a few tasks fail, it is undesirable to abort the
23
job for some application because to use the results of the job despite some failures
is possible. Without triggering, job failure can be set for the job. Using
the mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercen
t properties map tasks and reduce tasks are controlled independently. Task getting
killed is different from failing. Because of speculative duplicate or if the node
manager was running, a task attempt may also be
killed. mapreduce.map.maxattempts and mapreduce.reduce.maxattempts tasks will not
count killed task attempts against the number of attempts to run the task.
Hadoop – Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly used for
storage purpose and maintaining and analyzing a large amount of data or datasets
on the clusters of commodity hardware, which means it is actually a data
management tool. Hadoop also posses a scale-out storage property, which means
that we can scale up or scale down the number of nodes as per are a requirement
in the future which is really a cool feature.
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource
24
Manager and Node Manager. Standalone Mode also means that we are installing
Hadoop only in a single system. By default, Hadoop is made to run in this
Standalone Mode or we can also call it as the Local mode. We mainly use Hadoop
in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we
all know HDFS (Hadoop distributed file system) is one of the major components for
Hadoop which utilized for storage Permission is not utilized in this mode. You can
think of HDFS as similar to the file system’s available for windows i.e. NTFS (New
Technology File System) and FAT32(File Allocation Table which stores the data in
the blocks of 32 bits ). when your Hadoop works in this mode there is no need to
configure the files – hdfs-site.xml, mapred-site.xml, core-site.xml for Hadoop
environment. In this Mode, all of your Processes will run on a single JVM(Java
Virtual Machine) and this mode can only be used for small development purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing is
that the cluster is simulated, which means that all the processes inside the cluster
will run independently to each other. All the daemons that are Namenode,
Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will be
running as a separate process on separate JVM(Java Virtual Machine) or we can
say run on different java processes that is why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up so
all the Master and Slave processes are handled by the single system. Namenode
and Resource Manager are used as Master and Datanode and Node Manager is
used as a slave. A secondary name node is also used as a Master. The purpose of
the Secondary Name node is to just keep the hourly based backup of the Name
node. In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input
and Output processes.
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-
site.xml for setting up the environment.
25
This is the most important one in which multiple nodes are used few of them run
the Master Daemon’s that are Namenode and Resource Manager and the rest of
them run the Slave Daemon’s that are DataNode and Node Manager. Here
Hadoop will run on the clusters of Machine or nodes. Here the data that is used is
distributed across different nodes. This is actually the Production Mode of Hadoop
let’s clarify or understand this Mode in a better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install
it in your system and you run all the processes in a single system but here in the
fully distributed mode we are extracting this tar or zip file to each of the nodes in
the Hadoop cluster and then we are using a particular node for a particular
process. Once you distribute the process among the nodes then you’ll define which
nodes are working as a master or which one of them is working as a slave.
26
he Journey of Hadoop Started in 2005 by Doug Cutting and Mike Cafarella. Which
is an open-source software build for dealing with the large size Data? The
objective of this article is to make you familiar with the differences between the
Hadoop 2.x vs Hadoop 3.x version. Obviously, Hadoop 3.x has some more
advanced and compatible features than the older versions of Hadoop 2.x.
S.No
. Feature Hadoop 2.x Hadoop 3.x
Minimum
JAVA 7 is the minimum JAVA 8 is the minimum
2 supported
compatible version. compatible version.
Java version
27
S.No
. Feature Hadoop 2.x Hadoop 3.x
optimized. tolerance.
28
S.No
. Feature Hadoop 2.x Hadoop 3.x
29