100% found this document useful (1 vote)
60 views46 pages

Unit 3

MapReduce is a programming model for processing large datasets in parallel across distributed systems, consisting of two main phases: Map and Reduce. It involves various components including clients, jobs, and task trackers, which manage job execution and handle failures. Job scheduling is managed by different algorithms like FIFO, Capacity, and Fair Scheduler, while shuffling and sorting are crucial for transferring and organizing data between mappers and reducers.

Uploaded by

Ragul S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
60 views46 pages

Unit 3

MapReduce is a programming model for processing large datasets in parallel across distributed systems, consisting of two main phases: Map and Reduce. It involves various components including clients, jobs, and task trackers, which manage job execution and handle failures. Job scheduling is managed by different algorithms like FIFO, Capacity, and Fair Scheduler, while shuffling and sorting are crucial for transferring and organizing data between mappers and reducers.

Uploaded by

Ragul S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

UNIT III

Anatomy of a Map Reduce Job Run,


Failures, Job Scheduling, Shuffle
and Sort, Task Execution, Map
Reduce Types and Formats, Map
Reduce Features.
MAP REDUCE ARCHITECRURE
• MapReduce is a programming model used for
efficient processing in parallel over large data-
sets in a distributed manner.
• The data is first split and then combined to
produce the final result.
• The libraries for MapReduce is written in so many
programming languages with various different-
different optimizations.
The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for
providing less overhead over the cluster network and to
reduce the processing power. The MapReduce task is
mainly divided into two phases Map Phase and Reduce
Phase.
Components of MapReduce Architecture:
Client: The MapReduce client is the one who brings the
Job to the MapReduce for processing. There can be
multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
Job: The MapReduce Job is the actual work that the client
wanted to do which is comprised of so many smaller tasks
that the client wants to process or execute.
Hadoop MapReduce Master: It divides the particular job
into subsequent job-parts.
Job-Parts: The task or sub-jobs that are obtained after
dividing the main job. The result of all the job-parts
combined to produce the final output.
Input Data: The data set that is fed to the MapReduce for
The MapReduce task is mainly divided into 2 phases i.e.
Map phase and Reduce phase.

Map: As the name suggests its main use is to map the


input data in key-value pairs. The input to the map may
be a key-value pair where the key can be the id of some
kind of address and value is the actual value that it
keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and
generates the intermediate key-value pair which works as
input for the Reducer or Reduce() function.
Reduce: The intermediate key-value pairs that work as
input for Reducer are shuffled and sort and send to the
Reduce() function. Reducer aggregate or group the data
based on its key-value pair as per the reducer algorithm
written by the developer.
How Job tracker and the task tracker deal with
MapReduce:

Job Tracker: The work of Job tracker is to manage all the


resources and all the jobs across the cluster and also to
schedule each map on the Task Tracker running on the
same data node since there can be hundreds of data nodes
available in the cluster.
Task Tracker: The Task Tracker can be considered as the
actual slaves that are working on the instruction given by
the Job Tracker. This Task Tracker is deployed on each of
the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
Anatomy of a Map Reduce Job
Run
•Can run a MapReduce job with a single method call:
submit () on a Job object
•Can also call
waitForCompletion() – submits the job if it hasn’t been
submitted already, then waits for it to finish
In the provided figure, the entire procedure is depicted.
There are five distinct entities at the highest level:
MapReduce job submission by the client.
The cluster's compute resource allocation is managed by
the YARN resource management.
Launching and keeping an eye on the computing
containers on cluster machines are the YARN node
managers.
The master MapReduce application, which
manages the tasks involved in running the
MapReduce operation. Running in containers
that are scheduled by the resource manager and
controlled by the node managers are the
application master and the MapReduce tasks.
The distributed file system is used to distribute
job files among the many entities.
Job Submission
•The submit() method on job creates an internal
JobSubmitter instance and calls submitJobInternal() on it
(step 1 in figure)
•Having submitted the job, waitForCompletion() polls the
job’s progress once per second and reports the progress
to the console if it has changed since last report
•When the job completes successfully, the job counters
are displayed.
•Otherwise, the error that caused the job to fail is logged
to the console.
•The job submission process implemented by
JobSubmitter does the following
•Asks the resource manager for a new application ID,
•Checks the output specification of the job. For example, if
the output directory has not been specified or it already
exists, the job is not submitted and an error is thrown to
the MapReduce program
•Computes the input splits for the job. If the splits cannot
be computed , the job is not submitted and an error is
thrown to the MapReduce program
•Copies the resources needed to run the job, including the
job JAR file, the configuration file and the computed input
splits, to the shared filesystem in a directory named after
the job ID (step 3). The job JAR is copied with a high
replication factor so that there are lots of copies across
the cluster for the node managers to access when they run
tasks for the job.
Job Initialization
•When the resource manager receives a call to its
submitApplication() method, it handsoff the request to the
YARN scheduler.
•The scheduler allocates a container, and the resource
manager then launches the application master’s process
there, under the node manager’s management (steps 5a
and 5b)
•The application master for MapReduce jobs is a Java
application whose main class is MRAppMaster.
•It initializes the job by creating a number of bookkeeping
objects to keep track of the job’s progress, as it will receive
progress and completion reports from the task (step 6)
•Next, it retrieves the input splits computed in the client
•It then creates a map task object for each split, as well as
a number of reduce task objects determined by the
[Link] property (set by the
setNumReduceTasks() method on Job). Tasks are given IDs
at this point
•The application master must decide how to run the tasks
that make up the MapReduce job
•If the job is small, the application master may choose to
run the tasks in the same JVM as itself.
•This happens when it judges that the overhead of
allocating and running tasks in new containers outweighs
the gain to be had in running them in parallel, compared to
running them sequentially on one node.
Failures
•In the real world, user code is buggy,
processes crash and machines fail One of the
major benefits of using Hadoop is
its ability to handle such failures and allow
the job to complete successfully
• Failure of any of the following entities –
considered
– The task
– The application master
– The node manager
Task Failure
•The most common occurrence of this failure is when
user code in the map or reduce task throws a runtime
exception
•If this happens, the task JVM reports the error back to
its parent application master before it exits
•The error ultimately makes it into the user logs
•The application master marks the task attempt as failed,
and frees up the container so its resources are available
for another task
•Another failure – sudden exit of the task JVM
•In this case, the node manager notices that the process
has exited and informs the application master so it can
mark the attempt as failed
Application Master Failure
•Just like MapReduce tasks are given several attempts to
succeed, applications in YARN are retired in the event of
failure
•The maximum number of attempts to run a MapReduce
application master is controlled by the
[Link]
•The default value is 2, so if a MapReduce application
master fails twice it will not be tried again and the job will
fail
•YARN imposes a limit for the maximum number of
attempts for any YARN application
•The limit is set by
[Link] defaults to 2
Node Manager Failure
•If a node manager fails by crashing or running very
slowly, it will stop sending heartbeats to the resource
manager
•The resource manager will notice a node manager that
has stopped sending heartbeats if it hasn’t received one
for 10 minutes; - remove it from its pool of nodes to
schedule containers on
•Any task or application master running on the failed node
manager will be recovered
•Node managers may be blacklisted if the number of
failures for the application is high
Resource Manager Failure
•Failure of the resource manager is serious because
without it, neither jobs nor task containers can be
launched
•In the default configuration, the resource manager is a
single point of failure, since in the event of machine
failure, all running jobs fail – and can’t be recovered
•To achieve High Availability (HA), it is necessary to run a
pair of resource managers in an active-standby
configuration
•If the active resource manager fails, then the standby can
take over without a significant interruption to the client
•Information about all the running applications is stored in
a highly available state store, so that the standby can
Job Scheduling

Three types of job scheduling in MapReduce: Capacity


Scheduler, Fair Scheduler, and FIFO Scheduler.
All these Schedulers are a kind of algorithm that is used
to schedule tasks in a Hadoop cluster when we receive
requests from multiple clients.
A Job queue is nothing but the collection of various tasks
that we have received from our various clients.
The tasks are available in the queue and we need to
schedule this task on the basis of our requirements.
1. FIFO Scheduler
As the name suggests FIFO i.e. First In First Out, so the
tasks or application that comes first will be served first.
This is the default Scheduler we use in Hadoop. The tasks
are placed in a queue and the tasks are performed in their
submission order. In this method, once the job is
scheduled, no intervention is allowed. So sometimes the
high-priority process has to wait for a long time since the
priority of the task does not matter in this method.
Advantage:

No need for configuration


First Come First Serve
simple to execute

Disadvantage:

Priority of task doesn’t matter, so high priority jobs need to


wait
Not suitable for shared cluster
2. Capacity Scheduler
In Capacity Scheduler we have multiple job queues for
scheduling our tasks. The Capacity Scheduler allows
multiple occupants to share a large size Hadoop cluster.
In Capacity Scheduler corresponding for each job queue,
we provide some slots or cluster resources for performing
job operation.
Each job queue has it’s own slots to perform its task. In
case we have tasks to perform in only one queue then the
tasks of that queue can access the slots of other queues
also as they are free to use, and when the new task enters
to some other queue then jobs in running in its own slots
of the cluster are replaced with its own job.
Advantage:

Best for working with Multiple clients or priority jobs in a


Hadoop cluster
Maximizes throughput in the Hadoop cluster

Disadvantage:

More complex
Not easy to configure for everyone
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the
capacity scheduler. The priority of the job is kept in
consideration. With the help of Fair Scheduler, the YARN
applications can share the resources in the large Hadoop
Cluster and these resources are maintained dynamically
so no need for prior capacity. The resources are
distributed in such a manner that all applications within a
cluster get an equal amount of time. Fair Scheduler takes
Scheduling decisions on the basis of memory, we can
configure it to work with CPU also.

As we told you it is similar to Capacity Scheduler but the


major thing to notice is that in Fair Scheduler whenever
Advantages:
Resources assigned to each application depend upon its
priority.
it can limit the concurrent running task in a particular
pool or queue.

Disadvantages: The configuration is required.


Shuffle & Sort
What is MapReduce Shuffling and Sorting?
Shuffling is the process by which it transfers mappers
intermediate output to the reducer. Reducer gets 1 or
more keys and associated values on the basis of
reducers.

The intermediated key – value generated by mapper is


sorted automatically by key. In Sort phase merging and
sorting of map output takes place.

Shuffling and Sorting in Hadoop occurs simultaneously.


Shuffling in MapReduce
The process of transferring data from the mappers to
reducers is shuffling. It is also the process by which the
system performs the sort. Then it transfers the map
output to the reducer as input. This is the reason shuffle
phase is necessary for the reducers.

Otherwise, they would not have any input (or input from
every mapper). Since shuffling can start even before the
map phase has finished. So this saves some time and
completes the tasks in lesser time.
Sorting in MapReduce
MapReduce Framework automatically sort the keys
generated by the mapper. Thus, before starting of reducer,
all intermediate key-value pairs get sorted by key and not
by value. It does not sort values passed to each reducer.
They can be in any order.

Sorting in a MapReduce job helps reducer to easily


distinguish when a new reduce task should start.

This saves time for the reducer. Reducer in MapReduce


starts a new reduce task when the next key in the sorted
input data is different than the previous. Each reduce task
takes key value pairs as input and generates key-value pair
The important thing to note is that shuffling and sorting
in Hadoop MapReduce are will not take place at all if you
specify zero reducers (setNumReduceTasks(0)).

If reducer is zero, then the MapReduce job stops at the


map phase.
Secondary Sorting in MapReduce
If we want to sort reducer values, then we use a
secondary sorting technique. This technique enables us
to sort the values (in ascending or descending order)
passed to each reducer.
Task Execution
Speculative Execution
• The MapReduce model is to break jobs into tasks and run the tasks
in parallel to make the overall job execution time smaller than it would
otherwise be if the tasks ran sequentially.
• This makes job execution time sensitive to slow-running tasks, as it
takes only one slow task to make the whole job take significantly longer
than it would have done otherwise.
• When a job consists of hundreds or thousands of tasks, the
possibility of a few straggling tasks is very real.
• Tasks may be slow for various reasons, including hardware
degradation or software mis-configuration, but the causes may be hard to
detect since the tasks still complete successfully.

•Hadoop doesn’t try to diagnose and fix slow-running
tasks; instead, it tries to detect when a task is running
slower than expected and launches another, equivalent,
task as a backup.
•This is termed speculative execution of tasks.
•It’s important to understand that speculative execution
does not work by launching two duplicate tasks at about
the same time so they can race each other.
•This would be wasteful of cluster resources.
•When a task completes successfully, any duplicate tasks
that are running are killed since they are no longer
needed.
•So if the original task completes before the speculative
task, then the speculative task is killed; on the other
hand, if the speculative task finishes first, then the
original is killed.
•Speculative execution is an optimization, not a feature to
make jobs run more reliably.
•If there are bugs that sometimes cause a task to hang or
slow down, then relying on speculative execution to avoid
these problems is unwise, and won’t work reliably, since
the same bugs are likely to affect the speculative task.
•Speculative execution is turned on by default.
•It can be enabled or disabled independently for map
tasks and reduce tasks, on a cluster- wide basis, or on a
per-job basis.
Features of MapReduce
The following advanced features characterize MapReduce:

1. Highly scalable
A framework with excellent scalability is Apache Hadoop
MapReduce. This is because of its capacity for distributing
and storing large amounts of data across numerous
servers. These servers can all run simultaneously and are
all reasonably priced.
2. Versatile
Businesses can use MapReduce programming to access
new data sources. It makes it possible for companies to
work with many forms of data. Enterprises can access
both organized and unstructured data with this method
and acquire valuable insights from the various data
sources.
[Link]
The MapReduce programming model uses the HBase and
HDFS security approaches, and only authenticated users
are permitted to view and manipulate the data. HDFS
uses a replication technique in Hadoop 2 to provide fault
tolerance. Depending on the replication factor, it makes a
clone of each block on the various machines.
[Link]
With the help of the MapReduce programming framework
and Hadoop’s scalable design, big data volumes may be
stored and processed very affordably. Such a system is
particularly cost-effective and highly scalable, making it
ideal for business models that must store data that is
constantly expanding to meet the demands of the present.
5. Fast-paced
The Hadoop Distributed File System, a distributed
storage technique used by MapReduce, is a mapping
system for finding data in a cluster. The data processing
technologies, such as MapReduce programming, are
typically placed on the same servers that enable quicker
data processing.
6. Based on a simple programming model
Hadoop MapReduce is built on a straightforward
programming model and is one of the technology’s
many noteworthy features. This enables
programmers to create MapReduce applications
that can handle tasks quickly and effectively. Java
is a very well-liked and simple-to-learn
programming language used to develop the
MapReduce programming model.
7. Parallel processing-compatible
The parallel processing involved in MapReduce
programming is one of its key components. The tasks are
divided in the programming paradigm to enable the
simultaneous execution of independent activities. As a
result, the program runs faster because of the parallel
processing, which makes it simpler for the processes to
handle each job. Multiple processors can carry out these
broken-down tasks thanks to parallel processing.
Consequently, the entire software runs faster.
8. Reliable
The same set of data is transferred to some other nodes
in a cluster each time a collection of information is sent
to a single node. Therefore, even if one node fails,
backup copies are always available on other nodes that
may still be retrieved whenever necessary. This ensures
high data availability.
9. Highly available
Hadoop’s fault tolerance feature ensures that even if one
of the DataNodes fails, the user may still access the data
from other DataNodes that have copies of it. Moreover,
the high accessibility Hadoop cluster comprises two or
more active and passive NameNodes running on hot
standby. The active NameNode is the active node. A
passive node is a backup node that applies changes made
in active NameNode’s edit logs to its namespace.

You might also like