Unit II:
Hadoop Ecosystem
Autumn Semester, 2024
Introduction
to MapReduce
● Hadoop MapReduce is a software
framework designed for processing
large datasets (multi-terabyte data) in
What is parallel across clusters of commodity
hardware, often consisting of
MapReduce? ●
thousands of nodes.
Hadoop can execute MapReduce
programs written in various languages
such as Java, Python, and Ruby.
● MapReduce programs are inherently
parallel, making large-scale data
analysis accessible to anyone with a
sufficient number of machines.
Key Ideas and Concepts
● A MapReduce job is a unit of work that the client wants to be performed. It includes the input
data, the MapReduce program that defines how the data should be processed, and the configuration
information required for executing the job.
● Unit of Work: Count the occurrences of each word across the collection of text documents.
○ Input data: The data to be processed. Example: A collection of text documents.
○ MapReduce program: The code that defines the map and reduce functions. Example: A
program to count the frequency of each word in the text documents.
○ Configuration information: Settings and parameters required for executing the job. Example:
Parameters such as the number of map and reduce tasks, input/output paths, and split size.
● Hadoop (or the MapReduce Framework) runs the job by dividing it into tasks, of which there are two
types: map tasks and reduce tasks.
● The tasks are scheduled using YARN and run on nodes in the cluster. If a task fails, it will be
automatically rescheduled to run on a different node.
Key Ideas and Concepts
● Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just
splits.
● Hadoop creates one map task for each split, which runs the user-defined map function for each
record in the split.
● Having many splits means the time taken to process each split is small compared to the time to
process the whole input.
● So if we are processing the splits in parallel, the processing is better load balanced when the splits
are small, since a faster machine will be able to process proportionally more splits over the course
of the job than a slower machine.
● Even if the machines are identical, failed processes or other jobs running concurrently make load
balancing desirable, and the quality of the load balancing increases as the splits become more fine
grained.
Key Ideas and Concepts
● On the other hand, if splits are too small, the overhead of managing the splits and map task
creation begins to dominate the total job execution time.
● For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by
default, although this can be changed for the cluster (for all newly created files) or specified
when each file is created.
● Hadoop does its best to run the map task on a node where the input data resides in HDFS,
because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization.
● Sometimes, however, all the nodes
hosting the HDFS block replicas
for a map task’s input split are
running other map tasks,
● so the job scheduler will look for a
free map slot on a node in the
same rack as one of the blocks.
● Very occasionally even this is not
possible, so an off-rack node is
used, which results in an inter-rack
network transfer.
Data-local (a),
rack-local (b), and
off-rack (c) map tasks
Key Ideas and Concepts
● Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate
output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the
map output can be discarded.
● Map output is typically stored as key-value pairs. Each map task generates intermediate data in the
form of these pairs, where the key is a piece of information derived from the input data, and the
value is associated data generated by the map function.
● Storing this intermediate output in HDFS with replication is unnecessary. If the node running the
map task fails before the map output has been consumed by the reduce task, Hadoop will
automatically rerun the map task on another node to re-create the map output.
● Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally
the output from all mappers.
Key Ideas and Concepts
● In the present example, we have a single
reduce task that is fed by all of the map
tasks.
● Therefore, the sorted map outputs have
to be transferred across the network to
the node where the reduce task is
running, where they are merged and
then passed to the user-defined reduce
function.
● The output of the reduce is normally
stored in HDFS for reliability.
The dotted boxes indicate nodes, the dotted arrows show data transfers on a node,
and the solid arrows show data transfers between nodes.
Key Ideas and Concepts
● The number of reduce tasks is not governed by the size of the input, but instead is specified
independently.
● When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task.
● There can be many keys (and their associated values) in each partition, but the records for any
given key are all in a single partition.
● The partitioning can be controlled by a user-defined partitioning function, but normally the
default partitioner—which buckets keys using a hash function—works very well.
Key Ideas and Concepts
● It’s also possible to have zero reduce
tasks.
● This can be appropriate when you
don’t need the shuffle because the
processing can be carried out entirely
in parallel
● In this case, the only off-node data
transfer is when the map tasks write
to HDFS
Key Ideas and Concepts
● The data flow for the general case of
multiple reduce tasks is illustrated in
the diagram.
● This diagram makes it clear why the
data flow between map and reduce
tasks is colloquially known as “the
shuffle,” as each reduce task is fed by
many map tasks. The shuffle is more
complicated than this diagram
suggests, and tuning it can have a big
impact on job execution time, as you
will see in Shuffle and Sort.
Advantages of MapReduce
1. Parallel Processing
● Efficient Data Handling: By breaking data into smaller
chunks and processing them simultaneously across multiple
nodes, MapReduce significantly speeds up data processing.
● Scalability: It easily scales to handle large volumes of data by
distributing tasks across a cluster of machines, allowing for
horizontal scaling.
● Fault Tolerance: MapReduce can handle node failures
gracefully by reassigning tasks to other nodes, ensuring that
the processing continues even in the case of hardware
failures.
Advantages of MapReduce
1. Parallel Processing
● Simplified Programming Model: The Map and Reduce functions
abstract the complexity of parallel processing, making it easier
for developers to write distributed applications.
● Efficient Resource Utilization: By leveraging parallelism,
MapReduce optimizes the use of computational resources,
leading to faster processing times.
Advantages of MapReduce
2. Data Locality
● Reduced Data Transfer Costs: By moving processing tasks to where the
data resides, MapReduce minimizes the need to transfer large volumes
of data across the network, which can be costly and time-consuming.
● Improved Performance: Reduces the overhead associated with data
movement, leading to faster and more efficient data processing.
● Enhanced Scalability: As data grows, MapReduce scales efficiently by
distributing both data and processing tasks across a cluster, allowing the
system to handle large datasets without significant performance
degradation.
Advantages of MapReduce
2. Data Locality
● Lower Latency: Reduces the time it takes to access and process data,
resulting in lower latency and quicker results.
● Optimized Resource Utilization: By executing tasks on the nodes
where data is stored, MapReduce makes better use of computing
resources, as it avoids the bottleneck of moving data between
storage and processing units.
Example of Election Votes Counting
● Votes are stored at different booths
● The result center has the details of all the booths.
Booth B
Booth C Booth D
Example of Election Votes Counting: Traditional Approach
● Counting - Traditional Approach:
Votes are transported to a central
result center for counting.
● Moving all the votes to the center can
be costly, and the counting process
takes time. Booth B
Booth C Booth D
Example of Election Votes Counting: MapReduce Approach
● Counting - MapReduce Approach:
Votes are counted at individual booths.
● The results from each booth are then sent
to the central result center.
● This method allows for the final result to
Booth B
be declared quickly and efficiently.
Booth C Booth D
MapReduce Approach
Anatomy of a MapReduce Program
Map:
Key Value
(K1, V1) List (k2, V2)
Reduce:
MapReduce
(K2, list (V2)) List (k3, V3)
Example of Word Count Process in MapReduce
Executing
a MapReduce
Program
Introduction to
Yarn
What is YARN?
● YARN (Yet Another Resource Negotiator) is a cluster resource management system for Hadoop
● YARN was introduced in Hadoop 2.x to enhance resource management and job scheduling,
addressing the limitations and bottlenecks of the Hadoop 1.x architecture.
● YARN functions as a resource management and job scheduling layer within the Hadoop
ecosystem, enabling high-level applications (such as MapReduce, Spark, and HBase) to share and
utilize the underlying Hadoop infrastructure efficiently.
● With the introduction of YARN, Hadoop evolved from being solely a MapReduce framework to a
comprehensive platform for big data processing.
● YARN is often described as a large-scale, distributed resource management system designed to
support a variety of big data applications.
Map Reduce 1.x Execution Framework
In Hadoop 1.0, we have two major components for job
execution: JobTracker and task tracker.
❖ JobTracker is a Master daemon
Responsible for managing resources and scheduling
jobs. It is also responsible for tracking the status of
each job and restarting them if there is any failure.
❖ TaskTrackers are slave daemons
They run on systems where data nodes reside.
Responsible for running tasks and sending progress
report to JobTracker. The JobTracker also reschedules
failed tasks on different task trackers.
Motivation for MapReduce V2
As JobTracker could be overloaded with multiple tasks, Hadoop 1.0 made several changes in its
architecture to eliminate the following limitations:
❏ Scalability: In Hadoop 1.0, the JobTracker is responsible for scheduling the jobs,
monitoring each job, and restarting them on failure.
❏ It means JobTracker spends the majority of its time managing the application's life cycle.
❏ In a larger cluster with more nodes and more tasks, the burden of scheduling and
monitoring increases.
❏ The work overhead limits the scalability of Hadoop version 1 to 4,000 nodes and 40,000
tasks (According to Yahoo).
Motivation for MapReduce V2
❏ High availability: High availability ensures that even if one node serving the request goes
down, the other standby active node can assume the responsibility for the failed node.
❏ In this case, the state of the failed node should be in sync with the state of the standby
active node.
❏ The JobTracker is a single point of failure. Every few seconds, task trackers send the
information about tasks to the JobTracker, which makes it difficult to implement high
availability for the JobTracker because of the large number of changes in a very short span
of time.
Motivation for MapReduce V2
❏ Memory utilization: Hadoop version 1 required pre configured task tracker slots for map and
reduce tasks. The slot reserved for the map task cannot be used for the reduce task or the
other way around. The efficient utilization of task trackers' memory was not possible on
account of to this setup.
❏ Non MapReduce jobs: Every job in Hadoop version 1 required MapReduce for its
completion because scheduling was only possible through the JobTracker. The JobTracker
and the task tracker were tightly coupled with the MapReduce framework. Since the adoption of
Hadoop has been growing fast, there were a lot of new requirements, such as graph
processing and real-time analytics that needed processing over the same HDFS storage to
reduce complexity, infrastructure, maintenance cost, and so on.
YARN Architecture
YARN Architecture
❏ The initial idea of YARN was to split the
resource management and job scheduling
responsibilities of the JobTracker.
❏ YARN consists of two major components:
Resource Manager and the Node Manager.
❏ The Resource Manager is a master node that is
responsible for managing resources in the
cluster.
❏ Per-application Application Master running on
the Node Manager is responsible for
launching and monitoring containers of
jobs.
❏ The cluster consists of one Resource
Manager and multiple Node Managers, as
seen in the diagram.
YARN Architecture
1) Resource Manager: The Resource
Manager is a master daemon that is
responsible for managing the resources of
submitted applications.
It has two primary components:
a) Scheduler
b) Application Manager
YARN Architecture
❏ a) Scheduler: The job of Resource Manager
Scheduler is to allocate the required
resources requested by the per application
application master.
❏ The job of Scheduler is to only schedule the
job, which means it does not monitor any
task and is not responsible for relaunching
any failed application container.
❏ The application makes a request of job
scheduling to the YARN and YARN sends
detailed scheduling information, including
the amount of memory required for the job.
Upon receiving the scheduling request, the
Scheduler simply schedules the job.
YARN Architecture
❏ b)Application Manager: The job of the
Application Manager is to manage per
application master.
❏ Each application submitted to the YARN
will have its own application master and the
Application Manager keeps track of each
application master.
❏ Each client request for job submission is
received by the Application Manager and it
provides resources to launch application master
for the application.
YARN Architecture
❏ It also destroys the application master upon
completion of the application execution.
❏ When cluster resources become limited and
already in use, the Resource Manager can
request back the resources from a running
application so that it can allocate it to the
application.
YARN Architecture
❏ 2) Node Manager: The Node Manager is a
slave that runs on every worker node of a
cluster and has responsibility for launching and
executing containers based on instructions from the
Resource Manager.
❏ The Node Manager sends heartbeat signals to
the Resource Manager, and which also contain
some other information, including Node
Manager machine details, and available memory.
❏ The Resource Manager regularly updates the
information of each Node Manager upon
receiving the request, which helps in planning
and scheduling upcoming tasks. The containers
are launched on the Node Manager and the
application master is also launched on the Node
Manager container.
YARN Architecture
❏ 3) Application Master: The first step of the
application is to submit the job to YARN, and
upon receiving the request of a job submission,
YARN's Resource Manager launches the
application master for that particular job on one
of the Node Manager Container.
❏ The application master is then responsible for
managing application execution in the cluster.
For each application, there will be a dedicated
application master running on some Node
Manager Container that is responsible for
coordinating between Resource Manager and
Node Manager in order to complete the
execution of the application.
YARN Architecture
❏ The application master requests the required
resources for application execution from the
Resource Manager and the Resource Manager
sends the detailed information about the
Resource Container to the application master,
which then coordinates with the respective
Node Manager to launch the container to
execute the application task.
❏ The application master sends heartbeats at
regular intervals to the Resource Manager and
updates its resource usage.
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Introduction to YARN job scheduling
❏ The Resource Manager has two major components; namely, the application manager and
the scheduler.
❏ The Resource Manager scheduler is responsible for allocating the required resources to an
application based on schedule policies. Before YARN, Hadoop used to allocate slots for map and
reduce tasks from available memory, which restricts reduce tasks to run on slots allocated
for map tasks and the other way around.
❏ YARN does not define map and reduce slots initially. Based on a request, it launches containers
for tasks. This means that if any free container is available, it will be used for map or
reduce tasks.
Introduction to YARN job scheduling
❏ YARN provides a configurable scheduling policy that allows to choose the right
strategy based on an application's need. There are by default three schedulers available
in YARN, which are as follows:
a. FIFO scheduler
b. Capacity scheduler
c. Fair scheduler
Introduction to YARN job scheduling
❏ FIFO scheduler
❏ The FIFO scheduler uses the simple strategy of first come first serve. Memory will be
allocated to applications based on the sequence of request time, which means the
first application in the queue will be allocated the required memory, then the
second, and so on.
❏ In case memory is not available, applications have to wait for sufficient memory
to be available for them to launch their jobs. When the FIFO scheduler is configured,
YARN will make a queue of requests and add applications to the queue, then launch
applications one by one.
Introduction to YARN job scheduling
❏ Capacity scheduler
❏ The Capacity Scheduler divides resources into queues with specific capacities. This approach allows
resources to be allocated based on different organizational or departmental needs. Each queue has a
defined capacity, and jobs within a queue are scheduled according to that queue's capacity and
priority.
❏ The Capacity Scheduler also enables the sharing of resources across an organization,
which supports multi-tenancy and helps increase the utilization of a Hadoop cluster.
Different departments have varying cluster requirements and thus need specific amounts
of resources reserved for them when they submit their jobs.
❏ Resources reserved for a department are used by its users. If no other applications are
submitted to a queue, the unused resources become available for other applications.
Introduction to YARN job scheduling
❏ Fair scheduler
❏ In fair scheduling, all applications get almost an equal amount of the available
resources. In fair scheduler, when the first application is submitted to YARN, it
will assign all the available resources to the application.
❏ Now in any scenario, if the new application is submitted to the scheduler, the
scheduler will start allocating resources to the new application until both the
applications have almost an equal amount of resources for their execution.
Introduction to YARN job scheduling
❏ Fair scheduler
❏ Unlike the two schedulers discussed before, the fair scheduler prevents applications
from resource starvation and assures that the applications in the queue get the required
memory for execution.
❏ The distribution of the minimum and maximum share resources are calculated by
the scheduling queue by using the configuration provided in the fair scheduler.
The application will get the amount of resources configured for the queue where
the application is submitted and if a new application is submitted to the same
queue, the total configured resources will be shared between both applications.
Thank
You!