0% found this document useful (0 votes)
20 views31 pages

Chap8 YARN

The document discusses the architecture and features of YARN (Yet Another Resource Negotiator) as part of Hadoop 2.x, emphasizing its role in resource management and application execution. It highlights YARN's scalability, multi-tenancy, compatibility, and serviceability, which improve upon the limitations of the earlier Hadoop v1 architecture. Additionally, it explains how YARN allows multiple applications to run concurrently on the same cluster, enhancing overall resource utilization.

Uploaded by

rjaoued75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views31 pages

Chap8 YARN

The document discusses the architecture and features of YARN (Yet Another Resource Negotiator) as part of Hadoop 2.x, emphasizing its role in resource management and application execution. It highlights YARN's scalability, multi-tenancy, compatibility, and serviceability, which improve upon the limitations of the earlier Hadoop v1 architecture. Additionally, it explains how YARN allows multiple applications to run concurrently on the same cluster, enhancing overall resource utilization.

Uploaded by

rjaoued75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

V11.

2
Unit 5. MapReduce and YARN

Uempty

YARN architecture

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-33. YARN architecture

© Copyright IBM Corp. 2016, 2021 5-41


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Topics
• Introduction to MapReduce
• Hadoop v1 and MapReduce v1 architecture and limitations
• YARN architecture
• Hadoop and MapReduce v1 compared to v2

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-34. Topics

© Copyright IBM Corp. 2016, 2021 5-42


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN
• Acronym for Yet Another Resource Negotiator.
• New resource manager is included in Hadoop 2.x and later.
• De-couples the Hadoop workload and resource management.
• Introduces a general-purpose application container.
• Hadoop 2.2.0 includes the first generally available (GA) version of
YARN.
• Most Hadoop vendors support YARN.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-35. YARN

YARN is a key component of the Hortonworks Data Platform (HDP).

© Copyright IBM Corp. 2016, 2021 5-43


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN high-level architecture


In Hortonworks Data Platform (HDP), users can use YARN and
applications that are written to YARN APIs.

Existing MapReduce Applications

Apache
MapReduce v2 Tez HBase Others
(batch) (interactive) (online) Spark (varied)
(in memory)

YARN
(cluster resource management)

HDFS

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-36. YARN high-level architecture

© Copyright IBM Corp. 2016, 2021 5-44


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (1 of 7)


NodeManager @node133

NodeManager @node134

NodeManager @node135
Resource
Manager
@node132

NodeManager @node136

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-37. Running an application in YARN (1 of 7)

© Copyright IBM Corp. 2016, 2021 5-45


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (2 of 7)


NodeManager @node133

Application 1:
Analyze lineitem table.
NodeManager @node134

Launch
NodeManager @node135
Resource
Manager Application
@node132 Master 1

NodeManager @node136

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-38. Running an application in YARN (2 of 7)

© Copyright IBM Corp. 2016, 2021 5-46


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (3 of 7)


NodeManager @node133

Application 1:
Analyze lineitem table.
NodeManager @node134

NodeManager @node135
Resource Resource request
Manager Application
@node132 Master 1
Container IDs

NodeManager @bigaperf136

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-39. Running an application in YARN (3 of 7)

© Copyright IBM Corp. 2016, 2021 5-47


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (4 of 7)


NodeManager @node133

Application 1:
Analyze lineitem table.
NodeManager @node134

App 1 App 1

Launch

NodeManager @node135
Resource
Manager Application
@node132 Master 1

Launch

NodeManager @node136

App 1

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-40. Running an application in YARN (4 of 7)

© Copyright IBM Corp. 2016, 2021 5-48


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (5 of 7)


NodeManager @node133

Application 1:
Analyze
lineitem
table. NodeManager @node134

App 1 App 1

Application 2:
Analyze customer table.

NodeManager @node135
Resource
Manager Application
@node132 Master 1

NodeManager @node136
Application
Master 2 App 1

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-41. Running an application in YARN (5 of 7)

© Copyright IBM Corp. 2016, 2021 5-49


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (6 of 7)


NodeManager @nodef133

Application 1:
Analyze
lineitem
table.
NodeManager @node134

App 1 App 1
Application 2:
Analyze customer table.

NodeManager @node135
Resource
Manager Application
@node132 Master 1

NodeManager @node136
Application
Master 2 App 1

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-42. Running an application in YARN (6 of 7)

© Copyright IBM Corp. 2016, 2021 5-50


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Running an application in YARN (7 of 7)


NodeManager @node133

App 2

Application 1:
Analyze
lineitem
table.
NodeManager @node134

App 1 App 1
Application 2:
Analyze customer table.

NodeManager @node135
Resource
Manager Application
AppApp
2 2
@node132 Master 1

NodeManager @nodef136
Application
Master 2 App 1

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-43. Running an application in YARN (7 of 7)

© Copyright IBM Corp. 2016, 2021 5-51


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

How YARN runs an application

Application Resource
client manager
1: Submit
YARN
Client node Resource manager node
application.

2a: Start container.

NodeManager
3: Allocate resources (heartbeat).

2b: Launch.

Container

Application NodeManager
process 4a: Start
container.
4b: Launch.
Node manager node
Container

Application
process

Node manager node

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-44. How YARN runs an application

To run an application on YARN, a client contacts the resource manager and prompts it to run an
application master process (step 1). The resource manager then finds a node manager that can
launch the application master in a container (steps 2a and 2b). Precisely what the application
master does after it is running depends on the application. It might simply run a computation in the
container it is running in and return the result to the client, or it might request more containers from
the resource managers (step 3) and use them to run a distributed computation (steps 4a and 4b).
For more information, see White, T. (2015) Hadoop: The definitive guide (4th ed.). Sabastopol, CA:
O'Reilly Media, p. 80.

© Copyright IBM Corp. 2016, 2021 5-52


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability and availability

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-45. YARN features

© Copyright IBM Corp. 2016, 2021 5-53


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN features: Scalability


• There is one Application Master per job, which is why YARN scales
better than the previous Hadoop v1 architecture. The Application Master
for a job can run on an arbitrary cluster node, and it runs until the job
reaches termination.
• The separation of functions enables the individual operations to be
improved with less effect on other operations.
• YARN supports rolling upgrades without downtime.

ResourceManager focuses exclusively on scheduling, enabling clusters


to expand to thousands of nodes managing petabytes of data.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-46. YARN features: Scalability

YARN lifts the scalability ceiling in Hadoop by splitting the roles of the Hadoop JobTracker into two
processes: A ResourceManager controls access to the cluster’s resources (memory, CPU, and
other components), and the ApplicationManager (one per job) controls task execution.
YARN can run on larger clusters than MapReduce v1. MapReduce v1 reaches scalability
bottlenecks in the region of 4,000 nodes and 40,000 tasks, which stems from the fact that the
JobTracker must manage both jobs and tasks. YARN overcomes these limitations by using its split
ResourceManager / ApplicationMaster architecture: It is designed to scale up to 10,000 nodes and
100,000 tasks.
In contrast to the JobTracker, each instance of an application has a dedicated ApplicationMaster,
which runs for the duration of the application. This model is closer to the original Google
MapReduce paper, which describes how a master process is started to coordinate Map and
Reduce tasks running on a set of workers.

© Copyright IBM Corp. 2016, 2021 5-54


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN features: Multi-tenancy


• YARN allows multiple access engines (either open source or
proprietary) to use Hadoop as the common standard for batch,
interactive, and real-time engines that can simultaneously access the
same data sets.
• YARN uses a shared pool of nodes for all jobs.
• YARN allows the allocation of Hadoop clusters of fixed size from the
shared pool.

Multi-tenant data processing improves an


enterprise's return on its Hadoop investment.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-47. YARN features: Multi-tenancy

Multi-tenancy generally refers to a set of features that enable multiple business users and
processes to share a common set of resources, such as an Apache Hadoop cluster that uses a
policy rather than physical separation, without negatively impacting service-level agreements
(SLA), violating security requirements, or even revealing the existence of each party.
What YARN does is de-couple Hadoop workload management from resource management, which
means that multiple applications can share a common infrastructure pool. Although this idea is not
new, it is new to Hadoop. Earlier versions of Hadoop consolidated both workload and resource
management functions into a single JobTracker. This approach resulted in limitations for customers
hoping to run multiple applications on the same cluster infrastructure.
To borrow from object-oriented programming terminology, multi-tenancy is an overloaded term. It
means different things to different people depending on their orientation and context. To say that a
solution is multi-tenant is not helpful unless you are specific about the meaning.

© Copyright IBM Corp. 2016, 2021 5-55


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty
Some interpretations of multi-tenancy in big data environments are:
• Support for multiple concurrent Hadoop jobs
• Support for multiple lines of business on a shared infrastructure
• Support for multiple application workloads of different types (Hadoop and non-Hadoop)
• Provisions for security isolation between tenants
• Contract-oriented service level guarantees for tenants
• Support for multiple versions of applications and application frameworks concurrently
Organizations that are sophisticated in their view of multi-tenancy need all these capabilities and
more. YARN promises to address some of these requirements and does so in large measure.
However, you will find in future releases of Hadoop that there are other approaches that are being
addressed to provide other forms of multi-tenancy.
Although it is an important technology, the world is not suffering from a shortage of resource
managers. Some Hadoop providers are supporting YARN, and others are supporting Apache
Mesos.

© Copyright IBM Corp. 2016, 2021 5-56


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN features: Compatibility


• To the user (a developer, not an administrator), the changes are almost
invisible.
• It is possible to run unmodified MapReduce jobs by using the same
MapReduce API and CLI, although you might need to recompile.

There is no reason not to migrate from MRv1 to YARN.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-48. YARN features: Compatibility

To ease the transition from Hadoop v1 to YARN, a major goal of YARN and the MapReduce
framework implementation on top of YARN was to ensure that existing MapReduce applications
that were programmed and compiled against previous MapReduce APIs (MRv1 applications) can
continue to run with little or no modification on YARN (MRv2 applications).
For many users who use the [Link] APIs, MapReduce on YARN ensures full
binary compatibility. These existing applications can run on YARN directly without recompilation.
You can use JAR files from your existing application that code against mapred APIs and use
bin/hadoop to submit them directly to YARN.
Unfortunately, it was difficult to ensure full binary compatibility with the existing applications that
compiled against MRv1 [Link] APIs. These APIs have gone through
many changes. For example, several classes stopped being abstract classes and changed to
interfaces. Therefore, the YARN community compromised by supporting source compatibility only
for [Link] APIs. Existing applications that use MapReduce APIs are
source-compatible and can run on YARN either with no changes, with simple recompilation against
MRv2 .jar files that are included with Hadoop 2, or with minor updates.

© Copyright IBM Corp. 2016, 2021 5-57


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN features: Higher cluster utilization


• Higher cluster utilization is where resources that are not used by one
framework can be consumed by another one.
• The NodeManager is a more generic and efficient version of the
TaskTracker:
Instead of having a fixed number of Map and Reduce slots, the
NodeManager has several dynamically created resource containers.
The size of a container depends upon the amount of resources that are
assigned to it, such as memory, CPU, disk, and network I/O.

The YARN dynamic allocation of cluster resources improves utilization


over the more static MapReduce rules that are used in early versions
of Hadoop (v1).

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-49. YARN features: Higher cluster utilization

The NodeManager is a more generic and efficient version of the TaskTracker. Instead of having a
fixed number of Map and Reduce slots, the NodeManager has several dynamically created
resource containers. The size of a container depends upon the amount of resources it contains,
such as memory, CPU, disk, and network I/O.
Currently, only memory and CPU are supported (YARN-3); cgroups might be used to control disk
and network I/O in the future.
The number of containers on a node is a product of configuration parameters and the total amount
of node resources (such as total CPUs and total memory) outside the resources that are dedicated
to the secondary daemons and the OS.

© Copyright IBM Corp. 2016, 2021 5-58


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN features: Reliability and availability


• High availability (HA) for the ResourceManager:
An application recovery is performed after the restart of ResourceManager.
The ResourceManager stores information about running applications and
completed tasks in HDFS.
If the ResourceManager is restarted, it re-creates the state of applications
and reruns only incomplete tasks.
• Has a HA NameNode, making the Hadoop cluster much more efficient,
powerful, and reliable.

HA is work in progress and is close to completion. Its features have


been actively tested by the community

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-50. YARN features: Reliability and availability

© Copyright IBM Corp. 2016, 2021 5-59


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN major features summarized


• Multi-tenancy:
YARN allows multiple access engines (either open source or proprietary) to use
Hadoop as the common standard for batch, interactive, and real-time engines
that can simultaneously access the same data sets.
Multi-tenant data processing improves an enterprise's return on its Hadoop
investments.
• Cluster utilization
The YARN dynamic allocation of cluster resources feature improves utilization
over more static MapReduce rules that are used in early versions of Hadoop.
• Scalability
Data center processing power continues to rapidly expand. YARN
ResourceManager focuses exclusively on scheduling and keeps pace as clusters
expand to thousands of nodes managing petabytes of data.
• Compatibility
Existing MapReduce applications that are developed for Hadoop 1 can run YARN
without any disruption to existing processes that already work.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-51. YARN major features summarized

© Copyright IBM Corp. 2016, 2021 5-60


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Apache Spark with Hadoop 2+


• Apache Spark is an alternative in-memory framework to MapReduce.
• Supports general workloads and streaming, interactive queries, and
machine learning to provide performance gains.
• Apache Spark SQL provides APIs that allow SQL queries to be
embedded in Java, Scala, or Python programs in Apache Spark.
• MLlib: An Apache Spark optimized library that supports machine
learning functions.
• GraphX: API for graphs and parallel computation.
• Apache Spark streaming: Writes applications to process streaming data
in Java, Scala, or Python.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-52. Apache Spark with Hadoop 2+

Apache Spark is a new, alternative in-memory framework to MapReduce.

© Copyright IBM Corp. 2016, 2021 5-61


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty
5.4. Hadoop and MapReduce v1 compared to
v2

© Copyright IBM Corp. 2016, 2021 5-62


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Hadoop and MapReduce v1


compared to v2

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-53. Hadoop and MapReduce v1 compared to v2

The original Hadoop (v1) and MapReduce (v1) had limitations, and several issues surfaced over
time. We review these issues in preparation for looking at the differences and changes that were
introduced with Hadoop v2 and MapReduce v2.

© Copyright IBM Corp. 2016, 2021 5-63


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Topics
• Introduction to MapReduce
• Hadoop v1 and MapReduce v1 architecture and limitations
• YARN architecture
• Hadoop and MapReduce v1 compared to v2

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-54. Topics

© Copyright IBM Corp. 2016, 2021 5-64


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Hadoop v1 to Hadoop v2

Single-use system Multi-purpose platform


Batch apps usually Batch, Interactive, Online, and
Streaming

Hadoop 1.0 Hadoop 2.0

MR2 Pig Hive Other … RT, HBase


Stream, +
Pig Hive Other … Graph Service
execution / data processing s
MapReduce
(Cluster resource YARN
management and data (Cluster resource management)
processing)
HDFS HDFS2
(redundant, reliable storage) (redundant, reliable storage)

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-55. Hadoop v1 to Hadoop v2

The most notable change from Hadoop v1 to Hadoop v2 is the separation of cluster and resource
management from the execution and data processing environment. This change allows for many
new application types to run on Hadoop, including MapReduce v2.
HDFS is common to both versions. MapReduce is the only execution engine in Hadoop v1. The
YARN framework provides work scheduling that is neutral to the nature of the work that is
performed. Hadoop v2 supports many execution engines, including a port of MapReduce that is
now a YARN application.

© Copyright IBM Corp. 2016, 2021 5-65


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN modifies MRv1


MapReduce has been modified with YARN. The two major functions of
JobTracker (resource management and job scheduling and monitoring)
are split into separate daemons:
• ResourceManager (RM):
The global ResourceManager and the per-node worker, the NodeManager
(NM)) form the data-computation framework.
The ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system.
• ApplicationMaster (AM):
The per-application ApplicationMaster is, in effect, a framework-specific
library that is tasked for negotiating resources from the ResourceManager
and working with the NodeManagers to run and monitor the tasks
An application is either a single job in the classical sense of MapReduce jobs
or a directed acyclic graph (DAG) of jobs.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-56. YARN modifies MRv1

The fundamental idea of YARN and MRv2 is to split the two major functions of the JobTracker,
resource management and job scheduling / monitoring, into separate daemons. The idea is to have
a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is
either a single job in the classical sense of MapReduce jobs or a DAG of jobs.
The ResourceManager and per-node worker, the NodeManager (NM), form the data-computation
framework. The ResourceManager is the ultimate authority that arbitrates resources among all the
applications in the system.
The per-application ApplicationMaster is, in effect, a framework-specific library that is tasked with
negotiating resources from the ResourceManager and working with the NodeManagers to run and
monitor the tasks.

© Copyright IBM Corp. 2016, 2021 5-66


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty
The ResourceManager has two main components: Scheduler and ApplicationsManager:
• The Scheduler is responsible for allocating resources to the various running applications. The
Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for
the application. Also, it offers no guarantees about restarting failed tasks either due to
application failure or hardware failures. The Scheduler performs its scheduling function based
the resource requirements of the applications; it does so based on the abstract notion of a
resource Container, which incorporates elements such as memory, CPU, disk, network, and
other resources. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster
resources among the various queues, applications, and other items. The current MapReduce
schedulers, such as the CapacityScheduler and the FairScheduler, are some examples of the
plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of
cluster resources.
• The ApplicationsManager is responsible for accepting job submissions and negotiating the first
container for running the application-specific ApplicationMaster. It provides the service for
restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent that is responsible for containers,
monitoring their resource usage (CPU, memory, disk, and network), and reporting the same to
the ResourceManager / Scheduler.
The per-application ApplicationMaster has the task of negotiating appropriate resource
containers from the Scheduler, tracking their status, and monitoring for progress.
MRv2 maintains API compatibility with previous stable release (hadoop-1.x), which means that
all MapReduce jobs should still run unchanged on top of MRv2 with just a recompile.

© Copyright IBM Corp. 2016, 2021 5-67


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Architecture of MRv1
Classic version of MapReduce (MRv1)

TaskTracker

Map task Reduce task


Client

Client JobTracker TaskTracker

Reduce task
Client
• Runs Map and
Reduce tasks.

• Schedules a job that is TaskTracker • Reports to the


JobTracker.
submitted by clients.
• Tracks live TaskTrackers Map task Map task
and available Map and
Reduce slots.
• Monitors job and task
execution on the cluster.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-57. Architecture of MRv1

In MapReduce v1, there is only one JobTracker that is responsible for allocation of resources, task
assignment to data nodes (as TaskTrackers), and ongoing monitoring ("heartbeat") as each job is
run (the TaskTrackers constantly report back to the JobTracker on the status of each running task).

© Copyright IBM Corp. 2016, 2021 5-68


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

YARN architecture
High-level architecture of YARN
ResourceManager (RM) NodeManager
NodeManager (NM)

• Tracks of live NodeManagers • Provides computational resources


and available resources. MR app Giraph Map in the form of containers.
• Allocates available resources to the master task task • Manager processes running
appropriate applications and tasks. in containers.
• Monitors application masters.

NodeManager ApplicationMaster (AM)


MR client
Resource • Coordinates the execution of all
manager Map Giraph tasks within its application.
Giraph client app master
task • Requests appropriate resource
containers to run tasks.

Client NodeManager Containers

• Can submit any type • Can run different types of tasks


of application that is Giraph Map Reduce (also Application Masters).
supported by YARN. task task task • Has different sizes, for example,
RAM and CPU.

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-58. YARN architecture

In the YARN architecture, a global ResourceManager runs as a master daemon, usually on a


dedicated machine, that arbitrates the available cluster resources among various competing
applications. The ResourceManager tracks how many live nodes and resources are available on
the cluster and coordinates what applications that are submitted by users should get these
resources and when.
The ResourceManager is the single process that has this information, so it can make its allocation
(or rather, scheduling) decisions in a shared, secure, and multi-tenant manner (for example,
according to an application priority, a queue capacity, ACLs, data locality, and other tasks).
When a user submits an application, an instance of a lightweight process that is called the
ApplicationMaster is started to coordinate the execution of all tasks within the application, which
includes monitoring tasks, restarting failed tasks, speculatively running slow tasks, and calculating
the total values of application counters. These responsibilities were previously assigned to the
single JobTracker for all jobs. The ApplicationMaster and tasks that belong to its application run in
resource containers that are controlled by the NodeManagers.

© Copyright IBM Corp. 2016, 2021 5-69


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty
The NodeManager is a more generic and efficient version of the TaskTracker. Instead of having a
fixed number of Map and Reduce slots, the NodeManager has several dynamically created
resource containers. The size of a container depends upon the amount of resources it contains,
such as memory, CPU, disk, and network I/O. Currently, only memory and CPU (YARN-3) are
supported. cgroups might be used to control disk and network I/O in the future. The number of
containers on a node is a product of configuration parameters and the total amount of node
resources (such as total CPU and total memory) outside the resources that are dedicated to the
secondary daemons and the OS.
The ApplicationMaster can run any type of task inside a container. For example, the MapReduce
ApplicationMaster requests a container to start a Map or a Reduce task, and the Giraph
ApplicationMaster requests a container to run a Giraph task. You can also implement a custom
ApplicationMaster that runs specific tasks and invent a new distributed application framework. I
encourage you to read about Apache Twill, which aims to make it easier to write distributed
applications sitting on top of YARN.
In YARN, MapReduce is simply degraded to the role of a distributed application (but still a useful
one) and is now called MRv2. MRv2 is simply the re implementation of the classical MapReduce
engine, now called MRv1, that runs on top of YARN.

© Copyright IBM Corp. 2016, 2021 5-70


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
V11.2
Unit 5. MapReduce and YARN

Uempty

Terminology changes from MRv1 to YARN

YARN terminology MRv1 terminology

ResourceManager Cluster Manager

ApplicationMaster JobTracker
(but dedicated and short-lived)

NodeManager TaskTracker

Distributed Application One particular MapReduce job

Container Slot

MapReduce and YARN © Copyright IBM Corporation 2021

Figure 5-59. Terminology changes from MRv1 to YARN

© Copyright IBM Corp. 2016, 2021 5-71


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

You might also like