0% found this document useful (0 votes)

12 views55 pages

B. Hadoop Ecosystem - III (MapReduce)

The document provides an overview of the Hadoop MapReduce framework, which is designed for processing large datasets in parallel across clusters of commodity hardware. It explains key concepts such as MapReduce jobs, task scheduling, data locality, and the advantages of using MapReduce for efficient data handling and fault tolerance. Additionally, it introduces YARN as a resource management system that enhances the capabilities of Hadoop by separating resource management from job scheduling.

Uploaded by

Chador Wangchuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views55 pages

B. Hadoop Ecosystem - III (MapReduce)

Uploaded by

Chador Wangchuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit II:

Hadoop Ecosystem
Autumn Semester, 2024
Introduction
to MapReduce
● Hadoop MapReduce is a software
framework designed for processing
large datasets (multi-terabyte data) in
What is parallel across clusters of commodity
hardware, often consisting of

MapReduce? ●
thousands of nodes.

Hadoop can execute MapReduce

programs written in various languages
such as Java, Python, and Ruby.
● MapReduce programs are inherently
parallel, making large-scale data
analysis accessible to anyone with a
sufficient number of machines.
Key Ideas and Concepts
● A MapReduce job is a unit of work that the client wants to be performed. It includes the input
data, the MapReduce program that defines how the data should be processed, and the configuration
information required for executing the job.
● Unit of Work: Count the occurrences of each word across the collection of text documents.
○ Input data: The data to be processed. Example: A collection of text documents.
○ MapReduce program: The code that defines the map and reduce functions. Example: A
program to count the frequency of each word in the text documents.
○ Configuration information: Settings and parameters required for executing the job. Example:
Parameters such as the number of map and reduce tasks, input/output paths, and split size.
● Hadoop (or the MapReduce Framework) runs the job by dividing it into tasks, of which there are two
types: map tasks and reduce tasks.
● The tasks are scheduled using YARN and run on nodes in the cluster. If a task fails, it will be
automatically rescheduled to run on a different node.
Key Ideas and Concepts

● Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just
splits.

● Hadoop creates one map task for each split, which runs the user-defined map function for each
record in the split.

● Having many splits means the time taken to process each split is small compared to the time to
process the whole input.

● So if we are processing the splits in parallel, the processing is better load balanced when the splits
are small, since a faster machine will be able to process proportionally more splits over the course
of the job than a slower machine.

● Even if the machines are identical, failed processes or other jobs running concurrently make load
balancing desirable, and the quality of the load balancing increases as the splits become more fine
grained.
Key Ideas and Concepts

● On the other hand, if splits are too small, the overhead of managing the splits and map task
creation begins to dominate the total job execution time.

● For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by
default, although this can be changed for the cluster (for all newly created files) or specified
when each file is created.

● Hadoop does its best to run the map task on a node where the input data resides in HDFS,
because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization.
● Sometimes, however, all the nodes
hosting the HDFS block replicas
for a map task’s input split are
running other map tasks,
● so the job scheduler will look for a
free map slot on a node in the
same rack as one of the blocks.
● Very occasionally even this is not
possible, so an off-rack node is
used, which results in an inter-rack
network transfer.

Data-local (a),
rack-local (b), and
off-rack (c) map tasks
Key Ideas and Concepts

● Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate
output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the
map output can be discarded.

● Map output is typically stored as key-value pairs. Each map task generates intermediate data in the
form of these pairs, where the key is a piece of information derived from the input data, and the
value is associated data generated by the map function.

● Storing this intermediate output in HDFS with replication is unnecessary. If the node running the
map task fails before the map output has been consumed by the reduce task, Hadoop will
automatically rerun the map task on another node to re-create the map output.

● Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally
the output from all mappers.
Key Ideas and Concepts

● In the present example, we have a single

reduce task that is fed by all of the map
tasks.
● Therefore, the sorted map outputs have
to be transferred across the network to
the node where the reduce task is
running, where they are merged and
then passed to the user-defined reduce
function.
● The output of the reduce is normally
stored in HDFS for reliability.

The dotted boxes indicate nodes, the dotted arrows show data transfers on a node,
and the solid arrows show data transfers between nodes.
Key Ideas and Concepts

● The number of reduce tasks is not governed by the size of the input, but instead is specified
independently.
● When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task.
● There can be many keys (and their associated values) in each partition, but the records for any
given key are all in a single partition.
● The partitioning can be controlled by a user-defined partitioning function, but normally the
default partitioner—which buckets keys using a hash function—works very well.
Key Ideas and Concepts

● It’s also possible to have zero reduce

tasks.
● This can be appropriate when you
don’t need the shuffle because the
processing can be carried out entirely
in parallel
● In this case, the only off-node data
transfer is when the map tasks write
to HDFS
Key Ideas and Concepts

● The data flow for the general case of

multiple reduce tasks is illustrated in
the diagram.
● This diagram makes it clear why the
data flow between map and reduce
tasks is colloquially known as “the
shuffle,” as each reduce task is fed by
many map tasks. The shuffle is more
complicated than this diagram
suggests, and tuning it can have a big
impact on job execution time, as you
will see in Shuffle and Sort.
Advantages of MapReduce

1. Parallel Processing
● Efficient Data Handling: By breaking data into smaller
chunks and processing them simultaneously across multiple
nodes, MapReduce significantly speeds up data processing.

● Scalability: It easily scales to handle large volumes of data by

distributing tasks across a cluster of machines, allowing for
horizontal scaling.

● Fault Tolerance: MapReduce can handle node failures

gracefully by reassigning tasks to other nodes, ensuring that
the processing continues even in the case of hardware
failures.
Advantages of MapReduce

1. Parallel Processing
● Simplified Programming Model: The Map and Reduce functions
abstract the complexity of parallel processing, making it easier
for developers to write distributed applications.

● Efficient Resource Utilization: By leveraging parallelism,

MapReduce optimizes the use of computational resources,
leading to faster processing times.
Advantages of MapReduce

2. Data Locality
● Reduced Data Transfer Costs: By moving processing tasks to where the
data resides, MapReduce minimizes the need to transfer large volumes
of data across the network, which can be costly and time-consuming.

● Improved Performance: Reduces the overhead associated with data

movement, leading to faster and more efficient data processing.

● Enhanced Scalability: As data grows, MapReduce scales efficiently by

distributing both data and processing tasks across a cluster, allowing the
system to handle large datasets without significant performance
degradation.
Advantages of MapReduce

2. Data Locality
● Lower Latency: Reduces the time it takes to access and process data,
resulting in lower latency and quicker results.

● Optimized Resource Utilization: By executing tasks on the nodes

where data is stored, MapReduce makes better use of computing
resources, as it avoids the bottleneck of moving data between
storage and processing units.
Example of Election Votes Counting

● Votes are stored at different booths

● The result center has the details of all the booths.

Booth B

Booth C Booth D
Example of Election Votes Counting: Traditional Approach

● Counting - Traditional Approach:

Votes are transported to a central
result center for counting.
● Moving all the votes to the center can
be costly, and the counting process
takes time. Booth B

Booth C Booth D
Example of Election Votes Counting: MapReduce Approach

● Counting - MapReduce Approach:

Votes are counted at individual booths.
● The results from each booth are then sent
to the central result center.
● This method allows for the final result to
Booth B
be declared quickly and efficiently.

Booth C Booth D
MapReduce Approach
Anatomy of a MapReduce Program

Map:

Key Value
(K1, V1) List (k2, V2)

Reduce:

MapReduce
(K2, list (V2)) List (k3, V3)
Example of Word Count Process in MapReduce
Executing
a MapReduce
Program
Introduction to
Yarn
What is YARN?

● YARN (Yet Another Resource Negotiator) is a cluster resource management system for Hadoop

● YARN was introduced in Hadoop 2.x to enhance resource management and job scheduling,
addressing the limitations and bottlenecks of the Hadoop 1.x architecture.

● YARN functions as a resource management and job scheduling layer within the Hadoop
ecosystem, enabling high-level applications (such as MapReduce, Spark, and HBase) to share and
utilize the underlying Hadoop infrastructure efficiently.

● With the introduction of YARN, Hadoop evolved from being solely a MapReduce framework to a
comprehensive platform for big data processing.

● YARN is often described as a large-scale, distributed resource management system designed to

support a variety of big data applications.
Map Reduce 1.x Execution Framework

In Hadoop 1.0, we have two major components for job

execution: JobTracker and task tracker.

❖ JobTracker is a Master daemon

Responsible for managing resources and scheduling
jobs. It is also responsible for tracking the status of
each job and restarting them if there is any failure.

❖ TaskTrackers are slave daemons

They run on systems where data nodes reside.
Responsible for running tasks and sending progress
report to JobTracker. The JobTracker also reschedules
failed tasks on different task trackers.
Motivation for MapReduce V2

As JobTracker could be overloaded with multiple tasks, Hadoop 1.0 made several changes in its
architecture to eliminate the following limitations:
❏ Scalability: In Hadoop 1.0, the JobTracker is responsible for scheduling the jobs,
monitoring each job, and restarting them on failure.
❏ It means JobTracker spends the majority of its time managing the application's life cycle.
❏ In a larger cluster with more nodes and more tasks, the burden of scheduling and
monitoring increases.
❏ The work overhead limits the scalability of Hadoop version 1 to 4,000 nodes and 40,000
tasks (According to Yahoo).
Motivation for MapReduce V2

❏ High availability: High availability ensures that even if one node serving the request goes
down, the other standby active node can assume the responsibility for the failed node.
❏ In this case, the state of the failed node should be in sync with the state of the standby
active node.
❏ The JobTracker is a single point of failure. Every few seconds, task trackers send the
information about tasks to the JobTracker, which makes it difficult to implement high
availability for the JobTracker because of the large number of changes in a very short span
of time.
Motivation for MapReduce V2

❏ Memory utilization: Hadoop version 1 required pre configured task tracker slots for map and
reduce tasks. The slot reserved for the map task cannot be used for the reduce task or the
other way around. The efficient utilization of task trackers' memory was not possible on
account of to this setup.
❏ Non MapReduce jobs: Every job in Hadoop version 1 required MapReduce for its
completion because scheduling was only possible through the JobTracker. The JobTracker
and the task tracker were tightly coupled with the MapReduce framework. Since the adoption of
Hadoop has been growing fast, there were a lot of new requirements, such as graph
processing and real-time analytics that needed processing over the same HDFS storage to
reduce complexity, infrastructure, maintenance cost, and so on.
YARN Architecture
YARN Architecture

❏ The initial idea of YARN was to split the

resource management and job scheduling
responsibilities of the JobTracker.
❏ YARN consists of two major components:
Resource Manager and the Node Manager.
❏ The Resource Manager is a master node that is
responsible for managing resources in the
cluster.
❏ Per-application Application Master running on
the Node Manager is responsible for
launching and monitoring containers of
jobs.
❏ The cluster consists of one Resource
Manager and multiple Node Managers, as
seen in the diagram.
YARN Architecture

1) Resource Manager: The Resource

Manager is a master daemon that is
responsible for managing the resources of
submitted applications.
It has two primary components:
a) Scheduler
b) Application Manager
YARN Architecture

❏ a) Scheduler: The job of Resource Manager

Scheduler is to allocate the required
resources requested by the per application
application master.
❏ The job of Scheduler is to only schedule the
job, which means it does not monitor any
task and is not responsible for relaunching
any failed application container.
❏ The application makes a request of job
scheduling to the YARN and YARN sends
detailed scheduling information, including
the amount of memory required for the job.
Upon receiving the scheduling request, the
Scheduler simply schedules the job.
YARN Architecture
❏ b)Application Manager: The job of the
Application Manager is to manage per
application master.
❏ Each application submitted to the YARN
will have its own application master and the
Application Manager keeps track of each
application master.
❏ Each client request for job submission is
received by the Application Manager and it
provides resources to launch application master
for the application.
YARN Architecture
❏ It also destroys the application master upon
completion of the application execution.
❏ When cluster resources become limited and
already in use, the Resource Manager can
request back the resources from a running
application so that it can allocate it to the
application.
YARN Architecture
❏ 2) Node Manager: The Node Manager is a
slave that runs on every worker node of a
cluster and has responsibility for launching and
executing containers based on instructions from the
Resource Manager.
❏ The Node Manager sends heartbeat signals to
the Resource Manager, and which also contain
some other information, including Node
Manager machine details, and available memory.
❏ The Resource Manager regularly updates the
information of each Node Manager upon
receiving the request, which helps in planning
and scheduling upcoming tasks. The containers
are launched on the Node Manager and the
application master is also launched on the Node
Manager container.
YARN Architecture
❏ 3) Application Master: The first step of the
application is to submit the job to YARN, and
upon receiving the request of a job submission,
YARN's Resource Manager launches the
application master for that particular job on one
of the Node Manager Container.
❏ The application master is then responsible for
managing application execution in the cluster.
For each application, there will be a dedicated
application master running on some Node
Manager Container that is responsible for
coordinating between Resource Manager and
Node Manager in order to complete the
execution of the application.
YARN Architecture
❏ The application master requests the required
resources for application execution from the
Resource Manager and the Resource Manager
sends the detailed information about the
Resource Container to the application master,
which then coordinates with the respective
Node Manager to launch the container to
execute the application task.
❏ The application master sends heartbeats at
regular intervals to the Resource Manager and
updates its resource usage.
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Running a WordCount Application in MRv2
Introduction to YARN job scheduling

❏ The Resource Manager has two major components; namely, the application manager and
the scheduler.
❏ The Resource Manager scheduler is responsible for allocating the required resources to an
application based on schedule policies. Before YARN, Hadoop used to allocate slots for map and
reduce tasks from available memory, which restricts reduce tasks to run on slots allocated
for map tasks and the other way around.
❏ YARN does not define map and reduce slots initially. Based on a request, it launches containers
for tasks. This means that if any free container is available, it will be used for map or
reduce tasks.
Introduction to YARN job scheduling

❏ YARN provides a configurable scheduling policy that allows to choose the right
strategy based on an application's need. There are by default three schedulers available
in YARN, which are as follows:
a. FIFO scheduler
b. Capacity scheduler
c. Fair scheduler
Introduction to YARN job scheduling

❏ FIFO scheduler
❏ The FIFO scheduler uses the simple strategy of first come first serve. Memory will be
allocated to applications based on the sequence of request time, which means the
first application in the queue will be allocated the required memory, then the
second, and so on.
❏ In case memory is not available, applications have to wait for sufficient memory
to be available for them to launch their jobs. When the FIFO scheduler is configured,
YARN will make a queue of requests and add applications to the queue, then launch
applications one by one.
Introduction to YARN job scheduling

❏ Capacity scheduler
❏ The Capacity Scheduler divides resources into queues with specific capacities. This approach allows
resources to be allocated based on different organizational or departmental needs. Each queue has a
defined capacity, and jobs within a queue are scheduled according to that queue's capacity and
priority.
❏ The Capacity Scheduler also enables the sharing of resources across an organization,
which supports multi-tenancy and helps increase the utilization of a Hadoop cluster.
Different departments have varying cluster requirements and thus need specific amounts
of resources reserved for them when they submit their jobs.
❏ Resources reserved for a department are used by its users. If no other applications are
submitted to a queue, the unused resources become available for other applications.
Introduction to YARN job scheduling

❏ Fair scheduler
❏ In fair scheduling, all applications get almost an equal amount of the available
resources. In fair scheduler, when the first application is submitted to YARN, it
will assign all the available resources to the application.
❏ Now in any scenario, if the new application is submitted to the scheduler, the
scheduler will start allocating resources to the new application until both the
applications have almost an equal amount of resources for their execution.
Introduction to YARN job scheduling

❏ Fair scheduler
❏ Unlike the two schedulers discussed before, the fair scheduler prevents applications
from resource starvation and assures that the applications in the queue get the required
memory for execution.
❏ The distribution of the minimum and maximum share resources are calculated by
the scheduling queue by using the configuration provided in the fair scheduler.
The application will get the amount of resources configured for the queue where
the application is submitted and if a new application is submitted to the same
queue, the total configured resources will be shared between both applications.
Thank
You!

BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
13 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Hadoop Seminar Report IIT Guwahati
No ratings yet
Hadoop Seminar Report IIT Guwahati
28 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Second Exam Summary
No ratings yet
Second Exam Summary
44 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Bda 2
No ratings yet
Bda 2
35 pages
Understanding MapReduce in CloudPDF
No ratings yet
Understanding MapReduce in CloudPDF
138 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Large-Scale Data Management with Hadoop
No ratings yet
Large-Scale Data Management with Hadoop
22 pages
3 Unit
No ratings yet
3 Unit
17 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
48 pages
Bda CHP 2
No ratings yet
Bda CHP 2
5 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Method Validation Protocol Genial Capsules 150mg
No ratings yet
Method Validation Protocol Genial Capsules 150mg
9 pages
CD - G.R. No. 100776 - Co Vs CA October 28, 1993
0% (2)
CD - G.R. No. 100776 - Co Vs CA October 28, 1993
3 pages
ECE Seminar Marks: IV ECE-B Report
No ratings yet
ECE Seminar Marks: IV ECE-B Report
2 pages
Excel Data Analysis For Dummies 2nd Edition Stephen L. Nelson Ebook Released 2025 Update
No ratings yet
Excel Data Analysis For Dummies 2nd Edition Stephen L. Nelson Ebook Released 2025 Update
71 pages
George Foreman: Double Champion Indoor/Outdoor Electric Grill
No ratings yet
George Foreman: Double Champion Indoor/Outdoor Electric Grill
16 pages
Sucession Planning Syllabus
No ratings yet
Sucession Planning Syllabus
2 pages
The Witchs Book of Spirits Devin Hunter Instant Download
100% (1)
The Witchs Book of Spirits Devin Hunter Instant Download
41 pages
Introductory2014 Lesson1 2D
No ratings yet
Introductory2014 Lesson1 2D
32 pages
KK5701 PDF
No ratings yet
KK5701 PDF
4 pages
WPS Indosol 001 (ASME) R
No ratings yet
WPS Indosol 001 (ASME) R
2 pages
CG Week3.1 DLPactivity PDF
No ratings yet
CG Week3.1 DLPactivity PDF
4 pages
Built-Up Rates Piling
No ratings yet
Built-Up Rates Piling
2 pages
Teaching Academic Words With Digital Flashcards (Selected)
No ratings yet
Teaching Academic Words With Digital Flashcards (Selected)
12 pages
Namespaces - Local and Global
No ratings yet
Namespaces - Local and Global
12 pages
Catalog DONE Power
No ratings yet
Catalog DONE Power
22 pages
Feht Manual
No ratings yet
Feht Manual
76 pages
Pornography - Simon Stephens
100% (8)
Pornography - Simon Stephens
43 pages
Chapter 14 Exercise Solutions
100% (1)
Chapter 14 Exercise Solutions
16 pages
DP Videos Others 15000 Drivers
No ratings yet
DP Videos Others 15000 Drivers
171 pages
COVID-19 Vaccination Certificate India
No ratings yet
COVID-19 Vaccination Certificate India
1 page
Engineering Disciplines & Careers
No ratings yet
Engineering Disciplines & Careers
24 pages
Power Bank
No ratings yet
Power Bank
1 page
SBTi Net Zero Standard Event Slides
No ratings yet
SBTi Net Zero Standard Event Slides
72 pages
Is 6721 PM
No ratings yet
Is 6721 PM
11 pages
Day 4 Unit 4 Earned Value Management
No ratings yet
Day 4 Unit 4 Earned Value Management
19 pages
Food Taxes: A Palatable Solution To The Obesity Epidemic?: Comments
No ratings yet
Food Taxes: A Palatable Solution To The Obesity Epidemic?: Comments
30 pages
1 s2.0 S135983680100052X Main
No ratings yet
1 s2.0 S135983680100052X Main
12 pages
Tablet PCs in Education: Impact & Features
100% (1)
Tablet PCs in Education: Impact & Features
21 pages
Lease Agreement
No ratings yet
Lease Agreement
8 pages
Vehicle Classification Guide
No ratings yet
Vehicle Classification Guide
6 pages

B. Hadoop Ecosystem - III (MapReduce)

Uploaded by

B. Hadoop Ecosystem - III (MapReduce)

Uploaded by

Unit II:

Hadoop can execute MapReduce

● In the present example, we have a single

● It’s also possible to have zero reduce

● The data flow for the general case of

● Scalability: It easily scales to handle large volumes of data by

● Fault Tolerance: MapReduce can handle node failures

● Efficient Resource Utilization: By leveraging parallelism,

● Improved Performance: Reduces the overhead associated with data

● Enhanced Scalability: As data grows, MapReduce scales efficiently by

● Optimized Resource Utilization: By executing tasks on the nodes

● Votes are stored at different booths

● Counting - Traditional Approach:

● Counting - MapReduce Approach:

● YARN is often described as a large-scale, distributed resource management system designed to

In Hadoop 1.0, we have two major components for job

❖ JobTracker is a Master daemon

❖ TaskTrackers are slave daemons

❏ The initial idea of YARN was to split the

1) Resource Manager: The Resource

❏ a) Scheduler: The job of Resource Manager

You might also like