0% found this document useful (0 votes)
15 views

Unit 2-1

Uploaded by

sahuakshat286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 2-1

Uploaded by

sahuakshat286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT 2

Introduction to Hadoop
INTRODUCTION:
Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of
large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation. Its framework is
based on Java programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of
large datasets.

Hadoop has two main components:


HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across multiple
machines. It is designed to work with commodity hardware,
which makes it cost-effective.

YARN (Yet Another Resource Negotiator): This is the resource


management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.

Hadoop also includes several additional modules that provide


additional functionality, such as Hive (a SQL-like query language), Pig

UNIT 2 1
(a high-level platform for creating MapReduce programs), and HBase (a
non-relational, distributed database).

Hadoop is commonly used in big data scenarios such as data


warehousing, business intelligence, and machine learning. It’s also used for
data processing, data analysis, and data mining. It enables the
distributed processing of large data sets across clusters of computers
using a simple programming model.

History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In
October 2003 the first paper release was Google File System. In January
2006, MapReduce development started on the Apache Nutch which consisted
of around 6000 lines coding for it and around 5000 lines coding for
HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and
processing big data. It was created by Apache Software Foundation in
2006, based on a white paper written by Google in 2003 that described
the Google File System (GFS) and the MapReduce programming model. The
Hadoop framework allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of
purposes such as data warehousing, log processing, and research. Hadoop
has been widely adopted in the industry and has become a key technology
for big data processing.

Features of hadoop:
1. it is fault tolerance.

2. it is highly available.
3. it’s programming is easy.

4. it have huge flexible storage.

UNIT 2 2
5. it is low cost.

Hadoop has several key features that make it well-suited for big
data processing:
Distributed Storage: Hadoop stores large data sets across multiple
machines, allowing for the storage and processing of extremely large
amounts of data.

Scalability: Hadoop can scale from a single server to thousands of machines,


making it easy to add more capacity as needed.

Fault-Tolerance: Hadoop is designed to be highly fault-tolerant,


meaning it can continue to operate even in the presence of hardware
failures.

Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature
helps to reduce the network traffic and improve the performance

High Availability: Hadoop provides High Availability feature, which


helps to make sure that the data is always available and is not lost.

Flexible Data Processing: Hadoop’s MapReduce programming model


allows for the processing of data in a distributed fashion, making it
easy to implement a wide variety of data processing tasks.

Data Integrity: Hadoop provides built-in checksum feature, which helps to


ensure that the data stored is consistent and correct.

Data Replication: Hadoop provides data replication feature, which


helps to replicate the data across the cluster for fault tolerance.

Data Compression: Hadoop provides built-in data compression feature, which


helps to reduce the storage space and improve the performance.

YARN: A resource management platform that allows multiple data


processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.

Hadoop Distributed File System

UNIT 2 3
It has distributed file system known as HDFS and this HDFS splits
files into blocks and sends them across various nodes in form of large
clusters. Also in case of a node failure, the system operates and data
transfer takes place between the nodes which are facilitated by HDFS.

HDFS

Advantages of HDFS: It is inexpensive, immutable in


nature, stores data reliably, ability to tolerate faults, scalable,
block structured, can process a large amount of data simultaneously and
many more.
Disadvantages of HDFS: It’s the biggest
disadvantage is that it is not fit for small quantities of data. Also,
it has issues related to potential stability, restrictive and rough in
nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark,
Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
Some common frameworks of Hadoop

1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.

2. Drill- It consists of user-defined functions and is used for data exploration.

3. Storm- It allows real-time processing and streaming of data.

UNIT 2 4
4. Spark- It contains a Machine Learning Library(MLlib) for providing
enhanced machine learning and is widely used for data processing. It
also supports Java, Python, and Scala.

5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.

6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.

Hadoop framework is made up of the following modules:

1. Hadoop MapReduce- a MapReduce programming model for handling and


processing large data.

2. Hadoop Distributed File System- distributed files in clusters among nodes.

3. Hadoop YARN- a platform which manages computing resources.

4. Hadoop Common- it contains packages and libraries which are used for other
modules.

Advantages and Disadvantages of Hadoop


Advantages:

Ability to store a large amount of data.

High flexibility.

Cost effective.

High computational power.

Tasks are independent.

Linear scaling.

Hadoop has several advantages that make it a popular choice for


big data processing:
Scalability: Hadoop can easily scale to handle large amounts of data by adding
more nodes to the cluster.

Cost-effective: Hadoop is designed to work with commodity hardware,


which makes it a cost-effective option for storing and processing large

UNIT 2 5
amounts of data.

Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-


tolerance, which means that if one node in the cluster goes down, the data
can still be processed by the other nodes.

Flexibility: Hadoop can process structured, semi-structured, and


unstructured data, which makes it a versatile option for a wide range of big
data scenarios.

Open-source: Hadoop is open-source software, which means that it is


free to use and modify. This also allows developers to access the source code
and make improvements or add new features.

Large community: Hadoop has a large and active community of


developers and users who contribute to the development of the software,
provide support, and share best practices.

Integration: Hadoop is designed to work with other big data


technologies such as Spark, Storm, and Flink, which allows for
integration with a wide range of data processing and analysis tools.

Disadvantages:

Not very effective for small data.

Hard cluster management.

Has stability issues.

Security concerns.

Complexity: Hadoop can be complex to set up and maintain, especially for


organizations without a dedicated team of experts.

Latency: Hadoop is not well-suited for low-latency workloads and may not be
the best choice for real-time data processing.

Limited Support for Real-time Processing: Hadoop’s batch-oriented


nature makes it less suited for real-time streaming or interactive data
processing use cases.

Limited Support for Structured Data: Hadoop is designed to work with


unstructured and semi-structured data, it is not well-suited for

UNIT 2 6
structured data processing

Data Security: Hadoop does not provide built-in security features


such as data encryption or user authentication, which can make it
difficult to secure sensitive data.

Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming


model is not well-suited for ad-hoc queries, making it difficult to
perform exploratory data analysis.

Limited Support for Graph and Machine Learning: Hadoop’s core


component HDFS and MapReduce are not well-suited for graph and machine
learning workloads, specialized components like Apache Graph and Mahout
are available but have some limitations.

Cost: Hadoop can be expensive to set up and maintain, especially for


organizations with large amounts of data.

Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.

Data Governance: Data Governance is a critical aspect of data


management, Hadoop does not provide a built-in feature to manage data
lineage, data quality, data cataloging, data lineage, and data audit.

How Is Hadoop Being Used?


Hadoop is being used in different sectors to date. The following sectors have the
usage of Hadoop.

1. Financial Sectors:
Hadoop is used to detect fraud in the financial sector. Hadoop is
also used to analyse fraud patterns. Credit card companies also use
Hadoop to find out the exact customers for their products.

2. Healthcare Sectors:
Hadoop is used to analyse huge data such as medical devices, clinical
data, medical reports etc. Hadoop analyses and scans the reports
thoroughly to reduce the manual work.

UNIT 2 7
3. Hadoop Applications in the Retail Industry:
Retailers use Hadoop to improve their sales. Hadoop also helped in
tracking the products bought by the customers. Hadoop also helps
retailers to predict the price range of the products. Hadoop also helps
retailers to make their products online. These advantages of Hadoop help
the retail industry a lot.

4. Security and Law Enforcement:


The National Security Agency of the USA uses Hadoop to prevent
terrorist attacks. Data tools are used by the cops to chase criminals
and predict their plans. Hadoop is also used in defence, cybersecurity
etc.

5. Hadoop Uses in Advertisements:


Hadoop is also used in the advertisement sector too. Hadoop is used
for capturing video, analysing transactions and handling social media
platforms. The data analysed is generated through social media platforms
like Facebook, Instagram etc. Hadoop is also used in the promotion of
the products.
There are many more advantages of Hadoop in daily life as well as in the Software
sector too.

Components of Hadoop
Hadoop is a framework that uses distributed storage and parallel
processing to store and manage Big Data. It is the most commonly used
software to handle Big Data. There are three components of Hadoop.

1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of
Hadoop.

2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.

3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.

Let us take a detailed look at Hadoop HDFS in this part of the What is Hadoop
article.

UNIT 2 8
Hadoop HDFS
Data is stored in a distributed manner in HDFS. There are two components of
HDFS - name node and data node. While there is only one name node, there can
be multiple data nodes.
HDFS is specially designed for storing huge datasets in commodity
hardware. An enterprise version of a server costs roughly $10,000 per
terabyte for the full processor. In case you need to buy 100 of these
enterprise version servers, it will go up to a million dollars.
Hadoop enables you to use commodity machines as your data nodes. This
way, you don’t have to spend millions of dollars just on your data
nodes. However, the name node is always an enterprise server.

Features of HDFS
Provides distributed storage

Can be implemented on commodity hardware

Provides data security

Highly fault-tolerant - If one machine goes down, the data from that machine
goes to the next machine

Master and Slave Nodes


Master and slave nodes form the HDFS cluster. The name node is called the
master, and the data nodes are called the slaves.

The name node is responsible for the workings of the data nodes. It also stores
the metadata.

UNIT 2 9
The data nodes read, write, process, and replicate the data. They
also send signals, known as heartbeats, to the name node. These
heartbeats show the status of the data node.

Consider that 30TB of data is loaded into the name node. The name
node distributes it across the data nodes, and this data is replicated
among the data notes. You can see in the image above that the blue,
grey, and red data are replicated among the three data nodes.

Replication of the data is performed three times by default. It is


done this way, so if a commodity machine fails, you can replace it with a
new machine that has the same data.

Let us focus on Hadoop MapReduce in the following section of the What is


Hadoop article.

Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce
approach, the processing is done at the slave nodes, and the final
result is sent to the master node.

A data containing code is used to process the entire data. This coded
data is usually very small in comparison to the data itself. You only
need to send a few kilobytes worth of code to perform a heavy-duty
process on computers.

UNIT 2 10
The input dataset is first split into chunks of data. In this
example, the input has three lines of text with three separate entities -
“bus car train,” “ship ship train,” “bus ship car.” The dataset is then
split into three chunks, based on these entities, and processed
parallelly.

In the map phase, the data is assigned a key and a value of 1. In this case, we have
one bus, one car, one ship, and one train.

These key-value pairs are then shuffled and sorted together based on
their keys. At the reduce phase, the aggregation takes place, and the
final output is obtained.
Hadoop YARN is the next concept we shall focus on in the What is Hadoop article.

Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the
resource management unit of Hadoop and is available as a component of
Hadoop version 2.

Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top
of HDFS.

It is responsible for managing cluster resources to make sure you don't


overload one machine.

It performs job scheduling to make sure that the jobs are scheduled in the
right place

UNIT 2 11
Suppose a client machine wants to do a query or fetch some code for data
analysis. This job request goes to the resource manager (Hadoop Yarn), which is
responsible for resource allocation and management.

In the node section, each of the nodes has its node managers. These
node managers manage the nodes and monitor the resource usage in the
node. The containers contain a collection of physical resources, which
could be RAM, CPU, or hard drives. Whenever a job request comes in, the
app master requests the container from the node manager. Once the node
manager gets the resource, it goes back to the Resource Manager.

Hadoop EcoSystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, However, for those who are not acquainted with
this technology,
one question arises that what is big data ? Big data is a term given to
the data sets which can’t be processed in an efficient manner with the
help of traditional methodology such as RDBMS. Hadoop has made its place
in the industries and companies that need to work on large data sets
which are sensitive and needs efficient handling. Hadoop is a framework
that enables processing of large data sets which reside in the form of
clusters. Being a framework, Hadoop is made up of several modules that
are supported by a large ecosystem of technologies.

UNIT 2 12
Introduction: Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache projects and various
commercial tools
and solutions. There are
four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common Utilities.
Most of the tools or solutions are used to supplement or support these
major elements. All these tools work collectively to provide services
such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

HDFS: Hadoop Distributed File System

YARN: Yet Another Resource Negotiator

MapReduce: Programming based Data Processing

Spark: In-Memory data processing

PIG, HIVE: Query based processing of data services

HBase: NoSQL Database

Mahout, Spark MLLib: Machine Learning algorithm libraries

Solar, Lucene: Searching and Indexing

Zookeeper: Managing cluster

Oozie: Job Scheduling

UNIT 2 13
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis
easier.

HDFS:

HDFS is the primary or major component of Hadoop


ecosystem and is responsible for storing large data sets of structured
or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.

HDFS consists of two core components i.e.

1. Name node

2. Data Node

Name Node is the prime node which contains metadata


(data about data) requiring comparatively fewer resources than the data
nodes that stores the actual data. These data nodes are commodity
hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.

UNIT 2 14
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

YARN:

Yet Another Resource Negotiator, as the name


implies, YARN is the one who helps to manage the resources across the
clusters. In short, it performs scheduling and resource allocation for
the Hadoop System.

Consists of three major components i.e.

1. Resource Manager

2. Nodes Manager

3. Application Manager

Resource manager has the privilege of allocating


resources for the applications in a system whereas Node managers work on
the allocation of resources such as CPU, memory, bandwidth per machine
and later on acknowledges the resource manager. Application manager
works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.

MapReduce:

By making the use of distributed and parallel


algorithms, MapReduce makes it possible to carry over the processing’s
logic and helps to write applications which transform big data sets into a
manageable one.

MapReduce makes the use of two functions i.e. Map() and Reduce() whose
task is:

1. Map() performs sorting and filtering of data and thereby organizing them
in
the form of group. Map generates a key-value pair based result which is
later on processed by the Reduce() method.

2. Reduce(), as the name suggests does the summarization by aggregating


the mapped

UNIT 2 15
data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.

PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.

It is a platform for structuring the data flow, processing and analyzing huge
data sets.

Pig does the work of executing commands and in the


background, all the activities of MapReduce are taken care of. After the
processing, pig stores the result in HDFS.

Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.

Pig helps to achieve ease of programming and optimization and hence is a


major segment of the Hadoop Ecosystem.

HIVE:

With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query
language is called as HQL (Hive Query Language).

It is highly scalable as it allows real-time


processing and batch processing both. Also, all the SQL datatypes are
supported by Hive thus, making the query processing easier.

Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.

JDBC, along with ODBC drivers work on establishing


the data storage permissions and connection whereas HIVE Command line
helps in the processing of queries.

Mahout:

Mahout, allows Machine Learnability to a system or application. Machine


Learning, as the name suggests helps the system to develop itself based on

UNIT 2 16
some
patterns, user/environmental interaction or on the basis of algorithms.

It provides various libraries or functionalities


such as collaborative filtering, clustering, and classification which
are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.

Apache Spark:

It’s a platform that handles all the process


consumptive tasks like batch processing, interactive or iterative
real-time processing, graph conversions, and visualization, etc.

It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.

Spark is best suited for real-time data whereas


Hadoop is best suited for structured data or batch processing, hence
both are used in most of the companies interchangeably.

Apache HBase:

It’s a NoSQL database which supports all kinds of


data and thus capable of handling anything of Hadoop Database. It
provides capabilities of Google’s BigTable, thus able to work on Big
Data sets effectively.

At times where we need to search or retrieve the


occurrences of something small in a huge database, the request must be
processed within a short quick span of time. At such times, HBase comes
handy as it gives us a tolerant way of storing limited data

Other Components: Apart from all of


these, there are some other components too that carry out a huge task
in order to make Hadoop capable of processing large datasets. They are
as follows:

Solr, Lucene: These are the


two services that perform the task of searching and indexing with the
help of some java libraries, especially Lucene is based on Java which

UNIT 2 17
allows spell check mechanism, as well. However, Lucene is driven by
Solr.

Zookeeper: There was a huge


issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency,
often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and
maintenance.

Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs
and binding them together as a single unit. There is two kinds of jobs .i.e
Oozie workflow and
Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs
are those that are triggered when some data or external stimulus is
given to it

Apache Hive
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like
interface between the user and the Hadoop distributed file system (HDFS)
which integrates Hadoop. It is built on top of Hadoop. It is a software
project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets that stored in distributed storage
and queried by Structure Query Language (SQL) syntax. It is not built
for Online Transactional Processing (OLTP) workloads. It is frequently
used for data warehousing tasks like data encapsulation, Ad-hoc Queries,
and analysis of huge datasets. It is designed to enhance scalability,
extensibility, performance, fault-tolerance and loose-coupling with its
input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It
delivers standard SQL functionality for analytics. Traditional SQL
queries are written in the MapReduce Java API to execute SQL Application
and SQL queries over distributed data. Hive provides portability as

UNIT 2 18
most data warehousing applications functions with SQL-based query
languages like NoSQL.
Apache Hive is a data warehouse software project that is built on top
of the Hadoop ecosystem. It provides an SQL-like interface to query and
analyze large datasets stored in Hadoop’s distributed file system
(HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow
users to express data queries, transformations, and analyses in a
familiar syntax. HiveQL statements are compiled into MapReduce jobs,
which are then executed on the Hadoop cluster to process the data.
Hive includes many features that make it a useful tool for big data
analysis, including support for partitioning, indexing, and user-defined
functions (UDFs). It also provides a number of optimization techniques
to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data
warehousing, ETL (extract, transform, load) pipelines, and ad-hoc data
analysis. It is widely used in the big data industry, especially in
companies that have adopted the Hadoop ecosystem as their primary data
processing platform.

Components of Hive:

1. HCatalog – It is a Hive component and is a


table as well as a store management layer for Hadoop. It enables user
along with various data processing tools like Pig and MapReduce which
enables to read and write on the grid easily.

2. WebHCat – It provides a service which can be


utilized by the user to run Hadoop MapReduce (or YARN), Pig, Hive tasks
or function Hive metadata operations with an HTTP interface.

Modes of Hive:

1. Local Mode – It is used, when the Hadoop is


built under pseudo mode which has only one data node, when the data size is
smaller in term of restricted to single local machine, and when

UNIT 2 19
processing will be faster on smaller datasets existing in the local
machine.

2. Map Reduce Mode – It is used, when Hadoop is


built with multiple data nodes and data is divided across various nodes, it will
function on huge datasets and query is executed parallelly, and to achieve
enhanced performance in processing large datasets.

Characteristics of Hive:

1. Databases and tables are built before loading the data.

2. Hive as data warehouse is built to manage and query only structured data
which is residing under tables.

3. At the time of handling structured data, MapReduce lacks


optimization and usability function such as UDFs whereas Hive framework
have optimization and usability.

4. Programming in Hadoop deals directly with the files. So, Hive can
partition the data with directory structures to improve performance on
certain queries.

5. Hive is compatible for the various file formats which are TEXTFILE,
SEQUENCEFILE, ORC, RCFILE, etc.

6. Hive uses derby database in single user metadata storage and it uses MYSQL
for multiple user Metadata or shared Metadata.

Features of Hive:

1. It provides indexes, including bitmap indexes to accelerate the


queries. Index type containing compaction and bitmap index as of 0.10.

2. Metadata storage in a RDBMS, reduces the time to function semantic checks


during query execution.

3. Built in user-defined functions (UDFs) to manipulation of strings,


dates, and other data-mining tools. Hive is reinforced to extend the UDF set to
deal with the use-cases not reinforced by predefined functions.

4. DEFLATE, BWT, snappy, etc are the algorithms to operation on compressed


data which is stored in Hadoop Ecosystem.

UNIT 2 20
5. It stores schemas in a database and processes the data into the Hadoop File
Distributed File System (HDFS).

6. It is built for Online Analytical Processing (OLAP).

7. It delivers various types of querying language which are frequently known as


Hive Query Language (HVL or HiveQL).

Advantages:
Scalability: Apache Hive is designed to handle large volumes of data, making it a
scalable solution for big data processing.
Familiar SQL-like interface: Hive uses a SQL-like language called HiveQL, which
makes it easy for SQL users to learn and use.
Integration with Hadoop ecosystem:
Hive integrates well with the Hadoop ecosystem, enabling users to
process data using other Hadoop tools like Pig, MapReduce, and Spark.
Supports partitioning and bucketing: Hive supports partitioning and bucketing,
which can improve query performance by limiting the amount of data scanned.
User-defined functions: Hive allows users to define their own functions, which
can be used in HiveQL queries.

Disadvantages:
Limited real-time processing: Hive is designed for batch processing, which
means it may not be the best tool for real-time data processing.

Slow performance: Hive


can be slower than traditional relational databases because it is built
on top of Hadoop, which is optimized for batch processing rather than
interactive querying.
Steep learning curve: While
Hive uses a SQL-like language, it still requires users to have knowledge
of Hadoop and distributed computing, which can make it difficult for
beginners to use.
Lack of support for transactions: Hive does not support transactions, which can
make it difficult to maintain data consistency.

UNIT 2 21
Limited flexibility:
Hive is not as flexible as other data warehousing tools because it is
designed to work specifically with Hadoop, which can limit its usability
in other environments.

Architecture and Working of Hive


The major components of Hive and its interaction with the Hadoop is
demonstrated in the figure below and all the components are described
further:

User Interface (UI) – As the name describes


User interface provide an interface between user and hive. It enables
user to submit queries and other operations to the system. Hive web UI,
Hive command line, and Hive HD Insight (In windows server) are supported by
the user interface.

Hive Server – It is referred to as Apache Thrift Server. It accepts the request


from different clients and provides it to Hive Driver.

Driver – Queries of the user after the


interface are received by the driver within the Hive. Concept of session
handles is implemented by driver. Execution and Fetching of APIs
modelled on JDBC/ODBC interfaces is provided by the user.

Compiler – Queries are parses, semantic


analysis on the different query blocks and query expression is done by
the compiler. Execution plan with the help of the table in the database
and partition metadata observed from the metastore are generated by the
compiler eventually.

Metastore – All the structured data or


information of the different tables and partition in the warehouse
containing attributes and attributes level information are stored in the
metastore. Sequences or de-sequences necessary to read and write data
and the corresponding HDFS files where the data is stored. Hive selects
corresponding database servers to stock the schema or Metadata of

UNIT 2 22
databases, tables, attributes in a table, data types of databases, and
HDFS mapping.

Execution Engine – Execution of the execution


plan made by the compiler is performed in the execution engine. The plan is a
DAG of stages. The dependencies within the various stages of the
plan is managed by execution engine as well as it executes these stages
on the suitable system components.

Diagram – Architecture of Hive that is built on the top of Hadoop


In the above diagram along with architecture, job execution flow in Hive with
Hadoop is demonstrated step by step.

Step-1: Execute Query – Interface of the Hive


such as Command Line or Web user interface delivers query to the driver
to execute. In this, UI calls the execute interface to the driver such
as ODBC or JDBC.

Step-2: Get Plan – Driver designs a session


handle for the query and transfer the query to the compiler to make
execution plan. In other words, driver interacts with the compiler.

Step-3: Get Metadata – In this, the compiler


transfers the metadata request to any database and the compiler gets the

UNIT 2 23
necessary metadata from the metastore.

Step-4: Send Metadata – Metastore transfers metadata as an


acknowledgment to the compiler.

Step-5: Send Plan – Compiler communicating with driver with the execution
plan made by the compiler to execute the query.

Step-6: Execute Plan – Execute plan is sent to the execution engine by the
driver.

Execute Job

Job Done

Dfs operation (Metadata Operation)

Step-7: Fetch Results – Fetching results from the driver to the user interface
(UI).

Step-8: Send Results – Result is transferred to the execution engine from the
driver. Sending results to Execution
engine. When the result is retrieved from data nodes to the execution
engine, it returns the result to the driver and to user interface (UI).

Advantages of Hive Architecture:


Scalability: Hive is a distributed system that can easily scale to handle large
volumes of data by adding more nodes to the cluster.

Data Accessibility: Hive


allows users to access data stored in Hadoop without the need for
complex programming skills. SQL-like language is used for queries and
HiveQL is based on SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the
Hadoop ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports
various data formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization,
and encryption to ensure data privacy.

UNIT 2 24
Disadvantages of Hive Architecture:
High Latency: Hive’s performance is slower compared
to traditional databases because of the overhead of running queries in a
distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it
is designed for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in
Hadoop, SQL, and data warehousing concepts.
Lack of Full SQL Support:
HiveQL does not support all SQL operations, such as transactions and
indexes, which may limit the usefulness of the tool for certain
applications.
Debugging Difficulties: Debugging Hive
queries can be difficult as the queries are executed across a
distributed system, and errors may occur in different nodes.

Hive Architecture
The following architecture explains the flow of submission of query into Hive.

UNIT 2 25
Hive Client
Hive allows writing applications in various languages, including
Java, Python, and C++. It supports different types of clients such as:-

Thrift Server - It is a cross-language service provider platform


that serves the request from all those programming languages that
supports Thrift.

JDBC Driver - It is used to establish a connection between hive and


Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.

ODBC Driver - It allows the applications that support the ODBC protocol to
connect to Hive.

Hive Services
The following are the services provided by Hive:-

Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.

Hive Web User Interface - The Hive Web UI is just an alternative of


Hive CLI. It provides a web-based GUI for executing Hive queries and
commands.

Hive MetaStore - It is a central repository that stores all the


structure information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the
serializers and deserializers which is used to read and write data and
the corresponding HDFS files where the data is stored.

Hive Server - It is referred to as Apache Thrift Server. It accepts


the request from different clients and provides it to Hive Driver.

Hive Driver - It receives queries from different sources like web


UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the
compiler.

Hive Compiler - The purpose of the compiler is to parse the query


and perform semantic analysis on the different query blocks and

UNIT 2 26
expressions. It converts HiveQL statements into MapReduce jobs.

Hive Execution Engine - Optimizer generates the logical plan in the


form of DAG of map-reduce tasks and HDFS tasks. In the end, the
execution engine executes the incoming tasks in the order of their
dependencies.

Difference Between RDBMS and Hadoop


RDMS (Relational Database Management System):
RDBMS is an information management system, which is based on a data
model.In RDBMS tables are used for information storage. Each row of the
table represents a record and column represents an attribute of data.
Organization of data and their manipulation processes are different in
RDBMS from other databases. RDBMS ensures ACID (atomicity, consistency,
integrity, durability) properties required for designing a database. The
purpose of RDBMS is to store, manage, and retrieve data as quickly and
reliably as possible.
Hadoop: It is an open-source software framework used for
storing data and running applications on a group of commodity hardware.
It has large storage capacity and high processing power. It can manage
multiple concurrent processes at the same time. It is used in predictive
analysis, data mining and machine learning. It can handle both
structured and unstructured form of data. It is more flexible in
storing, processing, and managing data than traditional RDBMS. Unlike
traditional systems, Hadoop enables multiple analytical processes on the
same data at the same time. It supports scalability very flexibly.
Below is a table of differences between RDBMS and Hadoop:

S.No. RDBMS Hadoop

An open-source software used for


Traditional row-column based
storing data and running
1. databases, basically used for data
applications or processes
storage, manipulation and retrieval.
concurrently.

In this structured data is mostly In this both structured and


2.
processed. unstructured data is processed.

UNIT 2 27
It is best suited for OLTP
3. It is best suited for BIG data.
environment.

4. It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in Data normalization is not required


5.
RDBMS. in Hadoop.

It stores transformed and


6. It stores huge volume of data.
aggregated data.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is The data schema of Hadoop is


8.
static type. dynamic type.

Low data integrity available than


9. High data integrity available.
RDBMS.

Cost is applicable for licensed Free of cost, as it is an open source


10.
software. software.

Introduction to Hadoop Distributed File


System(HDFS)
With growing data velocity the data size easily outgrows the storage limit of
a machine. A solution would be to store the data across a network of
machines. Such filesystems are called
distributed filesystems. Since data is stored across a network all the complications
of a network come in.
This is where Hadoop comes in. It provides one of the most reliable
filesystems. HDFS (Hadoop Distributed File System) is a unique design
that provides storage for
extremely large files with streaming data access pattern and it runs on commodity
hardware. Let’s elaborate the terms:

Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).

Streaming Data Access Pattern: HDFS is designed on principle of write-once


and read-many-times. Once data is written large portions of dataset can be
processed any number times.

UNIT 2 28
Commodity hardware: Hardware that is
inexpensive and easily available in the market. This is one of feature
which specially distinguishes HDFS from other file system.

Nodes: Master-slave nodes typically forms the HDFS cluster.

1. NameNode(MasterNode):

Manages all the slave nodes and assign work to them.

It executes filesystem namespace operations like opening, closing,


renaming files and directories.

It should be deployed on reliable hardware which has the high config. not
on commodity hardware.

2. DataNode(SlaveNode):

Actual worker nodes, who do the actual work like reading, writing,
processing etc.

They also perform creation, deletion, and replication upon instruction from
the master.

They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.

Namenodes:

Run on the master node.

Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.

Require high amount of RAM.

Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.

DataNodes:

Run on slave nodes.

Require high memory as data is actually stored here.

UNIT 2 29
Data storage in HDFS: Now let’s see how the data is stored in a distributed
manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will first
divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and
above). Then these blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate
the blocks among themselves and the information of what blocks they
contain is sent to the master. Default replication factor is
3
means for each block 3 replicas are created (including itself). In
hdfs.site.xml we can increase or decrease the replication factor i.e we
can edit its configuration here.

Note: MasterNode has the record of everything, it


knows the location and info of each and every single data nodes and the
blocks they contain, i.e. nothing is done without the permission of
masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very
difficult to store a 100 TB file on a single machine. Even if we store,
then each read and write operation on that
whole file is going to
take very high seek time. But if we have multiple blocks of size 128MB
then its become easy to perform various read and write operations on it

UNIT 2 30
compared to doing it on a whole file at once. So we divide the file to
have faster data access i.e. reduce seek time.

Why replicate the blocks in data nodes while storing?


Answer: Let’s assume we don’t replicate and only one yellow
block is present on datanode D1. Now if the data node D1 crashes we will
lose the block and which will make the overall data inconsistent and
faulty. So we replicate the blocks to achieve
fault-tolerance.
Terms related to HDFS:

HeartBeat : It is the signal that datanode


continuously sends to namenode. If namenode doesn’t receive heartbeat
from a datanode then it will consider it dead.

Balancing : If a datanode is crashed the blocks present on it will be gone too


and the blocks will be under-replicated compared to the remaining blocks.
Here master node(namenode) will give a signal to datanodes containing
replicas of those lost blocks to
replicate so that overall distribution of blocks is balanced.

Replication:: It is done by datanode.

Note: No two replicas of the same block are present on the same datanode.
Features:

Distributed data storage.

Blocks reduce seek time.

The data is highly available as the same block is present at multiple


datanodes.

Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.

High fault tolerance.

Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.

UNIT 2 31
Low latency data access: Applications that require
low-latency access to data i.e in the range of milliseconds will not
work well with HDFS, because HDFS is designed keeping in mind that we
need high-throughput of data even at the cost of latency.

Small file problem: Having lots of small files will result in lots of seeks and lots
of movement from one datanode to
another datanode to retrieve each small file, this whole process is a
very inefficient data access pattern.

Data processing with hadoop


Processing data with Hadoop involves several steps and components within the
Hadoop ecosystem. Here's a detailed overview of the process:

1. Data Acquisition: The first step in processing data with Hadoop is acquiring
the data. This could involve collecting data from various sources such as logs,
databases, sensors, social media feeds, etc. The data may be structured,
semi-structured, or unstructured.

2. Data Ingestion: Once the data is collected, it needs to be ingested into the
Hadoop ecosystem. Tools like Apache Flume, Apache Sqoop, or Apache Kafka
are commonly used for this purpose. Flume is used for streaming data
ingestion, Sqoop for transferring bulk data between Hadoop and relational
databases, and Kafka for building real-time data pipelines.

3. Storage: After ingestion, the data is stored in the Hadoop Distributed File
System (HDFS) or other storage systems like Apache HBase, Apache
Cassandra, etc., depending on the requirements of the use case. HDFS is the
primary storage system in Hadoop, designed to store large files across
multiple machines in a distributed manner.

4. Data Processing: Once the data is stored, various processing frameworks and
tools within the Hadoop ecosystem can be used to analyze and manipulate it.
The two main processing frameworks are MapReduce and Apache Spark.

MapReduce: MapReduce is a programming model and processing engine


for distributed computing on large datasets. It consists of two main steps:
map and reduce. MapReduce distributes the computation across a cluster
of machines, abstracting away the complexities of parallel processing.

UNIT 2 32
Developers write MapReduce jobs in languages like Java or Python to
process data.

Apache Spark: Apache Spark is a fast and general-purpose cluster


computing system that provides APIs for Scala, Java, Python, and R. Spark
supports in-memory processing, making it faster than MapReduce for
many use cases. Spark provides higher-level libraries like Spark SQL for
SQL-based querying, Spark Streaming for real-time processing, MLlib for
machine learning, and GraphX for graph processing.

5. Data Analysis and Visualization: Once the data is processed, it can be


analyzed using tools like Apache Hive, Apache Pig, Apache Impala, etc. These
tools provide SQL-like interfaces for querying and analyzing data stored in
Hadoop. Additionally, data can be visualized using tools like Apache Zeppelin,
Apache Superset, or integrating with external BI tools like Tableau, Power BI,
etc.

6. Data Export: After analysis and visualization, the results can be exported or
stored in various formats or systems depending on the requirements. This
could involve exporting data back to relational databases, data warehouses, or
other storage systems outside of Hadoop.

7. Data Security and Governance: Throughout the data processing lifecycle, it's
crucial to ensure data security and governance. Hadoop provides mechanisms
for authentication, authorization, encryption, auditing, and data lineage to
ensure data integrity and compliance with regulations.

8. Monitoring and Optimization: Monitoring tools like Apache Ambari, Cloudera


Manager, or third-party solutions can be used to monitor the health and
performance of the Hadoop cluster. Optimization techniques such as tuning
hardware configurations, optimizing data processing workflows, and utilizing
caching mechanisms can improve the overall performance and efficiency of
data processing with Hadoop.

By following these steps and leveraging the various components within the
Hadoop ecosystem, organizations can efficiently process, analyze, and derive
insights from large volumes of data.

UNIT 2 33
Hadoop YARN Architecture
YARN stands for “Yet Another Resource Negotiator“.
It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker
which was present in Hadoop 1.0. YARN was described as a “
Redesigned Resource Manager”
at the time of its launching, but it has now evolved to be known as
large-scale distributed operating system used for Big Data processing.

YARN architecture basically separates resource management layer from


the processing layer. In Hadoop 1.0 version, the responsibility of Job
tracker is split between the resource manager and application manager.

UNIT 2 34
YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed
File System) thus making the system much more efficient. Through its
various components, it can dynamically allocate various resources and
schedule the application processing. For large volume data processing,
it is quite necessary to manage the available resources properly so that
every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

Scalability: The scheduler in Resource manager of YARN architecture allows


Hadoop to extend and manage thousands of nodes and clusters.

Compatibility: YARN supports the existing map-reduce applications without


disruptions thus making it compatible with Hadoop 1.0 as well.

Cluster Utilization:Since YARN supports Dynamic utilization of cluster in


Hadoop, which enables optimized Cluster Utilization.

Multi-tenancy: It allows multiple engine access thus giving organizations a


benefit of multi-tenancy.

Hadoop YARN Architecture

UNIT 2 35
The main components of YARN architecture include:

Client: It submits map-reduce jobs.

Resource Manager: It is the master daemon of YARN


and is responsible for resource assignment and management among all the
applications. Whenever it receives a processing request, it forwards it
to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:

Scheduler: It performs scheduling based on the


allocated application and available resources. It is a pure scheduler,
means it does not perform other tasks such as monitoring or tracking and
does not guarantee a restart if a task fails. The YARN scheduler
supports plugins such as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.

Application manager: It is responsible for


accepting the application and negotiating the first container from the
resource manager. It also restarts the Application Master container if a
task fails.

Node Manager: It take care of individual node on


Hadoop cluster and manages application and workflow and that particular

UNIT 2 36
node. Its primary job is to keep-up with the Resource Manager. It
registers with the Resource Manager and sends heartbeats with the health
status of the node. It monitors resource usage, performs log management and
also kills a container based on directions from the resource
manager. It is also responsible for creating the container process and
start it on the request of Application master.

Application Master: An application is a single job


submitted to a framework. The application master is responsible for
negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container
Launch Context(CLC) which includes everything an application needs to
run. Once the application is started, it sends the health report to the
resource manager from time-to-time.

Container: It is a collection of physical resources such as RAM, CPU cores


and disk on a single node. The containers are
invoked by Container Launch Context(CLC) which is a record that contains
information such as environment variables, security tokens,
dependencies etc.

Application workflow in Hadoop YARN:

UNIT 2 37
1. Client submits an application

2. The Resource Manager allocates a container to start the Application Manager

3. The Application Manager registers itself with the Resource Manager

4. The Application Manager negotiates containers from the Resource Manager

5. The Application Manager notifies the Node Manager to launch containers

6. Application code is executed in the container

7. Client contacts Resource Manager/Application Manager to monitor


application’s status

8. Once the processing is complete, the Application Manager un-registers with


the Resource Manager

Advantages :
Flexibility: YARN offers flexibility to run various types of distributed
processing systems such as Apache Spark, Apache
Flink, Apache Storm, and others. It allows multiple processing engines
to run simultaneously on a single Hadoop cluster.

Resource Management: YARN provides an efficient way of managing


resources in the Hadoop cluster. It allows administrators
to allocate and monitor the resources required by each application in a
cluster, such as CPU, memory, and disk space.

Scalability: YARN is designed to be highly scalable and can handle thousands


of nodes in a cluster. It can scale up or down based on the requirements of the
applications running on the cluster.

Improved Performance: YARN offers better


performance by providing a centralized resource management system. It
ensures that the resources are optimally utilized, and applications are
efficiently scheduled on the available resources.

Security: YARN provides robust security features


such as Kerberos authentication, Secure Shell (SSH) access, and secure
data transmission. It ensures that the data stored and processed on the
Hadoop cluster is secure.

UNIT 2 38
Disadvantages :
Complexity: YARN adds complexity to the Hadoop
ecosystem. It requires additional configurations and settings, which can be
difficult for users who are not familiar with YARN.

Overhead: YARN introduces additional overhead,


which can slow down the performance of the Hadoop cluster. This overhead is
required for managing resources and scheduling applications.

Latency: YARN introduces additional latency in the


Hadoop ecosystem. This latency can be caused by resource allocation,
application scheduling, and communication between components.

Single Point of Failure: YARN can be a single point of failure in the Hadoop
cluster. If YARN fails, it can cause the
entire cluster to go down. To avoid this, administrators need to set up a
backup YARN instance for high availability.

Limited Support: YARN has limited support for


non-Java programming languages. Although it supports multiple processing
engines, some engines have limited language support, which can limit
the usability of YARN in certain environments.

Managing resources and apps


Managing resources and applications with Hadoop YARN (Yet Another Resource
Negotiator) is a critical aspect of running distributed data processing workloads
efficiently on a Hadoop cluster. YARN acts as a resource management layer in
Hadoop, responsible for scheduling and allocating resources to various
applications running within the cluster. Here's a detailed overview of how YARN
manages resources and applications:

1. Resource Management: YARN manages the computing resources (CPU,


memory, etc.) available in a Hadoop cluster and allocates these resources to
applications as needed. It abstracts the underlying compute resources (nodes)
into a unified resource pool, making it easier to manage and utilize resources
efficiently.

2. Resource Scheduler: YARN includes a pluggable resource scheduler that


determines how resources are allocated among competing applications. The

UNIT 2 39
default scheduler is the CapacityScheduler, which supports hierarchical
queues to allocate resources based on configured capacities and priorities.
Another popular scheduler is the FairScheduler, which aims to provide fair
sharing of cluster resources among users and applications.

3. Application Lifecycle Management: YARN manages the lifecycle of


applications running on the cluster, including submission, scheduling,
execution, and monitoring. When a user submits a job or application to the
cluster, YARN accepts the application, negotiates required resources, and
schedules it for execution.

4. Application Master: Each application running on YARN has an Application


Master (AM), which is responsible for negotiating resources with the
ResourceManager (RM) and coordinating the execution of tasks on the cluster.
The AM runs as a container within the cluster and manages the application's
execution from start to finish.

5. Containerization: YARN uses containers as the basic unit of resource


allocation and isolation. A container encapsulates the execution environment
for a specific task or process, including CPU, memory, and other resources.
YARN manages the lifecycle of containers, allocating them to applications and
monitoring their resource usage.

6. Dynamic Resource Allocation: YARN supports dynamic resource allocation,


allowing applications to request additional resources or release unused
resources based on changing workload requirements. This elasticity enables
efficient resource utilization and better responsiveness to varying workloads.

7. Fault Tolerance: YARN provides fault tolerance mechanisms to ensure the


reliability of applications running on the cluster. In case of node failures or
other issues, YARN automatically detects and reschedules failed tasks on
healthy nodes to minimize disruption to ongoing computations.

8. Integration with Other Hadoop Ecosystem Components: YARN integrates


with other components of the Hadoop ecosystem, such as MapReduce,
Apache Spark, Apache Tez, etc., allowing these frameworks to run as
applications on the YARN cluster. This integration enables users to leverage
diverse processing frameworks while sharing cluster resources efficiently.

UNIT 2 40
9. Resource Isolation and Security: YARN provides mechanisms for resource
isolation and security, ensuring that applications running on the cluster are
sandboxed and cannot disrupt each other. It enforces resource limits, access
controls, and authentication mechanisms to protect the cluster from
unauthorized access or misuse.

Overall, YARN plays a central role in managing resources and applications in a


Hadoop cluster, enabling efficient and flexible execution of diverse workloads
while maximizing resource utilization and ensuring reliability and security.

MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop
which makes it so powerful and efficient to use. MapReduce is a
programming model used for efficient processing in parallel over large
data-sets in a distributed manner. The data is first split and then
combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various
different-different optimizations. The purpose of MapReduce in Hadoop is
to Map each of the jobs and then it will reduce it to equivalent tasks
for providing less overhead over the cluster network and to reduce the
processing power. The MapReduce task is mainly divided into two phases
Map Phase and Reduce Phase.
MapReduce Architecture:

UNIT 2 41
Components of MapReduce Architecture:
1. Client: The MapReduce
client is the one who brings the Job to the MapReduce for processing.
There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.

2. Job: The MapReduce Job


is the actual work that the client wanted to do which is comprised of so many
smaller tasks that the client wants to process or execute.

3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.

4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of
all the job-parts combined to produce the final output.

5. Input Data: The data set that is fed to the MapReduce for processing.

6. Output Data: The final result is obtained after the processing.

In MapReduce, we have a
client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job
into further equivalent job-parts. These job-parts are then made
available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the
particular company is solving. The developer writes their logic to
fulfill the requirement that the industry requires. The input data which
we are using is then fed to the Map Task and the Map will generate
intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is
stored on the HDFS. There can be n number of Map and Reduce tasks made
available for processing the data as per the requirement. The algorithm
for Map and Reduce is made with a very optimized way such that the time
complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its
architecture:

UNIT 2 42
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.

1. Map: As the name


suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of
some kind of address and value is the actual value that it keeps. The
Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair
which works as input for the Reducer or
Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value
pair as per the reducer algorithm written by the developer.

How Job tracker and the task tracker deal with MapReduce:

1. Job Tracker: The work


of Job tracker is to manage all the resources and all the jobs across
the cluster and also to schedule each map on the Task Tracker running on the
same data node since there can be hundreds of data nodes available
in the cluster.

2. Task Tracker: The Task


Tracker can be considered as the actual slaves that are working on the
instruction given by the Job Tracker. This Task Tracker is deployed on
each of the nodes available in the cluster that executes the Map and
Reduce task as instructed by Job Tracker.

There is also one important component of MapReduce Architecture known as Job


History Server.
The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs
which are generated during or after the job execution are stored on Job
History Server.

UNIT 2 43

You might also like