Unit 2-1
Unit 2-1
Introduction to Hadoop
INTRODUCTION:
Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of
large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation. Its framework is
based on Java programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of
large datasets.
UNIT 2 1
(a high-level platform for creating MapReduce programs), and HBase (a
non-relational, distributed database).
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In
October 2003 the first paper release was Google File System. In January
2006, MapReduce development started on the Apache Nutch which consisted
of around 6000 lines coding for it and around 5000 lines coding for
HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and
processing big data. It was created by Apache Software Foundation in
2006, based on a white paper written by Google in 2003 that described
the Google File System (GFS) and the MapReduce programming model. The
Hadoop framework allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of
purposes such as data warehousing, log processing, and research. Hadoop
has been widely adopted in the industry and has become a key technology
for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
UNIT 2 2
5. it is low cost.
Hadoop has several key features that make it well-suited for big
data processing:
Distributed Storage: Hadoop stores large data sets across multiple
machines, allowing for the storage and processing of extremely large
amounts of data.
Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature
helps to reduce the network traffic and improve the performance
UNIT 2 3
It has distributed file system known as HDFS and this HDFS splits
files into blocks and sends them across various nodes in form of large
clusters. Also in case of a node failure, the system operates and data
transfer takes place between the nodes which are facilitated by HDFS.
HDFS
1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
UNIT 2 4
4. Spark- It contains a Machine Learning Library(MLlib) for providing
enhanced machine learning and is widely used for data processing. It
also supports Java, Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
High flexibility.
Cost effective.
Linear scaling.
UNIT 2 5
amounts of data.
Disadvantages:
Security concerns.
Latency: Hadoop is not well-suited for low-latency workloads and may not be
the best choice for real-time data processing.
UNIT 2 6
structured data processing
Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.
1. Financial Sectors:
Hadoop is used to detect fraud in the financial sector. Hadoop is
also used to analyse fraud patterns. Credit card companies also use
Hadoop to find out the exact customers for their products.
2. Healthcare Sectors:
Hadoop is used to analyse huge data such as medical devices, clinical
data, medical reports etc. Hadoop analyses and scans the reports
thoroughly to reduce the manual work.
UNIT 2 7
3. Hadoop Applications in the Retail Industry:
Retailers use Hadoop to improve their sales. Hadoop also helped in
tracking the products bought by the customers. Hadoop also helps
retailers to predict the price range of the products. Hadoop also helps
retailers to make their products online. These advantages of Hadoop help
the retail industry a lot.
Components of Hadoop
Hadoop is a framework that uses distributed storage and parallel
processing to store and manage Big Data. It is the most commonly used
software to handle Big Data. There are three components of Hadoop.
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of
Hadoop.
Let us take a detailed look at Hadoop HDFS in this part of the What is Hadoop
article.
UNIT 2 8
Hadoop HDFS
Data is stored in a distributed manner in HDFS. There are two components of
HDFS - name node and data node. While there is only one name node, there can
be multiple data nodes.
HDFS is specially designed for storing huge datasets in commodity
hardware. An enterprise version of a server costs roughly $10,000 per
terabyte for the full processor. In case you need to buy 100 of these
enterprise version servers, it will go up to a million dollars.
Hadoop enables you to use commodity machines as your data nodes. This
way, you don’t have to spend millions of dollars just on your data
nodes. However, the name node is always an enterprise server.
Features of HDFS
Provides distributed storage
Highly fault-tolerant - If one machine goes down, the data from that machine
goes to the next machine
The name node is responsible for the workings of the data nodes. It also stores
the metadata.
UNIT 2 9
The data nodes read, write, process, and replicate the data. They
also send signals, known as heartbeats, to the name node. These
heartbeats show the status of the data node.
Consider that 30TB of data is loaded into the name node. The name
node distributes it across the data nodes, and this data is replicated
among the data notes. You can see in the image above that the blue,
grey, and red data are replicated among the three data nodes.
Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce
approach, the processing is done at the slave nodes, and the final
result is sent to the master node.
A data containing code is used to process the entire data. This coded
data is usually very small in comparison to the data itself. You only
need to send a few kilobytes worth of code to perform a heavy-duty
process on computers.
UNIT 2 10
The input dataset is first split into chunks of data. In this
example, the input has three lines of text with three separate entities -
“bus car train,” “ship ship train,” “bus ship car.” The dataset is then
split into three chunks, based on these entities, and processed
parallelly.
In the map phase, the data is assigned a key and a value of 1. In this case, we have
one bus, one car, one ship, and one train.
These key-value pairs are then shuffled and sorted together based on
their keys. At the reduce phase, the aggregation takes place, and the
final output is obtained.
Hadoop YARN is the next concept we shall focus on in the What is Hadoop article.
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the
resource management unit of Hadoop and is available as a component of
Hadoop version 2.
Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top
of HDFS.
It performs job scheduling to make sure that the jobs are scheduled in the
right place
UNIT 2 11
Suppose a client machine wants to do a query or fetch some code for data
analysis. This job request goes to the resource manager (Hadoop Yarn), which is
responsible for resource allocation and management.
In the node section, each of the nodes has its node managers. These
node managers manage the nodes and monitor the resource usage in the
node. The containers contain a collection of physical resources, which
could be RAM, CPU, or hard drives. Whenever a job request comes in, the
app master requests the container from the node manager. Once the node
manager gets the resource, it goes back to the Resource Manager.
Hadoop EcoSystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, However, for those who are not acquainted with
this technology,
one question arises that what is big data ? Big data is a term given to
the data sets which can’t be processed in an efficient manner with the
help of traditional methodology such as RDBMS. Hadoop has made its place
in the industries and companies that need to work on large data sets
which are sensitive and needs efficient handling. Hadoop is a framework
that enables processing of large data sets which reside in the form of
clusters. Being a framework, Hadoop is made up of several modules that
are supported by a large ecosystem of technologies.
UNIT 2 12
Introduction: Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache projects and various
commercial tools
and solutions. There are
four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common Utilities.
Most of the tools or solutions are used to supplement or support these
major elements. All these tools work collectively to provide services
such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
UNIT 2 13
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis
easier.
HDFS:
1. Name node
2. Data Node
UNIT 2 14
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
1. Resource Manager
2. Nodes Manager
3. Application Manager
MapReduce:
MapReduce makes the use of two functions i.e. Map() and Reduce() whose
task is:
1. Map() performs sorting and filtering of data and thereby organizing them
in
the form of group. Map generates a key-value pair based result which is
later on processed by the Reduce() method.
UNIT 2 15
data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge
data sets.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query
language is called as HQL (Hive Query Language).
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
Mahout:
UNIT 2 16
some
patterns, user/environmental interaction or on the basis of algorithms.
Apache Spark:
It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
Apache HBase:
UNIT 2 17
allows spell check mechanism, as well. However, Lucene is driven by
Solr.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs
and binding them together as a single unit. There is two kinds of jobs .i.e
Oozie workflow and
Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs
are those that are triggered when some data or external stimulus is
given to it
Apache Hive
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like
interface between the user and the Hadoop distributed file system (HDFS)
which integrates Hadoop. It is built on top of Hadoop. It is a software
project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets that stored in distributed storage
and queried by Structure Query Language (SQL) syntax. It is not built
for Online Transactional Processing (OLTP) workloads. It is frequently
used for data warehousing tasks like data encapsulation, Ad-hoc Queries,
and analysis of huge datasets. It is designed to enhance scalability,
extensibility, performance, fault-tolerance and loose-coupling with its
input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It
delivers standard SQL functionality for analytics. Traditional SQL
queries are written in the MapReduce Java API to execute SQL Application
and SQL queries over distributed data. Hive provides portability as
UNIT 2 18
most data warehousing applications functions with SQL-based query
languages like NoSQL.
Apache Hive is a data warehouse software project that is built on top
of the Hadoop ecosystem. It provides an SQL-like interface to query and
analyze large datasets stored in Hadoop’s distributed file system
(HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow
users to express data queries, transformations, and analyses in a
familiar syntax. HiveQL statements are compiled into MapReduce jobs,
which are then executed on the Hadoop cluster to process the data.
Hive includes many features that make it a useful tool for big data
analysis, including support for partitioning, indexing, and user-defined
functions (UDFs). It also provides a number of optimization techniques
to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data
warehousing, ETL (extract, transform, load) pipelines, and ad-hoc data
analysis. It is widely used in the big data industry, especially in
companies that have adopted the Hadoop ecosystem as their primary data
processing platform.
Components of Hive:
Modes of Hive:
UNIT 2 19
processing will be faster on smaller datasets existing in the local
machine.
Characteristics of Hive:
2. Hive as data warehouse is built to manage and query only structured data
which is residing under tables.
4. Programming in Hadoop deals directly with the files. So, Hive can
partition the data with directory structures to improve performance on
certain queries.
5. Hive is compatible for the various file formats which are TEXTFILE,
SEQUENCEFILE, ORC, RCFILE, etc.
6. Hive uses derby database in single user metadata storage and it uses MYSQL
for multiple user Metadata or shared Metadata.
Features of Hive:
UNIT 2 20
5. It stores schemas in a database and processes the data into the Hadoop File
Distributed File System (HDFS).
Advantages:
Scalability: Apache Hive is designed to handle large volumes of data, making it a
scalable solution for big data processing.
Familiar SQL-like interface: Hive uses a SQL-like language called HiveQL, which
makes it easy for SQL users to learn and use.
Integration with Hadoop ecosystem:
Hive integrates well with the Hadoop ecosystem, enabling users to
process data using other Hadoop tools like Pig, MapReduce, and Spark.
Supports partitioning and bucketing: Hive supports partitioning and bucketing,
which can improve query performance by limiting the amount of data scanned.
User-defined functions: Hive allows users to define their own functions, which
can be used in HiveQL queries.
Disadvantages:
Limited real-time processing: Hive is designed for batch processing, which
means it may not be the best tool for real-time data processing.
UNIT 2 21
Limited flexibility:
Hive is not as flexible as other data warehousing tools because it is
designed to work specifically with Hadoop, which can limit its usability
in other environments.
UNIT 2 22
databases, tables, attributes in a table, data types of databases, and
HDFS mapping.
UNIT 2 23
necessary metadata from the metastore.
Step-5: Send Plan – Compiler communicating with driver with the execution
plan made by the compiler to execute the query.
Step-6: Execute Plan – Execute plan is sent to the execution engine by the
driver.
Execute Job
Job Done
Step-7: Fetch Results – Fetching results from the driver to the user interface
(UI).
Step-8: Send Results – Result is transferred to the execution engine from the
driver. Sending results to Execution
engine. When the result is retrieved from data nodes to the execution
engine, it returns the result to the driver and to user interface (UI).
UNIT 2 24
Disadvantages of Hive Architecture:
High Latency: Hive’s performance is slower compared
to traditional databases because of the overhead of running queries in a
distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it
is designed for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in
Hadoop, SQL, and data warehousing concepts.
Lack of Full SQL Support:
HiveQL does not support all SQL operations, such as transactions and
indexes, which may limit the usefulness of the tool for certain
applications.
Debugging Difficulties: Debugging Hive
queries can be difficult as the queries are executed across a
distributed system, and errors may occur in different nodes.
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
UNIT 2 25
Hive Client
Hive allows writing applications in various languages, including
Java, Python, and C++. It supports different types of clients such as:-
ODBC Driver - It allows the applications that support the ODBC protocol to
connect to Hive.
Hive Services
The following are the services provided by Hive:-
Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
UNIT 2 26
expressions. It converts HiveQL statements into MapReduce jobs.
UNIT 2 27
It is best suited for OLTP
3. It is best suited for BIG data.
environment.
Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
UNIT 2 28
Commodity hardware: Hardware that is
inexpensive and easily available in the market. This is one of feature
which specially distinguishes HDFS from other file system.
1. NameNode(MasterNode):
It should be deployed on reliable hardware which has the high config. not
on commodity hardware.
2. DataNode(SlaveNode):
Actual worker nodes, who do the actual work like reading, writing,
processing etc.
They also perform creation, deletion, and replication upon instruction from
the master.
Namenodes:
Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
DataNodes:
UNIT 2 29
Data storage in HDFS: Now let’s see how the data is stored in a distributed
manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first
divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and
above). Then these blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate
the blocks among themselves and the information of what blocks they
contain is sent to the master. Default replication factor is
3
means for each block 3 replicas are created (including itself). In
hdfs.site.xml we can increase or decrease the replication factor i.e we
can edit its configuration here.
UNIT 2 30
compared to doing it on a whole file at once. So we divide the file to
have faster data access i.e. reduce seek time.
Note: No two replicas of the same block are present on the same datanode.
Features:
Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.
UNIT 2 31
Low latency data access: Applications that require
low-latency access to data i.e in the range of milliseconds will not
work well with HDFS, because HDFS is designed keeping in mind that we
need high-throughput of data even at the cost of latency.
Small file problem: Having lots of small files will result in lots of seeks and lots
of movement from one datanode to
another datanode to retrieve each small file, this whole process is a
very inefficient data access pattern.
1. Data Acquisition: The first step in processing data with Hadoop is acquiring
the data. This could involve collecting data from various sources such as logs,
databases, sensors, social media feeds, etc. The data may be structured,
semi-structured, or unstructured.
2. Data Ingestion: Once the data is collected, it needs to be ingested into the
Hadoop ecosystem. Tools like Apache Flume, Apache Sqoop, or Apache Kafka
are commonly used for this purpose. Flume is used for streaming data
ingestion, Sqoop for transferring bulk data between Hadoop and relational
databases, and Kafka for building real-time data pipelines.
3. Storage: After ingestion, the data is stored in the Hadoop Distributed File
System (HDFS) or other storage systems like Apache HBase, Apache
Cassandra, etc., depending on the requirements of the use case. HDFS is the
primary storage system in Hadoop, designed to store large files across
multiple machines in a distributed manner.
4. Data Processing: Once the data is stored, various processing frameworks and
tools within the Hadoop ecosystem can be used to analyze and manipulate it.
The two main processing frameworks are MapReduce and Apache Spark.
UNIT 2 32
Developers write MapReduce jobs in languages like Java or Python to
process data.
6. Data Export: After analysis and visualization, the results can be exported or
stored in various formats or systems depending on the requirements. This
could involve exporting data back to relational databases, data warehouses, or
other storage systems outside of Hadoop.
7. Data Security and Governance: Throughout the data processing lifecycle, it's
crucial to ensure data security and governance. Hadoop provides mechanisms
for authentication, authorization, encryption, auditing, and data lineage to
ensure data integrity and compliance with regulations.
By following these steps and leveraging the various components within the
Hadoop ecosystem, organizations can efficiently process, analyze, and derive
insights from large volumes of data.
UNIT 2 33
Hadoop YARN Architecture
YARN stands for “Yet Another Resource Negotiator“.
It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker
which was present in Hadoop 1.0. YARN was described as a “
Redesigned Resource Manager”
at the time of its launching, but it has now evolved to be known as
large-scale distributed operating system used for Big Data processing.
UNIT 2 34
YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed
File System) thus making the system much more efficient. Through its
various components, it can dynamically allocate various resources and
schedule the application processing. For large volume data processing,
it is quite necessary to manage the available resources properly so that
every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
UNIT 2 35
The main components of YARN architecture include:
UNIT 2 36
node. Its primary job is to keep-up with the Resource Manager. It
registers with the Resource Manager and sends heartbeats with the health
status of the node. It monitors resource usage, performs log management and
also kills a container based on directions from the resource
manager. It is also responsible for creating the container process and
start it on the request of Application master.
UNIT 2 37
1. Client submits an application
Advantages :
Flexibility: YARN offers flexibility to run various types of distributed
processing systems such as Apache Spark, Apache
Flink, Apache Storm, and others. It allows multiple processing engines
to run simultaneously on a single Hadoop cluster.
UNIT 2 38
Disadvantages :
Complexity: YARN adds complexity to the Hadoop
ecosystem. It requires additional configurations and settings, which can be
difficult for users who are not familiar with YARN.
Single Point of Failure: YARN can be a single point of failure in the Hadoop
cluster. If YARN fails, it can cause the
entire cluster to go down. To avoid this, administrators need to set up a
backup YARN instance for high availability.
UNIT 2 39
default scheduler is the CapacityScheduler, which supports hierarchical
queues to allocate resources based on configured capacities and priorities.
Another popular scheduler is the FairScheduler, which aims to provide fair
sharing of cluster resources among users and applications.
UNIT 2 40
9. Resource Isolation and Security: YARN provides mechanisms for resource
isolation and security, ensuring that applications running on the cluster are
sandboxed and cannot disrupt each other. It enforces resource limits, access
controls, and authentication mechanisms to protect the cluster from
unauthorized access or misuse.
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop
which makes it so powerful and efficient to use. MapReduce is a
programming model used for efficient processing in parallel over large
data-sets in a distributed manner. The data is first split and then
combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various
different-different optimizations. The purpose of MapReduce in Hadoop is
to Map each of the jobs and then it will reduce it to equivalent tasks
for providing less overhead over the cluster network and to reduce the
processing power. The MapReduce task is mainly divided into two phases
Map Phase and Reduce Phase.
MapReduce Architecture:
UNIT 2 41
Components of MapReduce Architecture:
1. Client: The MapReduce
client is the one who brings the Job to the MapReduce for processing.
There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of
all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
In MapReduce, we have a
client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job
into further equivalent job-parts. These job-parts are then made
available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the
particular company is solving. The developer writes their logic to
fulfill the requirement that the industry requires. The input data which
we are using is then fed to the Map Task and the Map will generate
intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is
stored on the HDFS. There can be n number of Map and Reduce tasks made
available for processing the data as per the requirement. The algorithm
for Map and Reduce is made with a very optimized way such that the time
complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its
architecture:
UNIT 2 42
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value
pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
UNIT 2 43