0% found this document useful (0 votes)
13 views

Hadoop ISE 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Hadoop ISE 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Hadoop – Introduction

Introduction
• The changed definition of a powerful person: - A powerful is one who has access to the
data. This is because data is increasing at a tremendous rate. Suppose we are living in
100% data world. Then 90% of the data is produced in the last 2 to 4 years.
• This is because now when a child is born, before her mother, she first faces the flash of the
camera.
• All these pictures and videos are nothing but data. Similarly, emails, various smartphone
applications, statistical data, etc.
• All this data has the enormous power to affect various incidents and trends. This data is not
only used by companies to affect their consumers but also by politicians to affect elections.
This huge data is referred to as Big Data.
• It needs to maintained, analyzed, and tackled. Here the World needs- Hadoop.
• Hadoop is a framework of the open source set of tools distributed under Apache License. It
is used to manage data, store data, and process data for various big data applications
running under clustered systems.
• We know Big Data was defined by the “3Vs” but now there are “5Vs” of Big Data which are
also termed as the characteristics of Big Data.
3Vs to 5Vs characteristics
1. Volume: With increasing dependence on technology, data is producing at a large volume. Common
examples are data being produced by various social networking sites, sensors, scanners, airlines and
other organizations.
2. Velocity: Huge amount of data is generated per second. It is estimated that by the end of 2020, every
individual will produce 3mb data per second. This large volume of data is being generated with a great
velocity.
3. Variety: The data being produced by different means is of three types:
1. Structured Data: It is the relational data which is stored in the form of rows and columns.
2. Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in
the form of rows and columns.
3. Semi Structured Data: Log files are the examples of this type of data.
4. Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in the
generation of doubtful or uncertain Information. Often data inconsistency arises because of the volume
or amount of data e.g. data in bulk could create confusion whereas less amount of data could convey
half or incomplete Information.
5. Value: After having the 4 V’s into account there comes one more V which stands for Value!. Bulk of
Data having no Value is of no good to the company, unless you turn it into something useful. Data in
itself is of no use or importance but it needs to be converted into something valuable to extract
Information. Hence, you can state that Value! is the most important V of all the 5V’s
Evolution of Hadoop
• Hadoop was designed by Doug Cutting and Michael Cafarella in 2005. The
design of Hadoop is inspired by Google.
• Hadoop stores the huge amount of data through a system called Hadoop
Distributed File System (HDFS) and processes this data with the
technology of Map Reduce. The designs of HDFS and Map Reduce are
inspired by the Google File System (GFS) and Map Reduce.
• In the year 2000 Google suddenly become most popular and profitable
search engine. The success of Google was attributed to its unique Google
File System and Map Reduce.
• The two enthusiasts Doug Cutting and Michael Cafarella studied Google
File System and Map Reduce. and designed what is called, Hadoop in the
year 2005.
• Doug’s son had a toy elephant whose name was Hadoop and thus Doug and
Michael gave their new creation, the name “Hadoop” and hence the symbol “toy
elephant.” This is how Hadoop evolved. Thus the designs of HDFS and Map
Reduced though created by Doug Cutting and Michael Cafarella, but are originally
inspired by Google.
Traditional Approach
• In the traditional approach, we used to store data on local machines. This data was then
processed.
• Now as data started increasing, the local machines or computers were not capable enough to store
this huge data set.
• So, data was then started to be stored on remote servers. Now suppose we need to
process that data. So, in the traditional approach, this data has to be fetched from the
servers and then processed upon.
• Suppose this data is of 500 GB. Now, practically it is very complex and expensive to fetch this data.
This approach is also called Enterprise Approach.
• In the new Hadoop Approach, instead of fetching the data on local machines we send the
query to the data. Obviously, the query to process the data will not be as huge as the data
itself.
• Moreover, at the server, the query is divided into several parts. All these parts process the data
simultaneously. This is called parallel execution and is possible because of Map Reduce.
• So, now not only there is no need to fetch the data, but also the processing takes lesser
time. The result of the query is then sent to the user.
• Thus the Hadoop makes data storage, processing and analyzing way easier than its traditional
approach.
Components of Hadoop
Hadoop has three components:
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data
with a cluster of commodity hardware or cheaper hardware with streaming access
pattern. It enables data to be stored at multiple nodes in the cluster which
ensures data security and fault tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon.
Now suppose a query is sent to process a data set in the HDFS. Now, Hadoop
identifies where this data is stored, this is called Mapping. Now the query is
broken into multiple parts and the results of all these multiple parts are combined
and the overall result is sent back to the user. This is called reduce process. Thus
while HDFS is used to store the data, Map Reduce is used to process the data.
3. YARN : stands for Yet Another Resource Negotiator. It is a dedicated operating
system for Hadoop which manages the resources of the cluster and also functions
as a framework for job scheduling in Hadoop.
1. The types of scheduling are First Come First Serve, Fair Share Scheduler and Capacity
Scheduler etc. The First Come First Serve scheduling is set by default in YARN.
Hadoop -> a solution for Big Data
1. Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk is
4KB. When we install Hadoop, the HDFS by default changes the block size to 64 MB. Since
it is used to store huge data. We can also change the block size to 128 MB.
HDFS works with Data Node and Name Node.
Name Node is a master service and it keeps the metadata as for on which commodity
hardware, the data is residing,
The Data Node stores the actual data. Now, since the block size is of 64 MB thus the
storage required to store metadata is reduced thus making HDFS better.
Also, Hadoop stores three copies of every dataset at three different locations. This
ensures that the Hadoop is not prone to single point of failure.
1. Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a
query into multiple parts and now each part process the data coherently. This parallel
execution helps to execute a query faster and makes Hadoop a suitable and optimal choice
to deal with Big Data.
2. YARN: As we know that Yet Another Resource Negotiator works like an operating system
to Hadoop and as operating systems are resource managers so YARN manages the
resources of Hadoop so that Hadoop serves big data in a better way.
Introduction to Hadoop Distributed File
System(HDFS)
• With growing data velocity the data size easily outgrows the
storage limit of a machine. A solution would be to store the data
across a network of machines. Such filesystems are called
distributed filesystems. Since data is stored across a network all
the complications of a network come in.
This is where Hadoop comes in. It provides one of the most
reliable filesystems. HDFS (Hadoop Distributed File System) is a
unique design that provides storage for extremely large files
with streaming data access pattern and it runs on commodity
hardware.
• Extremely large files: Here we are talking about the data in
range of petabytes(1000 TB).
• Streaming Data Access Pattern: HDFS is designed on
principle of write-once and read-many-times. Once data is
written large portions of dataset can be processed any number
times.
• Commodity hardware: Hardware that is inexpensive and
easily available in the market. This is one of feature which
specially distinguishes HDFS from other file system.
Nodes:
• Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
1. Manages all the slave nodes and assign work to them.
2. It executes filesystem namespace operations like opening, closing, renaming
files and directories.
3. It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
1. Actual worker nodes, who do the actual work like reading, writing,
processing etc.
2. They also perform creation, deletion, and replication upon instruction from
the master.
3. They can be deployed on commodity hardware.
HDFS daemons:
Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks,
block Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
• Require high memory as data is actually stored here.
Data storage in HDFS

Note: MasterNode has the record of everything, it knows the location and info of each and every
single data nodes and the blocks they contain, i.e. nothing is done without the permission of
masternode.
Example sample.txt
HDFS broke this file into four parts and
• Hello I am GeeksforGeeks named each part as first.txt, second.txt,
• How can I help you third.txt, and fourth.txt.
• How can I assist you Each part will contain 2 lines.
• Are you an engineer All these files will be stored in Data
• Are you looking for coding Nodes and the Name Node will contain
• Are you looking for interview questions the metadata about them.
• what are you doing these days All this is the task of HDFS.
• what are your strengths
Why divide the file into blocks?

• Answer: Let’s assume that we don’t divide, now it’s very difficult
to store a 100 TB file on a single machine. Even if we store,
then each read and write operation on that whole file is going to
take very high seek time. But if we have multiple blocks of size
128MB then its become easy to perform various read and write
operations on it compared to doing it on a whole file at once. So
we divide the file to have faster data access i.e. reduce seek
time.
Why replicate the blocks in data nodes
while storing?
• Answer: Let’s assume we don’t replicate and only one yellow
block is present on datanode D1. Now if the data node D1
crashes we will lose the block and which will make the overall
data inconsistent and faulty. So we replicate the blocks to
achieve fault-tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to
namenode. If namenode doesn’t receive heartbeat from a
datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it
will be gone too and the blocks will be under-replicated
compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing
replicas of those lost blocks to replicate so that overall
distribution of blocks is balanced.
• Replication:: It is done by datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at
multiple datanodes.
• Even if multiple datanodes are down we can still do our work,
thus making it highly reliable.
• High fault tolerance.
Limitations:
• Though HDFS provide many features there are some areas
where it doesn’t work well.
• Low latency data access: Applications that require low-latency
access to data i.e in the range of milliseconds will not work well
with HDFS, because HDFS is designed keeping in mind that we
need high-throughput of data even at the cost of latency.
• Small file problem: Having lots of small files will result in lots of
seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a very
inefficient data access pattern.
Map Reduce in Hadoop
• One of the three components of Hadoop is Map Reduce.
• The first component of Hadoop that is, Hadoop Distributed File
System (HDFS) is responsible for storing the file.
• The second component that is, Map Reduce is responsible
for processing the file.
Processing of the file
• Suppose a user wants to process this file.
• Here is what Map-Reduce comes into the picture.
• Suppose this user wants to run a query on this sample.txt.
• So, instead of bringing sample.txt on the local computer, we will send
this query on the data.
• To keep a track of our request, we use Job Tracker (a master service).
Job Tracker traps our request and keeps a track of it.
Map
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
• The Job Tracker traps this request and asks Name Node to run this request on sample.txt.
• Name Node then provides the metadata to the Job Tracker. Job Tracker now knows that
sample.txt is stored in four different .txt files.
• As all these four files have three copies stored in HDFS, so the Job Tracker communicates with the
Task Tracker (a slave service) of each of these files, but it communicates with only one copy of
each file which is residing nearest to it.
Note: Applying the desired code on local four different .txt files is a process., This process is called
Map.
Responsibility of Job Tracker
• In Hadoop terminology, the main file sample.txt is called input file
and its four subfiles are called input splits.
• So, in Hadoop the number of mappers for an input file are equal to
number of input splits of this input file.
• In the above example, the input file sample.txt has four input splits
hence four mappers will be running to process it.
• The Job Tracker is responsible for handling these mappers.
Working of Job Tracker
• The task trackers are slave services to the Job Tracker.
• So, in case any of the local machines breaks down then the processing over
that part of the file will stop and it will halt the complete process.
• Each task tracker sends heartbeat and its number of slots to Job Tracker in
every 3 seconds. This is called the status of Task Trackers. In case any task
tracker goes down, the Job Tracker then waits for 10 heartbeat times, that
is, 30 seconds, and even after that if it does not get any status, then it
assumes that either the task tracker is dead or is extremely busy.
• So, it then communicates with the task tracker of another copy of the same
file and directs it to process the desired code over it. Similarly, the slot
information is used by the Job Tracker to keep a track of how many tasks
are being currently served by the task tracker and how many more tasks
can be assigned to it. In this way, the Job Tracker keeps track of our
request.
Reduce
• Suppose that the system has generated output for each individual .txt file.
• But this is not the user’s desired output.
• To produce the desired output, all these individual outputs have to be
merged or reduced to a single output.
• This reduction of multiple outputs to a single one is also a process which is
done by REDUCER.
• In Hadoop, as many reducers are there, those many number of output files
are generated. By default, there is always one reducer per cluster.
• Map and Reduce are two different processes of the second component of
Hadoop, that is, Map Reduce.
• These are also called phases of Map Reduce. Thus we can say that Map
Reduce has two phases. Phase 1 is Map and Phase 2 is Reduce.

You might also like