Hadoop ISE 2

Uploaded by

SHREENIWAS LAXMANRAO CHATE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Hadoop ISE 2

Uploaded by

SHREENIWAS LAXMANRAO CHATE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Hadoop – Introduction

Introduction
• The changed definition of a powerful person: - A powerful is one who has access to the
data. This is because data is increasing at a tremendous rate. Suppose we are living in
100% data world. Then 90% of the data is produced in the last 2 to 4 years.
• This is because now when a child is born, before her mother, she first faces the flash of the
camera.
• All these pictures and videos are nothing but data. Similarly, emails, various smartphone
applications, statistical data, etc.
• All this data has the enormous power to affect various incidents and trends. This data is not
only used by companies to affect their consumers but also by politicians to affect elections.
This huge data is referred to as Big Data.
• It needs to maintained, analyzed, and tackled. Here the World needs- Hadoop.
• Hadoop is a framework of the open source set of tools distributed under Apache License. It
is used to manage data, store data, and process data for various big data applications
running under clustered systems.
• We know Big Data was defined by the “3Vs” but now there are “5Vs” of Big Data which are
also termed as the characteristics of Big Data.
3Vs to 5Vs characteristics
1. Volume: With increasing dependence on technology, data is producing at a large volume. Common
examples are data being produced by various social networking sites, sensors, scanners, airlines and
other organizations.
2. Velocity: Huge amount of data is generated per second. It is estimated that by the end of 2020, every
individual will produce 3mb data per second. This large volume of data is being generated with a great
velocity.
3. Variety: The data being produced by different means is of three types:
1. Structured Data: It is the relational data which is stored in the form of rows and columns.
2. Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in
the form of rows and columns.
3. Semi Structured Data: Log files are the examples of this type of data.
4. Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in the
generation of doubtful or uncertain Information. Often data inconsistency arises because of the volume
or amount of data e.g. data in bulk could create confusion whereas less amount of data could convey
half or incomplete Information.
5. Value: After having the 4 V’s into account there comes one more V which stands for Value!. Bulk of
Data having no Value is of no good to the company, unless you turn it into something useful. Data in
itself is of no use or importance but it needs to be converted into something valuable to extract
Information. Hence, you can state that Value! is the most important V of all the 5V’s
Evolution of Hadoop
• Hadoop was designed by Doug Cutting and Michael Cafarella in 2005. The
design of Hadoop is inspired by Google.
• Hadoop stores the huge amount of data through a system called Hadoop
Distributed File System (HDFS) and processes this data with the
technology of Map Reduce. The designs of HDFS and Map Reduce are
inspired by the Google File System (GFS) and Map Reduce.
• In the year 2000 Google suddenly become most popular and profitable
search engine. The success of Google was attributed to its unique Google
File System and Map Reduce.
• The two enthusiasts Doug Cutting and Michael Cafarella studied Google
File System and Map Reduce. and designed what is called, Hadoop in the
year 2005.
• Doug’s son had a toy elephant whose name was Hadoop and thus Doug and
Michael gave their new creation, the name “Hadoop” and hence the symbol “toy
elephant.” This is how Hadoop evolved. Thus the designs of HDFS and Map
Reduced though created by Doug Cutting and Michael Cafarella, but are originally
inspired by Google.
Traditional Approach
• In the traditional approach, we used to store data on local machines. This data was then
processed.
• Now as data started increasing, the local machines or computers were not capable enough to store
this huge data set.
• So, data was then started to be stored on remote servers. Now suppose we need to
process that data. So, in the traditional approach, this data has to be fetched from the
servers and then processed upon.
• Suppose this data is of 500 GB. Now, practically it is very complex and expensive to fetch this data.
This approach is also called Enterprise Approach.
• In the new Hadoop Approach, instead of fetching the data on local machines we send the
query to the data. Obviously, the query to process the data will not be as huge as the data
itself.
• Moreover, at the server, the query is divided into several parts. All these parts process the data
simultaneously. This is called parallel execution and is possible because of Map Reduce.
• So, now not only there is no need to fetch the data, but also the processing takes lesser
time. The result of the query is then sent to the user.
• Thus the Hadoop makes data storage, processing and analyzing way easier than its traditional
approach.
Components of Hadoop
Hadoop has three components:
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data
with a cluster of commodity hardware or cheaper hardware with streaming access
pattern. It enables data to be stored at multiple nodes in the cluster which
ensures data security and fault tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon.
Now suppose a query is sent to process a data set in the HDFS. Now, Hadoop
identifies where this data is stored, this is called Mapping. Now the query is
broken into multiple parts and the results of all these multiple parts are combined
and the overall result is sent back to the user. This is called reduce process. Thus
while HDFS is used to store the data, Map Reduce is used to process the data.
3. YARN : stands for Yet Another Resource Negotiator. It is a dedicated operating
system for Hadoop which manages the resources of the cluster and also functions
as a framework for job scheduling in Hadoop.
1. The types of scheduling are First Come First Serve, Fair Share Scheduler and Capacity
Scheduler etc. The First Come First Serve scheduling is set by default in YARN.
Hadoop -> a solution for Big Data
1. Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk is
4KB. When we install Hadoop, the HDFS by default changes the block size to 64 MB. Since
it is used to store huge data. We can also change the block size to 128 MB.
HDFS works with Data Node and Name Node.
Name Node is a master service and it keeps the metadata as for on which commodity
hardware, the data is residing,
The Data Node stores the actual data. Now, since the block size is of 64 MB thus the
storage required to store metadata is reduced thus making HDFS better.
Also, Hadoop stores three copies of every dataset at three different locations. This
ensures that the Hadoop is not prone to single point of failure.
1. Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a
query into multiple parts and now each part process the data coherently. This parallel
execution helps to execute a query faster and makes Hadoop a suitable and optimal choice
to deal with Big Data.
2. YARN: As we know that Yet Another Resource Negotiator works like an operating system
to Hadoop and as operating systems are resource managers so YARN manages the
resources of Hadoop so that Hadoop serves big data in a better way.
Introduction to Hadoop Distributed File
System(HDFS)
• With growing data velocity the data size easily outgrows the
storage limit of a machine. A solution would be to store the data
across a network of machines. Such filesystems are called
distributed filesystems. Since data is stored across a network all
the complications of a network come in.
This is where Hadoop comes in. It provides one of the most
reliable filesystems. HDFS (Hadoop Distributed File System) is a
unique design that provides storage for extremely large files
with streaming data access pattern and it runs on commodity
hardware.
• Extremely large files: Here we are talking about the data in
range of petabytes(1000 TB).
• Streaming Data Access Pattern: HDFS is designed on
principle of write-once and read-many-times. Once data is
written large portions of dataset can be processed any number
times.
• Commodity hardware: Hardware that is inexpensive and
easily available in the market. This is one of feature which
specially distinguishes HDFS from other file system.
Nodes:
• Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
1. Manages all the slave nodes and assign work to them.
2. It executes filesystem namespace operations like opening, closing, renaming
files and directories.
3. It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
1. Actual worker nodes, who do the actual work like reading, writing,
processing etc.
2. They also perform creation, deletion, and replication upon instruction from
the master.
3. They can be deployed on commodity hardware.
HDFS daemons:
Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks,
block Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
• Require high memory as data is actually stored here.
Data storage in HDFS

Note: MasterNode has the record of everything, it knows the location and info of each and every
single data nodes and the blocks they contain, i.e. nothing is done without the permission of
masternode.
Example sample.txt
HDFS broke this file into four parts and
• Hello I am GeeksforGeeks named each part as first.txt, second.txt,
• How can I help you third.txt, and fourth.txt.
• How can I assist you Each part will contain 2 lines.
• Are you an engineer All these files will be stored in Data
• Are you looking for coding Nodes and the Name Node will contain
• Are you looking for interview questions the metadata about them.
• what are you doing these days All this is the task of HDFS.
• what are your strengths
Why divide the file into blocks?

• Answer: Let’s assume that we don’t divide, now it’s very difficult
to store a 100 TB file on a single machine. Even if we store,
then each read and write operation on that whole file is going to
take very high seek time. But if we have multiple blocks of size
128MB then its become easy to perform various read and write
operations on it compared to doing it on a whole file at once. So
we divide the file to have faster data access i.e. reduce seek
time.
Why replicate the blocks in data nodes
while storing?
• Answer: Let’s assume we don’t replicate and only one yellow
block is present on datanode D1. Now if the data node D1
crashes we will lose the block and which will make the overall
data inconsistent and faulty. So we replicate the blocks to
achieve fault-tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to
namenode. If namenode doesn’t receive heartbeat from a
datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it
will be gone too and the blocks will be under-replicated
compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing
replicas of those lost blocks to replicate so that overall
distribution of blocks is balanced.
• Replication:: It is done by datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at
multiple datanodes.
• Even if multiple datanodes are down we can still do our work,
thus making it highly reliable.
• High fault tolerance.
Limitations:
• Though HDFS provide many features there are some areas
where it doesn’t work well.
• Low latency data access: Applications that require low-latency
access to data i.e in the range of milliseconds will not work well
with HDFS, because HDFS is designed keeping in mind that we
need high-throughput of data even at the cost of latency.
• Small file problem: Having lots of small files will result in lots of
seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a very
inefficient data access pattern.
Map Reduce in Hadoop
• One of the three components of Hadoop is Map Reduce.
• The first component of Hadoop that is, Hadoop Distributed File
System (HDFS) is responsible for storing the file.
• The second component that is, Map Reduce is responsible
for processing the file.
Processing of the file
• Suppose a user wants to process this file.
• Here is what Map-Reduce comes into the picture.
• Suppose this user wants to run a query on this sample.txt.
• So, instead of bringing sample.txt on the local computer, we will send
this query on the data.
• To keep a track of our request, we use Job Tracker (a master service).
Job Tracker traps our request and keeps a track of it.
Map
J$hadoop jar query.jar DriverCode sample.txt result.output
1. query.jar : query file that needs to be processed on the input file.
2. sample.txt: input file.
3. result.output: directory in which output of the processing will be received.
• The Job Tracker traps this request and asks Name Node to run this request on sample.txt.
• Name Node then provides the metadata to the Job Tracker. Job Tracker now knows that
sample.txt is stored in four different .txt files.
• As all these four files have three copies stored in HDFS, so the Job Tracker communicates with the
Task Tracker (a slave service) of each of these files, but it communicates with only one copy of
each file which is residing nearest to it.
Note: Applying the desired code on local four different .txt files is a process., This process is called
Map.
Responsibility of Job Tracker
• In Hadoop terminology, the main file sample.txt is called input file
and its four subfiles are called input splits.
• So, in Hadoop the number of mappers for an input file are equal to
number of input splits of this input file.
• In the above example, the input file sample.txt has four input splits
hence four mappers will be running to process it.
• The Job Tracker is responsible for handling these mappers.
Working of Job Tracker
• The task trackers are slave services to the Job Tracker.
• So, in case any of the local machines breaks down then the processing over
that part of the file will stop and it will halt the complete process.
• Each task tracker sends heartbeat and its number of slots to Job Tracker in
every 3 seconds. This is called the status of Task Trackers. In case any task
tracker goes down, the Job Tracker then waits for 10 heartbeat times, that
is, 30 seconds, and even after that if it does not get any status, then it
assumes that either the task tracker is dead or is extremely busy.
• So, it then communicates with the task tracker of another copy of the same
file and directs it to process the desired code over it. Similarly, the slot
information is used by the Job Tracker to keep a track of how many tasks
are being currently served by the task tracker and how many more tasks
can be assigned to it. In this way, the Job Tracker keeps track of our
request.
Reduce
• Suppose that the system has generated output for each individual .txt file.
• But this is not the user’s desired output.
• To produce the desired output, all these individual outputs have to be
merged or reduced to a single output.
• This reduction of multiple outputs to a single one is also a process which is
done by REDUCER.
• In Hadoop, as many reducers are there, those many number of output files
are generated. By default, there is always one reducer per cluster.
• Map and Reduce are two different processes of the second component of
Hadoop, that is, Map Reduce.
• These are also called phases of Map Reduce. Thus we can say that Map
Reduce has two phases. Phase 1 is Map and Phase 2 is Reduce.

bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop Interview Questions - HDFS
No ratings yet
Hadoop Interview Questions - HDFS
19 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Big Data
No ratings yet
Big Data
51 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Big Data Open Source Frameworks Lecture Slides
No ratings yet
Big Data Open Source Frameworks Lecture Slides
109 pages
HADOOP
No ratings yet
HADOOP
19 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
21ai402 Data Analytics Unit-2
No ratings yet
21ai402 Data Analytics Unit-2
44 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop
No ratings yet
Hadoop
7 pages
IRJET - Big Data-A Review Study With Comp
No ratings yet
IRJET - Big Data-A Review Study With Comp
6 pages
Notes Hadoop
No ratings yet
Notes Hadoop
19 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
unit 2
No ratings yet
unit 2
28 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
HADOOP
No ratings yet
HADOOP
18 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
BDA
No ratings yet
BDA
8 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
BDA Final Notes
No ratings yet
BDA Final Notes
53 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Student Handbook 2021-2022
No ratings yet
Student Handbook 2021-2022
31 pages
Download Full (Ebook) C# Programming For Absolute Beginners by Radek Vystavel ISBN 9781484271469, 1484271467 PDF All Chapters
100% (12)
Download Full (Ebook) C# Programming For Absolute Beginners by Radek Vystavel ISBN 9781484271469, 1484271467 PDF All Chapters
81 pages
System Administration and Network Management
No ratings yet
System Administration and Network Management
7 pages
Note CNC Milling Machine 2
No ratings yet
Note CNC Milling Machine 2
10 pages
ESC Calibration
No ratings yet
ESC Calibration
5 pages
GT1030 SL 2G BRK
No ratings yet
GT1030 SL 2G BRK
3 pages
Advance Computing Technology
No ratings yet
Advance Computing Technology
2 pages
Sap MM
50% (2)
Sap MM
31 pages
24206
No ratings yet
24206
1 page
Module 3 Kani' Method
No ratings yet
Module 3 Kani' Method
7 pages
Unit 5 Syllabus
No ratings yet
Unit 5 Syllabus
43 pages
Belarc Advisor - Computer Profile
No ratings yet
Belarc Advisor - Computer Profile
3 pages
MF Setup
No ratings yet
MF Setup
8 pages
Lecture 6 - Pneumatic Circuit
No ratings yet
Lecture 6 - Pneumatic Circuit
5 pages
Introduction To Computers and Programming
No ratings yet
Introduction To Computers and Programming
31 pages
Blue Coat Systems Product Use Guide - Q32014 Addendum
No ratings yet
Blue Coat Systems Product Use Guide - Q32014 Addendum
2 pages
Mantis Laser Academy Training Kit
No ratings yet
Mantis Laser Academy Training Kit
5 pages
Difference Matrices for ODE and PDE: A MATLAB® Companion John M. Neubergerdownload
100% (2)
Difference Matrices for ODE and PDE: A MATLAB® Companion John M. Neubergerdownload
52 pages
DATA MINING Unit 1
No ratings yet
DATA MINING Unit 1
22 pages
PLC Elevator Controller Report
No ratings yet
PLC Elevator Controller Report
16 pages
Chap 13 PLC Installation Practices
No ratings yet
Chap 13 PLC Installation Practices
38 pages
What's in the Box- Galaxy Tab A9+
No ratings yet
What's in the Box- Galaxy Tab A9+
3 pages
Lab 9 Responsive Website
No ratings yet
Lab 9 Responsive Website
4 pages
ELEC 4601 Sample-Questions
No ratings yet
ELEC 4601 Sample-Questions
12 pages
Chapter 5
No ratings yet
Chapter 5
2 pages
SAP-C01 Dumps - Pass Your Exam by Dumpshq
0% (1)
SAP-C01 Dumps - Pass Your Exam by Dumpshq
3 pages
Data Communications: - Data - Data Communication - Fundamental Characteristics of Data Communication
No ratings yet
Data Communications: - Data - Data Communication - Fundamental Characteristics of Data Communication
20 pages
MIL Exam
No ratings yet
MIL Exam
1 page
Jupyterlab: The Evolution of The Jupyter Notebook
No ratings yet
Jupyterlab: The Evolution of The Jupyter Notebook
22 pages
Datablast: Improve Your Blasting Productivity, Quality & Governance
No ratings yet
Datablast: Improve Your Blasting Productivity, Quality & Governance
2 pages

Hadoop ISE 2

Uploaded by

Hadoop ISE 2

Uploaded by

Hadoop – Introduction

You might also like