0% found this document useful (0 votes)
9 views27 pages

Hadoop 2 & 3 Units Final

Big Data refers to large and complex data sets that require advanced tools for processing and management, characterized by volume, velocity, variety, veracity, and value. It plays a crucial role in various sectors such as e-commerce, healthcare, and finance, enabling informed decision-making and operational efficiency. However, challenges like data quality, security, and the need for skilled personnel must be addressed to harness its full potential, with technologies like Google File System and Hadoop providing solutions for data storage and processing.

Uploaded by

datta58639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views27 pages

Hadoop 2 & 3 Units Final

Big Data refers to large and complex data sets that require advanced tools for processing and management, characterized by volume, velocity, variety, veracity, and value. It plays a crucial role in various sectors such as e-commerce, healthcare, and finance, enabling informed decision-making and operational efficiency. However, challenges like data quality, security, and the need for skilled personnel must be addressed to harness its full potential, with technologies like Google File System and Hadoop providing solutions for data storage and processing.

Uploaded by

datta58639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT2

Working with Big Data

Introduction to BigData:

Today we are living in an Informational Society and we are moving towards a Knowledge Based
Society. In order to extract better knowledge we need a bigger amount of data. Every organization
needs to collect a large set of data in order to support its decision and extract correlations through data
analysis as a basis for decisions.

What is big data?

 Big Data is collection of large and complex data sets which are difficult to process
using the present database management tools or traditional data processing
applications.
 Big Data consist of large volumes of heterogeneous data that is being generated often
at high speeds.
 So big data requires a use of new set of tools, applications and frame works to process
and manage these large volumes of data.
 Big data is generally too big to process that data and also to transport that data.
 Big Data is difficult to store and process and the main issue is large time is taken to
clean the data in the first place.
Characteristics:

Big Data is a data which satisfies the following five characteristics:

 Volume
 Velocity
 Variety
 Value
 Viscosity

Volume: The main characteristic of Big Data is that is must be big which is measured as
volume. Due to the advancement in the technology and with the invention of social media the
exchange of data has become very easy and the amount of data is growing very rapidly. The
large volumes of data are spread to different places, in different formats. The large volumes
may be of Gigabytes to Terabytes or Petabytes or even more. Nowadays the data is generated
not only
by humans but also by machines. These all together are declared as Volumes of data.
Example: Facebook, it stores/generates large volumes of data every day.

Velocity: Velocity refers to the speed at which the data is being generated.Decision makers
want the necessary data in the least amount of time as possible. In different fields and
different areas of technology, we see data getting generated at different Speeds. A few
examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares
on Facebook, and many others. This speed aspect of data generation is referred to as Velocity
in the Big Data world.
Example: Stock exchange, for every second large volumes of data is being generated and is
being accessed by the users.
Figure 1: Five V’s of Big Data

Variety: Variety refers to the different formats in which the data is being generated.
Different applications generate/store the data in different formats. In today's world, there are
large volumes of unstructured data being generated apart from the structured data getting
generated in enterprises. In today's world, organizations not only need to rely on the
structured data from enterprise databases/warehouses, they are also forced to consume lots of
data that is being generated both inside and outside of the enterprise like clickstream data,
social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets,
relational databases etc., we have a lot of unstructured data stored in the form of images,
audio files, video files, web logs, sensor data, and many others. This aspect of varied data
formats is referred to as Variety in the Big Data world.

Types of Big Data:

1. Structured Data

Structured data generally refers to data that has a defined length and format for big data.
Examples of structured data include numbers, dates, and groups of words and numbers
called strings, Relational Database, .csv files

2. Unstructured Data:

Unstructured data files often include text and multimedia content.

Examples include email messages, word processing documents, videos, photos, audio
files, presentations, web pages and many other kinds of business documents.

3. SemiStructured Data

Eg : Xml files, JSON documents are semi structured documents, NoSQL databases are
considered as semi structured.

Example: There are many varieties of data such as Text, image, audio, video etc. and
each have many formats.

Veracity: It refers to the messiness or trustworthiness of the data. With many forms of data,
quality and accuracy are less controllable. For example, Twitter posts with hash tags,
abbreviations and colloquial speech. Big data and analytics works on these types of data.
These volumes and types make up for the lack of quality or accuracy.

Value: Value refers to our ability turn our data into value. It is important that business makes
a case for any attempt to collect and leverage big data. It is easy to fall into the buzz trap and
embark on big data initiatives without a clear understanding of business value it will bring.

Sources of Bigdata:

1. Enterprise Data

2. Transactional Data

3. Social Media

4. Public Data

Need of big data

The following examples illustrate the need of big data for making informed decisions, improving
processes, and staying competitive in various sectors.

i. Ecommerce Personalization:
 Big data enables ecommerce platforms to analyze customer behaviour in realtime.
 Recommendations and targeted advertisements are tailored based on users' preferences and
past interactions.
ii. Healthcare Analytics:
 Analyzing large volumes of medical data allows for personalized treatment plans and
early detection of potential health issues.
 Realtime monitoring of patient data helps in timely interventions and improving overall
healthcare outcomes.

iii. Financial Fraud Detection:


 Banks and financial institutions use big data to detect unusual patterns in transactions.
 Realtime analysis helps identify potential fraud, enabling swift action to prevent financial
losses.

iv. Smart Cities and IoT:


 In smart cities, sensors and devices generate vast amounts of data in realtime.
 Analyzing this data helps city planners optimize traffic flow, manage energy consumption,
and enhance overall urban living.

v. Social Media Insights:


 Platforms like Facebook and Twitter process enormous amounts of data to understand user
engagement and sentiment in realtime.
 Advertisers use this information to refine marketing strategies instantly.

vi. Supply Chain Optimization:


 Big data assists in monitoring and optimizing the entire supply chain in realtime.
 Predictive analytics helps reduce delays, minimize inventory costs, and enhance overall
efficiency.

Big data presents numerous challenges, and addressing them is crucial for organizations to harness the
full potential of large volumes of data. Here are some challenges associated with big data, along with
realtime examples:

Challenges of Big Data

 Volume:

Challenge: Managing and processing vast amounts of data.

Example: Social media platforms generate enormous volumes of user generated content,
including text, images, and videos.

 Velocity:
Challenge: Handling the high speed at which data is generated, processed, and updated.

Example: Financial trading platforms process real-time stock market data to make split-second
decisions.

 Variety:

Challenge: Dealing with diverse types of data, including structured, semi structured, and
unstructured.

Example: Ecommerce websites analyze customer behaviour through structured data (purchase
history), semi structured data (clickstream data), and unstructured data (customer reviews).

 Veracity:

Challenge: Ensuring data quality, accuracy, and reliability.

Example: Sensor data from IoT devices may have inaccuracies or missing values, affecting the
reliability of insights derived from the data.

 Variability:

Challenge: Handling inconsistencies in the data flow, including intermittent data spikes.

Example: Weather forecasting systems process data from various sources, and sudden,
unexpected weather events can lead to spikes in data variability.

 Complexity:

Challenge: Managing the complexity of integrating and analysing data from different sources and
formats.

Example: Healthcare organizations integrate data from electronic health records, medical imaging,
and patient generated data for comprehensive patient care.

 Security and Privacy:

Challenge: Ensuring the confidentiality, integrity, and availability of sensitive data.

Example: Financial institutions need to secure customer financial transactions and personal
information to prevent fraud and comply with privacy regulations.
 Scalability:

Challenge: Ensuring systems can handle increased data volume and user demands.

Example: Online streaming services must scale their infrastructure to handle a surge in users during
popular events or releases.

 Cost Management:

Challenge: Balancing the costs associated with storing, processing, and analysing large volumes of
data.

Example: Cloud service providers charge based on data storage, processing power, and network
usage, requiring organizations to optimize their usage to control costs.

 Lack of Skilled Personnel:

Challenge: Shortage of skilled professionals with expertise in big data technologies.

Example: Organizations struggle to find and retain data scientists, engineers, and analysts with the
necessary skills to manage and derive insights from big data.

Addressing these challenges often involves a combination of technology adoption, data governance,
and strategic planning to ensure that big data initiatives contribute effectively to an organization's
goals.

Google File Systems:

As a consequence of the services provided by Google, it faces the requirements to effectively manage
large amounts of data involving web text/image searches, maps, YouTube etc. Considering the fact that
the available data is huge and vulnerable, Google tried to create its own file storage structure and came
up with the concept of Google File System (GFS) in 2003.

The basic considerations which forced Google to switch to a new file system included constant
monitoring, fault tolerance as well as error detection and automatic recovery.

Having the clear aims of a prospective file system in detail, Google has opted not to use an existing
distributed file system such as Network File System(NFS) or Andrew File System(AFS) but instead
decided to develop a new file system customized fully to suite Google’s needs.

GFS Architecture

GFS follows a master chunkserver relationship. It consists of a single master and a number of chunk
servers. Both the master and the chunk servers can be accessed by multiple clients.
Files are divided into fixed size chunks, with default chunk size being equal to 64MB. A chunk handle
which is usually a 64-bit pointer, is used by master to handle each chunk. The handle is globally unique
and immutable (write once, read anywhere).

By default, the number of replicas created is three but it can be changed accordingly.

The general architecture of Google File System is shown Figure

Master

One of the primary roles of Master is to maintain all Metadata. This includes managing file and chunk
namespaces, keeping track of location of each chunk’s replica, mapping from files to chunks and
accessing control information.

Storing metadata in master provides simplicity and flexibility compared to storing it individually for
each chunk. A general norm which is followed by the master is that it allows less than 64 bytes of
metadata being maintained for each 64 MB chunk. Besides, master is also responsible for chunk
management and chunk migration between chunkservers.

Also, it supplies garbage collection mechanism to avoid fragmentation. The master periodically gives
instructions to each chunkserver and gathers information about its state also.

The role of the client is to ask the master as to which chunkserver to contact. Using file name and byte
offset, it creates a chunk index and sends it to master. Master on the other hand acknowledges the
current and few future requests. This ensures minimum future interaction between client and master.

Snapshot

A Snapshot is an integral function involved in functioning of Google File System, which ensures
consistency control. It helps a user to create a copy of a file or a directory instantaneously, without
hindering the ongoing mutations.
It is more commonly used while creating checkpoints of the current state in order to perform
experimentation on data which can later be committed or rolled back

Data Integrity

Given the fact that a typical GFS cluster consists of thousands of machines, it experiences disk failures
time and again, which ultimately causes corruption or loss of data reads and data writes. Google
considered this to be a norm rather than an exception.

Each chunk broken up into 64KB block has a corresponding checksum of 32 bits. Similar to metadata,
checksums are also stored with logging, independent from user data

Garbage Collection

Instead of immediately reclaiming the physical storage space after a file or a chunk is deleted, GFS
implies a lazy action policy of Garbage Collection.

This approach ensures that system is much simpler and more reliable. 2.5 Implication GFS relies on
appends rather than overwrites. It believes in the strategy of check pointing in order to fully utilize its
fault tolerance capabilities. By differentiating file system control (master) from data transfer
(chunkserver and clients), it ensures that master’s involvement in common operations is minimized,
resulting in faster execution

Google lately changed its focus to automation of data. This led to an entire architectural change in
existing distributed storage and processing. This new automated type of storage was named ‘Colossus’
and is being used by Google presently.

Colossus is the next generation cluster level file system with data being written to it using ‘Reed
Solomon error correction codes’ which ultimately account for a 1.5 times redundancy compared to its
predecessor. It is a client driven system with full access given to client to replicate and encode data.

Hadoop
Introduction to Hadoop:

• Apache Hadoop is an open source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data.

• It runs applications on large clusters of commodity hardware and it processes thousands of


terabytes of data on thousands of the nodes.

• Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers.

History of Hadoop:

It all started with two people, Mike Cafarella and Doug Cutting, who were in the process of
building a search engine system that can index 1 billion pages. After their research, they
estimated that such a system will cost around half a million dollars in hardware, with a
monthly running cost of $30,000, which is quite expensive. However, they soon realized that
their architecture will not be capable enough to work around with billions of pages on the
web.
They came across a paper, published in 2003, that described the architecture of Google’s
distributed file system, called GFS, which was being used in production at Google. Now, this
paper on GFS proved to be something that they were looking for, and soon, they realized that
it would solve all their problems of storing very large files that are generated as a part of the
web crawl and indexing process. Later in 2004, Google published one more paper that
introduced MapReduce to the world. Finally, these two papers led to the foundation of the
framework called “Hadoop“. Doug quoted on Google’s contribution in the development of
Hadoop framework:

Advantages of Hadoop:
1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large
data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional
relational database systems (RDBMS) that can’t scale to process large amounts of data,
Hadoop enables businesses to run applications on thousands of nodes involving many
thousands of terabytes of data.

2. Cost effective
Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.

3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights from data sources such as
social media, email conversations. Hadoop can be used for a wide variety of purposes, such
as log processing, recommendation systems, data warehousing, market campaign analysis
and fraud detection.

4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’
data wherever it is located on a cluster. The tools for data processing are often on the same
servers where the data is located, resulting in much faster data processing. If you’re dealing
with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of
data in just minutes, and petabytes in hours.

5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual
node, that data is also replicated to other nodes in the cluster, which means that in the event
of failure, there is another copy available for use.
Disadvantages of Hadoop:
As the backbone of so many implementations, Hadoop is almost synomous with big data.
1. Security Concerns
Just managing a complex application such as Hadoop can be challenging. A simple example
can be seen in the Hadoop security model, which is disabled by default due to sheer
complexity. If whoever managing the platform lacks of know how to enable it, your data
could be at huge risk. Hadoop is also missing encryption at the storage and network levels,
which is a major selling point for government agencies and others that prefer to keep their
data under wraps.

2. Vulnerable by Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and
as a result, implicated in numerous security breaches.

3. Not Fit for Small Data


While big data is not exclusively made for big businesses, not all big data platforms are
suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its
high capacity design, the Hadoop Distributed File System, lacks the ability to efficiently
support the random reading of small files. As a result, it is not recommended for
organizations with small quantities of data.

4. Potential Stability Issues


Like all open source software, Hadoop has had its fair share of stability issues. To avoid these
issues, organizations are strongly recommended to make sure they are running the latest
stable version, or run it under a third party vendor equipped to handle such problems.

5. General Limitations
The article introduces Apache Flume, MillWheel, and Google’s own Cloud Dataflow as
possible solutions. What each of these platforms have in common is the ability to improve the
efficiency and reliability of data collection, aggregation, and integration. The main point the
article stresses is that companies could be missing out on big benefits by using Hadoop alone.

Distinguish between Hadoop 1.x and Hadoop 2.x ?

S.No Hadoop 1.x Hadoop 2.x

1 Supports MapReduce (MR) processing model Allows to work in MR as well as other


only. Does not support non MR tools distributed computing models like Spark,
Hama, Giraph, Message Passing Interface)
MPI & HBase coprocessors.
2 MR does both processing and cluster resource YARN (Yet Another Resource Negotiator)
management. does cluster resource management
and processing is done using different
processing models.
3 Has limited scaling of nodes. Limited to 4000 Has better scalability. Scalable up to 10000
nodes per cluster nodes per cluster
4 Works on concepts of slots – slots can run Works on concepts of containers. Using
either a Map task or a Reduce task only. containers can run generic tasks.
5 A single NameNode to manage the entire Multiple NameNode servers manage
namespace. multiple namespaces.
6 Has Single Point of Failure (SPOF) – because Has to feature to overcome SPOF with a
of single NameNode and in the case standby NameNode and in the case of
of NameNode failure, needs manual NameNode failure, it is configured for
intervention to overcome. automatic recovery.
7 MR API is compatible with Hadoop1x. A MR API requires additional files for a
program written in Hadoop1 executes program written in Hadoop1x to execute
in Hadoop1x without any additional files. in Hadoop2x.
8 Has a limitation to serve as a platform for event Can serve as a platform for a wide variety
processing, streaming and real time operations. of data analytics possible to run event
processing, streaming and real time
operations.
9 Does not support Microsoft Windows Added support for Microsoft windows

Distinguish between Hadoop 2.x and Hadoop 3.x ?

Hadoop 2.x
Hadoop 3.x

Java version 6 was the minimum requirement. Java version 8 is the minimum requirement .As
most of the dependency library file used is from
java8.

HDFS supports replication for fault tolerance. HDFS support for erasure encoding. (Erasure
coding is a technique for durably storing
information with significant space savings
compared to replication)

Limited Shell scripts with Bugs. Many new Unix shell API, along with old Bug
Fixed.

Map reduce became fast due to YARN. Map reduce became faster, particularly at map
output collector and shuffle jobs by 30%.

Secondary NameNode was introduced as standby. Supports more than 2 NameNode

A single DataNode manages multiple disks. Disks New functionality intra DataNode balancing is
inside can lead to significant skew within a added, which is invoked via the hdfs disk balancer
DataNode. CLI.

Default ports were Conflicting in Linux port range Port range has been optimized.
which leads to failure in port reservation.
At least supports Java version 6 At least supports Java version 8

HDFS The type of file system is DFS.

Behaviour of DFS
• Total file data will be splitted into Blocks.
• Each block will be stored into separate slave nodes.
• As per HDFS the minimum block size is 64MB.
• Recommendation for multiple terabytes of data is 128MB.
• Block size can be configured as 64MB, 64MB * 2, 64MB * 3.

Example :

file1.txt has 100MB data size, when this file is moved into Hadoop, the file will be divided into 2 Blocks
as Block1 and Block2.
Block1 will be stored in Data Node1 (DN1) Block2 will be stored in Data Node1 (DN2)

• The advantage of Block distribution across multiple nodes is ͞parallel process͟ .

• Example file1.txt 100MB Block1 and Block2

• When you submit job to process file1, the job will be divided into two smaller works (tasks)
out of 2 tasks, task1 is assigned to DataNode1 and task2 is assigned to DataNode2.

• Both machines running parallelly when first machine will be processing Block1 of file1 and
second machine processing block2 of file1.

• During this process if any process machine will get down your job execution will be
terminated abnormally.

HDFS: Read & Write Commands using Java API

Hadoop distributed file system called HDFS (HADOOP Distributed File Systems) HADOOP
based applications makes the use of HDFS.

HDFS is designed for storing very largedata files, running on clusters of commodity hardware. It
is fault tolerant, scalable, and extremely simple to expand.
HDFS cluster primarily consists of a NameNode that manages the filesystem Metadata and a Data
Nodes that stores the actual data.

NameNode: NameNode can be considered as a master of the system. It Maintains


the file system tree and the metadata for all the files and directories present in the
system. Two files 'Namespace image' and the 'edit log' are used to store
metadata information. Namenode has knowledge of all the datanodes containingdata blocks for a g
iven file, however, it does not store block locations persistently.
This information is reconstructed every time from datanodes when the system starts.

DataNode : DataNodes are slaves which reside on each machine in a cluster and
provide the actual storage. It is responsible for serving, read and write requests for the clients.

Read Operation In HDFS


Data read request is served by HDFS, NameNode and DataNode. Let's call reader as
a 'client'. Below diagram depicts file read operation in Hadoop

1. Client initiates read request by calling 'open()' method of FileSystem object; it is an


object of type Distributed File System.

2. This object connects to namenode using RPC and gets metadata information such
as the locations of the blocks of the file. Please note that these addresses are of
first few block of file.
3. In response to this metadata request, addresses of the DataNodes having copy of
that block, is returned back.
4. Once addresses of DataNodes are received, an object of
type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of
interactions with DataNode and NameNode. In step 4 shown in above diagram, client invokes
'read()' method which causes DFSInputStream to establish a
connection with the first DataNode with the first block of file.
5. Data is read in the form of streams wherein client invokes 'read()' method
repeatedly. This process of read() operation continues till it reaches end of block.
6. Once end of block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
7. Once client has done with the reading, it calls close() method.

Write Operation In HDFS:

1. Client initiates write operation by calling 'create()' method of DistributedFileSystemobject which


creates a new file Step no. 1 in above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and
initiates new file creation. However, this file create operation does not associate
any blocks with the file. It is the responsibility of NameNode to verify that the file
(which is being created) does not exist already and client has correct permissions
to create new file. If file already exists or client does not have sufficient permission
to create a new file, then IOException is thrown to client. Otherwise, operationsucceeds and a new
record for the file is created by the NameNode.
3. Once new record in NameNode is created, an object of type FSDataOutputStream
is returned to the client. Client uses it to write data into the HDFS. Data write
method is invoked (step 3 in diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode.While client continues writingdata, DFSOutputStr
eam continues creating packets with this data. These packetsare enqueued into a queue which is cal
led as DataQueue. 5. There is one more component called DataStreamer which consumes
this DataQueue. DataStreamer also asks NameNode for allocation of new blocks
thereby picking desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In
our case, we have chosen replication level of 3 and hence there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline
8. Every DataNode in a pipeline stores packet received by it and forwards the same
to the second DataNode in pipeline.
9. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets
which are waiting for acknowledgement from DataNodes.
10. Once acknowledgement for a packet in queue is received from all DataNodes in
the pipeline, it is removed from the 'Ack Queue'.In the event of any DataNode
failure, packets from this queue are used to reinitiate the operation.
11. After client is done with the writing data, it calls close() method (Step 9 in thediagram) Call to
close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgement.
12. Once final acknowledgement is received, NameNode is contacted to tell it that the
file write operation is complete.

Building blocks of Hadoop:

• HDFS architecture is implemented and maintained by 5 subsystems. Each subsystem is


called “DEAMON”.
5 DEAMON’S of Hadoop are:
• Name Node (or) Master Node

• Data Node (or) Slave Node HDFS (Storage)

• Secondary Name Node

• Job Tracker MapReduce (Process)

• Task Tracker

NameNode
• NameNode is “controller” of total Hadoop cluster. The SPF (Single Point Failure) is at
Name Node.
• NameNode is mainly responsible for storage services.

• When you load a file into Hadoop cluster based on operating system file size, it estimates
required file size in HDFS based on “Block size” configured and “number of
replications” configured.
• After estimation is done, it checks for available space in DataNodes (Slaves).

• If space is available, it divides the data into blocks and replications.

• It orders data nodes (selected/chosen) to save these file blocks.


Decision for Data Node (How it selects)
• Priority sequence

• High configuration

• Nearest Location

• More Available Free Space

• Once data nodes saved the file blocks, it registers META DATA (file block metadata).
Name Node will maintain the following configuration.
• Physical address of Data Nodes (IP Address)

• Hardware Configuration.

• HDFS configuration

• Block size

• Replication

• Meta Data

• It serves, it information to the job tracker, secondary Name Node whenever they request.
SPF (Single Point of Failure)
• In Hadoop cluster, SPF is at NameNode. If NameNode is down operations cannot be
done.
• New job submission cannot be done.

• Queuing jobs cannot be released.

• Running jobs (active jobs) will be abnormally terminated.

DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high availability.
The DataNode is a block server that stores the data in the local file ext3 or ext4.
Functions of DataNode:

 These are slave daemons or process which runs on each slave machine.
 The actual data is stored on DataNodes.
 The DataNodes perform the low-level read and write requests from the file system’s
clients.

They send heartbeats to the NameNode periodically to report the overall health of HDFS, by
default, this frequency is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary NameNode as
a helper daemon. And don’t be confused about the Secondary NameNode being a backup
NameNode because it is not.

Functions of Secondary NameNode:

 The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
 It is responsible for combining the EditLogs with FsImage from the NameNode.
 It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage. The new FsImage is copied back to the NameNode, which is used whenever
the NameNode is started the next time.

Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also


called Checkpoint Node.
Job Tracker
• It is responsible for submission of jobs and execution of jobs.
Responsibilities
 Job Management (Scheduling, Tuning).
 Input File Availability check.
 Divide the job into smaller works.
 Work assigned to task trackers.
 Health monitoring of the machines.
 Load balancing
– Before work assignment
– After work assignment
 Fault Tolerance. (Replications)

What happened if Job tracker is down?


• All running jobs will be abnormally terminated and queuing jobs cannot be released; new
jobs cannot be submitted.
Task Tracker
• It is one of the Daemon in Hadoop cluster.

• It is responsible to execute the tasks assigned by the job tracker.

• Task tracker Heart beat interval is 10 seconds, cannot be configured.

• If job tracker has received expected number of beats from the task tracker, it indicates
perfect health.
• If number of beats is less than to expected indicates ill health

• If Zero beats received indicates node down.

• Machine is combination of Data Node + Task Tracker

What is the Single point of failure in Hadoop v1?


The single point of failure in Hadoop v1 is NameNode. If NameNode gets fail the
whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster
work will be shut down, because NameNode is only the point of contact to all DataNodes and
if the NameNode fails all communication will stop.
What are the available solutions to handle single point of failure in Hadoop 1?
To handle the single point of failure, we can use another setup configuration which
can backup NameNode metadata. If the primary NameNode will fail our setup can switch to
secondary (backup) and no any type to shutdown will happen for Hadoop cluster.

How Single point of failure issue has been addressed in Hadoop 2?

HDFS High Availability of NameNode is introduced with Hadoop 2. In this two


separate machines are getting configured as NameNodes, where one NameNode always in
working state and anther is in standby. Working Name node handling all clients request in the
cluster where standby is behaving as the slave and maintaining enough state to provide a fast
failover on Working Name node.

Differences between GFS and HDFS:

GFS HDFS

Linux Cross Platform


It was developed by Google At first its developed by Yahoo and now its an open
source Framework
It has Name node and Data Node
It has Masternode and Chunk server
64 MB will be the default block size 128 MB will be the default block size

Master node receive heartbeat from Name node receive heartbeat from Data node
Chunk server
Commodities hardware were used Commodities hardware were used

Multiple writer , multiple reader model WORM – Write Once and Read Many times

Deleted files are not reclaimed Deleted files are renamed into particular folder and then
immediately and renamed in hidden it will removed via garbage.
namespace and it will be deleted after 3
days if it is not in use.

HDFS Fault Tolerance


• HDFS creates replications for each block of file.

• The default number of replications for each Block is three (it can be configured).

• In “pseudo mode” of Hadoop – number of replications should be 1.

• In “Distribute mode”of Hadoop – number of replications should be greater than 1.

• Example When file1.txt (100MB) is moved to HDFS this need 2 blocks

• Block1 and Block2. For block1 three copies including original and block2 three copies.
Block1’s first copy is moved to DataNode1.

• In the above diagram their no node to store Block2 3rd copy at that time Block2 3 rd copy
is stored in any node.
• After completion Blocks storage including replications, the Meta data will be written in
“Name Node”.
Different Modes of Hadoop

Hadoop can run in three different modes as below :

•Local(Standalone)Mode

•PseudoDistributedMode

• FullyDistributed Mode

Hadoop Operation Modes Once you have downloaded Hadoop, you can operate your
Hadoop cluster in one of the three supported modes:

Local/Standalone Mode : After downloading Hadoop in your system, by default, it is


configured in a standalone mode and can be run as a single java process.
Pseudo Distributed Mode : It is a distributed simulation on single machine. Each
Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This
mode is useful for development.

Fully Distributed Mode : This mode is fully distributed with minimum two or more
machines as a cluster. We will come across this mode in detail in the coming chapters.
Installing Hadoop in Standalone Mode

Here we will discuss the installation of Hadoop 2.4.1 in standalone mode. There are
no daemons running and everything runs in a single JVM. Standalone mode is suitable for
running MapReduce programs during development, since it is easy to test and debug them.
Setting Up Hadoop You can set Hadoop environment variables by appending the following
commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop

Before proceeding further, you need to make sure that Hadoop is working fine. Just
issue the following command:

$ hadoop version
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is
configured to run in a nondistributed mode on a single machine. Example Let's check a
simple example of Hadoop. Hadoop installation delivers the following example

Step 1

Create temporary content files in the input directory. You can create this input
directory anywhere you would like to work.

$ mkdir input

$ cp $HADOOP_HOME/*.txt input

$ ls l input
It will give the following files in your input directory:

total 24

rwrr 1 root root 15164 Feb 21 10:14 LICENSE.txt rwrr 1 root root 101 Feb
21 10:14 NOTICE.txt

rwrr 1 root root 1366 Feb 21 10:14 README.txt

These files have been copied from the Hadoop installation home directory. For your
experiment, you can have different and large sets of files.

Step 2

Let's start the Hadoop process to count the total number of words in all the files
available in the input directory, as follows:

$ hadoop
jar$HADOOP_HOME/share/hadoop/mapreduce/hadoopmapreduceexamples2.2.0.jar wordcount
input ouput

Step 3

Step2 will do the required processing and save the output in output/partr00000
file, which you can check by using:

$cat output/*

It will list down all the words along with their total counts available in all the files available
in the input directory.

Installing Hadoop in Pseudo Distributed Mode

Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1: Setting Up Hadoop You can set Hadoop environment variables by appending the
following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_MAPRED_HOME=$HADOOP_HOME export


HADOOP_COMMON_HOME=$HADOOP_HOME export
HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR= $HADOOP_HOME/lib/native

export tPATH=$PATH:$HADOOP_HOME/sbin:$HADOOP

_HOME/bin export HADOOP_INSTALL=$HADOOP_HOME

Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 2: Hadoop Configuration You can find all the Hadoop configuration files in the
location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those
configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoopenv.sh file by replacing JAVA_HOME value with the location of java in
your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

The following are the list of files that you have to edit to configure Hadoop.
coresite.xml The coresite.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing the data,
and size of Read/Write buffers. Open the coresite.xml and add the following properties in
between <configuration>, </configuration> tags.

<configuration>

<property>

<name>fs.default.name </name>

<value> hdfs://localhost:9000 </value>

</property>

</configuration>

hdfssite.xml The hdfssite.xml file contains information such as the value of


replication data, namenode path,
and datanode paths of your local file systems. It means the place where you want to
store the Hadoop infrastructure
Hadoop’s three configuration files: Hadoop has a bewildering number of configuration
properties. . These properties are set in the Hadoop site files: coresite.xml, hdfssite.xml, and
mapredsite.xml.
coresite.xml configuration file: The coresite.xml file informs Hadoop daemon where
NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as
I/O settings that are common to HDFS and MapReduce.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode/</value>
<final>true</final>
</property>
</configuration>
hdfssite.xml configuration file
The hdfssite.xml file contains the configuration settings for HDFS daemons; the NameNode,
the Secondary NameNode, and the DataNodes. Here, we can configure hdfssite.xml to
specify default block replication and permission checking on HDFS. The actual number of
replications can also be specified when the file is created. The default is used if replication is
not specified in create time.
<?xml version="1.0"?>
<! hdfssite.xml >
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/disk1/hdfs/name,/remote/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/disk1/hdfs/data,/disk2/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary</value>
<final>true</final>
</property>
</configuration>

mapredsite.xml configuration file


The mapredsite.xml file contains the configuration settings for MapReduce daemons; the job
tracker and the tasktrackers.
<?xml version="1.0"?>
<! mapredsite.xml >
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>jobtracker:8021</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/disk1/mapred/local,/disk2/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>Xmx400m</value>
<! Not marked as final so jobs can include JVM debugging options >
</property>
</configuration>

In Hadoop 2.x configuration files has the following files :


1.coresite.xml
2.hdfssite.xml
3.yarnsite.xml
Yarnsite.xml Configuration files
YARN configuration options are stored in the /opt/mapr/hadoop/hadoop
2.x.x/etc/hadoop/yarnsite.xml file and are editable by the root user. This file contains
configuration information that overrides the default values for YARN parameters. Overrides
of the default values for core configuration properties are stored in the Default YARN
Parameters file.
To override a default value for a property, specify the new value within
the <configuration> tags, using the following format:
Syntax:
<property>
<name></name>
<value></value>
<description></description>
</property>

Sample yarnsite.xml:

<?xml version="1.0" encoding="UTF8"?>


<configuration>
<! Sitespecific YARN configuration properties >
<property>
<name>yarn.nodemanager.auxservices</name>
<value>mapreduce_shuffle,myriad_executor</value>
<! If using MapR distro, please use the following value: >
<value>mapreduce_shuffle,mapr_direct_shuffle,myriad_executor</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.myriad_executor.class</name>
<value>org.apache.myriad.executor.MyriadExecutorAuxService</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.myriad.scheduler.yarn.MyriadFairScheduler</value>
</property>
</configuration>

You might also like