Hadoop 2 & 3 Units Final
Hadoop 2 & 3 Units Final
Introduction to BigData:
Today we are living in an Informational Society and we are moving towards a Knowledge Based
Society. In order to extract better knowledge we need a bigger amount of data. Every organization
needs to collect a large set of data in order to support its decision and extract correlations through data
analysis as a basis for decisions.
Big Data is collection of large and complex data sets which are difficult to process
using the present database management tools or traditional data processing
applications.
Big Data consist of large volumes of heterogeneous data that is being generated often
at high speeds.
So big data requires a use of new set of tools, applications and frame works to process
and manage these large volumes of data.
Big data is generally too big to process that data and also to transport that data.
Big Data is difficult to store and process and the main issue is large time is taken to
clean the data in the first place.
Characteristics:
Volume
Velocity
Variety
Value
Viscosity
Volume: The main characteristic of Big Data is that is must be big which is measured as
volume. Due to the advancement in the technology and with the invention of social media the
exchange of data has become very easy and the amount of data is growing very rapidly. The
large volumes of data are spread to different places, in different formats. The large volumes
may be of Gigabytes to Terabytes or Petabytes or even more. Nowadays the data is generated
not only
by humans but also by machines. These all together are declared as Volumes of data.
Example: Facebook, it stores/generates large volumes of data every day.
Velocity: Velocity refers to the speed at which the data is being generated.Decision makers
want the necessary data in the least amount of time as possible. In different fields and
different areas of technology, we see data getting generated at different Speeds. A few
examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares
on Facebook, and many others. This speed aspect of data generation is referred to as Velocity
in the Big Data world.
Example: Stock exchange, for every second large volumes of data is being generated and is
being accessed by the users.
Figure 1: Five V’s of Big Data
Variety: Variety refers to the different formats in which the data is being generated.
Different applications generate/store the data in different formats. In today's world, there are
large volumes of unstructured data being generated apart from the structured data getting
generated in enterprises. In today's world, organizations not only need to rely on the
structured data from enterprise databases/warehouses, they are also forced to consume lots of
data that is being generated both inside and outside of the enterprise like clickstream data,
social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets,
relational databases etc., we have a lot of unstructured data stored in the form of images,
audio files, video files, web logs, sensor data, and many others. This aspect of varied data
formats is referred to as Variety in the Big Data world.
1. Structured Data
Structured data generally refers to data that has a defined length and format for big data.
Examples of structured data include numbers, dates, and groups of words and numbers
called strings, Relational Database, .csv files
2. Unstructured Data:
Examples include email messages, word processing documents, videos, photos, audio
files, presentations, web pages and many other kinds of business documents.
3. SemiStructured Data
Eg : Xml files, JSON documents are semi structured documents, NoSQL databases are
considered as semi structured.
Example: There are many varieties of data such as Text, image, audio, video etc. and
each have many formats.
Veracity: It refers to the messiness or trustworthiness of the data. With many forms of data,
quality and accuracy are less controllable. For example, Twitter posts with hash tags,
abbreviations and colloquial speech. Big data and analytics works on these types of data.
These volumes and types make up for the lack of quality or accuracy.
Value: Value refers to our ability turn our data into value. It is important that business makes
a case for any attempt to collect and leverage big data. It is easy to fall into the buzz trap and
embark on big data initiatives without a clear understanding of business value it will bring.
Sources of Bigdata:
1. Enterprise Data
2. Transactional Data
3. Social Media
4. Public Data
The following examples illustrate the need of big data for making informed decisions, improving
processes, and staying competitive in various sectors.
i. Ecommerce Personalization:
Big data enables ecommerce platforms to analyze customer behaviour in realtime.
Recommendations and targeted advertisements are tailored based on users' preferences and
past interactions.
ii. Healthcare Analytics:
Analyzing large volumes of medical data allows for personalized treatment plans and
early detection of potential health issues.
Realtime monitoring of patient data helps in timely interventions and improving overall
healthcare outcomes.
Big data presents numerous challenges, and addressing them is crucial for organizations to harness the
full potential of large volumes of data. Here are some challenges associated with big data, along with
realtime examples:
Volume:
Example: Social media platforms generate enormous volumes of user generated content,
including text, images, and videos.
Velocity:
Challenge: Handling the high speed at which data is generated, processed, and updated.
Example: Financial trading platforms process real-time stock market data to make split-second
decisions.
Variety:
Challenge: Dealing with diverse types of data, including structured, semi structured, and
unstructured.
Example: Ecommerce websites analyze customer behaviour through structured data (purchase
history), semi structured data (clickstream data), and unstructured data (customer reviews).
Veracity:
Example: Sensor data from IoT devices may have inaccuracies or missing values, affecting the
reliability of insights derived from the data.
Variability:
Challenge: Handling inconsistencies in the data flow, including intermittent data spikes.
Example: Weather forecasting systems process data from various sources, and sudden,
unexpected weather events can lead to spikes in data variability.
Complexity:
Challenge: Managing the complexity of integrating and analysing data from different sources and
formats.
Example: Healthcare organizations integrate data from electronic health records, medical imaging,
and patient generated data for comprehensive patient care.
Example: Financial institutions need to secure customer financial transactions and personal
information to prevent fraud and comply with privacy regulations.
Scalability:
Challenge: Ensuring systems can handle increased data volume and user demands.
Example: Online streaming services must scale their infrastructure to handle a surge in users during
popular events or releases.
Cost Management:
Challenge: Balancing the costs associated with storing, processing, and analysing large volumes of
data.
Example: Cloud service providers charge based on data storage, processing power, and network
usage, requiring organizations to optimize their usage to control costs.
Example: Organizations struggle to find and retain data scientists, engineers, and analysts with the
necessary skills to manage and derive insights from big data.
Addressing these challenges often involves a combination of technology adoption, data governance,
and strategic planning to ensure that big data initiatives contribute effectively to an organization's
goals.
As a consequence of the services provided by Google, it faces the requirements to effectively manage
large amounts of data involving web text/image searches, maps, YouTube etc. Considering the fact that
the available data is huge and vulnerable, Google tried to create its own file storage structure and came
up with the concept of Google File System (GFS) in 2003.
The basic considerations which forced Google to switch to a new file system included constant
monitoring, fault tolerance as well as error detection and automatic recovery.
Having the clear aims of a prospective file system in detail, Google has opted not to use an existing
distributed file system such as Network File System(NFS) or Andrew File System(AFS) but instead
decided to develop a new file system customized fully to suite Google’s needs.
GFS Architecture
GFS follows a master chunkserver relationship. It consists of a single master and a number of chunk
servers. Both the master and the chunk servers can be accessed by multiple clients.
Files are divided into fixed size chunks, with default chunk size being equal to 64MB. A chunk handle
which is usually a 64-bit pointer, is used by master to handle each chunk. The handle is globally unique
and immutable (write once, read anywhere).
By default, the number of replicas created is three but it can be changed accordingly.
Master
One of the primary roles of Master is to maintain all Metadata. This includes managing file and chunk
namespaces, keeping track of location of each chunk’s replica, mapping from files to chunks and
accessing control information.
Storing metadata in master provides simplicity and flexibility compared to storing it individually for
each chunk. A general norm which is followed by the master is that it allows less than 64 bytes of
metadata being maintained for each 64 MB chunk. Besides, master is also responsible for chunk
management and chunk migration between chunkservers.
Also, it supplies garbage collection mechanism to avoid fragmentation. The master periodically gives
instructions to each chunkserver and gathers information about its state also.
The role of the client is to ask the master as to which chunkserver to contact. Using file name and byte
offset, it creates a chunk index and sends it to master. Master on the other hand acknowledges the
current and few future requests. This ensures minimum future interaction between client and master.
Snapshot
A Snapshot is an integral function involved in functioning of Google File System, which ensures
consistency control. It helps a user to create a copy of a file or a directory instantaneously, without
hindering the ongoing mutations.
It is more commonly used while creating checkpoints of the current state in order to perform
experimentation on data which can later be committed or rolled back
Data Integrity
Given the fact that a typical GFS cluster consists of thousands of machines, it experiences disk failures
time and again, which ultimately causes corruption or loss of data reads and data writes. Google
considered this to be a norm rather than an exception.
Each chunk broken up into 64KB block has a corresponding checksum of 32 bits. Similar to metadata,
checksums are also stored with logging, independent from user data
Garbage Collection
Instead of immediately reclaiming the physical storage space after a file or a chunk is deleted, GFS
implies a lazy action policy of Garbage Collection.
This approach ensures that system is much simpler and more reliable. 2.5 Implication GFS relies on
appends rather than overwrites. It believes in the strategy of check pointing in order to fully utilize its
fault tolerance capabilities. By differentiating file system control (master) from data transfer
(chunkserver and clients), it ensures that master’s involvement in common operations is minimized,
resulting in faster execution
Google lately changed its focus to automation of data. This led to an entire architectural change in
existing distributed storage and processing. This new automated type of storage was named ‘Colossus’
and is being used by Google presently.
Colossus is the next generation cluster level file system with data being written to it using ‘Reed
Solomon error correction codes’ which ultimately account for a 1.5 times redundancy compared to its
predecessor. It is a client driven system with full access given to client to replicate and encode data.
Hadoop
Introduction to Hadoop:
• Apache Hadoop is an open source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data.
• Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers.
History of Hadoop:
It all started with two people, Mike Cafarella and Doug Cutting, who were in the process of
building a search engine system that can index 1 billion pages. After their research, they
estimated that such a system will cost around half a million dollars in hardware, with a
monthly running cost of $30,000, which is quite expensive. However, they soon realized that
their architecture will not be capable enough to work around with billions of pages on the
web.
They came across a paper, published in 2003, that described the architecture of Google’s
distributed file system, called GFS, which was being used in production at Google. Now, this
paper on GFS proved to be something that they were looking for, and soon, they realized that
it would solve all their problems of storing very large files that are generated as a part of the
web crawl and indexing process. Later in 2004, Google published one more paper that
introduced MapReduce to the world. Finally, these two papers led to the foundation of the
framework called “Hadoop“. Doug quoted on Google’s contribution in the development of
Hadoop framework:
Advantages of Hadoop:
1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large
data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional
relational database systems (RDBMS) that can’t scale to process large amounts of data,
Hadoop enables businesses to run applications on thousands of nodes involving many
thousands of terabytes of data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights from data sources such as
social media, email conversations. Hadoop can be used for a wide variety of purposes, such
as log processing, recommendation systems, data warehousing, market campaign analysis
and fraud detection.
4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’
data wherever it is located on a cluster. The tools for data processing are often on the same
servers where the data is located, resulting in much faster data processing. If you’re dealing
with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of
data in just minutes, and petabytes in hours.
5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual
node, that data is also replicated to other nodes in the cluster, which means that in the event
of failure, there is another copy available for use.
Disadvantages of Hadoop:
As the backbone of so many implementations, Hadoop is almost synomous with big data.
1. Security Concerns
Just managing a complex application such as Hadoop can be challenging. A simple example
can be seen in the Hadoop security model, which is disabled by default due to sheer
complexity. If whoever managing the platform lacks of know how to enable it, your data
could be at huge risk. Hadoop is also missing encryption at the storage and network levels,
which is a major selling point for government agencies and others that prefer to keep their
data under wraps.
2. Vulnerable by Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and
as a result, implicated in numerous security breaches.
5. General Limitations
The article introduces Apache Flume, MillWheel, and Google’s own Cloud Dataflow as
possible solutions. What each of these platforms have in common is the ability to improve the
efficiency and reliability of data collection, aggregation, and integration. The main point the
article stresses is that companies could be missing out on big benefits by using Hadoop alone.
Hadoop 2.x
Hadoop 3.x
Java version 6 was the minimum requirement. Java version 8 is the minimum requirement .As
most of the dependency library file used is from
java8.
HDFS supports replication for fault tolerance. HDFS support for erasure encoding. (Erasure
coding is a technique for durably storing
information with significant space savings
compared to replication)
Limited Shell scripts with Bugs. Many new Unix shell API, along with old Bug
Fixed.
Map reduce became fast due to YARN. Map reduce became faster, particularly at map
output collector and shuffle jobs by 30%.
A single DataNode manages multiple disks. Disks New functionality intra DataNode balancing is
inside can lead to significant skew within a added, which is invoked via the hdfs disk balancer
DataNode. CLI.
Default ports were Conflicting in Linux port range Port range has been optimized.
which leads to failure in port reservation.
At least supports Java version 6 At least supports Java version 8
Behaviour of DFS
• Total file data will be splitted into Blocks.
• Each block will be stored into separate slave nodes.
• As per HDFS the minimum block size is 64MB.
• Recommendation for multiple terabytes of data is 128MB.
• Block size can be configured as 64MB, 64MB * 2, 64MB * 3.
Example :
file1.txt has 100MB data size, when this file is moved into Hadoop, the file will be divided into 2 Blocks
as Block1 and Block2.
Block1 will be stored in Data Node1 (DN1) Block2 will be stored in Data Node1 (DN2)
• When you submit job to process file1, the job will be divided into two smaller works (tasks)
out of 2 tasks, task1 is assigned to DataNode1 and task2 is assigned to DataNode2.
• Both machines running parallelly when first machine will be processing Block1 of file1 and
second machine processing block2 of file1.
• During this process if any process machine will get down your job execution will be
terminated abnormally.
Hadoop distributed file system called HDFS (HADOOP Distributed File Systems) HADOOP
based applications makes the use of HDFS.
HDFS is designed for storing very largedata files, running on clusters of commodity hardware. It
is fault tolerant, scalable, and extremely simple to expand.
HDFS cluster primarily consists of a NameNode that manages the filesystem Metadata and a Data
Nodes that stores the actual data.
DataNode : DataNodes are slaves which reside on each machine in a cluster and
provide the actual storage. It is responsible for serving, read and write requests for the clients.
2. This object connects to namenode using RPC and gets metadata information such
as the locations of the blocks of the file. Please note that these addresses are of
first few block of file.
3. In response to this metadata request, addresses of the DataNodes having copy of
that block, is returned back.
4. Once addresses of DataNodes are received, an object of
type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of
interactions with DataNode and NameNode. In step 4 shown in above diagram, client invokes
'read()' method which causes DFSInputStream to establish a
connection with the first DataNode with the first block of file.
5. Data is read in the form of streams wherein client invokes 'read()' method
repeatedly. This process of read() operation continues till it reaches end of block.
6. Once end of block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
7. Once client has done with the reading, it calls close() method.
• Task Tracker
NameNode
• NameNode is “controller” of total Hadoop cluster. The SPF (Single Point Failure) is at
Name Node.
• NameNode is mainly responsible for storage services.
• When you load a file into Hadoop cluster based on operating system file size, it estimates
required file size in HDFS based on “Block size” configured and “number of
replications” configured.
• After estimation is done, it checks for available space in DataNodes (Slaves).
• High configuration
• Nearest Location
• Once data nodes saved the file blocks, it registers META DATA (file block metadata).
Name Node will maintain the following configuration.
• Physical address of Data Nodes (IP Address)
• Hardware Configuration.
• HDFS configuration
• Block size
• Replication
• Meta Data
• It serves, it information to the job tracker, secondary Name Node whenever they request.
SPF (Single Point of Failure)
• In Hadoop cluster, SPF is at NameNode. If NameNode is down operations cannot be
done.
• New job submission cannot be done.
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high availability.
The DataNode is a block server that stores the data in the local file ext3 or ext4.
Functions of DataNode:
These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s
clients.
They send heartbeats to the NameNode periodically to report the overall health of HDFS, by
default, this frequency is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary NameNode as
a helper daemon. And don’t be confused about the Secondary NameNode being a backup
NameNode because it is not.
The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
It is responsible for combining the EditLogs with FsImage from the NameNode.
It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage. The new FsImage is copied back to the NameNode, which is used whenever
the NameNode is started the next time.
• If job tracker has received expected number of beats from the task tracker, it indicates
perfect health.
• If number of beats is less than to expected indicates ill health
GFS HDFS
Master node receive heartbeat from Name node receive heartbeat from Data node
Chunk server
Commodities hardware were used Commodities hardware were used
Multiple writer , multiple reader model WORM – Write Once and Read Many times
Deleted files are not reclaimed Deleted files are renamed into particular folder and then
immediately and renamed in hidden it will removed via garbage.
namespace and it will be deleted after 3
days if it is not in use.
• The default number of replications for each Block is three (it can be configured).
• Block1 and Block2. For block1 three copies including original and block2 three copies.
Block1’s first copy is moved to DataNode1.
• In the above diagram their no node to store Block2 3rd copy at that time Block2 3 rd copy
is stored in any node.
• After completion Blocks storage including replications, the Meta data will be written in
“Name Node”.
Different Modes of Hadoop
•Local(Standalone)Mode
•PseudoDistributedMode
• FullyDistributed Mode
Hadoop Operation Modes Once you have downloaded Hadoop, you can operate your
Hadoop cluster in one of the three supported modes:
Fully Distributed Mode : This mode is fully distributed with minimum two or more
machines as a cluster. We will come across this mode in detail in the coming chapters.
Installing Hadoop in Standalone Mode
Here we will discuss the installation of Hadoop 2.4.1 in standalone mode. There are
no daemons running and everything runs in a single JVM. Standalone mode is suitable for
running MapReduce programs during development, since it is easy to test and debug them.
Setting Up Hadoop You can set Hadoop environment variables by appending the following
commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
Before proceeding further, you need to make sure that Hadoop is working fine. Just
issue the following command:
$ hadoop version
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is
configured to run in a nondistributed mode on a single machine. Example Let's check a
simple example of Hadoop. Hadoop installation delivers the following example
Step 1
Create temporary content files in the input directory. You can create this input
directory anywhere you would like to work.
$ mkdir input
$ cp $HADOOP_HOME/*.txt input
$ ls l input
It will give the following files in your input directory:
total 24
rwrr 1 root root 15164 Feb 21 10:14 LICENSE.txt rwrr 1 root root 101 Feb
21 10:14 NOTICE.txt
These files have been copied from the Hadoop installation home directory. For your
experiment, you can have different and large sets of files.
Step 2
Let's start the Hadoop process to count the total number of words in all the files
available in the input directory, as follows:
$ hadoop
jar$HADOOP_HOME/share/hadoop/mapreduce/hadoopmapreduceexamples2.2.0.jar wordcount
input ouput
Step 3
Step2 will do the required processing and save the output in output/partr00000
file, which you can check by using:
$cat output/*
It will list down all the words along with their total counts available in all the files available
in the input directory.
Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.
Step 1: Setting Up Hadoop You can set Hadoop environment variables by appending the
following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export YARN_HOME=$HADOOP_HOME
export tPATH=$PATH:$HADOOP_HOME/sbin:$HADOOP
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 2: Hadoop Configuration You can find all the Hadoop configuration files in the
location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those
configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoopenv.sh file by replacing JAVA_HOME value with the location of java in
your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
coresite.xml The coresite.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing the data,
and size of Read/Write buffers. Open the coresite.xml and add the following properties in
between <configuration>, </configuration> tags.
<configuration>
<property>
<name>fs.default.name </name>
</property>
</configuration>
Sample yarnsite.xml: