Bda Guess Paper Solution
Bda Guess Paper Solution
Ans: - Big Data is also data but with a huge size. Big Data is a term used to describe a collection
of data that is huge in volume and yet growing exponentially with time. In short such data is so
large and complex that none of the traditional data management tools are able to store it or
process it efficiently.
(I) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be considered as
a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.
(ii) Variety – The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the
nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data. Big Data Velocity
deals with the speed at which data flows in from sources like business processes, application logs,
networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and
continuous.
(IV) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Ans: - Apache Hadoop MapReduce is a framework for processing large data sets in parallel
across a Hadoop cluster. Data analysis uses a two step map and reduces process. The job
configuration supplies map and reduce analysis functions and the Hadoop framework provides
the scheduling, distribution, and parallelization services.
3. Define HDFS and YARN, and talk about their respective component
Ans: - Hadoop Distributed File System, also known as HDFS, Hadoop architecture, Yet another
Resource Negotiator, also known as YARN, and YARN architecture. HDFS is a distributed file system
that provides access to data across Hadoop clusters. ... YARN is the acronym for Yet another Resource
Negotiator.
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System.
YARN: Yet another Resource Negotiator.
MapReduce: Programming based Data Processing.
Spark: In-Memory data processing.
PIG, HIVE: Query based processing of data services.
HBase: No SQL Database.
Various Components of YARN
Resource Manager. YARN works through a Resource Manager which is one
per node and Node Manager which runs on all the nodes. ...
Node Manager. Node Manager is responsible for the execution of the task in each data node. ...
Containers. ...
Application Master.
Ans: - The main task of Reducer is to reduce a larger set of data that shares a key to a smaller set of data.
In Hadoop, Reducer has following three core methods:
Setup (): At the start of a task, setup () method is called to configure various parameters for Reducer.
Reduce (): This is the main operation of Reducer. In reduce () method we define the task that has to
be done for a set of values that share a key.
Cleanup (): Once reduce () task is done, we can use cleanup () to clean any intermediate data or
temporary files.
Ans: - Big Data can bring key insight to trends in the market place and provide companies with a wealth
of information to help them make more informed decisions. It's a collection of traditional and
digital data that makes up large data sets and can be analyzed to reveal patterns, trends, and associations.
As a result, big data allows for better business decisions, increased revenue, and decreased operating
costs.
Ans: - NFS (Network File system): A protocol developed that allows clients to access files over the
network. HDFS is fault tolerant because it stores multiple replicas of files on the file system; the default
replication level is 3. The major difference between the two is Replication/Fault Tolerance. Normal file
systems have small block size of data. (Around 512 bytes) while HDFS has larger block sizes at around
64 MB) Multiple disks seek for larger files in normal file systems while in HDFS, data is read
sequentially after every individual seek.
Ans: - Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all
your processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN
Architecture consists of the following main components:
1. Resource Manager: Runs on a master daemon and manages the resource allocation in the
cluster.
2. Node Manager: They run on the slave daemons and are responsible for the execution of a task
on every single Data Node.
3. Application Master: Manages the user job lifecycle and resource needs of individual
applications. It works along with the Node Manager and monitors the execution of tasks.
4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single node
10. What is Pig Grunt
Ans: - Grunt is Pig's interactive shell. It enables users to enter Pig Latin interactively and provides a
shell for users to interact with HDFS. To enter Grunt, invoke Pig with no script or command to run.
The Grunt shell provides a set of utility commands. These include utility commands such as clear, help,
history, quit, and set; and command such as exec, kill, and run to control Pig from the Grunt shell.
Ans: - No SQL allows for high-performance, agile processing of information at massive scale. It stores
unstructured data across multiple processing nodes, as well as across multiple servers. As such, the No
SQL distributed database infrastructure has been the solution of choice for some of the
largest data warehouses.
Ans:-
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.StringTokenizer;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.LongWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.MapReduceBase;
8. import org.apache.hadoop.mapred.Mapper;
9. import org.apache.hadoop.mapred.OutputCollector;
10. import org.apache.hadoop.mapred.Reporter;
11. public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWri
table>{
12. private final static IntWritable one = new IntWritable(1);
13. private Text word = new Text();
14. public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
15. Reporter reporter) throws IOException{
16. String line = value.toString();
17. StringTokenizer tokenizer = new StringTokenizer(line);
18. while (tokenizer.hasMoreTokens()){
19. word.set(tokenizer.nextToken());
20. output.collect(word, one);
21. }
22. }
23.
24. }
File: WC_Reducer.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10.
11. public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntW
ritable> {
12. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output
,
13. Reporter reporter) throws IOException {
14. int sum=0;
15. while (values.hasNext()) {
16. sum+=values.next().get();
17. }
18. Output. Collect(key, new IntWritable(sum));
19. }
20. }
13. What is Zoo Keeper? What are the Benefits of Zoo Keeper
Ans: - Apache Zoo Keeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques. Zoo Keeper is itself a
distributed application providing services for writing a distributed application.
The common services provided by Zoo Keeper are as follows −
Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
Configuration management − Latest and up-to-date configuration information of the system for
a joining node.
Cluster management − Joining / leaving of a node in a cluster and node status at real time.
Leader election − Electing a node as leader for coordination purpose.
Locking and synchronization service − Locking the data while modifying it. This mechanism
helps you in automatic fail recovery while connecting other distributed applications like Apache
HBase.
Highly reliable data registry − Availability of data even when one or a few nodes are down.
Benefits of Zoo Keeper
Here are the benefits of using Zoo Keeper −
Simple distributed coordination process
Synchronization − Mutual exclusion and co-operation between server processes. This process
helps in Apache HBase for configuration management.
Ordered Messages
Serialization − Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to execute running
threads.
Reliability
Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.
Ans: - Shading is a method for distributing data across multiple machines. Mongo DB uses shading to
support deployments with very large data sets and high throughput operations. Database systems with
large data sets or high throughput applications can challenge the capacity of a single server. For example,
high query rates can exhaust the CPU capacity of the server. Working set sizes larger than the system’s
RAM stress the I/O capacity of disk drives.
There are two methods for addressing system growth: vertical and horizontal scaling.
Vertical Scaling involves increasing the capacity of a single server, such as using a more powerful CPU,
adding more RAM, or increasing the amount of storage space. Limitations in available technology may
restrict a single machine from being sufficiently powerful for a given workload.
Horizontal Scaling involves dividing the system dataset and load over multiple servers, adding
additional servers to increase capacity as required. While the overall speed or capacity of a single
machine may not be high, each machine handles a subset of the overall workload, potentially providing
better efficiency than a single high-speed high-capacity server.
15. Difference between Apache pig and Map Reduced
Ans:-
Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.
16. What is Pig Data? Explain Pig Data model and features
Ans: - Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce First, to process the data which is stored in the
HDFS, the programmers will write the scripts using the Pig Latin Language.
Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the reducer
and mapper, compiling packaging the code, submitting the job and retrieving the output is a time-consuming
task. Apache Pig reduces the time of development using the multi-query approach. Also, Pig is beneficial
for the programmers who are not from Java background. 200 lines of Java code can be written in only 10
lines using the Pig Latin language. Programmers who have SQL knowledge needed less effort to learn Pig
Latin.
For performing several operations Apache Pig provides rich sets of operators like the filters, join,
sort, etc.
Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
Apache Pig is extensible so that you can make your own user-defined functions and process.
Join operation is easy in Apache Pig.
Fewer lines of code.
Apache Pig allows splits in the pipeline.
The data structure is multi valued, nested and richer.
Pig can handle the analysis of both structured and unstructured data.
17. Explain Map-Reduced Framework in brief
Ans: - MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set
of data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing
nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once
we write an application in the MapReduce form, scaling the application to run over hundreds, thousands,
or even tens of thousands of machines in a cluster is merely a configuration change. This simple
scalability is what has attracted many programmers to use the MapReduce model.
Ans:-
Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage systems and
memory management.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows querying the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL? Called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of
streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to analyze
batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time example of a data stream.
ML lib
o The ML lib is a Machine Learning library that contains various machine learning algorithms.
o These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache Mahout.
Graph X
o The Graph X is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
o To manipulate graph, it supports various fundamental operators like sub graph, join Vertices, and
aggregate Messages.
Ans:-
Features of HDFS are:
Cost-effective: ...
Large Datasets/ Variety and volume of data. ...
Replication. ...
Fault Tolerance and reliability. ...
High Availability. ...
Scalability. ...
Data Integrity. ...
High Throughput.
Features of MapReduce
Scalability. Apache Hadoop is a highly scalable framework. ...
Flexibility. MapReduce programming enables companies to access new sources of data. ...
Security and Authentication. ...
Cost-effective solution. ...
Fast. ...
Simple model of programming. ...
Parallel Programming. ...
Availability and resilient nature.
Ans: - The general concept is that an application submission client submits an application to the
YARN Resource Manager (RM). This can be done through setting up a Yarn Client object. After Yarn
Client is started, the client can then set up application context, prepare the very first container of the
application that contains the Application Master (AM), and then submit the application. The YARN
Resource Manager will then launch the Application Master (as specified) on an allocated container.
During the execution of an application, the Application Master communicates Node Managers
through NM Client Async object. All container events are handled by NM Client Async. Callback
Handler, associated with NM Client Async. A typical callback handler handles client start, stop, status
update and error. Application Master also reports execution progress to Resource Manager by handling
the get Progress () method of AMRM Client Async. Callback Handler.
Interfaces
Client<-->Resource Manager
By using AMRM Client Async objects, handling events asynchronously by AMRM Client
Async. Callback Handler
Launch containers. Communicate with Node Managers by using NM Client Async objects,
handling container events by NM Client Async. Callback Handler
Note
The three main protocols for YARN application (Application Client Protocol, Application Master
Protocol and Container Management Protocol) are still preserved. The 3 clients wrap these 3
protocols to provide simpler programming model for YARN applications.
Under very rare circumstances, programmer may want to directly use the 3 protocols to
implement an application. However, note that such behaviors are no longer encouraged for
general use cases.
SECTION-B
1. Draw and Explain HDFS Architecture. Explain the function of Name node and Data
node.
Ans: - Hadoop Distributed File system – HDFS is the world’s most reliable storage system. HDFS is a
File system of Hadoop designed for storing very large files running on a cluster of commodity hardware.
It is designed on the principle of storage of less number of large files rather than the huge number of small
files.
Hadoop HDFS provides a fault-tolerant storage layer for Hadoop and its other components. HDFS
Replication of data helps us to attain this feature. It stores data reliably, even in the case of hardware
failure. It provides high throughput access to application data by providing the data access in parallel.
HDFS Nodes
Name Node regulates file access to the clients. It maintains and manages the slave nodes and assigns tasks
to them. Name Node executes file system namespace operations like opening, closing, and renaming files
and directories. Name Node runs on the high configuration hardware. Name Node is the centerpiece of
the Hadoop Distributed File System. It maintains and manages the file system namespace and provides
the right access permission to the clients. The Name Node stores information about blocks locations,
permissions, etc. on the local disk in the form of two files:
Fs image: Fs image stands for File System image. It contains the complete namespace of the Hadoop
file system since the Name Node creation.
Edit log: It contains all the recent changes performed to the file system namespace to the most recent
Fs image.
Functions of HDFS Name Node
1. It executes the file system namespace operations like opening, renaming, and closing files and
directories.
2. Name Node manages and maintains the Data Nodes.
3. It determines the mapping of blocks of a file to Data Nodes.
4. Name Node records each change made to the file system namespace.
5. It keeps the locations of each block of a file.
6. Name Node takes care of the replication factor of all the blocks.
7. Name Node receives heartbeat and block reports from all Data Nodes that ensure Data Node is alive.
8. If the Data Node fails, the Name Node chooses new Data Nodes for new replicas.
2. HDFS Slave (Data node)
There are n number of slaves (where n can be up to 1000) or Data Nodes in the Hadoop Distributed File
System that manages storage of data. These slave nodes are the actual worker nodes that do the tasks and
serve read and write requests from the file system’s clients. Data Nodes are the slave nodes in Hadoop
HDFS. Data Nodes are inexpensive commodity hardware. They store blocks of a file.
They perform block creation, deletion, and replication upon instruction from the Name Node. Once a
block is written on a Data Node, it replicates it to other Data Node, and the process continues until
creating the required number of replicas.
2. What is the advantage of Hadoop? Explain Hadoop Architecture and its
Components with proper Diagram
Ans: - Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing).
It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn
and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that
HDFS was developed. It states that the files will be broken into blocks and stored in nodes over
the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on
data using key value pair. The Map task takes input data and converts it into a data set which can
be computed in Key value pair. The output of Map task is consumed by reduce task and then the
out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so
if one node is down or some other network failure happens, then Hadoop takes the other copy of
data and use it. Normally, data are replicated thrice but the replication factor is configurable.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2. A Hadoop
cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task
Tracker, Name Node, and Data Node whereas the slave node includes Data Node and Task Tracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
master/slave architecture. This architecture consist of a single Name Node performs the role of master,
and multiple Data Nodes performs the role of a slave.
Both Name Node and Data Node are capable enough to run on commodity machines. The Java language
is used to develop HDFS. So any machine that supports Java language can easily run the Name Node and
Data Node software.
Name Node
o It is a single master server exists in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
Data Node
o The HDFS cluster contains multiple Data Nodes.
o Each Data Node contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of Data Node to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the Name Node.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using Name Node.
o In response, Name Node provides metadata to Job Tracker
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also
be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job
Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the
Task Tracker fails or time out. In such a case, that part of the job is rescheduled
Ans: - Zoo Keeper is a distributed co-ordination service to manage large set of hosts. Co-coordinating
and managing a service in a distributed environment is a complicated process. Zoo Keeper solves this
issue with its simple architecture and API. Zoo Keeper allows developers to focus on core application
logic without worrying about the distributed nature of the application.
The Zoo Keeper framework was originally built at “Yahoo!” for accessing their applications in an easy
and robust manner. Later, Apache Zoo Keeper became a standard for organized service used by Hadoop,
HBase, and other distributed frameworks. For example, Apache HBase uses Zoo Keeper to track the
status of distributed data. This tutorial explains the basics of Zoo Keeper, how to install and deploy a
Zoo Keeper cluster in a distributed environment, and finally concludes with a few examples using Java
programming and sample applications. Before going deep into the working of Zoo Keeper, let us take a
look at the fundamental concepts of Zoo Keeper. We will discuss the following topics in this chapter −
Architecture
Hierarchical namespace
Session
Watches
Architecture of Zoo Keeper
Take a look at the following diagram. It depicts the “Client-Server Architecture” of Zoo Keeper.
Each one of the components that is a part of the Zoo Keeper architecture has been explained in the
following table.
Part Description
Client
Clients, one of the nodes in our distributed application cluster, access information from
the server. For a particular time interval, every client sends a message to the server to
let the sever know that the client is alive.
Similarly, the server sends an acknowledgement when a client connects. If there is no
response from the connected server, the client automatically redirects the message to
another server.
Server Server, one of the nodes in our Zoo Keeper ensemble, provides all the services to clients.
Gives acknowledgement to client to inform that the server is alive.
Ensemble Group of Zoo Keeper servers. The minimum number of nodes that is required to form an
ensemble is 3.
Leader Server node which performs automatic recovery if any of the connected node failed.
Leaders are elected on service startup.
Types of Z nodes
Z nodes are categorized as persistence, sequential, and ephemeral.
Persistence z node − Persistence z node is alive even after the client, which created that particular
z node, is disconnected. By default, all z nodes are persistent unless otherwise specified.
Ephemeral z node − Ephemeral z nodes are active until the client is alive. When a client gets
disconnected from the Zoo Keeper ensemble, then the ephemeral z nodes get deleted
automatically. For this reason, only ephemeral z nodes are not allowed to have a child further. If
an ephemeral z node is deleted, then the next suitable node will fill its position. Ephemeral z
nodes play an important role in Leader election.
Sequential z node − Sequential z nodes can be either persistent or ephemeral. When a new z node
is created as a sequential z node, then Zoo Keeper sets the path of the z node by attaching a 10
digit sequence number to the original name. For example, if a z node with path /my app is
created as a sequential z node, Zoo Keeper will change the path to /myapp0000000001 and set
the next sequence number as 0000000002. If two sequential z nodes are created concurrently,
then Zoo Keeper never uses the same number for each z node. Sequential z nodes play an
important role in Locking and Synchronization.
Sessions
Sessions are very important for the operation of Zoo Keeper. Requests in a session are executed in FIFO
order. Once a client connects to a server, the session will be established and a session id is assigned to
the client. The client sends heartbeats at a particular time interval to keep the session valid. If the Zoo
Keeper ensemble does not receive heartbeats from a client for more than the period (session timeout)
specified at the starting of the service, it decides that the client died. Session timeouts are usually
represented in milliseconds. When a session ends for any reason, the ephemeral z nodes created during
that session also get deleted.
Watches
Watches are a simple mechanism for the client to get notifications about the changes in the Zoo Keeper
ensemble. Clients can set watches while reading a particular z node. Watches send a notification to the
registered client for any of the z node (on which client registers) changes. Znode changes are
modification of data associated with the z node or changes in the z node’s children. Watches are
triggered only once. If a client wants a notification again, it must be done through another read
operation. When a connection session is expired, the client will be disconnected from the server and the
associated watches are also removed.
4. Discuss the concept of regions in H base and storing Big Data with H base
Ans: - In HBase, tables are split into regions and are served by the region servers. Regions are vertically
divided by column families into “Stores”. Stores are saved as files in HDFS. Shown below is the
architecture of HBase.
HBase has three major components: the client library, a master server, and region servers. Region
servers can be added or removed as per requirement.
Master Server
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts
the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of tables and
column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
The store contains memory store and H Files. Mem store is just like a cache memory. Anything that is
entered into the HBase is stored here initially. Later, the data is transferred and saved in H files as blocks
and the mem store is flushed.
HBase Shell
HBase contains a shell using which you can communicate with HBase. HBase uses the Hadoop File
System to store its data. It will have a master server and region servers. The data storage will be in the
form of regions (tables). These regions will be split up and stored in region servers.
The master server manages these region servers and all these tasks take place on HDFS. Given below are
some of the commands supported by HBase Shell.
General Commands
Status - Provides the status of HBase, for example, the number of servers.
Version - Provides the version of HBase being used.
Table help - Provides help for table-reference commands.
Whoami - Provides information about the user.
Data Definition Language
These are the commands that operate on the tables in HBase.
Create - Creates a table.
List - Lists all the tables in HBase.
Disable - Disables a table.
Is disabled - Verifies whether a table is disabled.
Enable - Enables a table.
Is enabled - Verifies whether a table is enabled.
Describe - Provides the description of a table.
Alter - Alters a table.
Exists - Verifies whether a table exists.
Drop - Drops a table from HBase.
Drop all - Drops the tables matching the ‘regex’ given in the command.
Java Admin API - Prior to all the above commands, Java provides an Admin API to achieve
DDL functionalities through programming. Under org.apache.hadoop.hbase.client package, H
Base Admin and H Table Descriptor are the two important classes in this package that provide
DDL functionalities.
Data Manipulation Language
Put - Puts a cell value at a specified column in a specified row in a particular table.
Get - Fetches the contents of row or a cell.
Delete - Deletes a cell value in a table.
Delete all - Deletes all the cells in a given row.
Scan - Scans and returns the table data.
Count - Counts and returns the number of rows in a table.
Truncate - Disables, drops, and recreates a specified table.
Java client API - Prior to all the above commands, Java provides a client API to achieve DML
functionalities, CRUD (Create Retrieve Update Delete) operations and more through
programming, under org.apache.hadoop.hbase.client package. H Table Put and Get are the
important classes in this package.
HBase provides low-latency random reads and writes on top of HDFS. In HBase, tables are dynamically
distributed by the system whenever they become too large to handle (Auto Shading). The simplest and
foundational unit of horizontal scalability in HBase is a Region. A continuous, sorted set of rows that are
stored together is referred to as a region (subset of table data). HBase architecture has a single HBase
master node (H Master) and several slaves i.e. region servers. Each region server (slave) serves a set of
regions, and a region can be served only by a single region server. Whenever a client sends a write
request, H Master receives the request and forwards it to the corresponding region server.
HBase architecture has 3 important components- H Master, Region Server and Zoo Keeper.
H Master
HBase H Master is a lightweight process that assigns regions to region servers in the Hadoop cluster for
load balancing. Responsibilities of H Master –
Region Server
These are the worker nodes which handle read, write, update, and delete requests from clients. Region
Server process, runs on every node in the hadoop cluster. Region Server runs on HDFS Data Node and
consists of the following components –
Block Cache – This is the read cache. Most frequently read data is stored in the read cache and
whenever the block cache is full, recently used data is evicted.
Mem Store- This is the write cache and stores new data that is not yet written to the disk. Every
column family in a region has a Mem Store.
Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent storage.
H File is the actual storage file that stores the rows as sorted key values on a disk.
Zookeeper
HBase uses Zoo Keeper as a distributed coordination service for region assignments and to recover any
region server crashes by loading them onto other region servers that are functioning. Zoo Keeper is a
centralized monitoring server that maintains configuration information and provides distributed
synchronization. Whenever a client wants to communicate with regions, they have to approach Zookeeper
first. H Master and Region servers are registered with Zoo Keeper service, client needs to access Zoo
Keeper quorum in order to connect with region servers and H Master. In case of node failure within an
HBase cluster, ZKquoram will trigger error messages and start repairing failed nodes.
Ans: - Apache Spark is general purpose cluster computing system. It provides high-level API in
Java, Scale, Python, and R. Spark provides an optimized engine that supports general execution graph. It
also has abundant high-level tools for structured data processing, machine learning, graph processing and
streaming. The Spark can either run alone or on an existing cluster manager. Now since we have some
understanding of Spark, let us dive deeper and understand its components. Apache Spark consists of
Spark Core Engine, Spark SQL, Spark Streaming, ML lib, Graph X and Spark R. You can use Spark Core
Engine along with any of the other five components mentioned above. It is not necessary to use all
the Spark components together. Depending on the use case and application, any one or more of these can
be used along with Spark Core.
Apache Spark Ecosystem – Spark Core, Spark SQL, Spark Streaming, ML lib, Graph X, Spark R.
Apache Spark Core
All the functionalities being provided by Apache Spark are built on the top of Spark Core. It delivers
speed by providing in-memory computation capability. Thus Spark Core is the foundation of parallel and
distributed processing of huge dataset.
The key features of Apache Spark Core are:
It is in charge of essential I/O functionalities.
Significant in programming and observing the role of the Spark cluster.
Task dispatching.
Fault recovery.
It overcomes the snag of MapReduce by using in-memory computation.
Spark Core is embedded with a special collection called RDD (resilient distributed dataset). RDD is
among the abstractions of Spark. Spark RDD handles partitioning data across all the nodes in a cluster. It
holds them in the memory pool of the cluster as a single unit. There are two operations performed on
RDDs: Transformation and Action-
Transformation: It is a function that produces new RDD from the existing RDDs.
Action: In Transformation, RDDs are created from each other. But when we want to work with the
actual dataset, then, at that point we use Action.
The Spark SQL component is a distributed framework for structured data processing. Using Spark SQL,
Spark gets more information about the structure of data and the computation. With this information,
Spark can perform extra optimization. It uses same execution engine while computing an output. It does
not depend on API/ language to express the computation.
Spark SQL works to access structured and semi-structured information. It also enables powerful,
interactive, analytical application across both streaming and historical data. Spark SQL is Spark module
for structured data processing. Thus, it acts as a distributed SQL query engine.
Features of Spark SQL include:
Cost based optimizer
Mid query fault-tolerance
Full compatibility
Data Frames and SQL provide a common way to access a variety of data sources. It includes Hive,
Avro, Parquet, ORC, JSON, and JDBC.
Provision to carry structured data inside Spark programs, using either SQL or a familiar Data Frame
API.
Apache Spark Streaming
Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small
batches of data. Hence Spark Streaming, groups the live data into small batches. It then delivers it to the
batch system for processing.
There are 3 phases of Spark Streaming:
a. GATHERING
The Spark Streaming provides two categories of built-in streaming sources:
Basic sources: These are the sources which are available in the Streaming Context API. Examples:
file systems, and socket connections.
Advanced sources: These are the sources like Kafka, Flume, Kinesis, etc. are available through extra
utility classes. Hence Spark access data from different sources like Kafka, Flume, Kinesis, or TCP
sockets.
b. PROCESSING
The gathered data is processed using complex algorithms expressed with a high-level function. For
example, map, reduce, join and window. Refer this guide to learn Spark Streaming transformations
operations.
c. DATA STORAGE
Spark Streaming also provides high-level abstraction. It is known as discredited stream or Stream.
Stream in Spark signifies continuous stream of data. We can form Stream in two ways either from
sources such as Kafka, Flume, and Kinesis or by high-level operations on other Streams. Thus, Stream is
internally a sequence of RDDs.
The motive behind ML lib creation is to make machine learning scalable and easy. It contains machine
learning libraries that have an implementation of various machine learning algorithms. For
example, clustering, regression, classification and collaborative filtering. Some lower level machine
learning primitives like generic gradient descent optimization algorithm are also present in ML lib.
In Spark Version 2.0 the RDD-based API in spark. Ml lib package entered in maintenance mode. In this
release, the Data Frame-based API is the primary Machine Learning API for Spark. So, from now ML lib
will not add any new feature to the RDD based API.
The reason ML lib is switching to Data Frame-based API is that it is more user-friendly than RDD. Some
of the benefits of using Data Frames are it includes Spark Data sources, SQL Data Frame
queries Tungsten and Catalyst optimizations, and uniform APIs across languages. ML lib also uses the
linear algebra package Breeze. Breeze is a collection of libraries for numerical computing and machine
learning.
Graph X in Spark is API for graphs and graph parallel execution. It is network graph analytics engine
and data store. Clustering, classification, traversal, searching, and path finding is also possible in graphs.
Furthermore, Graph X extends Spark RDD by bringing in light a new Graph abstraction:
A directed multi graph with properties attached to each vertex and edge.
Graph X also optimizes the way in which we can represent vertex and edges when they are primitive data
types. To support graph computation it supports fundamental operators (e.g., sub graph, join Vertices,
and aggregate Messages) as well as an optimized variant of the Pregl API.
Apache Spark R
Spark R was Apache Spark 1.4 release. The key component of Spark R is Spark R Data frame. Data
Frames are a fundamental data structure for data processing in R.
The concept of Data Frames extends to other languages with libraries like Pandas etc.
R also provides software facilities for data manipulation, calculation, and graphical display. Hence, the
main idea behind Spark R was to explore different techniques to integrate the usability of R with the
scalability of Spark. It is R package that gives light-weight frontend to use Apache Spark from R.
There are various benefits of Spark R:
Data Sources API: By tying into Spark SQL’s data sources API Spark R can read in data from a
variety of sources. For example, hive tables, JSON files, Parquet files etc.
Data Frame Optimizations: Spark R Data Frames also inherit all the optimizations made to the
computation engine in terms of code generation, memory management.
Scalability too many cores and machines: Operations that executes on Spark R Data Frames get
distributed across all the cores and machines available in the Spark cluster. As a result, Spark R Data
Frames can run on terabytes of data and clusters with thousands of machines.
Apache Spark being an open-source framework for big data has a various advantage over other big data
solutions like Apache Spark is Dynamic in Nature; it supports in-memory Computation of RDDs. It
provides a provision of reusability, Fault Tolerance, real-time stream processing and many more.
Apache Spark is lightning fast, in-memory data processing engine. Spark mainly designs for data science
and the abstractions of Spark make it easier. Apache Spark provides high-level APIs in Java, Scale,
Python and R it also has an optimized engine for general execution graph. In data processing, Apache
Spark is the largest open source project.
Features of Apache Spark
a. Swift Processing
Using Apache Spark, we achieve a high data processing speed of about 100 x faster in memory and 10x
faster on the disk. This is made possible by reducing the number of read-write to disk.
b. Dynamic in Nature
We can easily develop a parallel application, as Spark provides 80 high-level operators.
With in-memory processing, we can increase the processing speed. Here the data is being cached so we
need not fetch data from the disk every time thus the time is saved. Spark has DAG execution engine
which facilitates in-memory computation and acyclic data flow resulting in high speed.
d. Reusability
We can reuse the Spark code for batch-processing, join stream against historical data or run ad-hoc
queries on stream state.
e. Fault Tolerance in Spark
Apache Spark provides fault tolerance through Spark abstraction-RDD. Spark RDDs are designed to
handle the failure of any worker node in the cluster. Thus, it ensures that the loss of data reduces to zero.
Learn different ways to create RDD in Apache Spark.
f. Real-Time Stream Processing
Spark has a provision for real-time stream processing. Earlier the problem with Hadoop Map Reduce was
that it can handle and process data which is already present, but not the real-time data. But with Spark
Streaming we can solve this problem.
g. Lazy Evaluation in Apache Spark
All the transformations we make in Spark RDD are Lazy in nature that is it does not give the result right
away rather a new RDD is formed from the existing one. Thus, this increases the efficiency of the system.
Follow this guide to learn more about Spark Lazy Evaluation in great detail.
h. Support Multiple Languages
In Spark, there is Support for multiple languages like Java, R, Scale, and Python. Thus, it provides
dynamicity and overcomes the limitation of Hadoop that it can build applications only in Java.
Get the best Scale Books to become an expert in Scale programming language.
Developers from over 50 companies were involved in making of Apache Spark. This project was
initiated in the year 2009 and is still expanding and now there are about 250 developers who contributed
to its expansion. It is the most important project of Apache Community.
Spark comes with dedicated tools for streaming data, interactive/declarative queries, and machine
learning which add-on to map and reduce.
Ans:- we are going to cover the features of HDFS. Hadoop HDFS has the features like Fault Tolerance,
Replication, Reliability, High Availability, Distributed Storage, Scalability etc.
Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. It stores very large
files running on a cluster of commodity hardware. HDFS is based on GFS (Google File System). It stores
data reliably even in the case of hardware failure. HDFS also provides high-throughput access to the
application by accessing in parallel.
Fault Tolerance
The fault tolerance in Hadoop HDFS is the working strength of a system in unfavorable conditions. It is
highly fault-tolerant. Hadoop framework divides data into blocks. After that creates multiple copies of
blocks on different machines in the cluster. So, when any machine in the cluster goes down, then a client
can easily access their data from the other machine which contains the same copy of data blocks.
High Availability
Hadoop HDFS is a highly available file system. In HDFS, data gets replicated among the nodes in
the Hadoop cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. So,
whenever a user wants to access this data, they can access their data from the slaves which contain its
blocks. At the time of unfavorable situations like a failure of a node, a user can easily access their data
from the other nodes. Because duplicate copies of blocks are present on the other nodes in the HDFS
cluster.
High Reliability
HDFS provides reliable data storage. It can store data in the range of 100s of petabytes. HDFS stores data
reliably on a cluster. It divides the data into blocks. Hadoop framework stores these blocks on nodes
present in HDFS cluster. HDFS stores data reliably by creating a replica of each and every block present
in the cluster. Hence provides fault tolerance facility. If the node in the cluster containing data goes down,
then a user can easily access that data from the other nodes. HDFS by default creates 3 replicas of each
block containing data present in the nodes. So, data is quickly available to the users. Hence user does not
face the problem of data loss. Thus, HDFS is highly reliable.
Replication
Data Replication is unique features of HDFS. Replication solves the problem of data loss in an
unfavorable condition like hardware failure, crashing of nodes etc. HDFS maintain the process of
replication at regular interval of time. HDFS also keeps creating replicas of user data on different machine
present in the cluster. So, when any node goes down, the user can access the data from other machines.
Thus, there is no possibility of losing of user data.
Scalability
Hadoop HDFS stores data on multiple nodes in the cluster. So, whenever requirements increase you can
scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal
Scalability.
Distributed Storage
All the features in HDFS are achieved via distributed storage and replication. HDFS store data in a
distributed manner across the nodes. In Hadoop, data is divided into blocks and stored on the nodes
present in the HDFS cluster. After that HDFS create the replica of each and every block and store on
other nodes. When a single machine in the cluster gets crashed we can easily access our data from the
other nodes which contain its replica.
Ans:-
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When you’re intermediate processes need to talk to each other (jobs run in isolation).
When you’re processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data
which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure
and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
Ans: - Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook; later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
A relational database
A design for Online Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
I. Framework
iii. Warehouse
v. Declarative language
vii. Multi-user
However, to perform more detailed data analysis, Hive allows writing custom Map Reduce framework
processes.
x. Data Formats
xi. Storage
Hive allows access files stored in HDFS. Also, similar others data storage systems such as Apache H
Base.
x. Format conversion
Moreover, it allows converting the variety of format from to within Hive. Although, it is very simple and
possible.
Features of Pig
Apache Pig comes with the following features −
Rich set of operators − It provides many operators to perform operations like join, sort, filer,
etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are
good at SQL.
Optimization opportunities − the tasks in Apache Pig optimize their execution automatically,
so the programmers need to focus only on semantics of the language.
Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.
UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
9. Write a short note on No SQL Database. List the Differences between No SQL and
Relational Database
Ans: - When it comes to choosing a database the biggest decisions is picking a relational (SQL) or non-
relational (No SQL) data structure. While both the databases are viable options still there are certain key
differences between the two that users must keep in mind when making a decision.
1. Type –
SQL databases are primarily called as Relational Databases (RDBMS); whereas No SQL
database are primarily called as non-relational or distributed database
2. Language – SQL databases defines and manipulates data based structured query language (SQL).
Seeing from a side this language is extremely powerful. SQL is one of the most versatile and
widely-used options available which makes it a safe choice especially for great complex queries.
But from other side it can be restrictive. SQL requires you to use predefined schemas to determine
the structure of your data before you work with it. Also all of your data must follow the same
structure. This can require significant up-front preparation which means that a change in the
structure would be both difficult and disruptive to your whole system. A No SQL database has
dynamic schema for unstructured data. Data is stored in many ways which means it can be
document-oriented, column-oriented, graph-based or organized as a Key Value store. This
flexibility means that documents can be created without having defined structure first. Also each
document can have its own unique structure. The syntax varies from database to database, and you
can add fields as you go
The Scalability – In almost all situations SQL databases are vertically scalable. This means that you
can increase the load on a single server by increasing things like RAM, CPU or SSD. But on the
other hand No SQL databases are horizontally scalable. This means that you handle more traffic by
sharding, or adding more servers in your No SQL database. It is similar to adding more floors to the
same building versus adding more buildings to the neighborhood. Thus No SQL can ultimately
become larger and more powerful, making these databases the preferred choice for large or ever-
changing data sets.
3. The Structure – SQL databases are table-based on the other hand No SQL databases are either key-
value pairs, document-based, graph databases or wide-column stores. This makes relational SQL
databases a better option for applications that require multi-row transactions such as an accounting
system or for legacy systems that were built for a relational structure.
4. Property followed –
SQL databases follow ACID properties (Atomicity, Consistency, Isolation and Durability) whereas
the No SQL database follows the Brewers CAP theorem (Consistency, Availability and Partition
tolerance).
Support –
Great support is available for all SQL database from their vendors. Also a lot of independent
consultations are there who can help you with SQL database for very large scale deployments but
for some No SQL database you still have to rely on community support and only limited outside
experts are available for setting up and deploying your large scale No SQL deployments. Some
examples of SQL databases include Post gre SQL, My SQL, Oracle and Microsoft SQL Server. No
SQL database examples include Re dis, Raven DB Cassandra, Mongo DB, Big Table, HBase, Neo4j
and Couch DB.
No SQL databases (aka "not only SQL") are non tabular, and store data differently than relational
tables. No SQL databases come in a variety of types based on their data model. The main types are
difference between SQL and No SQL
Parameter SQL NOSQL
SQL databases are primarily called No SQL databases are primarily called as Non-relational
Definition
RDBMS or Relational Databases or distributed database
Traditional RDBMS uses SQL syntax No SQL database system consists of various kind of
and queries to analyze and get the data database technologies. These databases were developed
Design for
for further insights. They are used for in response to the demands presented for the
OLAP systems. development of the modern application.
Query
Structured query language (SQL) No declarative query language
Language
No SQL databases can be document based, key-value
Type SQL databases are table based databases
pairs, graph databases
SQL databases have a predefined No SQL databases use dynamic schema for unstructured
Schema
schema data.
Ability to scale SQL databases are vertically scalable No SQL databases are horizontally scalable
Examples Oracle, Post gres, and MS-SQL. Mongo DB, Redis,, Neo4j, Cassandra, H base.
An ideal choice for the complex query
Best suited for It is not good fit complex queries.
intensive environment.
Parameter SQL NOSQL
Hierarchical SQL databases are not suitable for More suitable for the hierarchical data store as it
data storage hierarchical data storage. supports key-value pair method.
Many different types which include key-value stores,
Variations One type with minor variations.
document databases, and graph databases.
Development It was developed in the 1970s to deal Developed in the late 2000s to overcome issues and
Year with issues with flat file storage limitations of SQL databases.
A mix of open-source like Postgres &
Open-source My SQL, and commercial like Oracle Open-source
Database.
It depends on DBMS as some offers strong consistency
It should be configured for strong
Consistency like Mongo DB, whereas others offer only offers
consistency.
eventual consistency, like Cassandra.
RDBMS database is the right option for No SQL is a best used for solving data availability
Best Used for
solving ACID problems. problems
It should be used when data validity is Use when it's more important to have fast data than
Importance
super important correct data
When you need to support dynamic Use when you need to scale based on changing
Best option
queries requirements
Specialized DB hardware (Oracle Exa
Hardware Commodity hardware
data, etc.)
Highly available network (Infini band,
Network Commodity network (Ethernet, etc.)
Fabric Path, etc.)
Highly Available Storage (SAN, RAID,
Storage Type Commodity drives storage (standard HDDs, JBOD)
etc.)
Best features Cross-platform support, Secure and free Easy to use, High performance and Flexible tool.
Top Companies
Hoot suite, Circle CI, Gauges Air bnb, Uber, Kickstarter
Using
The average salary for any professional
The average salary for "No SQL developer" ranges from
Average salary SQL Developer is $84,328 per year in
approximately $72,174 per year
the U.S.A.
ACID( Atomicity, Consistency,
ACID vs. BASE Base ( Basically Available, Soft state, Eventually
Isolation, and Durability) is a standard
Model Consistent) is a model of many No SQL systems
for RDBMS
10. Explain working of HIVE with proper steps and Diagram
Ans: - The major components of Hive and its interaction with the Hadoop are demonstrated in the figure
below and all the components are described further:
User Interface (UI) – As the name describes User interface provide an interface between user and
hive. It enables user to submit queries and other operations to the system. Hive web UI, Hive
command line, and Hive HD Insight (In windows server) are supported by the user interface.
Driver – Queries of the user after the interface are received by the driver within the Hive. Concept
of session handles is implemented by driver. Execution and Fetching of APIs modeled on
JDBC/ODBC interfaces is provided by the user.
Compiler – Queries are parses, semantic analysis on the different query blocks and query
expression is done by the compiler. Execution plan with the help of the table in the database and
partition metadata observed from the meta store are generated by the compiler eventually.
Meta store – All the structured data or information of the different tables and partition in the
warehouse containing attributes and attributes level information are stored in the Meta store.
Sequences or de-sequences necessary to read and write data and the corresponding HDFS files
where the data is stored. Hive selects corresponding database servers to stock the schema or
Metadata of databases, tables, attributes in a table, data types of databases, and HDFS mapping.
Execution Engine – Execution of the execution plan made by the compiler is performed in the
execution engine. The plan is a DAG of stages. The dependencies within the various stages of the
plan are managed by execution engine as well as it executes these stages on the suitable system
components.
Diagram – Architecture of Hive that is built on the top of Hadoop In the above diagram along with
architecture, job execution flow in Hive with Hadoop is demonstrated step by step.
Step-1: Execute Query – Interface of the Hive such as Command Line or Web user interface
delivers query to the driver to execute. In this, UI calls the execute interface to the driver such as
ODBC or JDBC.
Step-2: Get Plan – Driver designs a session handle for the query and transfer the query to the
compiler to make execution plan. In other words, driver interacts with the compiler.
Step-3: Get Metadata – In this, the compiler transfers the metadata request to any database and the
compiler gets the necessary metadata from the Meta store.
Step-4: Send Metadata – Meta store transfers metadata as an acknowledgement to the compiler.
Step-5: Send Plan – Compiler communicating with driver with the execution plan made by the
compiler to execute the query.
Step-8: Send Results – Result is transferred to the execution engine from the driver. Sending
results to Execution engine. When the result is retrieved from data nodes to the execution engine, it
returns the result to the driver and to user interface (UI).
Ans: - Big data is a term used to define a massive amount of data on a large scale, be it structured, semi
structured and unstructured, from several resources like media and public data, sensors data, warehouse
data, etc. differing in formats like .txt and .csv files, image files, Html files, etc.
Data is collected and prepared at a very rapid rate with the help of superfast and highly processed
computers for real-time and wide-ranged applications extensively. To bring this data into action and
information, analysis of data is required and therefore big data analytics steps come out, we can deploy
three main characteristics Volume, Velocity, and Variation to analyze large sets of data in order to ensure
accurate information.
Augmenting noticeable changes in weather becomes a serious issue to concern, day by day fluctuations in
weather draws the attention of not only meteorologists but also analyst especially to forecast data. It also
gives an interesting topic for researchers to explore and understand the reason behind the weather-
everything like whats is going to happen tomorrow and what’s in the coming time. The study of changes
in the weather is necessary to get numerous advantages such as saving lives, conquering risk, intensifying
profits and quality of weather-based life, etc. To forecast weather, we need to analyze huge amounts of
data, and thus big data is used as a trump card that provides many leads for forthcoming natural disasters
like heavy rainfall, thunder, tornadoes, tsunamis, etc. in advance.
Our day to day life is directly or indirectly depends on the weather in terms of economy and environment;
it affects us with various factors as events, timing, duration, location, etc. In consideration of these
factors, weather forecasting works with the parameters temperature, humidity, and wind speed. The
weather forecasting is complex and challenging phenomena, an interaction between these factors and
parameters is a necessity to authenticate weather forecasting. Explicit forecasting is an essential task
directed by big data analytics, in the era of data; many ways had developed in the analytics domain for
correct and quicker results, thanks to analytical algorithms. Weather-applications are generic to data
analytics-based applications in weather, by knowing the accurate situation of the weather using data; it
can be used for solving many unusual problems in many businesses;
1. In agriculture, the forecast is required for when to plant, irrigate and harvest crops on time.
Weather forecasting also indicates the knocking floods, and it is suggested to harvest the crop
timely even if only a 60% crop is matured. Similarly, an indication of initiation of rainy season
helps farmers to sow crops timely.
2. In sports, weather prediction has its own roles in sports, there are many applications that tell us
where to play, within how many days, what could be the best time, what will be the current
climate of the place where the game is going to organize, etc provided in the what weather
organization wants games.
3. In forestry, a proper prediction is required for preventing and controlling, and the safety of
wildlife and wildfires, circumstances of spreading the harmful insects can be predicted, etc.
Many other organizations also depend on weather forecasting and demand for accurate weather
predictions for their smooth functioning without any disrupt like airport control management,
construction work, utility companies are the places where weather predictions are essential.
Along with these business events, weather predictions have an impactful effect in predicting or estimating
natural disasters like predicting floods, volcanoes, thunderstorm, heavy rain falls, etc. We know that Big
data analytics can contribute to plenty of information and insights about disasters, this can be used to get
daily climatic conditions and catastrophic events that give warnings about tsunamis, hurricanes, etc. In
your mobile apps, you must have seen apps as barometers, gyro meters, and other sensor IoT-based
apps (Take a look at some other applications of IoT here) that record the data like wind pressure, wind
speed, precipitation, temperature, and humidity of a particular location and the time at which data is
recorded, required for weather predictions. All the industries get affected by weather unbiased, an
organization could make the smart decision and strategies about the future by following the weather
impact in advance, for this an organization need to unite its proprietary data with weather data in order to
get a wider understanding of how to predict and how to influence business solutions. Many Organizations
serving in retail, transportation, distribution, etc are major industries that use analytics to ascertain how to
staff, design for demand, decrease damages, also have an opportunity to use weather data strategically
IBM’s Deep Thunder: Weather Predictions Application
It is a famous application for weather forecasting powered by big data, it gives a forecast of
the extremely specific location like a single city or single airport, local authorities get the
sign of dangers in real-time and manage their work accordingly.
IBM Deep thunder: preparing for harvesting on time with the modeling technique
Deep Thunder can yield much important information like evaluating areas where the floods more likely to
have happened, estimating direction and scale of tropical storms, determining the amount of heavy snow,
rainfall, and dropping of power lines in an area, estimating areas where roads and bridges are damaged
and many more.
12. What is H base? Explain storage mechanism of H base with an example
Ans: - HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable. HBase is a data model that is similar to Google’s big
table designed to provide quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that
provides random real-time read/write access to data in the Hadoop File System. One can store the data in
HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides read and write access.
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions
processing; no concept of batch of records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for
faster lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table has multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously on
the disk. Each cell value of the table has a timestamp. In short, in an HBase:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical
Processing (OLAP).
Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.
The following image shows column families in a column-oriented database:
HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema,
fixed columns schema; defines only column families. which describes the whole structure of
tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
Features of HBase
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Ans: - The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high level
data processing language which provides a rich set of data types and operators to perform various
operations on the data. To perform a particular task Programmers using Pig, programmers need to write
a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms
(Grunt Shell, UDFs, and Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig
converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking,
and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script
are represented as the nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations
such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs
are executed on Hadoop producing the desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-atomic data types such
as map and tuples. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string
and can be used as string and number. Int, long, float, double, char array, and byte array are the atomic
values of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuples
A record that is formed by an ordered set of fields is known as a tuples, the fields can be of any type. A
tuples is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a
bag. Each tuples can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuples contain
the same number of fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type char array and should be
unique. The value might be of any type. It is represented by ‘[]’
Example − [name Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples
are processed in any particular order).
14. Write the Application of Big Data with suitable Diagram
The Securities Exchange Commission (SEC) is using Big Data to monitor financial market activity. They
are currently using network analytics and natural language processors to catch illegal trading activity in
the financial markets. Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in the
financial markets use Big Data for trade analytics used in high-frequency trading, pre-trade decision-
support analytics, sentiment measurement, Predictive Analytics, etc. This industry also heavily relies on
Big Data for risk analytics, including; anti-money laundering, demand enterprise risk management,
"Know Your Customer," and fraud mitigation. Big Data providers are specific to this industry includes
1010data, Panopticon Software, Streambase Systems, Nice Actimize, and Quartet FS.
Since consumers expect rich media on-demand in different formats and a variety of devices, some Big
Data challenges in the communications, media, and entertainment industry include:
A case in point is the Wimbledon Championships (YouTube Video) that leverages Big Data to deliver
detailed sentiment analysis on the tennis matches to TV, mobile, and web users in real-time.
Spottily, an on-demand music service uses Hadoop Big Data analytics, to collect data from its millions of
users worldwide and then uses the analyzed data to give informed music recommendations to individual
users. Amazon Prime, which is driven to provide a great customer experience by offering video, music,
and Kindle books in a one-stop-shop, also heavily utilizes Big Data. Big Data Providers in this industry
include Info chimps, Spelunk, Pervasive Software, and Visible Measure
3. Healthcare Providers
The healthcare sector has access to huge amounts of data but has been plagued by failures in utilizing the
data to curb the cost of rising healthcare and by inefficient systems that stifle faster and better healthcare
benefits across the board. This is mainly because electronic data is unavailable, inadequate, or unusable.
Additionally, the healthcare databases that hold health-related information have made it difficult to link
data that can show patterns useful in the medical field.
Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions of patients,
to allow doctors to use evidence-based medicine as opposed to administering several medical/lab tests to
all patients who go to the hospital. A battery of tests can be efficient, but it can also be expensive and
usually ineffective. Free public health data and Google Maps have been used by the University of Florida
to create visual data that allows for faster identification and efficient analysis of healthcare information,
used in tracking the spread of chronic disease.
4. Education
From a technical point of view, a significant challenge in the education industry is to incorporate Big Data
from different sources and vendors and to utilize it on platforms that were not designed for the varying
data. From a practical point of view, staff and institutions have to learn new data management and
analysis tools. On the technical side, there are challenges to integrating data from different sources on
different platforms and from different vendors that were not designed to work with one another.
Applications of Big Data in Education
Big data is used quite significantly in higher education. For example, The University of Tasmania. An
Australian university with over 26000 students has deployed a Learning and Management System that
tracks, among other things, when a student logs onto the system, how much time is spent on different
pages in the system, as well as the overall progress of a student over time. In a different use case of the
use of Big Data in education, it is also used to measure teacher’s effectiveness to ensure a pleasant
experience for both students and teachers. Teacher’s performance can be fine-tuned and measured against
student numbers, subject matter, student demographics, student aspirations, behavioral classification, and
several other variables.
Increasing demand for natural resources, including oil, agricultural products, minerals, gas, metals, and so
on, has led to an increase in the volume, complexity, and velocity of data that is a challenge to handle.
Similarly, large volumes of data from the manufacturing industry are untapped. The underutilization of
this information prevents the improved quality of products, energy efficiency, reliability, and better profit
margins.
In the natural resources industry, Big Data allows for predictive modeling to support decision making that
has been utilized for ingesting and integrating large amounts of data from geospatial data, graphical data,
text, and temporal data. Areas of interest where this has been used include; seismic interpretation and
reservoir characterization
Big data is being used in the analysis of large amounts of social disability claims made to the Social
Security Administration (SSA) that arrive in the form of unstructured data. The analytics are used to
process medical information rapidly and efficiently for faster decision making and to detect suspicious or
fraudulent claims. The Food and Drug Administration (FDA) is using Big Data to detect and study
patterns of food-related illnesses and diseases. This allows for a faster response, which has led to more
rapid treatment and less death. The Department of Homeland Security uses Big Data for several different
use cases. Big data is analyzed from various government agencies and is used to protect the country Big
Data Providers in this industry include Digital Reasoning, Socratic, and HP.
7. Insurance Lack of personalized services, lack of personalized pricing, and the lack of targeted
services to new segments and specific market segments are some of the main challenges. Big data
has been used in the industry to provide customer insights for transparent and simpler products,
by analyzing and predicting customer behavior through data derived from social media, GPS-
enabled devices, and CCTV footage. The Big Data also allows for better customer retention from
insurance companies. When it comes to claims management, predictive analytics from Big Data
has been used to offer faster service since massive amounts of data can be analyzed mainly in the
underwriting stage. Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims throughout
the claims cycle has been used to provide insights. Big Data Providers in this industry include Sprint,
Qualcomm, Onto Telemetric, The Climate Corp.
From traditional brick and mortar retailers and wholesalers to current day e-commerce traders, the
industry has gathered a lot of data over time. This data, derived from customer loyalty cards, POS
scanners, RFID, etc. are not being used enough to improve customer experiences on the whole. Any
changes and improvements made have been quite slow. Big data from customer loyalty data, POS, store
inventory, local demographics data continues to be gathered by retail and wholesale stores. In New
York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco, and IBM pitched the
need for the retail industry to utilize Big Data for analytics and other uses, including:
Optimized staffing through data from shopping patterns, local events, and so on
Reduced fraud
Social media use also has a lot of potential use and continues to be slowly but surely adopted, especially
by brick and mortar stores. Social media is used for customer prospecting, customer retention, promotion
of products, and more. Big Data Providers in this industry include First Retail, First Insight, Fujitsu,
Inform, EPCOR, and Vista.
9. Transportation In recent times, huge amounts of data from location-based social networks and high-
speed data from telecoms have affected travel behavior. Regrettably, research to understand travel
behavior has not progressed as quickly. In most places, transport demand models are still based on poorly
understood new social media structures. Some applications of Big Data by governments, private
organizations, and individuals include:
Governments use of Big Data: traffic control, route planning, intelligent transport systems, congestion
management (by predicting traffic conditions)
Individual use of Big Data includes route planning to save on fuel and time, for travel arrangements in
tourism, etc
10. Energy and Utilities Smart meter readers allow data to be collected almost every 15 minutes as
opposed to once a day with the old meter readers. This granular data is being used to analyze the
consumption of utilities better, which allows for improved customer feedback and better control of utilities
use. In utility companies, the use of Big Data also allows for better asset and workforce management,
which is useful for recognizing errors and correcting them as soon as possible before complete failure is
experienced.
15. Difference between Hive and Pig data,datawarehouse and Big Data, Old API and
New API,HDFS and HBase
Ans:-
It is used by Researchers
5. and Programmers. It is mainly used by Data Analysts.
It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.
Pig does not have a Hive makes use of the exact variation of
dedicated metadata dedicated SQL-DDL language by defining
13. database. tables beforehand.
14. It supports Avro file format. It does not support Avro file format.
Below is a table of differences between Big Data and Data Warehouse:
Big data does processing by using Data warehouse doesn’t use distributed file
4. distributed file system. system for processing.
Big data doesn’t follow any SQL In data warehouse we use SQL queries to fetch
5. queries to fetch data from database. data from relational databases.
Apache Hadoop can be used to handle Data warehouse cannot be used to handle
6. enormous amount of data. enormous amount of data.
Big data doesn’t require efficient Data warehouse requires more efficient
management techniques as compared management techniques as the data is collected
8. to data warehouse. from different departments of the enterprise.
Below are Difference between Hadoop OLD API (0.20) and New API (1.X or 2.X)
reduce()
In the new API, the reduce() method passes values as In the Old API, the reduce() method passes
method passes
a java.lang.Iterable values as a java.lang.Iterator
values
Below is a table of differences between HDFS and HBase:
HDFS HBASE
HDFS is based on write once read many HBase supports random read and write operation into
times file system
HDFS provides high latency for access HBase provides low latency access to small amount
operations. of data
SECTION-C
Ans:-
HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of
storage of less number of large files rather than the huge number of small files. HDFS stores data reliably
even in the case of hardware failure. It provides high throughput by providing the data access in parallel.
I. Hardware failure
Hardware failure is no more exception; it has become a regular term. HDFS instance consists of hundreds
or thousands of server machines, each of which is storing part of the file system’s data. There exist a huge
number of components that are very susceptible to hardware failure. This means that there are some
components that are always non-functional. So the core architectural goal of HDFS is quick and
automatic fault detection/recovery.
HDFS works with large data sets. In standard practices, a file in HDFS is of size ranging from gigabytes
to petabytes. The architecture of HDFS should be design in such a way that it should be best for storing
and retrieving huge amounts of data. HDFS should provide high aggregate data bandwidth and should be
able to scale up to hundreds of nodes on a single cluster. Also, it should be good enough to deal with tons
of millions of files on a single instance.
It works on a theory of write-once-read-many access model for files. Once the file is created, written,
and closed, it should not be changed. This resolves the data coherency issues and enables high throughput
of data access. A MapReduce-based application or web crawler application perfectly fits in this model. As
per apache notes, there is a plan to support appending writes to files in the future.
V. Moving computation is cheaper than moving data
If an application does the computation near the data it operates on, it is much more efficient than done far
of. This fact becomes stronger while dealing with large data set. The main advantage of this is that it
increases the overall throughput of the system. It also minimizes network congestion. The assumption is
that it is better to move computation closer to data instead of moving data to computation.
VI. Portability across heterogeneous hardware and software platforms
HDFS is designed with the portable property so that it should be portable from one platform to another.
This enables the widespread adoption of HDFS. It is the best platform while dealing with a large set of
data.
Hadoop Distributed File System follows the master-slave architecture. Each cluster comprises a single
master node and multiple slave nodes. Internally the files get divided into one or more blocks, and each
block is stored on different slave machines depending on the replication factor (which you will see later
in this article). The master node stores and manages the file system namespace that is information about
blocks of files like block locations, permissions, etc. The slave nodes store data blocks of files.
The Master node is the Name Node and Data Nodes are the slave nodes.
Name Node is the centerpiece of the Hadoop Distributed File System. It maintains and manages the file
system namespace and provides the right access permission to the clients.
The Name Node stores information about blocks locations, permissions, etc. on the local disk in the form
of two files:
Fs image: Fs image stands for File System image. It contains the complete namespace of the Hadoop
file system since the Name Node creation.
Edit log: It contains all the recent changes performed to the file system namespace to the most recent
Fs image.
Functions of HDFS Name Node
1. It executes the file system namespace operations like opening, renaming, and closing files and
directories.
2. Name Node manages and maintains the Data Nodes.
3. It determines the mapping of blocks of a file to Data Nodes.
4. Name Node records each change made to the file system namespace.
5. It keeps the locations of each block of a file.
6. Name Node takes care of the replication factor of all the blocks.
7. Name Node receives heartbeat and block reports from all Data Nodes that ensure Data Node is alive.
8. If the Data Node fails, the Name Node chooses new Data Nodes for new replicas.
Data Nodes are the slave nodes in Hadoop HDFS. Data Nodes are inexpensive commodity hardware.
They store blocks of a file.
Functions of Data Node
Apart from Data Node and Name Node, there is another daemon called the secondary Name Node.
Secondary Name Node works as a helper node to primary Name Node but doesn’t replace primary Name
Node. When the Name Node starts, the Name Node merges the Fs image and edit logs file to restore the
current file system namespace. Since the Name Node runs continuously for a long time without any
restart, the size of edit logs becomes too large. This will result in a long restart time for Name Node.
Secondary Name Node solves this issue. Secondary Name Node downloads the Fs image file and edit
logs file from Name Node. It periodically applies edit logs to Fs image and refreshes the edit logs. The
updated Fs image is then sent to the Name Node so that Name Node doesn’t have to re-apply the edit log
records during its restart. This keeps the edit log size small and reduces the Name Node restart time. If the
Name Node fails, the last save Fs image on the secondary Name Node can be used to recover file system
metadata. The secondary Name Node performs regular checkpoints in HDFS.
What is Checkpoint Node?
The Checkpoint node is a node that periodically creates checkpoints of the namespace. Checkpoint Node
in Hadoop first downloads Fs image and edits from the Active Name node. Then it merges them (Fs
image and edits) locally, and at last, it uploads the new image back to the active Name Node. It stores the
latest checkpoint in a directory that has the same structure as the Name node’s directory. This permits the
check pointed image to be always available for reading by the Name Node if necessary.
A Backup node provides the same check pointing functionality as the Checkpoint node. In Hadoop,
Backup node keeps an in-memory, up-to-date copy of the file system namespace. It is always
synchronized with the active Name Node state. It is not required for the backup node in HDFS
architecture to download Fs image and edits files from the active Name Node to create a checkpoint. It
already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is
more efficient as it only needs to save the namespace into the local Fs image file and reset edits. Name
Node supports one Backup node at a time. This was about the different types of nodes in HDFS
Architecture. Further in this HDFS Architecture tutorial, we will learn about the Blocks in HDFS,
Replication Management, Rack awareness and read/write operations.
Internally, HDFS split the file into block-sized chunks called a block. The size of the block is 128 Mb by
default. One can configure the block size as per the requirement. For example, if there is a file of size 612
Mb, then HDFS will create four blocks of size 128 Mb and one block of size 100 Mb. The file of a
smaller size does not occupy the full block size space in the disk. For example, the file of size 2 Mb will
occupy only 2 Mb space in the disk.
What is Replication Management?
For a distributed system, the data must be redundant to multiple places so that if one machine fails, the
data is accessible from other machines. In Hadoop, HDFS stores replicas of a block on multiple Data
Nodes based on the replication factor. The replication factor is the number of copies to be created for
blocks of a file in HDFS architecture. If the replication factor is 3, then three copies of a block get stored
on different Data Nodes. So if one Data Node containing the data block fails, then the block is accessible
from the other Data Node containing a replica of the block. If we are storing a file of 128 Mb and the
replication factor is 3, then (3*128=384) 384 Mb of disk space is occupied for a file as three copies of a
block get stored. This replication mechanism makes HDFS fault-tolerant.
What is Rack Awareness in HDFS Architecture?
Rack is the collection of around 40-50 machines (Data Nodes) connected using the same network switch.
If the network goes down, the whole rack will be unavailable. Rack Awareness is the concept of
choosing the closest node based on the rack information. To ensure that all the replicas of a block are not
stored on the same rack or a single rack, Name Node follows a rack awareness algorithm to store replicas
and provide latency and fault tolerance.
Suppose if the replication factor is 3, then according to the rack awareness algorithm:
1. Write Operation
When a client wants to write a file to HDFS, it communicates to the Name Node for metadata. The
Name node responds with a number of blocks, their location, replicas, and other details. Based on
information from Name Node, the client directly interacts with the Data Node. The client first sends block
A to Data Node 1 along with the IP of the other two Data Nodes where replicas will be stored. When Data
node 1 receives block A from the client, Data Node 1 copies the same block to Data Node 2 of the same
rack. As both the Data Nodes are in the same rack, so block transfer via rack switch. Now Data Node 2
copies the same block to Data Node 4 on a different rack. As both the Data Node are in different racks, so
block transfer via an out-of-rack switch.
2. Read Operation
To read from HDFS, the client first communicates with the Name Node for metadata. The Name node
responds with the locations of Data Nodes containing blocks. After receiving the Data Nodes locations,
the client then directly interacts with the Data Nodes. The client starts reading data parallels from the Data
Nodes based on the information received from the Name Node. The data will flow directly from the Data
Node to the client.
Overview of HDFS Architecture
In Hadoop HDFS, Name Node is the master node and Data Nodes are the slave nodes. The file in HDFS
is stored as data blocks. The file is divided into blocks (A, B, C in the below GIF). These blocks get
stored on different Data Nodes based on the Rack Awareness Algorithm. Block an on DataNode-1(DN-
1), block B on DataNode-6(DN-6), and block C on DataNode-7(DN-7). To provide Fault Tolerance,
replicas of blocks are created based on the replication factor. In the below GIF, 2 replicas of each block is
created (using default replication factor 3). Replicas were placed on different Data Nodes, thus ensuring
data availability even in the case of Data Node failure or rack failure.
After reading the HDFS architecture tutorial, we can conclude that the HDFS divides the files into blocks.
The size of the block is 128 Mb by default, which we can configure as per the requirements. The master
node (Name Node) stores and manages the metadata about block locations, blocks of a file, etc. The Data
Node stores the actual data blocks. The Master Node manages the Data Nodes. HDFS creates replicas of
blocks and stores them on different Data Nodes in order to provide fault tolerance. Also, Name Node uses
the Rack Awareness algorithm to improve cluster performance.
Features of HDFS
The key features of HDFS are:
1. Cost-effective:
In HDFS architecture, the Data Nodes, which stores the actual data are inexpensive commodity
hardware, thus reduces storage costs.
2. Large Datasets/ Variety and volume of data
HDFS can store data of any size (ranging from megabytes to petabytes) and of any formats (structured,
unstructured).
3. Replication
Data Replication is one of the most important and unique features of HDFS. In HDFS replication of data
is done to solve the problem of data loss in unfavorable conditions like crashing of a node, hardware
failure, and so on. The data is replicated across a number of machines in the cluster by creating replicas of
blocks. The process of replication is maintained at regular intervals of time by HDFS and HDFS keeps
creating replicas of user data on different machines present in the cluster. Hence whenever any machine
in the cluster gets crashed, the user can access their data from other machines that contain the blocks of
that data. Hence there is no possibility of a loss of user data.
HDFS is highly fault-tolerant and reliable. HDFS creates replicas of file blocks depending on the
replication factor and stores them on different machines. If any of the machines containing data blocks
fail, other Data Nodes containing the replicas of that data blocks are available. Thus ensuring no loss of
data and makes the system reliable even in unfavorable conditions. Hadoop 3 introduced Erasure
Coding to provide Fault Tolerance. Erasure Coding in HDFS improves storage efficiency while
providing the same level of fault tolerance and data durability as traditional replication-based HDFS
deployment.
5. High Availability
The High availability feature of Hadoop ensures the availability of data even during Name Node or Data
Node failure. Since HDFS creates replicas of data blocks, if any of the Data Nodes goes down, the user
can access his data from the other Data Nodes containing a copy of the same data block Also, if the active
Name Node goes down, the passive node takes the responsibility of the active Name Node. Thus, data
will be available and accessible to the user even during a machine crash.
6. Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the
cluster. There is two scalability mechanisms available: Vertical scalability – add more resources (CPU,
Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more
machines in the cluster. The horizontal way is preferred since we can scale the cluster from 10s of nodes
to 100s of nodes on the fly without any downtime.
7. Data Integrity
Data integrity refers to the correctness of data. HDFS ensures data integrity by constantly checking the
data against the checksum calculated during the write of the file. While file reading, if the checksum does
not match with the original checksum, the data is said to be corrupted. The client then opts to retrieve the
data block from another Data Node that has a replica of that block. The Name Node discards the
corrupted block and creates an additional new replica.
8. High Throughput
Hadoop HDFS stores data in a distributed fashion, which allows data to be processed parallels on a cluster
of nodes. This decreases the processing time and thus provides high throughput.
9. Data Locality
Data locality means moving computation logic to the data rather than moving data to the computational
unit. In the traditional system, the data is brought at the application layer and then gets processed. But in
the present scenario, due to the massive volume of data, bringing data to the application layer degrades
the network performance.
1. Ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
Bin/hdfs dfs -ls <path>
Example:
Bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs means
we want the executables of hdfs particularly dfs(Distributed File System) commands.
2. Mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.
Syntax:
Bin/hdfs dfs -mkdir <folder name>
creating home directory:
Note: Observe that we don’t write bin/hdfs while checking the things present on local filesystem.
7. MoveFromLocal: This command will move file from local to hdfs.
Syntax:
Bin/hdfs dfs -moveFromLocal <local src> <dest (on hdfs)>
Example:
8. Cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks copied.
Syntax:
Bin/hdfs dfs -cp <src (on hdfs)> <dest (on hdfs)>
Example:
Bin/hdfs -cp /geeks /geeks copied
9. Mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geek’s folder to geeks copied.
Syntax:
Bin/hdfs dfs -mv <sac (on hdfs)> <src (on hdfs)>
Example:
Bin/hdfs -mv /geeks/myfile.txt /geeks copied
10. Rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
Bin/hdfs dfs -rmr <filename/directory Name>
Example:
Bin/hdfs dfs -rmr /geeks copied -> It will delete all the content inside the directory then the
directory itself.
11. Du: It will give the size of each file in directory.
Syntax:
Bin/hdfs dfs -du <dirName>
Example:
Bin/hdfs dfs -du /geeks
12. Dus: This command will give the total size of directory/file.
Syntax:
Bin/hdfs dfs -dus <dirham>
Example:
Bin/hdfs dfs -dus /geeks
13. Stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
Bin/hdfs dfs -stat <hdfs file>
Example:
Bin/hdfs dfs -stat /geeks
14. Setrep: This command is used to change the replication factor of a file/directory in HDFS. By
default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
Bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory geeks Input stored in HDFS.
Bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means recursively, we use it for
directories as they may also contain many files and folders inside them.
Note: There are more commands in HDFS but we discussed the commands which are commonly used when
working with Hadoop. You can check out the list of dfs commands using the following command:
Bin/hdfs dfs
2. Define GFS and its features. Discuss the GFS Architecture with Diagram
Ans:-
The GFS node cluster is a single master with multiple chunk servers that are continuously accessed by
different client systems. Chunk servers store data as Linux files on local disks. Stored data is divided into
large chunks (64 MB), which are replicated in the network a minimum of three times. The large chunk
size reduces network overhead. GFS is designed to accommodate Google’s large cluster requirements
without burdening applications. Files are stored in hierarchical directories identified by path names.
Metadata - such as namespace, access control data, and mapping information - is controlled by the master,
which interacts with and monitors the status updates of each chunk server through timed heartbeat
messages.
GFS features include:
Fault tolerance
Critical data replication
Automatic and efficient data recovery
High aggregate throughput
Reduced client and master interaction because of large chunk server size
Namespace management and locking
High availability
The largest GFS clusters have more than 1,000 nodes with 300 TB disk storage capacity. This can be
accessed by hundreds of clients on a continuous basis.
GFS was designed with five basic assumptions, [63] according to its particular application requirements:
1. GFS will anticipate any commodity hardware outages caused by both software and hardware faults.
This means that an individual node may be unreliable. This assumption is similar to one of its system
design principles
2. GFS accepts a modest number of large files. The quantity of “modest” is few million files. A typical
file size is 100 MB/per file. The system also accepts smaller files, but it will not optimize them
3. The typical workload size for stream reading would be from hundreds of KBs to 1 MB, with small
random reads for a few KBs in batch mode
4. GFS has its well defined semantic for multiple clients with minimal synchronization overhead
5. A constant high-file storage network bandwidth is more important than low latency
1. Clients
2. Master servers
3. Chunk servers.
Client can be other computers or computer applications and make a file request. Requests can range from
retrieving and manipulating existing files to creating new files on the system. Clients can be thought as
customers of the GFS.
Master Server is the coordinator for the cluster. Its task include:-
1. Maintaining an operation log, that keeps track of the activities of the cluster. The operation log helps keep
service interruptions to a minimum if the master server crashes, a replacement server that has monitored
the operation log can take its place.
2. The master server also keeps track of metadata, which is the information that describes chunks. The
metadata tells the master server to which files the chunks belong and where they fit within the overall file.
Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk servers don't
send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS
copies every chunk multiple times and stores it on different chunk servers. Each copy is called a replica.
By default, the GFS makes three replicas per chunk, but users can change the setting and make more or
fewer replicas if desired.
Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much larger than typical
file system blocks sizes. Each chunk replica is stored as a plain Linux file on a chunk server and is
extended only as needed.
Advantages
1. It reduces clients’ need to interact with the master because reads and writes on the same chunk require
only one initial request to the master for chunk location information.
2. Since on a large chunk, a client is more likely to perform many operations on a given chunk, it can
reduce network overhead by keeping a persistent TCP connection to the chunk server over an extended
period of time.
3. It reduces the size of the metadata stored on the master. This allows us to keep the metadata in
memory, which in turn brings other advantages.
Disadvantages
2. Even with lazy space allocation, a small file consists of a small number of chunks, perhaps just one.
The chunk servers storing those chunks may become hot spots if many clients are accessing the same file.
In practice, hot spots have not been a major issue because the applications mostly read large multi-chunk
files sequentially. To mitigate it, replication and allowance to read from other clients can be done.
3. Define Map Reduced and its features. Discuss the Map Reduced Architecture with Diagram
Ans: - MapReduce is a software framework for writing applications that can process huge amounts of
data across the clusters of in-expensive nodes. Hadoop MapReduce is the processing part of Apache
Hadoop. It is also known as the heart of Hadoop. It is the most preferred data processing application.
Several players in the e-commerce sector such as Amazon, Yahoo, and Zuventus, etc. are using the
MapReduce framework for high volume data processing.
Features of MapReduce
1. Scalability
Apache Hadoop is a highly scalable framework. This is because of its ability to store and distribute huge
data across plenty of servers. All these servers were inexpensive and can operate in parallel. We can
easily scale the storage and computation power by adding servers to the cluster. Hadoop MapReduce
programming enables organizations to run applications from large sets of nodes which could involve the
use of thousands of terabytes of data. Hadoop MapReduce programming enables business organizations
to run applications from large sets of nodes. This can use thousands of terabytes of data.
2. Flexibility
MapReduce programming enables companies to access new sources of data. It enables companies to
operate on different types of data. It allows enterprises to access structured as well as unstructured data,
and derive significant value by gaining insights from the multiple sources of data. Additionally, the
MapReduce framework also provides support for the multiple languages and data from sources ranging
from email, social media, to click stream. The MapReduce processes data in simple key-value pairs thus
supports data type including meta-data, images, and large files. Hence, MapReduce is flexible to deal
with data rather than traditional DBMS.
3. Security and Authentication
The MapReduce programming model uses HBase and HDFS security platform that allows access only to
the authenticated users to operate on the data. Thus, it protects unauthorized access to system data and
enhances system security.
4. Cost-effective solution
Hadoop’s scalable architecture with the MapReduce programming framework allows the storage and
processing of large data sets in a very affordable manner.
5. Fast
Hadoop uses a distributed storage method called as a Hadoop Distributed File System that basically
implements a mapping system for locating data in a cluster. The tools that are used for data processing,
such as MapReduce programming, are generally located on the very same servers that allow for the faster
processing of data. So, Even if we are dealing with large volumes of unstructured data, Hadoop
MapReduce just takes minutes to process terabytes of data. It can process petabytes of data in just an
hour.
Amongst the various features of Hadoop MapReduce, one of the most important features is that it is based
on a simple programming model. Basically, this allows programmers to develop the MapReduce
programs which can handle tasks easily and efficiently. The MapReduce programs can be written in Java,
which is not very hard to pick up and is also used widely. So, anyone can easily learn and write
MapReduce programs and meet their data processing needs.
7. Parallel Programming
One of the major aspects of the working of MapReduce programming is its parallel processing. It divides
the tasks in a manner that allows their execution in parallel. The parallel processing allows multiple
processors to execute these divided tasks. So the entire program is run in less time.
Whenever the data is sent to an individual node, the same set of data is forwarded to some other nodes in
a cluster. So, if any particular node suffers from a failure, then there are always other copies present on
other nodes that can still be accessed whenever needed. This assures high availability of data.
Hadoop Map Reduce architecture
Map reduce architecture consists of mainly two processing stages. First one is the map stage and the
second one is reduce stage. The actual MR process happens in task tracker. In between map and reduce
stages, Intermediate process will take place. Intermediate process will do operations like shuffle and
sorting of the mapper output data. The Intermediate data is going to get stored in local file system.
Mapper Phase
In Mapper Phase the input data is going to split into 2 components, Key and Value. The key is writable
and comparable in the processing stage. Value is writable only during the processing stage. Suppose,
client submits input data to Hadoop system, the Job tracker assigns tasks to task tracker. The input data
that is going to get split into several input splits. Input splits are the logical splits in nature. Record reader
converts these input splits in Key-Value (KV) pair. This is the actual input data format for the mapped
input for further processing of data inside Task tracker. The input format type varies from one type of
application to another. So the programmer has to observe input data and to code accord
What is MapReduce?
MAPREDUCE is a software framework and programming model used for processing huge amounts of
data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting
and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable of running
MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs
are parallel in nature, thus are very useful for performing large-scale data analysis using multiple
machines in the cluster. The input to each phase is key-value pairs. In addition, every programmer needs
to specify two functions: map function and reduce function.
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and
reducing. Consider you have following input data for your Map Reduce Program
MapReduce Architecture
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
Input Splits:
An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of
the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each split is
passed to a mapping function to produce output values. In our example, a job of mapping phase is to
count a number of occurrences of each word from input splits (more details about input-split is given
below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from
Mapping phase output. In our example, the same words are clubbe together along with their respective
frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from
Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset .
MapReduce Architecture explained in detail
One map task is created for each split which then executes map function for each record in the
split.
It is always beneficial to have multiple splits because the time taken to process a split is small as
compared to the time taken for processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the splits in parallel.
However, it is also not desirable to have splits too small in size. When splits are too small, the
overload of managing the splits and map task creation begins to dominate the total job execution
time.
For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64
MB, by default).
Execution of map tasks results into writing output to a local disk on the respective node and not to
HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of
HDFS store operation.
Map output is intermediate output which is processed by reduce tasks to produce the final output.
Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
In the event of node failure, before the map output is consumed by the reduce task, Hadoop
reruns the map task on another node and re-creates the map output.
Reduce task doesn't work on the concept of data locality. An output of every map task is fed to
the reduce task. Map output is transferred to the machine where reduce task is running.
On this machine, the output is merged and then passed to the user-defined reduce function.
Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local
node and other replicas are stored on off-rack nodes). So, writing the reduce output
Hadoop divides the job into tasks. There are two types of tasks:
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of
entities called a
1. Job tracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Job tracker that resides on Name
node and there are multiple task trackers which reside on Data node.
A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.
Execution of individual task is then to look after by task tracker, which resides on every data node
executing part of the job.
Task tracker's responsibility is to send the progress report to the job tracker.
In addition, task tracker periodically sends 'heartbeat' signal to the Job tracker so as to notify
him of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the
job tracker can reschedule it on a different task tracker.
4 Define Pig data Model in detail. Discuss how it will help for effective data flow.
Ans:-
To perform a particular task Programmers using Pig, programmers need to write a Pig script using the
Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, and
Embedded). After execution, these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at
the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking,
and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script
are represented as the nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations
such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs
are executed on Hadoop producing the desired results
Tuples
A record that is formed by an ordered set of fields is known as a tuples, the fields can be of any type. A
tuples is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a
bag. Each tuples can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuples contain
the same number of fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type char array and should be
unique. The value might be of any type. It is represented by ‘[]’
Example − [name Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples
are processed in any particular order).
Machine Learning. Machine learning, a specific subset of AI that trains a machine how to learn, makes it
possible to quickly and automatically produce models that can analyze bigger, more complex data and
deliver faster, more accurate results – even on a very large scale. And by building precise models, an
organization has a better chance of identifying profitable opportunities – or avoiding unknown risks.
Data management. Data needs to be high quality and well-governed before it can be reliably analyzed.
With data constantly flowing in and out of an organization, it's important to establish repeatable processes
to build and maintain standards for data quality. Once data is reliable, organizations should establish a
master data management program that gets the entire enterprise on the same page.
Data mining. Data mining technology helps you examine large amounts of data to discover patterns in
the data – and this information can be used for further analysis to help answer complex business
questions. With data mining software, you can sift through all the chaotic and repetitive noise in data,
pinpoint what's relevant, use that information to assess likely outcomes, and then accelerate the pace of
making informed decisions.
Hadoop. This open source software framework can store large amounts of data and run applications on
clusters of commodity hardware. It has become a key technology to doing business due to the constant
increase of data volumes and varieties, and its distributed computing model processes big data fast. An
additional benefit is that Hadoop's open source framework is free and uses commodity hardware to store
large quantities of data.
In-memory analytics. By analyzing data from system memory (instead of from your hard disk drive),
you can derive immediate insights from your data and act on them quickly. This technology is able to
remove data prep and analytical processing latencies to test new scenarios and create models; it's not only
an easy way for organizations to stay agile and make better business decisions, it also enables them to run
iterative and interactive analytics scenarios.
Predictive analytics. Predictive analytics technology uses data, statistical algorithms and machine-
learning techniques to identify the likelihood of future outcomes based on historical data. It's all about
providing a best assessment on what will happen in the future, so organizations can feel more confident
that they're making the best possible business decision. Some of the most common applications of
predictive analytics include fraud detection, risk, operations and marketing.
Text mining. With text mining technology, you can analyze text data from the web, comment fields,
books and other text-based sources to uncover insights you hadn't noticed before. Text mining
uses machine learning or natural language processing technology to comb through documents – emails,
blogs, Twitter feeds, surveys, competitive intelligence and more – to help you analyze large amounts of
information and discover new topics and term relationships.
The following table summarizes the difference between Map Reduce and Apache Pig:
Apache Pig Map Reduce
Requires a few lines of code (10 lines of code Requires a more extensive code (more lines of
can summarize 200 lines of Map Reduce code) code)
Requires less development time and effort Requires more development time and effort
Ans: -
Hadoop is an open-source framework to store and process Big Data in a distributed environment. It
contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).
MapReduce: It is a parallel programming model for processing large amounts of structured,
semi-structured, and unstructured data on large clusters of commodity hardware.
HDFS: Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used
to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
Pig: It is a procedural language platform used to develop a script for MapReduce operations.
Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Note: There are various ways to execute MapReduce operations:
The traditional approach using Java MapReduce program for structured, semi-structured, and
unstructured data.
The scripting approach for MapReduce to process structured and semi structured data using Pig.
The Hive Query Language (Hive QL or HQL) for MapReduce to process structured data using
Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook; later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for On Line Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
This component diagram contains different units. The following table describes each unit:
User Interface Hive is data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive
Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS
mapping.
Hive QL Process Engine Hive QL is similar to SQL for querying on schema info on the Meta store.
It is one of the replacements of traditional approach for MapReduce
program. Instead of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
Execution Engine The conjunction part of Hive QL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates
results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques
to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Meta store (any database).
4 Send Metadata
Meta store sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to Job Tracker, which is in Name node and it assigns this job to Task Tracker,
which is in Data node. Here, the query executes MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is stored in
tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features
like UDFs but Hive framework does. Query optimization refers to an effective way of query
execution in terms of performance.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming. It reuses familiar concepts from the relational database world, such as tables, rows,
columns and schema, etc. for ease of learning.
Hadoop's programming works on flat files. So, Hive can use directory structures to "partition"
data to improve performance on certain queries.
A new and important component of Hive i.e. Meta store used for storing schema information.
This Meta store typically resides in a relational database. We can interact with Hive using
methods like
o Web GUI
o Java Database Connectivity (JDBC) interface
Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to
write Hive queries using Hive Query Language(HQL)
Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The
Sample query below display all the records present in mentioned table name.
o Sample query : Select * from <Table Name>
Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE (Record Columnar File).
For single user metadata storage, Hive uses derby database and for multiple user Metadata or
shared Metadata case Hive uses MYSQL.
Some of the key points about Hive:
The major difference between HQL and SQL is that Hive query executes on Hadoop's
infrastructure rather than the traditional database.
The Hive query execution is going to be like series of automatically generated map reduce Jobs.
Hive supports partition and buckets concepts for easy retrieval of data when the client executes
the query.
Hive supports custom specific UDF (User Defined Functions) for data cleansing, filtering, etc.
According to the requirements of the programmers one can define Hive UDFs.
By using Hive, we can perform some peculiar functionality that is not achieved in Relational Databases.
For a huge amount of data that is in peta-bytes, querying it and getting results in seconds is important.
And Hive does this quite efficiently; it processes the queries fast and produce results in second's time.
Some key differences between Hive and relational databases are the following;
Relational databases are of "Schema on READ and Schema on Write". First creating a table then
inserting data into the particular table. On relational database tables, functions like Insertions, Updates,
and Modifications can be performed.
Hive is "Schema on READ only". So, functions like the update, modifications, etc. don't work with this.
Because the Hive query in a typical cluster runs on multiple Data Nodes. So it is not possible to update
and modify data across multiple nodes.( Hive versions below 0.13)
Also, Hive supports "READ Many WRITE Once" pattern. Which means that after inserting table we
can update the table in the latest Hive versions?
NOTE: However the new version of Hive comes with updated features. Hive versions ( Hive 0.14) comes
up with Update and Delete options as new features
Hive Architecture
The above screenshot explains the Apache Hive architecture in detail
1. Hive Clients
2. Hive Services
3. Hive Storage and Computing
Hive Clients:
Hive provides different drivers for communication with a different type of applications. For Thrift based
applications, it will provide Thrift client for communication.
For Java related applications, it provides JDBC Drivers. Other than any type of applications provided
ODBC drivers. These Clients and drivers in turn again communicate with Hive server in the Hive
services.
Hive Services:
Client interactions with Hive can be performed through Hive Services. If the client wants to perform any
query related operations in Hive, it has to communicate through Hive Services. CLI is the command line
interface acts as Hive service for DDL (Data definition Language) operations. All drivers communicate
with Hive server and to the main driver in Hive services as shown in above architecture diagram. Driver
present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC,
and other client specific applications. Driver will process those requests from different applications to
Meta store and field systems for further processing.
Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and
performs the following actions
Metadata information of tables created in Hive is stored in Hive "Meta storage database".
Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.
EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only. While from Name Node it only fetches the metadata information for the query.
It collects actual data from data nodes related to mentioned query
Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform
DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and
ALTERING tables and databases are done. Meta store will store information about database
name, table names and column names only. It will fetch data related to query mentioned.
Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data
nodes, and job tracker to execute the query on top of Hadoop file system
Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted
arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
Hive can operate in two modes depending on the size of data nodes in Hadoop.
Local mode
Map reduce mode
If the Hadoop installed under pseudo mode with having one data node we use Hive in this mode
If the data size is smaller in term of limited to single local machine, we can use this mode
Processing will be very fast on smaller data sets present in the local machine
When to use Map reduce mode:
If Hadoop is having multiple data nodes and data is distributed across different node we use Hive
in this mode
It will perform on large amount of data sets and query going to execute in parallel way
Processing of large data sets with better performance can be achieved through this mode
In Hive, we can set this property to mention which mode Hive can work? By default, it works on Map
Reduce mode and for local mode you can have the following setting.
SET mapred.job.tracker=local;
From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.
From the latest version it's having some advanced features Based on Thrift RPC like;
Multi-client concurrency
Authentication
Hive is an ETL and data warehouse tool on top of Hadoop ecosystem and used for processing structured
and semi structured data.
Hive is a database present in Hadoop ecosystem performs DDL and DML operations, and it
provides flexible query language such as HQL for better querying and processing of data.
It provides so many features compared to RDMS which has certain limitations.
It provides option of writing and deploying custom defined scripts and User defined functions.
In addition, it provides partitions and buckets for storage specific logics.
6 Explain storage mechanism of H base with an example. List out the features of H
base
Ans:-
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-
source project and is horizontally scalable. HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that
provides random real-time read/write access to data in the Hadoop File System. One can store the data in
HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides read and write access.
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions
processing; no concept of batch of records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for
faster lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously on
the disk. Each cell value of the table has a timestamp. In short, in an HBase:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical
Processing (OLAP).
Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.
HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema,
fixed columns schema; defines only column families. which describes the whole structure of
tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
Features of HBase
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Set of tables
Each table with column families and rows
Each table must have an element defined as Primary Key.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key
Each column present in HBase denotes attribute corresponding to object
H Master
H Region server
H Regions
Zookeeper
HDFS
H Master:
H Master is the implementation of a Master server in HBase architecture. It acts as a monitoring agent to
monitor all Region Server instances present in the cluster and acts as an interface for all the metadata
changes. In a distributed cluster environment, Master runs on Name Node. Master runs several
background threads.
Plays a vital role in terms of performance and maintaining nodes in the cluster.
H Master provides admin performance and distributes services to different region servers.
H Master assigns regions to region servers.
H Master has the features like controlling load balancing and failover to handle the load over
nodes present in the cluster.
When a client wants to change any schema and to change any Metadata operations, H Master
takes responsibility for these operations.
Some of the methods exposed by H Master Interface are primarily Metadata oriented methods.
The client communicates in a bi-directional way with both H Master and Zoo Keeper. For read and write
operations, it directly contacts with H Region servers. H Master assigns regions to region servers and in
turn, check the health status of region servers.
In entire architecture, we have multiple region servers. H log present in region servers which are going to
store all the log files.
When Region Server receives writes and read requests from the client, it assigns the request to a specific
region, where the actual column family resides. However, the client can directly contact with H Region
servers, there is no need of H Master mandatory permission to the client regarding communication with H
Region servers. The client requires H Master help when operations related to metadata and schema
changes are required. H Region Server is the Region Server implementation. It is responsible for serving
and managing regions or data that is present in a distributed cluster. The region servers run on Data Nodes
present in the Hadoop cluster. H Master can get into contact with multiple H Region servers and performs
the following functions.
HBase Regions:
H Regions are the basic building elements of HBase cluster that consists of the distribution of tables and
are comprised of Column families. It contains multiple stores, one for each column family. It consists of
mainly two components, which are Mem store and H file.
Zoo Keeper:
In HBase, Zookeeper is a centralized monitoring server which maintains configuration information and
provides distributed synchronization. Distributed synchronization is to access the distributed applications
running across the cluster with the responsibility of providing coordination services between nodes. If the
client wants to communicate with regions, the server's client has to approach Zoo Keeper first.
Master and HBase slave nodes ( region servers) registered themselves with Zoo Keeper. The client needs
access to ZK (zookeeper) quorum configuration to connect with master and region servers. During a
failure of nodes that present in HBase cluster, ZK quorum will trigger error messages, and it starts to
repair the failed nodes.
HDFS:-
HDFS is a Hadoop distributed file system, as the name implies it provides a distributed environment for
the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in
multiple blocks and to maintain fault tolerance, the blocks are replicated across a Hadoop cluster. HDFS
provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes to
the cluster and performing processing & storing by using the cheap commodity hardware, it will give the
client better results as compared to the existing one. In here, the data stored in each block replicates into 3
nodes any in a case when any node goes down there will be no loss of data, it will have a proper backup
recovery mechanism. HDFS get in contact with the HBase components and stores a large amount of data
in a distributed manner.
The Read and Write operations from Client into H file can be shown in below diagram.
Step 1) Client wants to write data and in turn first communicates with Regions server and then regions
Step 2) Regions contacting mem store for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that, it flushes into H File. The
main reason for using Memstore is to store data in a Distributed file system based on Row Key. Memstore
will be placed in Region server main memory while H Files are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) in turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches H Files to get the data. The data are fetched and retrieved by the Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as
shown from top to bottom in below table.
Store It stores per Column Family for each region for the table
Memstore Memstore for each store for each region for the table
It sorts data before flushing into H Files
Write and read performance will increase because of sorting
Store File Store Files for each store for each region for the table
HBase runs on top of HDFS and Hadoop. Some key differences between HDFS and HBase are in terms
of data operations and processing.
HBASE HDFS
Accessed through shell commands, client API in Java, Primarily accessed through MR (Map
REST, Avro or Thrift Reduce) jobs
Storage and process both can be perform It's only for storage areas
Some typical IT industrial applications use HBase operations along with Hadoop. Applications include
stock exchange data, online banking data operations, and processing H base is best-suited solution
method.
Following are examples of HBase use cases with a detailed explanation of the solution it provides to
various technical problems
Telecom Industry faces following Technical HBase is used to store billions of rows of detailed call
challenges records. If 20TB of data is added per month to the existing
RDBMS database, performance will deteriorate. To handle a
Storing billions of CDR (Call detailed large amount of data in this use case, HBase is the best
recording) log records generated by solution. HBase performs fast querying and displays records.
telecom domain
Providing real-time access to CDR logs
and billing information of customers
Provide cost-effective solution comparing
to traditional database systems
The Banking industry generates millions of To store, process and update vast volumes of data and
records on a daily basis. In addition to this, the performing analytics, an ideal solution is - HBase integrated
banking industry also needs an analytics solution with several Hadoop ecosystem components.
that can detect Fraud in money transactions
To better understand it, let us take an example and consider the table below.
If this table is stored in a row-oriented database. It will store the records as shown below:
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
1, 2, Paul Walker, VIN Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column values will be
stored together, then the second column values will be stored together and data in other columns are
stored in a similar manner.
When the amount of data is very huge, like in terms of petabytes or exa bytes, we use column-
oriented approach, because the data of a single column is stored together and can be accessed
faster.
While row-oriented approach comparatively handles less number of rows and columns
efficiently, as row-oriented database stores data is a structured format.
When we need to process and analyze a large set of semi-structured or unstructured data, we use
column oriented approach. Such as applications dealing with Online Analytical Processing like
data mining, data warehousing, applications including analytics, etc.
Whereas, Online Transactional Processing such as banking and finance domains which handle
structured data and require transactional properties (ACID properties) use row-oriented approach.
HBase tables have following components, shown in the image below:
Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format.
Row Key: Row keys are used to search records which make searches fast. You would be curious
to know how? I will explain it in the architecture part moving ahead in this blog.
Column Families: Various columns are combined in a column family. These column families are
stored together which makes the searching process faster because data belonging to same column
family can be accessed together in a single seek.
Column Qualifiers: Each column’s name is known as its column qualifier.
Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by row
key and column qualifiers.
Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored
with its timestamp. This makes easy to search for a particular version of data.
Set of tables
Each table with column families and rows
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key
Each column qualifier present in HBase denotes attribute corresponding to the object which
resides in the cell.
Now that you know about HBase Data Model, let us see how this data model falls in line with HBase
Architecture and makes it suitable for large storage and faster processing.
7 What is RDD? Explain transformation and actions in RDD. Explain RDD operation
in brief.
Ans:-
RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an
immutable collection of objects which computes on the different node of the cluster. Each and every
dataset in Spark RDD is logically partitioned across many servers so that they can be computed on
different nodes of the cluster. In this blog, we are going to get to know about what is RDD in Apache
Spark. What are the features of RDD, What is the motivation behind RDDs, RDD vs. DSM? We will also
cover Spark RDD operation i.e. transformations and actions, various limitations of RDD in Spark and
how RDD make Spark feature rich in this Spark tutorial.
RDD stands for “Resilient Distributed Dataset”. It is the fundamental data structure of Apache Spark.
RDD in Apache Spark is an immutable collection of objects which computes on the different node of the
cluster.
Decomposing the name RDD:
Resilient, i.e. fault-tolerant with the help of RDD lineage graph (DAG) and so able to recomputed
missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set externally which
can be JSON file, CSV file, text file or database via JDBC with no specific data structure.
Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be
computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the case
of failure.
There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and
parallelizing already existing collection in driver program. One can also operate Spark RDDs in parallel
with a low-level API that offers transformations and actions. We will study these Spark RDD Operations
later in this section.
Spark RDD can also be cached and manually partitioned. Caching is beneficial when we use RDD
several times. And manual partitioning is important to correctly balance partitions. Generally, smaller
partitions allow distributing RDD data more equally, among more executors. Hence, fewer partitions
make the work easy.
Programmers can also call a persist method to indicate which RDDs they want to reuse in future
operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not
enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk or
replicating it across machines, through flags to persist .
The key motivations behind the concept of RDD are-
Iterative algorithms.
Interactive data mining tools.
DSM (Distributed Shared Memory) is a very general abstraction, but this generality makes it harder
to implement in an efficient and fault tolerant manner on commodity clusters. Here the need of RDD
comes into the picture.
In distributed computing system data is stored in intermediate stable distributed store such
as HDFS or Amazon S3. This makes the computation of job slower since it involves many IO
operations, replications, and serializations in the process.
In first two cases we keep data in-memory; it can improve performance by an order of magnitude.
The main challenge in designing RDD is defining a program interface that provides fault tolerance
efficiently. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory,
based on coarse-grained transformation rather than fine-grained updates to shared state.
Spark exposes RDD through language integrated API. In integrated API each data set is represented as an
object and transformation is involved using the method of these objects.
Apache Spark evaluates RDDs lazily. It is called when needed, which saves lots of time and improves
efficiency. The first time they are used in an action so that it can pipeline the transformation. Also, the
programmer can call a persist method to state which RDD they want to use in future operations. In this
Spark RDD tutorial, we are going to get to know the difference between RDD and DSM which will take
RDD in Apache Spark into the limelight.
I. Read
RDD – The read operation in RDD is either coarse grained or fine grained. Coarse-grained meaning
we can transform the whole dataset but not an individual element on the dataset. While fine-grained
means we can transform individual element on the dataset.
DSM – The read operation in Distributed shared memory is fine-grained.
ii. Write
RDD – The write operation in RDD is coarse grained.
DSM – The Write operation is fine grained in distributed shared system.
iii. Consistency
RDD – The consistency of RDD is trivial meaning it is immutable in nature. Any changes on RDD is
permanent i.e. we cannot realtor the content of RDD. So the level of consistency is high.
DSM – In Distributed Shared Memory the system guarantees that if the programmer follows the
rules, the memory will be consistent and the results of memory operations will be predictable.
v. Partitioning
Partitioning is the fundamental unit of parallelism in Spark RDD. Each partition is one logical division of
data which is mutable. One can create a partition through some transformations on existing partitions.
vi. Persistence
Users can state which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory
storage or on Disk).
viii. Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to
information about the location of RDD.
Transformation
Actions
I. Transformations
Spark RDD Transformations are functions that take an RDD as the input and produce one or many
RDDs as the output. They do not change the input RDD (since RDDs are immutable and hence one
cannot change it), but always produce one or more new RDDs by applying the computations they
represent e.g. Map (), filter (), reduce By Key() etc.
Transformations are lazy operations on an RDD in Apache Spark. It creates one or many new RDDs,
which executes when an Action occurs. Hence, Transformation creates a new dataset from an existing
one.
Certain transformations can be pipelined which is an optimization method, that Spark uses to improve the
performance of computations. There are two kinds of transformations: narrow transformation, wide
transformation.
a. Narrow Transformations
It is the result of map, filter and such that the data is from a single partition only, i.e. it is self-sufficient.
An output RDD has partitions with records that originate from a single partition in the parent RDD. Only
a limited subset of partitions used to calculate the result.
Spark groups narrow transformations as a stage known as pipelining.
Wide Transformation
ii. Actions
An Action in Spark returns final result of RDD computations. It triggers execution using lineage graph to
load the data into original RDD, carry out all intermediate transformations and return final results to
Driver program or write it out to file system. Lineage graph is dependency graph of all parallel RDDs of
RDD.
Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program.
An Action is one of the ways to send result from executors to the driver. First(), take(), reduce(), collect(),
the count() is some of the Actions in spark.
Using transformations, one can create RDD from the existing one. But when we want to work with the
actual dataset, at that point we use Action. When the Action occurs it does not create the new RDD,
unlike transformation. Thus, actions are RDD operations that give no RDD values. Action stores its value
either to drivers or to the external storage system. It brings laziness of RDD into motion.
Limitation of Spark RDD
There is also some limitation of Apache Spark RDD. Let’s discuss them one by one-
Ans: -
Apache Hadoop is an open source software framework used to develop data processing applications
which are executed in a distributed computing environment. Applications built using HADOOP are run
on large data sets distributed across clusters of commodity computers. Commodity computers are cheap
and widely available. These are mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a
distributed file system which is called as a Hadoop Distributed File system. The processing model is
based on 'Data Locality' concept wherein computational logic is sent to cluster nodes (server) containing
data. This computational logic is nothing, but a compiled version of a program written in a high-level
language such as Java. Such a program, processes data stored in Hadoop HDFS.
1. Hadoop MapReduce: MapReduce is a computational model and software framework for writing
applications which are run on Hadoop. These MapReduce programs are capable of processing
enormous data in parallel on large clusters of computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop
applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas
of data blocks and distributes them on compute nodes in a cluster. This distribution enables
reliable and extremely rapid computations.
Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also
used for a family of related projects that fall under the umbrella of distributed computing and large-scale
data processing. Other Hadoop-related projects at Apache include
are Hive, HBase, Mahout, Sqoop, Flume, and Zoo Keeper.
Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods.
Name Node:
Name Node represented every files and directory which is used in the namespace
Data Node:
Data Node helps you to manage the state of an HDFS node and allows you to interacts with the blocks
Master Node:
The master node allows you to conduct parallel processing of data using Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to
conduct complex calculations. Moreover, the entire slave node comes with Task Tracker and a Data
Node. This allows you to synchronize the processes with the Name Node and Job Tracker respectively.
Features of 'Hadoop'
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for
analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes,
less network bandwidth is consumed. This concept is called as data locality concept which helps
increase the efficiency of Hadoop based applications.
• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows
for the growth of Big Data. Also, scaling does not require modifications to application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the
event of a cluster node failure, data processing can still proceed by using data stored on another cluster
node.
Topology (Arrangement) of the network, affects the performance of the Hadoop cluster when the size of
the Hadoop cluster grows. In addition to the performance, one also needs to care about the high
availability and handling of failures. In order to achieve this Hadoop, cluster formation makes use of
network topology.
Typically, network bandwidth is an important factor to consider while forming any network. However, as
measuring bandwidth could be difficult, in Hadoop, a network is represented as a tree and distance
between nodes of this tree (number of hops) is considered as an important factor in the formation of
Hadoop cluster. Here, the distance between two nodes is equal to sum of their distance to their closest
common ancestor. Hadoop cluster consists of a data center, the rack and the node which actually executes
jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to
processes varies depending upon the location of the processes. That is, the bandwidth available becomes
lesser as we go away from-
Processes on the same node
Different nodes on the same rack
Nodes on different racks of the same data center
Nodes in different data centers
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2. A Hadoop
cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task
Tracker, Name Node, and Data Node whereas the slave node includes Data Node and Task Tracker.
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
master/slave architecture. This architecture consist of a single Name Node performs the role of master,
and multiple Data Nodes performs the role of a slave. Both Name Node and Data Node are capable
enough to run on commodity machines. The Java language is used to develop HDFS. So any machine that
supports Java language can easily run the Name Node and Data Node software.
Name Node
It is a single master server exists in the HDFS cluster.
As it is a single node, it may become the reason of single point failure.
It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
It simplifies the architecture of the system.
Data Node
The HDFS cluster contains multiple Data Nodes.
Each Data Node contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of Data Node to read and write requests from the file system's clients.
It performs block creation, deletion, and replication upon instruction from the Name Node.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using Name Node.
o In response, Name Node provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also
be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job
Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the
Task Tracker fails or time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
Resilient to failure: HDFS has the property with which it can replicate data over the network, so
if one node is down or some other network failure happens, then Hadoop takes the other copy of
data and use it. Normally, data are replicated thrice but the replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.
In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an
open source web crawler software project.
While working on Apache Notch, they were dealing with big data. To store that data they have to
spend a lot of costs which becomes the consequence of that project. This problem becomes one of
the important reasons for the emergence of Hadoop.
In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary
distributed file system developed to provide efficient access to data.
In 2004, Google released a white paper on Map Reduce. This technique simplifies the data
processing on large clusters.
In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch
Distributed File System). This file system also includes Map reduce.
In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough
Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed
File System). Hadoop first version 0.1.0 released in this year.
Doug Cutting gave named his project Hadoop after his son's toy elephant.
In 2007, Yahoo runs two clusters of 1000 machines.
In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within
209 seconds.
In 2013, Hadoop 2.2 was released.
In 2017, Hadoop 3.0 was released.
Year Event
9. Write Short note on Hadoop Ecosystem also explain various elements of Hadoop
Apache Hadoop is an open source framework intended to make interaction with big data easier, However,
for those who are not acquainted with this technology, one question arises that what is big data? Big data is
a term given to the data sets which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and companies that need to
work on large data sets which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is
made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There are four major
elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or
solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: No SQL Database
Mahout, Spark ML Lib: Machine Learning algorithm libraries
Solar, Lucerne: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too that are part of
the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it
revolves around data and hence making its synthesis easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively
fewer resources than the data nodes that stores the actual data. These data nodes are commodity
hardware in the distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of
the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas
Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine
and later on acknowledges the resource manager. Application manager works as an interface
between the resource manager and node manager and performs negotiations as per the requirement
of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map () performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the Reduce
() method.
2. Reduce (), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce () takes the output generated by Map () as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce
are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data
sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL
data types are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection
whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout allows Machine Learn ability to a system or application. Machine Learning, as the name
suggests helps the system to develop itself based on some patterns, user/environmental interaction
or Om the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms as
per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing; hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a No SQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s Big Table, thus able to work on Big Data sets
effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database,
the request must be processed within a short quick span of time. At such times, HBase comes handy
as it gives us a tolerant way of storing limited data.
Other Components: Apart from all of these, there are some other components too that carry out a huge task
in order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lukens: These are the two services that perform the task of searching and indexing with the
help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame
all the problems by performing synchronization, inter-component based communication, grouping,
and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e. Oozie workflow and Oozie coordinator jobs.
Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
10. Write a brief short note on: Spark,Hbase,Pig data, Map Reduced, HDFS, HIVE, H base,
Pig Script
Ans:-
Apache Spark
Apache Spark is an open source parallel processing framework for running large-scale data
analytics applications across clustered computers. It can handle both batch and real-time analytics and
data processing workloads. Spark became a top-level project of the Apache Software Foundation in
February 2014, and version 1.0 of Apache Spark was released in May 2014. Spark version 2.0 was
released in July 2016. The technology was initially designed in 2009 by researchers at the University of
California, Berkeley as a way to speed up processing jobs in Hadoop systems. Spark Core, the heart of the
project that provides distributed task transmission, scheduling and I/O functionality provides
programmers with a potentially faster and more flexible alternative to MapReduce, the software
framework to which early versions of Hadoop were tied. Spark's developers say it can run jobs 100 times
faster than MapReduce when processed in-memory, and 10 times faster on disk.
Spark libraries
The Spark Core engine functions partly as an application programming interface (API) layer and
underpins a set of related tools for managing and analyzing data. Aside from the Spark Core processing
engine, the Apache Spark API environment comes packaged with some libraries of code for use in data
analytics applications. These libraries include:
Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query data stored
in disparate applications using the common SQL language.
Spark Streaming -- This library enables users to build applications that analyze and present data in
real time.
ML lib -- A library of machine learning code that enables users to apply advanced statistical
operations to data in their Spark cluster and to build applications around these analyses.
Spark was written in Scale, which is considered the primary language for interacting with the Spark Core
engine. Out of the box, Spark also comes with API connectors for using Java and Python. Java is not
considered an optimal language for data engineering or data science, so many users rely on Python, which
is simpler and more geared toward data analysis. There is also an R programming package that users can
download and run in Spark. This enables users to run the popular desktop data science language on larger
distributed data sets in Spark and to use it to build applications that leverage machine learning algorithms.
The wide range of Spark libraries and its ability to compute data from many different types of data stores
means Spark can be applied to many different problems in many industries. Digital advertising companies
use it to maintain databases of web activity and design campaigns tailored to specific consumers.
Financial companies use it to ingest financial data and run models to guide investing activity. Consumer
goods companies use it to aggregate customer data and forecast trends to guide inventory decisions and
spot new market opportunities.
HBase
HBase is a column-oriented non-relational database management system that runs on top of Hadoop
Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which
are common in many big data use cases. It is well suited for real-time data processing or random
read/write access to large volumes of data. Unlike relational database systems, HBase does not support a
structured query language like SQL; in fact, HBase isn’t a relational data store at all. HBase applications
are written in Java™ much like a typical Apache MapReduce application. HBase does support writing
applications in Apache Avro, REST and Thrift. An HBase system is designed to scale linearly. It
comprises a set of standard tables with rows and columns, much like a traditional database. Each table
must have an element defined as a primary key, and all access attempts to HBase tables must use this
primary key. Avro, as a component, supports a rich set of primitive data types including: numeric, binary
data and strings; and a number of complex types including arrays, maps, enumerations and records. A sort
order can also be defined for the data. HBase relies on Zoo Keeper for high-performance coordination.
Zoo Keeper is built into HBase, but if you’re running a production cluster, it’s suggested that you have a
dedicated Zoo Keeper cluster that’s integrated with your HBase cluster. HBase works well with Hive, a
query engine for batch processing of big data, to enable fault-tolerant big data applications.
What is HBase
H base is an open source and sorted map data built on Hadoop. It is column oriented and horizontally
scalable.
It is based on Google's Big Table. It has set of tables which keep data in key value format. Hbase is well
suited for sparse data sets which are very common in big data use cases. Hbase provides APIs enabling
development in practically any programming language. It is a part of the Hadoop ecosystem that provides
random real-time read/write access to data in the Hadoop File System.
Why HBase
o RDBMS get exponentially slow as the data becomes large
o Expects data to be highly structured, i.e. ability to fit in a well-defined schema
o Any change in schema might require a downtime
o For sparse datasets, too much of overhead of maintaining NULL values
Features of H base
o Horizontally scalable: You can add any number of columns anytime.
o Automatic Failover: Automatic failover is a resource that allows a system administrator to
automatically switch data handling to a standby system in the event of system compromise
o Integrations with Map/Reduce framework: Al the commands and java codes internally implement
Map/ Reduce to do the task and it is built over Hadoop Distributed File System.
o Sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column
key, and timestamp.
o Often referred as a key value store or column family-oriented database, or storing versioned maps
of maps.
o Fundamentally, it's a platform for storing and retrieving data with random access.
o It doesn't care about data types (storing an integer in one row and a string in another for the same
column).
o It doesn't enforce relationships within your data.
o It is designed to run on a cluster of computers, built using commodity hardware.
Pig Data
Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop. The
language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache
Spark. Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using PIG can
also be achieved using java used in MapReduce.
Features of Apache Pig
1) Ease of programming
Writing complex java programs for map reduces is quite tough for non-programmers. Pig makes this
process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically, allowing the
user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data set.
4) Flexible
5) In-built operators
Here, it is required to develop complex programs It is not required to develop complex programs.
using Java or Python.
It is difficult to perform data operations in It provides built-in operators to perform data operations
MapReduce. Like union, sorting and ordering.
It doesn't allow nested data types. It provides nested data types like tuples, bag, and map.
Advantages of Apache Pig
Less code - The Pig consumes less line of code to perform any operation.
Reusability - The Pig code is flexible enough to reuse again.
Nested data types - The Pig provides a useful concept of nested data types like tuples, bag,
and map.
MapReduce
MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy
to scale data processing over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − the map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
Reduce stage − this stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the
job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job,
conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-
Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce
job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Terminology
Pay Load − Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
Named Node − Node that manages the Hadoop Distributed File System (HDFS).
Data Node − Node where data is presented in advance before any processing takes place.
Master Node − Node where Job Tracker runs and which accepts job requests from clients.
Slave Node − Node where Map and Reduce program runs.
Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to Job Tracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − an execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a Slave Node.
Example Scenario
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are
stored across multiple machines. These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure. HDFS also makes applications available to parallel processing.
HDFS
Features of HDFS
HDFS follows the master-slave architecture and it has the following elements.
Name node
The name node is the commodity hardware that contains the GNU/Linux operating system and the name
node software. It is software that can be run on commodity hardware. The system having the name node
acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
Data node
The data node is a commodity hardware having the GNU/Linux operating system and data node
software. For every node (Commodity hardware/System) in a cluster, there will be a data node. These
nodes manage the data storage of their system.
Data nodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the
instructions of the name node.
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or
more segments and/or stored in individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is called a Block. The default block
size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − since HDFS includes a large number of commodity hardware, failure of
components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near
the data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook; later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
A relational database
A design for On Line Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
User Interface Hive is data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types, and
HDFS mapping.
Hive QL Process Engine Hive QL is similar to SQL for querying on schema info on the Meta
store. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in Java,
we can write a query for MapReduce job and process it.
Execution Engine The conjunction part of Hive QL process Engine and MapReduce is
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor of
MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent
of big data, companies realized the benefit of processing big data and started opting for solutions like
Hadoop. Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even
unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That
means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of time
(random access).
Hadoop Random Access Databases
Applications such as HBase, Cassandra, couch DB, Dynamo, and Mongo DB are some of the databases
that store huge amounts of data and access the data in a random manner.
HBase
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-
source project and is horizontally scalable. HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that
provides random real-time read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data
in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and
write access.
HBase and HDFS
HDFS HBase
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions
processing; no concept of batch of records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for
faster lookups.
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical
Processing (OLAP).
Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.
Pig Scripts
Basically, to place Pig Latin statements and Pig commands in a single file, we use Pig scripts. It is good
practice to identify the file using the *.pig extension, even while not required.
Moreover, we can run Pig scripts from the command line and from the Grunt shell.
Also, to pass values to parameters using parameter substitution, Pig scripts allows us to do so.
Executing Pig Script in Batch mode
Step 1
at very first, write all the required Pig Latin statements and commands in a single file. Then save it as a
.pig file.
Step 2
Afterwards, execute the Apache Pig script. To execute Pig script from the shell (Linux), see:
Local mode
$ Pig -x local Sample_script.pig
MapReduce mode
$ Pig -x MapReduce Sample_script.pig
It is possible to execute it from the Grunt shell as well using the exec command.
Grunt> exec /sample_script.pig