Nosql and Hadoop
Nosql and Hadoop
II
NOSQL & HADOOP
NO SQL
No- SQL stands for “not only SQL”.
It is a Database Management approach where data is stored in
a more natural and flexible way.
First coined by Carlo Strozzi.
Features of NoSQL database:
Open Source
non-relational
Distributed
Schema less
Cluster friendly
Used in web applications
NO SQL
Widely used in big data and other real time web applications.
S
o
c
i
a
Ll
oN
ge
At
nw
ao
lr
yk
si
Types:
T
iin
sg
m F
ee
Be
ad
ss
e
d
D Document
Key Value
a
t
a Oriented
Database Databases
Types
Includes Includes
Product Product
Produ Produ
ct ct
NoSQL
Reasons for Using NoSQL:
Scalability : NoSQL Databases are designed in such a way
that they can handle large amounts of data and user traffic by
adding more commodity hardware.
Performance: Used to handle large volumes of data . This is
particularly important for applications that require real time
access and low latency.
Flexibility: It is Schema less which means it can handle
unstructured ,semi structured data.
Cost effectiveness: As this database is designed to run on
commodity hardware , they are often cost effective
Availability: This is designed to handle high levels of traffic
and data throughput( they can provide high availability and
fault tolerance
Advantages of NoSQL
Advantages of NoSQL: NoSQL
Flexible Data Models : SQL databases support flexible schemas, allowing for
dynamic changes to the data model without requiring a predefined schema. This
flexibility is crucial for agile development and rapidly changing data environments.
Scalability: NoSQL databases typically support horizontal scalability, which means
they can scale out by adding more servers. This is in contrast to the vertical scaling
of traditional relational databases, which often involves adding more resources to a
single server.
High Availability: Many NoSQL databases are designed with high availability in
mind, offering features like data replication and distribution across multiple servers to
ensure continuous operation and fault tolerance.
Performance: For certain data structures and workloads, NoSQL databases can
offer excellent query performance, especially for applications that involve a lot of
reading and writing. Their performance is enhanced for extensive data processing.
Handling Large Data Volumes: NoSQL databases are built to store and manage
NoSQL
Use of NoSQL in Industry
Used to support analysis for applications such as web user data analysis,
Log analysis, Sensor feed Analysis.
Key-Value Pairs
Shopping carts,
Web User Data
Analysis
Document Based
Real Time Analytics
Logging
Document archive
management
NoSQL
NOSQL Vendors:
Company Product Most Widely Used in
Amazon Dynamo DB LinkedIn , Mozilla
Facebook Cassandra Netflix, Twitter, eBay
Google Big Table Adobe Photoshop
COMPARISON OF SQL AND NOSQL
Factors SQL NoSQL
Database Relational Non-relational, Distributed
Data model Relational model Model-Less Approach
Schema Rigid, Pre-defined Flexible,Dynamic
Structure Table Based Document Based, Key Value
Based,Wide Column Based,
Graph Based
Scalability Vertical Scalable (by Horizontal Scalable(by creating
increasing System Resources) a cluster of commodity
machines)
Language SQL UnQL(Unstructured Query
Language)
Preference of Smaller Datasets Larger Datasets
Datasets
Integrity/ Integrity (ACID) Availability(eventual
Availability Consistency)
Support From Vendors From Community
Example Oracle,MySQL,DB2 MongoDB , Cassandra, BigTable
Hadoop
Introduction
Apache Hadoop software is an open source
framework that allows for the distributed storage and
processing of large datasets across clusters of
computers using simple programming models.
Designed to scale up from a single computer to
thousands of clustered computers, with each machine
offering local computation and storage
Hadoop can efficiently store and process large
datasets ranging in size from gigabytes to petabytes
of data
HADOOP is sometimes referred to as an acronym for
High Availability Distributed Object Oriented Platform
Hadoop
Four modules comprise the primary Hadoop framework and work collectively to form
the Hadoop ecosystem
Hadoop Distributed File System (HDFS): It is the primary
component of the Hadoop ecosystem. It is a distributed file system in
which individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access
to application data
Yet Another Resource Negotiator (YARN): YARN is a resource-
management platform responsible for managing compute resources in
clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.
Map Reduce: Map Reduce is a programming model for large-scale data
processing. In the Map Reduce model, subsets of larger datasets and
instructions for processing the subsets are dispatched to multiple
different nodes, where each subset is processed by a node in parallel
with other processing jobs. After processing the results, individual
subsets are combined into a smaller, more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities
used and shared by other Hadoop modules.
Hadoop
RDBMS Vs Hadoop:
S.No RDBMS Hadoop
Traditional row-column based databases,
An open-source software used for storing data and
1. basically used for data storage,
running applications or processes concurrently.
manipulation and retrieval.
3. It is best suited for OLTP environment. It is best suited for BIG data.
8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
Hadoop
Distributed Computing Challenges:
Although there are several challenges with distributed computing, two major
challenges to be addressed and they are
Hardware Failure : As several servers are networked together , there are possibilities for
the failure of hardwares. When such failures occurs Hadoop uses Replication factor (the
number of copies of a block that are stored on different data nodes in a cluster)
Overview:
Open Source Software used in distributed systems to store and
process enormous amount of data
Key aspects:
Open source software: free to download and use.
Framework: A Window with Programs, tools etc are being placed.
Distributed: Divides and Stores the data across multiple
computers and processing is done in parallel.
Massive Storage: Stores huge amount of data across nodes of low
cost commodity hardware.
Faster Processing: Large amounts of Data is processed in Parallel
yielding quick response.
Hadoop
Hadoop Components:
Two Major Components found in Hadoop and they are,
Hadoop Ecosystem
Hadoop Core Components
Master Node
HDFS
Map reduce
Computational Storage
By understanding the block size and its impact on the file storage system, the data
processing can be optimized
HDFS Read and Write operations:
Rack 1 Rack 2
Data Replication and Data Pipelining are the special features of HDFS
Working with HDFS Commands:
Command Description
hadoop fs -ls/ list of directories and files at the root
of HDFS
hadoop fs –ls –R/ Listof complete directories and files of
HDFS
hadoop fs –mkdir/directory_name Create the directory
hadoop fs –put /root/directory_name /file_name/ Copy a file from Local system to HDFS
directory_name/file-name
hadoop fs –get/directory_name/file-name/root/ Copy a file from HDFS to Local system
directory_name /file_name
hadoop fs –copyFromLocal/root/directory_name Copy a file from Local system to HDFS
/file_name/ directory_name/file-name Via copyFromLocal command
hadoop fs Copy a file from HDFS to Local system
–copyToLocal/directory_name/file-name/root/ Via copyToLocal command
directory name /file_name
hadoop fs –cat/directory name /file_name Display the contents of the HDFS fle
on console
hadoop fs –cp/directory name1 Copy a file from one directory to
/file_name/directory name2 other directory
hadoop fs –rm-r/directory name Remove the directory from HDFS
PROCESSING DATA WITH HADOOP- MAP
REDUCE
For Processing huge amount of data that are stored in HDFS, the
programming model called Map reduce is used.
It is a programming technique based on Java that is used on the top of
the Hadoop framework for faster processing of huge amount of data.
HDFS and Map reduce Framework run on the same set of nodes.
there are 2 daemons associated with map reduce programming and
they are
Job Tracker:
It is a master process responsible for executing overall map reduce job.
It provides connectivity between Hadoop and the application.
When the client submit the code , it creates the execution plan by deciding which
task to assign to which node.
It monitors all the running tasks.
when the tasks fail, it reschedule the task to a different node .
There is single job tracker per Hadoop Clusters.
Task Tracker:
This s responsible for executing individual tasks assigned by job tracker.
There is single Task tracker per slave and generates multiple JVMs to handle
multiple map or reduce tasks in parallel.
It continuously sends the heartbeat message to Job tracker for ensuring
connectivity.
If the Job tracker fails to receive the heartbeat message from task tracker, it
reschedules the tasks to another available node in the cluster.
Working of Map Reduce:
A Map reduce system is usually composed of three steps
Map
Shuffle
Reduce
Map:
The input data is split into smaller blocks
The Hadoop Framework then decides the number of mappers to use based on the
size of the data to be processed and the memory block available
Each block is then assigned to a mapper for processing.
Each worker node applies the map function to the local data and wites the output to
Shuffle
Worker nodes then redistribute the data based on the output keys (produced by map function).
All data belonging to one key is located on the same worker node.
The Values are grouped by keys in the form of key value pair.
Reduce:
A reducer cannot start while a mapper is still in progress.
Worker node process each group of <key,value> pairs output data, in parallel to produce
<key,value> pairs as output.
All the map output values that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Input
Mapper
Splits Reducer
Input
Input Data Mapper Output
Splits Shuffling Reducer
Stored on and Data
HDFS Input Sorting Stored on
Mapper
Splits Reducer HDFS
Input Mapper
Splits Reducer
MANAGING RESOURCES WITH HADOOP-
YARN
YARN stands for “Yet Another Resource Negotiator”.
It was introduced in Hadoop 2.0 to remove the bottleneck on job Tracker which was
present in Hadoop 1.0.
YARN architecture basically separates resource management layer from the
processing layer.
In Hadoop 1.0, the responsibility of Job tracker is to split between the
resourcemanager and application manager.
In Hadoop 2.0, YARN also allows different processing engines like graph processing,
interactive processing etc to run and process the data stored in HDFS.
Through its various components , it can dynamically allocate various resources and
schedule the application processing .For large volume data processing, it is
necessary to manage the available resources properly so that every application can
leverage them.
Features:
Scalability
Compatibility
Cluster Utilizaton
Multi tenancy
Hadoop YARN Architecture:
The main components of YARN architecture include:
Client: It submits map-reduce jobs.
Resource Manager:
It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications.
Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the
completion of the request accordingly.
It has two major components:
1.Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair
Scheduler to partition the cluster resources.
2.Application manager: It is responsible for accepting the
application and negotiating the first container from the resource
manager. It also restarts the Application Master container if a task
fails.
Node Manager:
It take care of individual node on Hadoop cluster and manages application and workflow
Its primary job is to keep-up with the Resource Manager.
It registers with the Resource Manager and sends heartbeats with the health status of the node.
It monitors resource usage, performs log management and also kills a container based on
directions from the resource manager.
It is also responsible for creating the container process and start it on the request of Application
master.
Application Master:
An application is a single job submitted to a framework.
The application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application.
The application master requests the container from the node manager by sending a Container
Launch Context(CLC) which includes everything an application needs to run.
Once the application is started, it sends the health report to the resource manager from time-to-
time.
Container:
It is a collection of physical resources such as RAM, CPU cores and disk on a single node.
The containers are invoked by Container Launch Context(CLC) which is a record that contains
information such as environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:
HIVE:
Data Warehousing layer on Top of Hadoop
This performs Analysis and queries, adhoc queries ,data analysis.
SQOOP:
Tool that helps to transfer the data between Hadoop and Relational Databases.
Through Sqoop, the data from RDBMS to Hadoop and vice versa can be imported.
HBASE:
It is Column-Oriented No-SQL database for Hadoop
It is used to store billions of rows and millions of columns
It provides random read/write operation
It supports record level updates.
It is found on top of HDFS
PIG HIVE
SQOOP HBASE
REFERENCES
https://www.mongodb.com/resources/basics/databases/nosql-explained
https://scaleyourapp.com/wide-column-and-column-oriented-databases/
https://aws.amazon.com/nosql/graph/
https://6point6.co.uk/insights/use-cases-for-graph-database/
https://www.geeksforgeeks.org/top-5-reasons-to-chose-nosql/
https://www.sprinkledata.com/blogs/what-is-a-nosql-database-understanding-the-evol
ution-of-data-management
https://www.scalablepath.com/back-end/sql-vs-nosql
https://cloud.google.com/learn/what-is-hadoop
https://www.researchgate.net/figure/Summary-of-HDFS-Write-operation_fig7_329155
467
https://www.integrate.io/blog/guide-to-hdfs-for-big-data-processing/
https://www.researchgate.net/figure/Hadoop-Distributed-File-System-HDFS-structure-
and-its-reading-process_fig3_312185695
https://www.geeksforgeeks.org/introduction-to-hadoop-distributed-file-systemhdfs/