0% found this document useful (0 votes)
33 views42 pages

Nosql and Hadoop

Nosql and hadoop

Uploaded by

akilacp07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views42 pages

Nosql and Hadoop

Nosql and hadoop

Uploaded by

akilacp07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Big Data Analytics- UNIT

II
NOSQL & HADOOP
NO SQL
 No- SQL stands for “not only SQL”.
 It is a Database Management approach where data is stored in
a more natural and flexible way.
 First coined by Carlo Strozzi.
 Features of NoSQL database:
 Open Source
 non-relational
Distributed
Schema less
Cluster friendly
Used in web applications
NO SQL
 Widely used in big data and other real time web applications.

S
o
c
i
a
Ll
oN
ge
At
nw
ao
lr
yk
si
 Types:
T
iin
sg
m F
ee
Be
ad
ss
e
d
D Document
Key Value
a
t
a Oriented
Database Databases

Types

Wide Column Graph


stores
NO SQL
Document Oriented Databases:
 Stores data in the form of documents similar to JSON(Java
Script Object Notation).
 Document contains pairs of fields and values(variety of types or
objects)
 Suitable for semi structured and unstructured data set.
 Examples: MongoDB, Couch base
 A Typical document will look like
{
username: “xxx”,
email: “[email protected]”,
phone:
{
cell: “1111111111”,
home: “2222222222”
}
}
NO SQL
 Key Value Database:
 Type of DB where each item contains key and value.
 Each key is unique and associated with single value.
 Used for caching, session management and provide high
performance in reads and writes.
Example: Amazon DynamoDB, Redis
 a simple view of data stored in a key- value database looks
like,
1 Key: user:12345
2. Value:{“name”:”xxx”,
“email”:[email protected]
“designation”:”data
analyst” }
NO SQL
 Wide Column stores:
 The data is stored in tables(rows, dynamic columns).
 Different rows can have different set of columns.
 It employs column compression techniques to save
the storage and enhance performance.
 They are used in distributed environment.
 Example: Apache Cassandra, Hbase, Google bigtable.
 This is best suited for OLAP use cases
 It can be simply depicted as,
Column
Name, Values
Customer for Name
Id Column
Name, Values
NO SQL
 Graph Database:
 It stores data in the form of nodes and edges.
 Nodes- Store information about the things (like nouns).
 Edges- Store information about the relationship between the nodes
 The NoSQL database uses Mathematical graph theory to show data connections.
 Use cases : Social networking, Fraud detection, Generative AI
 The Typical Graph database for customer and order data looks like

Custo Places Includes


mer Order Product Produ
order
ct

Includes Includes
Product Product

Produ Produ
ct ct
NoSQL
 Reasons for Using NoSQL:
 Scalability : NoSQL Databases are designed in such a way
that they can handle large amounts of data and user traffic by
adding more commodity hardware.
 Performance: Used to handle large volumes of data . This is
particularly important for applications that require real time
access and low latency.
 Flexibility: It is Schema less which means it can handle
unstructured ,semi structured data.
 Cost effectiveness: As this database is designed to run on
commodity hardware , they are often cost effective
 Availability: This is designed to handle high levels of traffic
and data throughput( they can provide high availability and
fault tolerance
Advantages of NoSQL
 Advantages of NoSQL: NoSQL
 Flexible Data Models : SQL databases support flexible schemas, allowing for
dynamic changes to the data model without requiring a predefined schema. This
flexibility is crucial for agile development and rapidly changing data environments.
 Scalability: NoSQL databases typically support horizontal scalability, which means
they can scale out by adding more servers. This is in contrast to the vertical scaling
of traditional relational databases, which often involves adding more resources to a
single server.
 High Availability: Many NoSQL databases are designed with high availability in
mind, offering features like data replication and distribution across multiple servers to
ensure continuous operation and fault tolerance.
 Performance: For certain data structures and workloads, NoSQL databases can
offer excellent query performance, especially for applications that involve a lot of
reading and writing. Their performance is enhanced for extensive data processing.
 Handling Large Data Volumes: NoSQL databases are built to store and manage
NoSQL
 Use of NoSQL in Industry
 Used to support analysis for applications such as web user data analysis,
Log analysis, Sensor feed Analysis.
Key-Value Pairs
Shopping carts,
Web User Data
Analysis

Graph Based Column Oriented


Network
Modelling,
Recommendation
NoSQL Analyze huge web
User Actions,
Sensor Feeds,
,
eBay,
Wallmart-Cross
Netflix
sell and UpSell

Document Based
Real Time Analytics
Logging
Document archive
management
NoSQL
 NOSQL Vendors:
Company Product Most Widely Used in
Amazon Dynamo DB LinkedIn , Mozilla
Facebook Cassandra Netflix, Twitter, eBay
Google Big Table Adobe Photoshop
COMPARISON OF SQL AND NOSQL
Factors SQL NoSQL
Database Relational Non-relational, Distributed
Data model Relational model Model-Less Approach
Schema Rigid, Pre-defined Flexible,Dynamic
Structure Table Based Document Based, Key Value
Based,Wide Column Based,
Graph Based
Scalability Vertical Scalable (by Horizontal Scalable(by creating
increasing System Resources) a cluster of commodity
machines)
Language SQL UnQL(Unstructured Query
Language)
Preference of Smaller Datasets Larger Datasets
Datasets
Integrity/ Integrity (ACID) Availability(eventual
Availability Consistency)
Support From Vendors From Community
Example Oracle,MySQL,DB2 MongoDB , Cassandra, BigTable
Hadoop
 Introduction
 Apache Hadoop software is an open source
framework that allows for the distributed storage and
processing of large datasets across clusters of
computers using simple programming models.
 Designed to scale up from a single computer to
thousands of clustered computers, with each machine
offering local computation and storage
 Hadoop can efficiently store and process large
datasets ranging in size from gigabytes to petabytes
of data
 HADOOP is sometimes referred to as an acronym for
High Availability Distributed Object Oriented Platform
Hadoop
 Four modules comprise the primary Hadoop framework and work collectively to form
the Hadoop ecosystem
 Hadoop Distributed File System (HDFS): It is the primary
component of the Hadoop ecosystem. It is a distributed file system in
which individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access
to application data
 Yet Another Resource Negotiator (YARN): YARN is a resource-
management platform responsible for managing compute resources in
clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.
 Map Reduce: Map Reduce is a programming model for large-scale data
processing. In the Map Reduce model, subsets of larger datasets and
instructions for processing the subsets are dispatched to multiple
different nodes, where each subset is processed by a node in parallel
with other processing jobs. After processing the results, individual
subsets are combined into a smaller, more manageable dataset.
 Hadoop Common: Hadoop Common includes the libraries and utilities
used and shared by other Hadoop modules.

Hadoop
RDBMS Vs Hadoop:
S.No RDBMS Hadoop
Traditional row-column based databases,
An open-source software used for storing data and
1. basically used for data storage,
running applications or processes concurrently.
manipulation and retrieval.

In this both structured and unstructured data is


2. In this structured data is mostly processed.
processed.

3. It is best suited for OLTP environment. It is best suited for BIG data.

4. It is less scalable than Hadoop. It is highly scalable.

5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

9. High data integrity available. Low data integrity available than RDBMS.

10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
Hadoop
 Distributed Computing Challenges:
 Although there are several challenges with distributed computing, two major
challenges to be addressed and they are
 Hardware Failure : As several servers are networked together , there are possibilities for
the failure of hardwares. When such failures occurs Hadoop uses Replication factor (the
number of copies of a block that are stored on different data nodes in a cluster)

suppose you have a file of size 150MB then the total


file blocks will be 2 shown below.
128MB = Block 1 if suppose your Slave 1 crashed then in that
22MB = Block 2 case B1R1 and B2R3 get lost. But you can
As the replication factor by-default is 3 so we have 3
copies of this file block recover the B1 and B2 from other slaves as
FileBlock1-Replica1(B1R1) FileBlock2-Replica1(B2R1) the Replica of this file blocks is already
FileBlock1-Replica2(B1R2) FileBlock2-Replica2(B2R2)
FileBlock1-Replica3(B1R3) FileBlock2-Replica3(B2R3) present in other slaves
Hadoop
 Distributed Computing Challenges:
 Processing Large amount of Data: As the data is spread across the network on Several
Machines, for integrating the data to process, Hadoop uses Map Reduce Programming
(a programming paradigm that enables massive scalability across hundreds or thousands of servers
in a Hadoop cluster)
Hadoop

 Overview:
 Open Source Software used in distributed systems to store and
process enormous amount of data
 Key aspects:
 Open source software: free to download and use.
 Framework: A Window with Programs, tools etc are being placed.
 Distributed: Divides and Stores the data across multiple
computers and processing is done in parallel.
 Massive Storage: Stores huge amount of data across nodes of low
cost commodity hardware.
 Faster Processing: Large amounts of Data is processed in Parallel
yielding quick response.
Hadoop
 Hadoop Components:
 Two Major Components found in Hadoop and they are,
 Hadoop Ecosystem
 Hadoop Core Components

 Hadoop Core Components:


(1) Hadoop Distributed File Systems: Used to store the data across different nodes in
distributed systems and uses replication factor to solve the problem of hardware
failure.
(2) Map Reduce : Used for Processing the data by splitting, sorting and combining
using the
programming model.
 Hadoop Ecosystem: Describes about the environment in which projects were developed to
enhance the functionality of Hadoop Core Components
(1) HIVE
(2) SQOOP
(3) HBASE
(4) PIG
Hadoop
 Hadoop Conceptual Layer:
 It is conceptually divided into Data Storage Layer (to store enormous amount of data) and
Data Processing Layer ( to perform processing fastly)

 Hadoop High Level Architecture:


 It is distributed master-slave architecture.
 Master Node is known as Name Node and Slave Node is called as Data Node

Master Node
HDFS
Map reduce
Computational Storage

Slave Node1 Slave Node1


Map reduce Map reduce
HDFS HDFS
Computational Storage Computational Storage
Hadoop Distributed File System
o Overview:
 HDFS (Hadoop Distributed File System) is a unique design that provides storage for extremely large
files with streaming data access pattern and it runs on commodity hardware.
 Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-
times. Once data is written large portions of dataset can be processed any number times.
 Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one
of feature which specially distinguishes HDFS from other file system.
 Nodes: Master-slave nodes typically forms the HDFS cluster.
 NameNode (MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening, closing, renaming files and directories.
 It should be deployed on reliable hardware which has the high config. not on commodity
hardware.
 DataNode (SlaveNode):
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the master.
 They can be deployed on commodity hardware
o HDFS daemons:
 Daemons are the processes running in background.
 Name nodes:
 Run on the master node.
 Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy
of it is kept on disk.
 Within the Name Node component, there are two types of files:
1. FsImage Files: FsImage files act as a storage system It contains a comprehensive and organized
representation of the file system's structure.
2. EditLogs Files: EditLogs files track all modifications made to the filesystem's files. They serve as
a log, recording changes and updates made over time.
 Data Nodes:
 Run on slave nodes.
 Require high memory as data is actually stored here.
 It Continuously sends the heartbeat message to Name node to ensure the connectivity.
 If there is no heartbeat from a data node, the name node replicates that data node within
the cluster and continues to work as if nothing has happened.
 Secondary NameNode:
 Secondary NameNode in HDFS is a helper node that performs periodic checkpoints of the
namespace.
 Secondary NameNode helps keep the size of the log file containing HDFS modifications
within limits and assists in faster recovery of the NameNode in case of failures.
 Metadata management:
 In HDFS, the NameNode loads block information into memory during startup, updates the
metadata with Edit Log data, and creates checkpoints.
 The size of metadata is limited by Name Node's RAM.
 Data storage in HDFS:
 In HDFS, data is divided into blocks for optimized storage and retrieval.
 By default, each block has a size of 128 MB, although this can be adjusted as needed.

 These blocks are stored across different datanodes(slavenode)


 Datanodes(slavenode)replicate the blocks among themselves and the information of what
blocks they contain is sent to the master.
 Default replication factor is 3 means for each block 3 replicas are created (including itself).
 In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
 MasterNode has the record of everything, it knows the location and info of each and every
single data nodes and the blocks they contain,
 Example: Let us see how the data is stored in Distributed manner.

 By understanding the block size and its impact on the file storage system, the data
processing can be optimized
 HDFS Read and Write operations:

Hadoop Distributed File System (HDFS) HDFS-Write operation


structure and its reading process.
 Replica Placement strategy:
 As per the Hadoop Replica Placement Strategy, the first replica is placed on the
same node as the client.
 The second replica is placed on a node that is present on different rack.
 The third replica is placed on the same rack as second, but on a different node in the
rack.
 Once the replica locations are set, a pipeline is built to provide good reliability.

Rack 1 Rack 2

 Data Replication and Data Pipelining are the special features of HDFS
 Working with HDFS Commands:
Command Description
hadoop fs -ls/ list of directories and files at the root
of HDFS
hadoop fs –ls –R/ Listof complete directories and files of
HDFS
hadoop fs –mkdir/directory_name Create the directory
hadoop fs –put /root/directory_name /file_name/ Copy a file from Local system to HDFS
directory_name/file-name
hadoop fs –get/directory_name/file-name/root/ Copy a file from HDFS to Local system
directory_name /file_name
hadoop fs –copyFromLocal/root/directory_name Copy a file from Local system to HDFS
/file_name/ directory_name/file-name Via copyFromLocal command
hadoop fs Copy a file from HDFS to Local system
–copyToLocal/directory_name/file-name/root/ Via copyToLocal command
directory name /file_name
hadoop fs –cat/directory name /file_name Display the contents of the HDFS fle
on console
hadoop fs –cp/directory name1 Copy a file from one directory to
/file_name/directory name2 other directory
hadoop fs –rm-r/directory name Remove the directory from HDFS
PROCESSING DATA WITH HADOOP- MAP
REDUCE
 For Processing huge amount of data that are stored in HDFS, the
programming model called Map reduce is used.
 It is a programming technique based on Java that is used on the top of
the Hadoop framework for faster processing of huge amount of data.
 HDFS and Map reduce Framework run on the same set of nodes.
 there are 2 daemons associated with map reduce programming and
they are
 Job Tracker:
 It is a master process responsible for executing overall map reduce job.
 It provides connectivity between Hadoop and the application.
 When the client submit the code , it creates the execution plan by deciding which
task to assign to which node.
 It monitors all the running tasks.
 when the tasks fail, it reschedule the task to a different node .
 There is single job tracker per Hadoop Clusters.
 Task Tracker:
 This s responsible for executing individual tasks assigned by job tracker.
 There is single Task tracker per slave and generates multiple JVMs to handle
multiple map or reduce tasks in parallel.
 It continuously sends the heartbeat message to Job tracker for ensuring
connectivity.
 If the Job tracker fails to receive the heartbeat message from task tracker, it
reschedules the tasks to another available node in the cluster.
 Working of Map Reduce:
 A Map reduce system is usually composed of three steps
 Map
 Shuffle
 Reduce
 Map:
 The input data is split into smaller blocks
 The Hadoop Framework then decides the number of mappers to use based on the
size of the data to be processed and the memory block available
 Each block is then assigned to a mapper for processing.
 Each worker node applies the map function to the local data and wites the output to
 Shuffle
 Worker nodes then redistribute the data based on the output keys (produced by map function).
 All data belonging to one key is located on the same worker node.
 The Values are grouped by keys in the form of key value pair.

 Reduce:
 A reducer cannot start while a mapper is still in progress.
 Worker node process each group of <key,value> pairs output data, in parallel to produce
<key,value> pairs as output.
 All the map output values that have the same key are assigned to a single reducer, which then
aggregates the values for that key.

Input
Mapper
Splits Reducer
Input
Input Data Mapper Output
Splits Shuffling Reducer
Stored on and Data
HDFS Input Sorting Stored on
Mapper
Splits Reducer HDFS
Input Mapper
Splits Reducer
MANAGING RESOURCES WITH HADOOP-
 YARN
YARN stands for “Yet Another Resource Negotiator”.
 It was introduced in Hadoop 2.0 to remove the bottleneck on job Tracker which was
present in Hadoop 1.0.
 YARN architecture basically separates resource management layer from the
processing layer.
 In Hadoop 1.0, the responsibility of Job tracker is to split between the
resourcemanager and application manager.
 In Hadoop 2.0, YARN also allows different processing engines like graph processing,
interactive processing etc to run and process the data stored in HDFS.
 Through its various components , it can dynamically allocate various resources and
schedule the application processing .For large volume data processing, it is
necessary to manage the available resources properly so that every application can
leverage them.
Features:
 Scalability
 Compatibility
 Cluster Utilizaton
 Multi tenancy
 Hadoop YARN Architecture:
 The main components of YARN architecture include:
 Client: It submits map-reduce jobs.
 Resource Manager:
 It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications.
 Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the
completion of the request accordingly.
 It has two major components:
1.Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair
Scheduler to partition the cluster resources.
2.Application manager: It is responsible for accepting the
application and negotiating the first container from the resource
manager. It also restarts the Application Master container if a task
fails.
 Node Manager:
 It take care of individual node on Hadoop cluster and manages application and workflow
 Its primary job is to keep-up with the Resource Manager.
 It registers with the Resource Manager and sends heartbeats with the health status of the node.
 It monitors resource usage, performs log management and also kills a container based on
directions from the resource manager.
 It is also responsible for creating the container process and start it on the request of Application
master.
 Application Master:
 An application is a single job submitted to a framework.
 The application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application.
 The application master requests the container from the node manager by sending a Container
Launch Context(CLC) which includes everything an application needs to run.
 Once the application is started, it sends the health report to the resource manager from time-to-
time.
 Container:
 It is a collection of physical resources such as RAM, CPU cores and disk on a single node.
 The containers are invoked by Container Launch Context(CLC) which is a record that contains
information such as environment variables, security tokens, dependencies etc.
 Application workflow in Hadoop YARN:

 Client submits an application


 The Resource Manager allocates a container to start the Application Manager
 The Application Manager registers itself with the Resource Manager
 The Application Manager negotiates containers from the Resource Manager
 The Application Manager notifies the Node Manager to launch containers
 Application code is executed in the container
 Client contacts Resource Manager/Application Manager to monitor application’s status
 Once the processing is complete, the Application Manager un-registers with the Resource Manager
INTERACTING WITH HADOOP
ECOSYSTEM
 It describes about the environment in which projects were developed to enhance the
functionality of Hadoop Core Components
(1) HIVE
(2) SQOOP
(3) HBASE
(4) PIG
 PIG:
 It is a Dataflow system for Hadoop
 It uses Pig Latin to specify data flow
 Focuses on Data Processing
 It consists of 2 Components
 Pig Latin : Data Processing Language
 Compilers: Translate Pig Latin to Map Reduce Programming.

 HIVE:
 Data Warehousing layer on Top of Hadoop
 This performs Analysis and queries, adhoc queries ,data analysis.
 SQOOP:
 Tool that helps to transfer the data between Hadoop and Relational Databases.
 Through Sqoop, the data from RDBMS to Hadoop and vice versa can be imported.
 HBASE:
 It is Column-Oriented No-SQL database for Hadoop
 It is used to store billions of rows and millions of columns
 It provides random read/write operation
 It supports record level updates.
 It is found on top of HDFS
PIG HIVE

SQOOP HBASE
REFERENCES
 https://www.mongodb.com/resources/basics/databases/nosql-explained
 https://scaleyourapp.com/wide-column-and-column-oriented-databases/
 https://aws.amazon.com/nosql/graph/
 https://6point6.co.uk/insights/use-cases-for-graph-database/
 https://www.geeksforgeeks.org/top-5-reasons-to-chose-nosql/

https://www.sprinkledata.com/blogs/what-is-a-nosql-database-understanding-the-evol
ution-of-data-management
 https://www.scalablepath.com/back-end/sql-vs-nosql
 https://cloud.google.com/learn/what-is-hadoop

https://www.researchgate.net/figure/Summary-of-HDFS-Write-operation_fig7_329155
467
 https://www.integrate.io/blog/guide-to-hdfs-for-big-data-processing/
 https://www.researchgate.net/figure/Hadoop-Distributed-File-System-HDFS-structure-
and-its-reading-process_fig3_312185695
 https://www.geeksforgeeks.org/introduction-to-hadoop-distributed-file-systemhdfs/

You might also like