0% found this document useful (0 votes)
16 views

10 NoSQL Databases - HBase Hive Cassandra

The document discusses NoSQL databases and provides details about HBase. It defines what NoSQL is and explains why NoSQL databases were created. It then discusses the CAP theorem and how it relates to ACID and BASE properties. The remainder of the document focuses specifically on HBase, describing its data model, architecture, and how data is stored in HFiles on HDFS.

Uploaded by

Neeraj Garg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

10 NoSQL Databases - HBase Hive Cassandra

The document discusses NoSQL databases and provides details about HBase. It defines what NoSQL is and explains why NoSQL databases were created. It then discusses the CAP theorem and how it relates to ACID and BASE properties. The remainder of the document focuses specifically on HBase, describing its data model, architecture, and how data is stored in HFiles on HDFS.

Uploaded by

Neeraj Garg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

NO SQL DATABASES

HBASE CASSANDRA HIVE


Dr. Emmanuel S. Pilli
Malaviya NIT Jaipur
What is NoSQL?
• Stands for Not Only SQL
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor
do they use the concept of joins
• All NoSQL offerings relax one or more of the
ACID properties (next … CAP theorem)
Why NoSQL?
• For data storage, an RDBMS cannot be the
be-all / end-all
• Just as there are different programming
languages, need to have other data storage tools
in the toolbox
• A NoSQL solution is more acceptable to a client
now than even a year ago
The CAP Theorem

Availability
Consistency

Partition
tolerance
The CAP Theorem
Once a writer has written, all readers
will see that write

Availability

Consistency

Partition
tolerance
The CAP Theorem
System is available during software
and hardware upgrades and node
failures.
Availability

Consistency

Partition
tolerance
The CAP Theorem
A system can continue to operate in
the presence of a network partition
failures.
Availability

Consistency

Partition
tolerance
The CAP Theorem
Theorem: You can have at
most two of these
Availability properties for any shared-
data system
Consistency

Partition ACID vs BASE


tolerance
ACID
 Atomic
► All operations in a transaction succeed or every operation
is rolled back.
 Consistent
► On the completion of a transaction, the database is
structurally sound.
 Isolated
► Transactions do not contend with one another.
Contentious access to data is moderated by the database
so that transactions appear to run sequentially.
 Durable
► The results of applying a transaction are permanent,
even in the presence of failures.
BASE
 Basic Availability
► The database appears to work most of the time.
 Soft-state
► Stores don’t have to be write-consistent, nor do different
replicas have to be mutually consistent all the time.
 Eventual consistency
► Stores exhibit consistency at some later point (e.g., lazily
at read time).
CAP theorem
NoSQL Options – Key-Value Stores
• Extremely simple interface
• Data model: (key, value) pairs
• Operations: Insert(key,value), Fetch(key),
Update(key), Delete(key)
• Some allow (non-uniform) columns within value
• Some allow Fetch on range of keys
• Examples
• Redis, Voldormort
• Memcached
• and 100s more
NoSQL Options – Document Stores
►Like Key-Value Stores except value is document
• Data model: (key, document) pairs
• Document: JSON, XML, other semistructured formats
• Basic operations: Insert(key,document), Fetch(key),
• Update(key), Delete(key)
• Also Fetch based on document contents
►Example systems
• CouchDB, MongoDB, SimpleDB, …
NoSQL Options – Column Stores
• Multi-dimensional map
• Not all entries are relevant each time
• Column families
• Examples
• Cassandra
• Hbase
• Amazon SimpleDB
NoSQL Options – Graph Stores
• Data model: nodes and edges
• Nodes may have properties (including ID)
• Edges may have labels or roles
 Relational DBs can model graphs, but an edge
requires a join which is expensive
• Example Neo4j FlockDB, Titan, Pregel, …
• RDF “triple stores” can map to graph databases
NoSQL Database Examples
• MongoDB Open-source document database.
• CouchDB Database that uses JSON for documents,
JavaScript for MapReduce queries, and regular HTTP for
an API.
• GemFire Distributed data management platform
providing dynamic scalability, high performance, and
database-like persistence.
• Redis Data structure server wherein keys can contain
strings, hashes, lists, sets, and sorted sets.
• Cassandra Database that provides scalability and high
availability without compromising performance.
NoSQL Database Examples
• Memcached Open source high-performance,
distributed-memory, and object-caching system.
• Hazelcast Open source highly scalable data distribution
platform.
• HBase Hadoop database, a distributed and scalable big
data store.
• Mnesia Distributed database management system that
exhibits soft real-time properties.
• Neo4j Open source high-performance, enterprise-grade
graph database.
HBASE
Dr. Emmanuel S. Pilli
Malaviya NIT Jaipur
It all started when ..
• Google published its paper on BigTable.
• The paper put forward a new way of storing and
retrieving data
• A proven architecture as Google has been using
it for many of their successful applications
• Primarily geared for web scale data storage and
lookups
Why not an RDBMs for such web scale data?
• RDBMs tend to scale less better at large data sizes
• ~500GB (both read and write)

• Some RDBMs support sharding but …


• Data rearrangement (Resharding) becomes a problem
• If the size of shards gets imbalanced
• Requires a hashing function to push data into shards
• Code for access becomes complex
• First find the appropriate shard to locate your data
• Pretty much rigid schemas
• Why do we need dynamic schemas
• Sparse nature of web scale data

• Sharding is a type of database partitioning that separates very large


databases the into smaller, faster, more easily managed parts called
data shards. The word shard means a small part of a whole.
HBase
• Column-Oriented data store, known as “Hadoop
Database”
• Distributed – designed to serve large tables
• Billions of rows and millions of columns
• Supports random real-time CRUD operations (unlike
HDFS) - create, read, update, and delete
• Runs on a cluster of commodity hardware
• Server hardware, not laptop/desktops
• Open-source, written in Java, Part of the Apache
Hadoop ecosystem
• Type of “NoSQL” DB
• Does not provide a SQL based access
• Does not adhere to Relational Model for storage
• Simple data model
• Dynamic control over data layout and format
Traditional RDBMS and HBase
HBase Data Model

► Map<byte[], Map<byte[], Map<byte[], Map<Long, byte[]>>>>


HBase Data Model
• Data is stored in Tables
• Tables contain rows
• Rows are referenced by a unique key
• Key is an array of bytes – good news
• Anything can be a key: string, long and your own
serialized data structures
• Rows made of columns which are grouped in column
families
• Data is stored in cells
• Identified by row x column-family x column
• Cell's content is also an array of bytes
HBase Data Model
• Rows are grouped into families
• Labeled as “family:column”
• Example “user:first_name”
• A way to organize your data
• Various features are applied to families
• Compression
• In-memory option
• Stored together - in a file called HFile/StoreFile

• Family definitions are static


• Created with table, should be rarely added and changed
• Limited to small number of families
HBase Families
• Family name must be composed of printable
characters
• Not bytes, unlike keys and values
• Think of family:column as a tag for a cell value and NOT
as a spreadsheet
• Columns on the other hand are NOT static
• Create new columns at run-time
• Can scale to millions for a family
HBase timestamps
• Cells' values are versioned
• For each cell multiple versions are kept
• 3 by default
• Another dimension to identify your data
• Either explicitly timestamped by region server or
provided by the client
• Versions are stored in decreasing timestamp order
• Read the latest first – optimization to read the current
value
• You can specify how many versions are kept
HBase Row Keys
• Rows are sorted lexicographically by key
• Compared on a binary level from left to right
• For example keys 1,2,3,10,15 will get sorted as
• 1, 10, 15, 2, 3
• Somewhat similar to Relational DB primary index
• Always unique
• Some but minimal secondary indexes support
HBase Architecture
• A HBase table is made up of regions
• Each region is made up of start row key and end row key.
• Table = SUM of regions
• Region = (tablename, startkey, endkey)
• Each region may live on a different node
• Each region is made up of HDFS blocks and files
• Point where HBase falls back on Hadoop
• These building blocks of storage are replicated by
Hadoop infrastructure (replication settings configurable)
Rows distribution in region servers
HBase Architecture – Contd ..
• HBase Nodes are of two types
• RegionServer
• The actual node where the data is stored
• A Region server can hold more than one table.
• Master
• Manage Region Servers
• Load balancing of the Regions
• Move the row keys as data is getting inserted so that Regions are
almost equally balanced.
• This aspect is what gives HBase advantage over RDBMs sharding
approaches.

• Special tables exist inside HBase


• -ROOT-
• Stores the schema information
• .META.
• Stores the Region Servers information
HBase Architecture Implementation
• HBase Master
• Administration of RegionServers

• HRegionServer
• Write Requests
• Read Requests
• Cache Flushes
• Compactions
• Region Splits

• HBase Client
• Caching for region lookups
Data Storage
• Data is stored in files called HFiles/StoreFiles
• Usually saved in HDFS
• HFile is basically a key-value map
• Keys are sorted lexicographically
• When data is added it's written to a log called
Write Ahead Log (WAL) and is also stored in memory
(memstore)
• Flush: when in-memory data exceeds maximum value it is
flushed to an HFile
• Data persisted to HFile can then be removed from WAL
• Region Server continues serving read-writes during the flush
operations, writing values to the WAL and memstore
Data Storage
• HDFS doesn't support updates to an existing file
therefore HFiles are immutable
• Cannot remove key-values out of HFile(s)
• Over time more and more HFiles are created
• Delete marker is saved to indicate that a record
was removed
• These markers are used to filter the data - to “hide” the
deleted records
• At runtime, data is merged between the content of the
HFile and WAL
Data Storage
• To control the number of HFiles and to keep
cluster well balanced HBase periodically
performs data compactions
• Minor Compaction: Smaller HFiles are merged into
larger HFiles (n-way merge)
• Fast - Data is already sorted within files
• Delete markers are not applied
• Major Compaction:
• For each region merges all the files within a column-family into
a single file
• Scan all the entries and apply all the deletes as necessary
HBase – Data Access
• HBase Shell
• list, get,put, disable, drop,alter,count,describe,scan etc
• Java Client API
• Table API
• Client API for data access, MapReduce
• Thrift Server
• Thrift compiler, Thrift Server and Thrift client
• REST API
• Stargate Servlet
• Avro Server
• Apache Avro is also a cross-language schema compiler
• http://avro.apache.org
• Requires running Avro Server
• HBql
• SQL like syntax for HBase
• http://www.hbql.com
HBase Map Reduce constructs
When to use HBase
 Use HBase if…
– You need random write, random read, or both (but
not neither)
– You need to do many thousands of operations per
second on multiple TB of data
– Your access patterns are well-known and simple

 Don’t use HBase if…


– You only append to your dataset, and tend to read
the whole thing
– You primarily do ad-hoc analytics (ill-defined access
patterns)
– Your data easily fits on one beefy node
References
• An Excellent blog on HBase Architecture
• http://www.larsgeorge.com
• HBase Wiki
• http://wiki.apache.org/hadoop/Hbase
• Some presentations made on HBase
• http://wiki.apache.org/hadoop/HBase/HBasePresentatio
ns
CASSANDRA
Dr. Emmanuel S. Pilli
Malaviya NIT Jaipur
The History of Cassandra
Why Use Cassandra?
Cassandra Characteristics…
Column Oriented
Schema Free
Cassandra Use Case - Summary
What is Apache Cassandra?
Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available, fault-
tolerant, Tuneably consistent, column-oriented
database.
Distributed and Decentralized
Distributed and Decentralized
Elastic Scalability
High Avalability and Fault Tolerance
Tunable Consistency

Cassandra enables us to tune the Consistency based


on Application Requirement.
High Performance
Cassandra was designed specifically from the ground up to
take full advantage of multiprocessor / multicore machines
and to run across many dozens of these machines housed
in multiple data centres.

It scales consistently and semlessly to hundreds of


terabytes.
Shows exceptional performance under heavy loads.
Consistently shows very fast throughput for writes per
seconds on a basic commodity workstation.
Where to Use Cassandra?
Use if your application has:
 Big Data (Billions of Records Rows & Columns)
 Very High Velocity Random Reads & Writes
 No Multiple Secondary Index Needs
 Low Latency

Use Cases:
 eCommerce Inventory Cache Use Cases
 Time Series / Events Use Cases
 Feed Based Activity / Use Cases
Where NOT to Use Cassandra?
Don’t Use if your application has:
 Secondary Indexes.
 Relational Data.
 Transactional (Rollback, Commit)
 Primary & Financial Records.
 Stringent Security & Authorization Needs On Data.
 Dynamic Queries on Columns.
 Searching Column Data.
 Low Latency.
HIVE
Dr. Emmanuel S. Pilli
Asst. Professor, CSE, MNIT Jaipur
58

Why Another Data Warehousing System?


• Problem : Data, data and more data
• Several TBs of data everyday
• The Hadoop Experiment:
• Uses Hadoop File System (HDFS)
• Scalable/Available
• Problem
• Lacked Expressiveness
• Map-Reduce hard to program
• Solution : HIVE
59

What is Hive?

• A system for managing and querying


unstructured data as if it were structured
• Uses Map-Reduce for execution
• HDFS for Storage

• Key Building Principles


• SQL as a familiar data warehousing tool
• Extensibility (Pluggable map/reduce scripts in the language of
your choice, Rich and User Defined Data Types, User Defined
Functions)
• Interoperability (Extensible Framework to support different file
and data formats)
• Performance
60

SQL vs HiveQL
61

SQL vs HiveQL
HiveQL: Type System
• Primitive types
– Integers:TINYINT, SMALLINT, INT, BIGINT.
– Boolean: BOOLEAN.
– Floating point numbers: FLOAT, DOUBLE .
– String: STRING.
• Complex types
– Structs: {a INT; b INT}.
– Maps: M['group'].
– Arrays: ['a', 'b', 'c'], A[1] returns 'b‘
• Functions
► SHOW functions
► DESCRIBE FUNCTION funname
63

Hive Data Model: Tables


• Managed Tables:
CREATE TABLE managed_table (dummy STRING);
• Hive manages the data
• Moves files to warehouse directory [During LOAD
operation]
• External Tables
CREATE EXTERNAL TABLE external_table (dummy
STRING);
64

Hive Data Model: Partitions


• Give extra structure to the data
• Useful for more efficient queries.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'


INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');

/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1
/file2
/country=US/file3
/dt=2010-01-02/country=GB/file4
/country=US/file5
/file6
65

Hive Data Model: Buckets


• To enable more efficient queries
• JOIN queries
• To make sampling more efficient
CREATE TABLE bucketed users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Examples – DDL Operations
CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col
INT);
DROP TABLE sample;
Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt'
OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE


INTO TABLE sample PARTITION (ds='2012-02-24');
68

Importing data
• INSERT OVERWRITE TABLE
INSERT OVERWRITE TABLE target
SELECT col1, col2
FROM source;

• Multitable Insert
FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year

• Create Table as Select


CREATE TABLE target
AS
SELECT col1, col2
FROM source;
69

Querying data
• SELECT
SELECT foo FROM sample WHERE ds='2012-02-24‘;

• Sorting and Aggregating


FROM records2
SELECT year, temperature
DISTRIBUTE BY year
SORT BY year ASC, temperature DESC;
• JOINS:
• Inner Joins
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
• Outer Join
70

Joins…
• Semi joins
SELECT *
FROM things
WHERE things.id IN (SELECT id from sales);
We can rewrite it as follows:
Can be written as…
SELECT *
FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
• Map joins
• If one table is small enough to fit in memory
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
Performance -Result
System Architecture and Components
References:
• Hadoop: The Definitive Guide
Tom White (Author)
O'Reilly Media; 3rd Edition (May6, 2012)
• Programming Hive
Edward Capriolo, Dean Wampler,
Jason Rutherglen (Authors)
O'Reilly Media; 1 edition (October, 2012)
Any Questions and Thanks

You might also like