10 NoSQL Databases - HBase Hive Cassandra
10 NoSQL Databases - HBase Hive Cassandra
Availability
Consistency
Partition
tolerance
The CAP Theorem
Once a writer has written, all readers
will see that write
Availability
Consistency
Partition
tolerance
The CAP Theorem
System is available during software
and hardware upgrades and node
failures.
Availability
Consistency
Partition
tolerance
The CAP Theorem
A system can continue to operate in
the presence of a network partition
failures.
Availability
Consistency
Partition
tolerance
The CAP Theorem
Theorem: You can have at
most two of these
Availability properties for any shared-
data system
Consistency
• HRegionServer
• Write Requests
• Read Requests
• Cache Flushes
• Compactions
• Region Splits
• HBase Client
• Caching for region lookups
Data Storage
• Data is stored in files called HFiles/StoreFiles
• Usually saved in HDFS
• HFile is basically a key-value map
• Keys are sorted lexicographically
• When data is added it's written to a log called
Write Ahead Log (WAL) and is also stored in memory
(memstore)
• Flush: when in-memory data exceeds maximum value it is
flushed to an HFile
• Data persisted to HFile can then be removed from WAL
• Region Server continues serving read-writes during the flush
operations, writing values to the WAL and memstore
Data Storage
• HDFS doesn't support updates to an existing file
therefore HFiles are immutable
• Cannot remove key-values out of HFile(s)
• Over time more and more HFiles are created
• Delete marker is saved to indicate that a record
was removed
• These markers are used to filter the data - to “hide” the
deleted records
• At runtime, data is merged between the content of the
HFile and WAL
Data Storage
• To control the number of HFiles and to keep
cluster well balanced HBase periodically
performs data compactions
• Minor Compaction: Smaller HFiles are merged into
larger HFiles (n-way merge)
• Fast - Data is already sorted within files
• Delete markers are not applied
• Major Compaction:
• For each region merges all the files within a column-family into
a single file
• Scan all the entries and apply all the deletes as necessary
HBase – Data Access
• HBase Shell
• list, get,put, disable, drop,alter,count,describe,scan etc
• Java Client API
• Table API
• Client API for data access, MapReduce
• Thrift Server
• Thrift compiler, Thrift Server and Thrift client
• REST API
• Stargate Servlet
• Avro Server
• Apache Avro is also a cross-language schema compiler
• http://avro.apache.org
• Requires running Avro Server
• HBql
• SQL like syntax for HBase
• http://www.hbql.com
HBase Map Reduce constructs
When to use HBase
Use HBase if…
– You need random write, random read, or both (but
not neither)
– You need to do many thousands of operations per
second on multiple TB of data
– Your access patterns are well-known and simple
Use Cases:
eCommerce Inventory Cache Use Cases
Time Series / Events Use Cases
Feed Based Activity / Use Cases
Where NOT to Use Cassandra?
Don’t Use if your application has:
Secondary Indexes.
Relational Data.
Transactional (Rollback, Commit)
Primary & Financial Records.
Stringent Security & Authorization Needs On Data.
Dynamic Queries on Columns.
Searching Column Data.
Low Latency.
HIVE
Dr. Emmanuel S. Pilli
Asst. Professor, CSE, MNIT Jaipur
58
What is Hive?
SQL vs HiveQL
61
SQL vs HiveQL
HiveQL: Type System
• Primitive types
– Integers:TINYINT, SMALLINT, INT, BIGINT.
– Boolean: BOOLEAN.
– Floating point numbers: FLOAT, DOUBLE .
– String: STRING.
• Complex types
– Structs: {a INT; b INT}.
– Maps: M['group'].
– Arrays: ['a', 'b', 'c'], A[1] returns 'b‘
• Functions
► SHOW functions
► DESCRIBE FUNCTION funname
63
/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1
/file2
/country=US/file3
/dt=2010-01-02/country=GB/file4
/country=US/file5
/file6
65
Importing data
• INSERT OVERWRITE TABLE
INSERT OVERWRITE TABLE target
SELECT col1, col2
FROM source;
• Multitable Insert
FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
Querying data
• SELECT
SELECT foo FROM sample WHERE ds='2012-02-24‘;
Joins…
• Semi joins
SELECT *
FROM things
WHERE things.id IN (SELECT id from sales);
We can rewrite it as follows:
Can be written as…
SELECT *
FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
• Map joins
• If one table is small enough to fit in memory
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
Performance -Result
System Architecture and Components
References:
• Hadoop: The Definitive Guide
Tom White (Author)
O'Reilly Media; 3rd Edition (May6, 2012)
• Programming Hive
Edward Capriolo, Dean Wampler,
Jason Rutherglen (Authors)
O'Reilly Media; 1 edition (October, 2012)
Any Questions and Thanks