NoSQL Databases
An Overview
Dr. Kalpakis, Introduction to Data Science, Fall 2017
1
The need
2
Scaling Relational Databases
• Vertically (or up)
• Can be achieved by hardware upgrades (e.g., faster CPU, more memory, or
larger disks)
• Limited by the amount of CPU, RAM and disk that can be configured on a
single machine
• Horizontally (or out)
• Can be achieved by adding more machines
• Requires database sharding and probably replication
• Limited by the Read-to-Write ratio and communication overhead
• ACID requirements constrain scalability
3
Data Sharding
• Data is typically sharded (or striped) to allow for parallel accesses
• Amdahl’s Law gives the speedup due to sharding
• Real speedup is less due to communication overhead and workload
imbalance
Input data: A large file
Machine 1 Machine 2 Machine 3
Chunk1 of input data Chunk3 of input data Chunk5 of input data
Chunk2 of input data Chunk4 of input data Chunk5 of input data
E.g., parallel access to chunks 1, 3 and 5
4
Data Replication
• Replicating data across servers helps
• Avoid performance bottlenecks
• Avoid single point of failures
• Enhance scalability and availability
Main Server
Replicated Servers
5
Relational Databases & ACID
properties
• Execution of DB code blocks (aka transactions) ensure
• Atomicity: either all instructions or none of them are excuted
• Consistency: at the end, it leaves database in consistent state
• Isolation: oblivious to other concurrent manipulations of database
• Durability: upon completion, modifications to DB are permanent
• Consistency in distributed relational databases is often done using 2-
phase commit protocol (2PC)
• When sharding and replicating relational databases, ensuring
consistency is costly since real-life distributed systems are unreliable
• even worse, when network partitions
• AID are relatively easier to support in distributed systems
6
2-Phase Commit protocol (2PC)
Phase I: Voting
1. VOTE_REQUEST
2. VOTE_COMMIT
DB Server 1
Participant 1
DB Server 2
Coordinator Participant 2
Phase II: Commit
3. GLOBAL_COMMIT DB Server 3
Participant 3
4. LOCAL_COMMIT
7
The CAP Theorem
• “Of three properties of a shared data system: data consistency, system availability
and tolerance to network partitions, only two can be achieved at any given moment .”
• Conjectured by Eric Brewer (2000) and proven by Nancy Lynch and Seth Gilbert (2002)
• “CAP prohibits only a tiny part of the design space: perfect availability and consistency in
the presence of partitions, which are rare.” (Eric Brewer, 2012)
• Consistency:
• All nodes should see the same data at the same time (strict consistency)
• Availability:
• Node failures do not prevent survivors from continuing to operate
• Partition-tolerance:
• The system continues to operate despite network partitions
• Necessary to decide between C and A for very large systems since almost certainly will
partition
8
Various Consistency types
• Strong Consistency
• any subsequent access after an update will return the same updated value.
• Eventual Consistency
• if no new updates are made, eventually all accesses will return the last updated value
• Read-your-writes
• Upon updating an item, a process never sees an older value
• Monotonic read consistency
• If a process has seen a particular value of an item, no process sees an older value
afterwards
• Monotonic write consistency
• serializes the writes by the same process
9
BASE antidote to ACID
• Basically Available: indicates that the system does guarantee
availability
• Soft state indicates that the state of the system may change
over time, even without input.
• Eventual consistency indicates that the system will become
consistent over time, when input ceases during that time.
• Most NoSQL databases relax ACID and adopt BASE
10
CAP and databases
11
Taxonomy of NoSQL (Not-only
SQL) databases
• Key-Value Stores
• Lookup a single value for a key
• Amazon’s DynamoDB
• Document Stores
• Access data by key or by search of “document” data.
• MongoDB
• CouchDB
• Column Stores
• Column-wise storage of tabular data
• Google’s BigTable
• Facebook’s Cassandra
• Graph Stores
• Native graph storage, efficient graph algorithms
• Neo4j
• Google’s Pregel
12
13
Key-Value Stores
DynamoDB Data Model
Mandatory Optional
Key-value access pattern Models 1:N relationships
Determines data distribution Enables rich queries
14
Column Stores
15
Document Stores
in JSON/BSON
16
MongoDB Architecture
17
Queries
18
Graph Stores
Graph Stores – neo4j
vs
19
Prons/Cons of NoSQL
• Advantages :
• High elastic scalability
• Lower cost
• Schema flexibility, semi-structured data
• Disadvantages
• No standardization
• Less mature
• Limited query capabilities
• Programming with eventual consistent is counter-intuitive
20
21
NewSQL
• A DBMS that delivers the scalability and flexibility promised by NoSQL while
retaining the support for SQL queries and/or ACID, or to improve performance for
appropriate workloads.
Matt Aslett – “How Will The Database Incumbents Respond To NoSQL And NewSQL?”
https://www.451research.com/report-short?entityId=66963
Properties Traditional SQL NoSQL NewSQL
ACID Y N Y
• NewSQL databases have In-memory DB N Y Y
• SQL as the primary interface. Big Data N Y Y
RDBMS Y N Y
• ACID support for transactions
• Non-locking concurrency control.
• High per-node performance.
Parallel,
•Michael shared-nothing
Stonebraker- architecture.
“New SQL: An Alternative
http://cacm.acm.org/blogs/blog-cacm/109710
to NoSQL and Old SQL for New OLTP Apps”
22
23