BDS Session 10
BDS Session 10
Janardhanan PS
[email protected]
Topics for today
• NoSQL Introduction
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J
2
Database Sphere
3
What is NoSQL Database ?
• NoSQL databases, also known as "Not Only SQL" databases, are a type of
database that do not use traditional SQL (Structured Query Language) for storing
and manipulating data
• They are designed to handle large amounts of unstructured, semi-structured, or
polymorphic data and are often used for big data, real-time data processing, and
cloud-based applications
• NoSQL databases use a distributed architecture, allowing them to scale horizontally
across multiple servers or nodes, making them ideal for handling high levels of
concurrency and data volume
4
What is NoSQL ?
• Coined by Carlo Strozzi in 1998
✓Lightweight, open source database without standard SQL interface
• Reintroduced by Johan Oskarsson in 2009
✓Non-relational databases
• Characteristics
✓Not Only SQL
✓Non-relational
✓Schema-less
✓Loosen consistency to address scalability and availability requirements in
large scale applications
✓Open source movement born out of web-scale applications
✓Distributed for scale
✓Cluster Friendly
5
Data model
6
How to choose the right NoSQL database?
7
NoSQL Use Cases
• Big data: NoSQL databases are perfect for handling large amounts of data since they can
scale horizontally across multiple servers or nodes and handle high levels of concurrency
• Real-time data processing: They are often used for real-time data processing since they can
handle high levels of concurrency and support low latency
• Cloud-based applications: NoSQL databases are perfect for cloud-based applications since
they can easily scale and handle large amounts of data in a distributed environment
• Content management: NoSQL databases are often used for content management systems
since they can handle large amounts of data and support flexible data models
• Social media: NoSQL databases are often used for social media applications since they can
handle high levels of concurrency and support flexible data models
• Internet of Things (IoT): These databases are often used for IoT applications since they can
handle large amounts of data from a large number of devices and handle high levels of
concurrency
• E-commerce: They are often used for e-commerce applications since they can handle high
levels of concurrency and support flexible data models
8
Why NoSQL (1)
• RDBMS meant for OLTP systems / Systems of Record
• Strict consistency and durability guarantees (ACID) over multiple data items involved
in a transaction
• But they have scale and cost issues with large volumes of data, distributed geo-scale
applications, very high transaction volumes
• Typical web scale systems do not need strict consistency and durability for every use case
• Social networking
• Real-time applications
• Log analysis
• Browsing retail catalogs
• Reviews and blogs
•…
9
Why NoSQL (2)
10
Choice between consistency and availability
• In a distributed database
✓Scalability and fault tolerance can be improved through additional nodes,
although this puts challenges on maintaining consistency (C).
✓The addition of nodes can also cause availability (A) to suffer due to the
latency caused by increased communication between nodes.
• May have to update all replicas before sending success to client . so longer
takes time and system may not be available during this period to service
reads on same data item.
• Large scale distributed systems cannot be 100% partition tolerant (P).
✓Although communication outages are rare and temporary, partition tolerance
(P) must always be supported by distributed database
13
NoSQL Offerings in Cloud
14
NoSQL Characteristics
• Auto sharding
✓ automatically spreads data across the number of servers
✓ applications are not aware about it
✓ helps in data balancing and failure from recovery
• Replication
✓ Good support for replication of data which offers high availability, fault tolerance
15
NoSQL - Pros and Cons
Pros Cons
• Cost effective for large data sets
• Joins between data sets / tables
• Easy to implement • Group by operations
• Easy to distribute esp across DCs • ACID properties for transactions
• Easier to scale up/down • SQL interface
• Relaxes data consistency when required • Lack of standardisation in this space
• No pre-defined schema • Makes it difficult to port from SQL
• Easier to model semi-structured data or and across NoSQL stores
connectivity data
• Less skills compared to SQL
• Easy to support data replication
• Lesser BI tools compared to mature SQL
BI space
16
SQL vs NoSQL
SQL NoSQL
17
Vendors
• Amazon
• Facebook
• Google
• Oracle
18
Topics for today
• NoSQL Introduction
• Classification
• Examples
• Cassandra
• Mongo
• GraphDBs: Neo4J
19
Classification: Document-based
• Store data in form of documents using well known formats like JSON
• Documents accessible via their id, but can be accessed through other index as well
• Maintains data in collections of documents
• Example,
• MongoDB, CouchDB, CouchBase
• Book document :
{
“Book Title” : “Fundamentals of Database Systems”,
“Publisher” : “Addison-Wesley”,
“Authors” : “Elmasri & Navathe”
“Year of Publication” : “2011”
}
20
Classification: Key-Value store
• Simple data model based on fast access by the key to the value associated with the key
• Value can be a record or object or document or even complex data structure
• Maintains a big hash table of keys and values
• For example,
✓ DynamoDB, Redis, Riak
Key Value
2014HW112220 { Santosh,Sharma,Pilani}
2018HW123123 {Eshwar,Pillai,Hyd}
21
Classification: Column-based
22
Classification: Graph based
23
Topics for today
• NoSQL Introduction
• Classification
• Examples
✓MongoDB
✓Cassandra
✓GraphDBs: Neo4J
24
MongoDB
25 * https://docs.mongodb.com/manual/core/wiredtiger/
MongoDB
26
MongoDB Data Example
Collection inventory
{
item: "ABC2",
details: { model: "14Q3", manufacturer: "M1 Corporation" },
stock: [ { size: "M", qty: 50 } ],
category: "clothing”
}
{ Document insertion
item: "MNO2",
db.inventory.insert(
details: { model: "14Q3", manufacturer: "ABC Company" },
{
stock: [ { size: "S", qty: 5 }, { size: "M", qty: 5 }, { size: "L", qty: 1 } ],
item: "ABC1",
category: "clothing”
details: {model: "14Q3",manufacturer: "XYZ Company"},
}
stock: [ { size: "S", qty: 25 }, { size: "M", qty: 50 } ],
category: "clothing"
}
)
27
Example of Simple Query
28
MongoDB Data Model
29
MongoDB: MapReduce
> db.collection.mapReduce(
function() {emit(key,value);}, //map function
function(key,values) {return reduceFunction}, { //reduce function
out: collection,
query: document,
sort: document,
limit: number
}
>db.posts.mapReduce(
function() { emit(this.name,1); },
function(key, values) {return Array.sum(values)}, {
query:{publish:"true"},
out:”total_reviews"
}
).find()
30
MongoDB: Indexing
• Can create index on any field of a collection or a sub-document fields
• e.g. document in a collection
{
"address": {
"city": “New Delhi",
"state": "Delhi",
"pincode": "110001"
},
"tags": [
"football",
"cricket",
"badminton"
],
"name": "Ravi"
}
31
MongoDB: Joins
• Mongo 3.2+ it is possible to join data from 2 collections using aggregate If you have two collections (users , comments) and want to pull
• Collection books (isbn, title, author) and books_selling_data(isbn, copies_sold) all the comments with pid=444 along with the user info for each
db.books.aggregate([{ $lookup: { comments
from: "books_selling_data", { uid:12345, pid:444, comment="blah" }
localField: "isbn", { uid:12345, pid:888, comment="asdf" }
{ uid:99999, pid:444, comment="qwer" }
foreignField: "isbn",
as: "copies_sold"
users
} { uid:12345, name:"john" }
}]) { uid:99999, name:"mia" }
33
MongoDB “read concerns”
• local :
✓Client reads primary replica
✓Client reads from secondary in causally consistent sessions
• available:
✓Read on secondary but causal consistency not required
• majority :
✓If client wants to read what majority of nodes have. Best option for fault tolerance and
durability.
• linearizable :
✓If client wants to read what has been written to majority of nodes before the read started.
✓Has to be read on primary
✓Only single document can be read
https://docs.mongodb.com/v3.4/core/read-preference-mechanics/
34
MongoDB “write concerns”
https://docs.mongodb.com/manual/reference/write-concern/
35
Consistency scenarios - causally consistent and durable
R = W = majority
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
36
Consistency scenarios - causally consistent but not durable
read=majority
write=1
• W1 may succeed on P1 and P2. R1 will succeed only on P2. W1 on P1 may roll back.
• So causally consistent but not durable with network partition. Fast writes, slower reads.
• Example: Twitter - a post may disappear but if on refresh you see it then it should be durable,
else repost.
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
37
Consistency scenarios - eventual consistency with durable writes
read=local
write=majority
• W1 will succeed only for P2 and will not be accepted on P1 after failure. Reads may not
succeed to see the last write on P1. Slow durable writes and fast non-causal reads.
• Example: Review site where write should be durable if committed but reads don’t need causal
guarantee as long as it appears some time (eventual consistency).
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
38
Consistency scenarios - eventual consistency but no durability
read=local
write=1
• Same as previous scenario and not writes are also not durable and may be rolled back.
• Example: Real-time sensor data feed that needs fast writes to keep up with the rate and reads
should get as much recent real-time data as possible. Data may be dropped on failures.
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
39
How applications deal with eventual consistency of AP
• Read again, if you do not get the data during the first read
40
MongoDB – ACID Transactions
41
MongoDB on Cloud
42
Topics for today
• NoSQL Introduction
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J and Tinkerpop
43
Cassandra
• Born in Facebook and built on Amazon Dynamo and Google Big Table concepts
• AP design in CAP context
• High performance, high availability applications that can sacrifice consistency
✓ Hence built for peer-to-peer symmetric nodes instead of primary-secondary
architecture (as in MongoDB)
• Column oriented DB
✓Create keyspace (like a DB)
✓ Within keyspace create column family (like a table)
✓ Within CF create attributes / columns with their types
44
Cassandra features
45
Cassandra on Cloud
46
Replication strategy for user data
• Simple
✓ Specify replication factor = N and data is stored in N nodes of
cluster
• NetworkTopology
✓ Specify replication factor per DC where we want reliability from DC
failures
✓ e.g. CREATE KEYSPACE cluster1 WITH replication = {'class':
'NetworkTopologyStrategy', 'eastDC' : 2, ‘westDC' : 3};
47
Consistency semantics (1)
• No primary replica - high partition tolerance and availability and levels of consistency
• Support for light transactions with “linearizable consistency”
• A Read or Write operation can pick a consistency level
• ONE, TWO, THREE, ALL - 1,2,3 or all replicas respectively have to ack
• ANY - Write to any node even if replicas are down (ref Hinted Handoff)
• QUORUM - majority have to ack
• LOCAL_QUORUM - majority within same datacenter have to ack
•…
https://cassandra.apache.org/doc/latest/architecture/dynamo.html
https://cassandra.apache.org/doc/latest/architecture/guarantees.html#
48
Tunable Consistency
R+W>N
49
Cassandra Consistency Spectrum
• Cassandra has C A P.
• But Consistency is tunable
• Give up a little A and P to get more C
The higher the consistency, the less chance you get stale data during read
● Pay for this with latency
50
Consistency semantics (2)
https://cassandra.apache.org/doc/latest/architecture/dynamo.html
https://cassandra.apache.org/doc/latest/architecture/guarantees.html#
51
Partitioners
52
Sample queries
> create keyspace demo with replication={'class':'SimpleStrategy',
'replication_factor':1};
> describe keyspaces;
> use demo; or columnfamily
> create table student_info (rollno int primary key, name text, doj
timestamp, lastexampercent double);
> describe table student_info ;
> consistency quorum
> insert into student_info (rollno,name,doj,lastexampercent) values
(4,'Roxanne', dateof(now()), 90) using ttl 30;
> select rollno from student_info where name='Roxanne' ALLOW
FILTERING;
> update student_info set lastexampercent=98 where rollno=2 IF
name='Sam';
53
Case study - eBay
• Marketplace has 100 million active buyers with 200+ million items
• 2B page views, 80B DB calls, multi-PB storage capacity
• No transactions, joins, referential integrity
• Multi-DC deployment
• 400M+ writes and 200M+ reads
• 3 Use cases
✓ Social signal on product pages (read latency is not important but write performance is
key)
✓ Connecting users and items via buy, sell, bid, watch events
✓ Many time series analysis cases, e.g. fraud detection
https://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376/2-eBay_Marketplaces_97_million_active
54
Case study - AdStage (from AWS use cases)
• Sector AdTech
• Online advertising platform to manage multi-channel ad campaigns on Google, FB,
Twitter, Bing, LinkedIn
• 3 clusters with 80+ nodes on AWS
• Vast amount of real-time data from 5 channels
• Constantly monitor trends and optimise campaigns for advertisers
• High performance and availability - consistency is not critical as it is read mainly
• Cassandra cluster can scale as more clients are added with no SPOF
55
Topics for today
• NoSQL Introduction
• Pros-Cons
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J
56
Graphs
57
Graph computing
• Property graphs
• Data is represented as vertices and edges with
properties
• Properties are key value pairs
• Edges are relationships between vertices
• When to use a graph DB ?
• A relationship-heavy data set with large set of
data items
• Queries are like graph traversals but need to
keep query performance almost constant as
database grows
• A variety of queries may be asked from the data
and static indices on data will not work
58
Relational Vs Graph Models
59
5 Signs You Need a Graph Database
60
Native vs Non-Native Graph storage
• Non-native graph computing platforms can use external DBs for data storage
• e.g. TinkerPop is an in-memory DB + computing framework that can store in
ElasticSearch, Cassandra etc.
• Native platform support built-in storage
• e.g. Neo4j
• Native approach is much faster because adjacent nodes and edges are stored closer for
faster traversal
• In a non-native approach, extensive indexing has to be used
• Native approach scales as nodes get added
One-hop index
https://neo4j.com/blog/native-vs-non-native-graph-technology/ 61
Graphs for Storage and Processing
62
What is Neo4J
➢ It’s is a Graph Database supporting full ACID Transactions
➢ Embeddable in applications and server deployable
➢ Java based, Open sourced
➢ Schema free, bottom-up data model design
➢ Neo4j is stable
➢ In 24/7 operation since 2003
➢ Neo4j is under active development
➢ High performance graph operations
➢ Supports the Cypher query language
https://neo4j.com/blog/native-vs-non-native-graph-technology/ 63
Neo4j / Cypher
• Cypher is a Declarative
language for graph query
• Example: match (:Person
{name: 'Tom Hanks'})-
[:ACTED_IN]->(m:Movie)
where m.released > 2000
RETURN m limit 5
• Find movies that Tom Hanks acted in and directed by Ron Howard released
after 2000
• Match (:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie),(:Person
{name: 'Ron Howard'})-[:DIRECTED]->(m) where m.released > 2000
RETURN m limit 5
• Who were the other actors in the movie where Tom Hanks acted in and
directed by Ron Howard released after 2000
• Match (:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie),(:Person
{name: 'Ron Howard’})-[:DIRECTED]->(m), (p:Person)-[:ACTED_IN]->(m)
where m.released > 2000 RETURN p limit 5
65
Neo4j on Cloud
66
Apache Tinkerpop / Gremlin
• TinkerPop is a computing platform that connects to GraphDBs that actually store the
nodes and edges. Built-in TinkerGraph stores in-memory data only.
• Gremlin is the query language (with traversal machine) that supports Declarative and ACTED_IN
Imperative flavours
• Sample queries
• movies where Tom Hanks has acted Person Movie
• g.V().hasLabel(‘person’).has(‘name’,’Tom DIRECTED
Hanks’).outE(‘ACTED_IN’).hasLabel(‘movies’).values(‘name’)
• movies where Tom Hanks has acted and directed by Ron Howard Person: Tom Hanks
• g.V().hasLabel(‘person’).has(‘name’,’Tom ACTED_IN
Hanks’).outE(‘ACTED_IN’).inE(‘DIRECTED’).has(‘name’,’Ron
Howard’).outE(‘DIRECTED’).values(‘name’)
DIRECTED Movie
2. Cassandra
Free Cassandra courses – Datastax Academy - https://www.datastax.com/dev/academy
Multiple Learning Paths - Gain an expert understanding of Apache Cassandra™
Each Learning Path is composed of a sequence of recommended courses for your role
• Administrator Certification
✓ DS201
✓ DS210
• Developer Certification
✓ DS201
✓ DS220
3. Neo4j
neo4j Graphacademy - https://graphacademy.neo4j.com/
Free, Self-Paced, Hands-on Online Training | Free Neo4j Courses from GraphAcademy 68
Summary
69
Next Session:
Cassandra and Graph Database in detail
70