0% found this document useful (0 votes)

25 views70 pages

BDS Session 10

The document provides an overview of NoSQL databases, highlighting their characteristics, classification, and use cases. It discusses various types of NoSQL databases such as key-value, document, column, and graph databases, along with specific examples like MongoDB and Cassandra. Additionally, it addresses the advantages and disadvantages of NoSQL compared to traditional SQL databases, emphasizing their scalability and flexibility for handling large volumes of data.

Uploaded by

Rahul Sbhatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views70 pages

BDS Session 10

Uploaded by

Rahul Sbhatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

DSECL ZG 522: Big Data Systems

Session 10: NoSQL Databases

Janardhanan PS
[email protected]
Topics for today

• NoSQL Introduction
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J

2
Database Sphere

3
What is NoSQL Database ?

• NoSQL databases, also known as "Not Only SQL" databases, are a type of
database that do not use traditional SQL (Structured Query Language) for storing
and manipulating data
• They are designed to handle large amounts of unstructured, semi-structured, or
polymorphic data and are often used for big data, real-time data processing, and
cloud-based applications
• NoSQL databases use a distributed architecture, allowing them to scale horizontally
across multiple servers or nodes, making them ideal for handling high levels of
concurrency and data volume

4
What is NoSQL ?
• Coined by Carlo Strozzi in 1998
✓Lightweight, open source database without standard SQL interface
• Reintroduced by Johan Oskarsson in 2009
✓Non-relational databases
• Characteristics
✓Not Only SQL
✓Non-relational
✓Schema-less
✓Loosen consistency to address scalability and availability requirements in
large scale applications
✓Open source movement born out of web-scale applications
✓Distributed for scale
✓Cluster Friendly
5
Data model

• Supports rich variety of data : structured, semi-structured and unstructured

• No fixed schema, i.e. each record could have different attributes
• Non-relational - no join operations are typically supported
• Transaction semantics for multiple data items are typically not supported
• Relaxed consistency semantics - no support for ACID as in RDBMS
• In some cases can model data as graphs and queries as graph traversals

6
How to choose the right NoSQL database?

5 questions to ask before choosing a NoSQL database

• Is NoSQL the right choice?
• Which NoSQL data model do we need?
• What is the latency requirement?
• How important are scalability and data consistency?
• How do we want to deploy it?

Link to infoworld article

7
NoSQL Use Cases

• Big data: NoSQL databases are perfect for handling large amounts of data since they can
scale horizontally across multiple servers or nodes and handle high levels of concurrency
• Real-time data processing: They are often used for real-time data processing since they can
handle high levels of concurrency and support low latency
• Cloud-based applications: NoSQL databases are perfect for cloud-based applications since
they can easily scale and handle large amounts of data in a distributed environment
• Content management: NoSQL databases are often used for content management systems
since they can handle large amounts of data and support flexible data models
• Social media: NoSQL databases are often used for social media applications since they can
handle high levels of concurrency and support flexible data models
• Internet of Things (IoT): These databases are often used for IoT applications since they can
handle large amounts of data from a large number of devices and handle high levels of
concurrency
• E-commerce: They are often used for e-commerce applications since they can handle high
levels of concurrency and support flexible data models
8
Why NoSQL (1)
• RDBMS meant for OLTP systems / Systems of Record
• Strict consistency and durability guarantees (ACID) over multiple data items involved
in a transaction
• But they have scale and cost issues with large volumes of data, distributed geo-scale
applications, very high transaction volumes
• Typical web scale systems do not need strict consistency and durability for every use case
• Social networking
• Real-time applications
• Log analysis
• Browsing retail catalogs
• Reviews and blogs
•…
9
Why NoSQL (2)

• RDBMS ensure uniform structure and modelling of relationships

between entities
• A class of emerging applications need granular and extreme
connectivity information modelled between individual semi-structured
data items. This information needs to be also queried at scale without
large expensive joins.
• Connectivity between users in a social media application: How
many friends do you have between 2 hops ?
• Connectivity between companies in terms of domain, technology,
people skills, hiring : Useful for skills acquisition, M&A etc.
• Connectivity between IT network devices: Useful for
troubleshooting incidents

10
Choice between consistency and availability
• In a distributed database
✓Scalability and fault tolerance can be improved through additional nodes,
although this puts challenges on maintaining consistency (C).
✓The addition of nodes can also cause availability (A) to suffer due to the
latency caused by increased communication between nodes.
• May have to update all replicas before sending success to client . so longer
takes time and system may not be available during this period to service
reads on same data item.
• Large scale distributed systems cannot be 100% partition tolerant (P).
✓Although communication outages are rare and temporary, partition tolerance
(P) must always be supported by distributed database

• In NoSQL, generally a choice between choosing either CP or AP of CAP

• RDBMS systems mainly provide CA for single data items and then on top of that
provide ACID for transactions that touch multiple data items.
11
CAP Theorem – Implications of CA, CP, AP

• CA- Implies single site cluster in which all nodes

communicate with each other.
• CP – Implies all the available data consistent or accurate,
but some data may not be available
• AP – Implies all data available, but some data returned
may be inconsistent
Consistency Models of NoSQL Databases

When you select a NoSQL database in the

above categories, check weather it will match
with the requirements of the application.
12
Classification of NoSQL DBs
• Key – value
✓ Maintains a big hash table of keys and values
✓ Example : DynamoDB, Redis, Riak etc
• Document
✓ Maintains data in collections of documents
✓ Example : MongoDB, CouchDB etc
• Column
✓ Each storage block has data from only one column
✓ Example : Cassandra, HBase
• Graph
✓ Network databases
✓ Graph stores data in nodes
✓ Example : Neo4j, HyperGraphDB, Apache Tinkerpop

NoSQL Databases - http://nosql-database.org/

13
NoSQL Offerings in Cloud

14
NoSQL Characteristics

• Scale out architecture instead of monolithic architecture of relational databases

• Cluster scale - distribution across 100+ nodes across DCs
• Performance scale - 100K+ DB reads and writes per sec
• Data scale - 1B+ docs in DB
• House large amount of structured, semi-structured and unstructured data
• Dynamic schemas
✓ allows insertion of data without pre-defined schema

• Auto sharding
✓ automatically spreads data across the number of servers
✓ applications are not aware about it
✓ helps in data balancing and failure from recovery

• Replication
✓ Good support for replication of data which offers high availability, fault tolerance

15
NoSQL - Pros and Cons

Pros Cons
• Cost effective for large data sets
• Joins between data sets / tables
• Easy to implement • Group by operations
• Easy to distribute esp across DCs • ACID properties for transactions
• Easier to scale up/down • SQL interface
• Relaxes data consistency when required • Lack of standardisation in this space
• No pre-defined schema • Makes it difficult to port from SQL
• Easier to model semi-structured data or and across NoSQL stores
connectivity data
• Less skills compared to SQL
• Easy to support data replication
• Lesser BI tools compared to mature SQL
BI space

16
SQL vs NoSQL

SQL NoSQL

Relational database Non relational, distributed databases

Pre-defined schema Schema less
Table based databases Multiple options: Key-Value,
Document, Column, Graph
Vertically scalable Horizontally scalable
Supports ACID properties Supports CAP theorem
Supports complex querying Relatively simpler querying
Excellent support from vendors Relies heavily on community support

17
Vendors

• Amazon
• Facebook
• Google
• Oracle

18
Topics for today

• NoSQL Introduction
• Classification
• Examples
• Cassandra
• Mongo
• GraphDBs: Neo4J

19
Classification: Document-based

• Store data in form of documents using well known formats like JSON
• Documents accessible via their id, but can be accessed through other index as well
• Maintains data in collections of documents
• Example,
• MongoDB, CouchDB, CouchBase

• Book document :
{
“Book Title” : “Fundamentals of Database Systems”,
“Publisher” : “Addison-Wesley”,
“Authors” : “Elmasri & Navathe”
“Year of Publication” : “2011”
}

20
Classification: Key-Value store

• Simple data model based on fast access by the key to the value associated with the key
• Value can be a record or object or document or even complex data structure
• Maintains a big hash table of keys and values
• For example,
✓ DynamoDB, Redis, Riak

Key Value
2014HW112220 { Santosh,Sharma,Pilani}
2018HW123123 {Eshwar,Pillai,Hyd}

21
Classification: Column-based

• Partition a table by column into column families

• A part of vertical partitioning where each column family is stored in its own files
• Allows versioning of data values
• Each storage block has data from only one column
• Example,
✓ Cassandra, Hbase

22
Classification: Graph based

• Data is represented as graphs and related nodes can be found by

traversing the edges using the path expression
• aka network database
• Graph query languages, e.g. Cypher,Gremlin
• Example
✓ Neo4J
✓ HyperGraphDB
✓ GraphX
✓ Apache TinkerPop

23
Topics for today

• NoSQL Introduction
• Classification
• Examples
✓MongoDB
✓Cassandra
✓GraphDBs: Neo4J

24
MongoDB

• Database is a set of collections

• A collection is like a table in RDBMS
• A collection stores documents
✓ BSON or Binary JSON with hierarchical key-value pairs
✓ Similar to rows in a table
✓ Max 16MB documents stored in WiredTiger storage engine
• For larger than 16MB documents uses GridFS
✓ Support for binary data
✓ Large objects can be stored in ‘chunks’ of 255KB
✓ Stores Meta-data in a separate collection
✓ Does not support multi-document transactions
✓ WiredTiger storage engine*

25 * https://docs.mongodb.com/manual/core/wiredtiger/
MongoDB

• Data is partitioned in shards

✓ For horizontal scaling
✓ Reduces amount of data each shard handles as the cluster grows
✓ Reduces number of operations on each shard
• Data is replicated
✓ Writes to primary in oplog. “write-concern” setting used to tweak write consistency.
✓ Secondaries use oplog to get local copies updated
✓ Clients usually read from primary but “read-preference” setting can tweak read consistency
• Data updates happen in place and not versioned / timestamped

26
MongoDB Data Example

Collection inventory

{
item: "ABC2",
details: { model: "14Q3", manufacturer: "M1 Corporation" },
stock: [ { size: "M", qty: 50 } ],
category: "clothing”
}

{ Document insertion
item: "MNO2",
db.inventory.insert(
details: { model: "14Q3", manufacturer: "ABC Company" },
{
stock: [ { size: "S", qty: 5 }, { size: "M", qty: 5 }, { size: "L", qty: 1 } ],
item: "ABC1",
category: "clothing”
details: {model: "14Q3",manufacturer: "XYZ Company"},
}
stock: [ { size: "S", qty: 25 }, { size: "M", qty: 50 } ],
category: "clothing"
}
)
27
Example of Simple Query

Collection orders db.users.find(

{ status: "A" }, selection
{
{ cust_id: 1, price: 1, _id: 0 }
_id: "a",
)
cust_id: "abc123",
status: "A", projection
price: 25,
items: [ { sku: "mmm", qty: 5, price: 3 },
{ sku: "nnn", qty: 5, price: 2 } ] In SQL it would look like this:
} SELECT cust_id, price
{ FROM orders
_id: "b", WHERE status="A"
cust_id: "abc124",
status: "B",
price: 12, {
items: [ { sku: "nnn", qty: 2, price: 2 }, cust_id: "abc123",
Results
{ sku: "ppp", qty: 2, price: 4 } ] price: 25
} }

28
MongoDB Data Model

• JavaScript Object Notation (JSON) model

• Database = set of named collections
• Collection = sequence of documents
• Document = {attribute1:value1,...,attributek:valuek}
• Attribute = string (attributei≠attributej)
• Value = primitive value (string, number, date, ...), or a document, or an array
• Array = [value1,...,valuen]

• Key properties: hierarchical (like XML), no schema

✓ Collection docs may have different attributes

29
MongoDB: MapReduce
> db.collection.mapReduce(
function() {emit(key,value);}, //map function
function(key,values) {return reduceFunction}, { //reduce function
out: collection,
query: document,
sort: document,
limit: number
}

>db.posts.mapReduce(
function() { emit(this.name,1); },
function(key, values) {return Array.sum(values)}, {
query:{publish:"true"},
out:”total_reviews"
}
).find()

30
MongoDB: Indexing
• Can create index on any field of a collection or a sub-document fields
• e.g. document in a collection
{
"address": {
"city": “New Delhi",
"state": "Delhi",
"pincode": "110001"
},
"tags": [
"football",
"cricket",
"badminton"
],
"name": "Ravi"
}

• indexing a field in ascending order and find

> db.users.createIndex({“tags":1})
> db.users.find({tags:"cricket"}).pretty()

• indexing a sub-document field in ascending order and find

> db.users.createIndex({"address.city":1,"address.state":1,"address.pincode":1})
> db.users.find({“address.city":"New Delhi”}).pretty()

31
MongoDB: Joins
• Mongo 3.2+ it is possible to join data from 2 collections using aggregate If you have two collections (users , comments) and want to pull
• Collection books (isbn, title, author) and books_selling_data(isbn, copies_sold) all the comments with pid=444 along with the user info for each
db.books.aggregate([{ $lookup: { comments
from: "books_selling_data", { uid:12345, pid:444, comment="blah" }
localField: "isbn", { uid:12345, pid:888, comment="asdf" }
{ uid:99999, pid:444, comment="qwer" }
foreignField: "isbn",
as: "copies_sold"
users
} { uid:12345, name:"john" }
}]) { uid:99999, name:"mia" }

• Sample joined document: Join command - Join using $lookup

{ db.users.aggregate({
"isbn": "978-3-16-148410-0", $lookup:{
"title": "Some cool book", from:"comments",
"author": "John Doe", localField:"uid",
"copies_sold": [ foreignField:"uid",
as:"users_comments"
{
"isbn": "978-3-16-148410-0", }
"copies_sold": 12500 })
}
]
} 32
MongoDB – Writes and Reads
• Document oriented DB
• Various read and write choices for flexible consistency tradeoff with scale / performance and durability
• Automatic primary re-election on primary failure and/or network partition

33
MongoDB “read concerns”

• local :
✓Client reads primary replica
✓Client reads from secondary in causally consistent sessions
• available:
✓Read on secondary but causal consistency not required
• majority :
✓If client wants to read what majority of nodes have. Best option for fault tolerance and
durability.
• linearizable :
✓If client wants to read what has been written to majority of nodes before the read started.
✓Has to be read on primary
✓Only single document can be read

https://docs.mongodb.com/v3.4/core/read-preference-mechanics/
34
MongoDB “write concerns”

• how many replicas should ack

✓1 - primary only
✓0 - none
✓n - how many including primary
✓majority - a majority of nodes (preferred for durability)
• journaling - If True then nodes need to write to disk journal before ack
else ack after writing to memory (less durable)
• timeout for write operation

https://docs.mongodb.com/manual/reference/write-concern/
35
Consistency scenarios - causally consistent and durable

Read latest written value

read=majority from common node
write=majority

R = W = majority

• W1 and R1 for P1 will fail and will succeed in P2

• So causally consistent, durable even with network partition sacrificing performance
• Example: Used in critical transaction oriented applications, e.g. stock trading

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
36
Consistency scenarios - causally consistent but not durable

read=majority
write=1

• W1 may succeed on P1 and P2. R1 will succeed only on P2. W1 on P1 may roll back.
• So causally consistent but not durable with network partition. Fast writes, slower reads.
• Example: Twitter - a post may disappear but if on refresh you see it then it should be durable,
else repost.

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
37
Consistency scenarios - eventual consistency with durable writes

read=local
write=majority

• W1 will succeed only for P2 and will not be accepted on P1 after failure. Reads may not
succeed to see the last write on P1. Slow durable writes and fast non-causal reads.
• Example: Review site where write should be durable if committed but reads don’t need causal
guarantee as long as it appears some time (eventual consistency).

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
38
Consistency scenarios - eventual consistency but no durability

read=local
write=1

• Same as previous scenario and not writes are also not durable and may be rolled back.
• Example: Real-time sensor data feed that needs fast writes to keep up with the rate and reads
should get as much recent real-time data as possible. Data may be dropped on failures.

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
39
How applications deal with eventual consistency of AP

• Application must be ready to deal with multiple versions of data

• Application must handle stale data

• Expect NULL values instead of data

• Application must be able to fix inconsistent state

• Read again, if you do not get the data during the first read

40
MongoDB – ACID Transactions

Can NoSQL databases be ACID-compliant?

• MongoDB is an ACID-compliant database.

• As of MongoDB 4.0, there is even support for multi-document ACID

transactions when required - 2018.

• Version 4.2 even brought distributed multi-document ACID

transactions for even more flexibility - 2019.

41
MongoDB on Cloud

• MongoDB Atlas on AWS Cloud

• Automated MongoDB Service on Microsoft Azure
• MongoDB Atlas on Google Cloud

42
Topics for today

• NoSQL Introduction
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J and Tinkerpop

43
Cassandra

• Born in Facebook and built on Amazon Dynamo and Google Big Table concepts
• AP design in CAP context
• High performance, high availability applications that can sacrifice consistency
✓ Hence built for peer-to-peer symmetric nodes instead of primary-secondary
architecture (as in MongoDB)
• Column oriented DB
✓Create keyspace (like a DB)
✓ Within keyspace create column family (like a table)
✓ Within CF create attributes / columns with their types

44
Cassandra features

45
Cassandra on Cloud

• Amazon Keyspaces (for Apache Cassandra)

• Azure Managed Instance for Apache Cassandra
• DataStax Astra DB for Apache Cassandra (Google Cloud)

46
Replication strategy for user data

• Simple
✓ Specify replication factor = N and data is stored in N nodes of
cluster
• NetworkTopology
✓ Specify replication factor per DC where we want reliability from DC
failures
✓ e.g. CREATE KEYSPACE cluster1 WITH replication = {'class':
'NetworkTopologyStrategy', 'eastDC' : 2, ‘westDC' : 3};

47
Consistency semantics (1)

• No primary replica - high partition tolerance and availability and levels of consistency
• Support for light transactions with “linearizable consistency”
• A Read or Write operation can pick a consistency level
• ONE, TWO, THREE, ALL - 1,2,3 or all replicas respectively have to ack
• ANY - Write to any node even if replicas are down (ref Hinted Handoff)
• QUORUM - majority have to ack
• LOCAL_QUORUM - majority within same datacenter have to ack
•…

https://cassandra.apache.org/doc/latest/architecture/dynamo.html
https://cassandra.apache.org/doc/latest/architecture/guarantees.html#
48
Tunable Consistency

Cassandra guarantees strong consistency if

(nodes_Written + nodes_Read) > replication_factor N

R+W>N

Tuning done by controlling the number of nodes (replicas)

• Selected for Write
• Selected for reads

This simple form (https://www.ecyrd.com/cassandracalculator/)

allows you to try out different parameters for your Apache Cassandra
cluster and see the impact of them for your application.

49
Cassandra Consistency Spectrum

• Cassandra has C A P.
• But Consistency is tunable
• Give up a little A and P to get more C

Faster reads and writes

Weak Consistency Strong

The higher the consistency, the less chance you get stale data during read
● Pay for this with latency

● Depends on your situational needs

50
Consistency semantics (2)

• For “causal consistency” pick Read consistency level = Write

consistency level = QUORUM Read latest written value
from common node
• Why ? At least one node will be common between write and read set
so a read will get the last write of a data item
• What happens if read and write use LOCAL_QUORUM ?
• If no overlap read and write sets then “Eventual consistency”

R level = W level = QUORUM

• Partitions data based on hashing to distribute data blocks from a

column among nodes*
• Random
✓ Crypto hash (MD5)- more expensive
• Murmur3
✓ Non-crypto consistent hash (MU-Multiple / R - Rotate
operations but easier to reverse compared to Crypto hash)
✓ 3-5x faster and overall 10% performance improvement
• Byteorder
✓ Lexical order

52
Sample queries
> create keyspace demo with replication={'class':'SimpleStrategy',
'replication_factor':1};
> describe keyspaces;
> use demo; or columnfamily
> create table student_info (rollno int primary key, name text, doj
timestamp, lastexampercent double);
> describe table student_info ;
> consistency quorum
> insert into student_info (rollno,name,doj,lastexampercent) values
(4,'Roxanne', dateof(now()), 90) using ttl 30;
> select rollno from student_info where name='Roxanne' ALLOW
FILTERING;
> update student_info set lastexampercent=98 where rollno=2 IF
name='Sam';

53
Case study - eBay

• Marketplace has 100 million active buyers with 200+ million items
• 2B page views, 80B DB calls, multi-PB storage capacity
• No transactions, joins, referential integrity
• Multi-DC deployment
• 400M+ writes and 200M+ reads
• 3 Use cases
✓ Social signal on product pages (read latency is not important but write performance is
key)
✓ Connecting users and items via buy, sell, bid, watch events
✓ Many time series analysis cases, e.g. fraud detection

https://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376/2-eBay_Marketplaces_97_million_active
54
Case study - AdStage (from AWS use cases)

• Sector AdTech
• Online advertising platform to manage multi-channel ad campaigns on Google, FB,
Twitter, Bing, LinkedIn
• 3 clusters with 80+ nodes on AWS
• Vast amount of real-time data from 5 channels
• Constantly monitor trends and optimise campaigns for advertisers
• High performance and availability - consistency is not critical as it is read mainly
• Cassandra cluster can scale as more clients are added with no SPOF

55
Topics for today

• NoSQL Introduction
• Pros-Cons
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J

56
Graphs

57
Graph computing

• Property graphs
• Data is represented as vertices and edges with
properties
• Properties are key value pairs
• Edges are relationships between vertices
• When to use a graph DB ?
• A relationship-heavy data set with large set of
data items
• Queries are like graph traversals but need to
keep query performance almost constant as
database grows
• A variety of queries may be asked from the data
and static indices on data will not work

58
Relational Vs Graph Models

59
5 Signs You Need a Graph Database

Is your traditional database not up to the task of deciphering your

complex data? Here are five of the telltale signs:

1. Hard-to-navigate interconnected data

2. Lack of real-time insights
3. Inflexible data structures
4. Inefficient querying
5. Inability to decode a web of highly connected data

60
Native vs Non-Native Graph storage
• Non-native graph computing platforms can use external DBs for data storage
• e.g. TinkerPop is an in-memory DB + computing framework that can store in
ElasticSearch, Cassandra etc.
• Native platform support built-in storage
• e.g. Neo4j
• Native approach is much faster because adjacent nodes and edges are stored closer for
faster traversal
• In a non-native approach, extensive indexing has to be used
• Native approach scales as nodes get added

One-hop index

https://neo4j.com/blog/native-vs-non-native-graph-technology/ 61
Graphs for Storage and Processing

62
What is Neo4J
➢ It’s is a Graph Database supporting full ACID Transactions
➢ Embeddable in applications and server deployable
➢ Java based, Open sourced
➢ Schema free, bottom-up data model design
➢ Neo4j is stable
➢ In 24/7 operation since 2003
➢ Neo4j is under active development
➢ High performance graph operations
➢ Supports the Cypher query language

➢ Traverses 1,000,000+ relationships/sec on commodity hardware

One-hop index
➢ No. of nodes and relationships decide Volume of data

https://neo4j.com/blog/native-vs-non-native-graph-technology/ 63
Neo4j / Cypher

• Cypher is a Declarative
language for graph query
• Example: match (:Person
{name: 'Tom Hanks'})-
[:ACTED_IN]->(m:Movie)
where m.released > 2000
RETURN m limit 5

Launch a free sandbox with dataset on neo4j website

64
Neo4j / Cypher: More queries

• Find movies that Tom Hanks acted in and directed by Ron Howard released
after 2000
• Match (:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie),(:Person
{name: 'Ron Howard'})-[:DIRECTED]->(m) where m.released > 2000
RETURN m limit 5
• Who were the other actors in the movie where Tom Hanks acted in and
directed by Ron Howard released after 2000
• Match (:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie),(:Person
{name: 'Ron Howard’})-[:DIRECTED]->(m), (p:Person)-[:ACTED_IN]->(m)
where m.released > 2000 RETURN p limit 5

65
Neo4j on Cloud

• Neo4j Enterprise Edition on AWS

• Neo4j - The Easiest Way to Graph on MS Azure
• NEO4J AURA on Google Cloud

66
Apache Tinkerpop / Gremlin
• TinkerPop is a computing platform that connects to GraphDBs that actually store the
nodes and edges. Built-in TinkerGraph stores in-memory data only.
• Gremlin is the query language (with traversal machine) that supports Declarative and ACTED_IN
Imperative flavours
• Sample queries
• movies where Tom Hanks has acted Person Movie
• g.V().hasLabel(‘person’).has(‘name’,’Tom DIRECTED
Hanks’).outE(‘ACTED_IN’).hasLabel(‘movies’).values(‘name’)
• movies where Tom Hanks has acted and directed by Ron Howard Person: Tom Hanks

• g.V().hasLabel(‘person’).has(‘name’,’Tom ACTED_IN
Hanks’).outE(‘ACTED_IN’).inE(‘DIRECTED’).has(‘name’,’Ron
Howard’).outE(‘DIRECTED’).values(‘name’)
DIRECTED Movie

Person: Ron Howard 67

NoSQL Certification Trainings
1. MongoDB
Free MongoDB courses - MongoDB University - https://learn.mongodb.com/

2. Cassandra
Free Cassandra courses – Datastax Academy - https://www.datastax.com/dev/academy
Multiple Learning Paths - Gain an expert understanding of Apache Cassandra™
Each Learning Path is composed of a sequence of recommended courses for your role
• Administrator Certification
✓ DS201
✓ DS210
• Developer Certification
✓ DS201
✓ DS220

3. Neo4j
neo4j Graphacademy - https://graphacademy.neo4j.com/
Free, Self-Paced, Hands-on Online Training | Free Neo4j Courses from GraphAcademy 68
Summary

• NoSQL databases are useful when

✓ you have to deal with large data sets
✓ may need geographical distribution
✓ No need for ACID transactions and need flexible consistency
• Choices between key-value, column based, document based, graph based data
stores
• Graph DBs and computing models are very suitable when data sets are
relationship heavy - can be modelled as large number of nodes and edges and
queries are similar to graph traversal
✓ Complex relation centric queries are possible
✓ Graph traversal costs can be kept stable with data growth

69
Next Session:
Cassandra and Graph Database in detail

Unit VI - 1
No ratings yet
Unit VI - 1
31 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Nosql Database: New Era of Databases For Big Data Analytics - Classification, Characteristics and Comparison
No ratings yet
Nosql Database: New Era of Databases For Big Data Analytics - Classification, Characteristics and Comparison
17 pages
No SQL
No ratings yet
No SQL
12 pages
NoSQL D
No ratings yet
NoSQL D
26 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
NoSQL
No ratings yet
NoSQL
18 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
Module 1 Introduction
No ratings yet
Module 1 Introduction
9 pages
NOsql Presentation
No ratings yet
NOsql Presentation
20 pages
NoSQL for Tech Professionals
No ratings yet
NoSQL for Tech Professionals
29 pages
Module 5 - NoSQL Databases
No ratings yet
Module 5 - NoSQL Databases
33 pages
DSA Notes Unit-03
No ratings yet
DSA Notes Unit-03
144 pages
41 NoSQL Introduction
No ratings yet
41 NoSQL Introduction
18 pages
Overview of NoSQL
No ratings yet
Overview of NoSQL
17 pages
NoSQL vs RDBMS: A Modern Shift
100% (1)
NoSQL vs RDBMS: A Modern Shift
142 pages
NoSQL Databases: Types, Features, and CAP Theorem
No ratings yet
NoSQL Databases: Types, Features, and CAP Theorem
112 pages
DBMS Lecture13 NoSQL
No ratings yet
DBMS Lecture13 NoSQL
31 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
No SQL
No ratings yet
No SQL
109 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
No SQL Lecture Notes
No ratings yet
No SQL Lecture Notes
17 pages
NoSQL Databases: Features and Limitations
No ratings yet
NoSQL Databases: Features and Limitations
13 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
Wide-Column Databases in Entertainment
No ratings yet
Wide-Column Databases in Entertainment
66 pages
Understanding NoSQL Databases
No ratings yet
Understanding NoSQL Databases
31 pages
Nosql
No ratings yet
Nosql
64 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
Unit 2
No ratings yet
Unit 2
26 pages
Riak CS Latency in NoSQL Systems
No ratings yet
Riak CS Latency in NoSQL Systems
22 pages
Understanding NoSQL Databases and Features
No ratings yet
Understanding NoSQL Databases and Features
10 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
2.1 Nosql
No ratings yet
2.1 Nosql
25 pages
NoSQL Database Design in Cloud Computing
No ratings yet
NoSQL Database Design in Cloud Computing
44 pages
Introduction to NoSQL Databases
No ratings yet
Introduction to NoSQL Databases
27 pages
M240205055 MD Razon BDA ASSIGNMENT
No ratings yet
M240205055 MD Razon BDA ASSIGNMENT
14 pages
Module 1
No ratings yet
Module 1
69 pages
CHP 4
No ratings yet
CHP 4
47 pages
NoSQL Data Management Overview
No ratings yet
NoSQL Data Management Overview
36 pages
BDA Module 5 - Part1 (No SQL) 2023
No ratings yet
BDA Module 5 - Part1 (No SQL) 2023
32 pages
Introduction to NoSQL Databases
No ratings yet
Introduction to NoSQL Databases
16 pages
Understanding NoSQL Databases Explained
No ratings yet
Understanding NoSQL Databases Explained
11 pages
NoSQL Database
No ratings yet
NoSQL Database
64 pages
BDA Unit-2
No ratings yet
BDA Unit-2
30 pages
Nosql Database
No ratings yet
Nosql Database
19 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
BDS Session 5 - NoSQL DB
No ratings yet
BDS Session 5 - NoSQL DB
51 pages
No SQL
No ratings yet
No SQL
19 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
38 pages
Nosql Tricks
No ratings yet
Nosql Tricks
34 pages
Module 3 NOSQL
No ratings yet
Module 3 NOSQL
69 pages
Bda Unit-5 PDF
No ratings yet
Bda Unit-5 PDF
83 pages
NoSQL Lec
No ratings yet
NoSQL Lec
45 pages
Chapter - 4 - NoSQL - 1676181987
No ratings yet
Chapter - 4 - NoSQL - 1676181987
85 pages
BIG - DATA - Unit 4
No ratings yet
BIG - DATA - Unit 4
99 pages
BDT Unit 4
No ratings yet
BDT Unit 4
93 pages
BDS Session 14 StreamWindowing
No ratings yet
BDS Session 14 StreamWindowing
12 pages
FAQ's M.Tech Cluster Dissertation
No ratings yet
FAQ's M.Tech Cluster Dissertation
6 pages
DL Cluster S1 23 EndSem Regular
No ratings yet
DL Cluster S1 23 EndSem Regular
4 pages
BDS Session 9
No ratings yet
BDS Session 9
56 pages
NLP Merged
No ratings yet
NLP Merged
9 pages
Homework Based On SQL Task#1:: Very Important Notes
No ratings yet
Homework Based On SQL Task#1:: Very Important Notes
4 pages
Clearwell Corporate Investigation Solution
No ratings yet
Clearwell Corporate Investigation Solution
2 pages
Document Management Systems Guide
No ratings yet
Document Management Systems Guide
5 pages
The Intersectional Internet
No ratings yet
The Intersectional Internet
287 pages
BMS Keystrokes - Defaults
No ratings yet
BMS Keystrokes - Defaults
16 pages
Vinod Resume
No ratings yet
Vinod Resume
1 page
AP Moller Maersk
No ratings yet
AP Moller Maersk
2 pages
Excel Data Tab Functions Explained
No ratings yet
Excel Data Tab Functions Explained
3 pages
PostgreSQL CHEAT SHEET
No ratings yet
PostgreSQL CHEAT SHEET
8 pages
ThoughtWorks Sample Technical Placement Paper Level1
100% (2)
ThoughtWorks Sample Technical Placement Paper Level1
7 pages
Introduction To Structured Query Language (SQL) : Database Systems Design, Implementation, and Management
No ratings yet
Introduction To Structured Query Language (SQL) : Database Systems Design, Implementation, and Management
44 pages
Clustering and Similarity:: Retrieving Documents
No ratings yet
Clustering and Similarity:: Retrieving Documents
47 pages
DATA Base System Class Notes 2024
No ratings yet
DATA Base System Class Notes 2024
22 pages
GIS Book Myanmar - Resized PDF
No ratings yet
GIS Book Myanmar - Resized PDF
300 pages
Arc Hydro - ArcGIS Pro Project Startup Best Practices
No ratings yet
Arc Hydro - ArcGIS Pro Project Startup Best Practices
18 pages
Dzhokhar Tsarnaev: Current Status Update
100% (3)
Dzhokhar Tsarnaev: Current Status Update
79 pages
This Study Resource Was: ISYS3412 Practical Database Concepts
No ratings yet
This Study Resource Was: ISYS3412 Practical Database Concepts
9 pages
Pendekatan Genre Based Dalam Pengajaran Bahasa Ing
No ratings yet
Pendekatan Genre Based Dalam Pengajaran Bahasa Ing
1 page
Practical Devops Tools
0% (1)
Practical Devops Tools
442 pages
Basis Data: Erika Devi, Fajar A. Nugroho Fakultas Ilmu Komputer Udinus
No ratings yet
Basis Data: Erika Devi, Fajar A. Nugroho Fakultas Ilmu Komputer Udinus
43 pages
CSE301 Lec1
No ratings yet
CSE301 Lec1
15 pages
Conducting A: Literature Search
No ratings yet
Conducting A: Literature Search
32 pages
Database Notes
No ratings yet
Database Notes
6 pages
JETIR2303833
No ratings yet
JETIR2303833
9 pages
2011 - Charles H. Davis, Debora Shaw - Introduction To Information Science and Technology-Information Today, Inc.
No ratings yet
2011 - Charles H. Davis, Debora Shaw - Introduction To Information Science and Technology-Information Today, Inc.
289 pages
Software Requirements in The Word Today
No ratings yet
Software Requirements in The Word Today
4 pages
ER (Entity Relationship) Diagram Model
No ratings yet
ER (Entity Relationship) Diagram Model
6 pages
Smart Grid Data Analytics Insights
No ratings yet
Smart Grid Data Analytics Insights
7 pages
Advanced Database Systems Assignment
No ratings yet
Advanced Database Systems Assignment
4 pages
Web Multimedia Data Model Proposal
No ratings yet
Web Multimedia Data Model Proposal
6 pages

BDS Session 10

Uploaded by

BDS Session 10

Uploaded by

DSECL ZG 522: Big Data Systems

Session 10: NoSQL Databases

• Supports rich variety of data : structured, semi-structured and unstructured

5 questions to ask before choosing a NoSQL database

Link to infoworld article

• RDBMS ensure uniform structure and modelling of relationships

• In NoSQL, generally a choice between choosing either CP or AP of CAP

• CA- Implies single site cluster in which all nodes

When you select a NoSQL database in the

NoSQL Databases - http://nosql-database.org/

• Scale out architecture instead of monolithic architecture of relational databases

Relational database Non relational, distributed databases

• Partition a table by column into column families

• Data is represented as graphs and related nodes can be found by

• Database is a set of collections

• Data is partitioned in shards

Collection orders db.users.find(

• JavaScript Object Notation (JSON) model

• Key properties: hierarchical (like XML), no schema

• indexing a field in ascending order and find

• indexing a sub-document field in ascending order and find

• Sample joined document: Join command - Join using $lookup

• how many replicas should ack

Read latest written value

• W1 and R1 for P1 will fail and will succeed in P2

• Application must be ready to deal with multiple versions of data

• Application must handle stale data

• Expect NULL values instead of data

• Application must be able to fix inconsistent state

Can NoSQL databases be ACID-compliant?

• MongoDB is an ACID-compliant database.

• As of MongoDB 4.0, there is even support for multi-document ACID

• Version 4.2 even brought distributed multi-document ACID

• MongoDB Atlas on AWS Cloud

• Amazon Keyspaces (for Apache Cassandra)

Cassandra guarantees strong consistency if

Tuning done by controlling the number of nodes (replicas)

This simple form (https://www.ecyrd.com/cassandracalculator/)

Faster reads and writes

Weak Consistency Strong

● Depends on your situational needs

• For “causal consistency” pick Read consistency level = Write

R level = W level = QUORUM

• Partitions data based on hashing to distribute data blocks from a

Is your traditional database not up to the task of deciphering your

1. Hard-to-navigate interconnected data

➢ Traverses 1,000,000+ relationships/sec on commodity hardware

Launch a free sandbox with dataset on neo4j website

• Neo4j Enterprise Edition on AWS

Person: Ron Howard 67

• NoSQL databases are useful when

You might also like