0% found this document useful (0 votes)
25 views70 pages

BDS Session 10

The document provides an overview of NoSQL databases, highlighting their characteristics, classification, and use cases. It discusses various types of NoSQL databases such as key-value, document, column, and graph databases, along with specific examples like MongoDB and Cassandra. Additionally, it addresses the advantages and disadvantages of NoSQL compared to traditional SQL databases, emphasizing their scalability and flexibility for handling large volumes of data.

Uploaded by

Rahul Sbhatla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views70 pages

BDS Session 10

The document provides an overview of NoSQL databases, highlighting their characteristics, classification, and use cases. It discusses various types of NoSQL databases such as key-value, document, column, and graph databases, along with specific examples like MongoDB and Cassandra. Additionally, it addresses the advantages and disadvantages of NoSQL compared to traditional SQL databases, emphasizing their scalability and flexibility for handling large volumes of data.

Uploaded by

Rahul Sbhatla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

DSECL ZG 522: Big Data Systems

Session 10: NoSQL Databases

Janardhanan PS
[email protected]
Topics for today

• NoSQL Introduction
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J

2
Database Sphere

3
What is NoSQL Database ?

• NoSQL databases, also known as "Not Only SQL" databases, are a type of
database that do not use traditional SQL (Structured Query Language) for storing
and manipulating data
• They are designed to handle large amounts of unstructured, semi-structured, or
polymorphic data and are often used for big data, real-time data processing, and
cloud-based applications
• NoSQL databases use a distributed architecture, allowing them to scale horizontally
across multiple servers or nodes, making them ideal for handling high levels of
concurrency and data volume

4
What is NoSQL ?
• Coined by Carlo Strozzi in 1998
✓Lightweight, open source database without standard SQL interface
• Reintroduced by Johan Oskarsson in 2009
✓Non-relational databases
• Characteristics
✓Not Only SQL
✓Non-relational
✓Schema-less
✓Loosen consistency to address scalability and availability requirements in
large scale applications
✓Open source movement born out of web-scale applications
✓Distributed for scale
✓Cluster Friendly
5
Data model

• Supports rich variety of data : structured, semi-structured and unstructured


• No fixed schema, i.e. each record could have different attributes
• Non-relational - no join operations are typically supported
• Transaction semantics for multiple data items are typically not supported
• Relaxed consistency semantics - no support for ACID as in RDBMS
• In some cases can model data as graphs and queries as graph traversals

6
How to choose the right NoSQL database?

5 questions to ask before choosing a NoSQL database


• Is NoSQL the right choice?
• Which NoSQL data model do we need?
• What is the latency requirement?
• How important are scalability and data consistency?
• How do we want to deploy it?

Link to infoworld article

7
NoSQL Use Cases

• Big data: NoSQL databases are perfect for handling large amounts of data since they can
scale horizontally across multiple servers or nodes and handle high levels of concurrency
• Real-time data processing: They are often used for real-time data processing since they can
handle high levels of concurrency and support low latency
• Cloud-based applications: NoSQL databases are perfect for cloud-based applications since
they can easily scale and handle large amounts of data in a distributed environment
• Content management: NoSQL databases are often used for content management systems
since they can handle large amounts of data and support flexible data models
• Social media: NoSQL databases are often used for social media applications since they can
handle high levels of concurrency and support flexible data models
• Internet of Things (IoT): These databases are often used for IoT applications since they can
handle large amounts of data from a large number of devices and handle high levels of
concurrency
• E-commerce: They are often used for e-commerce applications since they can handle high
levels of concurrency and support flexible data models
8
Why NoSQL (1)
• RDBMS meant for OLTP systems / Systems of Record
• Strict consistency and durability guarantees (ACID) over multiple data items involved
in a transaction
• But they have scale and cost issues with large volumes of data, distributed geo-scale
applications, very high transaction volumes
• Typical web scale systems do not need strict consistency and durability for every use case
• Social networking
• Real-time applications
• Log analysis
• Browsing retail catalogs
• Reviews and blogs
•…
9
Why NoSQL (2)

• RDBMS ensure uniform structure and modelling of relationships


between entities
• A class of emerging applications need granular and extreme
connectivity information modelled between individual semi-structured
data items. This information needs to be also queried at scale without
large expensive joins.
• Connectivity between users in a social media application: How
many friends do you have between 2 hops ?
• Connectivity between companies in terms of domain, technology,
people skills, hiring : Useful for skills acquisition, M&A etc.
• Connectivity between IT network devices: Useful for
troubleshooting incidents

10
Choice between consistency and availability
• In a distributed database
✓Scalability and fault tolerance can be improved through additional nodes,
although this puts challenges on maintaining consistency (C).
✓The addition of nodes can also cause availability (A) to suffer due to the
latency caused by increased communication between nodes.
• May have to update all replicas before sending success to client . so longer
takes time and system may not be available during this period to service
reads on same data item.
• Large scale distributed systems cannot be 100% partition tolerant (P).
✓Although communication outages are rare and temporary, partition tolerance
(P) must always be supported by distributed database

• In NoSQL, generally a choice between choosing either CP or AP of CAP


• RDBMS systems mainly provide CA for single data items and then on top of that
provide ACID for transactions that touch multiple data items.
11
CAP Theorem – Implications of CA, CP, AP

• CA- Implies single site cluster in which all nodes


communicate with each other.
• CP – Implies all the available data consistent or accurate,
but some data may not be available
• AP – Implies all data available, but some data returned
may be inconsistent
Consistency Models of NoSQL Databases

When you select a NoSQL database in the


above categories, check weather it will match
with the requirements of the application.
12
Classification of NoSQL DBs
• Key – value
✓ Maintains a big hash table of keys and values
✓ Example : DynamoDB, Redis, Riak etc
• Document
✓ Maintains data in collections of documents
✓ Example : MongoDB, CouchDB etc
• Column
✓ Each storage block has data from only one column
✓ Example : Cassandra, HBase
• Graph
✓ Network databases
✓ Graph stores data in nodes
✓ Example : Neo4j, HyperGraphDB, Apache Tinkerpop

NoSQL Databases - http://nosql-database.org/

13
NoSQL Offerings in Cloud

14
NoSQL Characteristics

• Scale out architecture instead of monolithic architecture of relational databases


• Cluster scale - distribution across 100+ nodes across DCs
• Performance scale - 100K+ DB reads and writes per sec
• Data scale - 1B+ docs in DB
• House large amount of structured, semi-structured and unstructured data
• Dynamic schemas
✓ allows insertion of data without pre-defined schema

• Auto sharding
✓ automatically spreads data across the number of servers
✓ applications are not aware about it
✓ helps in data balancing and failure from recovery

• Replication
✓ Good support for replication of data which offers high availability, fault tolerance

15
NoSQL - Pros and Cons

Pros Cons
• Cost effective for large data sets
• Joins between data sets / tables
• Easy to implement • Group by operations
• Easy to distribute esp across DCs • ACID properties for transactions
• Easier to scale up/down • SQL interface
• Relaxes data consistency when required • Lack of standardisation in this space
• No pre-defined schema • Makes it difficult to port from SQL
• Easier to model semi-structured data or and across NoSQL stores
connectivity data
• Less skills compared to SQL
• Easy to support data replication
• Lesser BI tools compared to mature SQL
BI space

16
SQL vs NoSQL

SQL NoSQL

Relational database Non relational, distributed databases


Pre-defined schema Schema less
Table based databases Multiple options: Key-Value,
Document, Column, Graph
Vertically scalable Horizontally scalable
Supports ACID properties Supports CAP theorem
Supports complex querying Relatively simpler querying
Excellent support from vendors Relies heavily on community support

17
Vendors

• Amazon
• Facebook
• Google
• Oracle

18
Topics for today

• NoSQL Introduction
• Classification
• Examples
• Cassandra
• Mongo
• GraphDBs: Neo4J

19
Classification: Document-based

• Store data in form of documents using well known formats like JSON
• Documents accessible via their id, but can be accessed through other index as well
• Maintains data in collections of documents
• Example,
• MongoDB, CouchDB, CouchBase

• Book document :
{
“Book Title” : “Fundamentals of Database Systems”,
“Publisher” : “Addison-Wesley”,
“Authors” : “Elmasri & Navathe”
“Year of Publication” : “2011”
}

20
Classification: Key-Value store

• Simple data model based on fast access by the key to the value associated with the key
• Value can be a record or object or document or even complex data structure
• Maintains a big hash table of keys and values
• For example,
✓ DynamoDB, Redis, Riak

Key Value
2014HW112220 { Santosh,Sharma,Pilani}
2018HW123123 {Eshwar,Pillai,Hyd}

21
Classification: Column-based

• Partition a table by column into column families


• A part of vertical partitioning where each column family is stored in its own files
• Allows versioning of data values
• Each storage block has data from only one column
• Example,
✓ Cassandra, Hbase

22
Classification: Graph based

• Data is represented as graphs and related nodes can be found by


traversing the edges using the path expression
• aka network database
• Graph query languages, e.g. Cypher,Gremlin
• Example
✓ Neo4J
✓ HyperGraphDB
✓ GraphX
✓ Apache TinkerPop

23
Topics for today

• NoSQL Introduction
• Classification
• Examples
✓MongoDB
✓Cassandra
✓GraphDBs: Neo4J

24
MongoDB

• Database is a set of collections


• A collection is like a table in RDBMS
• A collection stores documents
✓ BSON or Binary JSON with hierarchical key-value pairs
✓ Similar to rows in a table
✓ Max 16MB documents stored in WiredTiger storage engine
• For larger than 16MB documents uses GridFS
✓ Support for binary data
✓ Large objects can be stored in ‘chunks’ of 255KB
✓ Stores Meta-data in a separate collection
✓ Does not support multi-document transactions
✓ WiredTiger storage engine*

25 * https://docs.mongodb.com/manual/core/wiredtiger/
MongoDB

• Data is partitioned in shards


✓ For horizontal scaling
✓ Reduces amount of data each shard handles as the cluster grows
✓ Reduces number of operations on each shard
• Data is replicated
✓ Writes to primary in oplog. “write-concern” setting used to tweak write consistency.
✓ Secondaries use oplog to get local copies updated
✓ Clients usually read from primary but “read-preference” setting can tweak read consistency
• Data updates happen in place and not versioned / timestamped

26
MongoDB Data Example

Collection inventory

{
item: "ABC2",
details: { model: "14Q3", manufacturer: "M1 Corporation" },
stock: [ { size: "M", qty: 50 } ],
category: "clothing”
}

{ Document insertion
item: "MNO2",
db.inventory.insert(
details: { model: "14Q3", manufacturer: "ABC Company" },
{
stock: [ { size: "S", qty: 5 }, { size: "M", qty: 5 }, { size: "L", qty: 1 } ],
item: "ABC1",
category: "clothing”
details: {model: "14Q3",manufacturer: "XYZ Company"},
}
stock: [ { size: "S", qty: 25 }, { size: "M", qty: 50 } ],
category: "clothing"
}
)
27
Example of Simple Query

Collection orders db.users.find(


{ status: "A" }, selection
{
{ cust_id: 1, price: 1, _id: 0 }
_id: "a",
)
cust_id: "abc123",
status: "A", projection
price: 25,
items: [ { sku: "mmm", qty: 5, price: 3 },
{ sku: "nnn", qty: 5, price: 2 } ] In SQL it would look like this:
} SELECT cust_id, price
{ FROM orders
_id: "b", WHERE status="A"
cust_id: "abc124",
status: "B",
price: 12, {
items: [ { sku: "nnn", qty: 2, price: 2 }, cust_id: "abc123",
Results
{ sku: "ppp", qty: 2, price: 4 } ] price: 25
} }

28
MongoDB Data Model

• JavaScript Object Notation (JSON) model


• Database = set of named collections
• Collection = sequence of documents
• Document = {attribute1:value1,...,attributek:valuek}
• Attribute = string (attributei≠attributej)
• Value = primitive value (string, number, date, ...), or a document, or an array
• Array = [value1,...,valuen]

• Key properties: hierarchical (like XML), no schema


✓ Collection docs may have different attributes

29
MongoDB: MapReduce
> db.collection.mapReduce(
function() {emit(key,value);}, //map function
function(key,values) {return reduceFunction}, { //reduce function
out: collection,
query: document,
sort: document,
limit: number
}

>db.posts.mapReduce(
function() { emit(this.name,1); },
function(key, values) {return Array.sum(values)}, {
query:{publish:"true"},
out:”total_reviews"
}
).find()

30
MongoDB: Indexing
• Can create index on any field of a collection or a sub-document fields
• e.g. document in a collection
{
"address": {
"city": “New Delhi",
"state": "Delhi",
"pincode": "110001"
},
"tags": [
"football",
"cricket",
"badminton"
],
"name": "Ravi"
}

• indexing a field in ascending order and find


> db.users.createIndex({“tags":1})
> db.users.find({tags:"cricket"}).pretty()

• indexing a sub-document field in ascending order and find


> db.users.createIndex({"address.city":1,"address.state":1,"address.pincode":1})
> db.users.find({“address.city":"New Delhi”}).pretty()

31
MongoDB: Joins
• Mongo 3.2+ it is possible to join data from 2 collections using aggregate If you have two collections (users , comments) and want to pull
• Collection books (isbn, title, author) and books_selling_data(isbn, copies_sold) all the comments with pid=444 along with the user info for each
db.books.aggregate([{ $lookup: { comments
from: "books_selling_data", { uid:12345, pid:444, comment="blah" }
localField: "isbn", { uid:12345, pid:888, comment="asdf" }
{ uid:99999, pid:444, comment="qwer" }
foreignField: "isbn",
as: "copies_sold"
users
} { uid:12345, name:"john" }
}]) { uid:99999, name:"mia" }

• Sample joined document: Join command - Join using $lookup


{ db.users.aggregate({
"isbn": "978-3-16-148410-0", $lookup:{
"title": "Some cool book", from:"comments",
"author": "John Doe", localField:"uid",
"copies_sold": [ foreignField:"uid",
as:"users_comments"
{
"isbn": "978-3-16-148410-0", }
"copies_sold": 12500 })
}
]
} 32
MongoDB – Writes and Reads
• Document oriented DB
• Various read and write choices for flexible consistency tradeoff with scale / performance and durability
• Automatic primary re-election on primary failure and/or network partition

33
MongoDB “read concerns”

• local :
✓Client reads primary replica
✓Client reads from secondary in causally consistent sessions
• available:
✓Read on secondary but causal consistency not required
• majority :
✓If client wants to read what majority of nodes have. Best option for fault tolerance and
durability.
• linearizable :
✓If client wants to read what has been written to majority of nodes before the read started.
✓Has to be read on primary
✓Only single document can be read

https://docs.mongodb.com/v3.4/core/read-preference-mechanics/
34
MongoDB “write concerns”

• how many replicas should ack


✓1 - primary only
✓0 - none
✓n - how many including primary
✓majority - a majority of nodes (preferred for durability)
• journaling - If True then nodes need to write to disk journal before ack
else ack after writing to memory (less durable)
• timeout for write operation

https://docs.mongodb.com/manual/reference/write-concern/
35
Consistency scenarios - causally consistent and durable

Read latest written value


read=majority from common node
write=majority

R = W = majority

• W1 and R1 for P1 will fail and will succeed in P2


• So causally consistent, durable even with network partition sacrificing performance
• Example: Used in critical transaction oriented applications, e.g. stock trading

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
36
Consistency scenarios - causally consistent but not durable

read=majority
write=1

• W1 may succeed on P1 and P2. R1 will succeed only on P2. W1 on P1 may roll back.
• So causally consistent but not durable with network partition. Fast writes, slower reads.
• Example: Twitter - a post may disappear but if on refresh you see it then it should be durable,
else repost.

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
37
Consistency scenarios - eventual consistency with durable writes

read=local
write=majority

• W1 will succeed only for P2 and will not be accepted on P1 after failure. Reads may not
succeed to see the last write on P1. Slow durable writes and fast non-causal reads.
• Example: Review site where write should be durable if committed but reads don’t need causal
guarantee as long as it appears some time (eventual consistency).

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
38
Consistency scenarios - eventual consistency but no durability

read=local
write=1

• Same as previous scenario and not writes are also not durable and may be rolled back.
• Example: Real-time sensor data feed that needs fast writes to keep up with the rate and reads
should get as much recent real-time data as possible. Data may be dropped on failures.

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
39
How applications deal with eventual consistency of AP

• Application must be ready to deal with multiple versions of data

• Application must handle stale data

• Expect NULL values instead of data

• Application must be able to fix inconsistent state

• Read again, if you do not get the data during the first read

40
MongoDB – ACID Transactions

Can NoSQL databases be ACID-compliant?

• MongoDB is an ACID-compliant database.

• As of MongoDB 4.0, there is even support for multi-document ACID


transactions when required - 2018.

• Version 4.2 even brought distributed multi-document ACID


transactions for even more flexibility - 2019.

41
MongoDB on Cloud

• MongoDB Atlas on AWS Cloud


• Automated MongoDB Service on Microsoft Azure
• MongoDB Atlas on Google Cloud

42
Topics for today

• NoSQL Introduction
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J and Tinkerpop

43
Cassandra

• Born in Facebook and built on Amazon Dynamo and Google Big Table concepts
• AP design in CAP context
• High performance, high availability applications that can sacrifice consistency
✓ Hence built for peer-to-peer symmetric nodes instead of primary-secondary
architecture (as in MongoDB)
• Column oriented DB
✓Create keyspace (like a DB)
✓ Within keyspace create column family (like a table)
✓ Within CF create attributes / columns with their types

44
Cassandra features

45
Cassandra on Cloud

• Amazon Keyspaces (for Apache Cassandra)


• Azure Managed Instance for Apache Cassandra
• DataStax Astra DB for Apache Cassandra (Google Cloud)

46
Replication strategy for user data

• Simple
✓ Specify replication factor = N and data is stored in N nodes of
cluster
• NetworkTopology
✓ Specify replication factor per DC where we want reliability from DC
failures
✓ e.g. CREATE KEYSPACE cluster1 WITH replication = {'class':
'NetworkTopologyStrategy', 'eastDC' : 2, ‘westDC' : 3};

47
Consistency semantics (1)

• No primary replica - high partition tolerance and availability and levels of consistency
• Support for light transactions with “linearizable consistency”
• A Read or Write operation can pick a consistency level
• ONE, TWO, THREE, ALL - 1,2,3 or all replicas respectively have to ack
• ANY - Write to any node even if replicas are down (ref Hinted Handoff)
• QUORUM - majority have to ack
• LOCAL_QUORUM - majority within same datacenter have to ack
•…

https://cassandra.apache.org/doc/latest/architecture/dynamo.html
https://cassandra.apache.org/doc/latest/architecture/guarantees.html#
48
Tunable Consistency

Cassandra guarantees strong consistency if


(nodes_Written + nodes_Read) > replication_factor N

R+W>N

Tuning done by controlling the number of nodes (replicas)


• Selected for Write
• Selected for reads

This simple form (https://www.ecyrd.com/cassandracalculator/)


allows you to try out different parameters for your Apache Cassandra
cluster and see the impact of them for your application.

49
Cassandra Consistency Spectrum

• Cassandra has C A P.
• But Consistency is tunable
• Give up a little A and P to get more C

Faster reads and writes

Weak Consistency Strong

The higher the consistency, the less chance you get stale data during read
● Pay for this with latency

● Depends on your situational needs

50
Consistency semantics (2)

• For “causal consistency” pick Read consistency level = Write


consistency level = QUORUM Read latest written value
from common node
• Why ? At least one node will be common between write and read set
so a read will get the last write of a data item
• What happens if read and write use LOCAL_QUORUM ?
• If no overlap read and write sets then “Eventual consistency”

R level = W level = QUORUM

https://cassandra.apache.org/doc/latest/architecture/dynamo.html
https://cassandra.apache.org/doc/latest/architecture/guarantees.html#
51
Partitioners

• Partitions data based on hashing to distribute data blocks from a


column among nodes*
• Random
✓ Crypto hash (MD5)- more expensive
• Murmur3
✓ Non-crypto consistent hash (MU-Multiple / R - Rotate
operations but easier to reverse compared to Crypto hash)
✓ 3-5x faster and overall 10% performance improvement
• Byteorder
✓ Lexical order

52
Sample queries
> create keyspace demo with replication={'class':'SimpleStrategy',
'replication_factor':1};
> describe keyspaces;
> use demo; or columnfamily
> create table student_info (rollno int primary key, name text, doj
timestamp, lastexampercent double);
> describe table student_info ;
> consistency quorum
> insert into student_info (rollno,name,doj,lastexampercent) values
(4,'Roxanne', dateof(now()), 90) using ttl 30;
> select rollno from student_info where name='Roxanne' ALLOW
FILTERING;
> update student_info set lastexampercent=98 where rollno=2 IF
name='Sam';

53
Case study - eBay

• Marketplace has 100 million active buyers with 200+ million items
• 2B page views, 80B DB calls, multi-PB storage capacity
• No transactions, joins, referential integrity
• Multi-DC deployment
• 400M+ writes and 200M+ reads
• 3 Use cases
✓ Social signal on product pages (read latency is not important but write performance is
key)
✓ Connecting users and items via buy, sell, bid, watch events
✓ Many time series analysis cases, e.g. fraud detection

https://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376/2-eBay_Marketplaces_97_million_active
54
Case study - AdStage (from AWS use cases)

• Sector AdTech
• Online advertising platform to manage multi-channel ad campaigns on Google, FB,
Twitter, Bing, LinkedIn
• 3 clusters with 80+ nodes on AWS
• Vast amount of real-time data from 5 channels
• Constantly monitor trends and optimise campaigns for advertisers
• High performance and availability - consistency is not critical as it is read mainly
• Cassandra cluster can scale as more clients are added with no SPOF

55
Topics for today

• NoSQL Introduction
• Pros-Cons
• Classification
• Examples
• MongoDB
• Cassandra
• GraphDBs: Neo4J

56
Graphs

57
Graph computing

• Property graphs
• Data is represented as vertices and edges with
properties
• Properties are key value pairs
• Edges are relationships between vertices
• When to use a graph DB ?
• A relationship-heavy data set with large set of
data items
• Queries are like graph traversals but need to
keep query performance almost constant as
database grows
• A variety of queries may be asked from the data
and static indices on data will not work

58
Relational Vs Graph Models

59
5 Signs You Need a Graph Database

Is your traditional database not up to the task of deciphering your


complex data? Here are five of the telltale signs:

1. Hard-to-navigate interconnected data


2. Lack of real-time insights
3. Inflexible data structures
4. Inefficient querying
5. Inability to decode a web of highly connected data

60
Native vs Non-Native Graph storage
• Non-native graph computing platforms can use external DBs for data storage
• e.g. TinkerPop is an in-memory DB + computing framework that can store in
ElasticSearch, Cassandra etc.
• Native platform support built-in storage
• e.g. Neo4j
• Native approach is much faster because adjacent nodes and edges are stored closer for
faster traversal
• In a non-native approach, extensive indexing has to be used
• Native approach scales as nodes get added

One-hop index

https://neo4j.com/blog/native-vs-non-native-graph-technology/ 61
Graphs for Storage and Processing

62
What is Neo4J
➢ It’s is a Graph Database supporting full ACID Transactions
➢ Embeddable in applications and server deployable
➢ Java based, Open sourced
➢ Schema free, bottom-up data model design
➢ Neo4j is stable
➢ In 24/7 operation since 2003
➢ Neo4j is under active development
➢ High performance graph operations
➢ Supports the Cypher query language

➢ Traverses 1,000,000+ relationships/sec on commodity hardware


One-hop index
➢ No. of nodes and relationships decide Volume of data

https://neo4j.com/blog/native-vs-non-native-graph-technology/ 63
Neo4j / Cypher

• Cypher is a Declarative
language for graph query
• Example: match (:Person
{name: 'Tom Hanks'})-
[:ACTED_IN]->(m:Movie)
where m.released > 2000
RETURN m limit 5

Launch a free sandbox with dataset on neo4j website


64
Neo4j / Cypher: More queries

• Find movies that Tom Hanks acted in and directed by Ron Howard released
after 2000
• Match (:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie),(:Person
{name: 'Ron Howard'})-[:DIRECTED]->(m) where m.released > 2000
RETURN m limit 5
• Who were the other actors in the movie where Tom Hanks acted in and
directed by Ron Howard released after 2000
• Match (:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie),(:Person
{name: 'Ron Howard’})-[:DIRECTED]->(m), (p:Person)-[:ACTED_IN]->(m)
where m.released > 2000 RETURN p limit 5

65
Neo4j on Cloud

• Neo4j Enterprise Edition on AWS


• Neo4j - The Easiest Way to Graph on MS Azure
• NEO4J AURA on Google Cloud

66
Apache Tinkerpop / Gremlin
• TinkerPop is a computing platform that connects to GraphDBs that actually store the
nodes and edges. Built-in TinkerGraph stores in-memory data only.
• Gremlin is the query language (with traversal machine) that supports Declarative and ACTED_IN
Imperative flavours
• Sample queries
• movies where Tom Hanks has acted Person Movie
• g.V().hasLabel(‘person’).has(‘name’,’Tom DIRECTED
Hanks’).outE(‘ACTED_IN’).hasLabel(‘movies’).values(‘name’)
• movies where Tom Hanks has acted and directed by Ron Howard Person: Tom Hanks

• g.V().hasLabel(‘person’).has(‘name’,’Tom ACTED_IN
Hanks’).outE(‘ACTED_IN’).inE(‘DIRECTED’).has(‘name’,’Ron
Howard’).outE(‘DIRECTED’).values(‘name’)
DIRECTED Movie

Person: Ron Howard 67


NoSQL Certification Trainings
1. MongoDB
Free MongoDB courses - MongoDB University - https://learn.mongodb.com/

2. Cassandra
Free Cassandra courses – Datastax Academy - https://www.datastax.com/dev/academy
Multiple Learning Paths - Gain an expert understanding of Apache Cassandra™
Each Learning Path is composed of a sequence of recommended courses for your role
• Administrator Certification
✓ DS201
✓ DS210
• Developer Certification
✓ DS201
✓ DS220

3. Neo4j
neo4j Graphacademy - https://graphacademy.neo4j.com/
Free, Self-Paced, Hands-on Online Training | Free Neo4j Courses from GraphAcademy 68
Summary

• NoSQL databases are useful when


✓ you have to deal with large data sets
✓ may need geographical distribution
✓ No need for ACID transactions and need flexible consistency
• Choices between key-value, column based, document based, graph based data
stores
• Graph DBs and computing models are very suitable when data sets are
relationship heavy - can be modelled as large number of nodes and edges and
queries are similar to graph traversal
✓ Complex relation centric queries are possible
✓ Graph traversal costs can be kept stable with data growth

69
Next Session:
Cassandra and Graph Database in detail

70

You might also like