An introduction to
NoSQL databases
POOYAN MEHRPARVAR
DEC 2014To get more references visit:
http://bit.ly/nosql_srbiau
1
What is covered in this
presentation:
 A brief history of data bases
 NoSQL why, what and when?
 Aggregate Data Models
 BASE vs ACID
 CAP theorem
 Polyglot persistence : the future of database systems
2
Why did we choose this topic?
 Is NoSQL replacing traditional databases?
 Where should we use NoSQL databases?
 Should we use NoSQL in any kind of projects?
3
A brief history of databases 4
Relational databases
Benefits of Relational databases:
Designed for all purposes
ACID
Strong consistancy, concurrency,
recovery
Mathematical background
Standard Query language (SQL)
Lots of tools to use with i.e: Reporting
services, entity frameworks, ...
Vertical scaling (up scaling)
Object / Object-relational databases
were not practical. Mainly because of
Impedance mismatch
5
Era of Distributed Computing
But...
 Relational databases were not built for
distributed applications.
Because...
 Joins are expensive
 Hard to scale horizontally
 Impedance mismatch occurs
 Expensive (product cost, hardware,
Maintenance)
6
Era of Distributed Computing
But...
 Relational databases were not built for
distributed applications.
Because...
 Joins are expensive
 Hard to scale horizontally
 Impedance mismatch occurs
 Expensive (product cost, hardware,
Maintenance)
And....
It’s weak in:
 Speed (performance)
 High availability
 Partition tolerance
7
Rise of Big data
Three V(s) of Bigdata:
 Volume
 Velocity
 Variety
8
Rise of Big data 9
Rise of Big data
 Wallmart: 1 million transactions per
hour
 Facebook: 40 billion photos
 People are talking about petabytes
today
10
NoSQL why, what and when?
 Google & Amazon bulit their own databases (Big table & Dynamo)
 Facebook invented Cassandra and is using thousands of them
 #NoSQL was a twitter hashtag for a conference in 2009
 The name doesn’t indicate its characteristics
 There is no strict defenition for NoSQL databases
 There are more than 150 NoSQL databases (nosql-database.org)
11
Characteristics of NoSQL databases
 Non relational
 Cluster friendly
 Schema-less
 21 century web
 Open-source
12
Characteristics of NoSQL databases
NoSQL avoids:
 Overhead of ACID transactions
 Complexity of SQL query
 Burden of up-front schema design
 DBA presence
 Transactions (It should be handled at
application layer)
Provides:
 Easy and frequent changes to DB
 Horizontal scaling (scaling out)
 Solution to Impedance mismatch
 Fast development
13
NoSQL is getting more & more popular 14
What is a schema-less datamodel?
In relational Databases:
 You can’t add a record which does not fit
the schema
 You need to add NULLs to unused items in
a row
 We should consider the datatypes. i.e :
you can’t add a stirng to an interger field
 You can’t add multiple items in a field
(You should create another table:
primary-key, foreign key, joins,
normalization, ... !!!)
15
What is a schema-less datamodel?
In NoSQL Databases:
 There is no schema to consider
 There is no unused cell
 There is no datatype (implicit)
 Most of considerations are done in
application layer
 We gather all items in an aggregate
(document)
16
What is Aggregation?
 The term comes from Domain Driven Design
 Shared nothing architecture
 An aggregate is a cluster of domain objects that can be treated as
a single unit
 Aggregates are the basic element of transfer of data storage - you
request to load or save whole aggregates
 Transactions should not cross aggregate boundaries
 This mechanism reduces the join operations to a minimal level
17
What is Aggregation? 18
What is Aggregation? 19
What is Aggregation? 20
Aggregate Data Models
NoSQL databases are classified in four major datamodels:
 Key-value
 Document
 Column family
 Graph
Each DB has its own query language
21
Key-value data model
 The main idea is the use of a hash table
 Access data (values) by strings called keys
 Data has no required format – data may have any format
 Data model: (key, value) pairs
 Basic Operations:
Insert(key,value), Fetch(key),Update(key), Delete(key)
22
Key-value data model
 “Value” is stored as a “blob”
- Without caring or knowing what is inside
- Application is responsible for understanding the
data
 Main observation from Amazon (using Dynamo)
– “There are many services on Amazon’s platform
that only need primary-key access to a data
store.”
E.g. Best seller lists, shopping carts, customer
preferences, session management, sales rank,
product catalog
23
Column family data model
 The column is lowest/smallest instance of
data.
 It is a tuple that contains a name, a value
and a timestamp
24
Column family data model
Some statistics about Facebook Search (using Cassandra)
 MySQL > 50 GB Data
 Writes Average : ~300 ms
 Reads Average : ~350 ms
 Rewritten with Cassandra > 50 GB Data
 Writes Average : 0.12 ms
 Reads Average : 15 ms
25
Graph data model
 Based on Graph Theory.
 Scale vertically, no clustering.
 You can use graph algorithms easily
 Transactions
 ACID
26
Document-based datamodel
 Usually JSON like interchange model.
 Query Model: JavaScript-like or custom.
 Aggregations: Map/Reduce
 Indexes are done via B-Trees.
 unlike simple key-value stores, both keys
and values are fully searchable in
document databases.
27
Document-based datamodel 28
Overview of a Document-based datamodel 29
Overview of a Document-based datamodel 30
Overview of a Document-based datamodel 31
Overview of a Document-based datamodel 32
A sample MongoDB query 33
MySQL:
MongoDB:
There is no join in MongoDB query
Because we are using an aggregate data model
What we need?
 We need a distributed database system having such
features:
 – Fault tolerance
 – High availability
 – Consistency
 – Scalability
34
What we need?
 We need a distributed database system having such
features:
 – Fault tolerance
 – High availability
 – Consistency
 – Scalability
Which is impossible!!!
According to CAP theorem
35
Should we...?
 In some cases getting an answer quickly is
more important than getting a correct
answer
 By giving up ACID properties, one can
achieve higher performance and scalability.
 Any data store can achieve Atomicity,
Isolation and Durability but do you always
need consistency?
 Maybe we should implement Asynchronous
Inserts and updates and should not wait for
confirmation?
36
BASE
Almost the opposite of ACID.
 Basically available: Nodes in the a distributed
environment can go down, but the whole
system shouldn’t be affected.
 Soft State (scalable): The state of the system and
data changes over time.
 Eventual Consistency: Given enough time, data
will be consistent across the distributed system.
37
BASE vs ACID 38
CAP theorem
Consistency: Clients should
read the same data. There
are many levels of
consistency.
o Strict Consistency – RDBMS.
o Tunable Consistency –
Cassandra.
o Eventual Consistency –
Mongodb.
Availability: Data to be
available.
Partial Tolerance: Data to
be partitioned across
network segments due to
network failures.
39
CAP theorem in different SQL/NoSQL
databases
We can not achieve all the three items
In distributed database systems (center) Proven by Nancy Lynch et al. MIT labs.
40
CAP theorem : A simple proof 41
CAP theorem : A simple proof 42
CAP theorem : A simple proof 43
Which data model to choose 44
Polyglot persistence : the future of database
systems
 Future databases are the combination of SQL & NoSQL
 We still need relational databases
45
Overview of a polygot db 46
New approach to database systems:
 Integrated databases has its own
advantages and disadvantages
 With the advent of webservices it
seems now it’s the time to switch
to decentralized data bases
 Single point of failure, Bottlenecks
would be avoided
 Clustering & replication would be
much easier
47
Conclusion:
Before you choose NoSQL as a solution:
Consider these items, ...
 Needs a precise evaluation, Maybe NoSQL is not the right thing
 Needs to read lots of case study papers
 Aggregation is totally a different approach
 NoSQL is still immature
 Needs lots of hours of studing and working to expert in a particular
NoSQL db
 There is no standard query language
 Most of controls have to be implemented at the application layer
 Relational databases are still the strongest in transactional environments
and provide the best solutions in consistancy and concurrency control
48
Conclusion:
Before you choose NoSQL as a solution:
49
Say hello to... 50
NewSQL a brief defenition
 NewSQL group was founded in 2011
Michael Stonebraker’s Definition …
 SQL as the primary interface.
 ACID support for transactions
 Non-locking concurrency control.
 High per-node performance.
 Parallel, shared-nothing architecture – each node is
independent and self-sufficient – do not share memory or storage
51
Technology is still in its infancy...
In 2000 no one even thought database
systems could be a hot topic again!
To get more references visit:
http://bit.ly/nosql_srbiau
52
References:
 NoSQL distilled, Martin Fowler
 Martin Fowler’s presentation at Goto conference
 www.mongodb.org
53

NoSQL databases - An introduction

  • 1.
    An introduction to NoSQLdatabases POOYAN MEHRPARVAR DEC 2014To get more references visit: http://bit.ly/nosql_srbiau 1
  • 2.
    What is coveredin this presentation:  A brief history of data bases  NoSQL why, what and when?  Aggregate Data Models  BASE vs ACID  CAP theorem  Polyglot persistence : the future of database systems 2
  • 3.
    Why did wechoose this topic?  Is NoSQL replacing traditional databases?  Where should we use NoSQL databases?  Should we use NoSQL in any kind of projects? 3
  • 4.
    A brief historyof databases 4
  • 5.
    Relational databases Benefits ofRelational databases: Designed for all purposes ACID Strong consistancy, concurrency, recovery Mathematical background Standard Query language (SQL) Lots of tools to use with i.e: Reporting services, entity frameworks, ... Vertical scaling (up scaling) Object / Object-relational databases were not practical. Mainly because of Impedance mismatch 5
  • 6.
    Era of DistributedComputing But...  Relational databases were not built for distributed applications. Because...  Joins are expensive  Hard to scale horizontally  Impedance mismatch occurs  Expensive (product cost, hardware, Maintenance) 6
  • 7.
    Era of DistributedComputing But...  Relational databases were not built for distributed applications. Because...  Joins are expensive  Hard to scale horizontally  Impedance mismatch occurs  Expensive (product cost, hardware, Maintenance) And.... It’s weak in:  Speed (performance)  High availability  Partition tolerance 7
  • 8.
    Rise of Bigdata Three V(s) of Bigdata:  Volume  Velocity  Variety 8
  • 9.
  • 10.
    Rise of Bigdata  Wallmart: 1 million transactions per hour  Facebook: 40 billion photos  People are talking about petabytes today 10
  • 11.
    NoSQL why, whatand when?  Google & Amazon bulit their own databases (Big table & Dynamo)  Facebook invented Cassandra and is using thousands of them  #NoSQL was a twitter hashtag for a conference in 2009  The name doesn’t indicate its characteristics  There is no strict defenition for NoSQL databases  There are more than 150 NoSQL databases (nosql-database.org) 11
  • 12.
    Characteristics of NoSQLdatabases  Non relational  Cluster friendly  Schema-less  21 century web  Open-source 12
  • 13.
    Characteristics of NoSQLdatabases NoSQL avoids:  Overhead of ACID transactions  Complexity of SQL query  Burden of up-front schema design  DBA presence  Transactions (It should be handled at application layer) Provides:  Easy and frequent changes to DB  Horizontal scaling (scaling out)  Solution to Impedance mismatch  Fast development 13
  • 14.
    NoSQL is gettingmore & more popular 14
  • 15.
    What is aschema-less datamodel? In relational Databases:  You can’t add a record which does not fit the schema  You need to add NULLs to unused items in a row  We should consider the datatypes. i.e : you can’t add a stirng to an interger field  You can’t add multiple items in a field (You should create another table: primary-key, foreign key, joins, normalization, ... !!!) 15
  • 16.
    What is aschema-less datamodel? In NoSQL Databases:  There is no schema to consider  There is no unused cell  There is no datatype (implicit)  Most of considerations are done in application layer  We gather all items in an aggregate (document) 16
  • 17.
    What is Aggregation? The term comes from Domain Driven Design  Shared nothing architecture  An aggregate is a cluster of domain objects that can be treated as a single unit  Aggregates are the basic element of transfer of data storage - you request to load or save whole aggregates  Transactions should not cross aggregate boundaries  This mechanism reduces the join operations to a minimal level 17
  • 18.
  • 19.
  • 20.
  • 21.
    Aggregate Data Models NoSQLdatabases are classified in four major datamodels:  Key-value  Document  Column family  Graph Each DB has its own query language 21
  • 22.
    Key-value data model The main idea is the use of a hash table  Access data (values) by strings called keys  Data has no required format – data may have any format  Data model: (key, value) pairs  Basic Operations: Insert(key,value), Fetch(key),Update(key), Delete(key) 22
  • 23.
    Key-value data model “Value” is stored as a “blob” - Without caring or knowing what is inside - Application is responsible for understanding the data  Main observation from Amazon (using Dynamo) – “There are many services on Amazon’s platform that only need primary-key access to a data store.” E.g. Best seller lists, shopping carts, customer preferences, session management, sales rank, product catalog 23
  • 24.
    Column family datamodel  The column is lowest/smallest instance of data.  It is a tuple that contains a name, a value and a timestamp 24
  • 25.
    Column family datamodel Some statistics about Facebook Search (using Cassandra)  MySQL > 50 GB Data  Writes Average : ~300 ms  Reads Average : ~350 ms  Rewritten with Cassandra > 50 GB Data  Writes Average : 0.12 ms  Reads Average : 15 ms 25
  • 26.
    Graph data model Based on Graph Theory.  Scale vertically, no clustering.  You can use graph algorithms easily  Transactions  ACID 26
  • 27.
    Document-based datamodel  UsuallyJSON like interchange model.  Query Model: JavaScript-like or custom.  Aggregations: Map/Reduce  Indexes are done via B-Trees.  unlike simple key-value stores, both keys and values are fully searchable in document databases. 27
  • 28.
  • 29.
    Overview of aDocument-based datamodel 29
  • 30.
    Overview of aDocument-based datamodel 30
  • 31.
    Overview of aDocument-based datamodel 31
  • 32.
    Overview of aDocument-based datamodel 32
  • 33.
    A sample MongoDBquery 33 MySQL: MongoDB: There is no join in MongoDB query Because we are using an aggregate data model
  • 34.
    What we need? We need a distributed database system having such features:  – Fault tolerance  – High availability  – Consistency  – Scalability 34
  • 35.
    What we need? We need a distributed database system having such features:  – Fault tolerance  – High availability  – Consistency  – Scalability Which is impossible!!! According to CAP theorem 35
  • 36.
    Should we...?  Insome cases getting an answer quickly is more important than getting a correct answer  By giving up ACID properties, one can achieve higher performance and scalability.  Any data store can achieve Atomicity, Isolation and Durability but do you always need consistency?  Maybe we should implement Asynchronous Inserts and updates and should not wait for confirmation? 36
  • 37.
    BASE Almost the oppositeof ACID.  Basically available: Nodes in the a distributed environment can go down, but the whole system shouldn’t be affected.  Soft State (scalable): The state of the system and data changes over time.  Eventual Consistency: Given enough time, data will be consistent across the distributed system. 37
  • 38.
  • 39.
    CAP theorem Consistency: Clientsshould read the same data. There are many levels of consistency. o Strict Consistency – RDBMS. o Tunable Consistency – Cassandra. o Eventual Consistency – Mongodb. Availability: Data to be available. Partial Tolerance: Data to be partitioned across network segments due to network failures. 39
  • 40.
    CAP theorem indifferent SQL/NoSQL databases We can not achieve all the three items In distributed database systems (center) Proven by Nancy Lynch et al. MIT labs. 40
  • 41.
    CAP theorem :A simple proof 41
  • 42.
    CAP theorem :A simple proof 42
  • 43.
    CAP theorem :A simple proof 43
  • 44.
    Which data modelto choose 44
  • 45.
    Polyglot persistence :the future of database systems  Future databases are the combination of SQL & NoSQL  We still need relational databases 45
  • 46.
    Overview of apolygot db 46
  • 47.
    New approach todatabase systems:  Integrated databases has its own advantages and disadvantages  With the advent of webservices it seems now it’s the time to switch to decentralized data bases  Single point of failure, Bottlenecks would be avoided  Clustering & replication would be much easier 47
  • 48.
    Conclusion: Before you chooseNoSQL as a solution: Consider these items, ...  Needs a precise evaluation, Maybe NoSQL is not the right thing  Needs to read lots of case study papers  Aggregation is totally a different approach  NoSQL is still immature  Needs lots of hours of studing and working to expert in a particular NoSQL db  There is no standard query language  Most of controls have to be implemented at the application layer  Relational databases are still the strongest in transactional environments and provide the best solutions in consistancy and concurrency control 48
  • 49.
    Conclusion: Before you chooseNoSQL as a solution: 49
  • 50.
  • 51.
    NewSQL a briefdefenition  NewSQL group was founded in 2011 Michael Stonebraker’s Definition …  SQL as the primary interface.  ACID support for transactions  Non-locking concurrency control.  High per-node performance.  Parallel, shared-nothing architecture – each node is independent and self-sufficient – do not share memory or storage 51
  • 52.
    Technology is stillin its infancy... In 2000 no one even thought database systems could be a hot topic again! To get more references visit: http://bit.ly/nosql_srbiau 52
  • 53.
    References:  NoSQL distilled,Martin Fowler  Martin Fowler’s presentation at Goto conference  www.mongodb.org 53

Editor's Notes

  • #3 Who is familiar with NoSQL? Who has worked with a practical distributed database? First of all you have to forgot about the SQL view. NoSQL is a kind of new approach
  • #4 A friend chose Mongodb as a solution for their log db. But it faild because they have some difficulties about transactions. (there is no transaction) Maybe it’s due to their lack of knowlage about NoSQL dbs
  • #8 Partitioning and Memcache in RDMSs Scale up is expensive – we need to scale out
  • #14 We have acid transactions in graph databases We have atomicity in an aggregate (a document in MongoDB)
  • #26 Facebook is using +6000 cassandra dbs. Have you seen the same with oracle or db2 (RDMS are suitable to upscaling) That’s how google extend its clusters everyday.
  • #27 Good for social networks & CS projects
  • #29 Mongodb has auto sharding, map/reduce module
  • #38 -> As the data is written, the latest version is on at least one node. The data is then versioned/replicated to other nodes within the system. -> Eventually, the same version is on all nodes.
  • #39 For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service
  • #40 Consistency (all nodes see the same data at the same time) Availability (a guarantee that every request receives a response about whether it succeeded or failed) Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
  • #43 In nosql we prefer Avaibility. When there is a inconsistancy in shoping, the most important thing is to shop!! (Amazon’s example)
  • #44 A RDMS will do such a thing. The whole system will be down until it comes to a consistant level
  • #45 We choose between C & A (it’s not a binary decision) It depends on our domain to decide about the inconsistency window (we should talk to the domain experts)
  • #48 Distribution methods: replication (master-slave, peer to peer) and sharding. Cassandra uses sharding and peer to peer Master-slave (single point of failure – good in consistency) Peer to peer (consistency is expensive) Replica sets are used for data redundancy, automated failover, read scaling, server maintenance without downtime.
  • #49 An idea for the thesis: Test nosql (e.g. mongodb) ability to scale out with virtual machines (various Lubuntu machines)
  • #50 Dental clinic example – xml as a solution – nosql as solution Facebook query example : show me a female < 30 who is intested in y music living in z Most of projects have some custom tables