0% found this document useful (0 votes)
22 views53 pages

Advance Database Chap One

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views53 pages

Advance Database Chap One

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NIE‐PDB: Advanced Database Systems

[Link]

Lecture 1

Introduction
Martin Svoboda
[Link]@[Link]

20. 9. 2022

Charles University, Faculty of Mathematics and Physics


Czech Technical University in Prague, Faculty of Information Technology
Lecture Outline
Big Data
• Characteristics
• Current trends
NoSQL databases
• Motivation
• Features
Overview of NoSQL database types
• Key‐value, wide column, document, graph, …

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 2


What is Big Data?
Buzzword? Bubble? Gold rush? Revolution?

Dan Ariely:
Big Data is like teenage sex: everyone talks about it, nobody
really knows how to do it, everyone thinks everyone else is
doing it, so everyone claims they are doing it.

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 3


What is Big Data?
No standard definition
• Gartner (research and advisory company):
High Performance Computing

Big Data is high volume, high velocity, and/or high variety


information assets that require new forms of processing to
enable enhanced decision making, insight discovery and pro‐
cess optimization.

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 4


Where is Big Data?
Sources of Big Data
• Social media and networks
…all of us are generating data
• Scientific instruments
…collecting all sorts of data
• Mobile devices
…tracking all objects all the time
• Sensor technology and networks
…measuring all kinds of data

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 5


Big Data Characteristics
Volume (Scale)

Source: [Link]

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 6


Big Data Characteristics
Variety (Complexity)

Source: [Link]

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 7


Big Data Characteristics
Velocity (Speed)

Source: [Link]

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 8


Big Data Characteristics
Veracity (Uncertainty)

Source: [Link]

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 9


Big Data Characteristics
Basic 4V

• Volume (Scale)
Data volume is increasing exponentially, not linearly
Even large amounts of small data can result into Big Data
• Variety (Complexity)
Various formats, types, and structures
(from semi‐structured XML to unstructured multimedia)
• Velocity (Speed)
Data is being generated fast and needs to be processed fast
• Veracity (Uncertainty)
Uncertainty due to inconsistency, incompleteness, latency,
ambiguities, or approximations

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 10


Big Data Characteristics
Additional V and C

• Value
Business value of the data (needs to be revealed)
• Validity
Data correctness and accuracy with respect to the intended use
• Volatility
Period of time the data is valid and should be maintained

• Cardinality
• Continuity
• Complexity

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 11


Big Data Characteristics
Additional V

Source: [Link]

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 12


Relational Databases
Data model
Instance → database → table → row
Query languages
• Real‐world: SQL (Structured Query Language)
• Formal: Relational algebra, relational calculi (domain, tuple)
Query patterns
• Selection based on complex conditions, projection, joins,
aggregation, derivation of new values, recursive queries, …
Representatives
• Oracle Database, Microsoft SQL Server, IBM DB2
• MySQL, PostgreSQL

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 13


Relational Databases
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 14


Relational Databases
Features: Normal Forms

Model
• Functional dependencies
• 1NF, 2NF, 3NF, BCNF (Boyce‐Codd normal form)
Objective
• Normalization of database schema to BCNF or 3NF
• Algorithms: decomposition or synthesis
Motivation
• Diminish data redundancy, prevent update anomalies
• However:
Data is scattered into small pieces (high granularity), and so
these pieces have to be joined back together when querying!

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 15


Relational Databases
Features: Transactions

Model
• Transaction = flat sequence of database operations
(READ, WRITE, COMMIT, ABORT)
Objectives
• Enforcement of ACID properties
• Efficient parallel / concurrent execution (slow hard drives, …)
ACID properties
• Atomicity – partial execution is not allowed (all or nothing)
• Consistency – transactions turn one valid database state into another
• Isolation – uncommitted effects are concealed among transactions
• Durability – effects of committed transactions are permanent

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 16


Current Trends
Big Data
• Volume: terabytes → zettabytes
• Variety: structured → structured and unstructured data
• Velocity: batch processing → streaming data
• …
Big users
• Population online, hours spent online, devices online, …
• Rapidly growing companies / web applications
Even millions of users within a few months

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 17


Current Trends
Everything is in cloud
• SaaS: Software as a Service
• PaaS: Platform as a Service
• IaaS: Infrastructure as a Service
Processing paradigms
• OLTP: Online Transaction Processing
• OLAP: Online Analytical Processing
• …but also…
• RTAP: Real‐Time Analytical Processing

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 18


Current Trends
Data assumptions
• Data format is becoming unknown or inconsistent
• Linear growth → unpredictable exponential growth
• Read requests often prevail write requests
• Data updates are no longer frequent
• Data is expected to be replaced
• Strong consistency is no longer mission‐critical

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 19


Current Trends
⇒ New approach is required
• Relational databases simply do not follow the current trends
Key technologies
• Distributed file systems
• MapReduce and other programming models
• Grid computing, cloud computing
• NoSQL databases
• Data warehouses
• Large scale machine learning

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 20


NoSQL Databases
What does NoSQL actually mean?
A bit of history …
• 1998
First used for a relational database that omitted usage of SQL
• 2009
First used during a conference to advocate non‐relational
databases
So?
• Not: no to SQL
• Not: not only SQL
• NoSQL is an accidental term with no precise definition

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 21


NoSQL Databases
What does NoSQL actually mean?

NoSQL movement = The whole point of seeking alternatives


is that you need to solve a problem that relational databases
are a bad fit for

NoSQL databases = Next generation databases mostly ad‐


dressing some of the points: being non‐relational, dis‐
tributed, open‐source and horizontally scalable. The original
intention has been modern web‐scale databases. Often more
characteristics apply as: schema‐free, easy replication sup‐
port, simple API, eventually consistent, a huge data amount,
and more.
Source: [Link]

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 22


Types of NoSQL Databases
Core types
• Key‐value stores
• Wide column (column family, column oriented, …) stores
• Document stores
• Graph databases
Non‐core types
• Object databases
• Native XML databases
• RDF stores
• …

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 23


Key‐Value Stores
Data model
• The most simple NoSQL database type
Works as a simple hash table (mapping)
• Key‐value pairs
Key (id, identifier, primary key)
Value: binary object, black box for the database system
Query patterns
• Create, update or remove value for a given key
• Get value for a given key
Characteristics
• Simple model ⇒ great performance, easily scaled, …
• Simple model ⇒ not for complex queries nor complex data

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 24


Key‐Value Stores
Suitable use cases
• Session data, user profiles, user preferences, shopping carts, …
I.e. when values are only accessed via keys
When not to use
• Relationships among entities
• Queries requiring access to the content of the value part
• Set operations involving multiple key‐value pairs
Representatives
• Redis, MemcachedDB, Riak KV, Hazelcast, Ehcache, Amazon
SimpleDB, Berkeley DB, Oracle NoSQL, Infinispan, LevelDB,
Ignite, Project Voldemort
• Multi‐model: OrientDB, ArangoDB

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 25


Key‐Value Stores
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 26


Document Stores
Data model
• Documents
Self‐describing
Hierarchical tree structures (JSON, XML, …)
– Scalar values, maps, lists, sets, nested documents, …
Identified by a unique identifier (key, …)
• Documents are organized into collections
Query patterns
• Create, update or remove a document
• Retrieve documents according to complex query conditions
Observation
• Extended key‐value stores where the value part is examinable!

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 27


Document Stores
Suitable use cases
• Event logging, content management systems, blogs, web
analytics, e‐commerce applications, …
I.e. for structured documents with similar schema
When not to use
• Set operations involving multiple documents
• Design of document structure is constantly changing
I.e. when the required level of granularity would outbalance
the advantages of aggregates

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 28


Document Stores
Representatives
• MongoDB, Couchbase, Amazon DynamoDB, CouchDB,
RethinkDB, RavenDB, Terrastore
• Multi‐model: MarkLogic, OrientDB, OpenLink Virtuoso,
ArangoDB

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 29


Document Stores
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 30


Wide Column Stores
Data model
• Column family (table)
Table is a collection of similar rows (not necessarily identical)
• Row
Row is a collection of columns
– Should encompass a group of data that is accessed together
Associated with a unique row key
• Column
Column consists of a column name and column value
(and possibly other metadata records)
Scalar values, but also flat sets, lists or maps may be allowed

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 31


Wide Column Stores
Query patterns
• Create, update or remove a row within a given column family
• Select rows according to a row key or simple conditions
Warning
• Wide column stores are not just a special kind of RDBMSs
with a variable set of columns!

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 32


Wide Column Stores
Suitable use cases
• Event logging, content management systems, blogs, …
I.e. for structured flat data with similar schema
When not to use
• ACID transactions are required
• Complex queries: aggregation (SUM, AVG, …), joining, …
• Early prototypes: i.e. when database design may change
Representatives
• Apache Cassandra, Apache HBase, Apache Accumulo,
Hypertable, Google Bigtable

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 33


Wide Column Stores
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 34


Graph Databases
Data model
• Property graphs
Directed / undirected graphs, i.e. collections of …
– nodes (vertices) for real‐world entities, and
– relationships (edges) between these nodes
Both the nodes and relationships can be associated
with additional properties
Types of databases
• Non‐transactional = small number of very large graphs
• Transactional = large number of small graphs

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 35


Graph Databases
Query patterns
• Create, update or remove a node / relationship in a graph
• Graph algorithms (shortest paths, spanning trees, …)
• General graph traversals
• Sub‐graph queries or super‐graph queries
• Similarity based queries (approximate matching)
Representatives
• Neo4j, Titan, Apache Giraph, InfiniteGraph, FlockDB
• Multi‐model: OrientDB, OpenLink Virtuoso, ArangoDB

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 36


Graph Databases
Suitable use cases
• Social networks, routing, dispatch, and location‐based
services, recommendation engines, chemical compounds,
biological pathways, linguistic trees, …
I.e. simply for graph structures
When not to use
• Extensive batch operations are required
Multiple nodes / relationships are to be affected
• Only too large graphs to be stored
Graph distribution is difficult or impossible at all

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 37


Graph Databases
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 38


Native XML Databases
Data model
• XML documents
Tree structure with nested elements, attributes, and text values
(beside other less important constructs)
Documents are organized into collections
Query languages
• XPath: XML Path Language (navigation)
• XQuery: XML Query Language (querying)
• XSLT: XSL Transformations (transformation)
Representatives
• Sedna, Tamino, BaseX, eXist‐db
• Multi‐model: MarkLogic, OpenLink Virtuoso

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 39


Native XML Databases
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 40


RDF Stores
Data model
• RDF triples
Components: subject, predicate, and object
Each triple represents a statement about a real‐world entity
• Triples can be viewed as graphs
Vertices for subjects and objects
Edges directly correspond to individual statements
Query language
• SPARQL: SPARQL Protocol and RDF Query Language
Representatives
• Apache Jena, rdf4j (Sesame), Algebraix
• Multi‐model: MarkLogic, OpenLink Virtuoso

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 41


RDF Stores
Representatives

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 42


Features of NoSQL Databases
Data model
• Traditional approach: relational model
• (New) possibilities:
Key‐value, document, wide column, graph
Object, XML, RDF, …
• Goal
Respect the real‐world nature of data
(i.e. data structure and mutual relationships)

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 43


Features of NoSQL Databases
Aggregate structure
• Aggregate definition
Data unit with a complex structure
Collection of related data pieces we wish to treat as a unit
(with respect to data manipulation and data consistency)
• Examples
Value part of key‐value pairs in key‐value stores
Document in document stores
Row of a column family in wide column stores

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 44


Features of NoSQL Databases
Aggregate structure
• Types of systems
Aggregate‐ignorant: relational, graph
– It is not a bad thing, it is a feature
Aggregate‐oriented: key‐value, document, wide column
• Design notes
No universal strategy how to draw aggregate boundaries
Atomicity of database operations:
just a single aggregate at a time

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 45


Features of NoSQL Databases
Elastic scaling
• Traditional approach: scaling‐up
Buying bigger servers as database load increases
• New approach: scaling‐out
Distributing database data across multiple hosts
– Graph databases (unfortunately): difficult or impossible at all
Data distribution
• Sharding
Particular ways how database data is split into separate groups
• Replication
Maintaining several data copies (performance, recovery)

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 46


Features of NoSQL Databases
Automated processes
• Traditional approach
Expensive and highly trained database administrators
• New approach: automatic recovery, distribution, tuning, …
Relaxed consistency
• Traditional approach
Strong consistency (ACID properties and transactions)
• New approach
Eventual consistency only (BASE properties)
I.e. we have to make trade‐offs because of the data distribution

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 47


Features of NoSQL Databases
Schemalessness
• Relational databases
Database schema present and strictly enforced
• NoSQL databases
Relaxed schema or completely missing
Consequences: higher flexibility
– Dealing with non‐uniform data
– Structural changes cause no overhead
However: there is (usually) an implicit schema
– We must know the data structure at the application level
anyway

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 48


Features of NoSQL Databases
Open source
• Often community and enterprise versions (with extended
features or extent of support)
Simple APIs
• Often state‐less application interfaces (HTTP)

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 49


Features of NoSQL Databases
Current State: Five advantages

• Scaling
Horizontal distribution of data among hosts
• Volume
High volumes of data that cannot be handled by RDBMS
• Administrators
No longer needed because of the automated maintenance
• Economics
Usage of cheap commodity servers, lower overall costs
• Flexibility
Relaxed or missing data schema, easier design changes

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 50


Features of NoSQL Databases
Current State: Five challenges

• Maturity
Often still in pre‐production phase with key features missing
• Support
Mostly open source, limited sources of credibility
• Administration
Sometimes relatively difficult to install and maintain
• Analytics
Missing support for business intelligence and ad‐hoc querying
• Expertise
Still low number of NoSQL experts available in the market

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 51


Conclusion
The end of relational databases?
• Certainly no
They are still suitable for most projects
Familiarity, stability, feature set, available support, …
• However, we should also consider different database models
and systems
Polyglot persistence = usage of different data stores
in different circumstances

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 52


Lecture Conclusion
Big Data
• 4V characteristics: volume, variety, velocity, veracity
NoSQL databases
• (New) logical models
Core: key‐value, wide column, document, graph
Non‐core: XML, RDF, …
• (New) principles and features
Horizontal scaling, data sharding and replication, eventual
consistency, …

NIE‐PDB: Advanced Database Systems | Lecture 1: Introduction | 20. 9. 2022 53

You might also like