Distributed Databases - Concepts & Architectures

Distributed Databases
Daniel Marcous

What?
Introduction
A distributed database is a database
in which storage devices are not all
attached to a common processing
unit such as the CPU, controlled by a
distributed database management
system.

● RDBMS - Relational Database Management System
● DDB - Distributed Database
● Node - a unit in a distributed system (mainly a single server)
● DDBMS - Distributed Database Management System
○ In charge of managing the different DDB nodes as one integrated system
● Centralized System - data is stored in one place
● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Understanding the vocabulary

Distributed Database Concepts
● Number of processing elements (database nodes)
● Connection between nodes over a computer network
● Logical interrelation between different database nodes
● Absence of node homogeneity

Types of Distributed Databases

Multiprocessing Systems
● Parallel Systems
○ Shared Memory (tightly coupled) - multiple processors share the same main memory
○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage
● Truly Distributed Systems
○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)

● Distribution - Data and software distributed over multiple nodes
● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs
● Heterogeneity - use of different software / hardware on different nodes
Classification of Distributed Systems

Why?
The power of distribution
Reasons for choosing a distributed
database over a “plain” centralized
database.

● More computing power
○ CPU
○ Memory
○ Storage
○ Network bandwidth
● Parallelism
○ Inter-query
○ Intra-query
Performance

Ease of use / development
● Transparency
● Geographically distributed sites
● Backups
● Elasticity
○ Growing
○ Shrinking

● Transparency - One software (Ring) to rule them all
○ Management - one command
○ Data - one query
● Autonomy - Degree of Independence
○ Different settings / configurations / Cache size
○ “Master” node / Master Election
● Keeping track of data distribution
○ which server has the table / partition I need?
Management Challenges

● Reliability - Probability of failures
○ Does one server failure affects the whole system? (“Freeze”)
● Availability - Percent of time when a data source is available
○ If a node goes down, does it’s data get lost? unavailable until its up again?
● Recovery
○ What is a single point of time?
○ Nodes clocks Synchronisation (NTP)
● Transaction Management - Server X must assure that the data is “safe” and no
Complex Features Implementation

Scaling
● Synchronisation Overhead

CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)
○ C - a read see all previously completed writes
○ A - reads and writes always succeed
○ P - read and write while network is down
● Choose 2! (2000)
● Sorry, actually only C or A… (2012)

How?
Internals
How does a distributed database
work?
● Advanced Concepts
● Architectures

Replication
● Assumptions
○ Nodes will fail
○ Commodity Hardware - prone to failure
● Settings
○ Replication Factor
○ Data / Actions /Apply logs
○ Synchronous / ASynchronous
○ Delay

Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts
● Types
○ Horizontal - row wise
○ Vertical - column wise (Vertica/ Parquet)
○ Hybrid - both
● Advantages
○ Reports on part of the data - horizontal
○ Increased parallelism - multiple physical files

Distributed Processing
● Access by key Only!
○ Using Hash Tables
■ keys are hashed and spread (=sharded) across nodes
■ result of hash tells you which node to access
■ Hash maps exist on every node / client
● Batch Processing
○ MapReduce
■ Map - partition by key

Data Locality
● Local storage (VS centralised storage controller)
○ Bring the processing to the data
○ Free bandwidth
● Smart Load Balancing
○ Route users to the “closest” node with the data (replication duh..)
● Data sorted by Key /Hash Key
○ Same / Close enough key = Same node
○ “Process” all the rides in the TLV area

ACID
BASE● Atomicity
○ Transactions
● Consistency
○ Locked until done
● Isolation
○ No interference
● Durability
○ Completed = Persistent
● Basic Availability
○ Response to every request
● Soft State
○ States change, results are
not determinant
● Eventual Consistency
○ Consistent state may take
time but is promised
○ (CAS - Compare & Swap
Operations exist)

Plain Old Centralized Database
● Oracle
● SQLServer (MS)
● DB2 (IBM)
● MySQL
● PostgreSQL

Relational (ACID) “Distributed” Database
● Oracle RAC (Real Applications Cluster)
● DB2 Data Sharing
● PostgresXL

Federated Database System
● IBM IIDR

Data Warehouse
● Oracle Exadata
● Teradata
● SQL Data Warehouse (MS)
● Vertica (HP)
● Greenplum (EMC)

Interactive Multiple Parallel Processing (MPP)
● Dremel (Big Query, Google)
● Redshift (Amazon)
● Presto (Facebook)
● Impala (Cloudera)

NoSQL (BASE) Shared Nothing Database
● MongoDB
● CouchBase
● Cassandra
● HBase

When?/Where?
History and Present
Where did the ideas come from and
what do we have present for use
nowadays?

Articles
● Old School
○ Fundamentals of Database Systems (1989)
○ Principles of Distributed Database Systems (1991)
● Distributed File System
○ The Google file system (2003)
● Distributed Processing
○ MapReduce: simplified data processing on large clusters (2004)
● Interactive Querying on large scale

● Document DB (Mostly JSON)
○MongoDB
○CouchBase
● Key-Value DB
○Cassandra
○HBase
● Graph DB
○Neo4J
NoSQL – Database Types

Big Guys
● Google - Inside tools
○ MapReduce
○ Dremel -> Big Query
○ Flume -> DataFlow
● Facebook - Inside tools open-sourced and modified
○ Cassandra -> HBase
○ Presto
● Yahoo - Hadoop / HBase

● IDF
● Waze
● Viber - Couchbase
● Liveperson - MongoDB, CouchBase
● SimilarWeb - HBase
Israel

Distribution is
awesome, but
requires complex
skills to do right.
Don’t overkill it.

Distributed Databases - Concepts & Architectures

More Related Content

What's hot

Similar to Distributed Databases - Concepts & Architectures

More from Daniel Marcous

Recently uploaded

Distributed Databases - Concepts & Architectures