Outline
• Introduction
• Background
• Distributed Database Design
• Database Integration
• Semantic Data Control
• Distributed Query Processing
➡ Overview
➡ Query decomposition and localization
➡ Distributed query optimization
• Multidatabase Query Processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
• Current Issues
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/1
Query Processing in a DDBMS
high level user query
query
processor
Low-level data manipulation
commands for D-DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/2
Query Processing Components
• Query language that is used
➡ SQL: “intergalactic dataspeak”
• Query execution methodology
➡ The steps that one goes through in executing high-level (declarative) user
queries.
• Query optimization
➡ How do we determine the “best” execution plan?
• We assume a homogeneous D-DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/3
Selecting Alternatives
SELECTENAME
FROM EMP,ASG
WHERE [Link] = [Link]
AND RESP = "Manager"
Strategy 1
ENAME(RESP=“Manager”[Link]=[Link](EMP×ASG))
Strategy 2
ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Strategy 2 avoids Cartesian product, so may be “better”
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/4
What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
ASG1=σENO≤“E3”(ASG) ASG2= σENO>“E3”(ASG) EMP1= σENO≤“E3”(EMP) EMP2= σENO>“E3”(EMP) Result
Site 5 Site 5
result EMP1' EMP2' result= (EMP1 × EMP2)⋈ENOσRESP=“Manager”(ASG1× ASG2)
EMP1' EMP2'
Site 3 Site 4 ASG1 ASG2 EMP1 EMP2
EMP’1=EMP1 ⋈ENO ASG’1 EMP’2=EMP2 ⋈ENO ASG’2
Site 1 Site 2 Site 3 Site 4
ASG1' ASG'2
Site 1 Site 2
ASG1' σ RESP "Manager" ASG1 ASG'2 σ RESP "Manager" ASG2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/5
Cost of Alternatives
• Assume
➡ size(EMP) = 400, size(ASG) = 1000
➡ tuple access cost = 1 unit; tuple transfer cost = 10 units
• Strategy 1
➡ produce ASG': (10+10) tuple access cost
20
➡ transfer ASG' to the sites of EMP: (10+10) tuple transfer cost
200
➡ produce EMP': (10+10) tuple access cost 2
40
➡ transfer EMP' to result site: (10+10) tuple transfer cost
200
Total Cost 460
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/6
Query Optimization Objectives
• Minimize a cost function
I/O cost + CPU cost + communication cost
These might have different weights in different distributed environments
• Wide area networks
➡ communication cost may dominate or vary much
✦ bandwidth
✦ speed
✦ high protocol overhead
• Local area networks
➡ communication cost not that dominant
➡ total cost function should be considered
• Can also maximize throughput
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/7
Complexity of Relational Operations
Operation Complexity
Select
• Assume Project O(n)
➡ relations of cardinality n (without duplicate elimination)
➡ sequential scan Project
(with duplicate elimination) O(n log n)
Group
Join
Semi-join O(n log n)
Division
Set Operators
Cartesian Product O(n2)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/8
Query Optimization Issues – Types
Of Optimizers
• Exhaustive search
➡ Cost-based
➡ Optimal
➡ Combinatorial complexity in the number of relations
• Heuristics
➡ Not optimal
➡ Regroup common sub-expressions
➡ Perform selection, projection first
➡ Replace a join by a series of semijoins
➡ Reorder operations to reduce intermediate relation size
➡ Optimize individual operations
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/9
Query Optimization Issues –
Optimization Granularity
• Single query at a time
➡ Cannot use common intermediate results
• Multiple queries at a time
➡ Efficient if many similar queries
➡ Decision space is much larger
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/10
Query Optimization Issues –
Optimization Timing
• Static
➡ Compilation optimize prior to the execution
➡ Difficult to estimate the size of the intermediate results⇒error
propagation
➡ Can amortize over many executions
➡ R*
• Dynamic
➡ Run time optimization
➡ Exact information on the intermediate relation sizes
➡ Have to reoptimize for multiple executions
➡ Distributed INGRES
• Hybrid
➡ Compile using a static algorithm
➡ If the error in estimate sizes > threshold, reoptimize at run time
➡ Mermaid
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/11
Query Optimization Issues –
Statistics
• Relation
➡ Cardinality
➡ Size of a tuple
➡ Fraction of tuples participating in a join with another relation
• Attribute
➡ Cardinality of domain
➡ Actual number of distinct values
• Common assumptions
➡ Independence between different attribute values
➡ Uniform distribution of attribute values within their domain
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/12
Query Optimization Issues –
Decision Sites
• Centralized
➡ Single site determines the “best” schedule
➡ Simple
➡ Need knowledge about the entire distributed database
• Distributed
➡ Cooperation among sites to determine the schedule
➡ Need only local information
➡ Cost of cooperation
• Hybrid
➡ One site determines the global schedule
➡ Each site optimizes the local subqueries
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/13
Query Optimization Issues –
Network Topology
• Wide area networks (WAN) – point-to-point
➡ Characteristics
✦ Low bandwidth
✦ Low speed
✦ High protocol overhead
➡ Communication cost will dominate; ignore all other cost factors
➡ Global schedule to minimize communication cost
➡ Local schedules according to centralized query optimization
• Local area networks (LAN)
➡ Communication cost not that dominant
➡ Total cost function should be considered
➡ Broadcasting can be exploited (joins)
➡ Special algorithms exist for star networks
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/14
Distributed Query Processing
Methodology
Calculus Query on Distributed Relations
Query
Query GLOBAL
GLOBAL
Decomposition
Decomposition SCHEMA
SCHEMA
Algebraic Query on Distributed
Relations
CONTROL
Data
Data FRAGMENT
FRAGMENT
SITE Localization
Localization SCHEMA
SCHEMA
Fragment Query
Global
Global STATS
STATSON
ON
Optimization
Optimization FRAGMENTS
FRAGMENTS
Optimized Fragment Query
with Communication Operations
LOCAL Local
Local LOCAL
LOCAL
Optimization
Optimization SCHEMAS
SCHEMAS
SITES
Optimized Local Queries
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/15