0% found this document useful (0 votes)
2 views

04 Advanced Database System Chap 02 [RVUNC]

Chapter Two discusses query processing and optimization, detailing the steps of parsing, optimization, and evaluation of SQL queries. It explains the internal representations of queries, such as query trees and graphs, and the importance of indexes in improving query execution time. Additionally, the chapter covers algorithms for sorting and implementing SELECT and JOIN operations, emphasizing the role of heuristics and cost estimates in query optimization.

Uploaded by

tagesseabate887
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

04 Advanced Database System Chap 02 [RVUNC]

Chapter Two discusses query processing and optimization, detailing the steps of parsing, optimization, and evaluation of SQL queries. It explains the internal representations of queries, such as query trees and graphs, and the importance of indexes in improving query execution time. Additionally, the chapter covers algorithms for sorting and implementing SELECT and JOIN operations, emphasizing the role of heuristics and cost estimates in query optimization.

Uploaded by

tagesseabate887
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CHAPTER TWO

Query Processing and Optimization


Chapter Outline
 Introduction to Query Processing
Translating SQL Queries into Relational Algebra
Basic algorithms
Sorting: internal sorting and external sorting

 Implementing the SELECT operation

 Implementing the JOIN operation

 Implementing the Project operation

Using Heuristics in Query Optimization

Using Selectivity and Cost Estimates in Query Optimization


Semantic Query Optimization

4/11/2022 1
Introduction to Query Processing

• A query expressed in a high-level query language such as


SQL must first be scanned, parsed, and validated.
• The scanner identifies the language tokens—such as
SQL keywords, attribute names, and relation names—in
the text of the query.
• The parser checks the query syntax to determine whether it
is formulated according to the syntax rules (rules of grammar)
of the query language.
• The query must also be validated, by checking that all
attribute and relation names are valid and semantically
meaningful names in the schema of the particular database
being queried.

4/11/2022 2
Cont…

• An internal representation of the query is then created, usually


as a tree data structure called a query tree.

• It is also possible to represent the query using a graph data


structure called a query graph.

• The DBMS must then devise an execution strategy for


retrieving the result of the query from the database files.

• A query typically has many possible execution strategies, and


the process of choosing a suitable one for processing a
query is known as query optimization.

4/11/2022 3
Cont…

• Query optimization: The process of choosing a suitable execution


strategy for processing a query.

• Two internal representations of a query:

 Query Tree

 Query Graph
• The query optimizer module has the task of producing an execution plan,
and the code generator generates the code to execute that plan.
• The runtime database processor has the task of running the query code,
whether in compiled or interpreted mode, to produce the query result.
• If a runtime error results, an error message is generated by the runtime
database processor.

4/11/2022 4
Cont…

• There are three phases that a query passes through during the DBMS
processing of that query:
 Parsing and translation

 Optimization

 Evaluation

• Most queries submitted to a DBMS are in a high-level language such as


SQL.

• During the parsing and translation stage, the human readable form of
the query is translated into forms usable by the DBMS.

• These can be in the forms of a relational algebra expression, query tree and
query graph

4/11/2022 5
Cont…

4/11/2022 6
Parsing and Translating the Query
• The first step in processing a query submitted to a DBMS is to convert the
query into a form usable by the query processing engine.
• High-level query languages such as SQL represent a query as a string, or
sequence, of characters.
• Certain sequences of characters represent various types of tokens such as
keywords, operators, operands, literal strings, etc. Like all languages, there are
rules (syntax and grammar) that govern how the tokens can be combined into
understandable (i.e. valid) statements.
• The primary job of the parser is to extract the tokens from the raw string
of characters and translate them into the corresponding internal data
elements (i.e. relational algebra operations and operands) and structures (i.e.
query tree, query graph).
• The last job of the parser is to verify the validity and syntax of the original
query string.
4/11/2022 7
Optimizing the Query

• In this stage, the query processor applies rules to the internal data structures
of the query to transform these structures into equivalent, but more efficient
representations.

• The rules can be based upon mathematical models of the relational


algebra expression and tree (heuristics), upon cost estimates of different
algorithms applied to operations or upon the semantics within the query
and the relations it involves.

• Selecting the proper rules to apply, when to apply them and how they are
applied is the function of the query.

4/11/2022 8
Evaluating the Query
• The final step in processing a query is the evaluation phase. The
best evaluation plan candidate generated by the optimization engine is
selected and then executed.
• Note that there can exist multiple methods of executing a query. Besides
processing a query in a simple sequential manner, some of a query‘s
individual operations can be processed in parallel—either as independent
processes or as interdependent pipelines of processes or threads.
• Regardless of the method chosen, the actual results should be same.
• The term optimization is actually a misnomer because in some cases the
chosen execution plan is not the optimal (best) strategy—it is just a
reasonably efficient strategy for executing the query.
• Finding the optimal strategy is usually too time-consuming except for the
simplest of queries and may require information on how the files are
implemented and even on the contents of the files—information that may
not be fully available in the DBMS catalog.
• Hence, planning of an execution strategy may be a more accurate
description than query optimization.
4/11/2022 9
THE ROLE OF INDEXES
• The utilization of indexes can dramatically reduce the execution time of various
operations such as select and join.
• Let us review some of the types of index file structures and the roles they play
in reducing execution time and overhead:
• Dense Index: Data-file is ordered by the search key and every search key
value has a separate index record.
• This structure requires only a single seek to find the first occurrence of a set of
contiguous records with the desired search value.
• Sparse Index: Data-file is ordered by the index search key and only some of
the search key values have corresponding index records. Each index record‘s
data-file pointer points
• Dense index — Index record appears for every search-key value in the file.

4/11/2022 10
Cont…

•Sparse Index: Data-file is ordered by the index search key and only some
of the search key values have corresponding index records. Each index
record‘s data-file pointer points
•Dense index — Index record appears for every search-key value in the
file.
•To the first data-file record with the search key value.
•While this structure can be less efficient (in terms of number of disk
accesses) than a dense index to find the desired records, it requires less
storage space and less overhead during insertion and deletion operations.

4/11/2022 11
Cont…

•Primary Index: The data file is ordered by the attribute that is also the
search key in the index file. Primary indices can be dense or sparse. This
is also referred to as an Index Sequential File. For scanning through a
relation‘s records in sequential order by a key value, this is one of the
fastest and more efficient structures—locating a record has a cost of 1
seek, and the contiguous makeup of the records in sorted order
minimizes the number of blocks that have to be read.
•However, after large numbers of insertions and deletions, the
performance can degrade quite quickly, and the only way to restore the
performance is to perform reorganization.
4/11/2022 12
Cont…
• Secondary Index: The data file is ordered by an attribute that is different
from the search key in the index file. Secondary indices must be dense.

•Multi-Level Index: An index structure consisting of 2 or more tiers of


records where an upper tier‘s records point to associated index records of
the tier below. The bottom tier‘s index records contain the pointers to the
data-file records. Multi-level indices can be used, for instance, to reduce the
number of disk block reads needed during a binary search.
4/11/2022 13
Cont…

Clustering Index: A two-level index structure where the records in the first
level contain the clustering field value in one field and a second field pointing
to a block [of 2nd level records] in the second level.
The records in the second level have one field that points to an actual data file
record or to another 2nd level block.

4/11/2022 14
Translating SQL Queries into Relational Algebra
• SQL is the query language that is used in most commercial RDBMSs.

• An SQL query is first translated into an equivalent extended relational


algebra expression—represented as a query tree data structure—that is then
optimized.

• Typically, SQL queries are decomposed into query blocks, which form the
basic units that can be translated into the algebraic operators and optimized.

• A query block contains a single SELECT-FROMWHERE expression, as


well as GROUP BY and HAVING clauses if these are part of the block.

• Hence, nested queries within a query are identified as separate query


blocks.

4/11/2022 15
Cont…

For example consider COMPANY Relational Database


Schema

4/11/2022 16
Cont…

Consider the following SQL query on the EMPLOYEE relation


SELECT LNAME, FNAME FROM EMPLOYEE
WHERE SALARY > (SELECT MAX (SALARY)
FROM EMPLOYEE WHERE DNO=5);
• the outer block is
SELECT LNAME, FNAME FROM EMPLOYEE
WHERE SALARY > c where c represents the result returned from the inner
block.
The inner block could be translated into the extended relational algebra
expression
∏MAX SALARY (sDNO=5(EMPLOYEE)) and the outer block into the
expression
∏ LNAME, FNAME(sSALARY>C(EMPLOYEE))

4/11/2022 17
Cont…

• The query optimizer would then choose an execution plan for each block.

• We should note that in the above example, the inner block needs to be
evaluated only once to produce the maximum salary, which is then used—
as the constant c—by the outer block.

• We called this an uncorrelated nested query.

• It is much harder to optimize the more complex correlated nested


queries where a tuple variable from the outer block appears in the
WHERE-clause of the inner block

4/11/2022 18
Cont…

EX 1: R = (A, B, C) S = (D, E, F)

• Let relations r(R) and s(S) be given. An expression in SQL that is


equivalent to each of the following queries.

A. SELECT distinct A from r ΠA(r)

B. SELECT * FROM r WHERE B = 17 σB =17 (r)

C. SELECT distinct * FROM r, s r × s

D. SELECT distinct A, F FROM r, s WHERE C = D ΠA,F (σC =D(r × s))

Ex: 2: Let R = (A, B, C), and let r1 and r2 both be relations on schema

(SELECT * FROM r1) union (SELECT * FROM r2) r1 ∪ r2

SELECT * FROM r1 WHERE (A, B, C) in (SELECT * FROM r2) r1 ∩ r2

SELECT ∗ FROM r1 WHERE (A, B, C) not in (SELECT ∗ FROM r2) r1 − r2


4/11/2022 19
Cont…
• Example
For every project located in ‘Stafford’, retrieve the project number, the
controlling department number and the department manager’s last name,
address and birth date.
• SQL query:
SELECT P.NUMBER,P.DNUM,E.LNAME, E.ADDRESS, E.BDATE

FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E


WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
• Relation algebra:

∏PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((∏ PLOCATION=‘STAFFORD’(PROJECT))

DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))

4/11/2022 20
Cont…

4/11/2022 21
Algorithms for External Sorting
• Sorting is one of the primary algorithms used in query processing.

• For example, whenever an SQL query specifies an ORDER BY-


clause, the query result must be sorted.

• Sorting is also a key component in sort-merge algorithms used for


JOIN and

• other operations (such as UNION and INTERSECTION), and


in duplicate elimination algorithms for the PROJECT
operation (when an SQL query specifies the DISTINCT option
in the SELECT clause).

4/11/2022 22
External sorting:

– Refers to sorting algorithms that are suitable for large files of records
stored on disk that do not fit entirely in main memory, such as most
database files.
• Sort-Merge strategy:
– Starts by sorting small subfiles (runs) of the main file and then merges
the sorted runs, creating larger sorted subfiles that are merged in turn.

• The sort-merge algorithm, like other database algorithms,


requires buffer space in main memory, where the actual
sorting and merging of the runs is performed.
• The basic algorithm,, consists of two phases: the sorting
phase and the merging phase.
4/11/2022 23
Cont…
– Sorting phase: nR = (b/nB)
– Merging phase: dM = Min (nB-1, nR); nP = (logdM(nR))
– nR: number of initial runs; b: number of file blocks;
– nB: available buffer space; dM: degree of merging;
– nP: number of passes.
• In the sorting phase, runs (portions or pieces) of the file that can fit in the
available buffer space are read into main memory, sorted using an internal
sorting algorithm, and written back to disk as temporary sorted subfiles (or
runs).
• The size of each run and the number of initial runs (nR) are dictated by
the number of file blocks (b) and the available buffer space (nB).

4/11/2022 24
For example

• If the number of available main memory buffers nB = 5 disk


blocks and the size of the file b = 1024 disk blocks, then

• nR= ⎡(b/nB)⎤ or 205 initial runs each of size 5 blocks (except


the last run which will have only 4 blocks).

• Hence, after the sorting phase, 205 sorted runs (or 205 sorted
subfiles of the original file) are stored as temporary subfiles on
disk.

4/11/2022 25
Cont…

• In the merging phase, the sorted runs are merged during one or more
merge passes. Each merge pass can have one or more merge steps.

• The degree of merging (dM) is the number of sorted subfiles that can be
merged in each merge step.

• During each merge step, one buffer block is needed to hold one disk block
from each of the sorted subfiles being merged, and one additional buffer is
needed for containing one disk block of the merge result, which will
produce a larger sorted file that is the result of merging several smaller
sorted subfiles.

• Hence, dM is the smaller of (nB − 1) and nR, and the number of merge
passes is ⎡(logdM(nR))⎤.

4/11/2022 26
Cont…

• In our example where nB = 5, dM = 4 (four-way merging), so the 205


initial sorted runs would be merged 4 at a time in each step into 52 larger
sorted subfiles at the end of the first merge pass.

• These 52 sorted files are then merged 4 at a time into 13 sorted files, which
are then merged into 4 sorted files, and then finally into 1 fully sorted file,
which means that four passes are needed.

4/11/2022 27
CONT…

4/11/2022 28
Algorithms for SELECT and JOIN Operations
Implementing the SELECT Operation
• Examples:
– (OP1):  SSN='123456789' (EMPLOYEE)
– (OP2):  DNUMBER>5(DEPARTMENT)
– (OP3):  DNO=5(EMPLOYEE)
– (OP4):  DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
– (OP5):  ESSN=123456789 AND PNO=10(WORKS_ON)
Search Methods for Simple Selection:
S1 Linear search (brute force):Retrieve every record in the file,
and test whether its attribute values satisfy the selection
condition.

4/11/2022 29
Cont…

S2 Binary search: If the selection condition involves an equality


comparison on a key attribute on which the file is ordered,
binary search (which is more efficient than linear search) can
be used. (See OP1).
S3 Using a primary index or hash key to retrieve a single
record:If the selection condition involves an equality
comparison on a key attribute with a primary index (or a hash
key), use the primary index (or the hash key) to retrieve the
record.

4/11/2022 30
Cont…

S4 Using a primary index to retrieve multiple records: If the comparison


condition is >, ≥, <, or ≤ on a key field with a primary index, use the index to find
the record satisfying the corresponding equality condition, then retrieve all
subsequent records in the (ordered) file.
S5 Using a clustering index to retrieve multiple records: If the selection
condition involves an equality comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the records satisfying the selection
condition.
S6 Using a secondary (B+-tree) index: On an equality comparison, this search
method can be used to retrieve a single record if the indexing field has unique
values (is a key) or to retrieve multiple records if the indexing field is not a key.
In addition, it can be used to retrieve records on conditions involving >,>=, <, or
<=. (FOR RANGE QUERIES)

4/11/2022 31
Search Methods for Complex Selection:
S7 Conjunctive selection: If an attribute involved in any single
simple condition in the conjunctive condition has an access path
that permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether each
retrieved record satisfies the remaining simple conditions in the
conjunctive condition.
S8 Conjunctive selection using a composite index
If two or more attributes are involved in equality conditions in the
conjunctive condition and a composite index (or hash structure)
exists on the combined field, we can use the index directly.
4/11/2022 32
Cont…

S9 Conjunctive selection by intersection of record pointers:


 This method is possible if secondary indexes are available on all (or
some of) the fields involved in equality comparison conditions in the
conjunctive condition and if the indexes include record pointers (rather
than block pointers).
 Each index can be used to retrieve the record pointers that satisfy the
individual condition.
 The intersection of these sets of record pointers gives the record
pointers that satisfy the conjunctive condition, which are then used to
retrieve those records directly.
 If only some of the conditions have secondary indexes, each
retrieved record is further tested to determine whether it satisfies the
remaining conditions.

4/11/2022 33
Cont…

– Whenever a single condition specifies the selection, we can


only check whether an access path exists on the attribute involved
in that condition.
• If an access path exists, the method corresponding to that
access path is used; otherwise, the “brute force” linear search
approach of method S1 is used. (See OP1, OP2 and OP3)
– For conjunctive selection conditions, whenever more than
one of the attributes involved in the conditions have an access
path, query optimization should be done to choose the access path
that retrieves the fewest records in the most efficient way.
4/11/2022 34
Implementing the JOIN Operation:

– Join (EQUIJOIN, NATURAL JOIN)

• two–way join: a join on two files

• e.g. R A=B S

• multi-way joins: joins involving more than two files.

• e.g. R A=B S C=D T

• Examples

– (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT

– (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE

4/11/2022 35
Methods for implementing joins:

– J1 Nested-loop join (brute force):


• For each record t in R (outer loop), retrieve every record s
from S (inner loop) and test whether the two records satisfy
the join condition t[A] = s[B].
– J2 Single-loop join (Using an access structure to retrieve the
matching records):
• If an index (or hash key) exists for one of the two join
attributes say, B of S — retrieve each record t in R, one at a
time, and then use the access structure to retrieve directly all
matching records s from S that satisfy s[B] = t[A].

4/11/2022 36
Cont…
– J3 Sort-merge join:

• If the records of R and S are physically sorted (ordered) by value of


the join attributes A and B, respectively, we can implement the join in
the most efficient way possible.

• Both files are scanned in order of the join attributes, matching the
records that have the same values for A and B.

• In this method, the records of each file are scanned only once each
for matching with the other file—unless both A and B are non-key
attributes, in which case the method needs to be modified slightly.

4/11/2022 37
Cont…

– J4 Hash-join:
• The records of files R and S are both hashed to the same hash
file, using the same hashing function on the join attributes A of R
and B of S as hash keys.
• A single pass through the file with fewer records (say, R)
hashes its records to the hash file buckets.
• A single pass through the other file (S) then hashes each of its
records to the appropriate bucket, where the record is combined
with all matching records from R.

4/11/2022 38
4/11/2022 39
Cont…

• Factors affecting JOIN performance


– Available buffer space
– Join selection factor
– Choice of inner VS outer relation

4/11/2022 40
Using Heuristics in Query Optimization

• Process for heuristics optimization


1. The parser of a high-level query generates an initial internal
representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of operations
based on the access paths available on the files involved in the query.
• The main heuristic is to apply first the operations that reduce the size of
intermediate results.
– E.g., Apply SELECT and PROJECT operations before applying the
JOIN or other binary operations.

4/11/2022 41
Cont…

• Query tree:
– A tree data structure that corresponds to a relational algebra expression.
– It represents the input relations of the query as leaf nodes of the tree, and
represents the relational algebra operations as internal nodes.
– An execution of the query tree consists of executing an internal node
operation whenever its operands are available and then replacing that internal
node by the relation that results from executing the operation.
• Query graph:
• A graph data structure that corresponds to a relational calculus expression.
• It does not indicate an order on which operations to perform first.
• There is only a single graph corresponding to each query.

4/11/2022 42
Cont…
• Example: For every project located in ‘Stafford’, retrieve the
project number, the controlling department number and the
department manager’s last name, address and birthdate.
• Relation algebra:
 PNUMBER, DNUM, LNAME, ADDRESS, BDATE
((( PLOCATION=‘STAFFORD’(PROJECT))

DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))

• SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,
E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT
AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
4/11/2022 43
Cont…

4/11/2022 44
Cont…
• Heuristic Optimization of Query Trees:
– The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
– The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
• Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON,
PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

4/11/2022 45
Cont…

4/11/2022 46
Using Selectivity and Cost Estimates in Query Optimization
• A query optimizer does not depend solely on heuristic rules; it
also estimates and compares the costs of executing a query
using different execution strategies and algorithms, and it then
chooses the strategy with the lowest cost estimate.
• For this approach to work, accurate cost estimates are required
so that different strategies can be compared fairly and
realistically.
• In addition, the optimizer must limit the number of execution
strategies to be considered; otherwise, too much time will be
spent making cost estimates for the many possible execution
strategies.
• Hence, this approach is more suitable for compiled queries where
the optimization is done at compile time and the resulting execution
strategy code is stored and executed directly at runtime.
4/11/2022 47
Cont…

• Cost-based query optimization:


– Estimate and compare the costs of executing a query using different
execution strategies and choose the strategy with the lowest cost
estimate. (Compare to heuristic query optimization)
• Issues
– Cost function
– Number of execution strategies to be considered
• Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
Note: Different database systems may focus on different cost
components.
4/11/2022 48
Cont…
• Examples of Cost Functions for SELECT
• S1. Linear search (brute force) approach
– CS1a = b; For an equality condition on a key, CS1a = (b/2) if the record
is found; otherwise CS1a = b.
• S2. Binary search: CS2 = log2b + (s/bfr) –1 For an equality condition on a
unique (key) attribute, CS2 =log2b
• S3. Using a primary index (S3a) or hash key (S3b) to retrieve a single
record CS3a = x + 1; CS3b = 1 for static or linear hashing;
– CS3b = 1 for extendible hashing;
• S4. Using an ordering index to retrieve multiple records: For the
comparison condition on a key field with an ordering index, CS4 = x + (b/2)
• S5. Using a clustering index to retrieve multiple records:
– CS5 = x + ┌ (s/bfr) ┐
• S6. Using a secondary (B+-tree) index:
– For an equality comparison, CS6a = x + s; For an comparison condition
such as >, <, >=, or <=, CS6a = x + (bI1/2) + (r/2)
4/11/2022 49
Semantic Query Optimization :
– Uses constraints specified on the database schema in order to modify one query
into another query that is more efficient to execute.
• Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
• Explanation:
– Suppose that we had a constraint on the database schema that stated that no
employee can earn more than his or her direct supervisor. If the semantic query
optimizer checks for the existence of this constraint, it need not execute the query
at all because it knows that the result of the query will be empty. Techniques
known as theorem proving can be used for this purpose.

4/11/2022 50

You might also like