04 Advanced Database System Chap 02 [RVUNC]
04 Advanced Database System Chap 02 [RVUNC]
4/11/2022 1
Introduction to Query Processing
4/11/2022 2
Cont…
4/11/2022 3
Cont…
Query Tree
Query Graph
• The query optimizer module has the task of producing an execution plan,
and the code generator generates the code to execute that plan.
• The runtime database processor has the task of running the query code,
whether in compiled or interpreted mode, to produce the query result.
• If a runtime error results, an error message is generated by the runtime
database processor.
4/11/2022 4
Cont…
• There are three phases that a query passes through during the DBMS
processing of that query:
Parsing and translation
Optimization
Evaluation
• During the parsing and translation stage, the human readable form of
the query is translated into forms usable by the DBMS.
• These can be in the forms of a relational algebra expression, query tree and
query graph
4/11/2022 5
Cont…
4/11/2022 6
Parsing and Translating the Query
• The first step in processing a query submitted to a DBMS is to convert the
query into a form usable by the query processing engine.
• High-level query languages such as SQL represent a query as a string, or
sequence, of characters.
• Certain sequences of characters represent various types of tokens such as
keywords, operators, operands, literal strings, etc. Like all languages, there are
rules (syntax and grammar) that govern how the tokens can be combined into
understandable (i.e. valid) statements.
• The primary job of the parser is to extract the tokens from the raw string
of characters and translate them into the corresponding internal data
elements (i.e. relational algebra operations and operands) and structures (i.e.
query tree, query graph).
• The last job of the parser is to verify the validity and syntax of the original
query string.
4/11/2022 7
Optimizing the Query
• In this stage, the query processor applies rules to the internal data structures
of the query to transform these structures into equivalent, but more efficient
representations.
• Selecting the proper rules to apply, when to apply them and how they are
applied is the function of the query.
4/11/2022 8
Evaluating the Query
• The final step in processing a query is the evaluation phase. The
best evaluation plan candidate generated by the optimization engine is
selected and then executed.
• Note that there can exist multiple methods of executing a query. Besides
processing a query in a simple sequential manner, some of a query‘s
individual operations can be processed in parallel—either as independent
processes or as interdependent pipelines of processes or threads.
• Regardless of the method chosen, the actual results should be same.
• The term optimization is actually a misnomer because in some cases the
chosen execution plan is not the optimal (best) strategy—it is just a
reasonably efficient strategy for executing the query.
• Finding the optimal strategy is usually too time-consuming except for the
simplest of queries and may require information on how the files are
implemented and even on the contents of the files—information that may
not be fully available in the DBMS catalog.
• Hence, planning of an execution strategy may be a more accurate
description than query optimization.
4/11/2022 9
THE ROLE OF INDEXES
• The utilization of indexes can dramatically reduce the execution time of various
operations such as select and join.
• Let us review some of the types of index file structures and the roles they play
in reducing execution time and overhead:
• Dense Index: Data-file is ordered by the search key and every search key
value has a separate index record.
• This structure requires only a single seek to find the first occurrence of a set of
contiguous records with the desired search value.
• Sparse Index: Data-file is ordered by the index search key and only some of
the search key values have corresponding index records. Each index record‘s
data-file pointer points
• Dense index — Index record appears for every search-key value in the file.
4/11/2022 10
Cont…
•Sparse Index: Data-file is ordered by the index search key and only some
of the search key values have corresponding index records. Each index
record‘s data-file pointer points
•Dense index — Index record appears for every search-key value in the
file.
•To the first data-file record with the search key value.
•While this structure can be less efficient (in terms of number of disk
accesses) than a dense index to find the desired records, it requires less
storage space and less overhead during insertion and deletion operations.
4/11/2022 11
Cont…
•Primary Index: The data file is ordered by the attribute that is also the
search key in the index file. Primary indices can be dense or sparse. This
is also referred to as an Index Sequential File. For scanning through a
relation‘s records in sequential order by a key value, this is one of the
fastest and more efficient structures—locating a record has a cost of 1
seek, and the contiguous makeup of the records in sorted order
minimizes the number of blocks that have to be read.
•However, after large numbers of insertions and deletions, the
performance can degrade quite quickly, and the only way to restore the
performance is to perform reorganization.
4/11/2022 12
Cont…
• Secondary Index: The data file is ordered by an attribute that is different
from the search key in the index file. Secondary indices must be dense.
Clustering Index: A two-level index structure where the records in the first
level contain the clustering field value in one field and a second field pointing
to a block [of 2nd level records] in the second level.
The records in the second level have one field that points to an actual data file
record or to another 2nd level block.
4/11/2022 14
Translating SQL Queries into Relational Algebra
• SQL is the query language that is used in most commercial RDBMSs.
• Typically, SQL queries are decomposed into query blocks, which form the
basic units that can be translated into the algebraic operators and optimized.
4/11/2022 15
Cont…
4/11/2022 16
Cont…
4/11/2022 17
Cont…
• The query optimizer would then choose an execution plan for each block.
• We should note that in the above example, the inner block needs to be
evaluated only once to produce the maximum salary, which is then used—
as the constant c—by the outer block.
4/11/2022 18
Cont…
EX 1: R = (A, B, C) S = (D, E, F)
Ex: 2: Let R = (A, B, C), and let r1 and r2 both be relations on schema
4/11/2022 20
Cont…
4/11/2022 21
Algorithms for External Sorting
• Sorting is one of the primary algorithms used in query processing.
4/11/2022 22
External sorting:
– Refers to sorting algorithms that are suitable for large files of records
stored on disk that do not fit entirely in main memory, such as most
database files.
• Sort-Merge strategy:
– Starts by sorting small subfiles (runs) of the main file and then merges
the sorted runs, creating larger sorted subfiles that are merged in turn.
4/11/2022 24
For example
• Hence, after the sorting phase, 205 sorted runs (or 205 sorted
subfiles of the original file) are stored as temporary subfiles on
disk.
4/11/2022 25
Cont…
• In the merging phase, the sorted runs are merged during one or more
merge passes. Each merge pass can have one or more merge steps.
• The degree of merging (dM) is the number of sorted subfiles that can be
merged in each merge step.
• During each merge step, one buffer block is needed to hold one disk block
from each of the sorted subfiles being merged, and one additional buffer is
needed for containing one disk block of the merge result, which will
produce a larger sorted file that is the result of merging several smaller
sorted subfiles.
• Hence, dM is the smaller of (nB − 1) and nR, and the number of merge
passes is ⎡(logdM(nR))⎤.
4/11/2022 26
Cont…
• These 52 sorted files are then merged 4 at a time into 13 sorted files, which
are then merged into 4 sorted files, and then finally into 1 fully sorted file,
which means that four passes are needed.
4/11/2022 27
CONT…
4/11/2022 28
Algorithms for SELECT and JOIN Operations
Implementing the SELECT Operation
• Examples:
– (OP1): SSN='123456789' (EMPLOYEE)
– (OP2): DNUMBER>5(DEPARTMENT)
– (OP3): DNO=5(EMPLOYEE)
– (OP4): DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
– (OP5): ESSN=123456789 AND PNO=10(WORKS_ON)
Search Methods for Simple Selection:
S1 Linear search (brute force):Retrieve every record in the file,
and test whether its attribute values satisfy the selection
condition.
4/11/2022 29
Cont…
4/11/2022 30
Cont…
4/11/2022 31
Search Methods for Complex Selection:
S7 Conjunctive selection: If an attribute involved in any single
simple condition in the conjunctive condition has an access path
that permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether each
retrieved record satisfies the remaining simple conditions in the
conjunctive condition.
S8 Conjunctive selection using a composite index
If two or more attributes are involved in equality conditions in the
conjunctive condition and a composite index (or hash structure)
exists on the combined field, we can use the index directly.
4/11/2022 32
Cont…
4/11/2022 33
Cont…
• e.g. R A=B S
• Examples
4/11/2022 35
Methods for implementing joins:
4/11/2022 36
Cont…
– J3 Sort-merge join:
• Both files are scanned in order of the join attributes, matching the
records that have the same values for A and B.
• In this method, the records of each file are scanned only once each
for matching with the other file—unless both A and B are non-key
attributes, in which case the method needs to be modified slightly.
4/11/2022 37
Cont…
– J4 Hash-join:
• The records of files R and S are both hashed to the same hash
file, using the same hashing function on the join attributes A of R
and B of S as hash keys.
• A single pass through the file with fewer records (say, R)
hashes its records to the hash file buckets.
• A single pass through the other file (S) then hashes each of its
records to the appropriate bucket, where the record is combined
with all matching records from R.
4/11/2022 38
4/11/2022 39
Cont…
4/11/2022 40
Using Heuristics in Query Optimization
4/11/2022 41
Cont…
• Query tree:
– A tree data structure that corresponds to a relational algebra expression.
– It represents the input relations of the query as leaf nodes of the tree, and
represents the relational algebra operations as internal nodes.
– An execution of the query tree consists of executing an internal node
operation whenever its operands are available and then replacing that internal
node by the relation that results from executing the operation.
• Query graph:
• A graph data structure that corresponds to a relational calculus expression.
• It does not indicate an order on which operations to perform first.
• There is only a single graph corresponding to each query.
4/11/2022 42
Cont…
• Example: For every project located in ‘Stafford’, retrieve the
project number, the controlling department number and the
department manager’s last name, address and birthdate.
• Relation algebra:
PNUMBER, DNUM, LNAME, ADDRESS, BDATE
((( PLOCATION=‘STAFFORD’(PROJECT))
• SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,
E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT
AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
4/11/2022 43
Cont…
4/11/2022 44
Cont…
• Heuristic Optimization of Query Trees:
– The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
– The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
• Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON,
PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
4/11/2022 45
Cont…
4/11/2022 46
Using Selectivity and Cost Estimates in Query Optimization
• A query optimizer does not depend solely on heuristic rules; it
also estimates and compares the costs of executing a query
using different execution strategies and algorithms, and it then
chooses the strategy with the lowest cost estimate.
• For this approach to work, accurate cost estimates are required
so that different strategies can be compared fairly and
realistically.
• In addition, the optimizer must limit the number of execution
strategies to be considered; otherwise, too much time will be
spent making cost estimates for the many possible execution
strategies.
• Hence, this approach is more suitable for compiled queries where
the optimization is done at compile time and the resulting execution
strategy code is stored and executed directly at runtime.
4/11/2022 47
Cont…
4/11/2022 50