DDBMS-Chapter-4-SE-LectureNote (Version 1)
DDBMS-Chapter-4-SE-LectureNote (Version 1)
4. Query Processing
Query processing refers to the range of activities involved in extracting data from the database.
The activities include translation of queries in high-level database languages into expressions that
can be used at the physical level of the file system, a variety of query optimizing transformations,
and actual evaluations of queries. A query is a statement requesting the retrieval of information.
Before query processing can begin, the system must translate the query into usable form. A
language such as sql is suitable for human use, but is ill suited to be the system’s internal
representation of query. A more useful internal representation is one based on the extended
relational algebra.
Query processing in a database management system (DBMS) involves transforming a high-level
query, typically written in SQL, into an efficient execution plan that retrieves or manipulates the
required data. The process includes parsing the query to check for syntax errors, optimizing the
query to find the most efficient way to execute it, and executing the optimized plan to produce the
desired results. The main goals are to ensure correctness, optimize performance, and manage
resource usage effectively. In a distributed database, query processing also involves coordinating
data retrieval and updates across multiple networked nodes.
Query processing in a distributed database environment is very difficult instead of a centerized
database, because there are many elements or parameters involved that affect the overall
performance of distributed queries. Move over, in a distributed environment, the query-response
time may became very high.
The main objective of query processing in a distributed environment is to form a high-level query
on a distributed database, which is seen as a single database by the users, into an efficient execution
strategy expressed in a low-level language in local databases.
An important point of query processing is query optimization. Because many execution strategies
are correct transformations of the same high-level query the one that optimizes (minimizes) resour
To transform a query in a high-level language (SQL) on a distributed DB (seen as a single DB by
the user) into an efficient execution strategy, expressed in a lower-level language (extension of
relational algebra with communication and data transfer operators), on several local DB’s.
Step of query processing
A. Query Parsing and Optimization: The query submitted by the user is parsed to identify its
components such as tables, columns, conditions, etc. After parsing, the query optimizer
generates an execution plan. In a distributed environment, the optimizer needs to consider
factors like data distribution, network latency, and availability of resources across multiple
nodes.
B. Distributed Query Optimization: This step involves optimizing the execution plan generated
in the previous step for distributed execution. It includes deciding on the distribution method
for data, selecting appropriate data replication strategies, and determining the most efficient
way to execute the query considering the distributed nature of the database.
C. Data Localization: In a distributed environment, data might be spread across multiple nodes.
Before executing the query, the system needs to determine where the required data is located.
This step involves identifying the relevant data fragments or partitions and determining which
nodes hold them.
D. Query Execution: Once the data localization is done, the query is executed on the appropriate
nodes where the data resides. In some cases, the query might need to be decomposed into
subqueries that are executed independently on different nodes. The results of these subqueries
are then combined to produce the final result.
E. Data Transfer and Communication: During query execution, there might be a need to
transfer intermediate results between nodes. This involves communication over the network,
which can introduce latency and overhead. Efficient data transfer mechanisms are employed
to minimize these overheads.
F. Result Integration: After all subqueries have been executed and intermediate results obtained,
they need to be integrated to produce the final result of the query. This step involves combining,
aggregating, and sorting the results as required by the query.
G. Transaction Management: In a distributed environment, query processing needs to ensure
transactional consistency across multiple nodes. This involves coordinating distributed
transactions, ensuring atomicity, consistency, isolation, and durability (ACID properties), and
handling concurrency control to prevent conflicts between concurrent transactions.
Relational algebra defines the core set of relational database model operations. A relational algebra
expression forms a sequence of relational algebra operations. The result of this expression reflects
the result of a query from a database.
A. Selection
Selection in relational algebra is a fundamental operation used to retrieve specific rows from a
relation (table) that satisfy a given predicate (condition). This operation is to the SQL SELECT
... WHERE ... clause and is denoted by the Greek letter sigma (σ). The selection operation filters
rows based on the specified condition and results in a new relation containing only the rows that
meet the criteria.
The general form of the selection operation is:
σcondition(relation)
where:
𝜎 is the selection operator.
condition is a logical predicate (e.g., age>30age>30).
relation is the name of the relation (table) from which rows are being selected
Example : select all employees who are older than 30
σAge>30(Employees)
The condition in the selection operation can be composed of multiple predicates using
logical connectors such as AND (∧), OR (∨), and NOT (¬). For example, to select
employees who are either in the IT department or older than 30, you would write:
σDepartment=’IT’∨Age>30(Employees)
B. Projection
Projection in relational algebra is another fundamental operation used to retrieve specific columns
(attributes) from a relation (table), effectively reducing the table to only those columns of interest.
In relational algebra, the result of a projection operation inherently eliminates duplicate rows. This
operation is analogous to the SQL SELECT column1, column2, ... clause and is denoted by the
Greek letter pi (π).
The general form of the projection operation is:
𝜋 attribute list(relation)
where:
𝜋 is the projection operator.
Attribute list is a comma-separated list of attributes (columns) to be included in the result.
Relation is the name of the relation (table) from which columns are being selected.
Example : If you want to project only the Name and Department columns from the Employees
relation, you would use the projection operation as follows:
𝜋Name, Department(Employees)
C. Union(u)
Union in relational algebra is an operation used to combine the tuples (rows) of two relations
(tables) into a single relation, effectively forming the set union of the two sets of tuples. This
operation is analogous to the SQL UNION operator and is denoted by the symbol ∪.
For the union operation to be valid, the two relations must satisfy the following conditions:
➢ Same Arity: Both relations must have the same number of attributes (columns).
➢ Attribute Correspondence: Corresponding attributes in the two relations must have
compatible domains (i.e., the same data type).
Syntax: R∪S
𝜎Age>30(EmployeesIT)∪𝜎Age>30(EmployeesHR)
D. Intersection
The intersection operation in relational algebra is used to retrieve the rows that are common to two
relations (tables). This operation is analogous to the SQL INTERSECT operator and is denoted
by the symbol ∩. The intersection operation requires that the two relations involved have the same
schema, meaning they must have the same number of attributes, and corresponding attributes must
have compatible data types.
where:
➢ ∩ is the intersection operator.
➢ relation1relation1 and relation2relation2 are the names of the relations to be
intersected.
Example: create a new relation that includes only the employees who are in both EmployeesIT
and EmployeesProjects, you can use the intersection operation as follows:
EmployeesIT∩EmployeesProjects
The intersection operation can be combined with other relational algebra operations to form
complex queries. For example, to find all unique employees who are both in the IT department
and working on projects, and who are older than 30, you could use a combination of selection and
intersection operations:
𝜎Age>30(EmployeesIT)∩𝜎Age>30(EmployeesProjects)
E. Minus
The minus operation in relational algebra, also known as the set difference operation, is used to
retrieve the rows that are present in one relation but not in another. This operation is analogous to
the SQL EXCEPT operator and is denoted by the symbol −.
Syntax: relation1−relation2
where:
➢ is the minus operator.
➢ relation1 and relation2 are the names of the relations involved in the operation.
Example: Consider two relations, Employees and ProjectMembers, with the following schema
and data:
To create a new relation that includes only the employees who are not project members, you can
use the minus operation as follows:
Employees−ProjectMembers
For the minus operation to be valid, the following conditions must be met:
➢ Same Degree: The relations must have the same number of attributes.
➢ Attribute Correspondence: Corresponding attributes must have compatible data types.
The minus operation can be combined with other relational algebra operations to form complex
queries. For example, to find all employees in the IT department who are not working on any
projects and are older than 30, you could use a combination of selection and minus operations:
𝜎Age>30(EmployeesIT)−EmployeesProjects
This combination first selects employees older than 30 from the EmployeesIT relation and then
subtracts those who are in the EmployeesProjects relation:
F. Join
The join operation in relational algebra is used to combine related tuples from two relations (tables)
into a single relation. There are several types of join operations, but the most common is the natural
join, which is denoted by the symbol ⋈. The join operation is crucial for querying multiple
relations to obtain a comprehensive dataset that includes related information from both.
Natural Join (⨝)
The natural join combines tuples from two relations by implicitly using all attributes with the same
name as the join condition. Duplicate columns are eliminated in the result.
Syntax: relation1 ⨝ relation2
G. Cartesian product
The Cartesian product in relational algebra, also known as the cross product or cross join, is an
operation that returns all possible combinations of tuples from two relations. This operation is
denoted by the symbol ×. The Cartesian product of two relations relation1relation1 and
relation2relation2 is a new relation that consists of all possible pairs of tuples where the first
element of the pair is a tuple from relation1relation1 and the second element is a tuple from
relation2relation2.
Syntax: relation1×relation2
The first layer transforms the calculus query into an algebraic query on global relations, using
information from the global conceptual schema that describes these relations. This transformation
does not yet consider data distribution, which is addressed in the next layer. Thus, this layer
employs techniques typical of a centralized DBMS. Query decomposition involves four successive
steps:
Normalization: The calculus query is rewritten in a normalized form for easier manipulation. This
process typically involves adjusting query quantifiers and applying logical operator priority.
Semantic Analysis: The normalized query is analyzed semantically to detect and reject incorrect
queries as early as possible. Detection techniques exist only for a subset of relational calculus and
often use a graph to capture the query's semantics.
Simplification: The correct query is simplified by eliminating redundant predicates, which often
arise from system transformations applied for semantic data control (such as views, protection,
and integrity control).
Restructuring: The simplified calculus query is restructured into an algebraic query. Since
multiple algebraic queries can be derived from the same calculus query, the goal is to find one that
offers better performance. This process begins with an initial algebraic query, derived directly by
translating predicates and target statements into relational operators. This initial query is then
optimized using transformation rules. While this layer ensures that inefficient executions are
typically avoided (e.g., accessing a relation only once despite multiple select predicates), it does
not yet achieve optimal execution due to the lack of information about data distribution and
fragment allocation at this stage.
B. Data Localization
The input to the second layer is an algebraic query on global relations. The primary function of
this layer is to localize the query’s data using the data distribution information found in the
fragment schema. Relations are fragmented and stored in disjoint subsets, known as fragments,
each located at a different site. This layer identifies which fragments are involved in the query and
converts the distributed query into a query on these fragments. Fragmentation is defined by
fragmentation predicates, which can be expressed through relational operators. A global relation
can be reconstructed by applying fragmentation rules and then deriving a localization program of
relational algebra operators, which act on the fragments. Generating a query on fragments involves
two steps:
Mapping to a Fragment Query: The query is mapped to a fragment query by substituting each
relation with its reconstruction (or materialization) program.
Simplification and Restructuring: The fragment query is then simplified and restructured to
produce a more efficient query. This is done using similar rules as those applied in the
decomposition layer.
Just like in the decomposition layer, the final fragment query is generally not optimal because
detailed information regarding fragments is not fully utilized.
C. Global Query Optimization
The input to the third layer is an algebraic query on fragments. The goal of this layer, query
optimization, is to find an execution strategy for the query that is close to optimal, recognizing that
finding the absolute optimal solution is computationally intractable. An execution strategy for a
distributed query is described using relational algebra operators and communication primitives
(send/receive operators) for transferring data between sites.
While the previous layers have optimized the query by eliminating redundant expressions, this
optimization does not account for fragment-specific characteristics such as fragment allocation
and cardinalities. Additionally, communication operators are not yet specified.
By reordering the operators within a query on fragments, many equivalent queries can be
generated. Query optimization involves finding the best ordering of operators, including
communication operators, to minimize a cost function. This cost function, typically measured in
time units, accounts for computing resources such as disk space, disk I/Os, buffer space, CPU cost,
and communication cost. It is usually a weighted combination of I/O, CPU, and communication
costs. Historically, early distributed DBMSs prioritized communication cost due to the high
expense of data transfer over wide area networks with limited bandwidth. However, in modern
systems, communication cost can sometimes be lower than I/O cost.
To select the optimal ordering of operators, it is necessary to predict the execution costs of
alternative candidate orderings. Static optimization, performed before query execution, relies on
fragment statistics and formulas for estimating the cardinalities of the results of relational
operators. Optimization decisions are thus based on the allocation of fragments and available
statistics on fragments recorded in the allocation schema.
An important aspect of query optimization is join ordering, as different permutations of joins can
lead to significant performance improvements. A basic technique for optimizing distributed join
operations is the semijoin operator, which reduces the size of join operands and, consequently, the
communication cost. However, techniques that consider both local processing costs and
communication costs might avoid semijoins if they increase local processing costs.
The output of the query optimization layer is an optimized algebraic query with communication
operators included for fragments. This is typically represented and saved as a distributed query
execution plan for future executions.
D. Distributed Query Execution
The final layer involves all the sites that have fragments participating in the query. Each subquery
executed at a site, known as a local query, is optimized using the local schema of that site and then
executed. At this stage, algorithms to perform the relational operators are selected. Local
optimization utilizes the algorithms of centralized systems.
The goal of distributed query processing can be summarized as follows: given a calculus query on
a distributed database, the objective is to find an execution strategy that minimizes a system cost
function, which includes I/O, CPU, and communication costs. This execution strategy is specified
in terms of relational algebra operators and communication primitives (send/receive) applied to
the local databases (i.e., the relation fragments). Therefore, the complexity of relational operators
that affect the performance of query execution is of major importance in the design of a query
processor.
Distributed query processing Methodology
https://www.slideshare.net/GirdharRatne/relational-algebra-ppt
https://www.slideshare.net/MeghajKumarMallick/query-processing-in-distributed-database-
system
https://www.slideshare.net/mashiur028/lec-7-query-processing
https://www.scribd.com/presentation/489611199/Lect-2-DDBS-Characteristics-and-Layers-
of-Query-Processing
https://www.slideshare.net/Hafizfaiz/query-decomposition-and-data-localization
https://www.slideshare.net/slideshow/distributed-dbms-unit-6-query-processing/70892922