E-Notes DBMS PDF
E-Notes DBMS PDF
on
dATABASE
MANAGEMENT SYSTEMS
B. Tech (CSE/IT)
Semester-Vth
1
CODE: PCC-CS-501
SUBJECT NAME: DATABASE MANAGEMENT SYSTEMS
Pre-requisites: Operating Systems Course Objectives: 1. To understand the different issues involved in
the design and implementation of a database system. 2. To study the physical and logical database
designs, database modeling, relational, hierarchical, and network models 3. To understand and use
data manipulation language to query, update, and manage a Database 4. To develop an understanding
of essential DBMS concepts such as: database security, integrity, concurrency, distributed database,
and intelligent database, Client/Server (Database Server), Data Warehousing. 5. To design and build a
simple database system and demonstrate competence with the fundamental tasks involved with
modeling, designing, and implementing a DBMS.
MODULE-1: Database system architecture: Data Abstraction, Data Independence, Data Definition
Language (DDL), Data Manipulation Language (DML). Data models: Entity-relationship model,
network model, relational and object oriented data models, integrity constraints, data manipulation
operations.
MODULE-2: Relational query languages: Relational algebra, Tuple and domain relational calculus,
SQL3, DDL and DML constructs, Open source and Commercial DBMS - MYSQL, ORACLE, DB2,
SQL server. Relational database design: Domain and data dependency, Armstrong's axiom, Normal
forms, Dependency preservation, Lossless design. Query processing and optimization: Evaluation of
relational algebra expressions, Query equivalence, Join strategies, Query optimization algorithms.
MODULE-3: Storage strategies: Indices, B-trees, hashing.
MODULE-4: Transaction processing: Concurrency control, ACID property, Serializability of
scheduling, Locking and timestamp based schedulers, Multi-version and optimistic Concurrency
Control schemes, Database recovery.
MODULE-5: Database Security: Authentication, Authorization and access control, DAC, MAC and
RBAC models, Intrusion detection, SQL injection.
MODULE-6: Advanced topics: Object oriented and object relational databases, Logical databases,
Web databases, Distributed databases, Data warehousing and data mining.
2
Table of Contents
MODULE-1 ....................................................................................................................................................................... 8
3
MODULE – 2 .................................................................................................................................................................. 29
4
DEPENDENCY PRESERVATION ..................................................................................................................................... 71
Discussion with Example of Non-dependency preserving decomposition: ............................................................. 72
i. Example: ....................................................................................................................................................... 73
6. ................................................................................................................................................................................... 77
7. FINDING ATTRIBUTE CLOSURE AND CANDIDATE KEYS USING FUNCTIONAL DEPENDENCIES ................................. 77
i. What is Functional Dependency? ................................................................................................................... 77
B. 6.1 LOSSLESS DECOMPOSITION ........................................................................................................................... 80
i. Example: ....................................................................................................................................................... 80
2.6 QUERY PROCESSING................................................................................................................................. 82
2.6.1 General strategies of queryprocessing: ...................................................................................................... 82
Query Parsing and Translation ............................................................................................................................... 84
Query Optimization and queryprocessor ................................................................................................................. 84
Step 1 − Query Tree Generation .............................................................................................................................. 84
Step 2 − Query Plan Generation .............................................................................................................................. 85
Step 3− Code Generation ........................................................................................................................................ 86
2.6.2 Approaches to Query Optimization ............................................................................................................ 86
2.6.3 Query Processing Steps: ............................................................................................................................. 87
2.7 QUERY OPTIMIZATION .................................................................................................................................... 88
Automatic Optimization vs. Human Programmer .................................................................................................... 88
THE OPTIMIZATION PROCESS ...................................................................................................................................... 88
2.8 QUERY EQUIVALENCE .................................................................................................................................... 89
C. SELECTION OPERATIONS..................................................................................................................................... 89
D. CONJUNCTIVE SELECTION ................................................................................................................................... 92
E. SEQUENCE OF PROJECTION OPERATION ................................................................................................................ 93
F. SELECTION WITH CARTESIAN PRODUCT AND JOINS .............................................................................................. 94
G. NATURAL JOIN WITH Θ JOIN ................................................................................................................................ 96
i. Associative Natural Joins................................................................................................................................ 96
H. DISTRIBUTIVE SELECTION OPERATION ................................................................................................................ 97
I. DISTRIBUTIVE PROJECTION OVER THETA JOIN..................................................................................................... 99
J. COMMUTATIVE SET OPERATION ....................................................................................................................... 100
K. ASSOCIATIVE SET OPERATION .......................................................................................................................... 101
L. DISTRIBUTIVE SELECTION OPERATION OVER SET OPERATORS............................................................................ 101
i. This rule has two parts: ................................................................................................................................ 101
M. DISTRIBUTIVE PROJECTION OVER UNION OPERATOR .......................................................................................... 102
i. .......................................................................................................................................................................... 102
TRANSLATING SQL Q UERIES INTO RELATION A LGEBRA .......................................................................................... 105
2.9 ALGORITHMS FOR E XTERNALS ORTING ........................................................................................................ 106
ALGORITHMSFOR SELECT ANDJOINO PERATIONS .................................................................................................. 108
Implementing the SELECTOperation ................................................................................................... 108
Implementing the JOINOperation ........................................................................................................ 109
ALGORITHMS FOR PROJECT AND SETOPERATIONS ........................................................................................... 110
IMPLEMENTING AGGREGATE O PERATIONS AND OUTER JOINS ............................................................................ 111
5
Implementing AggregateOperations .................................................................................................. 111
Implementing OuterJoin .......................................................................................................................... 112
COMBINING O PERATIONS USINGPIPELINING ........................................................................................................... 112
USING HEURISTICS IN Q UERY O PTIMIZATION .......................................................................................................... 113
Notation for Query Trees and QueryGraphs.................................................................................... 113
Heuristic Optimization of QueryTrees ................................................................................................ 114
USING S ELECTIVITY AND COST E STIMATES IN Q UERY O PTIMIZATION .................................................................... 117
Cost Components for QueryExecution .................................................................................................. 117
Catelog Information Used in CostFunctions ....................................................................................... 117
6
MODULE – 6 ................................................................................................................................................................ 139
7
MODULE-1
In 1971 Database Task Group (DBTG) appointed by conference on Data Systems and Languages
(CODASYL), produced a proposal for general architecture for database systems. In 1975
ANSISPARC (American National Standards Institutes-Standard planning and Requirement
Committee) produced a three-tier architecture with a system catalog. As shown in Figure 2.1 It
consists of three levels namely,
1. Internal Level
2. Conceptual Level
3. External Level
8
1.2 External, Conceptual and Internal Levels:
Internal level is the physical representation of the database on the computer and this view is
found on the lowest level of abstraction of database. This level indicates how the data will be
stored in the database and describe the data structures, file structures and access methods to be
used by thedatabase. It describes the way the DBMS and the operating system perceive the
data in the database. Just below the internal level there is physical level data organization
whose implementation is covered by the internal level to achieve routine performance and
storage space utilization.
The internal schema defines the internal level (or view). The internal schema contains the
definition of the stored record, the method of representing the data fields (or attributes),
indexing and hashing schemes and the access methods used. Internal level provides coverage
to the data structures and file organisations used to store data on storage devices.
Essentially, internal schema summarizes how the relations described in the conceptual schema
are actually stored on secondary storage devices such as disks and tapes. It interfaces with
operating system access methods (also called file management techniques for storing and
retrieving data records) to place the data on the storage devices, build the indexes, retrieve the
The process arriving at a good internal (or physical) schema is called physical database design.
The internal schema is written using SQL or internal data definition language (internal DDL).
The conceptual level is the middle level in the three-tier architecture. At this level of database
abstraction all the database entities and relationships among them are included. Conceptual
level provides the community view of the database and describes what data is stored in the
database and what relationships exist among the data. It contains the logical structure of the
entire database as seen by the DBA. One conceptual view represents entire database of an
organization. It is a complete view of the data requirements of the organisation that is
independent of any storage considerations. The conceptual level describes the conceptual view
of the database. It is also called the Logical schema. There is only one conceptual schema per
database. This schema contains the method of deriving the objects in the conceptual view from
information
The conceptual level supports each external view, in that any data available to a user must be
contained in, or derived from the conceptual level. This level though does not contain any
information that is storage dependent. For example, the entity should contain only data types of
attributes and their length but cannot contain any storage considerations like number of bytes it
occupies.
9
1.2.3 External Level
The external level is user’s view of the database. This level is at the highest level of data
abstraction where only those portions of the database of concern to a user or application
program are included. In other words, this level describes that part of the database which is
relevant to the user. Any number of user views, even identical, may exist for a given
conceptual or global view of database. The external view includes only those entities,
attributes and relationships in the “real world” that user is interested in. The entities, attributes,
etc. which is of no concern to user may exist in the database but user is unaware of their
existence.
In the external level, the different views may have different representation of the same data.
For example, one user may date in form of dd/mm/yyyy while another may view it in format
mm/dd/yyyy. An external schema describes each external view. The external schema consists
of the definition of the logical records and the relationships in the external view. It also
contains the method of deriving the objects (for example, entities, attributes and relationships)
in the external view from the objects in the conceptual view. External schema allow data
access to be customized at the level of individual user or groups of users. Any given database
has exactly one internal or physical level and one conceptual schema because it has just one set
of stored relations, but it may have several external levels. The external schema is written
using external data definition language (external DDL).
Immunity of the conceptual (or external) schemas to changes in the internal schema is referred to as
physical data independence. In physical data independence, the conceptual schema insulates the users
from changes in the physical storage of the data. Changes to the internal schema, such as using
different file organisation( like heap, hashed, indexed sequential, etc.) or storage structures, using
different storage devices, modifying indexes or hashing algorithms, must be possible without changing
the conceptual or external schemas. In other words, physical data independence indicates that the
physical storage structures or devices used for storing the data could be changed without necessitating
a change in the conceptual view or any of the external views. The actual data is stored in bit format on
the disk. Physical data independence is the power to change the physical data without impacting the
schema or logical data. For example, in case we want to change or upgrade the storage system itself −
10
suppose we want to replace hard-disks withSSD − it should not have any impact on the logical data or
schemas. The change is absorbed by the conceptual/internal mapping, as will be discussed in Section
Immunity of the external schemas (or application programs) to changes in the conceptual schema is
referred to as logical data independence. In logical data independence, the users are shielded from
changes in logical structure of the data or changes in the choice of relations to be stored. Changes to
the conceptual schema, such as the addition and deletion of entities, addition and deletion of attributes,
or addition and deletion of relationships, must be possible without changing existing external schemas
or having to rewrite application programs. Only the view definition and the mapping need be changed
in a DBMS that supports logical data independence. It is important that the users for whom the
changes have been made should not be concerned. In other words, the application programs that refers
to the external schema constructs must work as before, after the conceptual schema undergoes a
logical reorganisation.
11
View definition language (VDL)
Data manipulation language (DML)
Fourth-generation language (4GL)
In practice, the data definition and data manipulation languages are not two separate languages.
Instead they simply form parts of a single database language and a comprehensive integrated language
is used such as the widely used structured query language (SQL). SQL represents combination of
DDL, VDL and DML, as well as statements for constraints specification andschema evaluation. It
includes constructs for conceptual schema definition view definition, and data manipulation.
Data definition (also called description) language (DDL) is used to specify a database conceptual
schema using set of definitions. It supports the definition or declaration of database objects (or data
element). DDL allows the DBA or user to describe and name the entities, attributes and relationships
required for the application, together with any associated integrity and security constraints.
Various techniques are available for writing data definition language. One widely used technique is
writing DDL into a text file (similar to a source program written using programming languages). Other
methods use DDL compiler or interpreter to process the DDL file or statements in order to identify
description of the schema constructs and to store the schema description in the DBMS
The fourth-generation language (4GL) is a compact (a short-hand type), efficient and nonprocedural
programming language that is used to improve the productivity of the DBMS. In 4GL, the user defines
what is to be done and not how it is to be done. The 4GL depends on higher-level 4GL tools, which
are used by the users to define parameters to generate an application program. The 4GL has the
following components inbuilt in it:
Query languages
Report generators
Spreadsheets
Database languages
Application generators to define operations such as insert, retrieve and update data from the
database to build applications
High-level languages to generate application program.
Structured query language (SQL) and query by example (QBE) are the examples of fourthgeneration
language.
12
The Entity-Relationship (ER) model was originally proposed by P.P Chen in 1976 as a way to unify
the network and relational database views. Simply stated the ER model is a conceptual data model that
views the real world as entities and relationships. A basic component of the model is the Entity-
Relationship diagram which is used to visually represent data objects. Since Chen wrote his paper the
model has been extended and today it is commonly used for database design by the database designer,
the utility of the ER model is:
It maps well to the relational model. The constructs used in the ER model can easily be transformed
into relational tables.
It is simple and easy to understand with a minimum of training. Therefore, the model can be used by
the database designer to communicate the design to the end user.
In addition, the model can be used as a design plan by the database developer to implement a data
model in a specific database management software.
ER modelling is a high-level conceptual data model which is independent of any particular DBMS and
hardware platform. It is a top-down approach to database design. It is very useful in mapping
meanings and interactions of real-world enterprise ont a conceptual schema. It contains the following
describe the entities and relationships. Each such attribute is associated with a value-set (domain) and
can take value from this value-
model views the real world as a construct of entities and association between entities. Each the above
concept is discussed in detail in forthcoming sections.
ER Notation
There is no standard for representing data objects in ER diagrams. Each modelling methodology uses
its own notation. All notational styles represent entities as rectangular boxes and relationships as lines
connecting boxes. Each style uses a special set of symbols to represent the cardinality of a connection.
The notation used in this document is from Martin. The symbols used for the basic ER constructs are:
• Entities are represented by labelled rectangles. The label is the name of the entity. Entity names
should be singular nouns.
• Relationships are represented by a solid line connecting two entities. The name of the relationship is
written above the line. Relationship names should be verbs.
• Attributes, when included, are listed inside the entity rectangle. Attributes which are identifiers are
underlined. Attribute names should be singular nouns.
• Cardinality of many is represented by a line ending in a crow's foot. If the crow's foot is omitted, the
cardinality is one.
13
• Existence is represented by placing a circle or a perpendicular bar on the line. Mandatory existence
is shown by the bar (looks like a 1) next to the entity for an instance is required. Optional existence is
shown by placing a circle next to the Entity that is optional.
Entities
Entities are the principal data object about which information is to be collected. Entities are usually
recognizable concepts, either concrete or abstract, such as person, places, things, or events which have
relevance to the database. Some specific examples of entities are EMPLOYEES, PROJECTS and
INVOICES. An entity is analogous to a table in the relational model.
Entities are classified as independent or dependent (in some methodologies, the terms used are strong
and weak, respectively). An independent entity is one that does not rely on another for identification.
A dependent entity is one that relies on another for identification.
Associative entities (also known as intersection entities) are entities used to associate two or more
entities in order to reconcile a many-to-many relationship.
Subtypes entities are used in generalization hierarchies to represent a subset of instances of their
parent entity, called the supertype, but which have attributes or relationships that apply only to the
subset.
Entity Set
14
An entity set (also called entity type) is a set of entities of the same type that share the same properties
or attributes. In E-R modelling, similar entities are grouped into an entity type. An entity type is a
group objects with the same properties. These are identified by the enterprise as having an
independent existence. It can have objects with physical (or real) existence or objects with a
conceptual (or abstract) existence. Each entity type is identified by a name and a list of properties. A
database normally contains many different entity types and not to a single entity occurrence. In other
words, the word 'entity' in the E-R modelling corresponds to a table and not to a row in the relational
environment. The E-R model refers to a specific table row as an entity instance or entity occurrence.
An entity occurrence (also called entity instance) is a uniquely identifiable object of an entity type.
The diagram below shows a conceptual model with three entity types: Book, Publisher, And Author.
Attributes
Attributes describe the entity of which they are associated. A particular instance of an attribute is a
value. An attribute is a property of an entity or a relationship type. All attributes in a given entity type
have the same attributes. For example, EMPLOYEE entity type could use name (NAME), date of
birth (DOB), etc. The domain of an attribute is the collection of all possible values an attribute can
have. The domain of Name is a character string. Attributes can be classified as identifiers or
descriptors. Identifiers, more commonly called keys, uniquely identify an instance of an entity. A
descriptor describes a non-unique characteristic of an entity instance.
Simple attributes
Composite attributes
Single-valued attributes
Multi-valued attributes
Identifier attribute
Derived attributes
Simple attributes (or atomic attribute) (represented by simple ellipse): A simple attribute is an
attribute composed of a single component with an independent existence. A simple attribute cannot be
further divided into smaller components.
15
Composite attributes (represented in branched form): A composite attribute is an attribute composed
of multiple components, each with an independent existence. Some attributes can be further broken
down or divided smaller components with an independent existence of their own. It could be simpe
composite or can form an hierarchy.
Single-valued attributes: A single-valued attribute is an attribute that holds a single value for each
occurrence of an entity set. For example: Emp_id, Roll_no, Project_id, etc.
Identifier attribute (represented by underlined text in ellipse): Each entity is required to be identified
uniquely in a particular entity set. Using one or more entity attributes as an identifier this identification
is done and is known as identifier attribute.
Derived attribute (represented by dotted ellipse): A derived attribute is an attribute that represents a
value that is derived from the value of a related attribute or set of attributes, not essentially in the same
entity-set. For example if we have date of joining as attributethen we can derive work experience of
the employee and this experience attribute is the derived attribute.
16
Relationships
Relationship Set
A set of relationships of similar type is called a relationship set. Like entities, a relationship too
can have attributes. These attributes are called descriptive attributes.
Degree of Relationship
The number of participating entities in a relationship defines the degree of the relationship. The degree
of a relationship is the number of entities associated with the relationship. The n-ary relationship is the
general form for degree n. Special cases are the binary, and ternary, where the degree is 2, and 3,
respectively.
Binary relationships, the association between two entities is the most common type in the real world.
A recursive binary relationship occurs when an entity is related to itself. An example might be "some
employees are married to other employees".
A ternary relationship involves three entities and is used when a binary relationship is inadequate.
Many modeling approaches recognize only binary relationships. Ternary or nary relationships are
decomposed into two or more binary relationships.
17
Recursive or unary relationship
Binary or degree 2 relationship
Ternary or degree 3 relationship
n-ary or n-degree relationship
Mapping Cardinalities
Cardinality defines the number of entities in one entity set, which can be associated with the
number of entities of other set via relationship set. The connectivity of a relationship describes
the mapping of associated entity instances in the relationship. The values of connectivity are
"one" or “many". The cardinality of a relationship is the actual number of related occurrences
for each of the two entities. The basic types of connectivity for relations are: one-to-one, one-
to-many, and many-to-many.
One-to-one–
One entity from entity set A can be associated with at most one entity of entity set B and vice versa.
For example, "employees in the company are eachassigned their own office. For each employee there
exists a unique office and for each office there exists a unique employee.
18
One-to-many − One entity from entity set A can be associated with more than one entities of entity set
B however an entity from entity set B, can be associated with at most one entity.
Many-to-one − More than one entities from entity set A can be associated with at most one entity of
entity set B, however an entity from entity set B can be associated with more than one entity from
entity set A.
ER Diagrams
Steps In Building the Data Model While ER model lists and defines the constructs required to build a
data model, there is no standard process for doing so. Some methodologies, such as IDEFIX, specify a
bottom-up development process were the model is built in stages. Typically, the entities and
relationships are modeled first, followed by key attributes, and then the model is finished by adding
19
nonkey attributes. Other experts argue that in practice, using a phased approach is impractical because
it requires too many meetings with the end-
-key
process. As noted above, the requirements analysis and the draft of the initial ER diagram often occur
simultaneously. Refining and validating the diagram may uncover problems or missing information
which require more information gathering and analysis. Now let’s consider some example to construct
an ER diagram.
20
1.5.2 Network Data Model
The Database Task Group of the Conference on Data System Languages (DBTG/CODASYL)
formalised the network data model in the late 1960s. The network data models were eventually
standardised as the CODASYL model. The network data model is similar to a hierarchical model
except that a record can have multiple parents. The network data model has three basic components
such as record types, dat items (fields), and links. Further, in network model terminology, a
relationship is called a set in which each set is composed of at least two record types. First record type
is called an owner record that is equivalent to parent in the hierarchical model. Second record type is
called a member record that is equivalent to child in the hierarchical model. The connection between
an owner and its member records is identified by a link to which database designers assign a set-name.
This set-name is used to retrieve and manipulate data. Just as branches in hierarchal data models
represent access path, the links between owner and their members indicate access paths in network
model and are typically implemented by pointers. The members in network model can appear in more
than one set and thus can have several owners i.e. it have many-to-many (n:m) relationship. A set
represents one-to-many (1:m) relationship between the owner and the members.
21
Figure: Network Data Model
Simplicity: Similar to hierarchical data model, network model is also simple and easy to
design.
Facilitating more relationship types: The network model facilitates in handling of oneto-
many (1:m) and many-to-many (n:m) relationships, which helps in modelling the real life
situations.
Superior data access: The data access and flexibility is superior to that is found in the
hierarchical data model. An application can access an owner record and all the members’
record within a set. If a member record in the set has two or more (like a faculty working for
two departments), then one can move from one owner to another.
Database Integrity: Network model enforces database integrity and does not allow members
to exist without an owner. First of all, the user must define the owner record and then the
member.
Data independence: The network data model provides sufficient data independence by at least
partially isolating the programs from complex physical storage details. Therefore, changes in
the data characteristics do not require changes in the application programs.
Database standards: Unlike hierarchical model, network data model is based on the universal
standards formulated by DBTG/CODASYL and augmented by ANSI-SPARC. All the network
data models confirm to these standards, which also includes DDL and DML.
System complexity: Like hierarchical data model, network model also provides a navigational
access mechanism to the data in which the data are accesses one record at a time. This
mechanism makes the system implementation very complex.
Absence of structural independence: It is difficult to make changes in a network database. If
changes are made to the database structure, all subschema definitions must be revalidated
before any applications programs can access the database. In other words, although the
network model achieves data independence, it does not provide structural independence.
Not a user-friendly: The network data model is not a design for user-friendly system and is a
highly skill-oriented system.
22
Operational Anomalies- The insertion, deletion and updating operations of any record require
large number of pointers adjustments.
E.E Codd of IBM Research first introduced the relational data model in a paper in 1970. The
relational data model is implemented using very sophisticated Relational Database
Management System (RDBMS). The RDBMS provides some basic functions of the
hierarchical and network DBMSs plus a host of other functions that make the relational data
models easier to understand and implement. The relational data model performs the same
functions that make the relational data model simplified the user's view of the database by
using simple tables instead of the more complex tree and network structures. It is a collection
of tables (also called relations) in which data is stored. Each of the tables is a matrix of a series
of row and column intersections. Tables are related to each other by sharing common entity
characteristics.
Simplicity: A relational data model is even simpler than hierarchical and network models. It
frees the designers from the actual physical data storage details, thereby allowing them to
concentrate on the logical view of the database.
Structural independence: Unlike hierarchical and network models, the relational do does not
depend on the navigational data access system. Changes in the database structure model do not
affect the data access.
Ease of design, implementation, maintenance and uses: The relational model provides both
structural independence and data independence. Therefore, it makes the database design,
implementation, maintenance and usage much easier.
Flexible and powerful query capability: It provides very powerful, flexible, and easy-touse
query facilities. Its structured query language (SQL) capability makes ad hoc queries a reality.
Disadvantages of Relational Data Model
Hardware overheads: Relational model needs more powerful computing hardware and data
storage devices to perform RDBMS assigned tasks.
Easy-to-design capability leads to bad design: Easy-to-use feature of relational database results
into untrained people generating queries and reports without much understanding and giving
any thought to importance of proper database design, which can lead to degraded performance
and slower system in long run.
Object-oriented data model is a logical data model that captures the semantics of objects supported
object-oriented programming. It is a persistent and sharable collection of defined objects. It has the
ability to model complete solution. Object-oriented database models represent an entity and a class. A
class represents both object attributes as well as the behaviour of the entity. For example, a
CUSTOMER class will have not only the customer attributes such as CUST-ID, CUST-NAME,
23
CUST-ADDRESS and so on, but also procedures that imitate actions expected of a customer such as
update-order. Instances of the class-object correspond to individual customers. Within an object, the
class attributes takes specific values, which distinguish one customer (object) from another. However,
all the objects belonging to the class, share the behaviour pattern of the class. The object-oriented
database maintains relationships through logical containment.
The object-oriented database is based on encapsulation of data and code related to an object into a
single-unit, whose contents are not visible to the outside world. Therefore object-oriented data model
emphasise on objects, rather than on data alone. The object-oriented database management system
(OODBMS) is among the most recent approaches to database management. Advantages of Object-
(such as hierarchical, network or relational), the object-oriented database are capable of storing
different types of data, for example, pictures, voices,
Combining object-oriented programming with database technology: Object-oriented data model is
capable of combining object-oriented programming with database technology and thus, providing an
integrated applicati -oriented data models
provide powerful features such as inheritance polymorphism and dynamic binding that allow the users
to compose objects andprovide solutions without writing object-specific code. These features increase
-
oriented data model represents relationships explicitly, supporting both navigational and associative
access to information. It further improves the data access performance over relational value-based
relationships.
Integrity Constraints
Each relational schema must satisfy the following four types of constraints.
A. Domain constraints
Each attribute Ai must be an atomic value from dom( Ai) for that attribute.
24
The attribute, Name in the example is a BAD DESIGN (because sometimes we may want to search a
person by only using their last name.
B. Key Constraints
Superkey of R: A set of attributes, SK, of R such that no two tuples in any valid
relational instance, r( R), will have the same value for SK. Therefore, for any two distinct
Key of R: A minimal superkey. That is, a superkey, K, of R such that the removal of ANY attribute
from K will result in a set of attributes that are not a superkey. Example CAR( State, LicensePlateNo,
VehicleID, Model, Year, Manufacturer)
K1 = { State, LicensePlateNo}
K2 = { VehicleID }
If a relation has more than one keys, we can select any one (arbitrarily) to be the primary key. Primary
Key attributes are underlined in the schema:
The primary key attribute, PK, of any relational schema R in a database cannot have null values in any
tuple. In other words, for each table in a DB, there must be a key; for each key, every row in the table
must have non-null values. This is because PK is used to identify the individual tuples.
Referential integrity constraints are used to specify the relationships between two relations in a
database.
Consider a referencing relation, R1, and a referenced relation, R2. Tuples in the referencing relation,
R1, have attributed FK (called foreign key attributes) that reference
the primary key attributes of the referenced relation, R2. A tuple, t1, in R1 is said to reference a tuple,
t2, in R2 if t1[FK] = t2[PK].
25
A referential integrity constraint can be displayed in a relational database schema as a directed arc
from the referencing (foreign) key to the referenced (primary) key. Examples are shown in the figure
below:
Relational algebra is a procedural query language, which takes instances of relations as input and
yields instances of relations as output. It uses operators to perform queries. An operator can be either
unary or binary. They accept relations as their input and yield relations as their output. Relational
algebra is performed recursively on a relation and intermediate results are also considered relations.
Select
Project
Union
Set different
Cartesian product
Rename
We will discuss all these operations in the following sections.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula which
may use connectors like and, or, and not. These terms may use relational operators like − =, ≠, ≥, < ,>,
≤.
26
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450.
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those books
published after 2010.
For example −
r ∪ s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
Output − Projects the names of the authors who have either written a book or an article or both.
Selects and projects columns named as subject and author from the relation Books
27
Notation − r − s
Output − Provides the name of authors who have written books but not articles.
Notation − r Χ s
r Χ s = { q t | q ∈ r and t ∈ s}
Output − Yields a relation, which shows all the books and articles written by DB1.
Notation − ρ x (E)
Set intersection
Assignment
Natural join
28
MODULE – 2
2.1 SQL
2.1.1 What is SQL?
SQL is structured Query Language which is a computer language for storing, manipulating and
retrieving data stored in relational database.
SQL is the standard language for Relation Database System. All relational database management
systems like MySQL, MS Access, Oracle, Sybase, Informix, postgres and SQL Server uses SQL as
standard database language.
Also they are using different dialects, Such as:
-SQL,
-compilers.
SQL Process:
When you are executing an SQL command for any RDBMS, the system determines the best way to
carry out your request and SQL engine figures out how to interpret the task.
There are various components included in the process. These components are Query Dispatcher,
Optimization engines, Classic Query Engine and SQL query engine etc. Classic query engine
handles all non-SQL queries but SQL query engine won't handle logical files.
Following is a simple diagram showing SQL Architecture:
29
2.1.3 SQL DATABASE :
MySQL
MySQL is open source SQL database, which is developed by Swedish company MySQL AB.
MySQL is pronounced "my ess-que-ell," in contrast with SQL, pronounced "sequel."
MySQL is supporting many different platforms including Microsoft Windows, the major Linux
distributions, UNIX, and Mac OS X.
MySQL has free and paid versions, depending on its usage (non-commercial/commercial) and
features. MySQL comes with a very fast, multi-threaded, multi-user, and robust SQL database
server.
Features:
Warehouse Strengths.
MS SQL Server
MS SQL Server is a Relational Database Management System developed by Microsoft Inc. Its
primary query languages are:
-SQL.
30
Features:
ORACLE
It is very large and multi-user database management system. Oracle is a relational database
management system developed by 'Oracle Corporation'.
Oracle works to efficiently manage its resource, a database of information, among the multiple
clients requesting and sending data in the network.
It is an excellent database server choice for client/server computing. Oracle supports all major
operating systems for both clients and servers, including MSDOS, NetWare, UnixWare, OS/2 and
most UNIX flavors.
Features:
rce Manager
MS- ACCESS
This is one of the most popular Microsoft products. Microsoft Access is entry-level database
management software. MS Access database is not only an inexpensive but also powerful database
for small-scale projects.
MS Access uses the Jet database engine which utilizes a specific SQL language dialect
31
Features:
ies, forms and reports, and connect them together with macros.
32
Operator in SQL
An operator is a reserved word or a character used primarily in an SQL statement's WHERE clause
to perform operation(s), such as comparisons and arithmetic operations.
Operators are used to specify conditions in an SQL statement and to serve as conjunctions for
multiple conditions in a statement.
33
SQL Arithmetic Operators:
Assume variable a holds 10 and variable b holds 20 then:
i.
1. SQL Comparison Operators:
ii.
34
1. SQL Logical Operators:
35
COMMANDS IN SQL :
1. CREATE DATABASE
The SQL CREATE DATABASE statement is used to create new SQL database.
Syntax:
Basic syntax of CREATE DATABASE statement is as follows:
CREATE DATABASE DatabaseName;
Always database name should be unique within the RDBMS.
Example:
If you want to create new database <testDB>, then CREATE DATABASE statement would be as
follows:
SQL> CREATE DATABASE testDB;
2. DROP DATABASE
The SQL DROP DATABASE statement is used to drop any existing database in SQL schema.
36
Syntax:
Basic syntax of DROP DATABASE statement is as follows:
DROP DATABASE DatabaseName;
Always database name should be unique within the RDBMS.
Example:
If you want to delete an existing database <testDB>, then DROP DATABASE statement would be
as follows:
SQL> DROP DATABASE testDB;
3. USE
The SQL USE statement is used to select any existing database in SQL schema.
Syntax:
Basic syntax of USE statement is as follows:
USE DatabaseName;
4. CREATE TABLE
The SQL CREATE TABLE statement is used to create a new table.
Syntax:
Basic syntax of CREATE TABLE statement is as follows:
CREATE TABLE table_name(
column1 datatype,
column2 datatype,
column3 datatype,
.....
columnN datatype,
PRIMARY KEY( one or more columns )
);
CREATE TABLE is the keyword telling the database system what you want to do.in this case, you
want to create a new table. The unique name or identifier for the table follows the CREATE TABLE
statement.
Then in brackets comes the list defining each column in the table and what sort of data type it is. The
syntax becomes clearer with an example below.
A copy of an existing table can be created using a combination of the CREATE TABLE statement
and the SELECT statement.
5. DROP TABLE
The SQL DROP TABLE statement is used to remove a table definition and all data, indexes,
triggers, constraints, and permission specifications for that table.
Syntax:
Basic syntax of DROP TABLE statement is as follows:
DROP TABLE table_name;
6. INSERT INTO
The SQL INSERT INTO Statement is used to add new rows of data to a table in the database.
Syntax:
There are two basic syntax of INSERT INTO statement is as follows:
INSERT INTO TABLE_NAME (column1, column2, column3,...columnN)]
VALUES (value1, value2, value3,...valueN);
Here column1, column2,...columnN are the names of the columns in the table into which you want
to insert data.
You may not need to specify the column(s) name in the SQL query if you are adding values for all
the columns of the table. But make sure the order of the values is in the same order as the columns in
the table. The SQL INSERT INTO syntax would be as follows:
37
INSERT INTO TABLE_NAME VALUES
Example:
Following statements would create six records in CUSTOMERS table:
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (1, 'Ramesh', 32, 'Ahmedabad', 2000.00 );
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (2, 'Khilan', 25, 'Delhi', 1500.00 );
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (3, 'kaushik', 23, 'Kota', 2000.00 );
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (4, 'Chaitali', 25, 'Mumbai', 6500.00 );
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (5, 'Hardik', 27, 'Bhopal', 8500.00 );
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (6, 'Komal', 22, 'MP', 4500.00 );
You can create a record in CUSTOMERS table using second syntax as follows:
INSERT INTO CUSTOMERS
VALUES (7, 'Muffy', 24, 'Indore', 10000.00 );
All the above statement would product following records in CUSTOMERS table:
7. SELECT
SQL SELECT Statement is used to fetch the data from a database table which returns data in the
form of result table. These result tables are called result-sets.
Syntax:
The basic syntax of SELECT statement is as follows:
SELECT column1, column2, columnN FROM table_name;
Here column1, column2...are the fields of a table whose values you want to fetch. If you want to
fetch all the fields available in the field then you can use following syntax:
SELECT * FROM table_name;
Example:
Consider CUSTOMERS table is having following records:
38
Following is an example which would fetch ID, Name and Salary fields of the customers available in
CUSTOMERS table:
SQL> SELECT ID, NAME, SALARY FROM CUSTOMERS;
This would produce following result:
8. WHERE CLAUSE
The SQL WHERE clause is used to specify a condition while fetching the data from single table or
joining with multiple table.
If the given condition is satisfied then only it returns specific value from the table. You would use
WHERE clause to filter the records and fetching only necessary records.
The WHERE clause not only used in SELECT statement, but it is also used in UPDATE, DELETE
statement etc. which we would examine in subsequent chapters.
Syntax:
The basic syntax of SELECT statement with WHERE clause is as follows:
SELECT column1, column2, columnN
FROM table_name
WHERE [condition]
You can specify a condition using comparision or logical operators like >, <, =, LIKE, NOT etc.
Below examples would make this concept clear.
Example:
Consider CUSTOMERS table is having following records:
39
Following is an example which would fetch ID, Name and Salary fields from the CUSTOMERS
table where salary is greater than 2000:
SQL> SELECT ID, NAME, SALARY
FROM CUSTOMERS
WHERE SALARY > 2000;
This would produce following result:
The SQL AND andOR operators are used to combile multiple conditions to narrow data in an SQL
statement. These two operators are called conjunctive operators.
These operators provide a means to make multiple comparisons with different operators in the same
SQL statement.
The AND Operator:
The AND operator allows the existence of multiple conditions in an SQL statement's WHERE
clause.
Syntax:
The basic syntax of AND operator with WHERE clause is as follows:
SELECT column1, column2, columnN
FROM table_name
WHERE [condition1] AND [condition2]...AND [conditionN];
You can combine N number of conditions using AND operator. For an action to be taken by the
SQL statement, whether it be a transaction or query, all conditions separated by the AND must be
TRUE.
101
Example:
Consider CUSTOMERS table is having following records:
40
Following is an example which would fetch ID, Name and Salary fields from the CUSTOMERS
table where salary is greater than 2000 AND age is less tan 25 years:
SQL> SELECT ID, NAME, SALARY
FROM CUSTOMERS
WHERE SALARY > 2000 AND age < 25;
This would produce following result:
10. UPDATE
The SQL UPDATE Query is used to modify the existing records in a table.
You can use WHERE clause with UPDATE query to update selected rows otherwise all the rows
would be effected.
Syntax:
The basic syntax of UPDATE query with WHERE clause is as follows:
UPDATE table_name
102
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE [condition];
11. DELETE
The SQL DELETE Query is used to delete the existing records from a table.
You can use WHERE clause with DELETE query to delete selected rows, otherwise all the records
would be deleted.
Syntax:
The basic syntax of DELETE query with WHERE clause is as follows:
DELETE FROM table_name
WHERE [condition];
You can combine N number of conditions using AND or OR operators.
Example:
Consider CUSTOMERS table is having following records:
41
Following is an example which would DELETE a customer whose ID is 6:
SQL> DELETE FROM CUSTOMERS
WHERE ID = 6;
Now CUSTOMERS table would have following records:
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
12. LIKE
The SQL LIKE clause is used to compare a value to similar values using wildcard operators. There
are two wildcards used in conjunction with the LIKE operator:
· The percent sign (%)
· The underscore (_)
The percent sign represents zero, one, or multiple characters. The underscore represents a single
number or character. The symbols can be used in combinations.
Syntax:
The basic syntax of % and _ is as follows:
SELECT FROM table_name
WHERE column LIKE 'XXXX%'
or
SELECT FROM table_name
WHERE column LIKE '%XXXX%'
or
SELECT FROM table_name
WHERE column LIKE 'XXXX_'
consider CUSTOMERS table is having following records:
42
Following is an example which would display all the records from CUSTOMERS table where
SALARY starts with 200:
SQL> SELECT * FROM CUSTOMERS
WHERE SALARY LIKE '200%';
This would produce following result:
13. TOP
The SQL TOP clause is used to fetch a TOP N number or X percent records from a table.
Syntax:
The basic syntax of TOP clause with SELECT statement would be as follows:
SELECT TOP number|percentcolumn_name(s)
FROM table_name
WHERE [condition]
Example:
Consider CUSTOMERS table is having following records
Following is an example on SQL server which would fetch top 3 records from CUSTOMERS table:
SQL> SELECT TOP 3 * FROM CUSTOMERS;
This would produce following result:
43
14. ORDER BY
The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one
or more columns. Some database sorts query results in ascending order by default.
Syntax:
Example:
Consider CUSTOMERS table is having following records:
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
Following is an example which would sort the result in ascending order by NAME and SALARY:
SQL> SELECT * FROM CUSTOMERS
ORDER BY NAME, SALARY;
This would produce following result:
44
15. GROUP BY
The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange
identical data into groups.
The GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the
ORDER BY clause.
Syntax:
The basic syntax of GROUP BY clause is given below. The GROUP BY clause must follow the
conditions in the WHERE clause and must precede the ORDER BY clause if one is used.
SELECT column1, column2
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2
ORDER BY column1, column2
Example:
Consider CUSTOMERS table is having following records:
If you want to know the total amount of salary on each customer, then GROUP BY query would be
as follows:
SQL> SELECT NAME, SUM(SALARY) FROM CUSTOMERS
GROUP BY NAME;
This would produce following result:
The SQL DISTINCT keyword is used in conjunction with SELECT statement to eliminate all the
duplicate records and fetching only unique records.
There may be a situation when you have multiple duplicate records in a table. While fetching such
records, it makes more sense to fetch only unique records instead of fetching duplicate records.
45
Syntax:
The basic syntax of DISTINCT keyword to eliminate duplicate records is as follows:
SELECT DISTINCT column1, column2,.....columnN
FROM table_name
WHERE [condition]
Example:
Consider CUSTOMERS table is having following records:
First let us see how the following SELECT query returns duplicate salary records:
SQL> SELECT SALARY FROM CUSTOMERS
ORDER BY SALARY;
This would produce following result where salary 2000 is coming twice which is a duplicate record
from the original table.
+----------+
| SALARY |
+----------+
| 1500.00 |
| 2000.00 |
| 2000.00 |
| 4500.00 |
| 6500.00 |
| 8500.00 |
| 10000.00 |
+----------+
Now let us use DISTINCT keyword with the above SELECT query and see the result:
109
SQL> SELECT DISTINCT SALARY FROM CUSTOMERS
ORDER BY SALARY;
This would produce following result where we do not have any duplicate entry:
+----------+
| SALARY |
+----------+
| 1500.00 |
| 2000.00 |
| 4500.00 |
| 6500.00 |
| 8500.00 |
| 10000.00 |
+----------+
46
1. CONSTRAINTS
Constraints are the rules enforced on data columns on table. These are used to limit the type of data
that can go into a table. This ensures the accuracy and reliability of the data in the database.
Contraints could be column level or table level. Column level constraints are applied only to one
column where as table level constraints are applied to the whole table.
Following are commonly used constraints available in SQL. These constraints have already been
discussed in SQL - RDBMS Concepts chapter but its worth to revise them at this point.
· NOT NULL Constraint: Ensures that a column cannot have NULL value.
· DEFAULT Constraint : Provides a default value for a column when none is specified.
· UNIQUE Constraint: Ensures that all values in a column are different.
· PRIMARY Key: Uniquely identified each rows/records in a database table.
· FOREIGN Key: Uniquely identified a rows/records in any another database table.
· CHECK Constraint: The CHECK constraint ensures that all values in a column satisfy
certain conditions.
· INDEX: Use to create and retrieve data from the database very quickly.
Constraints can be specified when a table is created with the CREATE TABLE statement or you can
use ALTER TABLE statment to create constraints even after the table is created.
Dropping Constraints:
Any constraint that you have defined can be dropped using the ALTER TABLE command with the
DROP CONSTRAINT option.
For example, to drop the primary key constraint in the EMPLOYEES table, you can use the
following command:
ALTER TABLE EMPLOYEES DROP CONSTRAINT EMPLOYEES_PK;
Some implementations may provide shortcuts for dropping certain constraints. For example, to drop
the primary key constraint for a table in Oracle, you can use the following command:
ALTER TABLE EMPLOYEES DROP PRIMARY KEY;
Some implementations allow you to disable constraints. Instead of permanently dropping a
constraint from the database, you may want to temporarily disable the constraint, and then enable it
later.
Integrity Constraints:
Integrity constraints are used to ensure accuracy and consistency of data in a relational database.
Data integrity is handled in a relational database through the concept of referential integrity.
There are many types of integrity constraints that play a role in referential integrity (RI). These
constraints include Primary Key, Foreign Key, Unique Constraints and other constraints mentioned
above.
The SQL Joins clause is used to combine records from two or more tables in a database. A JOIN is a
means for combining fields from two tables by using values common to each.
Consider following two tables, (a) CUSTOMERS table is as follows:
47
(b) Another table is ORDERS as follows:
Now let us join these two tables in our SELECT statement as follows:
SQL> SELECT ID, NAME, AGE, AMOUNT
FROM CUSTOMERS, ORDERS
WHERE CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
This would produce following result:
Here it is noteable that the join is performed in the WHERE clause. Several operators can be used to
join tables, such as =, <, >, <>, <=, >=, !=, BETWEEN, LIKE, and NOT; they can all be used to join
tables. However, the most common operator is the equal symbol.
48
· CARTESIAN JOIN: returns the cartesian product of the sets of records from the two or more
joined tables.
19. UNION
The SQL UNION clause/operator is used to combine the results of two or more SELECT
statements without returning any duplicate rows.
To use UNION, each SELECT must have the same number of columns selected, the same number of
column expressions, the same data type, and have them in the same order but they do not have to be
the same length.
Syntax:
The basic syntax of UNION is as follows:
SELECT column1 [, column2 ]
FROM table1 [, table2 ]
[WHERE condition]
UNION
SELECT column1 [, column2 ]
FROM table1 [, table2 ]
[WHERE condition]
Here given condition could be any given expression based on your requirement.
Example:
Consider following two tables, (a) CUSTOMERS table is as follows:
Now let us join these two tables in our SELECT statement as follows:
SQL> SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
LEFT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID
UNION
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
RIGHT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;
This would produce following result:
49
20. NULL
The SQL NULL is the term used to represent a missing value. A NULL value in a table is a value in
a field that appears to be blank.
A field with a NULL value is a field with no value. It is very important to understand that a NULL
value is different than a zero value or a field that contains spaces.
Syntax:
The basic syntax of NULL while creating a table:
SQL> CREATE TABLE CUSTOMERS(
ID INT NOT NULL,
NAME VARCHAR (20) NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR (25) ,
SALARY DECIMAL (18, 2),
PRIMARY KEY (ID)
);
Here NOT NULL signifies that column should always accept an explicit value of the given data
type. There are two column where we did not use NOT NULL which means these column could be
NULL.
A field with a NULL value is one that has been left blank during record creation.
Example:
The NULL value can cause problems when selecting data, however, because when comparing an
unknown value to any other value, the result is always unknown and not included in the final results.
You must use the IS NULL or IS NOT NULL operators in order to check for a NULL value.
Consider following table, CUSTOMERS having following records:
50
This would produce following result:
21.ALIAS
You can rename a table or a column temporarily by giving another name known as alias.
The use of table aliases means to rename a table in a particular SQL statement. The renaming is a
temporary change and the actual table name does not change in the database.
The column aliases are used to rename a table's columns for the purpose of a particular SQL query.
Syntax:
The basic syntax of table alias is as follows:
SELECT column1, column2....
FROM table_name AS alias_name
WHERE [condition];
The basic syntax of column alias is as follows:
SELECT column_name AS alias_name
FROM table_name
WHERE [condition];
51
ADD CONSTRAINT MyPrimaryKey PRIMARY KEY (column1, column2...);
The basic syntax of ALTER TABLE to DROP CONSTRAINT from a table is as follows:
ALTER TABLE table_name
DROP CONSTRAINT MyUniqueConstraint;
If you're using MySQL, the code is as follows:
ALTER TABLE table_name
DROP INDEX MyUniqueConstraint;
The basic syntax of ALTER TABLE to DROP PRIMARY KEY constraint from a table is as
follows:
ALTER TABLE table_name
DROP CONSTRAINT MyPrimaryKey;
If you're using MySQL, the code is as follows:
ALTER TABLE table_name
DROP PRIMARY KEY;
23. TRUNCATE TABLE
The SQL TRUNCATE TABLE command is used to delete complete data from an existing table.
You can also use DROP TABLE command to delete complete table but it would remove complete
table structure form the database and you would need to re-create this table once again if you wish
you store some data.
120
Syntax:
The basic syntax of TRUNCATE TABLE is as follows:
TRUNCATE TABLE table_name;
52
FROM_UNIXTIME() Format date as a UNIX timestamp
HOUR() Extract the hour
LAST_DAY Return the last day of the month for the argument
LOCALTIME(), LOCALTIME Synonym for NOW()
LOCALTIMESTAMP,
LOCALTIMESTAMP()
Synonym for NOW()
MAKEDATE() Create a date from the year and day of year
MAKETIME MAKETIME()
MICROSECOND() Return the microseconds from argument
MINUTE() Return the minute from the argument
MONTH() Return the month from the date passed
MONTHNAME() Return the name of the month
NOW() Return the current date and time
PERIOD_ADD() Add a period to a year-month
PERIOD_DIFF() Return the number of months between periods
QUARTER() Return the quarter from a date argument
SEC_TO_TIME() Converts seconds to 'HH:MM:SS' format
SECOND() Return the second (0-59)
STR_TO_DATE() Convert a string to a date
SUBDATE() When invoked with three arguments a synonym for DATE_SUB()
SUBTIME() Subtract times
SYSDATE() Return the time at which the function executes
TIME_FORMAT() Format as time
TIME_TO_SEC() Return the argument converted to seconds
TIME() Extract the time portion of the expression passed
TIMEDIFF() Subtract time
TIMESTAMP()
With a single argument, this function returns the date or datetime
expression. With two arguments, the sum of the arguments
TIMESTAMPADD() Add an interval to a datetime expression
TIMESTAMPDIFF() Subtract an interval from a datetime expression
TO_DAYS() Return the date argument converted to days
UNIX_TIMESTAMP() Return a UNIX timestamp
UTC_DATE() Return the current UTC date
UTC_TIME() Return the current UTC time
UTC_TIMESTAMP() Return the current UTC date and time
WEEK() Return the week number
WEEKDAY() Return the weekday index
WEEKOFYEAR() Return the calendar week of the date (1-53)
YEAR() Return the year
YEARWEEK() Return the year and week
2.2 RELATIONAL ALGEBRA
Relational Algebra Operators are mathematical functions used to retrieve queries by describing
asequence operations on tables or even databases(schema) involved. With relational algebraoperators,
a query is always composed of a number of operators, which each in turn are composed ofrelations as
variables and return an individual abstraction as the end product.
The following are the main relational algebra operators as applied to SQL:
53
2.2.1 The SELECT Operator
The SELECT operator is used to choose a subset of the tuples(rows) from a relation that satisfies
aselection condition, acting as a filter to retain only tuples that fulfills a qualifying requirement.
The SELECT operator is relational algebra is denoted by the symbol σ (sigma).
· The syntax for the SELECT statement is then as follows:
σ<Selection condition>(R)
· The σ would represent the SELECT command
· The <selection condition> would represent the condition for selection.
· The (R) would represent the Relation or the Table from which we are making a selection ofthe
tuples.
To implement the SELECT statement in SQL, we take a look at an example in which we would liketo
select the EMPLOYEE tuples whose employee number is 7, or those whose date of birth is
before1980…
σempno=7(EMPLOYEE)
σdob<’01-Jan-1980′(EMPLOYEE)
The SQL implementation would translate into:
SELECT empno
FROM EMPLOYEE
WHERE empno=7
SELECT dob
FROM EMPLOYEE
WHERE DOB < ’01-Jan-1980′
This operator is used to reorder, select and get rid of attributes from a table. At some point we
mightwant only certain attributes in a relation and eliminate others from our query result. Therefore
thePROJECT operator would be used in such operations.
· The symbol used for the PROJECT operation is Π (pi).
· The general syntax for the PROJECT operator is:
Π<attribute list>(R )
· Π would represent the ROJECT.
· <attribute list> would represent the attributes(columns) we want from a relational.
· (R ) would represent the relation or table we want to choose the attributes from.
To implement the PROJECT statement in SQL, we take a look at an example in which we wouldlike
to choose the Date of Birth (dob) and Employee Number (empno) from the relationEMPLOYE…
· Πdob, empno(EMPLOYEE )
In SQL this would translate to:
SELECT dob, empno
FROM EMPLOYEE
The RENAME Operator
The RENAME operator is used to give a name to results or output of queries, returns of
selectionstatements, and views of queries that we would like to view at some other point in time:
· The RENAME operator is symbolized by ρ (rho).
· The general syntax for RENAME operator is: ρ s(B1, B2, B3,….Bn)(R )
· ρ is the RENAME operation.
54
· S is the new relation name.
· B1, B2, B3, …Bn are the new renamed attributes (columns).
· R is the relation or table from which the attributes are chosen.
To implement the RENAME statement in SQL, we take a look at an example in which we wouldlike
to choose the Date of Birth and Employee Number attributes and RENAME them as‘Birth_Date’ and
‘Employee_Number’ from the EMPLOYEE relation…
ρ s(Birth_Date, Employee_Number)(EMPLOYEE ) ←Πdob, empno(EMPLOYEE )
· The arrow symbol ← means that we first get the PROJECT operation results on the right sideof the
arrow then apply the RENAME operation on the results on the left side of the arrow.
In SQL we would translate the RENAME operator using the SQL ‘AS’ statement:
SELECT dob AS ‘Birth_Date’, empno AS ‘Employee_Number’
FROM EMPLOYEE
UNION: the UNION operation on relation A UNION relation B designated as A ∪B, joins orincludes
all tuples that are in A or in B, eliminating duplicate tuples. The SQL implementation ofthe UNION
operations would be as follows:
UNION
RESULT ← A ∪B
SQL Statement:
SELECT * From A
UNION
SELECT * From B
MINUSOperations: the MINUS operation includes tuples from one Relation that are not in
anotherRelation. Let the Relations be A and B, the MINUS operation A MINUS B is denoted by A –
B, thatresults in tuples that are A and not in B. The SQL implementation of the MINUS operations
wouldbe as follows:
MINUS
RESULT ← A – B
SQL Statement
SELECT dob From A
MINUS
SELECT dob from B
55
The CARTERSIAN PRODUCT operator, also referred to as the cross product or cross join, creates
arelation that has all the attributes of A and B, allowing all the attainable combinations of tuples
fromA and B in the result. The CARTERSIAN PRODUCT A and B is symbolized by X as in A X B.
Let there be Relation A(A1, A2) and Relation B(B1, B2)
The CARTERSIAN PRODUCT C of A and B which is A X B is
C=AXB
C = (A1B1, A1B2 , A2B1, A2B2 )
The SQL implementation would be something like:
SELECT A.dob, B.empnofrom A, B
The JOIN operation is denoted by the ˇ symbol and is used to compound similar tuples from
twoRelations into single longer tuples. Every row of the first table is joined to every row of the
secondtable. The result is tuples taken from both tables.
· The general syntax would be A ˇ <join condition> B
SQL translation example where attribute dob is Date of Birth and empno is Employee Number:
SELECT A.dob, A.empnofrom employeeJOIN B on B.empno=A.empno
2. EQUIJOIN Operator
The EQUIJOIN operation returns all combinations of tuples from Relation A and Relation Bsatisfying
a join requirement with only equality comparisons. The EQUIJOIN operation issymbolized by :
A ˇ <join condition> B, OR
·
· A ˇ (<join attributes 1>),
(<join attributes 2>) B
SQL translation example where attribute dob is Date of Birth and empno is Employee Number:
SELECT * from AINNER JOIN Bon A.empno=B.empno
56
We can always use the ‘where’ clause to further restrict our output and stop a Cartesian productoutput.
The DIVISION operation will return a Relation R(X) that includes all tuples t[X] in R(Z) that appearin
R1 in combination with every tuple from R2(Y), where Z = X ∪Y. The DIVISION operator
issymbolized by:
· R1(Z) ∻R2(Y)
The DIVISION operator is the most difficult to implement in SQL as no SQL command is given
forDIVISION operation. The DIVISION operator would be seen as the opposite of the
CARTERSIANPRODUCT operator; just as in standard math, the relation between division and
multiplication.
Therefore a series of current SQL commands have to be utilized in implementation of the DIVISION
operator. An example of the SQL implementation of DIVISION operator:
SELECT surname, forenames
FROM employee X
WHERE NOT EXISTS
(SELECT ‘X’
FROM employee y
WHERE NOT EXISTS
(SELECT ‘X’
FROM employee z
WHERE x.empno = z.empno
AND y.surname = z.surname))
ORDER BY empno
57
The tuple relational calculus is based on specifying a number of tuple variables. Each such tuple
variable normally ranges over a particular database relation. This means that the variable may take any
individual tuple from that relation as its value. A simple tuple relational calculus query is of the form {
t I COND(t)·}, where '1' is a tuple variable and COND(t) is a conditional expression involving '1'. The
result of such a query is a relation that contains all the tuples (rows) that satisfy COND(t).
For example, the relational calculus query {t I BOOK(t) and t.PRICE>lOO} will get you all the books
whose price is greater than 100. In the above example, the condition 'BOOK(t)' specifies that the range
relation of the tuple variable '1' is BOOK. Each BOOK tuple 't' that satisfies the condition 't.PRICE>
100' will be retrieved. Note that 't.PRICE' references the attribute PRICE of the tuple variable '1'.
The query {t IBOOK (t) and t.PRICE>100} retrieves all attribute values for each selectedBOOK
tuple. To retrieve only some of the attributes (say TITLE, AUTHOR and PRICE) we canmodify the
query as follows:
{t.TITLE, t.AUTHOR, t.PRICE I BOOK(t) and t.PRICE>200}
Thus, in a tuple calculus expression we need to specify the following information:
For each tuple variable the range relation 'R' of 'to This value is specified by a condition of the form
R(t) .
• A condition to select the required tuples from the relation.
• A set of attributes to be retrieved. This set is called the requested attributes. The values of these
attributes for each selected combination of tuples. If the requested attribute list is not specified, then
all the attributes of the selected tuples are retrieved.
Thus, to retrieve the details of all books (Title and Author name) which were published by 'Kalyani'
and whose price is greater than 100 we will write the query as follows:
{t.TITLE, t.AUTHOR I BOOK (t) and t.PUBLISHER='xyz' and t.PRICE>100}
Bound variables: Bound variables are those range variables when the meaning of the formulawould
remain unchanged if all the occurrence of range variable say 'x' were replaced by some othervariable
say 'y'. Then range variable 'x' is called as the Bound variable.
For example: x (x>3) means
EXISTS x (x>3)
Here, WFF simply states that there exists some integer x that is greater than 3. Note, that themeaning
of this WFF would remain totally unchanged if all references of x were replaced byreferences to some
other variable y. In other words the WFF EXISTS y(y>3) is semantically same.
58
Free Variables: Free variables are those range variables when the meaning of the formula changed,if
all the occurrences of range variable say 'x' were replaced by some other variables say 'y'. Thenrange
variable 'x' is called as the Free variable.
For example: x (x>3) and x<O means
EXISTS x (x>3) and x <0
Here, there are three references to x, denoting two different variables. The first two references
arebound and could be replaced by references to some other variable y without changing the
overallmeaning. The third reference is free, and cannot be replaced without changing the meaning of
theformula. Thus, of the two WFFs shown below, the first is equivalent to the one just given and
thesecond is not: -
EXIST Y (y>3) and x<O
EXITS y (y>3) and y<O
Closed and Open WFF: A WFF in which all variables references are bound is called Closed WFF.
e.g. EXISTS x (x>3) is a closed WFF.
An open WFF is a WFF that is not closed i.e. one that consists of at least one free variable reference.
e.g. EXISTS y (y>3) and x<O
The domain calculus differs from the tuple calculus in the type of variables used in formulas.
Indomain calculus the variables range over single values from domains of attributes rather thanranging
over tuples. To form a relation of degree 'n' for a query result, we must have 'n' of thesedomain
variables-one for each attribute.
An expression of the domain calculus is of the following form:
{Xl, X2, ... ,Xn I COND(XI, X2, .. ·, Xn, Xn+b Xn+2, ,Xn+m)}
In the above expression Xl, X2, ... ,Xn, Xn+b Xn+2, , Xn+m are domain variables that range
overdomains of attributes and COND is a condition or formula of the domain relational calculus.
Expression of the domain calculus are constructed from the following elements:
• Domain variables Xl, X2, ... ,Xn, Xn+b Xn+2, ... , Xn+m each domain variable is to range oversome
specified domain • Conditions, which can take two forms:
• Simple comparisons of the form x * y, as for the tuple calculus, except that x and yare now
domainvariables.
• Membership conditions, of the form R (term,term ...).
Here, R is a relation, and each "term" is a pair AV, where A in turn is an attribute
Of R and V is either a domain variable or a constant. For example EMP (empno: 100, ename: 'Ajay')is
a membership condition (which evaluates to true if and only if there exists an EMP tuple
havingempno=100 and ename = 'Ajay') .
• Well Formed Formulates (WFFs), formed in accordance with rules of tuple calculus (but with
therevised definition of "condition").
59
FD's are constraints on well-formed relations and represent a formalism on the infrastructure of
relation.
Multivalued dependencies occur when the presence of one or more rows in a table implies the
presence ofone or more other rows in that same table.
Examples:
60
For example, imagine a car company that manufactures many models of car, but always makes
bothred and blue colors of each model. If you have a table that contains the model name, color and
yearof each car the company manufactures, there is a multivalued dependency in that table. If there is
arow for a certain model name and year in blue, there must also be a similar row corresponding to
thered version of that same car.
2.5 NORMALIZATION
ANAMOLIES IN DBMS:
Insertion Anomaly
It is a failure to place information about a new database entry into all the places in the databasewhere
information about the new entry needs to be stored. In a properly normalized database,information
about a new entry needs to be inserted into only one place in the database, in aninadequatly
normalized database, information about a new entry may need to be inserted into morethan one place,
and human fallibility being what it is, some of the needed additional insertions maybe missed.
Deletion anomaly
It is a failure to remove information about an existing database entry when it is time to remove
thatentry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs
tobe deleted from only one place in the database, in an inadequatly normalized database,
informationabout that old entry may need to be deleted from more than one place.
Update Anomaly
An update of a database involves modifications that may be additions, deletions, or both. Thus“update
anomalies” can be either of the kinds discussed above.All three kinds of anomalies are highly
undesirable, since thieroccurence constitutes corruption ofthe database. Properly normalized database
are much less susceptible to corruption than are unnormalizeddatabases.
Normalization Avoids
· Duplication of Data – The same data is listed in multiple lines of the database
· Insert Anomaly – A record about an entity cannot be inserted into the table without first
insertinginformation about another entity – Cannot enter a customer without a sales order
· Delete Anomaly – A record cannot be deleted without deleting a record about a related entity.Cannot
delete a sales order without deleting all of the customer’s information.
· Update Anomaly – Cannot update information without changing information in many places.
Toupdate customer information, it must be updated for each sales order the customer has placed
Process of normalization:
61
Before getting to know the normalization techniques in detail, let us define a few building
blockswhich are used to define normal form.
1. Determinant: Attribute X can be defined as determinant if it uniquely defines the value Y ina given
relationship or entity .To qualify as determinant attribute need NOT be a key attribute.Usually
dependency of attribute is represented as X->Y ,which means attribute X decidesattribute Y.
Example: In RESULT relation, Marks attribute may decide the grade attribute .This is represented
asMarks->grade and read as Marks decides Grade.Marks -> Grade
In the result relation, Marks attribute is not a key attribute .Hence it can be concluded that
keyattributes are determinants but not all the determinants are key attributes.
2. Functional Dependency: Yes functional dependency has definition but let’s not care aboutthat.
Let’s try to understand the concept by example. Consider the following relation :
REPORT(Student#,Course#,CourseName,IName,Room#,Marks,Grade)
Where:
· Student#-Student Number
· Course#-Course Number
· CourseName -CourseName
· IName- Name of the instructor who delivered the course
· Room#-Room number which is assigned to respective instructor
· Marks- Scored in Course Course# by student Student #
· Grade –Obtained by student Student# in course Course #
· Student#,Course# together (called composite attribute) defines EXACTLY ONE value of
marks .This can be symbolically represented as
Student#Course# Marks
This type of dependency is called functional dependency. In above example Marks is
functionallydependent on Student#Course#.
Other Functional dependencies in above examples are:
· Course# ->CourseName
· Course#->IName(Assuming one course is taught by one and only one instructor )
· IName -> Room# (Assuming each instructor has his /her own and non shared room)
· Marks ->Grade
Formally we can define functional dependency as: In a given relation R, X and Y are attributes.
Attribute Y is functional dependent on attribute X if each value of X determines exactly one value
ofY. This is represented as :
X->Y
However X may be composite in nature.
62
Formal Definition of full functional dependency : In a given relation R ,X and Y are attributes. Y
isfully functionally dependent on attribute X only if it is not functionally dependent on sub-set
ofX.However X may be composite in nature.
5. Transitive Dependency: In above example , Room# depends on IName and in turn dependson
Course# .Here Room# transitively depends on
Course#.
IName
Room#
Course#
Similarly Grade depends on Marks,in turn Marks depends on Student# Course# hence Grade
Fully transitively depends on Student# Course#.
6. Key attributes :In a given relationship R ,if the attribute X uniquely defines all otherattributes ,then
the attribute X is a key attribute which is nothing but the candidate key.
Ex: Student#Course# together is a composite key attribute which determines all attributes
inrelationship
REPORT(student#,Course#,CourseName,IName,Room#,Marks,Grade)uniquely.HenceStudent# and
Course# are key attributes.
If a table contains non-atomic values at each row, it is said to be in UNF. An atomic value
issomething that can not be further decomposed. A non-atomic value, as the name suggests, can
befurther decomposed and simplified. Consider the following table:
63
In the sample table above, there are multiple occurrences of rows under each key Emp-Id.
Althoughconsidered to be the primary key, Emp-Id cannot give us the unique identification facility for
anysingle row. Further, each primary key points to a variable length record (3 for E01, 2 for E02 and
4for E03).
A relation is said to be in 1NF if it contains no non-atomic values and each row can provide a
uniquecombination of values. The above table in UNF can be processed to create the following table
in1NF.
As you can see now, each row contains unique combination of values. Unlike in UNF, this
relationcontains only atomic values, i.e. the rows can not be further decomposed, so the relation is
now in1NF.
64
2.5.2 Second Normal Form (2NF)
A relation is said to be in 2NF f if it is already in 1NF and each and every attribute fully depends onthe
primary key of the relation. Speaking inversely, if a table has some attributes which is notdependant
on the primary key of that table, then it is not in 2NF.
Let us explain. Emp-Id is the primary key of the above relation. Emp-Name, Month, Sales and Bank-
Name all depend upon Emp-Id. But the attribute Bank-Name depends on Bank-Id, which is not
theprimary key of the table. So the table is in 1NF, but not in 2NF. If this position can be removed
intoanother related relation, it would come to 2NF.
After removing the portion into another relation we store lesser amount of data in two relations
without any loss information. There is also a significant reduction in redundancy.
A relation is said to be in 3NF, if it is already in 2NF and there exists no transitive dependency inthat
relation. Speaking inversely, if a table contains transitive dependency, then it is not in 3NF, andthe
table must be split to bring it into 3NF.
What is a transitive dependency? Within a relation if we see
A → B [B depends on A]
And
B → C [C depends on B]
Then we may derive
A → C[C depends on A]
Such derived dependencies hold well in most of the situations. For example if we have
Roll → Marks
And
65
Marks → Grade
Then we may safely derive
Roll → Grade.
This third dependency was not originally specified but we have derived it.
The derived dependency is called a transitive dependency when such dependency
becomesimprobable. For example we have been given
Roll → City
And
City → STDCode
If we try to derive Roll → STDCode it becomes a transitive dependency, because obviously the
STDCode of a city cannot depend on the roll number issued by a school or college. In such a case
the relation should be broken into two, each containing one of these two dependencies:
Roll → City
And
City → STD code
A relationship is said to be in BCNF if it is already in 3NF and the left hand side of everydependency
is a candidate key. A relation which is in 3NF is almost always in BCNF. These couldbe same
situation when a 3NF relation may not be in BCNF the following conditions are found true.
1. The candidate keys are composite.
2. There are more than one candidate keys in the relation.
3. There are some common attributes in the relation.
66
The given relation is in 3NF. Observe, however, that the names of Dept. and Head of Dept.
areduplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose the information
thatRao is the Head of Department of Chemistry.
The normalization of the relation is done by creating a new relation for Dept. and Head of Dept.
anddeleting Head of Dept. form the given relation. The normalized relations are shown in the
following.
67
See the dependency diagrams for these new relations.
When attributes in a relation have multi-valued dependency, further Normalization to 4NF and 5NFare
required. Let us first find out what multi-valued dependency is.
68
A multi-valued dependency is a typical kind of dependency in which each and every attributewithin
a relation depends upon the other, yet none of them is a unique primary key.
We will illustrate this with an example. Consider a vendor supplying many items to many projects
inan organization. The following are the assumptions:
1. A vendor is capable of supplying many items.
2. A project uses many items.
3. A vendor supplies to many projects.
4. An item may be supplied by many vendors.
A multi valued dependency exists here because all the attributes depend upon the other and yet noneof
them is a primary key having unique value.
1. If vendor V1 has to supply to project P2, but the item is not yet decided, then a row with a blank for
item code has to be introduced.
2. The information about item I1 is stored twice for vendor V3.
Observe that the relation given is in 3NF and also in BCNF. It still has the problem mentionedabove.
The problem is reduced by expressing this relation as two relations in the Fourth NormalForm (4NF).
A relation is in 4NF if it has no more than one independent multi valued dependency orone
independent multi valued dependency with a functional dependency.
The table can be expressed as the two 4NF relations given as following. The fact that vendors
arecapable of supplying certain items and that they are assigned to supply for some projects
inindependently specified in the 4NF relation.
69
2.5.6 Fifth Normal Form (5NF)
These relations still have a problem. While defining the 4NF we mentioned that all the
attributesdepend upon each other. While creating the two tables in the 4NF, although we have
preserved thedependencies between Vendor Code and Item code in the first table and Vendor Code
and Item codein the second table, we have lost the relationship between Item Code and Project No. If
there were aprimary key then this loss of dependency would not have occurred. In order to revive
thisrelationship we must add a new table like the following. Please note that during the entire process
ofnormalization, this is the only step where a new table is created by joining two attributes, rather
thansplitting them into separate tables.
Dependency Preservation
70
Define dependency preservation / What is dependency preservation? / Why do we need dependency
preserving decomposition? / The need for dependency preserving decomposition / dependency
preserving decomposition example / What is dependency Preservation property for decomposition? /
Why dependency preservation is important? / Dependency preservation property of normalization
process
Dependency Preservation
A decomposition of a relation R into R1, R2, R3, …, Rn is dependency preserving decomposition with
respect to the set of Functional Dependencies F that hold on R only if the following is hold;
(F1 U F2 U F3 U … U Fn)+ = F+
where,
F1, F2, F3, …, Fn – Sets of Functional dependencies of relations R1, R2, R3, …, Rn.
71
If the closure of set of functional dependencies of individual relations R 1, R2, R3, …, Rn are equal to
the set of functional dependencies of the main relation R (before decomposition), then we would say
the decomposition D is lossless dependency preserving decomposition.
Dependency preservation is a concept that is very much related to Normalization process. Remember
that the solution for converting a relation into a higher normal form is to decompose the relation into
two or more relations. This is done using the set of functional dependencies identified in the lower
normal form state.
For example, let us assume a relation R (A, B, C, D) with set of functional dependencies F = {A→B,
B→C, C→D}. There is no partial dependency in the given set F. Hence, this relation is in 2NF.
Is R (A, B, C, D) in 3NF? No. The reason is Transitive Functional Dependency. How do we convert R
into 3NF? The solution is decomposition.
The dependency A→B is missing. This causes acceptance of any values for B in R2. It causes
duplicate values to be entered into R2 for B which may not depend on A. If we would like to avoid
this type of duplication, then we need to perform a join operation between R1 and R2 to accept a new
value for B which is costlier operation. Hence, we demand the decomposition to be a dependency
preserving decomposition.
We would like to check easily that updates to the database do not result in illegal relations
being created.
It would be nice if our design allowed us to check updates without having to compute natural
joins.
72
We can permit a non-dependency preserving decomposition if the database is static. That is, if
there is no new insertion or update in the future.
i. Example:
a. Dependency Preservation
A Decomposition D = { R1, R2, R3….Rn } of R is dependency preserving wrt a set F of Functional
dependency if
(F1 ∪ F2 ∪ … ∪Fm)+ = F+.
Consider a relation R
R --->F{...with some functional dependency(FD)....}
Problem: Let a relation R (A, B, C, D ) and functional dependency {AB –> C, C –> D, D –> A}.
Relation R is decomposed into R1( A, B, C) and R2(C, D). Check whether decomposition is
dependency preserving or not.
Solution:
R1(A, B, C) and R2(C, D)
73
closure(A) = { A } // Trivial
closure(B) = { B } // Trivial
closure(C) = {C, A, D} but D can't be in closure as D is not present R1.
= {C, A}
C--> A // Removing C from right side as it is trivial attribute
closure(AB) = {A, B, C, D}
= {A, B, C}
AB -->C // Removing AB from right side as these are trivial attributes
closure(BC) = {B, C, D, A}
= {A, B, C}
BC -->A // Removing BC from right side as these are trivial attributes
closure(AC) = {A, C, D}
AC -->D // Removing AC from right side as these are trivial attributes
74
(A) gives a lossless join, and is dependency preserving
(B) gives a lossless join, but is not dependency preserving
(C) does not give a lossless join, but is dependency preserving
(D) does not give a lossless join and is not dependency preserving
Answer: (A)
Explanation: Background :
Lossless-Join Decomposition:
Decomposition of R into R1 and R2 is a lossless-join decomposition if at least one of the
following functional dependencies are in F+ (Closure of functional dependencies)
R1 ∩ R2 → R1
OR
R1 ∩ R2 → R2
Dependency Preserving Decomposition:
Decomposition of R into R1 and R2 is a dependency preserving decomposition if closure of
functional dependencies after decomposition is same as closure of of FDs before decomposition.
A simple way is to just check whether we can derive all the original FDs from the FDs present
after decomposition.
Question 2
R(A,B,C,D) is a relation. Which of the following does not have a lossless join, dependency
preserving BCNF decomposition?
(A) A->B, B->CD
(B) A->B, B->C, C->D
(C) AB->C, C->AD
(D) A ->BCD
Ans (c ).
Explanation: Background :
Question :
We know that for lossless decomposition common attribute should be candidate key in one of the
relation.
A) A->B, B->CD
R1(AB) and R2(BCD)
B is the key of second and hence decomposition is lossless.
B) A->B, B->C, C->D
R1(AB) , R2(BC), R3(CD)
B is the key of second and C is the key of third, hence lossless.
C) AB->C, C->AD
R1(ABC), R2(CD)
C is key of second, but C->A violates BCNF condition in ABC as C is not a key. We cannot
decompose ABC further as AB->C dependency would be lost.
75
D) A ->BCD
Already in BCNF.
Therefore, Option C AB->C, C->AD is the answer.
Functional Dependency
A functional dependency A->B in a relation holds if two tuples having same value of attribute A also
have same value for attribute B. For Example, in relation STUDENT shown in table 1, Functional
Dependencies
STUD_NO->STUD_NAME, STUD_NO->STUD_ADDR hold
but
STUD_NAME->STUD_ADDR do not hold
Attribute Closure: Attribute closure of an attribute set can be defined as set of attributes which can
be functionally determined from it.
How to find attribute closure of an attribute set?
To find attribute closure of an attribute set:
Add elements of attribute set to the result set.
Recursively add elements to the result set which can be functionally determined from the
elements of the result set.
Using FD set of table 1, attribute closure can be determined as:
76
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,
STUD_AGE}
(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}
How to find Candidate Keys and Super Keys using Attribute Closure?
If attribute closure of an attribute set contains all attributes of relation, the attribute set will be
super key of the relation.
If no subset of this attribute set can functionally determine all attributes of the relation, the set
will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,
STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset (STUD_NO)+
is equal to all attributes of the relation. So, STUD_NO will be a candidate key.
GATE Question: Consider the relation scheme R = {E, F, G, H, I, J, K, L, M, M} and the set of
functional dependencies {{E, F} -> {G}, {F} -> {I, J}, {E, H} -> {K, L}, K -> {M}, L -> {N} on R.
What is the key for R? (GATE-CS-2014)
A. {E, F}
B. {E, F, H}
C. {E, F, H, K, L}
D. {E}
Answer: Finding attribute closure of all given options, we get:
{E,F}+ = {EFGIJ}
{E,F,H}+ = {EFHGIJKLMN}
{E,F,H,K,L}+ = {{EFHGIJKLMN}
{E}+ = {E}
{EFH}+ and {EFHKL}+ results in set of all attributes, but EFH is minimal. So it will be candidate
key. So correct option is (B).
6.
77
E001 John Delhi Delhi
Table 1
The FD set for EMPLOYEE relation given in Table 1 are:
{E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-STATE}
Trivial versus Non-Trivial Functional Dependency: A trivial functional dependency is the one
which will always hold in a relation.
X->Y will always hold if X ⊇ Y
In the example given above, E-ID, E-NAME->E-ID is a trivial functional dependency and will
always hold because {E-ID,E-NAME} ⊃ {E-ID}. You can also see from the table that for each value
of {E-ID, E-NAME}, value of E-ID is unique, so {E-ID, E-NAME} functionally determines E-ID.
If a functional dependency is not trivial, it is called Non-Trivial Functional Dependency. Non-
Trivial functional dependency may or may not hold in a relation. e.g; E-ID->E-NAME is a non-trivial
functional dependency which holds in the above relation.
Properties of Functional Dependencies
Let X, Y, and Z are sets of attributes in a relation R. There are several properties of functional
dependencies which always hold in R also known as Armstrong Axioms.
1. Reflexivity: If Y is a subset of X, then X → Y. e.g.; Let X represents {E-ID, E-NAME} and Y
represents {E-ID}. {E-ID, E-NAME}->E-ID is true for the relation.
2. Augmentation: If X → Y, then XZ → YZ. e.g.; Let X represents {E-ID}, Y represents {E-
NAME} and Z represents {E-CITY}. As {E-ID}->E-NAME is true for the relation, so { E-ID,E-
CITY}->{E-NAME,E-CITY} will also be true.
3. Transitivity: If X → Y and Y → Z, then X → Z. e.g.; Let X represents {E-ID}, Y represents {E-
CITY} and Z represents {E-STATE}. As {E-ID} ->{E-CITY} and {E-CITY}->{E-STATE} is
true for the relation, so { E-ID }->{E-STATE} will also be true.
4. Attribute Closure: The set of attributes that are functionally dependent on the attribute A is
called Attribute Closure of A and it can be represented as A+.
Steps to Find the Attribute Closure of A
Q. Given FD set of a Relation R, The attribute closure set S be the set of A
1. Add A to S.
2. Recursively add attributes which can be functionally determined from attributes of the set S until
done.
From Table 1, FDs are
Given R(E-ID, E-NAME, E-CITY, E-STATE)
FDs = { E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-STATE }
The attribute closure of E-ID can be calculated as:
1. Add E-ID to the set {E-ID}
78
2. Add Attributes which can be derived from any attribute of set. In this case, E-NAME and E-
CITY, E-STATE can be derived from E-ID. So these are also a part of closure.
3. As there is one other attribute remaining in relation to be derived from E-ID. So result is:
(E-ID)+ = {E-ID, E-NAME, E-CITY, E-STATE }
Similarly,
(E-NAME)+ = {E-NAME}
(E-CITY)+ = {E-CITY, E_STATE}
Q. Find the attribute closures of given FDs R(ABCDE) = {AB->C, B->D, C->E, D->A} To find
(B)+,we will add attribute in set using various FD which has been shown in table below.
{B} Triviality
{B,D} B->D
{B,D,A} D->A
{B,D,A,C} AB->C
{B,D,A,C,E} C->E
We can find (C, D)+ by adding C and D into the set (triviality) and then E using(C->E) and then
A using (D->A) and set becomes.
(C,D)+ = {C,D,E,A}
Similarly we can find (B,C)+ by adding B and C into the set (triviality) and then D using (B->D)
and then E using (C->E) and then A using (D->A) and set becomes
(B,C)+ ={B,C,D,E,A}
Candidate Key
79
Candidate Key is minimal set of attributes of a relation which can be used to identify a tuple
uniquely. For Example, each tuple of EMPLOYEE relation given in Table 1 can be uniquely
identified by E-ID and it is minimal as well. So it will be Candidate key of the relation.
A candidate key may or may not be a primary key.
Super Key
Super Key is set of attributes of a relation which can be used to identify a tuple uniquely.For
Example, each tuple of EMPLOYEE relation given in Table 1 can be uniquely identified by E-ID or
(E-ID, E-NAME) or (E-ID, E-CITY) or (E-ID, E-STATE) or (E_ID, E-NAME, E-STATE) etc.
So all of these are super keys of EMPLOYEE relation.
Note: A candidate key is always a super key but vice versa is not true.
Q. Finding Candidate Keys and Super Keys of a Relation using FD set The set of
attributeswhose attribute closure is set of all attributes of relation is called super key of relation. For
Example, the EMPLOYEE relation shown in Table 1 has following FD set. {E-ID->E-NAME, E-ID-
>E-CITY, E-ID->E-STATE, E-CITY->E-STATE} Let us calculate attribute closure of different set
of attributes:
(E-ID)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-NAME)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-CITY)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-STATE)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-CITY,E-STATE)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-NAME)+ = {E-NAME}
(E-CITY)+ = {E-CITY,E-STATE}
As (E-ID)+, (E-ID, E-NAME)+, (E-ID, E-CITY)+, (E-ID, E-STATE)+, (E-ID, E-CITY, E-
STATE)+ give set of all attributes of relation EMPLOYEE. So all of these are super keys of relation.
The minimal set of attributes whose attribute closure is set of all attributes of relation is called
candidate key of relation. As shown above, (E-ID)+ is set of all attributes of relation and it is minimal.
So E-ID will be candidate key. On the other hand (E-ID, E-NAME)+ also is set of all attributes but it is
not minimal because its subset (E-ID)+ is equal to set of all attributes. So (E-ID, E-NAME) is not a
candidate key.
i. Example:
R1 ∩ R2 → R1
OR
R1 ∩ R2 → R2
80
Question 1:
Let R (A, B, C, D) be a relational schema with the following functional dependencies:
A → B, B → C,
C → D and D → B.
Answer:(A)
Explanation:Background :
Lossless-Join Decomposition:
Decomposition of R into R1 and R2 is a lossless-join decomposition if at least one of the
following functional dependencies are in F+ (Closure of functional dependencies)
R1 ∩ R2 → R1
OR
R1 ∩ R2 → R2
Dependency Preserving Decomposition:
Decomposition of R into R1 and R2 is a dependency preserving decomposition if closure of
functional dependencies after decomposition is same as closure of of FDs before
decomposition.
A simple way is to just check whether we can derive all the original FDs from the FDs present
after decomposition.
Question :
Let R (A, B, C, D) be a relational schema with the following functional dependencies:
A -> B, B -> C,
C -> D and D -> B.
81
The relation is dependency preserving as well as all functional dependencies are preserved directly
or indirectly. Note that C -> D is also preserved with following two C -> B and B -> D.
Query processing includes translation of high-level queries into low-level expressions that can be
used at thephysical level of the file system, query optimization and actual execution of the query to get
the result. It is athree-step process that consists of parsing and translation, optimization and execution
of the query submittedby the user.
Minimizationofresponsetimeofquery(timetakentoproducetheresultstouser’squery).
Maximizesystemthroughput(thenumberofrequeststhatareprocessedinagivenamount of
time).
Reduce the amount of memory and storage required forprocessing.
Increaseparallelism.
82
83
Query Parsing and Translation
Query optimization involves three steps, namely query tree generation, plan generation, and query
plan code generation.
A query tree is a tree data structure representing a relational algebra expression. The tables of the
query are represented as leaf nodes. The relational algebra operations are represented as the
internal nodes. The root represents the query as a whole.
Duringexecution,aninternalnodeisexecutedwheneveritsoperandtablesareavailable.Thenode
isthenreplacedbytheresulttable.Thisprocesscontinuesforallinternalnodesuntiltherootnode is
executed and replaced by the resulttable.
EMPLOYEE
DEPARTMENT
1. Example 1
πEmpID(σEName="ArunKumar"(EMPLOYEE))
84
2. Example 2
πEName,Salary(σDName="Marketing"(DEPARTMENT))⋈DNo=DeptNo(EMPLOYEE)
Afterthequerytreeisgenerated,aqueryplanismade.Aqueryplanisanextendedquerytreethat includes
access paths for all operations in the query tree. Access paths specify how the relational
operations in the tree should be performed. For example, a selection operation can have anaccess
path that gives details about the use of B+ tree index forselection.
Besides, a query plan also states how the intermediate tables should be passed from one operator
to the next, how temporary tables should be used and how operations should be
pipelined/combined.
85
Step 3− Code Generation
Code generation is the final step in query optimization. It is the executable form of the query,
whose form depends upon the type of the underlying operating system. Once the query code is
generated, the Execution Manager runs it and produces the results.
Among the approaches for query optimization, exhaustive search and heuristics-based
algorithms are mostly used.
Exhaustive Search Optimization
In these techniques, for a query, all possible query plans are initially generated and then the
best planisselected.Thoughthesetechniquesprovidethebestsolution,ithasanexponentialtimeand
space complexity owing to the large solution space. For example, dynamic programming
technique.
Heuristic based optimization uses rule-based optimization approaches for query optimization.
Thesealgorithmshavepolynomialtimeandspacecomplexity,whichislowerthantheexponential
complexity of exhaustive search-based algorithms. However, these algorithms do not
necessarily produce the best queryplan.
Perform select and project operations before join operations. This is done by moving
the select and project operations down the query tree. This reduces the number of tuples
available forjoin.
Perform the most restrictive select/project operations at first before the otheroperations.
Avoid cross-product operation since they result in very large-sized intermediatetables.
2272 86
2.6.3 Query Processing Steps:
2272 87
· Each operation in the query (SELECT, JOIN, etc.) can be implemented using one or more different
Access Routines.
· For example, an access routine that employs an index to retrieve some rows would be more
efficientthat an access routine that performs a full table scan.
· The goal of the query optimizer is to find a reasonably efficient strategy for executing the query
(notquite what the name implies) using the access routines.
· Optimization typically takes one of two forms: Heuristic Optimization or Cost Based Optimization
· In Heuristic Optimization, the query execution is refined based on heuristic rules for reordering
theindividual operations.
· With Cost Based Optimization, the overall cost of executing the query is systematically reduced
byestimating the costs of executing several different execution plans.
3. Query Code Generator (interpreted or compiled)
· Once the query optimizer has determined the execution plan (the specific ordering of access
routines),the code generator writes out the actual access routines to be executed.
· With an interactive session, the query code is interpreted and passed directly to the runtime
databaseprocessor for execution.
· It is also possible to compile the access routines and store them for later execution.
4. Execution in the runtime database processor
· At this point, the query has been scanned, parsed, planned and (possibly) compiled.
· The runtime database processor then executes the access routines against the database.
· The results are returned to the application that made the query in the first place.
· Any runtime errors are also returned.
· To enable the system to achieve (or improve) acceptable performance by choosing a better (if not
thebest) strategy during the process of a query. One of the great strengths to the relational database.
2272 88
b. Reduce the amount of comparisons by converting a restriction condition to an equivalentcondition
in conjunctive normal form- that is, a condition consisting of a set of restrictionsthat are ANDed
together, where each restriction in turn consists of a set of simple comparisonsconnected only by
OR's.
c. A sequence of restrictions (selects) before the join.
d. In a sequence of projections, all but the last can be ignored.
e. A restriction of projection is equivalent to a projection of a restriction.
f. Others
3. Choose candidate low-level procedures by evaluate the transformed query.
*Access path selection: Consider the query expression as a series of basic operations
(join,restriction, etc.), then the optimizer choose from a set of pre-defined, low-levelimplementation
procedures. These procedures may involve the user of primary key, foreignkey or indexes and other
information about the database.
4. Generate query plans and choose the cheapest by constructing a set of candidate query plansfirst,
then choose the best plan. To pick the best plan can be achieved by assigning cost toeach given plan.
The costs is computed according to the number of disk I/O's involved.
Equivalence Rule defines how to write equivalence expression for each of the operators. Let us see
them below.
c. Selection Operations
When we have multiple selection operation looping one inside other on the table, then we can write
them in any order. That is,
σθ1 (σθ2 (T)) = σθ2 (σθ1 (T)), where T is the table and θ is filter condition.
This implies that in a selection operation order of θ1 and θ2 does not affect the result. It can be used
in any order. That is conditions of selection operation are commutative in nature.
For example, retrieve the students of age 18 who are studying in class DESIGN_01.
or
2272 89
WHERE AGE =18;
Actually, relational expressions are written in this form as a part of equivalence relation. Above
query is written for better understanding. Going forward, let us try to understand the equivalence
rule in terms of relational expression. Relational expression for above query can be written as below:
Let us see this with the help of data. Below tables show the select operation by altering the filtering
condition. Though the intermediate tables will have different records, the final result is same.
2272 90
2272 91
d. Conjunctive Selection
When we have selection operation with multiple filter conditions, then we can split them into
sequence of selection operation. That is,
For example, retrieve the students of age 18 and who are studying in class DESIGN_01.
2272 92
e. Sequence of projection operation
When there is a sequence of projection operation, only the last projection is required and rest of the
projections can be ignored. In terms of relational expression, it can be written as below:
∏Col_list1(∏Col_list2(…(∏Col_listN(T))….))=∏Col_list1(T)
2272 93
(STUDENT)))=∏STD_ID, STD_NAME (STUDENT)
That means final set of columns which we are going to select is only required rather than selecting
all set of different columns throughout the query.
When we have Cartesian product on two tables and a selection condition on the result, then we can
replace it with natural join with filter condition.
2272 94
i.e.; the Cartesian product on EMP and DEPT will combine the records of both the tables
irrespective of the any joins. When a selection operation with join condition is applied on it, it
selects only those records which have same DEPT_ID. The same can be done with natural join
where it selects only the matching records from both the tables based on DEPT_ID.
Similarly, when we have natural join on two tables with one condition and a selection operation on it
with another condition, then we can combine the conditions into natural join.
2272 95
i.e.; When a natural join on EMP and DEPT is performed based on DEPT_ID, and then the result is
filtered for AGE = 23 will yield a same result as performing natural join with condition on
DEPT_ID and AGE = 23.
When we have natural join on two tables with θ join, then the order of the tables does not have any
significance. That means we can have tables in any order implying that they are commutative.
T ∞θ2 S = S ∞θ2 T
This is the enhancement of above rule. It states that when natural joins are performed on three or
more tables, then it can be performed in any combination of two tables. That means natural joins are
associative.
(R ∞S)∞ T = R ∞(S ∞ T)
2272 96
(EMP ∞DEPT)∞ PROJECT = EMP ∞(DEPT ∞ PROJECT)
Selection operation with theta join (natural join) are distributive as below:
When all the columns of select conditions have only one table, then we can re-write selection
with theta join as below:
2272 97
Joining the tables with less number of records is always efficient. Performing SELECT operation
first will reduce number of records to be joined in a relation, hence increasing the performance. This
is the inference we get from above rule.
When the condition θ1 has only the columns of T and θ2 has only the columns of S and θ has
columns from both the tables, then we can have equivalence rule as below.
2272 98
i. Distributive Projection Over Theta Join
Let CT and CS be the columns of T and S respectively and are selected in the final result. Let
CT1 and CS2 be the columns of T and S respectively and are used in the theta join, but are
not in CT and CS. Then selecting the columns of T and S after having theta join on them is
equivalent to selecting the columns of individual tables and then having join on them.
Is equivalent to:
2272 99
j. Commutative Set Operation
This rule states that tables used in set operators like UNION and INTERSECTION can be used
interchangeably. i.e.;
T U S = S U T
T ∩S = S ∩T
2272 100
Eg :
This rule states that when there is a sequence of UNION or INTERSETION operation on more than
two tables, and then it can be evaluated from beginning to end or from end to beginning by selecting
the combination of two tables each time. i.e.;
R U (S U T) = (R U S) U T
R ∩ (S ∩ T) = (R ∩ S) ∩ T
Eg :
First rule says that, selecting the records after applying Set operator is same as applying the
Set operator on individual selection operation.
σ θ (T U S) = σ θ (T) U σ θ (S)
σ θ (T ∩ S) = σ θ (T) ∩ σ θ (S)
σ θ (T - S) = σ θ (T) - σ θ (S)
Eg :
Second rule states that applying selection condition on first table after applying the set
operator is same as applying selection condition on first table first and then applying the
set operator.
σ θ (T U S) = σ θ (T) U S
σ θ (T ∩ S) = σ θ (T) ∩ S
σ θ (T - S) = σ θ (T) – S
2272 101
Or
This rule says projecting the columns after applying UNION operator is same as projecting the
columns of individual tables and then applying UNION operator.
∏ CT U CS (T U S) = ∏ CT (T) U ∏ CS (S)
Eg :
i.
2272 102
When there is complex query, the processor will analyze and see which of above rule can be
applied on it, so that processing becomes efficient and easier. Some of the key points to be
noted while evaluating the query are
Executing the queries with less number of records is always efficient. Hence perform
selection operation as early as possible, so that it reduces the number of records in the
query.
For example, retrieve student names, who are studying in class DESIGN_01. This can be
done in two ways. First, retrieve all the student names and class from STUDENT table,
and then apply the filter to select only DESIGN_01 students. Suppose we have 100
students in the student table and only 10 are in DESIGN_01 class. Then below query will
retrieve student name and CLASS_ID for all the 100 students first, then it will apply filter
to select the 10 design students. The cost of query is hence more in this case. In addition, it
needs more space to store temporary result of STD_NAME and CLASS_ID for 100
students.
Second method is to filter students who are studying in class DESIGN_01. Then select student
names from this list of records. Here, it filters DESIGN_01 students first. Hence it selects only 10
student records. It then selects only required column – STD_NAME. This is more efficient method,
as it reduces the number of records at first step itself.
When there is multiple tables involved in a query with different joins, then it is better to split
the query into different tokens and then execute the query in a sequence. We can apply
above 12 rules in this situation.
Suppose we have to retrieve the details of employees who work at DESIGN department
and are of age 23. The query can be written as below:
But selecting the required records at the end will have large number of data till the end. Hence it is
better to filter number of records in each table first, and then applying the join is better. Hence first
task is to filter EMP records, and then Filter DEPT records, then join them. These actions will be
performed in a sequence.
When there is only few set of columns needs to be displayed in the result, then push the
projection operation as early as possible. This is similar to pushing the selection operation.
Having selection and projection early in the query reduces the number of unnecessary
records and attributes in the temporary tables. We can use any of 12 equivalence rule to
get this task done.
2272 103
∏ EMP_ID, EMP_NAME, DEPT_ID, DEPT_NAME (EMP∞ DEPT_ID DEPT) = ∏ EMP_ID,
EMP_NAME (EMP) ∞ DEPT_ID ∏ DEPT_ID, DEPT_NAME (DEPT)
While joining more than two tables, we can use join associative rule, so that we can join
smaller table first, hence the size of the temporary table is small.
Suppose, we have EMP, DEPT and PROJ tables with records 10000, 20, 50 records
respectively. Suppose we have to retrieve EMP_NAME, DEPT_NAME and
PROJ_NAME for each employee. Then the query would be:
Here we can start evaluating from beginning to end or from end to beginning. But starting from
beginning has very large table, EMP. Hence the intermediary temporary table created will need large
space to hold the records of joining. Hence it is better to evaluate from end to beginning. This uses
the join associative rule to evaluate.
• Aninternalrepresentation(querytreeorquerygraph)ofthequeryiscreatedafter scanning,
parsing, andvalidating.
• Then DBMS must devise an execution strategy for retrieving the result from the
databasefiles.
• Howtochooseasuitable(efficient)strategyforprocessingaqueryisknownasquery
optimization.
2272 104
• There are two main techniques for implementing queryoptimization.
– Systematically estimating the cost of different execution strategies and choos- ing
the lowest costestimate.
• AnSQLqueryisfirsttranslatedintoanequivalentextendedrelationalgebraexpression (as a
query tree) that is thenoptimized.
2272 105
QueryblockinSQL:thebasicunitthatcanbetranslatedintothealgebraicoperators
andoptimized.AqueryblockcontainsasingleSELECT-FROM-WHEREexpression,
as well as GROUP BY and HAVING clauses.
Innerblock: Outerblock:
( SELECT MAX(SALARY) SELECT LNAME,FNAME
FROM EMPLOYEE FROM EMPLOYEE
WHERE DNO=5) WHERE SALARY >c
• Externalsortingreferstosortingalgorithmsthataresuitableforlargefilesofrecords stored on
disk that do not fit entirely in mainmemory.
• Use a sort-merge strategy, which starts by sorting small subfiles – called runs–
of the main file and merges the sorted runs, creating larger sorted subfiles that are merged in
turn.
• Sortingphase:
2272
106
– Runs of the file that can fit in the available buffer space are read into main
memory,sortedusinganinternalsortingalgorithm,andwrittenbacktodiskas
temporary sorted subfiles (or runs).
– number of initial runs (nR), number of file blocks (b), and available
buffer space(nb)
– nR=|b/nB|
– Iftheavailablebuffersizeis5blocksandthefilecontains1024blocks,thenthere are 205
initial runs each of size 5 blocks. After the sort phase, 205 sortedruns
are stored as temporary subfiles on disk.
• Mergingphase:
– The degree of merging (dM) is the number of runs that can be merged together in
eachpass.
– Ineachpass, onebufferblockisneededtoholdoneblockfromeachoftheruns
being merged, and one block is needed for containing one block of the merge result.
• Forexample:
– 5 initial runs [2, 8, 11], [4, 6, 7], [1, 9, 13], [3, 12, 15], [5, 10, 14].
2272
107
– After second pass: 2runs
[1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 15],[5, 10, 14]
– After thirdpass:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
AlgorithmsforSELECTandJOINOperations
– S3. Using a primary index: If the selection condition involves an equality com-
parison on a key attribute with a primary index. (ex.OP1)
– S4. Using a primary index to retrieve multiple records: If the comparison condi-
tionis>,≤,<,≥onakeyfieldwithaprimaryindex–forexample,DNUMBER>5inOP2–
usetheindextofindtherecordsatisfyingtheequalitycondition
(DNUMBER = 5), then retrieve all subsequent (preceding) records in the (or- dered)
file.
2272
108
– S5. Using a clustering index to retrieve multiple records: If the selection condition
involvesanequalitycomparisononanon-keyattributewithaclusteringindex–
forexample,DNO=5inOP3–usetheindextoretrievealltherecordssatisfying thecondition.
2272
109
– (OP6): EMP LOY EE Da DNO= DNUMBER DEP ARTMENT
– (OP7):DEPARTMENT DaMttRSSN=SSNEMPLOYEE
RDaA=BS
– J1. Nested-loop join (brute force): For each record t in R (outer loop), retrieve
everyrecordsfromSandtestwhetherthetworecordssatisfythejoincondition t[A] =s[B].
– J2. Single-loop join (using an access structure to retrieve the matching records):
Ifanindexexistsforoneofthetwoattributes–say,BofS–retrieveeachrecord
t in R, one at a time (single loop), and then use the access structure to retrieve directly
all matching records s from S that satisfy s[B] = t[A].
– J4.Hash-join:
∗Partitioning phase: a single pass through the file with fewer records (say,
R) hashes its records (using A as the hash key) to the hash file buckets
∗Probing phase: a single pass through the other file (S) then hashes each of its
records to probe the appropriate bucket.
• PROJECToperation:
2272
110
– Ifthe<attributelist>ofthePROJECToperationπ<attributelist>(R)includesa
keyofR,thenthenumberoftuplesintheprojectionresultisequaltothenumber
of tuples in R, but only with the values for the attributes in < attribute list >in each
tuple.
– If the < attribute list >does not contain a key of R, duplicate tuples must be
eliminated. The following methods can be used to eliminateduplication.
∗ Sorting the result of the operation and then eliminating duplicate tuples.
∗Hashingtheresultoftheoperationintohashfileinmemoryandcheckeach hashed record
against those in the same bucket; if it is a duplicate, it isnot
inserted.
• CARTESIAN PRODUCToperation:
– UseHashingtechnique.Thatis,firsthash(partition)onefile,thenhash(probe) anotherfile.
Implementing AggregateOperations
• The aggregate operators (MIN, MAX, COUNT, AVERAGE, SUM), when applied to an
entire table, can be computed by a table scanor by using an appropriate index.
• SELECTMAX(SALARY)
FROM EMPLOYEE;
2272
111
Ifan(ascending)indexonSALARYexistsfortheEMPLOYEErelation,theoptimizer
candecidethelargestvalue:therightmostleaf(B-TreeandB+-Tree)orthelastentry in the first
level index (clustering,secondary).
• TheindexcouldalsobeusedforCOUNT,AVERAGE,andSUMaggregates,ifitisa dense
index. For a nondense index, the actual number of records associated with each index
entry must be used for a correctcomputation.
• ForaGROUPBYclauseinaquery,thetechniquetouseistopartition(eithersorting or hashing)
the relation on the grouping attributes and then to apply the aggregate operators on
eachgroup.
– Ifaclusteringindexexistsonthegroupingattributes,thentherecordsarealready partitioned.
Implementing OuterJoin
– For a left (right) outer join, we use the left (right) relation as the outer loop or single-
loop in the joinalgorithms.
– Iftherearematchingtuplesintheotherrelation,thejoinedtuplesareproduced
andsavedintheresult.However,ifnomatchingtupleisfound,thetupleisstill
included in the result but is padded with null values.
• The other join algorithms, sort-merge and hash-join, can also be extended to com- pute
outerjoins.
2272
112
• Pipeliningprocessing:Insteadofgeneratingtemporaryfilesondisk,theresulttuples from one
operation are provided directly as input for subsequentoperations.
• Usingheuristicrulestomodifytheinternalrepresentation(querytree)ofa query.
• TheSELECTandPROJECToperationsreducethesizeofafileandhenceshouldbe appliedfirst.
– Atreedatastructurethatcorrespondstoarelationalalgebraexpression.Thein-
putrelationsofthequery–leafnodes;therelationalalgebraoperations–internal nodes.
2272
113
– Thequerygraphdoesnotindicateanorderonwhichoperationstoperformfirst.
Thereisonlyasinglegraphcorrespondingtoeachquery.Hence,aquerygraph corresponds
to a relational calculusexpression.
• Many different relational algebra expressions – and hence many different query trees – can
beequivalent.
• The initial query tree (canonical query tree) is generated by the following se-
quence.
– TheselectionandjoinconditionsoftheWHEREclauseareapplied.
• Exampleoftransformingaquery.SeeFigure15.5(Fig18.5one3). Consider
the SQL querybelow.
SELECT LNAME
2272
114
– 1. Cascade ofσ:
σ c1 and c2 and ... and cn (R) ≡σc1 (σc2 (. . . (σcn (R)) . . .))
– 2. Commutativity of σ:
σc1(σc2(R))≡σc2(σc1(R))
– 3. Cascade ofπ:
πList1(πList2(. . . (πListn(R)) . . .)) ≡ πList1(R)
– 5. Commutativity of Da (and×):
RDacS≡SDacR
R×S≡S×R
σc (RDaS)≡(σc1(R))Da(σc2(S))
– 7. Commuting π with Da (or ×): Suppose the projection list L = {A1, . . . , An,
B1,...,Bm},whereA1,...,AnareattributesofRandB1 ,...,Bmareattributes ofS
∗IfthejoinconditioncinvolvesonlyattributesinL.
πL(RDacS)≡(πA1,...,An(R))Dac(πB1,...,Bm(S))
2272
115
– 10. Commuting σ with set operations: Let θ be one of the three set opera- tions
∩, ∪, and−.
σc(R θ S) ≡ (σc(R)) θ (σc(S))
2272
116
– 5. Break up and push down PROJECToperations:
Using rules 3, 4, 7, and 11 concerning the cascading of PROJECT and thecom- muting of PROJECT with
other operations, break down and move lists of pro- jection attributes down the tree as far as possible by
creating new PROJECT operations asneeded.
• Example for transforming SQL query ⇒ Initial query tree ⇒Optimized query tree.
SELECT lname
• Thecostofexecutingaqueryincludesthefollowingcomponents:
– 1. Access cost to secondary storage: The cost of searching for, reading, and writing data blocks
that reside on secondarystorage.
– 2. Storage cost: The cost of storing temporary files generated by an execution strategy for thequery.
– 3. Computation cost: The cost of performing in-memory operations on the data buffers during
queryexecution.
– 4. Memory usage cost: The cost of pertaining to the number of memory buffers needed during
queryexecution.
– 5. Communication cost: The cost of shipping the query and its results from
thedatabasesitetothesiteorterminalwherethequeryoriginated.
• Different applications emphasize differently on individual cost components. For exam- ple,
– Forlargedatabases,themainemphasisisonminimizingtheaccesscosttosec- ondarystorage.
– Forsmallerdatabases,theemphasisisonminimizingcomputationcostbecause
mostofthedatainfilesinvolvedinthequerycanbecompletelystoredinmemory.
– Fordistributeddatabases,communicationcostmustbeminimizeddespiteofother factors.
– It is difficult to include all the cost components in a weighted cost function be- cause of the difficulty of
assigning suitable weights to the costcomponents.
• ThenecessaryinformationforcostfunctionevaluationisstoredinDBMScatelog.
– For each file, the access methods or indexes and the corresponding access at- tributes.
– The number of levels (x) of each multilevel index is needed for cost functions that estimate the number of
blockaccesses.
117
– The number of first-level index blocks(bI1).
– The number of distinct values (d) of an attribute and
itsselectivity(sl),whichisthefractionofrecordssatisfyinganequalityconditionontheattribute.
∗Theselectivityallowsustoestimatetheselectioncardinality(s=sl×
r)ofanattribute,whichistheaveragenumberofrecordsthatwillsatisfyan equality selectionon thatattribute.
∗For a key attribute, d = r, sl = 1/r and s = 1
∗For a nonkey attribute, by making an assumption that the ddistinct values are uniformly distributed
among the records, then sl = 1/d and s = r/d
118
Module-3
3 Storage strategies
3.1 Indices
The main goal of designing the database is faster access to any data in the database and quicker insert/delete/update to
any data. This is because no one likes waiting. When a database is very huge, even a smallest transaction will take time
to perform the action. In order to reduce the time spent in transactions, Indexes are used. Indexes are similar to book
catalogues in library or even like an index in a book. What it does? It makes our search simpler and quicker. Same
concept is applied here in DBMS to access the files from the memory.
When records are stored in the primary memory like RAM, accessing them is very easy and quick. But records are not
limited in numbers to store in RAM. They are very huge and we have to store it in the secondary memories like hard
disk. As we have seen already, in memory we cannot store records like we see – tables. They are stored in the form of
files in different data blocks. Each block is capable of storing one or more records depending on its size.
When we have to retrieve any required data or perform some transaction on those data, we have to pull them from
memory, perform the transaction and save them back to the memory. In order to do all these activities, we need to have a
link between the records and the data blocks so that we can know where these records are stored. This link between the
records and the data block is called index. It acts like a bridge between the records and the data block.
How do we index in a book? We list main topics first and under that we group different sub-topic right? We do the same
thing in the database too. Each table will have unique column or primary key column which uniquely determines each
record in the table. Most of the time, we use this primary key to create index. Sometimes, we will have to fetch the
records based on other columns in the table which are not primary key. In such cases we create index on those columns.
But what is this index? Index in databases is the pointer to the block address in the memory. But these pointers are stored
as (column, block_address) format.
The first column is the Search key that contains a copy of the primary key or candidate key of the table. These
values are stored in sorted order so that the corresponding data can be accessed quickly (Note that the data may or
may not be stored in sorted order).
The second column is the Data Reference which contains a set of pointers holding the address of the disk block
where that particular key value can be found.
1. Ordered Indices
Imagine we have a student table with thousands of records, each of which is 10 bytes long. Imagine their IDs start from 1
2, 3… and goes on. And we have to search student with ID 678. In a normal database with no index, it searches the disk
block from the beginning till it reaches 678. So the DBMS will reach this record after reading 677*10 = 6770 bytes. But
if we have index on ID column, then the address of the location will be stored as each record as (1,200), (2, 201)… (678,
879) and so on. One can imagine it as a smaller table with index column and address column. Now if we want to search
record with ID 678, then it will search using indexes. i.e.; here it will traverse only 677*2 = 1354 bytes which very less
compared to earlier one. Hence retrieving the record from the disk becomes faster. Most of the cases these indexes are
sorted and kept to make searching faster. If the indexes are sorted, then it is called as ordered indices.
119
2. Primary Index
If the index is created on the primary key of the table then it is called as Primary Indexing. Since these primary keys are
unique to each record and it has 1:1 relation between the records, it is much easier to fetch the record using it. Also, these
primary key are kept in sorted form which helps in performance of the transactions. The primary indexing is of two types
– Dense Index and Sparse Index.
In this case, indexing is created for primary key as well as on the columns on which we perform transactions. That
means, user can fire query not only based on primary key column. He can query based on any columns in the table
according to his requirement. But creating index only on primary key will not help in this case. Hence index on all the
search key columns are stored. This method is called dense index.
For example, Student can be searched based on his ID which is a primary key. In addition, we search for student by his
first name, last name, particular age group, residing in some place, opted for some course etc. That means most of the
columns in the table can be used for searching the student based on different criteria. But if we have index on his ID,
other searches will not be efficient. Hence index on other search columns are also stored to make the fetch faster.
Though it addresses quick search on any search key, the space used for index and address becomes overhead in the
memory. Here the (index, address) becomes almost same as (table records, address). Hence more space is consumed to
store the indexes as the record size increases.
In order to address the issues of dense indexing, sparse indexing is introduced. In this method of indexing, range of index
columns store the same data block address. And when data is to be retrieved, the block address will be fetched linearly
till we get the requested data.
Let us see how above example of dense index is converted into sparse index
120
In above diagram we can see, we have not stored the indexes for all the records, instead only for 3 records indexes are
stored. Now if we have to search a student with ID 102, then the address for the ID less than or equal to 102 is searched
– which returns the address of ID 100. This address location is then fetched linearly till we get the records for 102. Hence
it makes the searching faster and also reduces the storage space for indexes.
The range of column values to store the index addresses can be increased or decreased depending on the number of
record in the table. The main goal of this method should be more efficient search with less memory space.
But if we have very huge table, then if we provide very large range between the columns will not work. We will have to
divide the column ranges considerably shorter. In this situation, (index, address) mapping file size grows like we have
seen in the dense indexing.
3. Secondary Index
In the sparse indexing, as the table size grows, the (index, address) mapping file size also grows. In the memory, usually
these mappings are kept in the primary memory so that address fetch should be faster. And latter the actual data is
searched from the secondary memory based on the address got from mapping. If the size of this mapping grows, fetching
the address itself becomes slower. Hence sparse index will not be efficient. In order to overcome this problem next
version of sparse indexing is introduced i.e.; Secondary Indexing.
In this method, another level of indexing is introduced to reduce the (index, address) mapping size. That means initially
huge range for the columns are selected so that first level of mapping size is small. Then each range is further divided
into smaller ranges. First level of mapping is stored in the primary memory so that address fetch is faster. Secondary
level of mapping and the actual data are stored in the secondary memory – hard disk.
In the above diagram, we can see that columns are divided into groups of 100s first. These groups are stored in the
primary memory. In the secondary memory, these groups are further divided into sub-groups. Actual data records are
then stored in the physical memory. We can notice that, address index in the first level is pointing to the first address in
the secondary level and each secondary index addresses are pointing to the first address in the data block. If we have to
search any data in between these values, then it will search the corresponding address from first and second level
respectively. Then it will go to the address in the data blocks and perform linear search to get the data.
For example, if it has to search 111 in the above diagram example, it will search the max (111) <= 111 in the first level
index. It will get 100 at this level. Then in the secondary index level, again it does max (111) <= 111, and gets 110. Now
it goes to data block with address 110 and starts searching each record till it gets 111. This is how a search is done in this
method. Inserting/deleting/updating is also done in same manner.
4. Clustering Index
In some cases, the index is created on non-primary key columns which may not be unique for each record. In such cases,
in order to identify the records faster, we will group two or more columns together to get the unique values and create
121
index out of them. This method is known as clustering index. Basically, records with similar characteristics are grouped
together and indexes are created for these groups.
For example, students studying in each semester are grouped together. i.e.; 1st Semester students, 2nd semester students,
3rd semester students etc are grouped.
In above diagram we can see that, indexes are created for each semester in the index file. In the data block, the students
of each semester are grouped together to form the cluster. The address in the index file points to the beginning of each
cluster. In the data blocks, requested student ID is then search in sequentially.
New records are inserted into the clusters based on their group. In above case, if a new student joins 3 rd semester, then his
record is inserted into the semester 3 cluster in the secondary memory. Same is done with update and delete.
If there is short of memory in any cluster, new data blocks are added to that cluster .
This method of file organization is better compared to other methods as it provides clean distribution of records, and
hence making search easier and faster. But in each cluster, there would be unused space left. Hence it will take more
memory compared to other methods.
3.2 B-Trees
Before we proceed to B-tree indexing lets understand what index means. An Index can be simply defined as an optional
structure associated with a table cluster that enables the speed access of data. One can reduce the disk I/O by this.
Creating an index, a small set of randomly distributed rows from the table can be retrieved.
Without an index, it would be so difficult and database had to perform a full table scan to find a particular value. In full
table scan, the database has to look at one row at a time to find desired value. One of the most common types of database
index is B-trees (Balanced trees). This index is a default for many storage engines on MySQL. B-tree index is well
ordered set of values that are divided into ranges.Btree is an example of multilevel indexing. Record pointers will be
present at leaf nodes as well as on internal nodes.
122
Provides universal applicability.
They are storage-friendly, work on any layer of the storage hierarchy.
Let’s take an example as to explain how B-tree indexing is helpful. Imagine books are arranged in the college library
based on the alphabetical manner, the library has books of all departments such as Automobile, Aeronautical, Bio-tech,
Chemical, Civil, Electronics and so on. After entering the library, you see that ground-floor contains books by
department name A-G, first-floor H-N, second-floor O-U and third-floor V-Z. So based on your requirement you can
quickly find the required book. Consider equivalent database search now, just Imagine books database table, with a B-
tree index on the dpt_name column. To find your book of civil, you can simply perform below query
SELECT * FROM books WHERE
dpt_name = ‘civil’
Initially, database examines the root node of B-tree index that defines four ranges to four floors in a library. Each node in
the B-tree is a block. Blocks on every level of the tree except last are called branch blocks where entries are set of range
and pointer to a block on the next level of tree. Those on last level are called leaf blocks which consist of a key value ( an
example is ‘civil’ ).
B-tree indexes have the sub-types. They are-
Index-organized tables: It is different from the heap-organized due to the fact that data itself is the index here.
Reverse key indexes: In this type, the bytes of an index are in a reverse order of the original index. For example,
346 is stored as 643.
Descending indexes: It stores the data as the descending order in a particular column.
Unlike binary search trees, B-trees are optimized for systems that read and write a large block of data, they are a good
example of data structure for external memory and commonly used in databases and file systems. A binary tree has two
child nodes (left and right node), the time required to find a node in a tree is proportional to its height and balance
(balance here indicates both sub-trees have almost the same height) which makes a good indexing data structure where
the access time is a log of number of nodes.
It is better to have the whole index structure in memory but with large sizes, it is not possible to place all the indexes in
memory. For a binary tree the branching factor is 2 where the nodes are highly granular, one has to do many round trips
to arrive at a final node but B-tree, on the other hand has high branching factor and so it is very easy to get required node.
Following is an example B-Tree of minimum degree 3. Note that in practical B-Trees, the value of minimum degree is
much more than 3.
Search
Search is similar to the search in Binary Search Tree. Let the key to be searched be k. We start from the root and recursively
traverse down. For every visited non-leaf node, if the node has the key, we simply return the node. Otherwise, we recur down to
the appropriate child (The child which is just before the first greater key) of the node. If we reach a leaf node and don’t find k in
the leaf node, we return NULL.
Traverse
Traversal is also similar to In order traversal of Binary Tree. We start from the leftmost child, recursively print the
leftmost child, then repeat the same process for remaining children and keys. In the end, recursively print the rightmost
child.
3.3 Hashing
In database management system, when we want to retrieve a particular data, It becomes very inefficient to search all the
index values and reach the desired data. In this situation, Hashing technique comes into picture.
Hashing is an efficient technique to directly search the location of desired data on the disk without using index structure.
Data is stored at the data blocks whose address is generated by using hash function. The memory location where these
records are stored is called as data block or data bucket .
123
3.3.1 Hash File Organization:
Data bucket – Data buckets are the memory locations where the records are stored. These buckets are also
considered as Unit Of Storage.
Hash Function – Hash function is a mapping function that maps all the set of search keys to actual record address.
Generally, hash function uses primary key to generate the hash index – address of the data block. Hash function can
be simple mathematical function to any complex mathematical function.
Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash index has a depth value to
signify how many bits are used for computing a hash function. These bits can address 2n buckets. When all these
bits are consumed ?then the depth value is increased linearly and twice the buckets are allocated.
Above diagram depicts data block address same as primary key value. This hash function can also be simple
mathematical function like mod, sin, cos, exponential etc. Imagine we have hash function as mod (5) to determine the
address of the data block. So what happens to the above case? It applies mod (5) on primary keys and generates 3,3,1,4
and 2 respectively and the records are stored in those data block addresses.
From above two diagrams it now clear how hash function works.
124
3.3.2 Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes the same address. For
example, if we want to generate address for STUDENT_ID = 76 using mod (5) hash function, it always result in the
same bucket address 4. There will not be any changes to the bucket address here. Hence number of data buckets in the
memory for this static hashing remains constant throughout.
Operations –
Insertion – When a new record is inserted into the table, The hash function h generate a bucket address for the new
record based on its hash key K.
Bucket address = h(K)
Searching – When a record needs to be searched, The same hash function is used to retrieve the bucket address for
the record. For Example, if we want to retrieve whole record for ID 76, and if the hash function is mod (5) on that
ID, the bucket address generated would be 4. Then we will directly got to address 4 and retrieve the whole record
for ID 104. Here ID acts as a hash key.
Deletion – If we want to delete a record, Using the hash function we will first fetch the record which is supposed to
be deleted. Then we will remove the records for that address in memory.
Updation – The data record that needs to be updated is first searched using hash function, and then the data record
is updated.
Now, If we want to insert some new records into the file But the data bucket address generated by the hash function is
not empty or the data already exists in that address. This becomes a critical situation to handle. This situation in the static
hashingis called bucket overflow.
There are several methods provided to overcome this situation. Some commonly used methods are discussed below:
Open Hashing –In Open hashing method, next available data block is used to enter the new record, instead of
overwriting older one. This method is also called linearprobing.For example, D3 is a new record which needs to
be inserted , the hash function generates address as 105. But it is already full. So the system searches next
available data bucket, 123 and assigns D3 to it.
Closed hashing- In Closed hashing method, a new data bucket is allocated with same address and is linked it after the full
data bucket. This method is also known as overflowchaining.
For example, we have to insert a new record D3 into the tables. The static hash function generates the data bucket
address as 105. But this bucket is full to store the new data. In this case is a new data bucket is added at the end of 105
data bucket and is linked to it. Then new record D3 is inserted into the new bucket.
i. Quadratic probing:
Quadratic probing is very much similar to open hashing or linear probing. Here, The only difference between old and
new bucket is linear. Quadratic function is used to determine the new bucket address.
125
ii. Double Hashing:
Double Hashing is another method similar to linear probing. Here the difference is fixed as in linear
probing, but this fixed difference is calculated by using another hash function. That’s why the name is
double hashing.
The drawback of static hashing is that that it does not expand or shrink dynamically as the size of the database grows or
shrinks. In Dynamic hashing, data buckets grows or shrinks (added or removed dynamically) as the records increases or
decreases. Dynamic hashing is also known as extended hashing.
In dynamic hashing, the hash function is made to produce a large number of values. For Example, there are three data
records D1, D2 and D3 . The hash function generates three addresses 1001, 0101 and 1010 respectively. This method of
storing considers only part of this address – especially only first one bit to store the data. So it tries to load three of them
at address 0 and1.
But the problem is that No bucket address is remaining for D3. The bucket has to grow dynamically to accommodate D3. So it
changes the address have 2 bits rather than 1 bit, and then it updates the existing data to have 2 bit address. Then it tries to
accommodate D3.
126
MODULE-4
4 TRANSACTION PROCESSING
4.1 Introduction
A transaction is a unit of program execution that accesses and possibly updates various data items. Usually, a transaction
is initiated by a user program written in a high level data-manipulation language or programming language (for
example, SQL, COBOL, C, C++, or Java), where it is delimited by statements (or function calls) of the form begin
transaction and end transaction. The transaction consists of all operations executed between the begin transaction and
end transaction. To ensure integrity of the data, we require that the database system maintain the following properties of
the transactions:
• Atomicity. Either all operations of the transaction are reflected properly in the database, or none are.
• Consistency. Execution of a transaction in isolation (that is, with no other transaction executing concurrently)
preserves the consistency of the database.
• Isolation. Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of
transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or Tj started execution after Ti
finished. Thus, each transaction is unaware of other transactions executing concurrently in the system.
• Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there
are system failures.
These properties are often called the ACID properties; the acronym is derived from the first letter of each of the four
properties. Let Ti be a transaction that transfers $50 from account A to account B. This transaction can be defined as
Ti: read(A);
A := A − 50;
write(A)
read(B);
B := B + 50;
write(B).
• Active, the initial state; the transaction stays in this state while it is executing
• Failed, after the discovery that normal execution can no longer proceed
• Aborted, after the transaction has been rolled back and the database has been restored to its state prior to the start of the
transaction
• Committed, after successful completion The state diagram corresponding to a transaction appears in Figure 15.1. We
say that a transaction has committed only if it has entered the committed state.
Similarly,we say that a transaction has aborted only if it has entered the aborted state. A transaction is said to have
terminated if has either committed or aborted. A transaction starts in the active state. When it finishes its final statement,
it enters the partially committed state. At this point, the transaction has completed its execution, but it is still possible
that it may have to be aborted, since the actual output may still be temporarily residing in main memory, and thus a
hardware failure may preclude its successful completion. The database system then writes out enough information to
disk that, even in the event of a failure, the updates performed by the transaction can be re-created when the system
restarts after the failure. When the last of this information is written out, the transaction enters the committed state.
127
Figure 1 state diagram of a transaction
It can restart the transaction, but only if the transaction was aborted as a result of some hardware or software error
that was not created through the internal logic of the transaction. A restarted transaction is considered to be a
newtransaction.
It can kill the transaction. It usually does so because of some internal logical error that can be corrected only by
rewriting the application program, or because the input was bad, or because the desired data were not found in
thedatabase.
Concurrent Executions:
Schedules
Schedules – sequences that indicate the chronological order in which instructions of concurrent transactions are executed
a schedule for a set of transactions must consist of all instructions of those transactions
must preserve the order in which the instructions appear in each individual transaction
Example Schedules
Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The following is a
serial schedule (Schedule 1 in the text), in which T1 is followed byT2.
128
Let T1 andT2 be the transactions defined previously.The following schedule (Schedule 3 in the text) is not a serial
schedule, but it is equivalent to Schedule1.
4.2 Serializability
Basic Assumption – Each transaction preserves databaseconsistency.
Thus serial execution of a set of transactions preserves databaseconsistency.
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule
equivalence give rise to the notionsof:
conflictserializability
viewserializability
We ignore operations other than read and write instructions, and we assume that transactions may perform
arbitrary computations on data in local buffers in between reads and writes. Our simplified schedules consist of
only read and writeinstructions.
Instructions li and ljof transactions Ti and Tjrespectively, conflict if and only if there exists some item Q
accessed by both li and lj, and at least one of these instructions wroteQ.
o li= read(Q), lj = read(Q). liand ljdon’tconflict.
o li= read(Q), lj = write(Q). Theyconflict.
o li= write(Q), lj = read(Q). Theyconflict
o li= write(Q), lj = write(Q). Theyconflict
Intuitively, a conflict between li and ljforces a (logical) temporal order between them. If li and ljare
consecutive in a schedule and they do not conflict, their results would remain the same even if they had
been interchanged in theschedule.
If a schedule S can be transformed into a schedule S´ by a series of swaps of non- conflicting instructions,
we say that S and S´ are conflictequivalent.
We say that a schedule S is conflict serializableif it is conflict equivalent to a serialschedule
Example of a schedule that is not conflictserializable:
T3 T4
read(Q)
write(Q)
write(Q)
We are unable to swap instructions in the above schedule to obtain either the serial schedule <T3, T4 >, or
the serial schedule <T4, T3 >.
Schedule 3 below can be transformed into Schedule 1, a serial schedule where T2 follows T1, by series of
swaps of non-conflicting instructions. Therefore Schedule 3 is conflictserializable.
129
View Serializability
Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following
three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in
schedule S´, also read the initial value ofQ.
2. For each data item Q if transaction Ti executes read(Q) inschedule S, and that value was produced by transaction
Tj(if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj.
3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must
perform the final write(Q) operation in scheduleS´.
As can be seen, view equivalence is also based purely on reads and writes alone.
A scheduleS isviewserializable it is view equivalent to a serial schedule.
Every conflict serializable schedule is also viewserializable.
Schedule 9 (from text) — a schedule which is view-serializable butnot
conflictserializable.
Cascading rollback – a single transaction failure leads to a series of transaction rollbacks. Consider the
following schedule where none of the transactions has yet committed (so the schedule is recoverable)
Lock-BasedProtocols
Timestamp-BasedProtocols
Validation-BasedProtocols
MultipleGranularity
MultiversionSchemes
DeadlockHandling
Lock requests are made to concurrency-control manager. Transaction can proceed only after request isgranted.
Lock-compatibilitymatrix
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the
item by othertransactions
Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the
item no other transaction may hold any lock on theitem.
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other
transactions have been released. The lock is thengranted.
Example of a transaction performinglocking:
T2: lock-S(A);
read(A);
130
unlock(A);
lock-S(B);
read(B);
unlock(B);
display(A+B)
Locking as above is not sufficient to guarantee serializability — if A and B get updated in-between
the read of A and B, the displayed sum would bewrong.
A locking protocol is a set of rules followed by all transactions while requesting and releasing
locks. Locking protocols restrict the set of possibleschedules.
Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release
its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock onA.
Such a situation is called adeadlock.
o To handle a deadlock one of T3 or T4 must be rolledback and its locksreleased.
One protocol that ensures serializability is the two-phase locking protocol. This protocol requires that each
transaction issue lock and unlock requests in two phases:
1. Growing phase. A transaction may obtain locks, but may not release anylock.
2. Shrinking phase. A transaction may release locks, but may not obtainany newlocks.
Initially, a transaction is in the growing phase. The transaction acquires locks as needed. Once the transaction releases a
lock, it enters the shrinking phase, andit can issue no more lockrequests.
For example, transactions T3 and T4 are two phase. On the other hand, transactions T1 and T2 are not two phase. Note
that the unlock instructions do not need to appear at the end of the transaction. For example, in the case of transaction T3,
we could move the unlock(B) instruction to just after the lock-X(A) instruction, and still retain the two-phase locking
property.
131
Cascading rollbacks can be avoided by a modification of two-phase locking called the strict two-phase locking
protocol. This protocol requires not only that locking be two phase, but also that all exclusive-mode locks taken
by a transaction be held until that transaction commits. This requirement ensures that any data written by an
uncommitted transaction are locked in exclusive mode until the transaction commits, preventing any other
transaction from reading the data.
Timestamps:
With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by TS(Ti). This
timestamp is assigned by the database system before the transaction Ti
startsexecution.IfatransactionTihasbeenassignedtimestampTS(Ti),andanewtransaction Tjenters the system, then
TS(Ti) <TS(Tj). There are two simple methods for implementing this scheme:
1. Use the value of the system clock as the timestamp; that is, a transaction’s timestampis equal to
the value of the clock when the transaction enters thesystem.
2. Use a logical counter that is incremented after a new timestamp has been assigned; that is, a
transaction’s timestamp is equal to the value of the counter When the transaction enters thesystem.
The timestamps of the transactions determine the serializability order. Thus, if
TS(Ti) <TS(Tj), then the system must ensure that the produced schedule is equivalent to a serial
schedule in which transaction Ti appears before transaction Tj.
To implement this scheme, we associate with each data item Q two timestamp values:
• W-timestamp(Q) denotes the largest timestamp of any transaction thatexecuted
write(Q)successfully.
• R-timestamp(Q) denotes the largest timestamp of any transaction thatexecuted
read(Q)successfully.
These timestamps are updated whenever a new read(Q) or write(Q) instruction is executed.
The timestamp-ordering protocol ensures that any conflicting read and write operations are executed in
timestamp order. This protocol operates as follows:
1. Suppose that transaction Ti issuesread(Q).
If TS(Ti) <W-timestamp(Q), then Ti needs to read a value of Q that was already overwritten. Hence, the read
operation is rejected, and Ti is rolled back.
If TS(Ti) ≥ W-timestamp(Q), then the read operation is executed, and Rtimestamp(Q) is set to the maximum of
R-timestamp(Q) andTS(Ti).
2. Suppose that transaction Ti issueswrite(Q).
o If TS(Ti) <R-timestamp(Q), then the value of Q that Ti is producing was needed previously, and the system
assumed that that value would never be produced. Hence, the system rejects the write operation and rolls
Tiback.
o If TS(Ti) <W-timestamp(Q), then Ti is attempting to write an obsolete value of Q. Hence, the system rejects
this write operation and rolls Ti back. Otherwise, the system executes the write operation and sets W-
timestamp(Q) toTS(Ti).
There can be any case in database system like any computer system when database failure happens. So data stored in
database should be available all the time whenever it is needed. So Database recovery means recovering the data when it
get deleted, hacked or damaged accidentally. Atomicity is must whether is transaction is over or not it should reflect in
the database permanently or it should not affect the database at all. So database recovery and database recovery
techniques are must in DBMS. So database recovery techniques in DBMS are given below.
132
Crash recovery:
DBMS may be an extremely complicated system with many transactions being executed each second. The sturdiness and
hardiness of software rely upon its complicated design and its underlying hardware and system package. If it fails or
crashes amid transactions, it’s expected that the system would follow some style of rule or techniques to recover lost
knowledge.
Classification of failure:
To see wherever the matter has occurred, we tend to generalize a failure into numerous classes, as follows:
Transaction failure
System crash
Disk failure
1. Transaction failure: A transaction needs to abort once it fails to execute or once it reaches to any further extent
from wherever it can’t go to any extent further. This is often known as transaction failure wherever solely many
transactions or processes are hurt. The reasons for transaction failure are:
Logical errors
System errors
1. Logical errors: Where a transaction cannot complete as a result of its code error or an internal error condition.
2. System errors: Wherever the information system itself terminates an energetic transaction as a result of the
DBMS isn’t able to execute it, or it’s to prevent due to some system condition. to Illustrate, just in case of
situation or resource inconvenience, the system aborts an active transaction.
3. System crash: There are issues − external to the system − that will cause the system to prevent abruptly and
cause the system to crash. For instance, interruptions in power supply might cause the failure of underlying
hardware or software package failure. Examples might include OS errors.
4. Disk failure: In early days of technology evolution, it had been a typical drawback wherever hard-disk drives or
storage drives accustomed to failing oftentimes. Disk failures include the formation of dangerous sectors,
unreachability to the disk, disk crash or the other failure, that destroys all or a section of disk storage.
Storage structure:
Classification of storage structure is as explained below:
1. Volatile storage: As the name suggests, a memory board (volatile storage) cannot survive system crashes.
Volatile storage devices are placed terribly near to the CPU; usually, they’re embedded on the chipset itself. For
instance, main memory and cache memory are samples of the memory board. They’re quick however will store a
solely little quantity of knowledge.
2. Non-volatile storage: These recollections are created to survive system crashes. they’re immense in information
storage capability, however slower in the accessibility. Examples could include hard-disks, magnetic tapes, flash
memory, and non-volatile (battery backed up) RAM.
It ought to check the states of all the transactions that were being executed.
A transaction could also be within the middle of some operation; the database management system should make
sure the atomicity of the transaction during this case.
It ought to check whether or not the transaction is completed currently or it must be rolled back.
133
No transactions would be allowed to go away from the database management system in an inconsistent state.
here are 2 forms of techniques, which may facilitate a database management system in recovering as well as maintaining
the atomicity of a transaction:
Maintaining the logs of every transaction, and writing them onto some stable storage before truly modifying the
info.
Maintaining shadow paging, wherever the changes are done on a volatile memory, and later, and the particular
info is updated.
Log could be a sequence of records, which maintains the records of actions performed by dealing. It’s necessary that the
logs area unit written before the particular modification and hold on a stable storage media, that is failsafe. Log-based
recovery works as follows:
Deferred update – This technique does not physically update the database on disk until a transaction has reached
its commit point. Before reaching commit, all transaction updates are recorded in the local transaction workspace. If
a transaction fails before reaching its commit point, it will not have changed the database in any way so UNDO is
not needed. It may be necessary to REDO the effect of the operations that are recorded in the local transaction
workspace, because their effect may not yet have been written in the database. Hence, a deferred update is also
known as the No-undo/redo algorithm
Immediate update – In the immediate update, the database may be updated by some operations of a transaction
before the transaction reaches its commit point. However, these operations are recorded in a log on disk before they
are applied to the database, making recovery still possible. If a transaction fails to reach its commit point, the effect
of its operation must be undone i.e. the transaction must be rolled back hence we require both undo and redo. This
technique is known as undo/redo algorithm.
Shadow paging – It provides atomicity and durability. A directory with n entries is constructed, where the ith
entry points to the ith database page on the link. When a transaction began executing the current directory is
copied into a shadow directory. When a page is to be modified, a shadow page is allocated in which changes are
made and when it is ready to become durable, all pages that refer to original are updated to refer new replacement
page.
Reference
https://www.cs.uct.ac.za/mit_notes/database/htmls/chp14.html
134
MODULE-5
5 Database Security
Database security is a growing concern evidenced by an increase in the number of reported incidents of loss of or
unauthorized exposure to sensitive data. As the amount of data collected, retained and shared electronically expands, so
does the need to understand database security. The Defence Information Systems Agency of the US Department of
Defence (2004), in its Database Security Technical Implementation Guide, states that database security should provide
controlled, protected access to the contents of a database as well as preserve the integrity, consistency, and overall quality
of the data. Students in the computing disciplines must develop an understanding of the issues and challenges related to
database security and must be able to identify possible solutions.
• Definition
– “Security protects data from intentional or accidental misuse or destruction, by controlling access to the data.” •
Stamper & Price
– “Database security is concerned with the ability of the system to enforce a security policy governing the disclosure,
modification or destruction of information.” • Pangalos
Security models
A security model establishes the external criteria for the examination of security issues in general, and provides the
context for database considerations, including implementation and operation. Specific DBMSs have their own security
models which are highly important in systems design and operation. Refer to the SeaView model for an example. You
will realise that security models explain the features available in the DBMS which need to be used to develop and operate
the actual security systems. They embody concepts, implement policies and provide servers for such functions. Any
faults in the security model will translate either into insecure operation or clumsy systems.
We are all familiar as users with the log-in requirement of most systems. Access to IT resources generally requires a log-
in process that is trusted to be secure. This topic is about access to database management systems, and is an overview of
the process from the DBA perspective. Most of what follows is directly about Relational client-server systems. Other
system models differ to a greater or lesser extent, though the underlying principles remain true. For a simple schematic,
see Authorisation and Authentication Schematic. Among the main principles for database systems are authentication and
authorisation.
Authentication
The client has to establish the identity of the server and the server has to establish the identity of the client. This is done
often by means of shared secrets (either a password/user-id combination, or shared biographic and/or biometric data). It
can also be achieved by a system of higher authority which has previously established authentication. In client-server
systems where data (not necessarily the database) is distributed, the authentication may be acceptable from a peer system.
Note that authentication may be transmissible from system to system. The result, as far as the DBMS is concerned, is an
authorisation-identifier. Authentication does not give any privileges for particular tasks. It only establishes that the
DBMS trusts that the user is who he/she claimed to be and that the user trusts that the DBMS is also the intended system.
Authentication is a prerequisite for authorisation.
Authorization
Authorisation relates to the permissions granted to an authorised user to carry out particular transactions, and hence to
change the state of the database (writeitem transactions) and/or receive data from the database (read-item transactions).
The result of authorisation, which needs to be on a transactional basis, is a vector: Authorisation (item, auth-id,
operation). A vector is a sequence of data values at a known location in the system. How this is put into effect is down to
the DBMS functionality. At a logical level, the system structure needs an authorisation server, which needs to co-operate
with an auditing server. There is an issue of server-to-server security and a problem with amplification as the
authorisation is transmitted from system to system. Amplification here means that the security issues become larger as a
larger number of DBMS servers are involved in the transaction. Audit requirements are frequently implemented poorly.
To be safe, you need to log all accesses and log all authorisation details with transaction identifiers. There is a need to
audit regularly and maintain an audit trail, often for a long period.
In real world situations have number of complex policies, in this access decisions depend on the rule. According to the
different application rules are coming, for example, from organizational regulations, practices and government laws. To
135
develop the access control system need to ensure that the availability of the resources, confidentiality and integrity of the
data. There are three main types of access control policies.
Mandatory access control (MAC), in this security policy users do not have the authority to override the policies and it
totally controlled centrally by the security policy administrator. The security policy administrator defines the usage of
resources and their access policy, which cannot be overridden by the end users, and the policy, will decide who has
authority to access the particular programs and files. MAC is mostly used in a system where priority is based on
confidentiality.
In MAC users do not have the authority to override the policies and it totally controlled centrally by the security policy
administrator. MAC is a system-wide policy which defines who is allowed to have access; individual user cannot change
that access rules. It totally relies on the central system.MAC policies are defined by the system administrator, and it is
strictly enforced by the OS or security kernel. Examples: According to law, court can access driving records without the
owners’ permission. MAC mechanisms have been tightly coupled to a few security models and it is mostly used in a
system where priority is based on confidentiality. For example Trusted Solaris, TrustedBSD , SELinux etc.)
1. Multilevel Security
2. Multilateral Security
Disadvantages of MAC
MAC models put restrictions on user access that, and according to security policies, does not allow for dynamic
alteration.
MAC needs to place the operating system and associated utilities outside the access control frame work.
MAC requires predetermined planning to implement it effectively. After implementing it needs a high system
management because due to constantly update object and account labels to collect new data.
Discretionary access control (DAC), this policy Contrast with Mandatory Access Control (MAC) which is determined
by the system administrator while DAC policies are determined by the end user with permission. In DAC, user has the
complete authority over the all resources it owns and also determines the permissions for other users who have those
resources and programs.
DAC was developed to implement Access Control Matrices defined by Lampson in his paper on system protection [4].
Discretionary policies defines access control based on the identity of the requestors and explicit access rules that
determines who can, or cannot, execute particular actions on particular resources. In DAC users can be given the
authority to other users to access the resources, where assigning and granting the privileges is done by an administrative
policy. Different types of DAC policies and models have been proposed in the literature.
DAC allows user to decide access control policies on their resources and these policies are global policies and therefore
DAC has trouble to ensure consistency.
Malicious software/programs: DAC is vulnerable from processes because it executing malicious programs. If it execute
the malicious programs exploiting the authorizations of a particular user on behalf of whom they are executing. For
instance, Trojan Horses.
Information flow: Once the particular information is acquired by a process, and then DAC do not have any control on the
flow of information. Information can be copied from one object to another; therefore it is possible to access a copy even
if the owner does not provide the access to the original copy.
Role-based access control (RBAC), this policy is very simple to use. In RBAC roles are assigned by the system
administrator statically. In which access is controlled depending on the roles that the users have in a system. (RBAC) is
mostly used to control the access to computer or network resources depending on the roles of individual users within an
organization. In this literature we describe the different access control policies and models that have been proposed by
the researchers, also finding the current status of access control systems and their low level implementation in terms of
security mechanisms. This review gives the idea of different access control policies to develop the access control systems
and gives the comparison of the different security policies and their mechanisms.
For providing access rights to user it is important to know the user’s responsibilities assigned by the organization. But in
the DAC user rights of data plays an important part, are not a good and in MAC, users have to take security clearances
and objects need security classifications. RBAC try to reduce the gap by combining the forced organizational constraints
with flexibility of explicit authorizations. RBAC mostly used for controlling the access to computer resources.
RBAC is very useful method for controlling what type of information users can utilize on the computer, the programs
that the users execute, and the changes that the users can make. In RBAC roles for users are assigned statically, which is
not used in dynamic environment. It is more difficult to change the access rights of the user without changing the
specified roles of the user. RBAC is mostly preferable access control model for the local domain. Due to the static role
assignment it does not have complexity. Therefore it needs the low attention for maintenance.
136
Role is nothing but the abstractions of the user behavior and their assigned duties. These are used to assign system
resources to the departments and their respective members. To provide the accessing control with security in the
particular software systems it will be the beneficial to use role concept. It also reduces the cost of authority management.
Essentially, in role based access control policies need to identify the roles in the system, a role can be defined as a set of
responsibilities and actions associated with a particular working activity. In an Access control security model, a role is
considered as a job related access rights which can be given to the authorized users within an organization. It allows
authorized user to achieve its associated responsibilities.
Advantages of RBAC
Authorization management: Role-based policies provide logical independence in specifying user authorizations. The user
authorizations task can be broken down in to two parts: i) assigning roles to the particular users, and ii) assigning objects
to roles.
Hierarchical roles: In many applications or organizations have hierarchy of roles, it is based on the principles of
generalization and specialization.
Roles defines the least privilege that user required to perform the particular task. Those Users are authorized to powerful
roles do not need to use them until those rights are actually needed.
Separation of duties This principle describe that no user should have more rights so he can misuse it. For instance, the
person who authorized a paycheck and who can prepare them should not be the same person.
Constraints enforcement Roles give the specification and enforcement that required for protection that real world policies
may need to define.
Disadvantages of RBAC
In RBAC model, there is still some work to be done to cover up all the requirements which may represent the real world
scenario. Defining the roles in a different context is difficult and it may result into large role definition. Sometimes it
produces more roles than users. Now days, require fine grained results but RBAC not gives fine grained results.
RBAC assigns the roles statically to its user, which is not preferred in dynamic environment. It is difficult to implement
when the environment is dynamic and distributed. Due to this it is more difficult to change the access rights of the user
without changing the role of that user. Therefore RBAC not provide support for dynamic attributes such as time of the
day on which the user permission is determined.
It maintains the relation between users and its roles. It also maintains the relation between permissions and roles.
Therefore to implement the RBAC model roles must be assigned in advance and it is not possible to change access rights
without altering the roles
Protecting data in network environments is a very important task nowadays. One way to make data less vulnerable to
malicious attacks is to deploy an intrusion detection system (IDS). To detect attacks, the IDS is configured with a number
of signatures that support the detection of known intrusions. Unfortunately, it is not a trivial task to keep intrusion
detection signatures up to date because a large number of new intrusions take place daily. To overcome this problem, the
IDS should be coupled with anomaly detection schemes, which support detection of new attacks. The IDSs cause early
detection of attacks and therefore make the recovery of lost or damaged data simpler. Many researchers are working on
increasing the intrusion detection efficiency and accuracy, but most of these efforts are to detect the intrusions at network
or operating system level. They are not capable of detecting corruption data due to malicious transactions in databases. In
recent years, researchers have proposed a variety of approaches for increasing the efficiency and accuracy of intrusion
detection. Most of these efforts focus on detecting intrusions at the network or operating system level [1-6] and are not
capable of detecting the malicious intrusion transactions that access data without permission. The corrupted data can
affect other data and the damage can spread across the database very fast, which impose a real danger upon many real-
world applications of databases. Therefore such attack or intrusions on the databases should be detected quickly and
accurately. Otherwise it might be very difficult to recover from such damages.
There are many methodologies in database intrusion detection such as access patterns of users, time signatures, Hidden
Markov Model, mining data dependencies among data items and etc. Researchers are working on using Artificial
Intelligence and Data Mining to make the IDS more accurate and efficient.
SQL Injection (SQLi) is a type of an injection attack that makes it possible to execute malicious SQL statements. These
statements control a database server behind a web application. Attackers can use SQL Injection vulnerabilities to bypass
application security measures. They can go around authentication and authorization of a web page or web application and
retrieve the content of the entire SQL database. They can also use SQL Injection to add, modify, and delete records in the
database.
An SQL Injection vulnerability may affect any website or web application that uses an SQL database such as MySQL,
Oracle, SQL Server, or others. Criminals may use it to gain unauthorized access to your sensitive data: customer
information, personal data, trade secrets, intellectual property, and more. SQL Injection attacks are one of the oldest,
137
most prevalent, and most dangerous web application vulnerabilities. The OWASP organization (Open Web Application
Security Project) lists injections in their OWASP Top 10 2017 document as the number one threat to web application
security.
How and Why Is an SQL Injection Attack Performed
To make an SQL Injection attack, an attacker must first find vulnerable user inputs within the web page or web
application. A web page or web application that has an SQL Injection vulnerability uses such user input directly in an
SQL query. The attacker can create input content. Such content is often called a malicious payload and is the key part of
the attack. After the attacker sends this content, malicious SQL commands are executed in the database.
SQL is a query language that was designed to manage data stored in relational databases. You can use it to access,
modify, and delete data. Many web applications and websites store all the data in SQL databases. In some cases, you can
also use SQL commands to run operating system commands. Therefore, a successful SQL Injection attack can have very
serious consequences.
Attackers can use SQL Injections to find the credentials of other users in the database. They can then impersonate
these users. The impersonated user may be a database administrator with all database privileges.
SQL lets you select and output data from the database. An SQL Injection vulnerability could allow the attacker to
gain complete access to all data in a database server.
SQL also lets you alter data in a database and add new data. For example, in a financial application, an attacker
could use SQL Injection to alter balances, void transactions, or transfer money to their account.
You can use SQL to delete records from a database, even drop tables. Even if the administrator makes database
backups, deletion of data could affect application availability until the database is restored. Also, backups may
not cover the most recent data.
In some database servers, you can access the operating system using the database server. This may be intentional
or accidental. In such case, an attacker could use an SQL Injection as the initial vector and then attack the internal
network behind a firewall.
There are several types of SQL Injection attacks: in-band SQLi (using database errors or UNION commands), blind
SQLi, and out-of-band SQLi.
SQL injection examples
There are a wide variety of SQL injection vulnerabilities, attacks, and techniques, which arise in different situations.
Some common SQL injection examples include:
Retrieving hidden data, where you can modify an SQL query to return additional results.
Subverting application logic, where you can change a query to interfere with the application's logic.
UNION attacks, where you can retrieve data from different database tables.
Examining the database, where you can extract information about the version and structure of the database.
Blind SQL injection, where the results of a query you control are not returned in the application's responses.
138
Module – 6
6 Advanced Topics
Object-oriented data model is a logical data model that captures the semantics of objects supported object-oriented
programming. It is a persistent and sharable collection of defined objects. It has the ability to model complete solution.
Object-oriented database models represent an entity and a class. A class represents both object attributes as well as the
behaviour of the entity. For example, a CUSTOMER class will have not only the customer attributes such as CUST-ID,
CUST-NAME, CUST-ADDRESS and so on, but also procedures that imitate actions expected of a customer such as
update-order. Instances of the class-object correspond to individual customers. Within an object, the class attributes takes
specific values, which distinguish one customer (object) from another. However, all the objects belonging to the class,
share the behaviour pattern of the class. The object-oriented database maintains relationships through logical
containment.
The object-oriented database is based on encapsulation of data and code related to an object into a single-unit, whose
contents are not visible to the outside world. Therefore object-oriented data model emphasise on objects, rather than on
data alone. The object-oriented database management system (OODBMS) is among the most recent approaches to
database management.
Capable of handling a large variety of data types: Unlike traditional databases (such as hierarchical,
network or relational), the object-oriented database are capable of storing different types of data, for example,
pictures, voices, video, including text, numbers and so on.
Combining object-oriented programming with database technology: Object-oriented data model is
capable of combining object-oriented programming with database technology and thus, providing an
integrated application development system.
Improved productivity: Object-oriented data models provide powerful features such as inheritance
polymorphism and dynamic binding that allow the users to compose objects and provide solutions without
writing object-specific code. These features increase the productivity of the database application developers
significantly.
Improved data access: Object-oriented data model represents relationships explicitly, supporting both
navigational and associative access to information. It further improves the data access performance over
relational value-based relationships.
No precise definition: It is difficult to provide a precise definition of what constitutes an object oriented
DBMS because the name has been applied to a variety of products and prototypes, some of which differ
considerably from one another.
Difficult to maintain: The definition of objects is required to be changed periodically and migration of
existing databases to confirm to the new object definition with change in organisational information needs. It
possess real challenge when changing object definitions and migrating databases
Not suited for all applications: Object-oriented data models are used where there is a need to manage
complex relationships among data objects. They are especially suited for specific application such as
engineering, e-commerce, medicines and so on, and not for all applications. Its performance degrades and
requires high processing requirements when used for ordinary applications.
139
6.2 Object-Relational Database
An object-relational database (ORD) is a database management system (DBMS) that's composed of both a relational
database (RDBMS) and an object-oriented database (OODBMS). ORD supports the basic components of any object-
oriented database model in its schemas and the query language used, such as objects, classes and inheritance.
An object-relational database may also be known as an object relational database management system (ORDBMS).
ORD is said to be the middleman between relational and object-oriented databases because it contains aspects and
characteristics from both models. In ORD, the basic approach is based on RDB, since the data is stored in a traditional
database and manipulated and accessed using queries written in a query language like SQL. However, ORD also
showcases an object-oriented characteristic in that the database is considered an object store, usually for software that is
written in an object-oriented programming language. Here, APIs are used to store and access the data as objects.
One of ORD’s aims is to bridge the gap between conceptual data modelling techniques for relational and object-oriented
databases like the entity-relationship diagram (ERD) and object-relational mapping (ORM). It also aims to connect the
divide between relational databases and the object-oriented modelling techniques that are usually used in programming
languages like Java, C# and C++.
In a four-quadrant view of the database world, as illustrated in the figure, the lower-left quadrant are those applications
that process simple data and have no requirements for querying the data.
Traditional RDBMS products concentrate on the efficient organization of data that is derived from a limited set of data-
types. On the other hand, an ORDBMS has a feature that allows developers to build and innovate their own data types
and methods, which can be applied to the DBMS. With this, ORDBMS intends to allow developers to increase the
abstraction with which they view the problem area.
Advantages of ORDBMSs
There are following advantages of ORDBMSs:
Reuse and Sharing: The main advantages of extending the Relational data model come from reuse and sharing.
Reuse comes from the ability to extend the DBMS server to perform standard functionality centrally, rather than
have it coded in each application.
Increased Productivity: ORDBMS provides increased productivity both for the developer and for the, end user
Use of experience in developing RDBMS: Another obvious advantage is that .the extended relational approach
preserves the significant body of knowledge and experience that has gone into developing relational applications.
This is a significant advantage, as many organizations would find it prohibitively expensive to change. If the new
functionality is designed appropriately, this approach should allow organizations to take advantage of the new
extensions in an evolutionary way without losing the benefits of current database features and functions.
Disadvantages of ORDBMSs
The ORDBMS approach has the obvious disadvantages of complexity and associated increased costs. Further, there are
the proponents of the relational approach that believe the· essential simplicity' and purity of the .relational model are lost
with these types of extension.
Distributed database systems are similar to client/server architecture in a number of ways. Both typically involve the use
of multiple computer systems and enable users to access data from remote system. However, distributed database system
broadens the extent to which data can be shared well beyond that which can be achieved with the client/server system.
Figure given below shows a diagram of distributed database architecture.
As shown in Figure given below in distributed database system, data is spread across a variety of different DBMS
software’s running on a variety of different computing machines supported by a variety of different operating systems.
These machines are spread (or distributed) geographically and connected together by a variety of communication
networks. In distributed database system, one application can operate on data that is spread geographically on different
machines. Thus, in distributed database system, the enterprise data might be distributed on different computers in such a
140
way that data for one portion (or department) of the enterprise is stored in one computer and the data for another
department is stored in another. Each machine can have data and applications of its own. However, the users on one
computer can access to data stored in several other computers. Therefore, each machine will act as a server for some
users and a client for others.
1. A distributed DBMS that hides the distributed nature from the user is inherently more complex than a centralized
DBMS.
2. Increased complexity means that there will be high procurement and maintenance cost.
3. In a centralized system, access to the data can be easily controlled. However, in a distributed DBMS not only
does access to replicated data have to be controlled in multiple locations but also the network itself has to be
made secure.
A Web database is a database application designed to be managed and accessed through the Internet. Website operators
can manage this collection of data and present analytical results based on the data in the Web database application.
Databases first appeared in the 1990s, and have been an asset for businesses, allowing the collection of seemingly infinite
amounts of data from infinite amounts of customers.
Three-Tier Architectures
This book shows you how to develop web database applications that are built around the three-tier architecture model
shown in fig. below. At the base of an application is the database tier, consisting of the database management system
that manages the data users create, delete, modify, and query. Built on top of the database tier is the middle tier, which
contains most of the application logic that you develop. It also communicates data between the other tiers. On top is the
client tier, usually web browser software that interacts with the application.
141
Figure . The three-tier architecture model of a web database application
The three-tier architecture is conceptual. In practice, there are different implementations of web database applications
that fit this architecture. The most common implementation has the web server (which includes the scripting engine that
processes the scripts and carries out the actions they specify) and the database management system installed on one
machine: it’s the simplest to manage and secure, and it’s our focus in this book. With this implementation on modern
hardware, your applications can probably handle tens of thousands of requests every hour.
For popular web sites, a common implementation is to install the web server and the database server on different
machines, so that resources are dedicated to permit a more scalable and faster application. For very high-end
applications, a cluster of computers can be used, where the database and web servers are replicated and the load
distributed across many machines.
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps
analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the transactions that take place.
Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any
consumer data, then the executive will have no data available to analyze because the previous data has been updated due
to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and
consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools
help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data
generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations
to enhance the interactive mining of knowledge at multiple level of abstraction. That's why data warehouse has now
become an important platform for data analysis and online analytical processing.
A data warehouse is a database, which is kept separate from the organization's operational database.
It possesses consolidated historical data, which helps the organization to analyze its business.
A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.
Subject Oriented − A data warehouse is subject oriented because it provides information around a subject rather
than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue,
etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of
data for decision making.
142
Integrated − A data warehouse is constructed by integrating data from heterogeneous sources such as relational
databases, flat files, etc. This integration enhances the effective analysis of data.
Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a
data warehouse provides information from the historical point of view.
Non-volatile − Non-volatile means the previous data is not erased when new data is added to it. A data
warehouse is kept separate from the operational database and therefore frequent changes in operational database
is not reflected in the data warehouse.
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Information Processing − A data warehouse allows to process the data stored in it. The data can be processed
by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.
Analytical Processing − A data warehouse supports analytical processing of the information stored in it. The
data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and
pivoting.
Data Mining − Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction. These mining results can be presented
using the visualization tools.
Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that
data mining is mining knowledge from data. The information or knowledge extracted so can be used for any of the
following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
143
Customer Profiling − Data mining helps determine what kind of people buy what kind of products.
Identifying Customer Requirements − Data mining helps in identifying the best products for different
customers. It uses prediction to find the factors that may attract new customers.
Cross Market Analysis − Data mining performs Association/correlations between product sales.
Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics
such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.
Providing Summary Information − Data mining provides us various multidimensional summary reports.
Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim
analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and spending.
3. Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud telephone
calls, it helps to find the destination of the call, duration of the call, time of the day or week, etc. It also analyzes the
patterns that deviate from expected norms.
Mining different kinds of knowledge in databases − Different users may be interested in different kinds of
knowledge. Therefore, it is necessary for data mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be
interactive because it allows users to focus the search for patterns, providing and refining data mining requests
based on the returned results.
Incorporation of background knowledge − To guide discovery process and to express the discovered patterns,
the background knowledge can be used. Background knowledge may be used to express the discovered patterns
not only in concise terms but at multiple levels of abstraction.
144